SGML Syntax Reference

Introduction

SGML (like HTML, which is based on SGML), is a text format starting from the idea of organizing information by tagging or marking up text. SGML is a meta-language for describing markup vocabularies such as HTML and their parsing rules.

Consider the following basic HTML document:

<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>Section Title</h1>
<p>Body Text with <a href="otherdoc.html">link to another document</a></p>.
<footer>Page Footer</footer>
</body>
</html>

The element grammar for this document can be described as a SGML Document Type Definition (DTD) as follows:

<!ELEMENT html - - (head?,body)>
<!ELEMENT head - - (title?)
<!ELEMENT title - - (#PCDATA)>
<!ELEMENT body - - (h1,p+)
<!ELEMENT h1 - - (#PCDATA)>
...

Introduction

Consider the following basic HTML document:

<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>Section Title</h1>
<p>Body Text with <a href="otherdoc.html">link to another document</a></p>.
<footer>Page Footer</footer>
</body>
</html>

The element grammar for this document can be described as a SGML Document Type Definition (DTD) as follows:

<!ELEMENT html - - (head?,body)>
<!ELEMENT head - - (title?)
<!ELEMENT title - - (#PCDATA)>
<!ELEMENT body - - (h1,p+)
<!ELEMENT h1 - - (#PCDATA)>
<!ELEMENT p - - (#PCDATA|a)>
<!ELEMENT a - - (#PCDATA)>

In this grammar, the regular expression head?,body means that that the content of the html element is expected to consist of an (optional) head element, followed by a body element, and both the head and the body element have grammar rules for their content, in turn. #PCDATA means that text is expected at the respective position.

Given such a markup grammar and other declarations, SGML can

check the markup of a given document or a larger collection of documents, and enforce presence or absence of tags or attributes

infer tags and attributes not present in a document but desired for content delivery (as used for automatically adding boilerplate and structuring content in web applications, and to simplify content creation)

attach processing to elements or more complex contexts for generating dynamic web content or other template processing application for content production

SGML can be used for

content authoring and workflow organization using straightforward concepts such as files and folders, as well as more sophisticated declarative techniques for web applications

content delivery over the web, with rich facilities for fetching and preparing content from databases or web services, and for integration into mainstream web application stacks

sanitizing potentially malicious user content in dynamic web applications or content production processes (injection prevention).

searching, transforming, analyzing and otherwise processing web content and other markup documents.

The following sections describe the markup declarations that can be used in a DTD, and their effect on the respective markup constructs in document content.

Elements

The general form of an element declaration is

<!ELEMENT element-name [rank] [tag-omission-rules] content [exceptions]>

<!ELEMENT name-group [rank] [tag-omission-rules] content [exceptions]>

where

element-name

is a single element name to declare

name-group

is a list of element names to declare

an element list has the form (element1|element2|...|elementN).

rank (optional)

is a non-negative decimal number which is treated as a rank suffix that the declared element must have when used in content

the element name is treated as a rank stem, rather than a complete element name, if a rank suffix is specified

an element declared with rank having a rank suffix specified in content (ie. ending in a number in a start-element tag), sets the implied rank suffix for any element tag in subsequent content

for an element declared with rank having its rank suffix omitted in content, the effective rank suffix is that of the most recent element declared in the same declaration that has a rank suffix specified; the most recent element doesn’t necessarily have to be a parent element, but can be any preceding element

it’s an error if the first occurrence of an element declared with rank in a document instance has its rank suffix omitted

with respect to rank minimization, sgmljs.net treats all elements declared with the same rank suffix in a DTD as if those were declared in the same declaration; ie. a rank suffix is not only inferred from prior elements declared in the same declaration, but from any prior element having a rank declared and specified in content

an element declared with rank is referenced by its rank stem and rank suffix as concatenated name from other declarations; e.g. an element declared with rank stem abc and rank suffix 3 is referenced as abc3 in content model expressions of other element declarations where the element may occur as content model token

note that using element ranks does not in itself enable uses such as e.g. automatically assigning/incrementing header levels based on tag nesting levels; instead, rank omission always infers from the most recently specified rank: see rank-examples

tag-omission-rules (optional)

- - means both start- and end-tag must be specified

- O means the end-tag can be left out

O - means the start-element tag can be left out

O O means both start- and end-element tag may be left out

in the above syntax rules O refers to the letter O, and - to the minus character

there must be whitespace between the specifier for start- and end-tag omission rules

the tag-omission-rules specification may be left out altogether in which case it defaults to - -

see Tag Inference for applying tag omission rules

content

either a Content Model, with surrounding parentheses

or ANY, allowing any content

or EMPTY, which forbids the element to have content

or CDATA, which will make element content parse as character data

or RDATA, which will make element content parse as character data, with general entity references being expanded into the respective entity replacement text

see below for detailed explanation

exceptions

an expression of the form -(exclusions) +(inclusions) where either the exclusions- or the inclusions-part, or both, can be omitted

if both the exclusion- and the inclusion-part is specified, then the inclusion-part must follow the exclusion-part

inclusions is a single element or a name group (a list of elements) allowed to occur anywhere and arbitrarily often in descendant content in addition to elements specified in the content model

exclusions is a single element or a name group of elements not allowed to occur in descendant content, even though allowed by the content model or included by an element declaration for a parent element

if an element is excluded, it can’t be included by an element declaration for a descendant element (the inclusion is ignored)

an element that is required in a content model can’t be excluded

it’s an error for an element to be both excluded and included in the same declaration

if an element occurs at a position where it matches a model group token, and is also in the set of included elements, then it is accepted as content model token (inclusion of the element is ignored)

exceptions can only be specified for elements having a content model or for elements with declared content ANY

Declared content

An element declared ANY, EMPTY, or CDATA is said to have declared content.

`ANY` content

When an element is declared to have ANY content, any content (character data or any nested elements, subject to the effective value of IMPLYDEF ELEMENT in the SGML declaration) may occur between the element start- and end-tag.

`EMPTY` content

When an element is declared to have EMPTY content, it must be specified

either just in start-element tags (ie. end-element tags can’t be used for that element at all), or,

if EMPTYNRM YES is specified in the SGML declaration, with an optional end-element tag immediately following the start-element tag.

Note that if, in addition to FEATURES MINIMIZE EMPTYNRM YES, also FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET is specified in the SGML declaration, and / and > are declared to have the NESTC and NET delimiter roles, respectively, then any element having no content (regardless of whether the element is declared EMPTY), can be specified as an XML-style empty element, ie. can be abbreviated by <element/>, instead of having to specify <element></element> (see SGML declaration for details).

Note that sgmljs.net supports only the characters stated above for the NESTC and NET delimiter role (or no assignment to these delimiters at all). Moreover, sgmljs.net restricts supported combinations of the FEATURES MINIMIZE EMPTYNRM and the FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET SGML declarations properties to have either the values stated above, or to have both the value NO. The first combination, introduced with WebSGML (the Annex K revision of SGML), corresponds to modern polyglot markup writing (and is used by default in sgmljs.net), while the latter corresponds to the traditional SGML authoring style.

Note that, when processing XML, empty elements are required to either have end-element tags, or to be specified as XML-style empty elements (ie. as <element/>).

Note that apart from declaring an element to have EMPTY content, an element must also have empty content when a #CONREF attribute is specified on it; see #CONREF in attribute default values.

`CDATA` content

Elements declared CDATA contain unparsed character data as child content.

The & (ampersand) character has no special meaning in content of elements declared CDATA: character sequences looking like named entity references aren’t expanded to replacement text, and are, like character entity references, reproduced as-is to result markup.

A < (lower-than) character followed by valid name start character terminates content of elements declared CDATA, just like regular elements declared with content models.

Note the CDATA reserved word is also used as declared value of attribute declaration and entity declarations, and as keyword in marked sections.

See declared content examples.

Content models

A content model specifies the sequence of sub-elements and/or character data content that an element’s child content is expected to have.

It is specified by a content model expression. For example, the content model expression a, b?, c* describes a sequence consisting of a single a element, followed by an optional b element, followed optionally by a sequence of any number of c elements.

A content model expression is an expression constructed from content model tokens and compositors, with optional grouping and nesting of subexpressions in parentheses.

Content model tokens

Content model tokens are either

element names declared in the same or another element declaration within the same declaration set, or

the #PCDATA token representing parsed character data being allowed at the position in the content model expression where it is specified

Compositors

A compositor is one of the following characters, listed along with the compositor’s application to operand elements and/or compound subexpressions, and its semantics:

operand? (zero-or-one compositor)

means "zero or one" of the element or content model subexpression to which it applies

the operand element or content model subexpression to which the compositor applies is written to the left of the compositor

operand* (Kleene star compositor)

means "zero or more" of the element or content model subexpression to which it applies

the operand element or content model subexpression to which the compositor applies is written to the left of the compositor

operand+ (plus compositor)

means "one or more" of the operand element or content model expression to which it applies

the operand element or content model subexpression to which the compositor applies is written to the left of the compositor

an expression such as a+ where a is an element or subexpression, is equivalent to a,a*

left-operand, right-operand (comma compositor)

means "a sequence of the left, followed by the right operand" element name or content model subexpression

(operand) (grouping)

expressions can be grouped in parentheses such that they can be used as operands to higher level compositors; when parentheses are omitted, content model expressions are parsed left-to-right, ie. a compositor to the left of an operand takes precedence over a compositor to the right

left-operand & right-operand (allgroups-compositor)

means any sequence of the operand elements or subexpressions, provided that any one element or subexpression occurs at most once in total

when applied to an operand subexpression that has the "zero-or-one" compositor as top-most compositor, that operand subexpression isn’t required to occur, but if it occurs, it must occur at most once anywhere in the content of the element being declared

the content model expression a & b & c is equivalent to the content model expression (a,((b,c)|(c,b)))|(b,(a,c)|(c,a))|(c,(a,b)|(b,a))

Note:

In sgmljs.net SGML, operands of the allgroup compositor must be either

a single element name, or

a subexpressions having the zero-or-one compositor, the sole operand of which is a single element name.

More complex operands for the allgroup compositor aren’t supported.

If #PCDATA is specified as content token, it is implicitly treated as if (#PCDATA)* were specified, ie. parsed character data is always optional in content models.

Content models must be unambiguous, ie. any content token must be uniquely matched without looking ahead at subsequent content tokens for disambiguation. For example, the content model

(a,b)|(a,c)

is not unambiguous, since element a can be matched as the beginning of either (a,b) or (a,c). On the other hand, the equivalent content model expression

a,(b|c)

is unambiguous.

Tag Inference

For automatic generation of required elements not present in content, FEATURES MINIMIZE OMITTAG YES must be enabled in the SGML declaration (which it is by default, except when processing XML).

In the following description of SGML tag inference, trivial actions on special conditions aren’t described, such as on

`ANY` content models, or, equivalently, implied-`ANY` elements; implied-`ANY` elements are elements having child elements with implied element declarations ie. undeclared elements (when allowed to occur via `IMPLYDEF ELEMENT YES`)

`EMPTY` elements, or, equivalently, implied-`EMPTY` elements (elements governed by content references)

inference of document elements (which is just a special case of general start-element tag inference).

Actions performed on a start-element tag, or on parsed character data

Close definitely completed elements

Definitely completed elements are those whose required elements have all been parsed by previous actions, in the sequence declared in its content model declaration such that only an end-element tag for the enclosing (definitely completed) element is accepted at the context position.

A model group ending in an optional content token or in a content token with one-or-more compositor can’t be definitely completed, and isn’t considered for automatic closing.

It’s an error if a definitely completed element’s end-element tag isn’t omissible at this point, because a start-element action cannot be accommodated at the context position. 1.

Check if the start-element tag or parsed character data is accepted at the context position; that is, check it’s accepted at the current position in the model group and isn’t excluded via exclusion exceptions 1.

Open contextually required elements

Elements are contextually required if the content model of the enclosing element accepts a single element at the context position as a required element.

The element to accommodate is not influential in opening elements here, only the state of model group(s) already opened is considered. 1.

If a contextually required element is opened, and matches the content token to accommodate, tag inference is completed for this action

Additional rules

The following actions are performed by sgmljs.net SGML in addition (these and similar recovery actions are also performed by third party SGML parsers such as SP, but are reported as recoverable errors by those parsers, whereas sgmljs.net SGML performs these actions silently):

At step 3, if it isn’t possible to open a contextually required element, and

the context is immediately below the document element (such that inferring an end-element tag at the context position will close the document element, and logically end the document), and

the element to accomodate is not declared to have rank or ends with a numeric token (see next section), and

there’s a single transition over an element from the context state, and

the start-element tag of the element to transition over is omissible

then that single transitioned-over element is opened as if it were contextually required.

At step 3, if it isn’t possible to open a contextually required element, and

the element to accommodate is declared as having rank, and

the element’s rank suffix to accommodate is higher (numerically larger) than that of the parent element (or the parent has no rank in which case it is be treated as having rank 0), and

there’s a single transition over a ranked element from the context state, and

the rank of that single transitioned-over element is the same as that of the element to accommodate, and

the start-element tag of the ranked element to transition over is omissible

then that single transitioned-over element is opened as if it were contextually required.

Moreover, if it isn’t possible to open either a contextually required element or a rank-implied element as described, the parent element is closed, if it is potentially completed (see definition below).

At step 2, if the element to accomodate isn’t accepted at the context position due to exclusion exceptions, close as many potentially completed parent elements as necessary until it is (ie. until no more exclusion exception apply to the element to accomodate, if possible).

Actions performed on an end-element tag

Close potentially completed elements

Potentially completed elements are those whose required elements have all been parsed by previous actions, in the sequence declared in its content model declaration; as opposed to definitely completed elements, the model group may allow further optional elements, or end in a content token (or in a nested model group) having the one-or-more compositor. 1.

If the end-element to accommodate matches the most recently closed element, tag inference is completed for this action

Empty element minimization

As an additional minimization feature, SGML supports omission of start- and end-element tags. This feature doesn’t require any special markup declaration and can be applied on any element (except on start-element tags on the document element) subject to the FEATURES MINIMIZE SHORTTAG STARTTAG EMPTY and FEATURES MINIMIZE SHORTTAG ENDTAG EMPTY SGML declaration settings, respectively.

An empty start-element tag is treated as if it were a start-element tag for the most recently closed element. For example, the empty start-element tag <> in the following markup text

<foo>
<bar>...</bar>
<>...</bar>
</foo>

is interpreted as <bar> start-element tag.

An empty end-element tag is treated as if it were an end-element tag for the context element (eg. name of the nearest unclosed element). For example, </> is equivalent to </bar> in the following markup text:

<foo>
<bar>...</>
</foo>

Note: the SGML terms empty start-element tag (and empty end-element tag) is used for the <> and </> tokens. An XML-style empty element token, on the other hand, represents a different concept.

Attributes

Declarations for attribute lists take the form

<!ATTLIST element-name attribute-name declared-value [default-value]
[attribute-name declared-value [default-value]] ...>

<!ATTLIST name-group attribute-name declared-value [default-value]
[attribute-name declared-value [default-value]] ...>

<!ATTLIST #ALL attribute-name declared-value [default-value]
[attribute-name declared-value [default-value]] ...>

where

element-name

is a single element name to declare attributes for

name-group

is a list of element names to declare attributes for

an element list has the form (element1|element2|...|elementN).

#ALL

declares the attribute on all (declared or undeclared) elements when used in place of element-name or name-group

attribute-name

is the name of the attribute to declare

declared-value

is one of the following possible lexical value types

an enumerated value type

`CDATA`, allowing any quoted string to be used as attribute value

`ENTITY`, allowing a name token declared as entity name in the same declaration set; the token doesn’t need quoting

`ENTITIES`, allowing, in addition to `ENTITY`, a space-separated list of name tokens declared as entity names; when actually specifying more than a single entity name in content, the attribute value must be quoted

`ID`, allowing a name token, which must be unique among all name tokens used as `ID` in a document, and which establishes an `ID` value for reference by `IDREF` or `IDREFS` attribute

`IDREF`, allowing a name token used as `ID` in the same document; the token doesn’t need quoting

`IDREFS`, allowing, in addition to `IDREF`, a space-separated list of name tokens declared as `ID` attribute value; when actually specifying more than a single `ID` value in content, the attribute value must be quoted

`NAME`, allowing a name token; the token doesn’t need quoting

`NAMES`, allowing, in addition to `NAME`, a space-separated list of name tokens

`NMTOKEN`, allowing, in addition to `NAME`, a token beginning with `.` (dot), `-` (minus), or `_` (underscore), whereas `NAME` allows these characters to occur only at the second or subsequent position in the attribute value

`NMTOKENS`, allowing, in addition to `NAMES`, a list of tokens, each of which beginning with `.` (dot), `-` (minus), or `_` (underscore)

`NOTATION`, allowing the attribute value to have a notation name specified in the enumerated list of permitted notation names

`NUMBER`, allowing a sequence of digits as attribute value

`NUMBERS`, allowing, in addition to `NUMBER`, a list of numerical values to occur

`NUTOKEN`, allowing a sequence of digits, followed by a sequence of letters (such as `64px`)

`NUTOKENS`, allowing, in addition to `NUTOKEN`, a list of `NUTOKEN` tokens

[data attribute specification], allowing for custom data attribute checks and value normalization (see Data attribute specification)

A single attribute list declaration can declare one or more attributes for one or more elements (when using the name group declaration variant).

Conversely, attributes of the same element can also be declared in multiple attribute list declarations (from potentially multiple declaration sets). But the same attribute for a given element can be effectively declared at most once in all applicable attribute list declaration for a given element, ie. multiple declarations for the same attribute on a given element aren’t rejected, but only the first declaration, in document order (and by extension in the order in which declaration sets are processed) becomes effective while latter declarations are ignored.

See attribute declaration and use examples.

Default value

The default value is either

(for enumerated values) one of the enumerated values

(for `NOTATION` attributes) one of the enumerated notation names

(for other attributes) an attribute value literal; only needs quotes if the default value isn’t a name token

the token `#REQUIRED`, which means the attribute must be specified, and must have a value

the token `#IMPLIED`, which means the attribute doesn’t have to be specified (is optional)

the token #CONREF, which means that, if the attribute is specified, then the element on which it is specified is treated as if it were declared EMPTY

The token #FIXED may be specified before default values of the first, second, or third form above. When specified, the attribute either must have the default value, or mustn’t be used at all on the respective element.

Note that assigning template entities to attributes declared #CONREF can have additional semantics to the effect that the element on which the #CONREF attribute is specified gets replaced by external content.

Enumerated values

An attribute declaration such as

<!ATTLIST elmt attr (val1|val2|val3) val1>

declares the attribute attr on element elmt.

The attribute can have either of the values val1, val2, or val3, and its default value (its value when not specified on the element explicitly) is val1.

Element wildcards WebSGML

Using the #ALL keyword, it’s possible to declare one or more attributes on all elements; depending on whether undeclared elements are allowed (eg. by using IMPLYDEF ELEMENT YES or IMPLYDEF ELEMENT ANYOTHER as explained below), attributes declared in an attribute list declaration with #ALL can also be used on undeclared elements.

An attribute can be declared both in an #ALL attribute list as well as in a regular attribute list for a single element or an element namegroup at the same time. If an attribute is declared both on an individual element and on #ALL elements, its usage must satisfy both declarations.

For example, an attribute can be declared to have an enumerated value in an #ALL attribute list, and can be declared to have a #FIXED value in an attribute list declaration for an individual element. In this way, it’s possible to model a common design pattern in DTDs, wherein an attribute declaration can be declared on an individual element in a more specific way than a generic declaration for the attribute in an #ALL attribute declaration, while the generic #ALL declaration still expresses a baseline declaration and common requirement for the attribute’s use accross all element used in a document.

It’s a design error (and reported by sgmljs.net SGML as attribute validation error on actual attribute use), if an attribute is declared both as an #ALL attribute and as an attribute on an individual element, when the two declarations are not satisfiable simultaneously. For example, a #FIXED value for an attribute declared in an #ALL attribute declaration can’t be refined by declaring a different #FIXED value on an individual element for the same attribute.

The order of an #ALL declaration relative to an attribute declaration of an individual element for the same attribute isn’t significant and doesn’t change the interpretation of attribute declarations. Moreover, #ALL attribute declarations always apply to all elements of the document type and DTD containing the declaration, irrespective of whether element declarations are placed before or after the respective #ALL attribute declaration in document order (or are present at all).

Note sgmljs.net doesn’t support WebSGML’s other keywords (such as #IMPLICIT) on attribute declarations in place of #ALL. Moreover, #ALL isn’t supported for data attributes (ie. attributes of notations; see below).

Data attribute specifications WebSGML

In addition to the build-in parsing types for attributes as described above, attributes can be declared to have custom data types (this form of declaration makes use of notations explained in the next section).

For example, the following declarations

<!NOTATION html5-form-input
PUBLIC "+//IDN www.w3c.org/TR/html5//NOTATION HTML 5 Form Input Types//EN">
<!ATTLIST elmt attr DATA html5-form-input>

declare the attr attribute to have a lexical type identified with a notation having the public identifier +//IDN www.w3c.org/TR/html5//NOTATION HTML 5 Form Input Types//EN.

This public identifier represents the collection of lexical datatypes specified by HTML 5 form input validation, and imposes validation and value normalization to attribute (and plain text content of CDATA and SDATA data entities entities declared to be in that notation.

WebSGML allows specifying attributes for the data library notation such as in

<!ATTLIST elmt attr DATA html5-form-input [ type="email" ]>

<!ATTLIST elmt attr DATA html5-form-input [ pattern="XYZ\d+" ]>

In the absence of the type or pattern attribute, sgmljs.net will behave as if text (the most basic HTML 5 input form validation type) had been specified for the type attribute. text accepts any text value as content, and the value normalization applied is restricted to removing newlines, if present.

See form input value checking for more details on lexical value checking.

Notations

A notation, in general SGML terms, is a representation format for data such as the image formats PNG, GIF, or JPEG, or a text format such as TeX for typesetting mathematics.

Notation markup can be used to specify content in a different data representation format than SGML, either embedded in a SGML document, or as a reference to an external resource.

In sgmljs.net, the notation construct is also used to provide custom processing on markup for a broad class of applications such as content formatting and filtering; see templating,

A notation is declared as follows:

<!NOTATION notation-name identifier>

where

notation-name

is the name of the notation to declare

identifier

is the public and/or system identifier for the notation. as used to identify the notation by either a built-in notation (SGML, SQL, SPARQL, etc.), or by an external custom notation; see identifiers

Notation attributes can be used to markup a piece of inline text as "in a notation": in the following example, the characters \sqrt{2} are marked up as TeX-formatted math:

<!doctype example [
<!element example (math)+)
<!element math CDATA>
<!attlist math format notation (tex) #implied>
<!notation tex public "TeX">
]>
<example>
<math format=tex>\sqrt{2}</math>
</example>

Note this is only an example of how to specify inline notation data; the use of the ad-hoc public identifier TeX here won’t cause sgmljs.net SGML to execute TeX instructions.

Note that when using notation attributes, the content restrictions and entity expansion behaviour declared in the element declaration for the element on which it is declared and specified apply unchanged.

The syntax for declaring (and specifying values for) NOTATION declared attributes is very similar to that of enumerated values; see attribute examples.

For using notations with external entities, see entities.

Data attributes

Like elements, notations can have attributes. Data attributes are used to configure properties of external data entities, or of inline notational content; see templating for details.

Data attributes are declared as follows:

<!ATTLIST #NOTATION notation-name attribute-name declared-value default-value
[attribute-name declared-value default-value] ...>

for data attributes, the same rules as for element attributes apply, with the following exceptions

data attributes can’t have a declared value of `ID`, `IDREF`, `IDREFS`, `NOTATION`, `ENTITY`, or `ENTITIES` (however, special rules apply for templating)

unlike element attributes, data attributes must be declared and aren’t subject to MINIMIZE IMPLYDEF ATTLIST YES when declared in the SGML declaration

Entities

An entity, in SGML, is a stream of character data.

An entity declaration introduces a name for an entity for subsequent use in the SGML prolog or in content. Parsed entities (see general entities) are used for entity references, which are replaced by the entity’s character data on processing. Unparsed entities (see data entities) are used as values of ENTITY (or ENTITIES) attributes for templating or are processed in other entity type-specific ways.

General entities

The purpose of general entities is to support reuse of text at multiple places in a document by placing entity references for shared declared general entity as follows:

<!DOCTYPE doc [
<!ENTITY text "some <i>reusable</i> text">
<!ELEMENT doc - - (p+)>
<!ELEMENT i - - (#PCDATA)>
<!ELEMENT p - - (#PCDATA|i)>
]>
<doc>
<p>First use of the "text" entity follows: &text</p>
<p>Second use of the "text" entity follows: &text</p>
</doc>

In the example, &text is a reference to the previously declared text (general) entity, and will expand to the string some <i>reusable</i> text in place.

Any markup contained in the entity replacement text will be interpreted as if it had been part of the text in which the entity reference is placed. This means that replacement text can contain tags (or any other SGML content construct such as marked sections, processing instructions, etc.). It may also contain further entity references in turn, which will be expanded in place recursively.

However, valid replacement text for an entity must not contain references to the entity being replaced itself (or, transitively, contain an entity reference expanding into a reference to the entity being expanded itself).

General entity references are expanded anywhere in content, regular attribute specifications, and replacement content of general entities, except in CDATA marked sections, CDATA content, data text entities (CDATA/SDATA entities), and attributes declared with data attribute specifications.

General entity references (as opposed to parameter entity references) aren’t expanded in markup declarations.

General entities are lazily fetched at the time(s) an entity reference is parsed in content. When processing an entity declaration with replacement text containing references to further entities, no check is performed whether referenced entities are declared and/or accessible. In particular, unlike parameter entities, at declaration time, replacement text for general entities may contain references to other entities that aren’t themselves declared (yet).

External general entities

Rather than specifying the replacement text for an entity literally, it’s also possible to specify that replacement text should be retrieved from an external resource (such as a file or via HTTP) by declaring the entity as follows:

<!ENTITY ent SYSTEM "filename.txt">

where the part beginning with SYSTEM ... (containing a file name in the example) is an identifier.

Data text entities

For entities declared as follows

<!ENTITY ent CDATA "escaped replacement text">

or, equivalently,

<!ENTITY ent SDATA "escaped replacement text">

entity referencesare expanded into the respective literal replacement text without further interpretation of the replacement text as markup. If the replacement text contains characters or character sequences that would be interpreted as markup delimiters (such as the < or & characters), then those characters will be expanded into character entity references.

Consequently, general entity references and tags aren’t recognized in data text entities; note, however, that the replacement text literal in a data text entity declaration is subject to parameter entity replacement.

In sgmljs.net, CDATA and SDATA data text entities are treated identically.

Processing instruction data text entities

Apart from CDATA and SDATA, also the PI keyword can be used in data text entity declarations.

This variant introduces an entity containing a processing instruction, and is the only variant that can also be used with parameter entities.

References to PI data text entities can only be used in a context where a processing instruction can be used; specifically, PI data text general entity references can’t be used in attribute values.

External data text entities

In sgmljs.net, an external data text entity is declared using the syntax for CDATA and SDATA data entities, explained below.

Character entity references

Character entity references are strings of the form &#NNNNNN where NNNNNN is a decimal number, or of the form &#xMMMMMM where MMMMMM is a hexadecimal number. The number refers to the code point in the document character set (Unicode) represented by the character entity reference.

Character entity references are passed as-is to the output; all browsers and markup processing tools are expected to be able to handle character entity references.

Parameter entities

Entity declarations with a % character following the ENTITY keyword introduce parameter entities. Where general entity declaration define replacement text for content, parameter entities define replacement text in markup declarations.

For example, the following document type declaration set contains a declaration for the idattr parameter entity. The parameter entity is then referenced twice in further declarations.

<!DOCTYPE doc [
<!ENTITY % idattr "id ID #IMPLIED">
<!ELEMENT doc - - (#PCDATA|p|ul|a)>
<!ELEMENT p - - (#PCDATA)>
<!ELEMENT ul - - (li+)>
<!ELEMENT li - - (#PCDATA)>
<!ELEMENT a - - (#PCDATA)>
<!ATTLIST doc %idattr>
<!ATTLIST p %idattr>
<!ATTLIST ul %idattr>
<!ATTLIST li %idattr>
<!ATTLIST a href CDATA #IMPLIED %idattr>
]>
...

Similar to general entity references, the %idaddr parameter entity reference is expanded into the replacement text

id ID #IMPLIED

so that all elements will have the same id attribute declaration as result.

Furthermore, the a element will have the href attribute in addition to the id attribute. Note that the purpose of reusing an attribute declaration can also be achieved by using a name group - a list of element names - in an ATTLIST declaration (and furthermore could also be achieved using WebSGML’s #ALL keyword in place of an element name or name group).

A parameter entity reference must begin with the % character. A parameter entity declaration must have whitespace between the % character and the subsequent parameter entity name.

Apart from reusing parts of declaration text, parameter entities are used in particular for

customizing a generic external declaration set by overriding default declarations for parameter entities in the internal declaration set; see declaration sets

as placeholder for keywords in marked sections

designing declaration set text for reuse in general.

Unlike general entities, parameter entities are fetched eagerly as soon as an external parameter entity declaration is processed. Therefore, it is an error for the replacement text of a parameter entity to contain unresolved references to (other) parameter entities; references to parameter entities already declared in a prior declaration (in markup declaration text order), on the other hand, are recognized and expanded in parameter entity replacement text.

Parameter entities can also be used for fetching external content when external content can’t or shouldn’t be fetched multiple times as would be the case for external general entities, for example when fetching an external service response into a parameter entity for use in multiple references. when fetching from the standard input or from a network stream.

Parameter entity references are expanded in the replacement text for general entities (as well as in all other markup declaration except system identifier literals). This means that any parameter entity value can be re-declared (copied) as general entity by placing a parameter entity reference into the replacement text for a general entity.

Note that parameter (or general) entity references aren’t expanded in system identifier literals (of markup declarations using external identifiers, such as entity and notation declarations). To construct a system identifier from a parameter entity, an additional, derived parameter entity is declared consisting of a reference to the parameter entity to construct from, with leading and trailing quote characters added; the derived parameter entity is then used as system identifier literal.

External Parameter Entities

Like general entity declarations, parameter entity declarations can point to a system identifier (a file or network location to fetch character data from), rather than providing inline replacement text as parameter literal.

System-specific Entities

An entity declaration with omitted system identifier literal but containing SYSTEM, such as the following

<!ENTITY ent SYSTEM>

declares an entity which is resolved by default to the filename ent. The file is searched for in the same directory as the file declaring it (the resolved value or the directory to search can be changed using runtime parameters).

Any entity that can be declared as external entity (general, data and parameter entities) can be declared system-specific.

Implied Entities

When IMPLYDEF ENTITY YES is specified in the SGML declaration, general entity references to undeclared entities will be resolved as system-specific entity. This means there is no need to specify an entity declaration at all; entities can be referenced right away provided the entity name can be resolved as file name, or another resolution rule has been provided as invocation parameter.

Parameter entities, on the other hand, must always be declared. Note, however, that external data text entities can’t be declared system-specific.

Data Entities

Entities can be declared to be in a notation as follows (where we first declare a notation to reference its name in the entity declaration):

<!NOTATION somenotation SYSTEM "some-notation-identifier">
<!ENTITY someent SYSTEM "some-entity" NDATA somenotation>

Entities declared like this are not considered SGML character data and won’t be expanded into replacement text when used in an entity reference.

Instead, the SGML processor just reproduces entity reference for these as-is; special processing can be implemented and associated with a notation (ie. with a public identifier of a notation) via notation handlers and the SGML API. A standard notation handler is provided by the templating feature.

Data entities declared using the CDATA or SDATA keywords in place of NDATA, on the other hand, will be expanded into the respective replacement text when used as entity reference:

<!NOTATION somenotation SYSTEM "some-notation-identifier">
<!ENTITY ent SYSTEM "some-entity" CDATA somenotation>

An entity reference to ent will be expanded into the text contained in the "some-entity" file; as with data text entities, special characters such as < or & are escaped in the replacement text, and not treated as markup delimiters.

Providing values for data attributes

If a notation has data attributes, values for the data attributes can (or must, if no #FIXED or default values are provided) be specified as shown in the following example:

<!NOTATION notation n system "some system id">
<!ATTLIST #NOTATION n x CDATA #IMPLIED y CDATA #IMPLIED>
<!ENTITY e SYSTEM "another system id" NDATA n [ x="val1" y="val2" ]>

where the first two declarations establish a notation with data attributes x and y, and the NDATA entity declaration for the e entity demonstrates the syntax for providing data attribute values.

Short references

Short references are a facility to replace short spans of punctuation mark and other characters in text content, such as dots, commas, tabs, brackets, spaces, and others by entity references in a context-dependent way. For example, short references can be used to replace a sequence of two hyphen-minus characters (--) into an ndash character (roughly, a dash the width of an n character). Using short references, text can be typed using the hyphen-minus characters entered via standard keyboard keys, yet can be rendered using the typographically and semantically more desirable ndash character where appropriate.

Short reference map declaration

<!SHORTREF shortref-map-name shortref-delimiter replacement-entity-name
[shortref-delimiter replacement-entity-name] ...>

Short references are declared in a short reference map declaration for a named short reference map as shown in the following example, which declares that a sequence of two hyphen-minus characters should be replaced by a a reference to the mdash-ent entity, which in turn maps to the character entity reference for mdash (Unicode code point 8212 in decimal) in text portions when the my-shortref-map short reference map is active:

<!ENTITY ndash-ent "&#8211;">
<!SHORTREF my-shortref-map
"--" ndash-ent>

As shown, a short reference map maps short reference delimiters to general entity names, rather than to replacement text directly.

Short reference use declaration

<!USEMAP shortref-map-name element-name>

<!USEMAP shortref-map-name name-group>

To then activate the my-shortref-map short reference map within the text content of a specific element (P in this example) or a group of elements, a short reference use declaration is used:

<!USEMAP my-shortref-map P>

Short reference map declarations can map more than a single short reference delimiter to an entity, as shown in the following example, which, in addition to mapping double hyphen-minus characters, also maps quotation mark (U+0022 QUOTATION MARK) characters to typographic citation mark characters (U+201C LEFT DOUBLE QUOTATION MARK, represented by “ as decimal character entity reference), which might be typographically more appealing, depending on the text language and typographic conventions:

<!ENTITY ndash-ent "&#8211;">
<!ENTITY curlyquot-ent "&#8220;">
<!SHORTREF enhanced-typography
"--" mdash-ent
'"'  curlyquot-ent>
<!USEMAP enhanced-typography p>

Of course, when the HTML predefined entities are declared in the SGML declaration (such as when processing .html or .md files, or when a SGML declaration activating the HTML predefined entities is put in the file to process, as shown here), a short reference map can directly refer to predefined entities, rather than having to declare mdash-ent and curlyquot-ent in the prolog:

<!SGML HTML PUBLIC "+//IDN sgml.net//SD SGML declaration body for HTML//EN">
<!DOCTYPE body [
<!ELEMENT body - - (p+)>
<!ELEMENT p - - (#PCDATA)>
<!SHORTREF enhanced-typography
"--" ndash
'"'  ldquo>
<!USEMAP enhanced-typography p>
]>
<body>
<p>"Murder" she said -- aka '4:50 from Paddington'</p>
</body>

The example imports the predefined entities for HTML, declares a tiny HTML-like vocabulary and uses the quot element to enclose quotations (though in HTML, quot elements would probably not be used for marking up this particular inline quote in the way shown). Moreover, two short reference maps and uses are declared: one for when in child content of p, starting a quot element, and another one for ending the quot element from within quot content.

Invoking sgmlproc on the above content will produce

<body>
<p>&#8220;Murder&#8220; she said &#8211; aka '4:50 from Paddington'</p>
</body>

Short reference use declaration in content

As the

check the markup of a given document or a larger collection of documents, and enforce presence or absence of tags or attributes

infer tags and attributes not present in a document but desired for content delivery (as used for automatically adding boilerplate and structuring content in web applications, and to simplify content creation)

content authoring and workflow organization using straightforward concepts such as files and folders, as well as more sophisticated declarative techniques for web applications

content delivery over the web, with rich facilities for fetching and preparing content from databases or web services, and for integration into mainstream web application stacks

sanitizing potentially malicious user content in dynamic web applications or content production processes (injection prevention).

ANY content

EMPTY content

either just in start-element tags (ie. end-element tags can’t be used for that element at all), or,

CDATA content

Content model tokens

element names declared in the same or another element declaration within the same declaration set, or

Compositors

a single element name, or

Tag Inference

ANY content models, or, equivalently, implied-ANY elements; implied-ANY elements are elements having child elements with implied element declarations ie. undeclared elements (when allowed to occur via IMPLYDEF ELEMENT YES)

EMPTY elements, or, equivalently, implied-EMPTY elements (elements governed by content references)

Actions performed on a start-element tag, or on parsed character data

Additional rules

the context is immediately below the document element (such that inferring an end-element tag at the context position will close the document element, and logically end the document), and

the element to accomodate is not declared to have rank or ends with a numeric token (see next section), and

there’s a single transition over an element from the context state, and

the element to accommodate is declared as having rank, and

the element’s rank suffix to accommodate is higher (numerically larger) than that of the parent element (or the parent has no rank in which case it is be treated as having rank 0), and

there’s a single transition over a ranked element from the context state, and

the rank of that single transitioned-over element is the same as that of the element to accommodate, and

Actions performed on an end-element tag

an enumerated value type

CDATA, allowing any quoted string to be used as attribute value

ENTITY, allowing a name token declared as entity name in the same declaration set; the token doesn’t need quoting

ENTITIES, allowing, in addition to ENTITY, a space-separated list of name tokens declared as entity names; when actually specifying more than a single entity name in content, the attribute value must be quoted

ID, allowing a name token, which must be unique among all name tokens used as ID in a document, and which establishes an ID value for reference by IDREF or IDREFS attribute

IDREF, allowing a name token used as ID in the same document; the token doesn’t need quoting

IDREFS, allowing, in addition to IDREF, a space-separated list of name tokens declared as ID attribute value; when actually specifying more than a single ID value in content, the attribute value must be quoted

NAME, allowing a name token; the token doesn’t need quoting

NAMES, allowing, in addition to NAME, a space-separated list of name tokens

NMTOKEN, allowing, in addition to NAME, a token beginning with . (dot), - (minus), or _ (underscore), whereas NAME allows these characters to occur only at the second or subsequent position in the attribute value

NMTOKENS, allowing, in addition to NAMES, a list of tokens, each of which beginning with . (dot), - (minus), or _ (underscore)

NOTATION, allowing the attribute value to have a notation name specified in the enumerated list of permitted notation names

NUMBER, allowing a sequence of digits as attribute value

NUMBERS, allowing, in addition to NUMBER, a list of numerical values to occur

NUTOKEN, allowing a sequence of digits, followed by a sequence of letters (such as 64px)

NUTOKENS, allowing, in addition to NUTOKEN, a list of NUTOKEN tokens

(for enumerated values) one of the enumerated values

(for NOTATION attributes) one of the enumerated notation names

(for other attributes) an attribute value literal; only needs quotes if the default value isn’t a name token

the token #REQUIRED, which means the attribute must be specified, and must have a value

the token #IMPLIED, which means the attribute doesn’t have to be specified (is optional)

for data attributes, the same rules as for element attributes apply, with the following exceptions

data attributes can’t have a declared value of ID, IDREF, IDREFS, NOTATION, ENTITY, or ENTITIES (however, special rules apply for templating)

External general entities

Processing instruction data text entities

External data text entities

customizing a generic external declaration set by overriding default declarations for parameter entities in the internal declaration set; see declaration sets

as placeholder for keywords in marked sections

External Parameter Entities

Providing values for data attributes

Similar Posts

`ANY` content

`EMPTY` content

`CDATA` content

`ANY` content models, or, equivalently, implied-`ANY` elements; implied-`ANY` elements are elements having child elements with implied element declarations ie. undeclared elements (when allowed to occur via `IMPLYDEF ELEMENT YES`)

`EMPTY` elements, or, equivalently, implied-`EMPTY` elements (elements governed by content references)

`CDATA`, allowing any quoted string to be used as attribute value

`ENTITY`, allowing a name token declared as entity name in the same declaration set; the token doesn’t need quoting

`ENTITIES`, allowing, in addition to `ENTITY`, a space-separated list of name tokens declared as entity names; when actually specifying more than a single entity name in content, the attribute value must be quoted

`ID`, allowing a name token, which must be unique among all name tokens used as `ID` in a document, and which establishes an `ID` value for reference by `IDREF` or `IDREFS` attribute

`IDREF`, allowing a name token used as `ID` in the same document; the token doesn’t need quoting

`IDREFS`, allowing, in addition to `IDREF`, a space-separated list of name tokens declared as `ID` attribute value; when actually specifying more than a single `ID` value in content, the attribute value must be quoted

`NAME`, allowing a name token; the token doesn’t need quoting

`NAMES`, allowing, in addition to `NAME`, a space-separated list of name tokens

`NMTOKEN`, allowing, in addition to `NAME`, a token beginning with `.` (dot), `-` (minus), or `_` (underscore), whereas `NAME` allows these characters to occur only at the second or subsequent position in the attribute value

`NMTOKENS`, allowing, in addition to `NAMES`, a list of tokens, each of which beginning with `.` (dot), `-` (minus), or `_` (underscore)

`NOTATION`, allowing the attribute value to have a notation name specified in the enumerated list of permitted notation names

`NUMBER`, allowing a sequence of digits as attribute value

`NUMBERS`, allowing, in addition to `NUMBER`, a list of numerical values to occur

`NUTOKEN`, allowing a sequence of digits, followed by a sequence of letters (such as `64px`)

`NUTOKENS`, allowing, in addition to `NUTOKEN`, a list of `NUTOKEN` tokens

(for `NOTATION` attributes) one of the enumerated notation names

the token `#REQUIRED`, which means the attribute must be specified, and must have a value

the token `#IMPLIED`, which means the attribute doesn’t have to be specified (is optional)

data attributes can’t have a declared value of `ID`, `IDREF`, `IDREFS`, `NOTATION`, `ENTITY`, or `ENTITIES` (however, special rules apply for templating)