[comp.text.sgml] Record boundaries in Author/Editor and SGML

rodney@sq.sq.com (Rodney Boyd) (12/05/90)

We don't usually receive customer support requests on the net, but
since we did, we thought we should post the resolution.

Gary Pianosi wrote:

> 
> I am able to import hand-edited files into Author/Editor without error,
> but when I try to validate the document, I get the error message:
> 
>         "Validation Error:  Text not allowed here"
> 
> wherever there is a new line between an end tag and a start tag.

and

> On the other side, when I try to export a document from Author/Editor,
> the resulting text file has new-lines only within text and after some
> start tags.  Aside from being difficult to read, some lines exceed 500
> characters making these files next to impossible to edit on the PC.

Both these problems cleared up after he rebuilt the rules file
using SoftQuad RulesBuilder and changed some formatting parameters. 
(A rules file is a compiled (binary) form of a DTD which Author/Editor 
can read.) 

We conjecture that the rules file he received from us had become
corrupted or was for some other reason incompatible with the version
of Author/Editor he is using.

Exported files can be made more readable by 
(i) setting the line length manually
(ii) instructing Author/Editor to export some or all elements in
	"blocked" format, i.e., starting on a new line,
	with the following element also starting on a new line.

> Finally, my questions ...
> 
>     -   What is the significance of record boundaries in SGML?
> 

From one of our programmers:

------------------------------
Record boundaries in SGML and A/E:

Record boundaries are *very* complicated in SGML. Here is a partial
description of when record ends (RE's) are ignored or treated as data:

	1. The "first" RE and the "last" RE in an element are ignored
	   if nothing significant occurs in the document before the first
	   RE or after the last RE respectively.
	2. Record boundaries -- actually, separators --  are ignored between
	   elements in an element with "element content" but are considered
	   to be data (e.g. text) in an element with "mixed content."

So,
	<a>
	text
	</a>
is equivalent to
	<a>text</a>

And
	<a><b></b>
	</a>
is equivalent to
	<a><b></b></a>

However,
	<a><b></b>
	<b></b></a>
is not necessarily equivalent to
	<a><b></b><b></b></a>
since "a" might have mixed content -- e.g.
	<!ELEMENT a (#PCDATA | b)+>
But they would be equivalent if "a" has only element content -- e.g.
	<!ELEMENT a (b)+>

-------------------------------

(For most purposes, a "mixed content model" means one in
which elements are mixed with #PCDATA.)


In a subsequent article, Erik Naggum writes:

> The problem can be reduced, I think, to the problem of the treatment
> of Record End in this contrived example:
> 
> 	<!element foo (bar+)>
> 	<!element bar (#PCDATA)>
> 
>      1	`<foo>'
>      2	`<bar>'
>      3	`caninus'
>      4	`</bar>'
>      5	`<bar>'
>      6	`felinus'
>      7	`</bar>'
>      8	`</foo>'
> 
> (where ` signifies Record Start, ' Record End for clarity, line
> numbers for reference, only)
> 
> According to section 7.6.1, this will be interpreted at the outer
> (foo) level as:
> 
>      1	`<foo>'
>      2	`<bar>...</bar>'
>      5	`<bar>...</bar>'
>      8	`</foo>'
> 
> Now, the RE in line 1 is clearly the first RE in the content of foo,
> and the RE in line 5 is clearly the last RE in the content of foo.
> According to said section, these are to be ignored.
> 
> The problem is the RE in line 2, and the question boils down to this:
> 
> 	Is this RE recognized as /content/ or as /markup/?
> 
> I believe I understand this to be markup, and thus that it should be
> ignored.  It seems that Gary's problems stem from some decision
> amounting to viewing this as content, in which the RE would imply the
> start of a bar element, in which a new bar element is illegal (see
> amended note to section 11.2.4), or in which data content is not
> valid.

In fact the sample file is a valid SGML file and can be imported and
validated with Author/Editor. Even with the following DTD,

	<!ELEMENT foo (#PCDATA|bar)+>
	<!ELEMENT bar (#PCDATA)>

the file will validate. Though "foo"'s content model is mixed,
text (#PCDATA) is explicitly permitted between instances of 
<bar> text </bar>.

However, if the DTD were as follows

	<!ELEMENT foo (#PCDATA|(bar+))>
	<!ELEMENT bar (#PCDATA)>

then the file would not validate. "foo"'s content model is mixed,
_and_ text is not explicitly permitted between instances of
<bar> text </bar>.

_________________________________________________________________________
Rodney Boyd				Phone: (416) 963-8337
Customer Support Representative		Fax: (416) 963-9575
SoftQuad, Inc.				e-mail: support@sq.com,
						{uunet,utzoo}!sq!support
_________________________________________________________________________