[comp.text.sgml] Short Ref's Reparsed?

jxr@thumper.bellcore.com (Jonathan Rosenberg) (06/17/91)

Hello,

I hope I use the correct terminology to ask this question...

If you specify a short reference map, e.g.,

<!ENTITY aa "a">
<!SHORTREF foo "aa" a>

what happens to the modified text after the substitution is made?  I.e.,
if the txt is

	aaa

after the first substitution it becomes

	aa

Now, is this texthe map applied again to this text to get

	a

or i the modified text not scanned again?

JR

gtoal@tardis.computer-science.edinburgh.ac.uk (06/18/91)

In article <448@salt.bellcore.com> jxr@thumper.bellcore.com (Jonathan Rosenberg) writes:
>Hello,
... about entity refs ...

I've implemented entity refs the way you describe. That doesn't
mean we're correct though :-)

I've hacked my lex parser to push back the expansion of an entity
ref onto the incoming text stream, so it is reparsed.  A cheap technique
but works OK.  Means you can have infinite recursion though if you're
not careful...

Graham

erik@naggum.no (Erik Naggum) (06/19/91)

Jonathan Rosenberg <jxr@thumper.bellcore.com> writes:
|
|   If you specify a short reference map, e.g.,
|
|   <!ENTITY aa "a">
|   <!SHORTREF foo "aa" a>
|   
|   what happens to the modified text after the substitution is made?

Please note a conceptual difference between "substitution" and entity
references.  When the parser sees an entity references, it invokes an
alternate input source from the entity manager, which then delivers
characters from this source until it ends and then sends an Entity end
signal.  After that, the entity manager continues to deliver characters
from the entity which contained the entity reference.

Short references are mapped to entity references, but not by way of
rescanning the input stream.  If, however, a document with short
references are sent to a system which cannot handle short references,
they are replaced by the corresponding general entity reference.

Your example does not say what you intended.  The short reference
mapping declaration refers to an entity "a", but you have defined
entity "aa".  If you meant

	<!ENTITY aa "a">
	<!SHORTREF map "aa" aa>
and
	aaa

this would map the first, longest occurence of a short reference
delimiter ("aa") to a general entity reference to entity "aa", and the
letter "a" and then the Ee signal would be parsed, as read from the
entity manager.  Parsing then sees the following "a", and the Ee
intervenes so no recursive mapping can take place.

If you meant

	<!ENTITY a "aa">
	<!SHORTREF map "aa" a>
and
	aaa

the first "aa" would map to the entity a, "aa" would be parsed, and at
this point there could be (1) an infinite recursion, or (2) the entity
could be read as "aa" by itself.  Then, there would be an Entity end,
which intervenes between the second "a" of the entity and the third
"a" in the data, so no mapping would take place.

I don't know for certain whether case (1) or (2) is the right one.  If
there is a provision in the standard for short reference mapping only
in the entity in which the element in which the mapping is declared to
be used, or invoked, this would indicate case (2).  If there is no
such provision, case (1) applies, unless there are other provisions
which preclude recursive mappings.  Once again, I will have to spend a
few hours with the Handbook to answer this question.  I would tend to
think that defining such a potentially recursive mapping is slightly
silly, if not for the sole purpose of showing that a vendor's parser
is broken. :-)

The difference between "substitution" and "entity reference" is very
important.  I spent a lot of time fighting this one, until Charles
Goldfarb showed me the right direction.  An entity reference is really
a request to the entity manager for input from an alternate source of
input (his terms, very accurate), not a substitution.  The parser only
reads a stream of characters from the entity manager, and does not
rescan previously read input.  The entity manager does not have such a
capability in the first place.

Hope this helps.

</Erik>
--
Erik Naggum             Professional Programmer            +47-2-836-863
Naggum Software             Electronic Text             <erik@naggum.no>
0118 OSLO, NORWAY       Computer Communications        <enag@ifi.uio.no>

enag@ifi.uio.no (Erik Naggum) (06/19/91)

gtoal@tardis.computer-science.edinburgh.ac.uk writes:
|
|   I've hacked my lex parser to push back the expansion of an entity
|   ref onto the incoming text stream, so it is reparsed.  A cheap technique
|   but works OK.  Means you can have infinite recursion though if you're
|   not careful...

Remember that the only way to get "<" into a document, when followed
by a valid name start character (a-zA-Z) is to define an entity which
expands to "<", e.g. <!ENTITY lang "<" -- left angle bracket>.

If you place an "entity end" signal at the end of the expansion, it
would appear to be right, except in the cases where the entity
refereces an external entity.

lex doesn't quite cut it.  Better to write an entity manager, and
provide an interface with the following three primitives:

	-- define entity	<!ENTITY ...>
	-- invoke entity	&name; %name;
	-- read character

after syntax checking, the ENTITY markup declaration could be pushed
to the entity manger's define entity function.  Whenever a syntac-
tically legal entity reference is parsed or named, the requestor
(parser or application) calls the entity manager's invoke entity
function and reads characters until Entity end ocurs.

Putting both entity manager and parser in the same control path is
probably a bad idea due to their conceptual independence, and the fact
that even the application may need to reference entities.

--
Erik Naggum             Professional Programmer            +47-2-836-863
Naggum Software             Electronic Text             <erik@naggum.no>
0118 OSLO, NORWAY       Computer Communications        <enag@ifi.uio.no>

jxr@thumper.bellcore.com (Jonathan Rosenberg) (06/20/91)

From: erik@naggum.no (Erik Naggum)
Newsgroups: comp.text.sgml
Subject: Re: Short Ref's Reparsed?
Date: 19 Jun 91 14:36:47 GMT

> Jonathan Rosenberg <jxr@thumper.bellcore.com> writes:
> |
> |   If you specify a short reference map, e.g.,
> |
> |   <!ENTITY aa "a">
> |   <!SHORTREF foo "aa" a>
> |   
> |   what happens to the modified text after the substitution is made?

> . . .

> Your example does not say what you intended.  The short reference
> mapping declaration refers to an entity "a", but you have defined
> entity "aa".  If you meant

>	<!ENTITY aa "a">
>	<!SHORTREF map "aa" aa>
> and
>	aaa

Yes, this is what I meant.

> this would map the first, longest occurence of a short reference
> delimiter ("aa") to a general entity reference to entity "aa", and the
> letter "a" and then the Ee signal would be parsed, as read from the
> entity manager.  Parsing then sees the following "a", and the Ee
> intervenes so no recursive mapping can take place.

Ok.  but, what happens in the following case:

	<!ENTITY aaa "aa">
	<!ENTITY aa "a">
	<!SHORTREF map "aaa" aaa>
	<!SHORTREF map "aa" aa>
and
	aaa
???

Does this become
	aa
or
	a
???

I think that I found a clause in the standard that outlaws recursive
applications of short reference maps in any case.  Section 9.4.6.1 (page
354 of the Handbook) says (in part):

	" A short reference can be removed from a document by replacing
	 it with an equivalent reference string that contains a named entity
	 reference.  The entity name must be that to which the short
	 reference is mapped in the current map."

This says clearly to me that given the above
	aaa
is equivalent to
	&aaa;Ee
which will (eventually) become
	aa
and not 
	a

>. . .

> Hope this helps.

Absolutely.

> </Erik>

JR

erik@naggum.no (Erik Naggum) (06/22/91)

I have consulted The SGML Handbook, and received a note from Goldfarb
himself, both of which clearly conclude that in your example case, you
would achieve infinite recursion.  There is no specific provision in
the standard to preclude recursion.

If, however, you really wish the replacement text to contain the short
reference delimiter(s), you can specify the entity to be character
data, as in

	<!ENTITY aa CDATA "a">

Jonathan Rosenberg <jxr@thumper.bellcore.com> writes:
|
|   Ok.  but, what happens in the following case:
|
|	    <!ENTITY aaa "aa">
|	    <!ENTITY aa "a">
|	    <!SHORTREF map "aaa" aaa>
|	    <!SHORTREF map "aa" aa>
|   and
|	    aaa
|   ???

1.  "aaa" maps to "&aaa;"
2.  "&aaa;" produces "aa<Ee>"
3.  "aa" maps to "&aa;"
3.  "&aa;" produces "a<Ee>"

so this becomes "a", only.

|   I think that I found a clause in the standard that outlaws
|   recursive applications of short reference maps in any case.
|   Section 9.4.6.1 (page 354 of the Handbook) says (in part):
|
|	"A short reference can be removed from a document by replacing
|	 it with an equivalent reference string that contains a named
|	 entity reference.  The entity name must be that to which the
|	 short reference is mapped in the current map."
|
|   This says clearly to me that given the above
|	    aaa
|   is equivalent to
|	    &aaa;Ee
		 ^^-- not here
|   which will (eventually) become
|	    aa
	      ^-- but here
|   and not 
|	    a

I agree that this looks possible, but the entity referenced (aaa) also
needs to map occurrencies of short reference delimiteres ("aa") to
entity references (aa) inside it, since it would have been parsed like
this had the short references been used.  I haven't found any way to
inhibit short reference delimiter recognition inside an entity which
was referenced by a short reference instead of a general reference;
but it seems to be quite necessary:

    a corollary is that translation from short references to entity
    references may modify the contents of entities referred to in the
    replacement text of the entity referenced by the short reference,
    according as the replacement text contains short reference
    delimiters, demanding multiple versions of entities according to
    context. 

This seems to defeat the purpose of the simple translation, and I can
readily foresee problems in this regard.  I think benefits may be
reaped from conservative use of short reference delimiters in entities
thus referenced, or making them data entities with CDATA or SDATA.

This is not only true for your particular questions, but in general,
since it may not be intuitive when declaring an entity to what short
reference delimiters may map where the entity is referenced.

E.g.,

	<!ENTITY issue "Vol 5 #6">
	...
	<!ENTITY pound SDATA "[libra]">
	<!SHORTREF map "#" pound>
	...
	<!USEMAP map>
	... &issue; ...		=> Vol 5 [libra]6

To which I conclude that it's probably best with CDATA for entities
which contain short reference delimiters.  In fact, I think CDATA
should be used whenever there is no specific need to have the entity
parsed, but this may be overly restrictive.

Is this a feasible compromise, Jonathan?

There is another thing with the USEMAP declaration which I discovered
while reading up on this.  One can say <!USEMAP map> and <!USEMAP
#EMTPY>, but once that is done, there is no way to restore the map
except by knowing which map was specified.  The #RESTORE found with
USELINK would have been handy.

</Erik>
--
Erik Naggum             Professional Programmer            +47-2-836-863
Naggum Software             Electronic Text             <erik@naggum.no>
0118 OSLO, NORWAY       Computer Communications        <enag@ifi.uio.no>

jxr@thumper.bellcore.com (Jonathan Rosenberg) (06/24/91)

> I have consulted The SGML Handbook, and received a note from Goldfarb
> himself, both of which clearly conclude that in your example case, you
> would achieve infinite recursion.  There is no specific provision in
> the standard to preclude recursion.

> If, however, you really wish the replacement text to contain the short
> reference delimiter(s), you can specify the entity to be character
> data, as in

>	<!ENTITY aa CDATA "a">

> Jonathan Rosenberg <jxr@thumper.bellcore.com> writes:
>|
>|   Ok.  but, what happens in the following case:
>|
>|	    <!ENTITY aaa "aa">
>|	    <!ENTITY aa "a">
>|	    <!SHORTREF map "aaa" aaa>
>|	    <!SHORTREF map "aa" aa>
>|   and
>|	    aaa
>|   ???

> 1.  "aaa" maps to "&aaa;"
> 2.  "&aaa;" produces "aa<Ee>"
> 3.  "aa" maps to "&aa;"
> 3.  "&aa;" produces "a<Ee>"

[Oops, I guess that should be 4.]

> so this becomes "a", only.

That's what I was afraid of.

>|   I think that I found a clause in the standard that outlaws
>|   recursive applications of short reference maps in any case.
>|   Section 9.4.6.1 (page 354 of the Handbook) says (in part):

>|	"A short reference can be removed from a document by replacing
>|	 it with an equivalent reference string that contains a named
>|	 entity reference.  The entity name must be that to which the
>|	 short reference is mapped in the current map."

>|   This says clearly to me that given the above
>|	    aaa
>|   is equivalent to
>|	    &aaa;Ee
>		 ^^-- not here
>|   which will (eventually) become
>|	    aa
>	      ^-- but here
>|   and not 
>|	    a

> I agree that this looks possible, but the entity referenced (aaa) also
> needs to map occurrencies of short reference delimiteres ("aa") to
> entity references (aa) inside it, since it would have been parsed like
> this had the short references been used.

Yeah, that makes sense (unfortunately).

> I haven't found any way to inhibit short reference delimiter
recognition inside an
> entity which was referenced by a short reference instead of a general
reference;
> but it seems to be quite necessary:

It certainly does to me, too.

>    a corollary is that translation from short references to entity
>    references may modify the contents of entities referred to in the
>    replacement text of the entity referenced by the short reference,
>    according as the replacement text contains short reference
>    delimiters, demanding multiple versions of entities according to
>    context. 

> This seems to defeat the purpose of the simple translation, and I can
> readily foresee problems in this regard.

Ugh.  This appears to be impossibly complicated.  Can this really happen?

> I think benefits may be reaped from conservative use of short
reference delimiters in
> entities thus referenced, or making them data entities with CDATA or SDATA.

> This is not only true for your particular questions, but in general,
> since it may not be intuitive when declaring an entity to what short
> reference delimiters may map where the entity is referenced.

>  . . .

> To which I conclude that it's probably best with CDATA for entities
> which contain short reference delimiters.  In fact, I think CDATA
> should be used whenever there is no specific need to have the entity
> parsed, but this may be overly restrictive.

> Is this a feasible compromise, Jonathan?

It is if I understand it correctly.  Are you saying that the following
"works correctly":

	<!ENTITY aaa CDATA "aa">
	<!ENTITY aa CDATA "a">
	 <!SHORTREF map "aaa" aaa>
	 <!SHORTREF map "aa" aa>

"Correctly" in the sense that the string
	aaa
would become simply
	aa
??  And, that the reason for this is is CDATA indicates that the
replacement string is unparseable character data?

> There is another thing with the USEMAP declaration which I discovered
> while reading up on this.  One can say <!USEMAP map> and <!USEMAP
> #EMTPY>, but once that is done, there is no way to restore the map
> except by knowing which map was specified.  The #RESTORE found with
> USELINK would have been handy.

I remember reading about this in the handbook.

Thanks for the help.

> </Erik>

JR

P.S.  Now I see why you use "|" instead of ">" in replying.

P.P.S.  The line eater wants these lines.
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
Here they are ...
I hate this software.

enag@ifi.uio.no (Erik Naggum) (06/25/91)

I'll try to avoid quoting stuff, this time.

Archimedes is rumored to have shouted Eureka! and streak through his
native little town upon discovering a basic law of nature.  I gener-
ally dislike clothes, and I did shout Eureka today, so there must be
a newly discovered law of nature somewhere...

Section 9.4 Entity References contains this little paragraph, which is
worth the above excursion:

	A reference to an entity that has already been refrenced and
	has not yet ended is invalid (i.e. entities cannot be
	refrenced recursively).

So the much publicized example

	<!ENTITY a "aa">
	<!SHORTREF map "aa" a>

is invalid, which will be discovered when the map is in use.  Why?
Let's look at what happens, once again, when "aa" is parsed as PCDATA:

1.	"aa" -> "&a;"
2.	"&a;" -> "aa<Ee>"	;; entity a is now open
3.	"aa" -> "&a;"		;; new reference to entity a
4.	Kaboom!			;; see 348:10 for details...

And as if that wasn't enough, 9.4.1 Quanties [348:13] takes care of
the infinite recursion, anyhow:

	The number of open entities ... cannot exceed the "ENTLVL"
	quantity.

ENTLVL defaults to 16.

This eliminates at least a fraction of our problems.

</Erik>
--
Erik Naggum             Professional Programmer            +47-2-836-863
Naggum Software             Electronic Text             <erik@naggum.no>
0118 OSLO, NORWAY       Computer Communications        <enag@ifi.uio.no>

enag@ifi.uio.no (Erik Naggum) (06/25/91)

I wrote (despite the indent which looks like a quotation):
|
|    a corollary is that translation from short references to entity
|    references may modify the contents of entities referred to in the
|    replacement text of the entity referenced by the short reference,
|    according as the replacement text contains short reference
|    delimiters, demanding multiple versions of entities according to
|    context. 
|
| This seems to defeat the purpose of the simple translation, and I can
| readily foresee problems in this regard.

Jonathan Rosenberg <jxr@thumper.bellcore.com> responds:
|
|   Ugh.  This appears to be impossibly complicated.  Can this really happen?

I think my example with # for number and # for pounds shows that it
can get this complicated:

	<!ENTITY issue "Vol 5 #6">
	<!ENTITY pounds SDATA "[libra]">
	<!SHORTREF map "#" pounds>

	&issue;
	<!USEMAP map>
	&issue;

Conversion to a system without short references is left as an exercise
for the reader...

|   It is if I understand it correctly.  Are you saying that the following
|   "works correctly":
|
|	    <!ENTITY aaa CDATA "aa">
|	    <!ENTITY aa CDATA "a">
|	     <!SHORTREF map "aaa" aaa>
|	     <!SHORTREF map "aa" aa>
|
|   "Correctly" in the sense that the string
|	    aaa
|   would become simply
|	    aa
|   ??

Yes.

|   And, that the reason for this is is CDATA indicates that the
|   replacement string is unparseable character data?

Well, not exactly "unparseable" :-), but at least it will not be
treated as potential markup, just as data.  Specifically, if you need
to have markup characters in the text, you could do it like this:

	<!ENTITY tag CDATA "<tag>">

|   P.S.  Now I see why you use "|" instead of ">" in replying.

Well, it's actually because I think it looks better (and it's
different from everybody else's conventions).

</Erik>
--
Erik Naggum             Professional Programmer            +47-2-836-863
Naggum Software             Electronic Text             <erik@naggum.no>
0118 OSLO, NORWAY       Computer Communications        <enag@ifi.uio.no>