[comp.text] SGML question

ath@prosys.se (Anders Thulin) (12/12/89)

Hope someone on the net can help me with this:

* What is the difference between the character set described in the
  CHARSET section of an SGML declaration, and that described in the
  SYNTAX section?  How do they interact? 

Neither the ISO document nor the book 'SGML: An Author's Guide'
seems to be very clear about this.  Thanks in advance.
-- 
Anders Thulin, Programsystem AB, Teknikringen 2A, S-583 30 Linkoping, Sweden
ath@prosys.se   {uunet,mcsun}!sunic!prosys!ath

ath@prosys.se (Anders Thulin) (08/27/90)

SGML gurus, ahoy!

I've been trying to make sense of the SGML standard. I'm beginning to
think that it shouldn't be done at home, and only attempted by highly
trained professionals :-,

The first problem is with the modes used for recognizing delimiters
(section 9.6.1 in the standard). Here is the background:

1) SGML delimiters are recognized only in certain modes. For example,
   "<?" is recognized as the PIO delimiter only in the CON and DSM
   modes.  In other modes "<?" is not regarded as a delimiter.

2) According to section 9.6.1, CON mode delimiters are recognized in
   the _content_ rule of the SGML grammar, and in the _marked section_
   of marked section declarations that occur in _content_.

   This seems to be reasonably clear and unambiguous.

3) Annex F (not formally a part of the standard) says that the
   document is scanned in CON mode after the prolog.

   This fits reasonably well with the previous rule.

This is my problem.  Section 9.6.1 seem to be very clear on the extent
on CON mode: it's essentially used `inside' the _content_ production
[27].  How, then, are delimiters *outside* _content_ recognized?

After folding some irrelevant(?) productions, I find the syntax for a
SGML document entity begins like this:

  s*, SGML declaration, prolog, s*, start-tag? content, end-tag?
                                               ^CON mode starts here
                                ^or here (Annex F)

These two 'points' are by and large the same: both s* and start-tag
may be empty. No problem so far.

Now, _prolog_ may contain processing instructions, which begin with
the PIO delimiter ("<?"). PIO is recognized only in CON and DSM modes.
And DSM cannot be used in this context - it is used only in
declaration subsets or in marked sections.

Hence, the document must be scanned in CON mode at the *beginning* of
_prolog_.  Using the same argument, CON mode recognition must be used
also at the beginning of the document for the SGML declaration.

This seem to be a contradiction with rule 2 above: CON mode
recognition *must* be used also outside of _content_, even though 9.6.1
states the CON delimiters are recognized in _content_ and in the
_marked section_ of ... and so forth.

Is there a contradiction here, or have I missed something important?

---

Another problem is the ">" delimiter. According to the table in Figure
3 (page 31) in the standard, ">" is recognized as either MDC or TAGC
in CTX mode. But I find nothing that says why it should be recognized
as one rather than the other.

I would assume that ">" would be recognized as TAGC if the parser was
currently parsing a _tag_, and as MDC in markup declarations. And this
is the behaviour I would expect from any decent implementation. I just
can't see that it is supported by the standard.

So: would a scanner that recognized ">" as MDC when the person who
created the document meant TAGC strictly speaking be in error?

-- 
Anders Thulin       ath@prosys.se   {uunet,mcsun}!sunic!prosys!ath
Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden

scjones@thor.UUCP (Larry Jones) (08/30/90)

In article <555@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes:
> I've been trying to make sense of the SGML standard. I'm beginning to
> think that it shouldn't be done at home, and only attempted by highly
> trained professionals :-,

<flame>
It shouldn't be done at all.  The SGML standard is, without a doubt,
the poorest excuse for a language standard I have ever seen.  I can
only assume that it was developed by people with no knowledge of
formal languages who adopted a formalism to make the resulting
document look more technical.  Regular expressions are clearly
inappropriate for describing the language, a phrase structure grammar
would have been much better.  In addition, the standard contains a
number of errors, inconsistencies, ambiguities, and violations of
the formalisms.  The SGML standard is in dire need of interpretation
and revision -- unfortunately, unlike ANSI standards which clearly
specify how to go about submitting comments and requesting
interpretations, ISO standards provide no clues at all.
</flame>

Now, having gotten that out of my system, allow me to try to provide
some answers to your questions.

> This is my problem.  Section 9.6.1 seem to be very clear on the extent
> on CON mode: it's essentially used `inside' the _content_ production
> [27].  How, then, are delimiters *outside* _content_ recognized?

That's what it says, but it's not what it means.  CON mode is simply
the default mode when you're not in any other mode.  So you do indeed
start off at the very beginning of the entire document in CON mode
and return to it whenever a nested mode is ended.

> Another problem is the ">" delimiter. According to the table in Figure
> 3 (page 31) in the standard, ">" is recognized as either MDC or TAGC
> in CTX mode. But I find nothing that says why it should be recognized
> as one rather than the other.

CXT isn't a real mode, it's just a pseudo-mode.  Some delimiters
aren't recognized as such unless they are followed by some particular
context as sepcified in 9.6.2.  This context may include other
delimiters so CXT mode is used to indicate that a delimiter needs to
be recognize while verifying the context of another delimiter.  Thus,
there should really be a separate CXT mode for each of the contexts
listed in 9.6.2 (e.g. DCL-CXT, GI-CXT, etc.) and the table in Figure
3 should be expanded appropriately (e.g. TAGC should be GI-CXT [but
only if "SHORTTAG YES" is specified on the SGML declaration], and
MDC should be DCL-CXT and MSE-CXT).
----
Larry Jones                         UUCP: uunet!sdrc!thor!scjones
SDRC                                      scjones@thor.UUCP
2000 Eastman Dr.                    BIX:  ltl
Milford, OH  45150-2789             AT&T: (513) 576-2070
Oh, now YOU'RE going to start in on me TOO, huh? -- Calvin

ath@prosys.se (Anders Thulin) (09/02/90)

In article <146@thor.UUCP> scjones@thor.UUCP (Larry Jones) writes:
>In article <555@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes:
>> I've been trying to make sense of the SGML standard. I'm beginning to
>> think that it shouldn't be done at home, and only attempted by highly
>> trained professionals :-,
>
><flame>
>[ ... ] In addition, the standard contains a
>number of errors, inconsistencies, ambiguities, and violations of
>the formalisms. 
></flame>

Glad to hear it ... sort of ... well, ...  :-/

Is there any list of these errors, etc floating around? It would
probably be a help to the readers of comp.text.sgml to have one
around. Alternatively, can we compile one? I'm thinking of sending
the CON and CTX questions to ISO for formal interpretation. A list
of bugs in the standard might provide sufficient reason to revise
it.

Or perhaps this has been tried earlier without success?

-- 
Anders Thulin       ath@prosys.se   {uunet,mcsun}!sunic!prosys!ath
Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden

ath@prosys.se (Anders Thulin) (09/02/90)

In article <146@thor.UUCP> scjones@thor.UUCP (Larry Jones) writes:
>In article <555@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes:
>> I've been trying to make sense of the SGML standard. I'm beginning to
>> think that it shouldn't be done at home, and only attempted by highly
>> trained professionals :-,
>
><flame>
>[ ... ] In addition, the standard contains a
>number of errors, inconsistencies, ambiguities, and violations of
>the formalisms. 
></flame>

Glad do hear it ... sort of ... well, ...  :-/

Is there any list of these errors, etc floating around? It would
probably be a help to the readers of comp.text.sgml to have one
around. Alternatively, can we compile one? I'm thinking of sending
the CON and CTX questions to ISO for formal interpretation. A list
of bugs in the standard might provide sufficient reason to revise
it.

Or perhaps this has been tried earlier without success?

-- 
Anders Thulin       ath@prosys.se   {uunet,mcsun}!sunic!prosys!ath
Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden

tut@cairo.Sun.COM (Bill "Bill" Tuthill) (09/05/90)

In article <582@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes:
> 
> It would probably be a help to the readers of comp.text.sgml to have
> [a list of SGML specification problems] around.

Is there a newsgroup comp.text.sgml somewhere?  Not where I work.

We investigated SGML two years ago, going so far as to send several people
to high-priced conferences.  Our conclusion was that SGML doesn't solve
the main problem we need solved, which is document portability.  Excuse the
baseball metaphor, but SGML has three strikes against it:

1. It originated at IBM, a company not renowned for software prowess.

2. The spec is bloated, bogus, pretentious, and incomprehensible.

3. It doesn't deal with graphics, tables, or equations (is that 5 strikes?).
   It only provides text portability, which ASCII does more elegantly.

My mind is open, but so far SGML proponents have said nothing to make me
change my mind.  We all want document portability, but I don't think we'll
ever see it from SGML.  And if SGML isn't supposed to provide document
portability, just what problem is it supposed to solve?

ath@prosys.se (Anders Thulin) (09/05/90)

In article <141829@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes:

>3. It doesn't deal with graphics, tables, or equations (is that 5 strikes?).

Graphics, no. Although graphics should probably be treated as non-SGML
data, which SGML can handle.

Tables it does provide for. That is, SGML permits you to mark up a
table.  The actual formatting is left to a 'processor', which is
supposed to do what you want in some way. So, provided that you have
this 'processor' you can do tables. I understand the word 'processor'
to mean a formatting program, e.g. troff.

Equations is almost the same thing. SGML permits text (entities?) to
be written in a 'notation', which then is interpreted by some
'notation translator'. I believe there is a registered BSI notation
standard which could be used. Alternatively, TeX notation could
probably be used as well. 

You are right that SGML doesn't provide these capabilities directly.
But a SGML parser/translator should not choke on them either. It
should just pass the buck to something who does better.

>   It only provides text portability, which ASCII does more elegantly.

ASCII isn't much of a help if I want to insert a Swedish { in the
text.  (That '{', of course, is an 'a' with an umlaut accent). Nor is
ASCII very good at indicating italics or boldface.

Some further coding conventions are required if I want to be able to
use these 'signs'. SGML is one way of making such conventions.

>My mind is open, but so far SGML proponents have said nothing to make me
>change my mind.  We all want document portability, but I don't think we'll
>ever see it from SGML.  And if SGML isn't supposed to provide document
>portability, just what problem is it supposed to solve?

Document markup. I think SGML solves that, although perhaps not as
neatly as I would wish it did.

The interchange (or portability) part should be solved by the SDIF
standard.  I haven't seen it, so I can't swear to it. 

--

It would be interesting to hear from anyone with hands-on experience
with SGML products if they are capable of handling graphics, equations
and things like that, and how well they do.

-- 
Anders Thulin       ath@prosys.se   {uunet,mcsun}!sunic!prosys!ath
Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden

tut@cairo.Sun.COM (Bill "Bill" Tuthill) (09/06/90)

In article <583@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes:
> 
> > SGML only provides text portability, which ASCII does more elegantly.
> 
> ASCII isn't much of a help if I want to insert a Swedish { in the
> text.  (That '{', of course, is an 'a' with an umlaut accent).

I probably should have said ISO 8859-1 (also known as ISO Latin-1),
not ASCII, but I didn't think many people would know what that is.
ISO Latin-1 is an 8-bit codeset identical to ASCII in the lower half,
but extended into the upper half with accent marks and characters
required in western Europe.  Anders could produce any Scandinavian
character using this extended code set.

However, ISO Latin-1 doesn't solve the character encoding problem
outside western Europe.  There are ISO standards for eastern Europe
(ISO 8859-2), Greece (ISO Greek), and Russia (ISO Cyrillic).  There
are also standards for Japan (JLS) and elsewhere in Asia.  But these
code sets are not mutually compatible; they are not interchangeable.

I believe the ultimate answer is Unicode, a 16-bit code set that
includes all known languages of the world in a single, interchangeable
code set.  Developed by Joe Becker and others at Xerox, Apple, and
elsewhere, Unicode represents a tremendous leap forward.  The main
reason 16 bits is sufficient is that Chinese, Japanese and Korean
pictographs have been combined so as to be complete and correctly
ordered, though not necessarily contiguous.

The main drawback to Unicode is that files will be twice as big.  But
being able to exchange data without shifting and conversion is a huge
advantage.  Space has even been left in the Unicode address space for
ancient writing systems such as hieroglyphics and cuneiform.

emv@math.lsa.umich.edu (Edward Vielmetti) (09/07/90)

In article <141829@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes:
   Is there a newsgroup comp.text.sgml somewhere?  Not where I work.

comp.text.sgml has just been created a few minutes ago, you should see
it soon.  This should be the first message in it.

I hope to come up with a more reasonable introduction to the group in
the next few weeks.  In the interim if you're on the internet the
site "sgml.math.lsa.umich.edu" has a directory /pub/sgml with some
stuff in it.

--Ed

Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>

bobs (Bob Stayton, Yoyodoc) (09/08/90)

In article <141829@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes:
>In article <582@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes:
>> 
>We investigated SGML two years ago, going so far as to send several people
>to high-priced conferences.  Our conclusion was that SGML doesn't solve
>the main problem we need solved, which is document portability.  Excuse the
>baseball metaphor, but SGML has three strikes against it:
>
>1. It originated at IBM, a company not renowned for software prowess.
>
>2. The spec is bloated, bogus, pretentious, and incomprehensible.
>
>3. It doesn't deal with graphics, tables, or equations (is that 5 strikes?).
>   It only provides text portability, which ASCII does more elegantly.

I won't touch your first two strikes, but the third requires
a response.

The most thorough implementation of SGML to date appears to
be CALS, the emerging Defense Department standard for
documentation.  If you haven't looked at SGML for two
years, then you probably missed CALS.

CALS does address graphics, tables, and equations.
It sort of cheats on graphics by choosing three formats
from among the many, but hey, why invent another format?
The tables spec has gone through a major revision and
appears to address most of the needs of describing
complex tables.  The equation part was adapted from ISO TR
9573 written by Anders Bergland.

The latest CALS spec also provides a method for identifying
a default format for each text element.  That way you
can have your text and format it too |).

The baseline tag set of CALS provides actual tag names and
definitions that will become some kind of standard merely
by the weight of the DoD.  Several vendors of publishing
software are providing CALS compliance, which will also help
make it a defacto standard for document interchange.
Of course, I don't know of anybody that is actually doing
this yet, but it is certainly feasible now.

bobs
Bob Stayton                                 425 Encinal Street
Technical Publications                      Santa Cruz, CA  95060
The Santa Cruz Operation, Inc.              (408) 425-7222
                                            ...!uunet!sco!bobs
/* I don't speak for my company and they don't speak for me. */

bts@unx.sas.com (Brian T. Schellenberger) (09/11/90)

In article <141873@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes:
|I believe the ultimate answer is Unicode, a 16-bit code set that
|includes all known languages of the world in a single, interchangeable
|code set.  Developed by Joe Becker and others at Xerox, Apple, and
|elsewhere, Unicode represents a tremendous leap forward.  The main
|reason 16 bits is sufficient is that Chinese, Japanese and Korean
|pictographs have been combined so as to be complete and correctly
|ordered, though not necessarily contiguous.
|
|The main drawback to Unicode is that files will be twice as big.  But
|being able to exchange data without shifting and conversion is a huge
|advantage.  Space has even been left in the Unicode address space for
|ancient writing systems such as hieroglyphics and cuneiform.

This is be no means necessary.  Even the "large" versions of Kanji and
such-like only have 6000 or so characters.  Allowing for overlap, I would
immensely surprised if you couldn't take care of all the ideographic living
languages (mostly Chinese, Japanese, and Korean) in 15,000 characters, tops.
For non-ideograhpic languages, which tend to have no more than about 40
characters tops in no more than two variations (eg, capital and lowercase
for Roman; Katakana and Hiragana for Japanese), making for less than 100
characters total, we can accomidate more than 150 non-overlapping languages
in the same space.  This leaves 2,000 slots for "special" symbols and
punctuation, while still coming in at less than 32,000 characters.  Thus,
it can easily be fit into 15 bits.  That way, we can have the 127 or so
most common chacters encoded into a 7 bits.  Then if the eighth bit is
set, we know it starts a 15-bit character.

The obivous starting place it to make the 7-bit code be the current ASCII
code.  I suspect that this is close to the set of most commonly used 
characters world-wide, but it need not be the choice.  If such a scheme is
used, most files will be only a little bit longer than they are now, and
(assuming ASCII is used as the base), computer code will increase not one
wit.  Neither will English text.  Since English is the lingua franca of the
world these days, this will includes a lot international text, and not just
English.

Finally, such a scheme is not all that difficult to work with.  It is
certainly easier than shared 7- and 8-bit codes than include "shift"
characters.


-- 
-- Brian, the Man from Babble-on.		bts@unx.sas.com
-- (Brian Schellenberger)
"And when the votes were cast, the winner was . . .
 Mister James K. Polk, Napolean of the stump."        -- THEY MIGHT BE GIANTS.

tut@cairo.Sun.COM (Bill "Bill" Tuthill) (09/11/90)

bts@unx.sas.com (Brian T. Schellenberger) writes:
> |
> |The main drawback to Unicode is that files will be twice as big.  But
> |being able to exchange data without shifting and conversion is a huge
> |advantage.  Space has even been left in the Unicode address space for
> |ancient writing systems such as hieroglyphics and cuneiform.
> 
> This is be no means necessary.  Even the "large" versions of Kanji and
> such-like only have 6000 or so characters.  Allowing for overlap, I would
> immensely surprised if you couldn't take care of all the ideographic living
> languages (mostly Chinese, Japanese, and Korean) in 15,000 characters, tops.

Wait a minute, do you mean it isn't necessary to combine Asian language
character sets?  There are over 6000 Kanji characters in Japan, about the
same number in Korea and China (PRC) and around 14000 characters in Taiwan.
That adds up to at least 32000.  Combinatory efforts done by the Unicode
people have reduced that total to around 20000.

Or do you mean 16 bits aren't necessary, 15 are enough?  That means there
would only be 12767 empty slots after covering the Asian languages, which
is almost certainly not enough.  I really don't think the world needs yet
another shift encoding algorithm, anyway.

yukngo@obelix.gaul.csd.uwo.ca (Cheung Yukngo) (09/11/90)

In article <142145@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes:
   bts@unx.sas.com (Brian T. Schellenberger) writes:
   > |
   > |The main drawback to Unicode is that files will be twice as big.  But
   > |being able to exchange data without shifting and conversion is a huge
   > |advantage.  Space has even been left in the Unicode address space for
   > |ancient writing systems such as hieroglyphics and cuneiform.
   > 
   > This is be no means necessary.  Even the "large" versions of Kanji and
   > such-like only have 6000 or so characters.  Allowing for overlap, I would
   > immensely surprised if you couldn't take care of all the ideographic living
   > languages (mostly Chinese, Japanese, and Korean) in 15,000 characters, tops.

   Wait a minute, do you mean it isn't necessary to combine Asian language
   character sets?  There are over 6000 Kanji characters in Japan, about the
   same number in Korea and China (PRC) and around 14000 characters in Taiwan.
   That adds up to at least 32000.  Combinatory efforts done by the Unicode
   people have reduced that total to around 20000.

Well, I don't think 6000 Chinese characters is a reasonable amount. It
is probably good enough to cover most of the Chinese surnames.
According to ``An Introduction to Chinese, Japanese and Korean
Computing'' by Huang and Huang, 74,000 is a reasonanle amount.
Granted, most of the characters are not used. But then you don't purge
a word just because it is not used in daily life---so the size of
Oxford English Dictionary.

I don't know anything about SGML. I hope SGML knows something about
Asian Languages.

lee@sq.sq.com (Liam R. E. Quin) (09/11/90)

tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes:
> [...] Unicode, a 16-bit code set that includes all known languages of the
> world in a single, interchangeable code set.

bts@unx.sas.com (Brian T. Schellenberger) writes:
> [...] still coming in at less than 32,000 characters.  Thus, it can easily
> be fit into 15 bits.  [Then] the 127 most common chacters [could fit in] 7
> bits.  Then if the eighth bit is set, we know it starts a 15-bit character.

It might make more sense to mandate that _all_ of the bytes of such an
extended character, except the last, have the top bit set.  Then there is
no limit imposed on the number of characters, although of course some
software might croak on 48-bit characters :-)  This also means that your
files aren't twice as big, or even four times as big, and you can still use
lots of glyphs.

This does mean that algorithms such as Boyer-Moore pattern matching have to
look at at most one extra byte per probe in some cases, to ensure that a
match isn't the last byte of a multi-byte character.  With a shift, you'd
have to look at every byte in the input to remember the current mode.
With four-byte encodings you could have to look at up to three bytes.  So
my scheme is no worse than a plain two-byte encoding in this way either.

You should also look at the work in progress by ISO committees such as
ISO/IEC JTC 1/SC 18/WG8 on Font Information Interchange (e.g. N1036), and
at ISO/IEC/DIS 9541-1.  They're busily working on ways of transmitting font
and glyph information about the place...

Lee
-- 
Liam R. E. Quin,  lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337
/text/humour/quote: No such file or directory

ttl@sti.fi (Timo Lehtinen) (09/12/90)

In article <9847@scorn.sco.COM> bobs@sco.COM (Bob Stayton) writes about CALS:

Ok, where does one get a specfication of CALS ?

Timo

-- 
       ____/ ___   ___/    /		Kivihaantie 8 C 25
      /           /       /		SF-00310 HELSINKI, Finland
   ____  /       /       /	Phone:	+358 0 573 161, +358 49 424 012
  Stream Technologies Inc.	Fax:	+358 0 571 384

jwh@flam.ifs.umich.edu (Jim Howe) (09/13/90)

In article <1990Sep12.145848.9873@sti.fi>, ttl@sti.fi (Timo Lehtinen) writes:
|> 
|> Ok, where does one get a specfication of CALS ?
|> 
|> Timo

You can get some information via FTP from DURER.CME.NIST.GOV in the /pub/cals
directory.

James W. Howe			   internet: jwh@ifs.umich.edu
University of Michigan             uucp:     uunet!mailrus!ifs.umich.edu!jwh
Ann Arbor, MI   48103-4943