ath@prosys.se (Anders Thulin) (12/12/89)
Hope someone on the net can help me with this:
* What is the difference between the character set described in the
  CHARSET section of an SGML declaration, and that described in the
  SYNTAX section?  How do they interact? 
Neither the ISO document nor the book 'SGML: An Author's Guide'
seems to be very clear about this.  Thanks in advance.
-- 
Anders Thulin, Programsystem AB, Teknikringen 2A, S-583 30 Linkoping, Sweden
ath@prosys.se   {uunet,mcsun}!sunic!prosys!athath@prosys.se (Anders Thulin) (08/27/90)
SGML gurus, ahoy!
I've been trying to make sense of the SGML standard. I'm beginning to
think that it shouldn't be done at home, and only attempted by highly
trained professionals :-,
The first problem is with the modes used for recognizing delimiters
(section 9.6.1 in the standard). Here is the background:
1) SGML delimiters are recognized only in certain modes. For example,
   "<?" is recognized as the PIO delimiter only in the CON and DSM
   modes.  In other modes "<?" is not regarded as a delimiter.
2) According to section 9.6.1, CON mode delimiters are recognized in
   the _content_ rule of the SGML grammar, and in the _marked section_
   of marked section declarations that occur in _content_.
   This seems to be reasonably clear and unambiguous.
3) Annex F (not formally a part of the standard) says that the
   document is scanned in CON mode after the prolog.
   This fits reasonably well with the previous rule.
This is my problem.  Section 9.6.1 seem to be very clear on the extent
on CON mode: it's essentially used `inside' the _content_ production
[27].  How, then, are delimiters *outside* _content_ recognized?
After folding some irrelevant(?) productions, I find the syntax for a
SGML document entity begins like this:
  s*, SGML declaration, prolog, s*, start-tag? content, end-tag?
                                               ^CON mode starts here
                                ^or here (Annex F)
These two 'points' are by and large the same: both s* and start-tag
may be empty. No problem so far.
Now, _prolog_ may contain processing instructions, which begin with
the PIO delimiter ("<?"). PIO is recognized only in CON and DSM modes.
And DSM cannot be used in this context - it is used only in
declaration subsets or in marked sections.
Hence, the document must be scanned in CON mode at the *beginning* of
_prolog_.  Using the same argument, CON mode recognition must be used
also at the beginning of the document for the SGML declaration.
This seem to be a contradiction with rule 2 above: CON mode
recognition *must* be used also outside of _content_, even though 9.6.1
states the CON delimiters are recognized in _content_ and in the
_marked section_ of ... and so forth.
Is there a contradiction here, or have I missed something important?
---
Another problem is the ">" delimiter. According to the table in Figure
3 (page 31) in the standard, ">" is recognized as either MDC or TAGC
in CTX mode. But I find nothing that says why it should be recognized
as one rather than the other.
I would assume that ">" would be recognized as TAGC if the parser was
currently parsing a _tag_, and as MDC in markup declarations. And this
is the behaviour I would expect from any decent implementation. I just
can't see that it is supported by the standard.
So: would a scanner that recognized ">" as MDC when the person who
created the document meant TAGC strictly speaking be in error?
-- 
Anders Thulin       ath@prosys.se   {uunet,mcsun}!sunic!prosys!ath
Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Swedenscjones@thor.UUCP (Larry Jones) (08/30/90)
In article <555@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes: > I've been trying to make sense of the SGML standard. I'm beginning to > think that it shouldn't be done at home, and only attempted by highly > trained professionals :-, <flame> It shouldn't be done at all. The SGML standard is, without a doubt, the poorest excuse for a language standard I have ever seen. I can only assume that it was developed by people with no knowledge of formal languages who adopted a formalism to make the resulting document look more technical. Regular expressions are clearly inappropriate for describing the language, a phrase structure grammar would have been much better. In addition, the standard contains a number of errors, inconsistencies, ambiguities, and violations of the formalisms. The SGML standard is in dire need of interpretation and revision -- unfortunately, unlike ANSI standards which clearly specify how to go about submitting comments and requesting interpretations, ISO standards provide no clues at all. </flame> Now, having gotten that out of my system, allow me to try to provide some answers to your questions. > This is my problem. Section 9.6.1 seem to be very clear on the extent > on CON mode: it's essentially used `inside' the _content_ production > [27]. How, then, are delimiters *outside* _content_ recognized? That's what it says, but it's not what it means. CON mode is simply the default mode when you're not in any other mode. So you do indeed start off at the very beginning of the entire document in CON mode and return to it whenever a nested mode is ended. > Another problem is the ">" delimiter. According to the table in Figure > 3 (page 31) in the standard, ">" is recognized as either MDC or TAGC > in CTX mode. But I find nothing that says why it should be recognized > as one rather than the other. CXT isn't a real mode, it's just a pseudo-mode. Some delimiters aren't recognized as such unless they are followed by some particular context as sepcified in 9.6.2. This context may include other delimiters so CXT mode is used to indicate that a delimiter needs to be recognize while verifying the context of another delimiter. Thus, there should really be a separate CXT mode for each of the contexts listed in 9.6.2 (e.g. DCL-CXT, GI-CXT, etc.) and the table in Figure 3 should be expanded appropriately (e.g. TAGC should be GI-CXT [but only if "SHORTTAG YES" is specified on the SGML declaration], and MDC should be DCL-CXT and MSE-CXT). ---- Larry Jones UUCP: uunet!sdrc!thor!scjones SDRC scjones@thor.UUCP 2000 Eastman Dr. BIX: ltl Milford, OH 45150-2789 AT&T: (513) 576-2070 Oh, now YOU'RE going to start in on me TOO, huh? -- Calvin
ath@prosys.se (Anders Thulin) (09/02/90)
In article <146@thor.UUCP> scjones@thor.UUCP (Larry Jones) writes: >In article <555@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes: >> I've been trying to make sense of the SGML standard. I'm beginning to >> think that it shouldn't be done at home, and only attempted by highly >> trained professionals :-, > ><flame> >[ ... ] In addition, the standard contains a >number of errors, inconsistencies, ambiguities, and violations of >the formalisms. ></flame> Glad to hear it ... sort of ... well, ... :-/ Is there any list of these errors, etc floating around? It would probably be a help to the readers of comp.text.sgml to have one around. Alternatively, can we compile one? I'm thinking of sending the CON and CTX questions to ISO for formal interpretation. A list of bugs in the standard might provide sufficient reason to revise it. Or perhaps this has been tried earlier without success? -- Anders Thulin ath@prosys.se {uunet,mcsun}!sunic!prosys!ath Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden
ath@prosys.se (Anders Thulin) (09/02/90)
In article <146@thor.UUCP> scjones@thor.UUCP (Larry Jones) writes: >In article <555@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes: >> I've been trying to make sense of the SGML standard. I'm beginning to >> think that it shouldn't be done at home, and only attempted by highly >> trained professionals :-, > ><flame> >[ ... ] In addition, the standard contains a >number of errors, inconsistencies, ambiguities, and violations of >the formalisms. ></flame> Glad do hear it ... sort of ... well, ... :-/ Is there any list of these errors, etc floating around? It would probably be a help to the readers of comp.text.sgml to have one around. Alternatively, can we compile one? I'm thinking of sending the CON and CTX questions to ISO for formal interpretation. A list of bugs in the standard might provide sufficient reason to revise it. Or perhaps this has been tried earlier without success? -- Anders Thulin ath@prosys.se {uunet,mcsun}!sunic!prosys!ath Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden
tut@cairo.Sun.COM (Bill "Bill" Tuthill) (09/05/90)
In article <582@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes: > > It would probably be a help to the readers of comp.text.sgml to have > [a list of SGML specification problems] around. Is there a newsgroup comp.text.sgml somewhere? Not where I work. We investigated SGML two years ago, going so far as to send several people to high-priced conferences. Our conclusion was that SGML doesn't solve the main problem we need solved, which is document portability. Excuse the baseball metaphor, but SGML has three strikes against it: 1. It originated at IBM, a company not renowned for software prowess. 2. The spec is bloated, bogus, pretentious, and incomprehensible. 3. It doesn't deal with graphics, tables, or equations (is that 5 strikes?). It only provides text portability, which ASCII does more elegantly. My mind is open, but so far SGML proponents have said nothing to make me change my mind. We all want document portability, but I don't think we'll ever see it from SGML. And if SGML isn't supposed to provide document portability, just what problem is it supposed to solve?
ath@prosys.se (Anders Thulin) (09/05/90)
In article <141829@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes: >3. It doesn't deal with graphics, tables, or equations (is that 5 strikes?). Graphics, no. Although graphics should probably be treated as non-SGML data, which SGML can handle. Tables it does provide for. That is, SGML permits you to mark up a table. The actual formatting is left to a 'processor', which is supposed to do what you want in some way. So, provided that you have this 'processor' you can do tables. I understand the word 'processor' to mean a formatting program, e.g. troff. Equations is almost the same thing. SGML permits text (entities?) to be written in a 'notation', which then is interpreted by some 'notation translator'. I believe there is a registered BSI notation standard which could be used. Alternatively, TeX notation could probably be used as well. You are right that SGML doesn't provide these capabilities directly. But a SGML parser/translator should not choke on them either. It should just pass the buck to something who does better. > It only provides text portability, which ASCII does more elegantly. ASCII isn't much of a help if I want to insert a Swedish { in the text. (That '{', of course, is an 'a' with an umlaut accent). Nor is ASCII very good at indicating italics or boldface. Some further coding conventions are required if I want to be able to use these 'signs'. SGML is one way of making such conventions. >My mind is open, but so far SGML proponents have said nothing to make me >change my mind. We all want document portability, but I don't think we'll >ever see it from SGML. And if SGML isn't supposed to provide document >portability, just what problem is it supposed to solve? Document markup. I think SGML solves that, although perhaps not as neatly as I would wish it did. The interchange (or portability) part should be solved by the SDIF standard. I haven't seen it, so I can't swear to it. -- It would be interesting to hear from anyone with hands-on experience with SGML products if they are capable of handling graphics, equations and things like that, and how well they do. -- Anders Thulin ath@prosys.se {uunet,mcsun}!sunic!prosys!ath Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden
tut@cairo.Sun.COM (Bill "Bill" Tuthill) (09/06/90)
In article <583@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes: > > > SGML only provides text portability, which ASCII does more elegantly. > > ASCII isn't much of a help if I want to insert a Swedish { in the > text. (That '{', of course, is an 'a' with an umlaut accent). I probably should have said ISO 8859-1 (also known as ISO Latin-1), not ASCII, but I didn't think many people would know what that is. ISO Latin-1 is an 8-bit codeset identical to ASCII in the lower half, but extended into the upper half with accent marks and characters required in western Europe. Anders could produce any Scandinavian character using this extended code set. However, ISO Latin-1 doesn't solve the character encoding problem outside western Europe. There are ISO standards for eastern Europe (ISO 8859-2), Greece (ISO Greek), and Russia (ISO Cyrillic). There are also standards for Japan (JLS) and elsewhere in Asia. But these code sets are not mutually compatible; they are not interchangeable. I believe the ultimate answer is Unicode, a 16-bit code set that includes all known languages of the world in a single, interchangeable code set. Developed by Joe Becker and others at Xerox, Apple, and elsewhere, Unicode represents a tremendous leap forward. The main reason 16 bits is sufficient is that Chinese, Japanese and Korean pictographs have been combined so as to be complete and correctly ordered, though not necessarily contiguous. The main drawback to Unicode is that files will be twice as big. But being able to exchange data without shifting and conversion is a huge advantage. Space has even been left in the Unicode address space for ancient writing systems such as hieroglyphics and cuneiform.
emv@math.lsa.umich.edu (Edward Vielmetti) (09/07/90)
In article <141829@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes:
   Is there a newsgroup comp.text.sgml somewhere?  Not where I work.
comp.text.sgml has just been created a few minutes ago, you should see
it soon.  This should be the first message in it.
I hope to come up with a more reasonable introduction to the group in
the next few weeks.  In the interim if you're on the internet the
site "sgml.math.lsa.umich.edu" has a directory /pub/sgml with some
stuff in it.
--Ed
Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>bobs (Bob Stayton, Yoyodoc) (09/08/90)
In article <141829@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes: >In article <582@helios.prosys.se>, ath@prosys.se (Anders Thulin) writes: >> >We investigated SGML two years ago, going so far as to send several people >to high-priced conferences. Our conclusion was that SGML doesn't solve >the main problem we need solved, which is document portability. Excuse the >baseball metaphor, but SGML has three strikes against it: > >1. It originated at IBM, a company not renowned for software prowess. > >2. The spec is bloated, bogus, pretentious, and incomprehensible. > >3. It doesn't deal with graphics, tables, or equations (is that 5 strikes?). > It only provides text portability, which ASCII does more elegantly. I won't touch your first two strikes, but the third requires a response. The most thorough implementation of SGML to date appears to be CALS, the emerging Defense Department standard for documentation. If you haven't looked at SGML for two years, then you probably missed CALS. CALS does address graphics, tables, and equations. It sort of cheats on graphics by choosing three formats from among the many, but hey, why invent another format? The tables spec has gone through a major revision and appears to address most of the needs of describing complex tables. The equation part was adapted from ISO TR 9573 written by Anders Bergland. The latest CALS spec also provides a method for identifying a default format for each text element. That way you can have your text and format it too |). The baseline tag set of CALS provides actual tag names and definitions that will become some kind of standard merely by the weight of the DoD. Several vendors of publishing software are providing CALS compliance, which will also help make it a defacto standard for document interchange. Of course, I don't know of anybody that is actually doing this yet, but it is certainly feasible now. bobs Bob Stayton 425 Encinal Street Technical Publications Santa Cruz, CA 95060 The Santa Cruz Operation, Inc. (408) 425-7222 ...!uunet!sco!bobs /* I don't speak for my company and they don't speak for me. */
bts@unx.sas.com (Brian T. Schellenberger) (09/11/90)
In article <141873@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes: |I believe the ultimate answer is Unicode, a 16-bit code set that |includes all known languages of the world in a single, interchangeable |code set. Developed by Joe Becker and others at Xerox, Apple, and |elsewhere, Unicode represents a tremendous leap forward. The main |reason 16 bits is sufficient is that Chinese, Japanese and Korean |pictographs have been combined so as to be complete and correctly |ordered, though not necessarily contiguous. | |The main drawback to Unicode is that files will be twice as big. But |being able to exchange data without shifting and conversion is a huge |advantage. Space has even been left in the Unicode address space for |ancient writing systems such as hieroglyphics and cuneiform. This is be no means necessary. Even the "large" versions of Kanji and such-like only have 6000 or so characters. Allowing for overlap, I would immensely surprised if you couldn't take care of all the ideographic living languages (mostly Chinese, Japanese, and Korean) in 15,000 characters, tops. For non-ideograhpic languages, which tend to have no more than about 40 characters tops in no more than two variations (eg, capital and lowercase for Roman; Katakana and Hiragana for Japanese), making for less than 100 characters total, we can accomidate more than 150 non-overlapping languages in the same space. This leaves 2,000 slots for "special" symbols and punctuation, while still coming in at less than 32,000 characters. Thus, it can easily be fit into 15 bits. That way, we can have the 127 or so most common chacters encoded into a 7 bits. Then if the eighth bit is set, we know it starts a 15-bit character. The obivous starting place it to make the 7-bit code be the current ASCII code. I suspect that this is close to the set of most commonly used characters world-wide, but it need not be the choice. If such a scheme is used, most files will be only a little bit longer than they are now, and (assuming ASCII is used as the base), computer code will increase not one wit. Neither will English text. Since English is the lingua franca of the world these days, this will includes a lot international text, and not just English. Finally, such a scheme is not all that difficult to work with. It is certainly easier than shared 7- and 8-bit codes than include "shift" characters. -- -- Brian, the Man from Babble-on. bts@unx.sas.com -- (Brian Schellenberger) "And when the votes were cast, the winner was . . . Mister James K. Polk, Napolean of the stump." -- THEY MIGHT BE GIANTS.
tut@cairo.Sun.COM (Bill "Bill" Tuthill) (09/11/90)
bts@unx.sas.com (Brian T. Schellenberger) writes: > | > |The main drawback to Unicode is that files will be twice as big. But > |being able to exchange data without shifting and conversion is a huge > |advantage. Space has even been left in the Unicode address space for > |ancient writing systems such as hieroglyphics and cuneiform. > > This is be no means necessary. Even the "large" versions of Kanji and > such-like only have 6000 or so characters. Allowing for overlap, I would > immensely surprised if you couldn't take care of all the ideographic living > languages (mostly Chinese, Japanese, and Korean) in 15,000 characters, tops. Wait a minute, do you mean it isn't necessary to combine Asian language character sets? There are over 6000 Kanji characters in Japan, about the same number in Korea and China (PRC) and around 14000 characters in Taiwan. That adds up to at least 32000. Combinatory efforts done by the Unicode people have reduced that total to around 20000. Or do you mean 16 bits aren't necessary, 15 are enough? That means there would only be 12767 empty slots after covering the Asian languages, which is almost certainly not enough. I really don't think the world needs yet another shift encoding algorithm, anyway.
yukngo@obelix.gaul.csd.uwo.ca (Cheung Yukngo) (09/11/90)
In article <142145@sun.Eng.Sun.COM> tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes: bts@unx.sas.com (Brian T. Schellenberger) writes: > | > |The main drawback to Unicode is that files will be twice as big. But > |being able to exchange data without shifting and conversion is a huge > |advantage. Space has even been left in the Unicode address space for > |ancient writing systems such as hieroglyphics and cuneiform. > > This is be no means necessary. Even the "large" versions of Kanji and > such-like only have 6000 or so characters. Allowing for overlap, I would > immensely surprised if you couldn't take care of all the ideographic living > languages (mostly Chinese, Japanese, and Korean) in 15,000 characters, tops. Wait a minute, do you mean it isn't necessary to combine Asian language character sets? There are over 6000 Kanji characters in Japan, about the same number in Korea and China (PRC) and around 14000 characters in Taiwan. That adds up to at least 32000. Combinatory efforts done by the Unicode people have reduced that total to around 20000. Well, I don't think 6000 Chinese characters is a reasonable amount. It is probably good enough to cover most of the Chinese surnames. According to ``An Introduction to Chinese, Japanese and Korean Computing'' by Huang and Huang, 74,000 is a reasonanle amount. Granted, most of the characters are not used. But then you don't purge a word just because it is not used in daily life---so the size of Oxford English Dictionary. I don't know anything about SGML. I hope SGML knows something about Asian Languages.
lee@sq.sq.com (Liam R. E. Quin) (09/11/90)
tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes: > [...] Unicode, a 16-bit code set that includes all known languages of the > world in a single, interchangeable code set. bts@unx.sas.com (Brian T. Schellenberger) writes: > [...] still coming in at less than 32,000 characters. Thus, it can easily > be fit into 15 bits. [Then] the 127 most common chacters [could fit in] 7 > bits. Then if the eighth bit is set, we know it starts a 15-bit character. It might make more sense to mandate that _all_ of the bytes of such an extended character, except the last, have the top bit set. Then there is no limit imposed on the number of characters, although of course some software might croak on 48-bit characters :-) This also means that your files aren't twice as big, or even four times as big, and you can still use lots of glyphs. This does mean that algorithms such as Boyer-Moore pattern matching have to look at at most one extra byte per probe in some cases, to ensure that a match isn't the last byte of a multi-byte character. With a shift, you'd have to look at every byte in the input to remember the current mode. With four-byte encodings you could have to look at up to three bytes. So my scheme is no worse than a plain two-byte encoding in this way either. You should also look at the work in progress by ISO committees such as ISO/IEC JTC 1/SC 18/WG8 on Font Information Interchange (e.g. N1036), and at ISO/IEC/DIS 9541-1. They're busily working on ways of transmitting font and glyph information about the place... Lee -- Liam R. E. Quin, lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337 /text/humour/quote: No such file or directory
ttl@sti.fi (Timo Lehtinen) (09/12/90)
In article <9847@scorn.sco.COM> bobs@sco.COM (Bob Stayton) writes about CALS:
Ok, where does one get a specfication of CALS ?
Timo
-- 
       ____/ ___   ___/    /		Kivihaantie 8 C 25
      /           /       /		SF-00310 HELSINKI, Finland
   ____  /       /       /	Phone:	+358 0 573 161, +358 49 424 012
  Stream Technologies Inc.	Fax:	+358 0 571 384jwh@flam.ifs.umich.edu (Jim Howe) (09/13/90)
In article <1990Sep12.145848.9873@sti.fi>, ttl@sti.fi (Timo Lehtinen) writes: |> |> Ok, where does one get a specfication of CALS ? |> |> Timo You can get some information via FTP from DURER.CME.NIST.GOV in the /pub/cals directory. James W. Howe internet: jwh@ifs.umich.edu University of Michigan uucp: uunet!mailrus!ifs.umich.edu!jwh Ann Arbor, MI 48103-4943