dns@sq.uucp (David Slocombe) (07/29/88)
Bill Tuthill's article voicing objections to Standard Generalized Markup Language (SGML), an ISO (and ANSI) standard for documents, gives me an excellent opportunity to set a lot of issues straight. Bill's scathing comments on SGML appear to be based largely on Charles Goldfarb's June 1981 paper in "SIGPLAN Notices". The language was refined between 1981 and 1986 when it became a standard (ISO 8879), and Dr. Goldfarb and everyone else associated with SGML have learned a great deal since 1981 about how not to be misinterpreted. The standard has been amended once since 1986, and a second amendment is currently being ballotted. I should state up front that we at SoftQuad Inc. have a strong interest in the success of SGML as a standard -- we sell three products so far supporting the use of SGML, and more are on the way. The reason we committed ourselves to the support and promotion of SGML was that our combined experience in the publishing industry convinced us that SGML was a long-awaited solution to real problems, a solution whose time had come. * * * * * * * * * * * * * * 1. The SGML philosophy: The following might form a manifesto for the SGML approach: - Documents are Data with Structure: A document is data in the same sense that an accounting file is data. The data in a document is *structured*, in the same sense as an accounting file has records and fields. Document structure is typically hierarchical. - Logical Content is Independent of Presentation: The visual presentation of the structured data in a document may take many forms depending on the application. The document's logical structure and text should be distinguished from any visual-style-specific information so that the document may be readily formatted according to a style that is most suited to the application of the moment, and can be loaded conveniently into databases or otherwise processed. - Document Classes Need to be Defined: For the sake of both database applications and graphic arts applications, it is desirable to define classes of document-structures and the syntax which encodes them, and to assign documents to the appropriate class (or to create them in the first place so that they belong to a specified class). The definition of a class of document-structures and the corresponding syntax amounts to a formal grammar, which the SGML world calls a Document Type Declaration (or DTD). - The DTD Goes With the Document: When a document is "interchanged", it is considered to consist of the appropriate DTD (or a reference to it) followed by the document itself (the "document-instance"), which must have a structure (and be encoded) according to the rules found in the DTD. - Document Access is Through a Parser: An "SGML Parser" is an application program which reads the DTD and then parses the document-instance according to the rules found in the DTD. Such a parser will "validate" the document and report an error if the document does not actually parse correctly according to the DTD. This parser can also be used as the front-end to a database-loading or text-formatting program or tools like Writer's Workbench. * * * * * * * * * * * * * * 2. SGML is NOT a text formatting language: Because I am a troff-hack and it is easy for me to think of suitable examples from the troff language, and because I can reasonably assume that most readers of this group have at least a passing acquaintance with troff, I will be making comparisons to troff in what follows. But please don't let that lead you into thinking that SGML is a way of specifying the way a document is to *look* on some medium. It isn't. In fact, people new to SGML have to fight continually against the tendency to express document structures in terms of specific *appearance* instead of *information content*. A good example would be a title in a bibliographic reference. It is likely to be typeset in italic type, so people have a tendency at first to set up an element called <italic> and use that for tagging this title (as well as emphasized text). But anyone who wants to search the document for a title reference is not interested in other things that might be in italic type, and does not care about the typeface actually used for titles. The element-type and its form of presentation are simply not in the same universe. The element is more appropriately called <title>. (Minor point: there *is* an "escape hatch" in SGML: it is called the "processing instruction" which has -- in the reference concrete syntax -- the form <? ... > where the three dots can be any garbage. An SGML parser will not attempt to parse this, but will pass it on. This mechanism is sometimes used to pass formatter-specific stuff through to troff or whatever, but it can be misused.) * * * * * * * * * * * * * * * 3. SGML allows flexibility without compromising portability: This is a critical point. Although SGML *allows* (but does not require, of course!) variations in document structure and encoding that are simply mind-boggling, this flexibility does *NOT* reduce the portability of an SGML document. This is because an SGML document logically begins with its Document Type Declaration which must define the allowable structural elements and other SGML syntax for the following document-instance. The DTD shows how to interpret the information contained in the document in an abstract manner independent of specific (concrete) coding. With the SGML parser, the document can always be reduced to only the *information* which its author intended it to contain, and then re-encoded in an alternative manner. There is a default "reference concrete syntax" which any SGML parser must assume on startup, and if the document itself uses a variant concrete syntax then the DTD must define that (using the reference concrete syntax to do so, of course). This is analogous to troff's ".ec", ".cc" and ".c2" requests for changing the escape character and the two control characters. Note that troff also starts off with default values for these characters so that it can interpret any troff document. In SGML the comparable main defaults would be "<" and "&". However, in SGML there are many, many other characters or strings which are magical following "<" or "&" and which have defaults that can be redefined. Even the character-sets used for text, for non-text (perhaps binary) data, and for SGML tags are definable. Finally, the meaning of record boundaries (e.g. the newline) is subject to fine control. But there are more interesting methods for changing the form in which a document is encoded. SGML has "markup-minimization", "shortrefs" and "datatags", which provide means to make documents with little or no visible structural coding acceptable to an SGML parser. People have used Sobemap's parser, for example, to parse telex messages containing bank transactions and to churn out database updates *plus* troff instructions to print paper records of the transactions. The telex messages were in natural-language French. (This kind of thing relies on and takes advantage of the fact that in most application-areas information is already structured and that structure is already encoded in the document even if it is not obvious to anyone who is not intimate with that application. Writing an SGML DTD for a document encoded in such a way, which requires familiarity with the implicit structure and its encoding, makes that structure apparent and accessible to anyone with an SGML parser.) This flexibility is just fine as far as portability is concerned because all the variations from plain-vanilla must be defined in the DTD, which in turn is written in a language guaranteed to be understood by the parser. I expect that, in practice, most documents in the future will be coded in something approaching plain-vanilla, at least once they get "into the system". From a data-processing point of view that is usually what you want. But two factors forced the SGML Committee to provide all the flexibility: - Somewhere, somewhen, humans originate the documents and they don't want to work any harder than necessary. I'm sure there were Production Managers of typesetting departments on the Committee who complained, "But that means N percent more keystrokes! Keystrokes cost MONEY!" When I worked for a newspaper I heard this all the time. - There are billions of documents already out there (ask the DOD) that have to be converted to SGML. If the current form of those documents *can* actually be parsed unambiguously without change, then there is at least a good chance an SGML DTD can be written to specify that parsing. Often the first thing a site should do with a document encoded with all sorts of weird datatags and shortrefs and variant concrete syntaxes is to parse it (preceeded, as it must be, by its DTD) with a full-blown SGML parser like Sobemap's, in order to turn it into a straightforward unfancy document with a much simpler DTD. After that, only a very simple parser will be needed to process and re-process this document. Absolutely no information is lost by this translation: you are just getting rid of various forms of data-compression or alternative notation. * * * * * * * * * * * * * * * * * 4. SGML handles graphics, tables, and equations: Graphics is easy: ISO already has standards for graphics -- more than one! So all the SGML Committee had to do was make sure that those other standards could be used within SGML. Basically they "punted the problem" rather than get into a turf war with other ISO and ANSI committees. The SGML standard effectively #includes the existing (and future) graphics standards. Tables and equations are simply particular substructures of documents. So SGML itself does not *need* to have any special facilities for these kinds of material. I have seen a number of different sub-DTDs to handle tables or equations. One of them was unabashedly specifying close to eqn's style of input, partly to illustrate the power of SGML's data-compression tools mentioned in the previous section: shortrefs and datatags and markup-minimization. Right now, people are experimenting to see just how far they can push SGML's flexibility -- "Look Ma, no hands!" But simple approaches will probably be best in the end. The AAP Electronic Manuscript Project has a booklet on each of "tabular matter" and "mathematical formulas" describing *their* particular sub-DTDs for tables and equations respectively. (AAP is Assoc. of American Publishers.) * * * * * * * * * * * * * * * * * 5. SGML has an elegant symbolic reference mechanism: In troff you define a string with ".ds x ..." and reference it with "\*x". For a macro you say ".de XX", and reference it with ".XX". Special symbols for which there is no single ASCII character are referenced with \(xx . If you want to #include a file you do ".so filename". If you want to get some information from the system or another program you use ".sy commandline". In SGML there is a single mechanism to do all these things: the Entity. You refer to an Entity in the same way no matter which kind of thing it is. When you define the Entity is when it takes on the character of a string, or special symbol, or macro, or file-to-be-included, or system-generated data. Note that this is of particular significance for portability when the writer is using many uncommon symbols. At one site a symbol may be on a font. At another site it has to be constructed, as in eqnchar, out of other symbols. At yet another site it must be slurped in by the typesetting program in the form of a bitmap. The writer cannot know which technique will be needed at some arbitrary time in the future. With Entities he doesn't need to know. * * * * * * * * * * * * * * * * To sum up: a good standard -- one that is really *needed* -- is one that, by imposing a discipline, creates freedom to do things that couldn't be done reliably (or at all) before. The SGML standard means that people can write or buy SGML parsing technology and then can be confident that their systems can understand documents written to meet the SGML standard no matter where (or when) they were generated. This is the first critical condition to be met if you are going to be able to format these documents or put them into databases. There is room for many SGML Document Type Definitions (some small, some very very large). However, if your organization wants to guarantee that a document created in one department is processible without change in another department, you probably want to create, centrally, one fairly large and complex DTD which is adequate to the needs of all parts of the organization. Then some groups within the organization may use subsets of this DTD, but you know that documents created with one subset can be combined with documents created with another subset and that they will be supported by your typesetting system (assuming you have figured out how to format every aspect of the jumbo DTD). This is equivalent to the data definitions of a corporate management information system, with "views" which are subsets of the overall data definition set. SGML is working for people now: last year, McGraw-Hill took a hugh collection of SGML-encoded articles and produced a 16 volume Encyclopedia of Science and Technology. Simultaneously, they published a CD-ROM containing the same materials and also made the textual component available on-line via the Westlaw database. The same source files were used to create all three editions. This project could not have taken place without an SGML approach. The main problem with SGML right now is that the Standard is difficult to understand (aren't all standards like this?), and books explaining SGML are still "in preparation". Time will take care of this. In the meantime the AAP Electronic Manuscript Project books are the best that are available. Also, the Graphic Communications Association runs courses on SGML. UPCOMING CONFERENCES (where you'd hear lots about SGML): TechDoc XII, August 23 to 26, San Diego, CA SGML for the Desktop, November 16 to 18, Boston, MA -------------------------------------------------------------------- David Slocombe uucp: {utai,utzoo}!sq!dns Vice-Pres., R&D SoftQuad Inc. Internet: dns@sq.com 720 Spadina Avenue Toronto, Ontario, Canada M5S 2T9 (416) 963-8337