E.Ireland@massey.ac.nz (Evan Ireland) (02/07/91)
Hi, I recently saw mention of SGML in comp.emacs, but the poster of that article could not provide me with a reference, and suggested that I ask here. (1) ISO standard number and ordering information (I was told there is a related ISO standard). (2) Reasonably comprehensive information that could be emailed to me. (3) Other references. Thanks in advance, ------------------------------------------------------------------------------- E.Ireland@massey.ac.nz Evan Ireland, School of Information Sciences, (063) 69-099 x8541 Massey University, Palmerston North, New Zealand.
enag@ifi.uio.no (Erik Naggum) (02/08/91)
Anybody want to take up the task of writing an authoritative answer to this question? Perhaps we need a comp.text.sgml FAQ, just like some other groups? Evan, this is of course not to deride your question, but it seems that it's very difficult to write a good summary of what SGML is, and spread it around sufficiently. I'll try, after a brief pause (a.k.a. sleep). Stay tuned. To answer your questions: (1) ISO 8879:1986 + ISO 8879A1 (year forgotten). You can order ISO standards from your national standards association or body. I have no idea who this is for New Zealand. However, see below (particularly the lines with --> on them). (2) Forthcoming... (3) SGML: The/An Author's Guide, Martin Bryan, Addison Wesley, 1987; a book which tries to encompass everything in the standard and doesn't succeed very well. I wouldn't recommend this book, but it was one of the earliest out, and many people have read it. It's fixation on keeping the language and acronymitis of the standard itself is its main drawback. Practical SGML, Eric van Herwijnen (sp?), Kluwer, 1990; a more practical book (!) with a more limited scope and much more suc- cessful than the above at meeting its stated goal. There are lots of small, annoying inaccuracies in it (I'm preparing a bug list), but you get a real good grip on SGML through it. It's quite easy reading, comes in three parts for escalating user needs (SGML user, SGML application, SGML parser), but is a little "simplistic" on some of the difficult points. It skips the things you don't (or won't) need to know, and this is very important with an uruly beast like SGML. The SGML Handbook, Charles F. Goldfarb, Oxford, 1991; an incredibly throrough book on the standard itself. I've found one very minor (but consistent) cosmetic bug, and a reference to H.400 which should have been X.400. That's the bugs. Now the reasons I've fallen in love with the book: ISO 8879 is a jungle. ISO 8879 has an amendment, no more than a diff in Unix parlance, which you have to consult to be certain of what "the" standard says. ISO 8879 has no index, no cross-references, separates the concrete reference syntax and the syntactic variables to an extent which makes you need a blackboard on which to scrawl "MDO = '<!', MSC = ']]', ...", or take numerous copies of pages from the standard and hang them on your wall and squint at, and you end up cluttering the standard with Post-It notes all over the place. CFG has done all that for you, with an expert hand. His book is the first I've seen with "paper hypertext", having "buttons" with page numbers and syntax production numbers on them, which makes it possible to look them up easily. It also has small boxes with the reference concrete syntax representation of the delimiter roles in the right margin whenever the roles appear. The book contains a heavily annotated, complete ISO 8879 plus amendment, plus suggested improvements. The text is full of wit and humor, which is badly needed with ISO standards in general. You'll find what you're looking for in this book, if your interest lies with the standard itself. It is no beginner's book, however. --> Several people have predicted that The SGML Handbook will replace --> the ISO standard as the reference work of choice. I fully support --> this prediction. The bibliographies in Eric van Herwijnen's book leave little to be desired (except, perhaps, issue numbers for <TAG>The SGML News- letter), and the section on SGML sources in The SGML Handbook is nearly complete. It's some time ago that I read Martin Bryan's book, but I was not that impressed with the bibliography, either. I hope this has been of some help. Your request for a "reasonably comprehensive information" I shall attempt to fulfull by the end of the day (T minus 21 hours, and counting...). -- [Erik Naggum] <enag@ifi.uio.no> Naggum Software, Oslo, Norway <erik@naggum.uu.no>
ct2g@plaid.acc.Virginia.EDU (Cheng Tang) (02/13/91)
I am trying to find anonymous ftp sites with software proclaiming to adhere to SGML, or with the same functionality as SGML. I do not know of the ambiguity of my request, but am willing to learn. Any thoughts/comments/answers are welcome. Thanks, ---Cheng
enag@ifi.uio.no (Erik Naggum) (02/19/91)
Hi, all, and sorry for the delay. The following is an unfinished introductory article that I don't have the time to finish quite as soon as I had hoped. Please comment on it by sending me a message at <erik@naggum.uu.no>. Thanks. Evan, I hope this will give you a flavor, if not the Whole Answer. [Erik] What is SGML? SGML is the abbreviation for Standard Generalized Markup Language. If you know what Generalized Markup is and what Markup Languages are and do, the only remaining element is Standard. SGML is International Standard 8879 as issued by ISO - the International Organization for Standardization - in 1986 and later amended. Many people apparently knows what Generalized Markup and Markup Language means, since SGML quickly became popular and is rumored to be one of ISO's best selling (or fastest selling, according to other rumors) standards. However, it seems even more people do not know what generalized markup is or what markup languages are or do, but nonetheless, they've heard that SGML is going to have an impact on how they do their work. Those who work with SGML certainly hope so. In this article I will try to explain some of the key concepts that led to the birth of SGML and how they manifest themselves in the language. In addition, some key concepts of the language are presented for the casual observer. Later articles will deal with these issues in more detail. Markup The specification of typefaces, sizes, weights, spacing, indentation, justification, etc, for a given text before it is printed (or typeset) is called marking up the text. Traditionally, this has been done by writing small cryptic signs in the margin of the copy (text), much like proofreaders marks, for those who have seen them. For instance, the abbreviation "SGML" could be set in small capitals, keywords could be set in italics, paragraph titles in boldface, a point or two smaller than running text. Examples could be set in some monospaced font such as courier, quotes in the same typeface as running text but two points smaller on a narrower line spacing with a shorter line length and a little indented from the left margin. All of these rather detailed specifications were instructions to the typesetter. Incidentally, we now call such markup "specific markup". Authors have generally not had to know anything too specific about typography. Most of them used typewriters or had someone type their manuscript from dictated tapes, if they didn't write it out in handwriting. However, they had to communicate the structure of the text to the person doing the markup, even though that person added many choices of his own. A minimum would be to separate paragraphs from each other, the headings from the text, examples from running text, etc. Some of the more obnoxious authors no doubt had specific ideas on how it should look, as well. As one comes to learn of the art of printing, one becomes awestruck (or should, anyhow) by the magnitude of the possibilities and choices. Typographers in the field of advertising exploit this to the fullest, but if you produced a book with the same typography as in ads, it would look horrible, and be unreadable. Typographers deal with the aesthetics of the printed page, while the copywriters concentrate on the message. Similarly, authors concentrate on the content of books and articles, deferring typographical decisions to people who master that part of the process. This is not always the case, as we shall see, and some software for popular hardware platforms deliberately confuse the issues of contents and style. Information and presentation Ironically, in a world of ever more specializing, it's important to distinguish between two aspects of any communicated piece of information that few would ahve thought could be confused. When you pick up a newspaper or a book, you don't do so to admire the clean typography, the beauty of the typeface, or the orderly layout (unless, of course, you've spent some time under the wing of an old-school typographer, and write articles about what SGML is); you pick it up because the information looks interesting and worth reading. Many would argue that people are unlikely to do so if a whole 100-page newspaper consisted of page upon page of 8 columns of text packed together with no change in type, no superfluous spacing, no paragraph breaks, etc. (The Frankfurter Allgemeine to the contrary notwith- standing.) On the printed page, information and presentation work together to bring the author's ideas and observations to the reader, as they do with the lecture, where a live lecturer with a rich voice and varied diction can do wonders to otherwise heavy material. In the result, information and presentation invariably coexist, and it would indeed be difficult to imagine either without the other. This does not imply that the two are not radically different processes, and that the basic tenet is that information must have some form of presentation, but it can be any form of presentation, and that a form of presentation must by applied to some information, but it can be any information. Thus we deal with easily abstractable concepts. Let's take them each in their turn. Information usually has some structure. Books, e.g., may have parts, chapters, sections, and paragraphs. They may have front matter, such as a foreword, a preface, table of contents, etc, and back matter such as bibliography, author biography, index, etc. Further, paragraphs can't contain chapters or a table of contents, but may contain other tables, figures, quotes, footnotes, etc. This forms a hierarchy of elements. (There are also non-hierarchical elements, such as index entries, references to chapters, paragraphs, tables, figures, and so on.) This structure is almost always in the head of the author, and not put down on paper in detail. This unfortunately lends itself to inconsistencies, and in particular where special conventions are used to indicate different parts of the structure. A style, or consistent form of presentation, has a close relationship to the structure of the information to which it applies. Titles of chapters look the same, as do figure and table captions. Footnotes obey a certain regularity, etc. It should be evident that a style associates each element in the structure with a particular set of typographic instructions or specific markup as mentioned above. Because of the relationship between structure and style, we talk of text which only carry indications of the structure as being marked up with "generalized markup". The implication is that we can choose any style for the structure in question, and that the author won't have to make any changes to the document if he (or the typographer) changes his mind about, say, footnote presentations, or the particular size of the typeface chosen for chapter titles. It should even be possible to produce the table of contents directly from the document by listing the titles of the chapters, sections, paragraphs, etc, and ignoring all other content. Similarly, of course, with list of figures and tables. It should be quite obvious that if all a document contains is specific markup, it would be impossible, without vast amounts of human inter- vention and artificial intelligence, to figure out what is a chapter title and what is a figure caption, mechanically. To the reader, this is of course very simple, but we must remember that this is courtesy of a long tradition among typographers who have managed to present it all to us in a clean and consistent fashion. Humans also recognize what's a title and what's content by the form of the text therein, which a computer program would be hard pressed to do. Physical arrangement on the page also helps, especially when more novel or experimental layout is used to represent keywords in the margin, titles in boxes on one side of the page, and so on. The separation of information and presentation is thus most helpful for machine processing of documents. The argument is that most new documents are written with the aid of computers and that even for small amounts of machine processing, consistency is a virtue. It is cruel to demand of the user that he maintain the consistency with no aid from the otherwise useful computer. That is why languages are designed to help the author inform the computer of his intentions, so that the computer both can check consistency and be of the most help. After all, such chores are what computers are well suited to perform. Markup Languages Before discussing generalized markup languages, a quick presentation of markup languages in general is called for. Instances of markup languages abound, and many have seen WordStar's dot-commands, TeX's backslash-introduced magic, and the intertwined text and control lines of nroff and troff. Less intuitively, the special codes used by the soi disant WYSIWYG ("What You See Is What You Get") editors are very simple markup languages, because only the editor is supposed to see them. Users are "relieved" from knowing what's going on, and are thus not able to use the document in anything but the environment in which it was created. Personally, I regard such closed environments as an abomination, and want nothing of it. Text and documents are much too important to be left to "integrated packages". nroff and troff, to take but one example, or two of a kind, really, since they take the same input language and prepare output for two different classes of devices, take as input a document consisting of interspersed text lines and control lines, where the control lines are distinguished from the text lines by having the first character on the line be either a period or an apostrophe (or acute accent). Other lines are text lines and are processed according to the instructions provided by the control lines. The primitive operations in nroff and troff are based on typesetter primitives, in addition to some powerful operations on registers, text justification and line drawing. The input language also supports expressions and conditionals. Another important feature of the input language is the ability to define macros for oft-repeated sequences of primitives. More than that, the macros may take arguments and pave the way for a generalized markup. Indeed, several such exist for many applications. E.g., the UNIX manual pages are all described with a highly generalized markup. Unfortunately, the language is somewhat restricted and is considered difficult to learn by the people who decide such things. There exists preprocessors for both equations, tables and pictures, and the result can be found in a large number of books. Most notably, the results are in line with sound typographical conventions, and even though it's possible to recognize that a book was set with the aid of troff, typographers are pleased with the result. TeX, on the other hand, contains a gargantuan amount of primitives (or not so primitives) and is a vast and complex language which is hard to learn and master, but which is said to be particularly well suited to typeset mathematics. It is extremely cumbersome to use in its raw version. TeX also supports macro definitions, and indeed this is what most TeX coding is about. However, the confusion of name spaces in TeX (something which is not present in nroff or troff), makes reading the macro libraries hard and changing them error-prone. Large macro packages and preprocessors exist, e.g., LaTeX, which function as generalized markup, except that is it difficult to use the input for anything but the particular preprocessing software. From the stand- point of a seasoned typographer, the output is gross and can easily be recognized even by the untrained eye. Both of these, and undoubtedly most other, markup languages are concerned with producing output on some (range of) devices. In this manner, they are really specific markup languages with generalization support, instead of being generalized markup languages with specific markup support. The Standard Generalized Markup Language is the first major attempt at a truly generalized markup language; the main feature of which is the ability to specify the structure and check it for consistency. Any document which use a particular structure can be validated with respect thereto. SGML does not concern itself with producing output on any device; that is handled by an SGML application, external to both the language and the document. What then, does SGML do? An SGML document consists of several parts, which need not be found in the same file. (Powerful means exist to enable any organization of the document.) From SGML's point of view, the DTD (Document Type Definition) is the most important. It defines a type of document in terms of its structure. From the authors's point of view, the document instance is the most important. This is the authors's content within the structural definition defined by the DTD. << More to come >>