[comp.text] software tools for SGML, proposed comp.text.sgml

yuri@sq.sq.com (Yuri Rubinsky) (08/02/90)

In message <281@txsil.lonestar.org>
robin@txsil.lonestar.org (Robin Cover) writes:

> If some SGML experts from among the "major players" are to be attracted to
> the group, the distinctive name "sgml" and focused attention on SGML is
> a clear desideratum.  It will be hard enough to get support from SGML
> gurus anyway -- they will have neither time nor patience to muck through
> dozens of postings on unrelated topics....
> 
> For a healthy SGML discussion, I feel it is imperative to have a couple
> SGML experts listening in.  Those who have actually read the standard,
> or write DTD's, or build parsers will know what I mean.  There is still
> a lot of confusion about what SGML actually *IS* (and is not), and it's
> easy for an unmoderated forum to generate unfortunate "mis-information."
> I would even suggest that several companies or SGML-supporting agencies
> be contacted (e.g., Software Exoterica; SoftQuad; Datalogics) to see if
> they would designate persons to help referee the discussion -- at least
> at moments when mis-information goes unchecked or when technical
> questions cannot be answered by the forum's regular readers.

On behalf of SoftQuad, yes. We will do our best (within our time
constraints) to respond to appropriate information and questions.

Here are some now:

------------------------------------------------------------------------

In message <2152@tnoibbc.UUCP>
anita@tnoibbc.UUCP (Anita Eijs) writes/asks:

> 1)	Are WYSISWYG-wordprocessors available which can read and write
> 	SGML ?

Yes and no. SGML encoding is generally considered to be at its purest
when it is free of formatting information. Its job is to interchange
structural data and content in such a way that any number of required
"formats" can be derived.

This makes possible work such as that mentioned
in message <45238@brunix.UUCP> wherein
var@iris.brown.edu (Victor A. Riley) describes the work of:

> X3V1.8M MUSIC IN INFORMATION PROCESSING STANDARDS (MIPS) COMMITTEE
>       operating under the rules and procedures of the
>            American National Standards Institute

which is using the syntax of SGML to build a representation language
for hypermedia and time-based documents (music and multimedia events
are two examples). (I mention this because it has relevance later
with respect to the CGM question below.)

SGML is widely used for the storage of text in databases and is being
slowly but surely embraced by the CD-ROM community. [In his keynote address
at the 1988 CD-ROM Conference, Bill Gates announced that he thought it was
pretty clear that SGML was the storage format of choice for CD-ROM publishing.]

All in all, then: The creators of SGML-encoded files will not normally
know or be able to imagine all the uses to which their contents will one
day be put. "What You See is What You Get" is, accordingly, not a phrase
that has much meaning when longevity and multi-purposeness are the goal.

Nonetheless, WYSIWYG has a place in the SGML world. In article
<cah492S00VsA4tzYNl@andrew.cmu.edu>
mss+@andrew.cmu.edu (Mark Sherman) writes:

> One can define imaging semantics to be associated with SGML. The program
> AuthorEditor from SoftQuad is quite nice in that regard. But its
> conventions are parochial -- an "SGML" system knows nothing about AE's
> semantics, unless the exchanging parties agree to information outside of
> the standard.

Mark is being a little bit mischievous here. Certainly my favourite
dictionary defines parochial as "confined to a narrow area", but the
"but" in his sentence doesn't recognize that very often this
local functionality is indeed a Good Thing.

For example: Just because the footnotes in my final document may be
printed in 6 or 8 point type is no reason why I should have to look
at them in that size on the screen. I'm perfectly comfortable knowing
that a pair of simple SGML tags will allow a text-for-paper formatter to
ensure that the footnotes will appear at the bottom of the page or chapter
end in a small point size, while a text-for-screen formatter may place
them in-line or at the bottom of a screenful of text, or in a thin column
to the left of the text body. Having computer screens imitate a piece
of paper (of all ancient technologies!) hardly does justice to their
capabilities. 

Yes, in Author/Editor we DO associate screen formatting with SGML elements.
So too does IBM with its TextWrite product. Both of us do this for a very
good reason: Users take advantage of the screen formatting to build a
working environment in which they are comfortable and where the formatting
helps their tagging intuition. With a simple command in such an editor,
you insert "list item" or "table" tags (for example); screen feedback
assures you this is the element you wanted.

[A word of explanation for those who don't recognize these names: Both
SoftQuad Author/Editor and IBM's TextWrite are conforming SGML editors,
context-sensitive, structured and so forth, with good assistance for
the user encoding an SGML document, and "QUASI-WYG" in the way
described above. There are other SGML editors, Exoterica's Checkmark,
Sobemap's Write-It and Datalogics' WriterStation, which don't do this.]

So, What You See Accurately Represents What You Want, in the model
that suggests that writers are best left to writing, editors to editing,
and designers, later in the process (generally), to designing.

------------------------------------------------------------------------
Here's a much shorter answer to the WYSIWYG question and,
simultaneously, perhaps to:
> 2)	Are any translators available to convert SGML to troff, TeX,
> 	MSWord, etc., and vice versa ?

Microsoft has announced (in Government Computer News and the EPSIG
Newsletter, among other places) that it will announce a form of SGML
support by the end of 1990 for delivery in 1991. According to the
EPSIG Newsletter (the journal of the Association of American Publishers'
Electronic Publishing Special Interest Group operated by OCLC in
Columbus Ohio), Microsoft is currently evaluating SGML parsers.

WordPerfect Corporation released a Statement of Direction
in June 1989 saying "We are in the
process of developing a strategy to assist people in creating
WordPerfect documents that can be converted to and from SGML and ODA".
To the best of my knowledge, that company has made no other public
statements on this subject since.

Agfa Compugraphic CAPS, Xyvision, Frame, Intergraph, Interleaf, Context,
Datalogics, Arbortext, SoftQuad and perhaps others (apologies to anyone I've
forgotten in this list) have demonstrated the ability to take
SGML files encoded using specific tagsets (generally CALS 28001)
and show them on the screen matching line-for-line what will be 
output to a printer.

Translation from SGML to formatter input is properly the task of an
SGML Parser, a utility which can understand enough about the context
of an SGML element [read "object" such as paragraph, or list item, or
table cell, or figure] to be able to produce an output stream which
is meaningful to a processor which may not understand "context sensitivity".
This is not (except when the SGML elements and their inter-relations are
particularly unsophisticated) a job for sed or awk, or even yacc or lex.

On the subject of parsers, Mark Sherman writes:
> I believe SoftQuad sells them. Quality, functionality and price unknown
> to me. There are probably more around, although I recall an article by
> Larry Welsch from NIST (ACM document processing conference) claiming
> that some parts of SGML were exceedingly difficult to implement, so you
> should watch out for how much is implemented when someone makes a claim.

A Conformance Testing Initiative led by the Graphic Communications
Association in North America and by the National Computing Centre in the
UK (with the cooperation of the European Community) will, within a
year or so, eliminate this issue. Today, the most popular parsers,
which are generally conceded to also be the most conformant, are those
of Software Exoterica (of Ottawa Canada), licensed by Frame,
Arbortext and Intergraph; and of Sobemap (of Brussels Belgium,
marketed by Yard Software of Chippenham Wiltshire UK), licensed
by Agfa Compugraphic CAPS, Interleaf, Context and Xyvision.
We have made available to our consulting clients
the parser from Author/Editor, which
is optimized to work with our SoftQuad Publishing Software 
sqtroff component.


In Holland, Elsevier Scientific Publishers, as a matter of
course, I believe, use the SGML Parser of the Vrije Amsterdam University
to convert SGML files to TeX. A number of other sites in Europe perform
the same conversion as did the creators of the terrific
SGML/Structured Text Bibliography compiled by Robin Cover, Nicholas
Duncan and David Barnard [Queen's University at Kingston Ontario Canada,
Technical Report 90-281 still in draft form and available later this
year].

------------------------------------------------------------------------
Back to Anita's questions:

> 3)	Is an SGML to PostScript converter available ?

Well, yes, though we think of that process not so much as a conversion
as traditional document processing. One could describe any software
product which makes up pages from SGML-to-parser input as performing
SGML to PostScript conversion. Neither SGML nor PostScript alone has
the smarts to know when to break a line or a page, and so on.

------------------------------------------------------------------------
> 4)	Does SGML support drawings (illustrations) ? How about tables,
> 	mathematical expressions ?

Yes, certainly, but these two questions have quite different answers.

a) Drawings/Illustrations: Think of SGML, at one level, as process
control. [Stop! SGML is not a procedural language, but nonetheless,
I believe this is the most straightforward way to explain the
functionality ...] The standard formalizes a set of declarations
which associate certain entities with "data content notations".
SGML's job is not to attempt to predict all the ways that any number
of hardware and software systems will store graphic images, video,
sound, smell, voice annotation, and so on. Rather, an SGML
document will contain, in easily recognized constructs, all the
information that a system needs to recognize where parseable text
starts and stops, and where control must be passed to an application
that can deal with the strictly delimited content which is non-SGML
data. [The hypermedia/multimedia work going on in the ANSI committee
mentioned above uses these capabilities very elegantly, even building
in SGML constructs to point to "the interiors" of non-SGML contents.]

Mark Sherman writes:
> Now, you and I can make a
> side agreement that whenever we use the tag "my-CGM-byte", the marked
> bytes will be in CGM-compliant format. However, that is an agreement
> outside of the standard and only usable by our local cabal. Ditto for
> tables, mathematical expressions.

This is not true. The standard defines a document as (more or less)
a Document Type Definition -- the set of elements, other constructs,
and their relationships -- followed by an "instance" of that DTD,
content marked up using the semantics rigidly prescribed by the DTD.

An ability to read the DTD is a vital function within any SGML system.
Accordingly, there is a completely standardized, interchangeable
method, within the standard, to pass along the data content notations,
such as CGM, or TIFF, or RIFF, or IGES, or IFF, or anything. It is
not the job of SGML (nor should it be) to dictate how applications
software will respond to the content being passed.

"Our local cabal" has nothing to do with the story. Anyone with
an SGML parser can read any SGML file and be passed a meaningful
output stream.


b) As for tables and mathematics: Both areas are covered in 
a "must-read" Technical Report (TR 9573)  published by ISO/IEC
and edited by Anders Berglund (now of ISO, ex of CERN), entitled
"SGML Support Facilities: Techniques for Using SGML". The DTDs created by
the Association of American Publishers (which are now an ANSI
standard) and by the US Defense Department under the CALS initiative,
also contain "content models" for tables of varying complexity. It
is now up to software developers to find mechanisms for presenting
these content models to users in as straight-forward a way as is
possible, but there is nothing wrong with the underlying SGML data
representation. [Certainly the content models are complex. And so
they should be: tables can be extremely complicated.]

As far as math goes, for now, the CALS DTDs use the "data content
notation" construct described above, choosing to standardize
on TeX, EQN and IBM's Scientific and Mathematical Formula Format,
with tags to delimit nested math, and expecting the formatter to
handle the formatting.

------------------------------------------------------------------------
> 5)	Is it possible to use SGML and CGM in combination ? How about the
> 	availability of CGM-translators ?

See above. A variety of graphics and CAD packages exist which
claim CGM translation ability -- but to other graphics formats,
not to SGML.

The afore-mentioned "Techniques for Using SGML" extends an example given
in Annex E of SGML itself. The CGM clear text encoding in the example
is nested within the SGML document, but attributes associated
with the SGML elements dictate scaling and cropping.

> 6)	Are parsers available to check an SGML-document on syntax ?

Yes. Software Exoterica's XGML, Sobemap's Mark-It, NIST's not-yet-complete
public domain utility, the Amsterdam Parser (which I've not seen, however),
and, to SGML sites using sqtroff, SoftQuad's. Datalogics bundles in its
own (built on top of the NIST parser, I believe) with its WriterStation
and Pager products; IBM includes one with TextWrite.

> 7)	Are the software tools public domain ? What are the prices of the
> 	software tools ? What kind of software tools are available ?

There is an extraordinary variety of software tools available, from
all the vendors mentioned above, plus a few more:

Avalanche Development Company (Boulder Colorado) sells FastTag,
an "auto-tagger" which uses a proprietary visual recognition engine
to mark up documents from a variety of wordprocessors and scanner/OCRs.

PraXis Inc (Providence Rhode Island) will soon be showing its
Electronic Book Browser, a system which builds and displays hypertexts
compiled from SGML texts.

OWL (Office Workstations Limited of Edinburgh Scotland and Bellevue
Washington) uses SGML as an input source for its IDEX hypertext/
document database.

Other products (along with addresses and phone numbers
for all the companies mentioned throughout this article) are listed
in the SGML Source Guide, a publication of the
	Graphic Communications Association
	1730 North Lynn Street, Suite 604
	Arlington, Virginia 22209-2085  USA
	Telephone: 703 841-8160
	Fax: 703 841-8144
		attn: Marion Ellidge

GCA also publishes <TAG>, the SGML Newsletter,
which, along with the newsletters mentioned below, is a good
source of product descriptions and new product announcements.

GCA also hosts several SGML tutorials each year, as well as
the twice-annual TechDoc Conference [next one: August 20 to 24
in Washington DC] and, co-sponsored with the International
Users' Group, the annual Mark-up conference each May or June.

The EPSIG Newsletter, mentioned above, is available from 
	OCLC Inc
	6565 Frantz Road
	Dublin, Ohio 43017-0702  USA
	Telephone: 800 848-5878
		attn: Betsy Kaiser

The newsletter and bulletin of the International SGML Users' Group,
as well as a number of other publications, are available from
	International SGML Users' Group,
	c/o SoftQuad Inc
	720 Spadina Avenue
	Toronto Canada M5S 2T9  Canada
	Telephone: 416 963-8337
		attn: Steven Downie

A recent posting to this newsgroup described the work and intentions
of an SGML Consortium proposed by Ohio State University with
intentions of making available a variety of public domain SGML tools.

> 8)	Will the newsgroup 'comp.text.sgml' be created ?

I suspect that if there was any doubt before, then the outrageous
length of this posting will tip the balance as crowds of comp.text
subscribers say "Get this stuff out of here!" Nonetheless, it seems
to me that there is another point of view on the subject:

Until SGML is taken for granted as a useful and normal part of
the working lives of all who toil with documents,
a national and international standard of this level of
capability might well be usefully discussed in comp.text rather
than in a separate newsgroup. I think that people generally interested
in text issues would do well to follow these discussions, rather
than create a distinct SGML ghetto. With the support of so many
governments, associations, research groups, hardware and software
vendors, as well as electronic and paper publishers of all sorts,
it's not going to go away. Anyone involved with comp.text may be
served by keeping on top of these developments.



------------------------------------------------------------------------
Yuri Rubinsky				(416) 963-8337
President                               (800) 387-2777 (from U.S. only)
SoftQuad Inc.				uucp: {uunet,utzoo}!sq!yuri
720 Spadina Ave.			Internet: yuri@sq.com
Toronto, Ontario, Canada M5S 2T9	Fax: (416) 963-9575