[comp.text] new tech report

jbarnes@xylophone.cis.ohio-state.edu (Julie Ann Barnes) (07/23/90)

We have recently published the following technical report:

Analysis of Document Encoding Schemes: A General Model and Retagging
Toolset
Julie Barnes
OSU-CISRC-7/90-TR19, July, 1990, 69 pp.

If you would like a copy, you may send the request via email to

strawser@cis.ohio-state.edu

Please include your postal mailing address.


                                ABSTRACT

Many document encoding schemes and software applications to process
electronically encoded documents exist today.  The plethora of schemes
complicates the development of applications that must access documents
in more than one representation.  A uniform representation of
electronic documents would greatly facilitate software development.

Unfortunately, the retagging of existing electronic documents is
difficult, given the current development tools.  The fundamental
problem of distinguishing the markup from the text strings is
complicated by problems such as context-sensitive markup, implicit
markup, white space, and the matching of start and end tags.
Lexical-analyzer generators such as Lex are based on formal models
that are inadequate to handle these problems.  Because of this, much
of the retagging code must be written by hand.

Based on a generalization of these problems, we develop a new model
for textual data objects with embedded markup.  The new model for
textual data objects is based on the relationships between markup and
text strings.  The model includes four classes of markup strings:
symbol, nonsymbol, implicit segmenting, and explicit segmenting tags.

We propose a uniform representation called a Lexical Intermediate Form
with the following lexical properties: 1) the tags are easy to
distinguish from the text, 2) the tags are unambiguous, and 3) the
tags are explicit.  The LIF borrows its concrete syntax from the ISO
standard SGML, but it is not encumbered with the SGML concept of
document-type definitions.

Based on the model and the proposed LIF, we identify two steps in the
retagging process and develop software tools that automatically
generate the code for each of these steps.  Experiences using the
toolset are described for six encoding schemes of varying complexity:
the Thesaurus Linguae Graecae, the Dictionary of the Old Spanish
Language, the Lancaster-Oslo/Bergen Corpus, the Oxford Concordance
Program, WATCON-2, and Scribe.  Use of the toolset represents a
savings in coding effort ranging from 4.3 to 23.2 lines of code
generated per line of specification in the toolset.  Approximately 98
per cent of the retagging code for these encoding schemes was
automatically generated by the toolset.
-=-
Julie A. Barnes			Department of Computer and Information Science
jbarnes@cis.ohio-state.edu		 The Ohio State University
					       2036 Neil Ave.
					Columbus, OH USA 43210-1277