jbarnes@xylophone.cis.ohio-state.edu (Julie Ann Barnes) (07/23/90)
We have recently published the following technical report: Analysis of Document Encoding Schemes: A General Model and Retagging Toolset Julie Barnes OSU-CISRC-7/90-TR19, July, 1990, 69 pp. If you would like a copy, you may send the request via email to strawser@cis.ohio-state.edu Please include your postal mailing address. ABSTRACT Many document encoding schemes and software applications to process electronically encoded documents exist today. The plethora of schemes complicates the development of applications that must access documents in more than one representation. A uniform representation of electronic documents would greatly facilitate software development. Unfortunately, the retagging of existing electronic documents is difficult, given the current development tools. The fundamental problem of distinguishing the markup from the text strings is complicated by problems such as context-sensitive markup, implicit markup, white space, and the matching of start and end tags. Lexical-analyzer generators such as Lex are based on formal models that are inadequate to handle these problems. Because of this, much of the retagging code must be written by hand. Based on a generalization of these problems, we develop a new model for textual data objects with embedded markup. The new model for textual data objects is based on the relationships between markup and text strings. The model includes four classes of markup strings: symbol, nonsymbol, implicit segmenting, and explicit segmenting tags. We propose a uniform representation called a Lexical Intermediate Form with the following lexical properties: 1) the tags are easy to distinguish from the text, 2) the tags are unambiguous, and 3) the tags are explicit. The LIF borrows its concrete syntax from the ISO standard SGML, but it is not encumbered with the SGML concept of document-type definitions. Based on the model and the proposed LIF, we identify two steps in the retagging process and develop software tools that automatically generate the code for each of these steps. Experiences using the toolset are described for six encoding schemes of varying complexity: the Thesaurus Linguae Graecae, the Dictionary of the Old Spanish Language, the Lancaster-Oslo/Bergen Corpus, the Oxford Concordance Program, WATCON-2, and Scribe. Use of the toolset represents a savings in coding effort ranging from 4.3 to 23.2 lines of code generated per line of specification in the toolset. Approximately 98 per cent of the retagging code for these encoding schemes was automatically generated by the toolset. -=- Julie A. Barnes Department of Computer and Information Science jbarnes@cis.ohio-state.edu The Ohio State University 2036 Neil Ave. Columbus, OH USA 43210-1277