[comp.archives] [comp.newprod] Database for/with

hvh@cs.kun.nl (Hans van Halteren) (03/07/91)

Archive-name: text/syntax/ldb/1991-03-05
Archive-directory: phoibos.cs.kun.nl:/pub/LDB/ [131.174.81.1]
Original-posting-by: hvh@cs.kun.nl (Hans van Halteren)
Original-subject: Database for/with (syntactic analysis) trees
Reposted-by: emv@ox.com (Edward Vielmetti)

- The book "Linguistic Exploitation of Syntactic Databases", about the use
  of the Linguistic DataBase, is now available (270pp., Hfl. 70).

- New also is a freely copyable demo version for MSDOS.

See below for details and for a general introduction to the LDB.

=====

The Linguistic DataBase (LDB) 

 
The LDB is a database system developed by the TOSCA group at Nijmegen
University which allows linguists who are not experts in computing to
access syntactically analyzed corpora. The data in the database
comprises `syntactic analysis trees' of the contiguous utterances in a
natural-language text. Since these trees are built from a continuous
text, they give a good representation of actual language use and can
thus provide a testing ground for linguistic hypotheses. The range of
extractable information in such a database is mainly dependent on the
degree to which the text has been prepared. Formerly studies of
corpora were restricted to the level of words or word-classes, but
with the Linguistic DataBase it becomes possible to extend these
studies to the level of syntax, so that larger constituents can be
analyzed.
 
Unlike currently available database packages, the LDB has been created
specifically to handle the type of data linguists need to analyze - a
labelled tree structure with a variable number of branches at each
node and the possibility of recursion.  The LDB can be used to examine
the trees on the terminal screen, search for utterances with given
properties, and handle database-wide queries about constructs in the
utterances.
 
The LDB does not presume special graphics hardware. For this reason it
has been implemented for common machines (VAX and IBM PC/AT) and
common terminals (VT100, ADM3, etc.).  Where possible, special
terminal features are used, such as highlighting and graphics
characters, but even on the so- called `dumb' ADM3A the trees are
represented by an acceptable imitation of graphics. Terminal types not
already provided for can be easily installed by the user.
 
The LDB also does not presume a computationally expert user. Thus
control of the program is designed to be simple and clear. The overall
control is handled by a menu system, which displays short descriptions
of the choices, each of which can be activated by a single keystroke.
In the Tree Viewer, which is used to examine an analysis tree on the
terminal screen, there is not enough space left on the screen to
produce these descriptions, so that commands (mostly of one keystroke)
are listed in abbreviated form. A description of all commands can be
accessed by a `help' command, however.
 
For queries going beyond a single tree, the Exploration Scheme
formalism has been developed. An Exploration Scheme consists of a
search pattern, itself a tree much like the analysis trees, and a
specification of the operations to be performed on the information the
pattern discovers. The possibilities of Exploration Schemes are
various. They range from a simple search for a tree, in order to
examine it with the Tree Viewer, to the creation of frequency tables.
The formalism is designed in such a way that the novice can start
exploring immediately. From there, he can gradually expand his
knowledge to the more complex features. In order to facilitate
formulating Exploration Schemes the LDB has a special scheme editor.
 
The LDB package comes with the Nijmegen Corpus, a 130,000 word
collection of modern British English with a full syntactic analysis of
each utterance. To each node in the tree (i.e. each constituent in the
utterance) has been attached a function and a category label. In the
future more corpora will become available.  Furthermore, since the
database system is independent of both formalism and language, it is
possible to use it for any other kind of analyzed corpus.
 
The LDB package requires (1) VAX with VMS; (2) IBM PC (AT preferred),
640K RAM, hard disk, at least one 1.2 Mb high-capacity diskette drive,
MS-DOS, no special graphics hardware; or (3) any UNIX machine,
competent C-compiler, enough knowledge about terminal and file I/O to
be able to configure the program to the system. Not copy protected.
Source code (ca. 25,000 lines of CDL2) not available.

It costs Hfl. 100 (academic institutions),  Hfl. 5000 (other). 
[as of Jan. 1991 Hfl. 1 is about $ 0.60]
A user manual is not included in the academic distribution; 
the book Linguistic Exploitation of Syntactic Databases (see
publications) contains all necessary information and is priced at Hfl. 70.
 
A (fully functional) demonstration version for any MSDOS machine with harddisk
is available 
 - on a 5.25" 360K diskette from the address below
 - by ftp at phoibos.cs.kun.nl in the directory pub/LDB
 - by listserv from LISTSERV@HEARN as files
     LDBDEMOC INF TOSCA-L
     LDBDEMOC UUE TOSCA-L

 
For more information contact
  Hans van Halteren
  TOSCA Group
  Department of English
  University of Nijmegen 
  P.O. Box 9103
  6500 HD Nijmegen
  The Netherlands
  tel:    (+31)-080-512836
  e-mail: cor_hvh@kunrc1.urc.kun.nl
 


Publications
 
van Halteren, Hans and Nelleke Oostdijk. ``Using an Analyzed
Corpus as a Linguistic Database'', in Computers in Literary
and Linguistic Computing, Proceedings of the
XIIIth ALLC Conference (Norwich 1986),
John Roper (vol. ed.), J. Hamesse and A. Zampolli (series eds.)
 
van Halteren, Hans and Theo van den Heuvel. Linguistic
Exploitation of Syntactic Databases. (Rodopi, Amsterdam 1990).
 
de Haan, Pieter. ``Exploring the Linguistic Database: Noun Phrase
Complexity and Language Variation'', in Corpus Linguistics
and Beyond, Willem Meijs, ed. (Rodopi, Amsterdam 1987).