[comp.archives] [ai] the consortium for lexical research

ted@nmsu.edu (Ted Dunning) (02/27/91)

Archive-name: text/research/lexical/1991-02-20
Archive-directory: crl.nmsu.edu:/pub/lexical/ [128.123.1.14]
Original-posting-by: ted@nmsu.edu (Ted Dunning)
Original-subject: the consortium for lexical research
Reposted-by: emv@ox.com (Edward Vielmetti)




            The Consortium for Lexical Research

                Rio Grande Research Corridor
               Computing Research Laboratory
                New Mexico State University
              Box 30001, Las Cruces, NM 88003.

                      lexical@nmsu.edu
                       (505) 646-5466
                    Fax: (505) 646-6218


     Work in computational linguistics has reached the point where the
performance of many natural language processing systems is limited by
a "lexical bottleneck". That is, such systems could handle much more
text and produce much more impressive application results were it not
for the fact that their lexicons are too small.

     The Association for Computational Linguistics has established the
Consortium for Lexical Research (CLR), and DARPA has agreed to fund
this.  It will be sited at the Computing Research Laboratory, New
Mexico, under its Director, Yorick Wilks, and an ACL committee
consisting of Roy Byrd, Ralph Grishman, Mark Liberman and Don Walker.

     The Consortium for Lexical Research will be an organization for
sharing lexical data and tools used to perform research on natural
language dictionaries and lexicons, and for communicating the results
of that research.  Members of the Consortium will contribute resources
to a repository and withdraw resources from it in order to perform
their research.  There is no requirement that withdrawals be
compensated by contributions in kind.

     A basic premise of the proposal for cooperation on lexical
research is that the research must be "precompetitive".  That is, the
CLR will not have as its goal the creation of commercial products.
The goal of precompetitive research would be to augment our
understanding of what lexicons contain and, specifically, to build
computational lexicons having those contents.

     The task of the CLR is primarily to facilitate research, making
available to the whole natural language processing community certain
resources now held only by a few groups that have special
relationships with companies or dictionary publishers.  The CLR would
as far as is practically possible accept contributions from any
source, regardless of theoretical orientation, and make them available
as widely as possible for research.  There is also an underlying
theoretical assumption or hope: that the contents of major lexicons
are very similar, and that some neutral, or "polytheoretic," form of
the information they contain can be at least a research goal, and
would be a great boon if it could be achieved.  A major activity of
the CLR will be to negotiate agreements with "providers" on reassuring
and advantageous terms to both suppliers and researchers.  Major
funders of work in this area in the US have indicated interest in
making participation in the CLR a condition for financial support of
research.  An annual fee will be charged for membership. It is
intended that after an initial start-up period, the Consortium become
self-supporting.

     The Computing Research Lab (CRL) already has an active research
program in computational lexicons, text processing, machine
translation, etc., funded by DARPA and NSF as well as a range of
machines appropriate for advanced computing on dictionaries.

Resources and Services of the Consortium

     The following lists of lexical data and tools seem to provide a
reasonable starting content for the repository.  We will continually
solicit and encourage additions to this list.

Data

1. word lists (proper nouns, count/mass nouns, causative verbs,
movement verbs, predicative adjectives, etc.)

2. published dictionaries

3. specialized terminology, technical glossaries, etc.

4. statistical data

5. synonyms, antonyms, hypernyms, pertainyms, etc.

6. phrase lists

                           Tools

1. lexical data base management tools

2. lexical query languages

3.  text analysis tools (concordance, KWIC, statistical analysis,
collocation analysis, etc.)

4. SGML tools (particularly tuned to dictionary encoding)

5. parsers

6. morphological analyzers

7. user interfaces to dictionaries

8. lexical workbenches

9. dictionary definition sense taggers

                          Services

     Repository management will involve cataloging and storing
material in disparate formats, and providing for their retransmission
(with conversion, where appropriate tools exist).  In addition, it
will be necessary to maintain a library of documentation describing
the repository's contents and containing research papers resulting
from projects that use the material.  A brief description of the
services to be provided is as follows:

a.  CRL will provide a catalog of, and act as a clearing-house for,
    utilities programs that have been written for existing online
    lexical data.

b.  CRL will compile a list of known mistakes, misprints, etc.  that
    occur in each of the major published sources (dictionaries etc.).

c.  CRL will set up a new memorandum series explicitly devoted to the
    lexical center.

d.  CRL will also be a clearinghouse for preprints and hard-to-find
    reprints on machine-readable dictionaries.

e.  CRL also expects to conduct workshops in this area, including an
    inaugural workshop in late 1991 or early 1992.

f.  CRL would provide a catalog for access to repositories of
    corpus-manipulation tools held elsewhere.

g.  CRL has already set up a network accessible file transfer service.

     We invite you to participate in the Consortium for Lexical
Research.  Anyone interested in participating even in principle as a
provider or consumer of data, tools, or services should send a message
to

    lexical@nmsu.edu
 or
    lexical@nmsu.bitnet

as should anyone who would like to be on our lexical information
list.