[comp.ai] Summary: Pointers to English grammars, lexicons, corpora

bllklly@olsen.UUCP (William Kelly) (12/20/90)

Here is a summary of responses to my request in November for 
"Pointers to English grammars, lexicons, corpora".  Thanks to everyone.
As various people have requested a summary, I assume enough interest for a
posting.  Quotes are verbatim, except where I have put my comments in
square brackets.
Bill Kelly, bllklly@olsen.uu.ch  or  bllklly@olsen.uucp

---Grammars---
>From Mark.Kantrowitz@A.GP.CS.CMU.EDU Thu Nov  8 16:24:18 1990
Grammars: The largest grammar I know of is NIGEL, which is part of the
PENMAN project at Information Sciences Institute (ISI) at USC.

>From laubsch@hpljl.hpl.hp.com Thu Nov 22 03:58:31 1990
I recommend reading the following:
	H. Alshawi et al "Research Programme in Natural Language Processing"
	Final Report, SRI International, Cambridge Centre,
        prepared for Alvey Project No. ALV/PRJ/IKBS/105

Their syntax has considerable coverage, 

>From ingria@BBN.COM Fri Nov 16 18:59:02 1990
There was a recent book produced by Gerald Gazdar, that contains a
grammar of English ready to be typed in, along with a parser.  This
book comes in LISP, POP-11, and PROLOG versions.
[The Prolog version is Natural Language Processing in Prolog, An Introduction
to Computational Linguistics, Gazdar and Mellish, 1989, Addison-Wesley,
ISBN  0-201-18053-7.  It is a nice introduction to CL.  The grammar is
only a small subset that illustrates the basic problems of NLP and the
features of CFGs and PATR-II.  Bill]

>From kalita@linc.cis.upenn.edu Thu Nov  8 18:59:40 1990
Subject: Re: Pointers to English grammars, lexicons, corpora?

For a reasonably large grammar of English, please contact 
Kathleen Bishop of the University of Pennsylvania. She is a 
Ph.D. student in the Dept. of COmputer Science. SHe and a few 
other students have been working on a large grammar for 
a couple of years. Her address is   
   bishop@linc.cis.upenn.edu

She is working under the supervision of Dr. Joshi and her
grammar is based on Dr. Joshi's TAG (Tree Adjoining Grammar)
formalism.

Jugal Kolita

>From oldbin@harvard.harvard.edu Mon Nov 12 23:57:23 1990
Regarding your inquiry about English grammars, here are a few names
I know of, but unfortunately, I have no specific publication times/
places (sorry!) :

1. Jane Robinson published an English grammar in a paper titled DIAGRAM
2. Otto Jesperson, (a Dane?), published a good English grammar
3. Woods developed a system called LUNAR in 1978 which included an
   English grammar
4. Partee and Shack (at UCLA) developed an English grammar
5. Gary Hendrix (1978) developed a system called LIFER which allowed
   the end-user to actually extend the language
6. Terry Winograd developed a system called SHRDLU which included an
   English interface

---Lexicons---
>From ingria@BBN.COM Fri Nov 16 18:59:02 1990
Martha Evens is putting together a lexicon for distribution via ACL's
Data Collection Initiative.  There's also the Brandeis Verb Lexicon.

>From IDE@vaxsar.bitnet Mon Nov 19 14:59:56 1990
You should above all be aware of the Oxford Text Archive, which has lots of
dictionaries and word lists, all free and most with no strings. I am sending in
a separate file the archive short list.

If you are non-profit research you can get some dictionaries easily: the OALD 3
ed. is easy to get from Oxford U Press. Longman's is reasonably forthcoming
about LDOCE as well. I can give you contact names if you are interested.

>From att!druwa!mfogg@relay.EU.net Tue Nov 13 18:59:18 1990
----------------------------------------------------------------------
  in the more obscure computer programming mags, there are some 
advertisements (tiny) that have a name like Moby-parts of speech
(seriously!).  I have some documentation around (or had).  As I
recall they were about $250-$500 range, and were C source code.
I think that the complete parser, differentiator algorithm could
be accomplished with these.  The address is:

The Austin Code Works
11,100 Leafwood Land
Austin, Texas  78750-3464
512 258 0785
        1342 (fax)
email info@acw.com

Moby Parts of Speech has 200,000 words in parts of speech headings
as source code   $200.

>From mgross@nosc.mil (Michelle Gross)
The Point of Contact for the OED on CD-ROM is
Royalynn O`Connor at 212-889-0206 

The mail order form is date 9/88 and gives a part number of
0-944674-00-3

The address is 
Oxford Electronic Publishing
Oxford University Press 
200 Madison Avneue
NY NY 10016

The cost is $950 each + $10 for shipping and handling

---Corpora---
[Various people suggested contacting the Association for Computational
Linguistics, which I did...see below.
Hans van Halteren (cor_hvh@kunrc1.urc.kun.nl) of the TOSCA Group at
the University of Nijmegen in the Netherlands sent me a very interesting
report.  It describes the TOSCA projects to create an English language corpus
with not just part of speech tagging, but also full syntactic and some
semantic analysis.  The report ("TOSCA", The Nijmegen Research Group for Corpus
Linguistics) is available from:  English Department, University of Nijmegen,
Erasmusplein 1, 6525 HT Nijmegen, Netherland, phone 080-512842/512157,
e-mail COR_HVH@HNYKUN52 (bitnet?).
Bill Kelly]

>From walker@flash.bellcore.com Tue Nov 27 03:57:31 1990
Subject: Re: English corpora

You should first check the LOB corpus at Birmingham, which I think
also has part of speech tagging.  Start with Stig Johansson at the
University of Oslo (johansson h_johansson%use.uio.uninett@nac.no
or h_johansson%use.uio.uninett@norunit.bitnet).

The Brown corpus is available in a tagged version as well.  Not
quite sure how you get in touch with them.

Della Summers at Longman just announced the availability of a large
text corpus:
Della Summers, Divisional Director
Dictionaries and Reference Division
Longman Group UK Ltd
Longman House
Burnt Mill, Harlow
Essex CM20 2JE, ENGLAND
PHONE: 44-279-26721
FAX: 44-279-31059
TELEX: 81259 Longmn G

Furthermore, there is a movement underway to develop a British
National Corpus.  Contact Jeremy Clear at Oxford University Press
for details (44-865-56767).

In addition, the ACL is preparing to release a substantial amount
of material on CDROM, some of which may be tagged.  However, it
is not as systematic as the materials referenced in the previous
paragraph.  Contact Mark Liberman (myl@unagi.cis.upenn.edu) at the
University of Pennsylvania for details.

>From FAFKH@NOBERGEN.bitnet
Subject: LOB & Brown Corpora

I enclose some material about our ICAME texts. Please note that these
text are for academic research and that we have restrictions for use
of the material.

Knut Hofland

The Norwegian Computing Centre for the Humanities
Street adr: Harald Haarfagres gt. 31
Post adr:   P.O. Box 53, University, N-5027 Bergen, Norway
Tel: +47 5 212954/5/6 Fax: +47 5 322656

[The International Computer Archive of Modern English (ICAME),
P.O. Box 53 Universitetet,
N-5027 Bergen,
Norway
offers various versions of the LOB corpus (written British English), the 
Brown corpus (written American English), the London-Lund Corpus 
(educated spoken British English in orthographic transcription),
the Melbourne-Surrey Corpus (Australian newspaper texts),
the Kolhapur Corpus (printed Indian English texts), 
the Lancaster Spoken English Corpus, and the Polytechnic of Wales Corpus 
(100.000 words analysed children speech).

Most of the materials are available on tape, some on diskette.
Prices range from 700 to 2400 Norwegian kroner, which I believe
are about 6-7 to the US$.

"Most of the material has been described in greater
detail in previous issues of our journal. Prices and technical
specifications are given on the order forms which accompany the
journal. Note that tagged versions of the Brown Corpus cannot
be obtained through ICAME."

The corpora are for research purposes and may not be redistributed.
Contact the ICAME for full details.  Bill Kelly]

>From winarske@divsun.unige.ch (Amy Winarske)
> Organization: University of Geneva, Switzerland
For info on the Brown corpus, try asking Prof.  Mitch Marcus at the University
of Pennsylvania.  He's heading up the Penn treebank project, DARPA funded
mega-project to tag and store the Brown corpus, as well as other bodies of
English text.  His address is mitch@linc.cis.penn.edu
[Note: the Brown corpus is available through the ICAME, which also
distributes the LOB corpus.  Bill Kelly.]

>From IDE@vaxsar.bitnet Mon Nov 19 14:59:56 1990
I can tell you of several things that answer some of your questions. You should
first of all know about the ACL/DCI, which is gathering exactly the sort of
information you are interested in for public dissemination. Contact Donald
Walker, walker@flash.bellcore.com. Among the materials are corpora with full
syntactic (tree) tagging; this is being done by Mitch Marcus
(mitch@cis.upenn.edu) at Penn. The collection itself is overseen by Mark
Liberman (myl@unagi.cis.upenn.edu).


---Misc---
From rpg@rex.cs.tulane.edu Thu Nov  8 18:59:52 1990
(Robert Goldman)
..I got this information from Garside, Leech and Sampson: _The
Computational Analysis of English: a corpus-based approach_, Longman,
1987, which I recommend highly.

>From IDE@vaxsar.bitnet Mon Nov 19 14:59:56 1990
You should also be aware of the HUMANIST discussion group, where you would hear
about a lot of this: send to HUMANIST@BROWNVM.