bllklly@olsen.UUCP (William Kelly) (12/20/90)
Here is a summary of responses to my request in November for "Pointers to English grammars, lexicons, corpora". Thanks to everyone. As various people have requested a summary, I assume enough interest for a posting. Quotes are verbatim, except where I have put my comments in square brackets. Bill Kelly, bllklly@olsen.uu.ch or bllklly@olsen.uucp ---Grammars--- >From Mark.Kantrowitz@A.GP.CS.CMU.EDU Thu Nov 8 16:24:18 1990 Grammars: The largest grammar I know of is NIGEL, which is part of the PENMAN project at Information Sciences Institute (ISI) at USC. >From laubsch@hpljl.hpl.hp.com Thu Nov 22 03:58:31 1990 I recommend reading the following: H. Alshawi et al "Research Programme in Natural Language Processing" Final Report, SRI International, Cambridge Centre, prepared for Alvey Project No. ALV/PRJ/IKBS/105 Their syntax has considerable coverage, >From ingria@BBN.COM Fri Nov 16 18:59:02 1990 There was a recent book produced by Gerald Gazdar, that contains a grammar of English ready to be typed in, along with a parser. This book comes in LISP, POP-11, and PROLOG versions. [The Prolog version is Natural Language Processing in Prolog, An Introduction to Computational Linguistics, Gazdar and Mellish, 1989, Addison-Wesley, ISBN 0-201-18053-7. It is a nice introduction to CL. The grammar is only a small subset that illustrates the basic problems of NLP and the features of CFGs and PATR-II. Bill] >From kalita@linc.cis.upenn.edu Thu Nov 8 18:59:40 1990 Subject: Re: Pointers to English grammars, lexicons, corpora? For a reasonably large grammar of English, please contact Kathleen Bishop of the University of Pennsylvania. She is a Ph.D. student in the Dept. of COmputer Science. SHe and a few other students have been working on a large grammar for a couple of years. Her address is bishop@linc.cis.upenn.edu She is working under the supervision of Dr. Joshi and her grammar is based on Dr. Joshi's TAG (Tree Adjoining Grammar) formalism. Jugal Kolita >From oldbin@harvard.harvard.edu Mon Nov 12 23:57:23 1990 Regarding your inquiry about English grammars, here are a few names I know of, but unfortunately, I have no specific publication times/ places (sorry!) : 1. Jane Robinson published an English grammar in a paper titled DIAGRAM 2. Otto Jesperson, (a Dane?), published a good English grammar 3. Woods developed a system called LUNAR in 1978 which included an English grammar 4. Partee and Shack (at UCLA) developed an English grammar 5. Gary Hendrix (1978) developed a system called LIFER which allowed the end-user to actually extend the language 6. Terry Winograd developed a system called SHRDLU which included an English interface ---Lexicons--- >From ingria@BBN.COM Fri Nov 16 18:59:02 1990 Martha Evens is putting together a lexicon for distribution via ACL's Data Collection Initiative. There's also the Brandeis Verb Lexicon. >From IDE@vaxsar.bitnet Mon Nov 19 14:59:56 1990 You should above all be aware of the Oxford Text Archive, which has lots of dictionaries and word lists, all free and most with no strings. I am sending in a separate file the archive short list. If you are non-profit research you can get some dictionaries easily: the OALD 3 ed. is easy to get from Oxford U Press. Longman's is reasonably forthcoming about LDOCE as well. I can give you contact names if you are interested. >From att!druwa!mfogg@relay.EU.net Tue Nov 13 18:59:18 1990 ---------------------------------------------------------------------- in the more obscure computer programming mags, there are some advertisements (tiny) that have a name like Moby-parts of speech (seriously!). I have some documentation around (or had). As I recall they were about $250-$500 range, and were C source code. I think that the complete parser, differentiator algorithm could be accomplished with these. The address is: The Austin Code Works 11,100 Leafwood Land Austin, Texas 78750-3464 512 258 0785 1342 (fax) email info@acw.com Moby Parts of Speech has 200,000 words in parts of speech headings as source code $200. >From mgross@nosc.mil (Michelle Gross) The Point of Contact for the OED on CD-ROM is Royalynn O`Connor at 212-889-0206 The mail order form is date 9/88 and gives a part number of 0-944674-00-3 The address is Oxford Electronic Publishing Oxford University Press 200 Madison Avneue NY NY 10016 The cost is $950 each + $10 for shipping and handling ---Corpora--- [Various people suggested contacting the Association for Computational Linguistics, which I did...see below. Hans van Halteren (cor_hvh@kunrc1.urc.kun.nl) of the TOSCA Group at the University of Nijmegen in the Netherlands sent me a very interesting report. It describes the TOSCA projects to create an English language corpus with not just part of speech tagging, but also full syntactic and some semantic analysis. The report ("TOSCA", The Nijmegen Research Group for Corpus Linguistics) is available from: English Department, University of Nijmegen, Erasmusplein 1, 6525 HT Nijmegen, Netherland, phone 080-512842/512157, e-mail COR_HVH@HNYKUN52 (bitnet?). Bill Kelly] >From walker@flash.bellcore.com Tue Nov 27 03:57:31 1990 Subject: Re: English corpora You should first check the LOB corpus at Birmingham, which I think also has part of speech tagging. Start with Stig Johansson at the University of Oslo (johansson h_johansson%use.uio.uninett@nac.no or h_johansson%use.uio.uninett@norunit.bitnet). The Brown corpus is available in a tagged version as well. Not quite sure how you get in touch with them. Della Summers at Longman just announced the availability of a large text corpus: Della Summers, Divisional Director Dictionaries and Reference Division Longman Group UK Ltd Longman House Burnt Mill, Harlow Essex CM20 2JE, ENGLAND PHONE: 44-279-26721 FAX: 44-279-31059 TELEX: 81259 Longmn G Furthermore, there is a movement underway to develop a British National Corpus. Contact Jeremy Clear at Oxford University Press for details (44-865-56767). In addition, the ACL is preparing to release a substantial amount of material on CDROM, some of which may be tagged. However, it is not as systematic as the materials referenced in the previous paragraph. Contact Mark Liberman (myl@unagi.cis.upenn.edu) at the University of Pennsylvania for details. >From FAFKH@NOBERGEN.bitnet Subject: LOB & Brown Corpora I enclose some material about our ICAME texts. Please note that these text are for academic research and that we have restrictions for use of the material. Knut Hofland The Norwegian Computing Centre for the Humanities Street adr: Harald Haarfagres gt. 31 Post adr: P.O. Box 53, University, N-5027 Bergen, Norway Tel: +47 5 212954/5/6 Fax: +47 5 322656 [The International Computer Archive of Modern English (ICAME), P.O. Box 53 Universitetet, N-5027 Bergen, Norway offers various versions of the LOB corpus (written British English), the Brown corpus (written American English), the London-Lund Corpus (educated spoken British English in orthographic transcription), the Melbourne-Surrey Corpus (Australian newspaper texts), the Kolhapur Corpus (printed Indian English texts), the Lancaster Spoken English Corpus, and the Polytechnic of Wales Corpus (100.000 words analysed children speech). Most of the materials are available on tape, some on diskette. Prices range from 700 to 2400 Norwegian kroner, which I believe are about 6-7 to the US$. "Most of the material has been described in greater detail in previous issues of our journal. Prices and technical specifications are given on the order forms which accompany the journal. Note that tagged versions of the Brown Corpus cannot be obtained through ICAME." The corpora are for research purposes and may not be redistributed. Contact the ICAME for full details. Bill Kelly] >From winarske@divsun.unige.ch (Amy Winarske) > Organization: University of Geneva, Switzerland For info on the Brown corpus, try asking Prof. Mitch Marcus at the University of Pennsylvania. He's heading up the Penn treebank project, DARPA funded mega-project to tag and store the Brown corpus, as well as other bodies of English text. His address is mitch@linc.cis.penn.edu [Note: the Brown corpus is available through the ICAME, which also distributes the LOB corpus. Bill Kelly.] >From IDE@vaxsar.bitnet Mon Nov 19 14:59:56 1990 I can tell you of several things that answer some of your questions. You should first of all know about the ACL/DCI, which is gathering exactly the sort of information you are interested in for public dissemination. Contact Donald Walker, walker@flash.bellcore.com. Among the materials are corpora with full syntactic (tree) tagging; this is being done by Mitch Marcus (mitch@cis.upenn.edu) at Penn. The collection itself is overseen by Mark Liberman (myl@unagi.cis.upenn.edu). ---Misc--- From rpg@rex.cs.tulane.edu Thu Nov 8 18:59:52 1990 (Robert Goldman) ..I got this information from Garside, Leech and Sampson: _The Computational Analysis of English: a corpus-based approach_, Longman, 1987, which I recommend highly. >From IDE@vaxsar.bitnet Mon Nov 19 14:59:56 1990 You should also be aware of the HUMANIST discussion group, where you would hear about a lot of this: send to HUMANIST@BROWNVM.