mnr@DAISY.LEARNING.CS.CMU.EDU (Marc Ringuette) (10/16/89)
I need to find a source of large quantities (1 to 50 megabytes) of ASCII text, with a strong preference towards literature. Novels and essays are great. Newspaper or magazine articles are probably OK too. Please send me Email at "mnr@cs.cmu.edu" if you have any ideas how I can find such text accessible from the Internet. Also let me know if I can include your response in a summary to post to comp.ai in a week or so. Thanks! --- Marc Ringuette /// CMU CS Dept, Pittsburgh /// mnr@cs.cmu.edu [ Why? I need to run some statistics on word usage. Actually, I'd also appreciate pointers to dictionaries of personal and place names, and dictionaries with annotations of noun/verb/etc. Send mail! ]
mnr@daisy.learning.cs.cmu.edu (Marc Ringuette) (10/30/89)
Here is a summary of the responses I got to my request for on-line literature for doing text analysis. I haven't obtained any of the texts yet, so I can't report any success stories. Please bear in mind that these respondents may not expect a flood of requests... ------------------------------ Can you get access to a NeXT machine? I did some similar experiments some time ago, and found the complete works of Shakespeare to be a good source of text... ------------------------------ Stanford just got what are basically online copies of UPI newswires. The service is called ClariNet. You might check to see if CMU has such a service. ------------------------------ Oxford has some sort of on-line text center, but I think they only send out tapes. They have an enormous selection of stuff on-line, though. The Univ. of Waterloo is handling the Electronic Oxford English Dictionary, which is on-line, and has embedded text. Let me know if you want/need the email addresses. Othar Hansson (othar@ernie.berkeley.edu) ------------------------------ The Brown corpus is text on the order of hundreds of thousands of words, with part of speech given too. You might ask Michael.Witbrock@cs.cmu.edu. ------------------------------ This summer I worked for Bob Berwick at MIT on projects that needed huge amounts of text. We had all of Sherlock Holmes, all the articles from the WSJ for 1988, the Oxford-English Dictionary, etc. Some of this stuff we got from the MIT Media Lab. I know that all of it contains special licensing restrictions, so he probably can't just give you a copy. But he might be able to tell you where you can get your own copy. So email to either berwick@chomsky.mit.edu or brent@reagan.ai.mit.edu (Michael Brent is the graduate student of Bob's who's doing most of the work on it.) --Mark Kantrowitz (mkant@cs.cmu.edu) ------------------------------ From: Susan Gauch <sgauch@wellesley.edu> Try contacting Ed Fox at Virginia Polytechnic. He has collections of large amounts of text for distribution on optical disks. It may be available in other formats. Also, contact John Smith at U. North Carolina (jbs@cs.unc.edu). He has found the bible, plus several literary works which he has online. ------------------------------ From: "Edward A. Fox" <fox@fox.cs.vt.edu> We often have isrmac1.cs.vt.edu up with a CDROM holding lots of things you might be able to use. Try anonymous FTP - I will try to get people to keep that up. [[ Not up as of 10/26/89...--Marc ]] Eventually the CDROMs will be released and that will be easier. There is also the ACL Data Collection Initiative which will soon be releasing mag tapes with lots and lots of text - you can ask to be on the list by contacting: myl%coma@research.att.com ------------------------------ You might be interested in contacting the Oxford Text Archive. They maintain a collection of works which they distribute (large texts on mag tape, alas...) for a very small charge (basically production costs). The archive also coordinates with other text archives worldwide. The archive has serviced my requests promptly and has treated me warmly. As of 11/4/87, they may be reached at: s-mail: Oxford Text Archive Oxford University Computing Service 13 Banbury Road Oxford OX2 6NN, UK e-mail: archive@uk.ac.oxford.vax phone: +44 (865) 273238 There is also an effort within the Association for Computational Linguistics to assemble a large collection of text (for example, the 'Data Collection Initiative';coding texts using the SGML). contact Donald Walker at Bellcore (walker@flash.bellcore.com; 201-829-4312). -- John Klockner <7334016@UWAVM.ACS.WASHINGTON.EDU> ------------------------------ That's it. Good luck! -- Marc Ringuette // CMU CS Dept, Pittsburgh // Internet: mnr@cs.cmu.edu