edwards@cogsci.berkeley.edu (Jane Edwards) (02/06/90)
In a Jan. 7 article <6385@ucdavis.ucdavis.edu> ez000441@deneb.ucdavis.edu (Ron Goldthwaite) asked about the availability online of transcripts of conversations and narratives. I summarize below the ones I know of. If you know of others, I would very much like to hear from you, as I am trying to prepare a reasonably complete list of the major ones for publication in a book on related topics later this Spring. So far as I know, the biggest archive project is the Oxford Text Archive, with about 450 separate collections of written texts and transcripts of spoken language ("corpora"). Most are from written sources (e.g., literary classics), but it also has some well-known spoken language corpora, such as the Lancaster-Oslo-Bergen (LOB) and London-Lund corpora. Most of the holdings are in English, but a wide range of other languages are also represented: Dutch, French, Hebrew, Latvian, German, Icelandic, Gaelic, Coptic, Malayan, etc. The Oxford Text Archive also distributes information concerning the holdings of 4 other archives: U. of Cambridge, U of Pisa, U. of Pennsylvania, and Brigham Young U. Oxford Text Archive address: archive@uk.ac.ox.vax (JANET), archive%vax.ox.ac.uk@ucl.cs.edu (EDU), archive%vax.ox.ac.uk@ukacrl.earn (BITNET). One of their written holdings is the BROWN CORPUS (asked about in a recent nl-kr digest), which is composed of 500 written language samples, of 2000 words each from a range of written styles of English printed in 1961 (described in Kucera and Francis, 1967, _Computational analysis of present-day American English_). This corpus is not used widely in linguistic research (though perhaps in Literature, or Humanities) because the data are: (a) from written rather than spoken language sources, and (b) 30 years old. The large "Australian Corpus Project" (described in Kyto, et al. (eds.), 1988, _Corpus linguistics: hard and soft_, and in the book review in _Language_, 1989, 65(4), 843-848), may provide a needed updated sampling of a wide range of written (Australian/British) English, and some spoken English as well. Another big archive project is the CHILDES project, at Carnegie-Mellon (Brian MacWhinney, brian@andrew.cmu.edu). While most of their data are children speaking to adults, they also distribute adult written and adult spoken language corpora from the CORNELL project. The spoken samples range from abortion debates to the Patty Hearst trial to TV sit. coms. There are a fair number of typographical errors, unfortunately, including some which most spell-checkers would overlook (e.g., "feint" for "faint"). But it is a diverse, highly useful and recent collection. For SPOKEN spontaneous adult English, the best and biggest is probably the London-Lund corpus (described in Svartvik & Quirk, 1980, _A corpus of spoken English_, and Svartvik, et al., 1982, _Survey of Spoken English_), available through the Oxford Text Archive. These data include conversations by people of various ages, occupations, etc., recorded under various circumstances. They have rich prosodic marking, and have been of enormous benefit to a wide range of linguistic investigations. A drawback for Americans for some purposes, is that the data are British English. Another big archive of spoken (British) English is the Lancaster-Oslo-Bergen (LOB) archive (52,000 words, prosodic marking, as close to RP as possible), also available through the Oxford Text Archive. For SPOKEN ADULT AMERICAN English, there is, to my knowledge no publically accessible archive as large as those just mentioned. At Berkeley, we have a collection of various types of spoken interaction (from conversations, to the Oliver North trial, to lectures), collected and contributed mostly by professors here, and intended mainly for local use at this time. The ethnomethodological corpora mentioned in article <1990Jan18.074947.28456@agate.berkeley.edu> by sp299-ad@violet.berkeley.edu (Celso Alvarez) also warrant looking into. An enormous archive of spoken American English is presently in the planning stages at UC Santa Barbara to fill the need for a large-scale archive sampling a wide range of types of adult spoken American English discourse. The 1987 Linguistics Society of America questionnaire turned up many private data sets, but only relatively few of them on computer. The trend toward doing so is very rapidly increasing, and with it, discussion of standards, normalization, etc., and as that happens more of them may come into common domain. In Germany, two archives warrant mention. One is in Mannheim (for which I have no email address or contact person) and contains various types of data in the German language. The other is at Univ. of Ulm (designed and coordinated by Erhard Mergenthaler, LU07@DMARUM8.bitnet, author of _Textbank systems: Computer science applied in the field of psychoanalysis_ 1985), and contains a large number of psychotherapy sessions and interviews (most in monolingual German, some in monolingual English). In the Netherlands (Max-Planck-Institut fuer Psycholinguistik, Nijmegen, helmut@hnympi51.bitnet), there is the European Science Foundation Second Language Data Bank, containing transcripts of 10 groups of adult migrant workers learning the language of their "host" country (e.g., Turks learning German or Dutch, Punjabis learning English, Moroccans learning French, Spaniards and Finns learning Swedish, etc.) So, these are all of the ones that I know about. If you know of others, or have email addresses to those above which I don't, I would very much appreciate hearing from you, and will summarize and post responses received. Thanks, Jane Edwards (edwards@cogsci.berkeley.edu) Cognitive Science Program, UC Berkeley