edwards@cogsci.berkeley.edu (Jane Edwards) (02/06/90)
In a Jan. 7 article <6385@ucdavis.ucdavis.edu> ez000441@deneb.ucdavis.edu
(Ron Goldthwaite) asked about the availability online of transcripts of
conversations and narratives. I summarize below the ones I know of.
If you know of others, I would very much like to hear from you, as I am
trying to prepare a reasonably complete list of the major ones for
publication in a book on related topics later this Spring.
So far as I know, the biggest archive project is the Oxford Text Archive,
with about 450 separate collections of written texts and transcripts of
spoken language ("corpora"). Most are from written sources (e.g., literary
classics), but it also has some well-known spoken language corpora, such
as the Lancaster-Oslo-Bergen (LOB) and London-Lund corpora. Most of the
holdings are in English, but a wide range of other languages are also
represented: Dutch, French, Hebrew, Latvian, German, Icelandic, Gaelic,
Coptic, Malayan, etc. The Oxford Text Archive also distributes information
concerning the holdings of 4 other archives: U. of Cambridge, U of Pisa,
U. of Pennsylvania, and Brigham Young U. Oxford Text Archive address:
archive@uk.ac.ox.vax (JANET), archive%vax.ox.ac.uk@ucl.cs.edu (EDU),
archive%vax.ox.ac.uk@ukacrl.earn (BITNET). One of their written holdings
is the BROWN CORPUS (asked about in a recent nl-kr digest), which is
composed of 500 written language samples, of 2000 words each from a range
of written styles of English printed in 1961 (described in Kucera and
Francis, 1967, _Computational analysis of present-day American English_).
This corpus is not used widely in linguistic research (though perhaps in
Literature, or Humanities) because the data are: (a) from written rather
than spoken language sources, and (b) 30 years old. The large "Australian
Corpus Project" (described in Kyto, et al. (eds.), 1988, _Corpus linguistics:
hard and soft_, and in the book review in _Language_, 1989, 65(4), 843-848),
may provide a needed updated sampling of a wide range of written
(Australian/British) English, and some spoken English as well.
Another big archive project is the CHILDES project, at Carnegie-Mellon
(Brian MacWhinney, brian@andrew.cmu.edu). While most of their data are
children speaking to adults, they also distribute adult written and adult
spoken language corpora from the CORNELL project. The spoken samples
range from abortion debates to the Patty Hearst trial to TV sit. coms.
There are a fair number of typographical errors, unfortunately, including
some which most spell-checkers would overlook (e.g., "feint" for "faint").
But it is a diverse, highly useful and recent collection.
For SPOKEN spontaneous adult English, the best and biggest is probably the
London-Lund corpus (described in Svartvik & Quirk, 1980, _A corpus of
spoken English_, and Svartvik, et al., 1982, _Survey of Spoken English_),
available through the Oxford Text Archive. These data include conversations
by people of various ages, occupations, etc., recorded under various
circumstances. They have rich prosodic marking, and have been of enormous
benefit to a wide range of linguistic investigations. A drawback for
Americans for some purposes, is that the data are British English. Another
big archive of spoken (British) English is the Lancaster-Oslo-Bergen (LOB)
archive (52,000 words, prosodic marking, as close to RP as possible),
also available through the Oxford Text Archive.
For SPOKEN ADULT AMERICAN English, there is, to my knowledge no publically
accessible archive as large as those just mentioned. At Berkeley, we have
a collection of various types of spoken interaction (from conversations,
to the Oliver North trial, to lectures), collected and contributed mostly
by professors here, and intended mainly for local use at this time. The
ethnomethodological corpora mentioned in article
<1990Jan18.074947.28456@agate.berkeley.edu> by sp299-ad@violet.berkeley.edu
(Celso Alvarez) also warrant looking into. An enormous archive of spoken
American English is presently in the planning stages at UC Santa Barbara
to fill the need for a large-scale archive sampling a wide range of types
of adult spoken American English discourse.
The 1987 Linguistics Society of America questionnaire turned up
many private data sets, but only relatively few of them on computer.
The trend toward doing so is very rapidly increasing, and with it,
discussion of standards, normalization, etc., and as that happens
more of them may come into common domain.
In Germany, two archives warrant mention. One is in Mannheim (for which
I have no email address or contact person) and contains various types
of data in the German language. The other is at Univ. of Ulm (designed
and coordinated by Erhard Mergenthaler, LU07@DMARUM8.bitnet, author of
_Textbank systems: Computer science applied in the field of psychoanalysis_
1985), and contains a large number of psychotherapy sessions and interviews
(most in monolingual German, some in monolingual English).
In the Netherlands (Max-Planck-Institut fuer Psycholinguistik, Nijmegen,
helmut@hnympi51.bitnet), there is the European Science Foundation
Second Language Data Bank, containing transcripts of 10 groups of adult
migrant workers learning the language of their "host" country (e.g., Turks
learning German or Dutch, Punjabis learning English, Moroccans learning
French, Spaniards and Finns learning Swedish, etc.)
So, these are all of the ones that I know about. If you know of others,
or have email addresses to those above which I don't, I would very much
appreciate hearing from you, and will summarize and post responses received.
Thanks,
Jane Edwards (edwards@cogsci.berkeley.edu)
Cognitive Science Program, UC Berkeley