[sci.lang] transcripts of conversations

edwards@cogsci.berkeley.edu (Jane Edwards) (02/06/90)
In a Jan. 7 article <6385@ucdavis.ucdavis.edu> ez000441@deneb.ucdavis.edu
(Ron Goldthwaite) asked about the availability online of transcripts of 
conversations and narratives.  I summarize below the ones I know of.  
If you know of others, I would very much like to hear from you, as I am 
trying to prepare a reasonably complete list of the major ones for 
publication in a book on related topics later this Spring. 

So far as I know, the biggest archive project is the Oxford Text Archive, 
with about 450 separate collections of written texts and transcripts of 
spoken language ("corpora"). Most are from written sources (e.g., literary 
classics), but it also has some well-known spoken language corpora, such 
as the Lancaster-Oslo-Bergen (LOB) and London-Lund corpora.  Most of the 
holdings are in English, but a wide range of other languages are also 
represented: Dutch, French, Hebrew, Latvian, German, Icelandic, Gaelic, 
Coptic, Malayan, etc.  The Oxford Text Archive also distributes information 
concerning the holdings of 4 other archives: U. of Cambridge, U of Pisa, 
U. of Pennsylvania, and Brigham Young U.  Oxford Text Archive address: 
archive@uk.ac.ox.vax (JANET), archive%vax.ox.ac.uk@ucl.cs.edu (EDU), 
archive%vax.ox.ac.uk@ukacrl.earn (BITNET).  One of their written holdings 
is the BROWN CORPUS (asked about in a recent nl-kr digest), which is 
composed of 500 written language samples, of 2000 words each from a range 
of written styles of English printed in 1961 (described in Kucera and 
Francis, 1967, _Computational analysis of present-day American English_).
This corpus is not used widely in linguistic research (though perhaps in 
Literature, or Humanities) because the data are: (a) from written rather 
than spoken language sources, and (b) 30 years old.  The large "Australian
Corpus Project" (described in Kyto, et al. (eds.), 1988, _Corpus linguistics: 
hard and soft_, and in the book review in _Language_, 1989, 65(4), 843-848), 
may provide a needed updated sampling of a wide range of written 
(Australian/British) English, and some spoken English as well.

Another big archive project is the CHILDES project, at Carnegie-Mellon
(Brian MacWhinney, brian@andrew.cmu.edu).  While most of their data are 
children speaking to adults, they also distribute adult written and adult
spoken language corpora from the CORNELL project.  The spoken samples
range from abortion debates to the Patty Hearst trial to TV sit. coms.
There are a fair number of typographical errors, unfortunately, including 
some which most spell-checkers would overlook (e.g., "feint" for "faint").  
But it is a diverse, highly useful and recent collection.  

For SPOKEN spontaneous adult English, the best and biggest is probably the
London-Lund corpus (described in Svartvik & Quirk, 1980, _A corpus of 
spoken English_, and Svartvik, et al., 1982, _Survey of Spoken English_), 
available through the Oxford Text Archive.  These data include conversations 
by people of various ages, occupations, etc., recorded under various
circumstances.  They have rich prosodic marking, and have been of enormous 
benefit to a wide range of linguistic investigations.  A drawback for 
Americans for some purposes, is that the data are British English.  Another 
big archive of spoken (British) English is the Lancaster-Oslo-Bergen (LOB) 
archive (52,000 words, prosodic marking, as close to RP as possible), 
also available through the Oxford Text Archive.

For SPOKEN ADULT AMERICAN English, there is, to my knowledge no publically 
accessible archive as large as those just mentioned.  At Berkeley, we have 
a collection of various types of spoken interaction (from conversations, 
to the Oliver North trial, to lectures), collected and contributed mostly 
by professors here, and intended mainly for local use at this time.  The 
ethnomethodological corpora mentioned in article 
<1990Jan18.074947.28456@agate.berkeley.edu> by sp299-ad@violet.berkeley.edu 
(Celso Alvarez) also warrant looking into.  An enormous archive of spoken 
American English is presently in the planning stages at UC Santa Barbara 
to fill the need for a large-scale archive sampling a wide range of types
of adult spoken American English discourse.

The 1987 Linguistics Society of America questionnaire turned up
many private data sets, but only relatively few of them on computer.
The trend toward doing so is very rapidly increasing, and with it,
discussion of standards, normalization, etc., and as that happens
more of them may come into common domain.

In Germany, two archives warrant mention.  One is in Mannheim (for which
I have no email address or contact person) and contains various types
of data in the German language.  The other is at Univ. of Ulm (designed 
and coordinated by Erhard Mergenthaler, LU07@DMARUM8.bitnet, author of 
_Textbank systems: Computer science applied in the field of psychoanalysis_ 
1985), and contains a large number of psychotherapy sessions and interviews 
(most in monolingual German, some in monolingual English).  

In the Netherlands (Max-Planck-Institut fuer Psycholinguistik, Nijmegen,
helmut@hnympi51.bitnet), there is the European Science Foundation
Second Language Data Bank, containing transcripts of 10 groups of adult 
migrant workers learning the language of their "host" country (e.g., Turks
learning German or Dutch, Punjabis learning English, Moroccans learning 
French, Spaniards and Finns learning Swedish, etc.)

So, these are all of the ones that I know about.  If you know of others, 
or have email addresses to those above which I don't, I would very much 
appreciate hearing from you, and will summarize and post responses received.  
Thanks,

Jane Edwards  (edwards@cogsci.berkeley.edu)
Cognitive Science Program, UC Berkeley