emv@ox.com (Ed Vielmetti) (03/20/91)
Archive-name: compression/corpus/text-compression-corpus/1991-03-18 Archive-directory: fsa.cpsc.ucalgary.ca:/pub/text.compression.corpus/ [136.159.2.1] Original-posting-by: emv@ox.com (Ed Vielmetti) Original-subject: Re: Bell-Witten-Cleary request Reposted-by: emv@msen.com (Edward Vielmetti, MSEN) When posting FTP sites, please include (where known) *both* the full host/domain name *and* its IP address. Domains were created to make host hierarchies easier to remember -- let's use them liberally. actually, if you want to make it easier on everyone, this is the format I would use: fsa.cpsc.ucalgary.ca:/pub/text.compression.corpus/ [136.159.2.1] Here's the README file from that directory. Good stuff! -- Msen Edward Vielmetti /|--- moderator, comp.archives emv@msen.com Welcome to the Calgary/Canterbury text compression corpus. This corpus is used in the book Bell, T.C., Cleary, J.G. and Witten, I.H. Text compression. Prentice Hall, Englewood Cliffs, NJ, 1990 and in the survey paper Bell, T.C., Witten, I.H. and Cleary, J.G. "Modeling for text compression," Computing Surveys 21(4): 557-591; December 1989, to evaluate the practical performance of various text compression schemes. Several other researchers are now using the corpus to evaluate text compression schemes. Nine different types of text are represented, and to confirm that the performance of schemes is consistent for any given type, many of the types have more than one representative. Normal English, both fiction and non-fiction, is represented by two books and papers (labeled book1, book2, paper1, paper2, paper3, paper4, paper5, paper6). More unusual styles of English writing are found in a bibliography (bib) and a batch of unedited news articles (news). Three computer programs represent artificial languages (progc, progl, progp). A transcript of a terminal session (trans) is included to indicate the increase in speed that could be achieved by applying compression to a slow line to a terminal. All of the files mentioned so far use ASCII encoding. Some non-ASCII files are also included: two files of executable code (obj1, obj2), some geophysical data (geo), and a bit-map black and white picture (pic). The file geo is particularly difficult to compress because it contains a wide range of data values, while the file pic is highly compressible because of large amounts of white space in the picture, represented by long runs of zeros. More details of the individual texts are given in the book mentioned above. Both book and paper give the results of compression experiments on these texts. The corpus itself constitutes files bib, book1, book2, geo, news, obj1, obj2, paper1, paper2, paper3, paper4, paper5, paper6, pic, progc, progl, progp and trans. (The book and paper above do not give results for files paper3, paper4, paper5 or paper6.) The directory "index" contains the sizes of the files and some information about where they came from. Ian H. Witten Timothy C. Bell Computer Science Department Computer Science Department University of Calgary University of Canterbury Calgary T2N 1N4, Canada Christchurch 1, New Zealand Phone (403) 220-6780 Phone (64-3) 642352 email: ian@cpsc.UCalgary.CA email: tim@cosc.canterbury.ac.nz