[comp.archives] [compression] Re: Bell-Witten-Cleary request

emv@ox.com (Ed Vielmetti) (03/20/91)
Archive-name: compression/corpus/text-compression-corpus/1991-03-18
Archive-directory: fsa.cpsc.ucalgary.ca:/pub/text.compression.corpus/ [136.159.2.1]
Original-posting-by: emv@ox.com (Ed Vielmetti)
Original-subject: Re: Bell-Witten-Cleary request
Reposted-by: emv@msen.com (Edward Vielmetti, MSEN)


   When posting FTP sites, please include (where known) *both* the full
   host/domain name *and* its IP address.  Domains were created to make
   host hierarchies easier to remember -- let's use them liberally.

actually, if you want to make it easier on everyone, this is the
format I would use:

fsa.cpsc.ucalgary.ca:/pub/text.compression.corpus/ [136.159.2.1]

Here's the README file from that directory.  Good stuff!

-- 
 Msen	Edward Vielmetti
/|---	moderator, comp.archives
	emv@msen.com

Welcome to the Calgary/Canterbury text compression corpus.  This corpus is used
in the book

        Bell, T.C., Cleary, J.G. and Witten, I.H. Text compression.
        Prentice Hall, Englewood Cliffs, NJ, 1990

and in the survey paper

        Bell, T.C., Witten, I.H. and Cleary, J.G. "Modeling for text
        compression," Computing Surveys 21(4): 557-591; December 1989,

to evaluate the practical performance of various text compression schemes.
Several other researchers are now using the corpus to evaluate text compression
schemes.

Nine different types of text are represented, and to confirm that the
performance of schemes is consistent for any given type, many of the types have
more than one representative.  Normal English, both fiction and non-fiction, is
represented by two books and papers (labeled book1, book2, paper1, paper2,
paper3, paper4, paper5, paper6).  More unusual styles of English writing are
found in a bibliography (bib) and a batch of unedited news articles (news).
Three computer programs represent artificial languages (progc, progl, progp).
A transcript of a terminal session (trans) is included to indicate the increase
in speed that could be achieved by applying compression to a slow line to a
terminal.  All of the files mentioned so far use ASCII encoding.  Some
non-ASCII files are also included: two files of executable code (obj1, obj2),
some geophysical data (geo), and a bit-map black and white picture (pic).  The
file geo is particularly difficult to compress because it contains a wide range
of data values, while the file pic is highly compressible because of large
amounts of white space in the picture, represented by long runs of zeros.

More details of the individual texts are given in the book mentioned above.
Both book and paper give the results of compression experiments on these texts.

The corpus itself constitutes files bib, book1, book2, geo, news, obj1, obj2,
paper1, paper2, paper3, paper4, paper5, paper6, pic, progc, progl, progp and
trans.  (The book and paper above do not give results for files paper3, paper4,
paper5 or paper6.)

The directory "index" contains the sizes of the files and some information
about where they came from.

Ian H. Witten                           Timothy C. Bell
Computer Science Department             Computer Science Department
University of Calgary                   University of Canterbury
Calgary T2N 1N4, Canada                 Christchurch 1, New Zealand
Phone (403) 220-6780                    Phone (64-3) 642352
email: ian@cpsc.UCalgary.CA             email: tim@cosc.canterbury.ac.nz