macrakis@harvard.UUCP (Stavros Macrakis) (11/20/85)
For an experiment in text compression, I would find it useful to have a collection of texts in a variety of languages. Ideally, I would like a half-dozen distinct texts, each 2000-15000 words long, in each language. The texts should be in a consistent (documented) transcription, preferably without formatting commands. The texts need not be selected to be `representative' of the language. For instance, technical papers are fine. The languages in which I am interested are French, Italian, German, (Modern) Greek, Arabic, and Turkish. If you have texts in other languages, please let me know. If you could send me mail describing the texts you might be able to provide, we can find some way of transferring them later. Thanks -s Macrakis@Harvard.{Harvard.EDU,ARPA,uucp,csnet} @Harvunxh.bitnet
inc@fluke.UUCP (Gary Benson) (12/02/85)
> For an experiment in text compression, I would find it useful to have > a collection of texts in a variety of languages. Ideally, I would > like a half-dozen distinct texts, each 2000-15000 words long, in each > language. The texts should be in a consistent (documented) > transcription, preferably without formatting commands. The texts need > not be selected to be `representative' of the language. For instance, > technical papers are fine. The languages in which I am interested are > French, Italian, German, (Modern) Greek, Arabic, and Turkish. If you > have texts in other languages, please let me know. > > If you could send me mail describing the texts you might be able to > provide, we can find some way of transferring them later. > > Thanks > -s > > Macrakis@Harvard.{Harvard.EDU,ARPA,uucp,csnet} > @Harvunxh.bitnet When Xerox bought Diablo and got into the high-speed character printer business, they used a computer to generate a paragraph of "standard English text". It was nonsense to read of course, but contained the proper distribution of letters, word lengths, and sentence lengths. They used the text to time their printers. Why not use a "standard text" from each of the languages you are interested in? It seems to me that way you would get a clearer picture of the perfomance of a compression algorithm *on text in the language of interest* than you will with a transcription, which to my mind at least is really just a term meaning a translation. -- Gary Benson * John Fluke Mfg. Co. * PO Box C9090 * Everett WA * 98206 MS/232-E = = {allegra} {uw-beaver} !fluke!inc = = (206)356-5367 _-_-_-_-_-_-_-_-ascii is our god and unix is his profit-_-_-_-_-_-_-_-_-_-_-_