[net.internat] Texts in other languages

macrakis@harvard.UUCP (Stavros Macrakis) (11/20/85)

For an experiment in text compression, I would find it useful to have
a collection of texts in a variety of languages.  Ideally, I would
like a half-dozen distinct texts, each 2000-15000 words long, in each
language.  The texts should be in a consistent (documented)
transcription, preferably without formatting commands.  The texts need
not be selected to be `representative' of the language.  For instance,
technical papers are fine.  The languages in which I am interested are
French, Italian, German, (Modern) Greek, Arabic, and Turkish.  If you
have texts in other languages, please let me know.

If you could send me mail describing the texts you might be able to
provide, we can find some way of transferring them later.

  Thanks
	-s

	Macrakis@Harvard.{Harvard.EDU,ARPA,uucp,csnet}
		@Harvunxh.bitnet

inc@fluke.UUCP (Gary Benson) (12/02/85)

> For an experiment in text compression, I would find it useful to have
> a collection of texts in a variety of languages.  Ideally, I would
> like a half-dozen distinct texts, each 2000-15000 words long, in each
> language.  The texts should be in a consistent (documented)
> transcription, preferably without formatting commands.  The texts need
> not be selected to be `representative' of the language.  For instance,
> technical papers are fine.  The languages in which I am interested are
> French, Italian, German, (Modern) Greek, Arabic, and Turkish.  If you
> have texts in other languages, please let me know.
> 
> If you could send me mail describing the texts you might be able to
> provide, we can find some way of transferring them later.
> 
>   Thanks
> 	-s
> 
> 	Macrakis@Harvard.{Harvard.EDU,ARPA,uucp,csnet}
> 		@Harvunxh.bitnet

When Xerox bought Diablo and got into the high-speed character printer
business, they used a computer to generate a paragraph of "standard
English text". It was nonsense to read of course, but contained 
the proper distribution of letters, word lengths, and sentence lengths.
They used the text to time their printers. Why not use a "standard
text" from each of the languages you are interested in? It seems to me
that way you would get a clearer picture of the perfomance of a compression
algorithm *on text in the language of interest* than you will with a
transcription, which to my mind at least is really just a term meaning
a translation.

-- 
 Gary Benson  *  John Fluke Mfg. Co.  *  PO Box C9090  *  Everett WA  *  98206
   MS/232-E  = =   {allegra} {uw-beaver} !fluke!inc   = =   (206)356-5367
 _-_-_-_-_-_-_-_-ascii is our god and unix is his profit-_-_-_-_-_-_-_-_-_-_-_