[comp.compression] Analyzing text files

hsu_wh@jhunix.HCF.JHU.EDU (William H Hsu) (06/27/91)
	Could someone point me in the direction of some code for fast
analysis of text files?  I am looking for C source to do this, or
bibliographic sources which discuss it.  I know there must be a lot of code
out there, because last year I saw 5 or more posted requests for 1 meg+ test
file samples for analysis.
	What I am trying to get is code which will scan a text file and
determine in minimal time whether it is normal English (or Roman alphabetic
text, i.e., French w/out non-ASCII characters), or a converted binary file
(e.g., BinHex'ed, uuencoded), or ANSI "graphics", or source code (if this is
sufficiently different to be distinguishable for English in a relatively
short amount of time).
	I understand that there is probably a significant performance
(accuracy) tradeoff a file size decreases, so for purposes of convenience,
perhaps it can be assumed that only files above 1 or 2K are analyzed.
	Does such code exist, and if so: where can one obtain it?  And what
is the fastest implementation?