hsu_wh@jhunix.HCF.JHU.EDU (William H Hsu) (06/27/91)
Could someone point me in the direction of some code for fast analysis of text files? I am looking for C source to do this, or bibliographic sources which discuss it. I know there must be a lot of code out there, because last year I saw 5 or more posted requests for 1 meg+ test file samples for analysis. What I am trying to get is code which will scan a text file and determine in minimal time whether it is normal English (or Roman alphabetic text, i.e., French w/out non-ASCII characters), or a converted binary file (e.g., BinHex'ed, uuencoded), or ANSI "graphics", or source code (if this is sufficiently different to be distinguishable for English in a relatively short amount of time). I understand that there is probably a significant performance (accuracy) tradeoff a file size decreases, so for purposes of convenience, perhaps it can be assumed that only files above 1 or 2K are analyzed. Does such code exist, and if so: where can one obtain it? And what is the fastest implementation?