levy@ttrdc.UUCP (Daniel R. Levy) (10/15/86)
In article <6168@alice.uUCp>, td@alice.UucP (Tom Duff) writes: >>Suppose I have a very large (about 20 Mb) file of arbitary length records (up >>to about 20 byte). 95% or so of these records will be unique within the file. >>The other 5% will contain a number of different records which a repeated >>throughout the file but not in any particular order. >>My question : does anybody know of any algorithm for finding the 20 (say) most >>common records in the file. I dont want to use large amounts of memory and I >>dont want to have to make more than a few passes through the file. > >Write a program to encode each record as an isomorphic line of ASCII text. >Call the program ``asc'' >Call your file ``file'' >then type > asc <file | sort | uniq -c | sort -nr | sed 20q ^^^^ ^^^^ >to a UNIX shell. Umm, with 20 meg of input, unless a larger amount of RAM is available and sort can use it all, this is going to have problems with memory and temp-file usage out of your /dev/wazoo. Not that I can think of anything better; it looks like sorting can't be avoided somewhere in the process. >In this pipeline > asc <file >converts the file into something sort can handle > sort >gathers together equal records > uniq -c >replaces each string of equal adjacent records with a single record >preceded by a count > sort -nr >sorts those records in reverse numerical order, and > sed 20q >throws away all but the 20 highest ranking records > >Of course, if your records are already newline-terminated strings, >(as they should be, if you're a good UNIX user), you won't have to >write any code at all. -- ------------------------------- Disclaimer: The views contained herein are | dan levy | yvel nad | my own and are not at all those of my em- | an engihacker @ | ployer or the administrator of any computer | at&t computer systems division | upon which I may hack. | skokie, illinois | -------------------------------- Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa, go for it! allegra,ulysses,vax135}!ttrdc!levy