[sci.math] statistics problem

levy@ttrdc.UUCP (Daniel R. Levy) (10/15/86)
In article <6168@alice.uUCp>, td@alice.UucP (Tom Duff) writes:
>>Suppose I have a very large (about 20 Mb) file of arbitary length records (up
>>to about 20 byte). 95% or so of these records will be unique within the file.
>>The other 5% will contain a number of different records which a repeated
>>throughout the file but not in any particular order.
>>My question : does anybody know of any algorithm for finding the 20 (say) most
>>common records in the file. I dont want to use large amounts of memory and I
>>dont want to have to make more than a few passes through the file.
>
>Write a program to encode each record as an isomorphic line of ASCII text.
>Call the program ``asc''
>Call your file ``file''
>then type
>	asc <file | sort | uniq -c | sort -nr | sed 20q
                    ^^^^             ^^^^
>to a UNIX shell.

Umm, with 20 meg of input, unless a larger amount of RAM is available and
sort can use it all, this is going to have problems with memory and temp-file
usage out of your /dev/wazoo.  Not that I can think of anything better;
it looks like sorting can't be avoided somewhere in the process.

>In this pipeline
>	asc <file
>converts the file into something sort can handle
>	sort
>gathers together equal records
>	uniq -c
>replaces each string of equal adjacent records with a single record
>preceded by a count
>	sort -nr
>sorts those records in reverse numerical order, and
>	sed 20q
>throws away all but the 20 highest ranking records
>
>Of course, if your records are already newline-terminated strings,
>(as they should be, if you're a good UNIX user), you won't have to
>write any code at all.
-- 
 -------------------------------    Disclaimer:  The views contained herein are
|       dan levy | yvel nad      |  my own and are not at all those of my em-
|         an engihacker @        |  ployer or the administrator of any computer
| at&t computer systems division |  upon which I may hack.
|        skokie, illinois        |
 --------------------------------   Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa,
	   go for it!  			allegra,ulysses,vax135}!ttrdc!levy