[comp.os.research] Cache theory question

cjp%megatek.UUCP@ucsd.edu (Chris Pikus) (02/24/90)

	The Unix disk buffer cashe is organized as
a combination of both a hash table and a linked list.
that is there several linked lists and which one to use
is determined by a hashing function.

	To wit, when one wishes to access/add a particular block
in the buffer cache, the kernel performs a hashing function
on BLKNO (usually a mod( 17) i beleive) and then walks down
the that list to find said BLKNO.

	My question is thus: WHY mod 17???

	I know the answer is fairly obvious, since 17 is 
a prime number, there would more likely to be an even
distribution amongst the lists. If something like a filesystem
started to have certain things regularly spaced, they might have
a tendency to "resonate" adn fill up one list if one used a number
like 16.

	To be more precise, is there any reason they picked 17, and
if so, is there any supporting documentation/articles/mathematical_treatises?
Could they have picked any prime (like 29, or 53, or 2011)?

	I know this is a pretty esoteric question but I was 
curious and I know there is a lot of UNIX historians on the net.

	Thank you very much,
	Christopher J. Pikus

rfg@ICS.UCI.EDU (Ronald Guilmette) (02/26/90)

In article <1456@darkstar.ucsc.edu> cjp%megatek.UUCP@ucsd.edu (Chris Pikus) writes:
>
>	The Unix disk buffer cashe is organized as
>a combination of both a hash table and a linked list.
>that is there several linked lists and which one to use
>is determined by a hashing function.
>
>	To wit, when one wishes to access/add a particular block
>in the buffer cache, the kernel performs a hashing function
>on BLKNO (usually a mod( 17) i beleive) and then walks down
>the that list to find said BLKNO.
>
>	My question is thus: WHY mod 17???
>
>	I know the answer is fairly obvious, since 17 is
>a prime number, there would more likely to be an even
>distribution amongst the lists. If something like a filesystem
>started to have certain things regularly spaced, they might have
>a tendency to "resonate" adn fill up one list if one used a number
>like 16.
>
>	To be more precise, is there any reason they picked 17, and
>if so, is there any supporting documentation/articles/mathematical_treatises?
>Could they have picked any prime (like 29, or 53, or 2011)?
>
>	I know this is a pretty esoteric question but I was
>curious and I know there is a lot of UNIX historians on the net.
>
>	Thank you very much,
>	Christopher J. Pikus

I am curious about this very same question.

Recently, I was writting a tool which reads assembly code for a particular
machine and does some mangling on it.  While doing this, I was hunting for
a nice hash function for the valid mnemonics for that machine.  I quickly
found that hash functions which involved prime numbers gave much better
results (i.e. in terms of minimizing collisions) than did ones without a
prime involved.  A quick check of Knuth's books confirmed this to be a well
know fact.

Now the only question remaining for me was "Which prime should I use for
this particular set of mnemonics?"  I wrote a short program which would try
all of the primes up to 65536 with the following function:

	unsigned int hash_mask = /* some power of two */ ;

	static unsigned int
	hash (const char *string, unsigned int prime)
	{
	  unsigned int ret_val = 0;

	  for (const char *p = string; *p; p++)
	    ret_val = (*p ^ ret_val) * prime;
	  return ret_val & hash_mask;
	}

I had the program evaluate the "efficiency" of all of the various primes for
the given mnemonic set.

For the mnemonic set I was interested in (consisting of 165 strings) my
program found a prime that (when used with a primary table size of 2048)
yielded a perfect hash function.  The prime was 5381.

Curiously however, the prime number 17 also yielded good efficiency (i.e.
a small percentage of collisions) with several different primary table
sizes (with my specific input key set).  Other numbers that seemed good
with various table sizes were 19 and 29, but 17 was always better.

Now in my case, I had a fixed input key set that was not going to be changed,
so I went with 5381 as my prime, but it occured to me that a different prime
would probably be best for my hash function which hashed *identifiers*.

Obviously, there is no way to predict the set of identifiers which will be
used in a given input program, so I would like to know if there have been
any studies which would suggest that any particular prime number (17?)
would (on average) yield good results for arbitrary sets of input strings
and arbitrary primary table sizes.

Anybody know?  Shall I do one?  Is there any theoretical basis for determining
the one "right" answer to this question (if there is one), or is an empirical
study the only way to come up with (at least) a good guess?

P.S.  Knuth uses 1009 in his examples, but what does he know! :-)

// rfg

caveh@csl.sri.com (Caveh Jalali) (02/28/90)

In article <1456@darkstar.ucsc.edu> cjp%megatek.UUCP@ucsd.edu (Chris Pikus) writes:
>
...
>
>	To wit, when one wishes to access/add a particular block
>in the buffer cache, the kernel performs a hashing function
>on BLKNO (usually a mod( 17) i beleive) and then walks down
>the that list to find said BLKNO.
>
>	My question is thus: WHY mod 17???
...

the choice of 17 is probably arbitrary.  given that there is (hopefully)
no preference for any particular block #s, i will go as far as claiming
that a power of 2 would have been a better choice.

as long as your hash key is distributed EVENLY, who cares what you
divide by -- pick a power of 2 that gives you the desired residency!

just because a number is prime doesn't mean you're going to have fewer
collisions.  you'll simply be sensitive to a different "beat"
frequency than with other numbers:  with 17, it's multiples of 17,
with 16 it would be multiples of 16 which defeat your hashing.

again, i see no reason to believe that block allocation is not
random.  there may be some clustering toward one or more areas
on the disk such as cylinder group boundaries in the BSD file
system, and probably toward the "beginning" of the partition
on older file systems, but within each cluster, the distribution
of blocks should be random.

as to why 17 seems to be the number chosen:  "old" disks used to
have 17 blocks/track.  this is still true for all MFM drives on
PC, etc... ie.  anything that's 3600 rpm, and 5Mbit/sec data rate,
and 512 byte sector.  someone might want to check how many
sectors/track there were on the old PDP-11 drives!
if they still use 17 today, something is badly WRONG:
"old" systems didn't have enough memory to afford more than say... 30
blocks of cache, hence the small value.  today, one might
choose a value in the 100's.

00c
--
00c -- Who was that masked interrupt?
Internet: caveh@csl.sri.com
UUCP: {ames|decwrl|pyramid|sun}!fernwood!hercules!caveh
ICBM: 37d 27' 14" North, 122d 10' 52" West

moss%ibis@cs.umass.edu (Eliot Moss) (03/01/90)

I suspect that 17 was chosen as being a prime close to the number of blocks
they expected to have in the cache in early systems (8k x 17 = 136k, a
reasonable chunk on a 1 or 2 megabyte system). With larger memory sizes
available today, it might pay to use a larger prime and cut down the cost of
searching the lists. If the size of the hash table is k and the number of
blocks in the cache is N, then the average number of probes it takes to find a
resident block is something like 1/2 * N/k and the average number for
non-resident blocks is N/k. Clearly, one desires large k for this, but pays an
overhead (a per-device one perhaps), so one must make an appropriate tradeoff.
Even on my machines with large amounts of memory, the chains are probably not
longer than 10 or 20 blocks ....					Eliot
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206; Moss@cs.umass.edu