cpshelley@violet.waterloo.edu (cameron shelley) (10/17/90)
(Greeting(time == morning? Goodmorning: ( time == afternoon? Goodafternoon: Goodevening) ) ); I have been (futily :<) experimenting with hash functions to hash english words into a large array. No book I can find around here gives the topic more than cursory discussion. Does anyone out there know of a good, simple one? They must exist, and I'm getting a little tired of mine... (er, function that is, I can't afford another data structures book right now :). I'll be happy to post mine if anyone needs a good laugh, btw. Cam -- Cameron Shelley | "Saw, n. A trite popular saying, or proverb. cpshelley@violet.waterloo.edu| So called because it makes its way into a Davis Centre Rm 2136 | wooden head." Phone (519) 885-1211 x3390 | Ambrose Bierce
macphee@convex.COM (Scott C. Mac Phee) (10/17/90)
In article <1990Oct16.184951.513@watdragon.waterloo.edu> cpshelley@violet.waterloo.edu (cameron shelley) writes: > > I have been (futily :<) experimenting with hash functions to >hash english words into a large array. No book I can find around >here gives the topic more than cursory discussion. Does anyone >out there know of a good, simple one? They must exist, and I'm >getting a little tired of mine... (er, function that is, I can't >afford another data structures book right now :). Here is some code I've been using. It is an adapted version of some hashing algorithms I've seen pass through this newsgroup. Hope it helps... =========================================================================== /* * ============================================================================ * HASH TABLE FUNCTIONS * ============================================================================ */ #define HASH_TABLE_SIZE 1000 static int eventhash[HASH_TABLE_SIZE] ; static int hash_entries, /* number in table... */ collisions ; /* count of collisions */ int init_hash_table() { register int i ; hash_entries = collisions = 0 ; for (i=0; i<HASH_TABLE_SIZE;) eventhash[i++] = 0 ; } /* * calc_hash_value.c * * calculate a hash table location from the given name. * */ int calc_hash_value(name) char *name ; { register int hash ; for (hash=0; *name;) hash = (hash<<2) ^ *name++ ; return(abs(hash % HASH_TABLE_SIZE)) ; } -- =============================================================================== CONVEX Computer Corporation (CVX) Richardson, Texas Scott C. Mac Phee (..uunet.uu.net!convex.com!macphee) 214-497-4772 ===============================================================================
macphee@convex.COM (Scott C. Mac Phee) (10/17/90)
In article <107251@convex.convex.com> macphee@convex.COM (Scott C. Mac Phee) writes: >In article <1990Oct16.184951.513@watdragon.waterloo.edu> cpshelley@violet.waterloo.edu (cameron shelley) writes: >> >> I have been (futily :<) experimenting with hash functions to >>hash english words into a large array. No book I can find around >>here gives the topic more than cursory discussion. Does anyone >>out there know of a good, simple one? They must exist, and I'm >>getting a little tired of mine... (er, function that is, I can't >>afford another data structures book right now :). > >Here is some code I've been using. It is an adapted version of some >hashing algorithms I've seen pass through this newsgroup. > >Hope it helps... > >=========================================================================== > >/* > * ============================================================================ > * HASH TABLE FUNCTIONS > * ============================================================================ > */ > >#define HASH_TABLE_SIZE 1000 >static int eventhash[HASH_TABLE_SIZE] ; >static int hash_entries, /* number in table... */ > collisions ; /* count of collisions */ Sorry... amended code... the hash table is an array of pointers to a structure, not int. ex: struct event { char * name ; struct event * next ; } eventhash[HASH_TABLE_SIZE] ; If a collision occurs use the hashed entry as the head of a linked list and walk the linked list until the end is found, then add the entry at the end. I pulled the code out and then added the declarations without looking at the original code. Ten minutes later I realized I'd messed up. Is today Monday, again? -- =============================================================================== CONVEX Computer Corporation (CVX) Richardson, Texas Scott C. Mac Phee (..uunet.uu.net!convex.com!macphee) 214-497-4772 ===============================================================================
chris@mimsy.umd.edu (Chris Torek) (10/18/90)
The following implements expanding chained hash tables for strings and numbers. This is unmodified from a special-purpose program (having to do with reading a large number of files with name=id pairs and doing various operations on them). Chain lengths are not monitored; instead, the table is invariably expanded whenever it becomes 2/3 full. Entries may be neither removed nor modified. An essentially-unrelated function called `string' maintains a string pool such that string(x) == string(y) iff strcmp(x,y)==0 (the program typically stores each name several times, and often needs to test for name equality, so this helps there). : Run this shell script with "sh" not "csh" PATH=/bin:/usr/bin:/usr/ucb:/etc:$PATH export PATH all=false if [ x$1 = x-a ]; then all=true fi echo Extracting hash.h sed 's/^X//' <<'//go.sysin dd *' >hash.h X/* X * $Id: hash.h,v 1.1 90/09/24 23:58:38 chris Exp $ X * X * Hash table entries keep track of name (or id) = value pairs. X * Values are simply pointers. Hash tables themselves are named X * (for debugging). X */ X Xstruct hashtab; X X/* X * The `ins' functions return nonzero if the old value existed, X * without changing the value. X * The iterate functions calls a given function with name,value X * or id,value pairs. X */ Xstruct hashtab *ht_new(); /* given name, create new hash table */ Xchar *ht_nget(); /* given table and name, get value */ Xchar *ht_iget(); /* given table and ID, get value */ Xint ht_nins(); /* given table and name, insert new value */ Xint ht_iins(); /* given table and id, insert new value */ Xvoid ht_niterate(); /* given table and function, iterate */ Xvoid ht_iiterate(); /* given table and function, iterate */ X X/* X * Some things that ought not to be here, but are anyway. X * goodnumber() takes a number of objects and a size and returns X * a new number of objects, such that malloc(goodnumber(n,size)*size) X * calls malloc with a `good' size value (resulting in less wasted X * memory). emalloc is malloc with program-abort on out-of-memory. X * string() makes a `read-only' copy of a string, reusing the previous X * copy if any. X */ Xint goodnumber(); /* given n & sizeof, return new n */ Xchar *emalloc(); /* malloc, exit on error */ Xchar *string(); /* make an `ideal' copy of a string */ //go.sysin dd * if [ `wc -c < hash.h` != 1415 ]; then made=false echo error transmitting hash.h -- echo length should be 1415, not `wc -c < hash.h` else made=true fi if $all; then echo ' Changing owner to bin' chown bin hash.h else echo ' Original owner was bin' fi if $made; then chmod 444 hash.h echo -n ' '; ls -ld hash.h fi echo Extracting hash.c sed 's/^X//' <<'//go.sysin dd *' >hash.c X#ifndef lint Xstatic char rcsid[] = "$Id: hash.c,v 1.2 90/10/02 13:32:23 chris Exp $"; X#endif X X/* X * Hash table routines. X */ X X#include <stdio.h> X#include <stdlib.h> X#include <string.h> X#include "hash.h" X X/* X * Hash table entries keep track of name=value pairs. X * The names may be numeric IDs instead (by having a null name). X */ Xstruct hashent { X struct hashent *h_next;/* next in chain */ X int h_hash; /* hash value or ID */ X char *h_name; /* name (null if from numeric ID) */ X char *h_value; /* value */ X}; X Xstruct hashtab { X int ht_size; /* size (power of 2) */ X int ht_mask; /* == ht_size - 1 */ X int ht_used; /* number of entries used */ X int ht_lim; /* when to expand */ X struct hashent **ht_tab;/* base of table */ X char ht_name[1]; /* table name; actually larger */ X}; X Xextern char *progname; X Xchar * Xemalloc(n) X size_t n; X{ X register char *p = malloc(n); X X if (p == NULL) { X (void) fprintf(stderr, "%s: out of memory\n", progname); X exit(1); X } X return (p); X} X X/* round up to next multiple of y, where y is a power of 2 */ X#define ROUND(x, y) (((x) + (y) - 1) & ~((y) - 1)) X X/* compute a `good' number of objects to allocate via malloc */ Xint Xgoodnumber(n, s) X int n; X size_t s; X{ X X /* 16384 is a guess at a good page size for malloc */ X /* 32 is a guess at malloc's overhead */ X return ((int)((ROUND(n * s + 32, 16384) - 32) / s)); X} X X/* X * Make a new hash table. X */ Xstruct hashtab * Xht_new(name) X char *name; X{ X register struct hashtab *ht; X register struct hashent **h; X register int n; X X ht = (struct hashtab *)emalloc(sizeof *ht + strlen(name)); X ht->ht_tab = h = (struct hashent **)emalloc(128 * sizeof *h); X ht->ht_size = 128; X ht->ht_mask = 127; X for (n = 128; --n >= 0;) X *h++ = NULL; X ht->ht_used = 0; X ht->ht_lim = (128 * 2) / 3; X (void) strcpy(ht->ht_name, name); X return (ht); X} X X/* X * Expand an existing hash table. X */ Xstatic void Xht_expand(ht) X register struct hashtab *ht; X{ X register int n = ht->ht_size * 2, i; X register struct hashent *p, **h, **oldh, *q; X X h = (struct hashent **)emalloc(n * sizeof *h); X for (i = n; --i >= 0;) X *h++ = NULL; X h -= n; X oldh = ht->ht_tab; X n--; X for (i = ht->ht_size; --i >= 0;) { X for (p = *oldh++; p != NULL; p = q) { X q = p->h_next; X p->h_next = h[p->h_hash & n]; X h[p->h_hash & n] = p; X } X } X free((char *)ht->ht_tab); X ht->ht_tab = h; X ht->ht_mask = n; X ht->ht_size = ++n; X ht->ht_lim = (n * 2) / 3; X} X X/* X * Make a new hash entry. Its h_next will be NULL. X */ Xstatic struct hashent * Xnewhashent(hash, name, value) X int hash; X char *name, *value; X{ X static struct hashent *hfree; X register struct hashent *h; X register int n, nalloc; X X if ((h = hfree) != NULL) X hfree = h->h_next; X else { X nalloc = goodnumber(2, sizeof *h); /* need at least 2 */ X hfree = h = (struct hashent *)emalloc(nalloc * sizeof *h) + 1; X for (n = nalloc - 2; --n >= 0; h++) X h->h_next = h + 1; X h->h_next = NULL; X h -= nalloc - 1; X } X h->h_next = NULL; X h->h_hash = hash; X h->h_name = name; X h->h_value = value; X return (h); X} X X#define HASH(str, h, p) \ X for (p = str, h = 0; *p;) h = (h << 5) - h + *p++ X X/* X * Look up a name=value. X */ Xchar * Xht_nget(ht, name) X register struct hashtab *ht; X char *name; X{ X register char *p; X register int hash; X register struct hashent *h; X X HASH(name, hash, p); X p = name; X for (h = ht->ht_tab[hash & ht->ht_mask]; h != NULL; h = h->h_next) X if (h->h_hash == hash && h->h_name != NULL && X strcmp(h->h_name, p) == 0) X return (h->h_value); X return (NULL); X} X X/* X * Look up an ID=value. X */ Xchar * Xht_iget(ht, id) X register struct hashtab *ht; X register int id; X{ X register struct hashent *h; X X for (h = ht->ht_tab[id & ht->ht_mask]; h != NULL; h = h->h_next) X if (h->h_hash == id && h->h_name == NULL) X return (h->h_value); X return (NULL); X} X X/* X * Insert (do not clobber) a name=value. X * Return zero on success. X */ Xint Xht_nins(ht, name, value) X register struct hashtab *ht; X char *name, *value; X{ X register char *p; X register int hash; X register struct hashent *h, **hp; X X HASH(name, hash, p); X p = name; X for (hp = &ht->ht_tab[hash & ht->ht_mask]; (h = *hp) != NULL; X hp = &h->h_next) X if (h->h_hash == hash && h->h_name != NULL && X strcmp(h->h_name, p) == 0) X return (-1); X *hp = newhashent(hash, name, value); X if (++ht->ht_used > ht->ht_lim) X ht_expand(ht); X return (0); X} X X/* X * Insert (do not clobber) an ID=value. X * Return zero on success. X */ Xint Xht_iins(ht, id, value) X register struct hashtab *ht; X register int id; X char *value; X{ X register struct hashent *h, **hp; X X for (hp = &ht->ht_tab[id & ht->ht_mask]; (h = *hp) != NULL; X hp = &h->h_next) X if (h->h_hash == id && h->h_name == NULL) X return (-1); X *hp = newhashent(id, (char *)NULL, value); X if (++ht->ht_used > ht->ht_lim) X ht_expand(ht); X return (0); X} X X/* X * Stash a copy of a string away; it will never be freed. X */ Xstatic char * Xpoolstr(s) X char *s; X{ X register char *p; X register size_t l = strlen(s) + 1; X static char *poolp; X static size_t nleft; X X if (nleft < l) X poolp = emalloc(nleft = goodnumber(l, 1)); X bcopy(s, p = poolp, l); X poolp += l; X return (p); X} X X/* X * Generate a single unique copy of the given string. X */ Xchar * Xstring(s) X char *s; X{ X register char *p; X register int hash; X register struct hashent *h, **hp; X static struct hashtab *ht; X X if (ht == NULL) X ht = ht_new("strings"); X HASH(s, hash, p); X p = s; X for (hp = &ht->ht_tab[hash & ht->ht_mask]; (h = *hp) != NULL; X hp = &h->h_next) X if (h->h_hash == hash && strcmp(h->h_name, p) == 0) X return (h->h_name); X *hp = h = newhashent(hash, poolstr(s), (char *)NULL); X if (++ht->ht_used > ht->ht_lim) X ht_expand(ht); X return (h->h_name); X} X X/* X * Call fn on all the name=value pairs. X */ Xvoid Xht_niterate(ht, fn) X struct hashtab *ht; X register void (*fn)(); X{ X register struct hashent *h, **hp; X register int n; X X for (n = ht->ht_size, hp = ht->ht_tab; --n >= 0;) X for (h = *hp++; h != NULL; h = h->h_next) X if (h->h_name != NULL) X (*fn)(h->h_name, h->h_value); X} X X/* X * Call fn on all the id=value pairs. X */ Xvoid Xht_iiterate(ht, fn) X struct hashtab *ht; X register void (*fn)(); X{ X register struct hashent *h, **hp; X register int n; X X for (n = ht->ht_size, hp = ht->ht_tab; --n >= 0;) X for (h = *hp++; h != NULL; h = h->h_next) X if (h->h_name == NULL) X (*fn)(h->h_hash, h->h_value); X} //go.sysin dd * if [ `wc -c < hash.c` != 6291 ]; then made=false echo error transmitting hash.c -- echo length should be 6291, not `wc -c < hash.c` else made=true fi if $all; then echo ' Changing owner to bin' chown bin hash.c else echo ' Original owner was bin' fi if $made; then chmod 444 hash.c echo -n ' '; ls -ld hash.c fi -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
gaynor@paul.rutgers.edu (Silver) (10/18/90)
You must supply definitions of uint8 (an 8-bit unsigned int), uint16
(similarly), and uint32 (likewise).
----------------------------------- hash.h ------------------------------------
/* This is an implementation of Peter K. Pearson's hash algorithm as */
/* presented in "Fast Hashing of Variable-Length Text Strings", ACM 6/90. */
/* The conclusion of his article is cited below. */
/* */
/* The main advantages of the hashing function presented here are: */
/* 1. No restriction is placed on the length of the text string. */
/* 2. The length of the text string does not need to be known beforehand. */
/* 3. Very little arithmetic is performed on each character being hashed. */
/* 4. Similar strings are not likely to collide. */
/* 5. Minimal, perfect hashing functions can be built in this form. */
/* [6. It's average distribution is quite random.] */
/* */
/* Its principal disadvantages are: */
/* 1. Output value ranges that are not powers of 2 are somwehat more */
/* complicated to provide. */
/* 2. More auxiliary memory (the 256-byte table T) is required by this */
/* hashing function than by many traditional functions. */
/* [3. I blew 50 cents copying the article and then had to type it in.] */
/* */
/* Elaborating on advantage 5 above, by fatootsing with the permutation */
/* table T, one can make hash8 perform perfect minimal hashes. The author */
/* provides the following guidelines for construction of such a table: */
/* */
/* [C[N] are the characters of the string being hashed and h[N] is the Nth */
/* iteration of the loop.] */
/* */
/* 1. A table T was consructed by pseudorandom permutation of the integers */
/* (0 ... 255). */
/* 2. One by one, the desired values were assigned to the words in the */
/* list. Each assignment was effected by exchanging two elemnts in the */
/* table. */
/* 3. For each word, the first candidate considered for exchange was */
/* T[h[n-1] ^ C[n]], the last table element referenced in the */
/* computation of the hash function for that word. */
/* 4. A table element could not be exchanged if it was referenced during */
/* the hashing of a previously assgined word or if it was referenced */
/* earlier in the hashing of the same word. */
/* 5. If the necessary exchange was forbidden by Rule 4, attention was */
/* shifted to the previously referenced table element, */
/* T[[h[n-2]] ^ C[n-1]]. */
/* */
/* The procedure is not always successful. For example, using the ascii */
/* character codes, if the word "a" hashes to 0 and the word "i" hashes to */
/* 15, it turns out that the word "in" must hash to 0. Initial attempts */
/* to map Knuth's 31 words onto the integers (0 ... 30) failed for exactly */
/* this reason. The shift to the range (1 ... 31) was an ad hoc tactic to */
/* circumvent this problem. */
/* Hash s into an 8, 16, or 32 bit unsigned integer. The larger-sized */
/* values are generated by applying hash8 in parallel. */
extern uint8 hash8(uint8 * s);
extern uint16 hash16(uint8 * s);
extern uint32 hash32(uint8 * s);
/* The hash permutation table. */
extern uint32 * hashpermutation;
----------------------------------- hash.c ------------------------------------
#include "hash.h"
static uint8 permutation[256] =
{
51, 140, 233, 27, 118, 125, 170, 138, 119, 132, 174, 97, 25, 110, 1, 14,
65, 36, 40, 188, 73, 173, 7, 30, 68, 56, 169, 234, 107, 177, 197, 87,
28, 210, 186, 67, 2, 15, 115, 48, 223, 148, 211, 57, 190, 104, 213, 49,
144, 172, 147, 124, 157, 238, 167, 183, 78, 75, 58, 22, 70, 103, 181, 12,
254, 41, 198, 168, 46, 79, 241, 156, 83, 128, 66, 60, 86, 141, 161, 176,
221, 54, 192, 252, 116, 95, 206, 35, 88, 133, 154, 250, 237, 253, 85, 178,
93, 159, 155, 42, 9, 89, 3, 61, 201, 158, 106, 82, 240, 255, 218, 102,
189, 8, 33, 4, 145, 16, 150, 26, 99, 100, 195, 175, 34, 50, 80, 166,
194, 195, 164, 29, 134, 105, 55, 143, 122, 130, 245, 208, 72, 77, 64, 121,
139, 232, 191, 108, 228, 137, 59, 74, 11, 126, 171, 5, 242, 101, 239, 193,
112, 113, 98, 21, 207, 225, 151, 251, 92, 91, 17, 127, 20, 81, 24, 6,
43, 196, 204, 247, 212, 224, 220, 94, 32, 13, 187, 199, 214, 18, 226, 84,
71, 231, 165, 19, 202, 217, 90, 129, 136, 153, 182, 111, 244, 45, 236, 249,
109, 47, 180, 205, 215, 160, 53, 162, 114, 246, 179, 62, 227, 96, 142, 230,
184, 146, 117, 39, 69, 37, 23, 63, 52, 216, 0, 135, 149, 31, 38, 44,
209, 120, 76, 203, 229, 123, 131, 152, 10, 219, 243, 248, 235, 222, 200, 163
};
uint8 hash8(register uint8 * s)
{register static uint8 h;
if (*s)
{h = permutation[*s];
while (*++s)
h ^= permutation[h ^ *s];
return(h);}
else
{return((uint8)0);}
uint16 hash16(register uint8 * s);
{register uint8 h1;
register uint8 h2;
if (*s)
{h1 = permutation[*s];
h2 = permutation[*s + 1];
while (*++s)
{h1 ^= permutation[h1 ^ *s];
h2 ^= permutation[h2 ^ *s];}
return((uint16)((h1<<8) & h2));}
else
{return((uint16)0);}}
uint32 hash32(register uint8 * s);
{register uint8 h1;
register uint8 h2;
register uint8 h3;
register uint8 h4;
if (*s)
{h1 = permutation[*s]
h2 = permutation[*s + 1];
h2 = permutation[*s + 2];
h2 = permutation[*s + 3];
while (*++s)
{h1 ^= permutation[h1 ^ *s];
h2 ^= permutation[h2 ^ *s];
h3 ^= permutation[h3 ^ *s];
h4 ^= permutation[h4 ^ *s];}
return((uint32)((h1<<24) & (h2<<16) & (h3<<8) & h4));}
else
{return((uint32)0);}}
ron@sco.COM (Ron Irvine) (10/20/90)
I have used many hash functions. Each one has advantages and disadvantages. Here is a powerful but simple one. It simply does a crc16 on the character string. The crc16 is a great hash function since it is designed to produce a unique code for different character strings. So for "reasonable" text it gives a good distribution of hash numbers (a big problem to solve in general). The crc16 function I list below uses nibbles to build the crc from each byte, this is faster than a bit by bit and less room than a 256 entry table needed for a byte by byte (fits in cache). So, it is fast (no multiply, just shifts and masks), well behaved and simple. If you have a VAX you could even use the "crc" instruction instead of the crc16 function - what a CISC. To use, do a crc16 on the string and the truncate the 16 bit result to the size you need (this should be a power of 2). /* crc16 - crc16 routine * * R.K. Irvine * * This routine returns the crc16 for the string pointed * to by "in". * crc16 is given by: x^16 + x^15 + x^2 + 1 */ unsigned short crc16(register char *in) { register unsigned int n, crc; static unsigned short crc16l[] = { 0x0000,0xc0c1,0xc181,0x0140, 0xc301,0x03c0,0x0280,0xc241, 0xc601,0x06c0,0x0780,0xc741, 0x0500,0xc5c1,0xc481,0x0440, }; static unsigned short crc16h[] = { 0x0000,0xcc01,0xd801,0x1400, 0xf001,0x3c00,0x2800,0xe401, 0xa001,0x6c00,0x7800,0xb401, 0x5000,0x9c01,0x8801,0x4400, }; crc = 0; while (n = (unsigned char)(*in++)) { n ^= crc; crc = crc16l[n&0x0f] ^ crc16h[(n>>4)&0x0f] ^ (crc>>8); } return(crc); }
unhd (Jonathan W Miner) (10/24/90)
In article <1990Oct16.184951.513@watdragon.waterloo.edu> cpshelley@violet.waterloo.edu (cameron shelley) writes: > I have been (futily :<) experimenting with hash functions to >hash english words into a large array. No book I can find around >here gives the topic more than cursory discussion. Does anyone >out there know of a good, simple one? They must exist, and I'm >getting a little tired of mine... (er, function that is, I can't >afford another data structures book right now :). I am not an expert on hash functions, but can suggest a couple of methods: 1. Add the ascii values and 'mod' by the size of your array. 2. Or, a small modification: add only certain characters of each string. Say 1,3,5,7 and 11 and 'mod' by the size of the array. Both those methods, and many similar ones involve handling collisions in some manner. This can be messy (or so I'm told). To skip this problem, use a dynamic hashing function. This is refered to Fagin's Approach. It avoid's collisions by allocating space as needed and modifying the hash function. E-Mail if you want more info on this. ----------------------------------------------------------------- Jonathan Miner | I don't speak for UNH, and UNH does not jwm775@unhd.unh.edu | speak for me! (603)868-3416 | Rather be downhill skiing... or hacking... -- ----------------------------------------------------------------- Jonathan Miner | I don't speak for UNH, and UNH does not jwm775@unhd.unh.edu | speak for me! (603)868-3416 | Rather be downhill skiing... or hacking...