[comp.lang.c] Hash??? Not quite clear on what this is...

avery@netcom.UUCP (Avery Colter) (11/22/90)

I'm seeing all this talk of "Hash Functions".

Are you talking about parsing? What is being defined as "hashing"?

-- 
Avery Ray Colter    {apple|claris}!netcom!avery  {decwrl|mips|sgi}!btr!elfcat
(415) 839-4567   "I feel love has got to come on and I want it:
                  Something big and lovely!"         - The B-52s, "Channel Z"

gordon@osiris.cso.uiuc.edu (John Gordon) (11/23/90)

avery@netcom.UUCP (Avery Colter) writes:

>I'm seeing all this talk of "Hash Functions".

>Are you talking about parsing? What is being defined as "hashing"?

	Oh, goody.  We talked about hashing the last 2 weeks in class, so I
get to explain it.

	Hashing is: a method of storing items in an array such that they can
be quickly recalled when needed.  This is possible because hashing performs a
transmogrification (:-)) on the item and comes up with an array address, so
you can go get it directly without having to search for it.  

	Example:
	
	Let's say that your whole data set consists of the words: "I", "am", 
"the", "best", "programmer".  You want a function to store these words in an
array, and to be able to pick any one out without searching.  Well, a good
function for this would be to put the word in the [strlen()]'th position.  Do
you see how this would result in each word being in its own location?  "I" 
would be in slot #1, "am" would be in #2, etc.  So, if you are looking for
the word "I", you would apply the hash function and come up with 1, and you
would look in array element 1, and there it is!
	So, to reiterate: A hash function's puprpose is to be able to take 
data items and perform some operation on them, resulting in an array address to
store that item in.  This is good because you do not have to search for the
desired item, you can find out exactly where it is and get it.

	NOTE: The above example is not realistic, for several reasons:
	1) The data set is ridiculously small.
	2) Many times you do not know the data set beforehand.
	3) hash functions never use something so simple as strlen().  They
	   break the item down into bits and twiddle them to produce the
	   array address.


---
John Gordon
Internet: gordon@osiris.cso.uiuc.edu        #include <disclaimer.h>
          gordon@cerl.cecer.army.mil       #include <clever_saying.h>

"Gee, ralph, how many flames do ya think THIS post will get??"

gwyn@smoke.brl.mil (Doug Gwyn) (11/24/90)

In article <1990Nov22.235522.26487@ux1.cso.uiuc.edu> gordon@osiris.cso.uiuc.edu (John Gordon) writes:
>So, if you are looking for the word "I", you would apply the hash function
>and come up with 1, and you would look in array element 1, and there it is!

Or, some other datum that hashes to the same bucket is there.  While
so-called "perfect hashing" doesn't have that problem, in practice
perfect hashing is rarely possible (normally only for language keywords
in a compiler); instead, support must be provided to resolve hash
collisions.  Simple-minded collision resolution schemes can perform
extremely poorly...

jih@ox.com (John Hritz) (11/24/90)

	Hashing can also be thought of as an engineering trade-off between
space and time.  There are far more efficient methods of storing a list of
data items, but hashing can have the benifit of yeilding the searched for data
faster than the overhead associated with a linear, binary or tree search.
The basic balancing point:  will it take longer to calculate the hash index
and deal with collisions than it will to  search by comparisons.  Generally,
if the list is of moderate size, temporary in lifespan and a hash value can
be calculated that doesn't result in a lot of collisions, hashing is a win.
Otherwise memory, collisions, or the exoticness of your hash function start
to eat into your speed gains.
-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
     John I. Hritz                               Photons have mass?!
     jih@ox.com                                       I didn't know they
     313-930-9126                                             were catholic!

johncore@compnect.UUCP (John Core ) (11/29/90)

In article <17291@netcom.UUCP>, avery@netcom.UUCP (Avery Colter) writes:
> I'm seeing all this talk of "Hash Functions".
> 
> Are you talking about parsing? What is being defined as "hashing"?
> 

a hash is a unique numerical value for a string between a range of
preset values. it can be used for record indexing. On LARGE databases
we would create a hash value (between 1 and our max records) of a 
key and stick the key at that position in a header file with a pointer
to the actual record position of the data file. Programmers are always
searching for better hash routines since there is always a collision 
possibility (two different keys hashing to the same value)

Wizard Systems              |    UUCP:   uunet!wa3wbu!compnect!johncore
P.O. Box 6269               |INTERNET:   johncore@compnect.wa3wbu
Harrisburg, Pa. 17112-6269  |a public bbs since 1978. Data(717)657-4992 & 4997
John Core, SYSOP            |-------------------------------------------------
----------------------------| No matter where you go, there you are!
a woman is just a woman, but a good cigar is a smoke.   -R. Kipling

johncore@compnect.UUCP (John Core ) (11/29/90)

In article <1990Nov22.235522.26487@ux1.cso.uiuc.edu>, gordon@osiris.cso.uiuc.edu (John Gordon) writes:
> avery@netcom.UUCP (Avery Colter) writes:
> 
> >I'm seeing all this talk of "Hash Functions".
> 
> >Are you talking about parsing? What is being defined as "hashing"?
> 
> 	Oh, goody.  We talked about hashing the last 2 weeks in class, so I
> get to explain it.
> 
> 	Hashing is: a method of storing items in an array such that they can
> be quickly recalled when needed.  This is possible because hashing performs a
> transmogrification (:-)) on the item and comes up with an array address, so
> you can go get it directly without having to search for it.  
> 
> "Gee, ralph, how many flames do ya think THIS post will get??"

This is not a flame! (well maybe it is.)

Last week I was talking to a prof at the local college, he is teaching a 'C'
course and had a textbook on 'c'. They spent 2 chapters on sorts, all of
them on ram memory sorts. NO file sorts. This is why I will never worry
about losing my job as systems analyst to a bright new wizard out of college.
College programming courses seem to never deal with the REAL (ie buisness)
world. As per the sorts, no-buisness (except maybe the local Macdonalds)
has data bases that fit in ram memory. Your reply indicated hashing
arrays in ram. Even a VAX 9000 with a gig of ram can not in memory sort
a decent buisness database, whether it be invoices, po's , or a mailing
list, unless of course you talk the system administrator into taking
the machine into single user mode to run your application. 

my reply to the original poster:

*>a hash is a unique numerical value for a string between a range of
*>preset values. it can be used for record indexing. On LARGE databases
*>we would create a hash value (between 1 and our max records) of a 
*>key and stick the key at that position in a header file with a pointer
*>to the actual record position of the data file. Programmers are always
*>searching for better hash routines since there is always a collision 
*>possibility (two different keys hashing to the same value)

then you said:
>	2) Many times you do not know the data set beforehand.
>	3) hash functions never use something so simple as strlen().  They
>	   break the item down into bits and twiddle them to produce the
>	   array address.

Actually you have to know a lot, like the key's max length
and how many total records will be in your data base, since you have
to give the hash function a range to deal with. Example:
years back I worked for a firm that wrote the software for General Signals's
electric motor factories. We had to manage a database that stored over
200000 items. It consisted of indevidual parts for various electric motors,
the motors the parts were associated with, supplier, rate of consumption,
lag time from order to deliver and price history. We played it safe and 
presumed the base would never go over 300000 items. We would take an
items search key and hash it to a number between 1 and 360000. The
extra 60000 was the collision overflow. We would then create an index
file that contained the primary key, secondary keys, and a pointer
to the actual record's file position in the data base. If part 
'AB-DDD-104-20-5A' hashed to 7 it's key and actual file position 
(a new item went to the bottom of the master file) was written to
record 7 of the index file. If another key hashed to 7 it was written
to an overflow (collision) area. As the data base grows you have to
watch the collision area. If you have a lot of collisions, you have
to adjust your hashing scheems, and the max records (high hash value)
of your hash index. In our hash calculations we seeded the hash 
equasion with a prime number that was equivelant to 80% of the total
number of records. The prime was kept in a seperate file so we could
change the hash with-out recompiling. I was sent one day to GS's Wisconsin
plant (there software had been running great for 2 years) on the complaint
that the system was really SLOW. Taking up to 5 minutes to retrieve a
record. The cause (it was the last place I looked) , one of our 'new' boys 
and been tuning the hash and used a non prime number. 50% of new entries
were going into collision. Took 2 days to rebuild the index.

--just another old fart telling a war story...

Wizard Systems              |    UUCP:   uunet!wa3wbu!compnect!johncore
P.O. Box 6269               |INTERNET:   johncore@compnect.wa3wbu
Harrisburg, Pa. 17112-6269  |a public bbs since 1978. Data(717)657-4992 & 4997
John Core, SYSOP            |-------------------------------------------------
----------------------------| No matter where you go, there you are!
a woman is just a woman, but a good cigar is a smoke.   -R. Kipling

Markd@Aus.Sun.COM (12/01/90)

johncore@compnect.UUCP (John Core ) writes:

>In article <1990Nov22.235522.26487@ux1.cso.uiuc.edu>, gordon@osiris.cso.uiuc.edu (John Gordon) writes:
>> avery@netcom.UUCP (Avery Colter) writes:
>> 
>> >I'm seeing all this talk of "Hash Functions".

>...
>College programming courses seem to never deal with the REAL (ie buisness)
>...

>Actually you have to know a lot, like the key's max length
>and how many total records will be in your data base, since you have
>to give the hash function a range to deal with. Example:
>...
>50% of new entries were going into collision.  Took 2 days to rebuild
>the index. 

If you use simplistic static hashing in REAL buisness (sic) you deserve
the heat.  In real life, you need to know neither "the key's max
length", nor "how many total records will be in your data base". 

Dbm in the former, and Dynamic Hashing (and it's numerous variants such
as Probing) in the latter, demonstrate workable techniques that obviate
the need for such prescience.  Both of which have only marginal
relevance to C. 

>--just another old fart telling a war story... 

Quite.

------------         -----------------     --------------------
Mark Delany          markd@Aus.Sun.COM     ...!sun!sunaus!markd
------------         -----------------     --------------------

bilbo@bisco.kodak.COM (Charles Tryon) (12/03/90)

in <885@compnect.UUCP>, johncore@compnect.UUCP (John Core ) writes:
> > avery@netcom.UUCP (Avery Colter) writes:
> > 
> > >I'm seeing all this talk of "Hash Functions".
> > 
> College programming courses seem to never deal with the REAL (ie buisness)
> world.  ...
> 
> --just another old fart telling a war story...

  Well, I never learned much about hashing when I went to school (more years
  ago than I want to think about ;-).  I know the concept of hashing, but have
  never had a use for it in the applications I have worked on.  My question
  is, are there any recomendations out there for a good, fairly basic book on
  hashing?  I don't need to know all the gory details and the latest hot-shot
  methods, just the basics.

--
Chuck Tryon
    <bilbo@bisco.kodak.com>
    USmail: 46 Post Ave.;Roch. NY 14619                       B. Baggins
    <<...include standard disclamer...>>                      At Your Service

  "Swagger knows no upper bound, but the laws of physics remain unimpressed."
                                                            (D. Mocsny)

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (12/04/90)

In article <9012031446.AA24102@bisco.kodak.COM> bilbo@bisco.kodak.COM (Charles Tryon) writes:
> My question
>   is, are there any recomendations out there for a good, fairly basic book on
>   hashing?  I don't need to know all the gory details and the latest hot-shot
>   methods, just the basics.

Knuth spends a section on hashing. The presentation is good and has all
the basics, but doesn't go into the latest hot-shot methods.

---Dan

salomon@ccu.umanitoba.ca (Dan Salomon) (12/05/90)

In article <9012031446.AA24102@bisco.kodak.COM> bilbo@bisco.kodak.COM (Charles Tryon) writes:
>  ... My question
>  is, are there any recomendations out there for a good, fairly basic book on
>  hashing?  I don't need to know all the gory details and the latest hot-shot
>  methods, just the basics.

Almost any university text on data structures will have a section on hash
searching.  See for instance:

"Fundamentals of Data Structures" by Horowitz and Sahni  
"Data Structures Techniques" by Standish
"Data Structures" by Reingold & Hansen

The first is the most readable of the three, and the last is the most
complete.  Hashing can be used on almost any computerized search
problem, not just string searching problems, and usually performs very
well.  I have even used hashing to search for matching states in a
parsing automaton.  But it may take some imagination and skill to apply
it to your specific problem.
-- 

Dan Salomon -- salomon@ccu.UManitoba.CA
               Dept. of Computer Science / University of Manitoba
	       Winnipeg, Manitoba, Canada  R3T 2N2 / (204) 275-6682

ccplumb@spurge.uwaterloo.ca (Colin Plumb) (12/16/90)

In article <9012031446.AA24102@bisco.kodak.COM> bilbo@bisco.kodak.COM (Charles Tryon) writes:
>  Well, I never learned much about hashing when I went to school (more years
>  ago than I want to think about ;-).  I know the concept of hashing, but have
>  never had a use for it in the applications I have worked on.  My question
>  is, are there any recomendations out there for a good, fairly basic book on
>  hashing?  I don't need to know all the gory details and the latest hot-shot
>  methods, just the basics.

Hashing: taking a long key (often a string) and returning a short key (often
an integer in some range) that tries to be different for every long key.
This is obviously impossible in general, and you get "collisions" where
different long keys map to identical short keys.

For a good hash function, collisions are no more likely than if you used random
numbers.  If you know all the input keys ahead of time, you can work out a hash
function that has no collisions.  If every short key is used (the N possible
long keys hash to 0..N-1), you have a "perfect hash."

A common application of hash functions is hash tables, where you want to store
some records indexed by long key and get access to them quickly.  So you keep
an array of records and index it by the hash function to find the record you
want.  Handling collisions can be done in various ways.  I like "chaining"
where you make each entry a linked list and do linear search along them.
There's also extensible hashing, where you copy everything to a bigger table
and arrange things so there are no collisions.  Then there's linear probing,
useful for tables that never need deletion.  You search forwards from the
hash position until you find an empty slot.  If you're writing, put it here.
If you're reading, this means that the value in question has never been
written.  Simple, but its performance gets Very Bad as the table fills up.

Then there's double hashing, where you search forwards in steps of a second
hashing function, and otherwise works the same (this works very well), uniform
hashing (where the hash function produces a permutation of the table buckets
to search in - double hashing is a very close approximation), and other tricks.

More unusual applications involve traditional indexing techniques (you don't
fill up the space as much), but you store a hash of the key because it's more
compact and quicker to compare.  Diff uses this to compare lines... it hashes
a line of text and compares them first, then double-checks with a full text
compare if the lines are equal.

Another interesting variant was used in a version of spell(1), although I don't
know if it's the current one or not.  The dictionary of correctly spelled words
was hashed into a large hash space (27 bits?) and the result stored as a
bitmap.  No, not as 2^24 = 16 Megabytes, but the gaps between adjacent bits,
with some higher-level indices.  that let you check if bit #4392859 is set
fairly quickly.  This has the possibility of missing a misspelled word, but
it's very rare and a "stop list" of words it fails to catch was added to
deal with exceptions that actually were noticed in practice.

For integers to integers, input%outputrange works fairly well if you arrange
for outputrange to be prime.  For strings, hash = fiddle(hash)+*p++ is popular.
Someone just posted fiddle(x) = x*31 and x*33.  (You range reduce mod 2^n
for suitable n eventually.)

One that was published in CACM recently was to set up a table with a random
permutation of 0..255 and loop over hash = table[hash^*p++];.  If you set up
the table more methodically, you get an 8-bit CRC of the input string, but
any random permutation works well.

Knuth (section 6.4) mentions multiplicative hashing (take some of the middle
bits of constant*input).

That was quick, but that's the guts of the thing.