[comp.os.msdos.programmer] Fast Search Routines

DOUG@ysub.ysu.edu (Doug Sewell) (07/27/90)

Here's three suggestions:

>>> If the list is sorted, then you can use a binary search.

   Look at 'middle' of list.  If found, exit.
   If 'middle' is smaller, repeat on 'top' half of list
   else if 'middle' is larger, repeat on 'bottom' half of list.

>>> If you have some control over the way the data is stored in a
file, you can use a hash-table - perhaps hash into a fixed-size
table of pointers to the actual records in the file.

proper construction of the hash function to avoid collisions is
essential.  Here's two I've seen recently.

1. all words are 8 chars or less.  Numerical values are assigned
   for each letter: a-i = 1-9, j-r = 1-9, s-z = 2-9, blank = 0. (This
   case is charge-codes on an EBCDIC system, so this makes sense).
   SC123456 => 23123456, 'DOUG    ' => 46570000

   Reverse this number, divide into two parts, and add the two:

   sc123456 => 23123456 => 65432132 => 6543 + 2132 => 8675
   doug     => 46570000 => 00007564 => 7564 + 0000 => 7564

   In our case, we divided by 179 (there are 180 tracks in six
   cylinders of 3350 dasd), and stored the entries in the appropriate
   track, which the operating system handled searching for a particular
   entry automatically. For our userids, this has resulted in virtually
   uniform distribution of over 5000 entries.

2. make a list of prime numbers, with as many entries in the list as
   the maximum length of your search-string.  For example, for 8, use
   2,3,5,7,11,13,17,19.  Assign a value a-z = 1-26, respectively.  Stop
   on a blank, and count a '.' as 0.  (This function was for MVS dataset
   names, which could be up to 44 chars long).

   multiply each letter-value by the prime number associated with the
   letter's position in the prime-table and add them up.

   doug => 4*2 + 15*3 + 21*5 + 7*7 => 207

   I don't remember how well this distributed.

At any rate, there's a few suggestions for how to implement non-linear
searches of a table.

>>> One other idea: if the words are 8 characters or less, and there are
1000 or less of them, then keeping the entire table in-core will be
8000 characters or less.  You could either read them into fixed-slot
tables, or use length-prefixing:

fixed-slot (- = blank)         length-prefixing:
a-------                       1a
abc-----                       3abc
aeiou---                       5aeiou
bvd-----                       3bvd
... (stored as a-------abc---) ...  (stored as 1a3abc5aeiou3bvd...)

I hope these suggestions help.  I didn't give very concise algorithms,
but binary search and hash algorithms are common in data-structure
books.
--
Doug Sewell, Tech Support, Computer Center,
Youngstown State University, Youngstown,  OH 44555
E-mail: DOUG@YSUB.BITNET, DOUG@YSUB.YSU.EDU, ...!uunet!ysub.ysu.edu!doug
>> I am not a wizard.  What I do isn't magic.  I am a computer professional.

streib@silver.ucs.indiana.edu (Allan Streib) (07/28/90)

Sorry if this has already been suggested, I havn't seen it.

Check out Volume 3 (Sorting and Searching) of "The Art of Computer
Programming" by Donald Knuth  (Addison-Wesley Publishing, (c) 1973)
Don't let the Copyright date fool you; Knuth is one of the "universal"
references in Computer Programming. 

--
Allan (streib@silver.ucs.indiana.edu)

tomr@ashtate (Tom Rombouts) (08/01/90)

I missed the original posting, but you might want to check out
Blaise's Power Search package for a really unique approach to
this.  If I understand it correctly, as the expression is parsed,
little chunks of machine code are placed directly in memory which
are then executed by a pointer to that "assembled" function.
A former roommate of mine ordered the package just to check this
out from his technical curiosity!

Tom Rombouts  Torrance Techie  tomr@ashtate.A-T.com  V:(213)538-7108