DOUG@ysub.ysu.edu (Doug Sewell) (07/27/90)
Here's three suggestions: >>> If the list is sorted, then you can use a binary search. Look at 'middle' of list. If found, exit. If 'middle' is smaller, repeat on 'top' half of list else if 'middle' is larger, repeat on 'bottom' half of list. >>> If you have some control over the way the data is stored in a file, you can use a hash-table - perhaps hash into a fixed-size table of pointers to the actual records in the file. proper construction of the hash function to avoid collisions is essential. Here's two I've seen recently. 1. all words are 8 chars or less. Numerical values are assigned for each letter: a-i = 1-9, j-r = 1-9, s-z = 2-9, blank = 0. (This case is charge-codes on an EBCDIC system, so this makes sense). SC123456 => 23123456, 'DOUG ' => 46570000 Reverse this number, divide into two parts, and add the two: sc123456 => 23123456 => 65432132 => 6543 + 2132 => 8675 doug => 46570000 => 00007564 => 7564 + 0000 => 7564 In our case, we divided by 179 (there are 180 tracks in six cylinders of 3350 dasd), and stored the entries in the appropriate track, which the operating system handled searching for a particular entry automatically. For our userids, this has resulted in virtually uniform distribution of over 5000 entries. 2. make a list of prime numbers, with as many entries in the list as the maximum length of your search-string. For example, for 8, use 2,3,5,7,11,13,17,19. Assign a value a-z = 1-26, respectively. Stop on a blank, and count a '.' as 0. (This function was for MVS dataset names, which could be up to 44 chars long). multiply each letter-value by the prime number associated with the letter's position in the prime-table and add them up. doug => 4*2 + 15*3 + 21*5 + 7*7 => 207 I don't remember how well this distributed. At any rate, there's a few suggestions for how to implement non-linear searches of a table. >>> One other idea: if the words are 8 characters or less, and there are 1000 or less of them, then keeping the entire table in-core will be 8000 characters or less. You could either read them into fixed-slot tables, or use length-prefixing: fixed-slot (- = blank) length-prefixing: a------- 1a abc----- 3abc aeiou--- 5aeiou bvd----- 3bvd ... (stored as a-------abc---) ... (stored as 1a3abc5aeiou3bvd...) I hope these suggestions help. I didn't give very concise algorithms, but binary search and hash algorithms are common in data-structure books. -- Doug Sewell, Tech Support, Computer Center, Youngstown State University, Youngstown, OH 44555 E-mail: DOUG@YSUB.BITNET, DOUG@YSUB.YSU.EDU, ...!uunet!ysub.ysu.edu!doug >> I am not a wizard. What I do isn't magic. I am a computer professional.
streib@silver.ucs.indiana.edu (Allan Streib) (07/28/90)
Sorry if this has already been suggested, I havn't seen it. Check out Volume 3 (Sorting and Searching) of "The Art of Computer Programming" by Donald Knuth (Addison-Wesley Publishing, (c) 1973) Don't let the Copyright date fool you; Knuth is one of the "universal" references in Computer Programming. -- Allan (streib@silver.ucs.indiana.edu)
tomr@ashtate (Tom Rombouts) (08/01/90)
I missed the original posting, but you might want to check out Blaise's Power Search package for a really unique approach to this. If I understand it correctly, as the expression is parsed, little chunks of machine code are placed directly in memory which are then executed by a pointer to that "assembled" function. A former roommate of mine ordered the package just to check this out from his technical curiosity! Tom Rombouts Torrance Techie tomr@ashtate.A-T.com V:(213)538-7108