ephraim@THINK.COM (12/12/88)
After following the discussion of super-linear performance improvement (or its absence) on parallel machines, it struck me that I've been sitting on a possible instance. It hadn't occurred to me to publish, perhaps because it's a real application and not a research project :-). The task is looking up words in a dictionary. Like most dictionaries, the dictionary contents are sorted in alphabetical order. Like most words, the words to be looked up are not. The algorithm used is a variation of binary search called Parallel Binary Search (PBS) which searches for multiple words at once. PBS was devised by Craig Stanfill and others at Thinking Machines. The algorithm was debugged and implemented by yours truly. Needless to say, the algorithm is designed for efficient implementation on a Connection Machine. We have a dictionary of D unique entries memory-resident in a P processor Connection Machine. We have a total of W words to look up. In practice, D >> P and W >> P. That is, there are several dictionary entries in each processor (D/P) and there are more words to look up than can be accomodated in one pass. Each use of PBS can look up 1 <= w <= P words in the dictionary, so at least W/P calls are needed to look up all of the words. The cost of PBS is nearly independent of w, the number of words looked up in one call. One component of the cost per call is proportional to D/P, another is proportional to log2(P). So in practice, we use w ~= P for greatest efficiency. Consider what happens as P is increased. The number of calls to PBS is reduced in direct proportion to P, giving us a nearly linear performance improvement. (It's slightly sub-linear because of the log2(P) term.) But, D/P is also decreased, giving us an additional improvement. In an experiment carried out in mid-September, I had the following results: Machine size (P) CPU time 8192 165 S 16384 54 S 32768 20 S