[comp.lang.c] Wanted -- ambiguous lookup routine.

gisle@ifi.uio.no (Gisle Hannemyr) (08/09/90)

I am storing a number of strings (composed of several words) in a list
in RAM.  For instance (three strings):

    John Fitzgerald Kennedy
    John Frank Oppenheimer
    Lyndon Bird Johnson

I want to look up a string using an unambiguous abbreviation of the words
in the string.

Ie (given the short list above):
    J F K                   => John Fitzgerald Kennedy
    John Fitzgerald Kennedy => John Fitzgerald Kennedy
    L                       => Lyndon Bird Johnson
    J F                     => (error, ambiguous)
    J F L                   => (error, not found)

Typical applications for such a function would be phone directory, or
a completing unambiguous command parser.

Does anyone have a fast C routine that do the above, or know a
clever algorithm to accomplish it.  My main requirement is speed. 

Please mail your replies if possible.  I do not read these groups
regularly.

- gisle hannemyr  (Norwegian Computing Center)
  EAN:   C=no;PRMD=uninett;O=nr;S=Hannemyr;G=Gisle (X.400 SA format)
         gisle.hannemyr@nr.no                      (RFC-822  format)
  Inet:  gisle@ifi.uio.no
  UUCP:  ...!mcvax!ifi!gisle
------------------------------------------------

ssdken@watson.Claremont.EDU (Ken Nelson) (08/10/90)

 I would have responded by mail but your address kept bouncing.



How about this approach:

   When you store your strings store them in some kind of fast access
   tree, B-tree, AVL, Splay, or perhaps a hashed structure.

   Save them with the string as the key, and with the initials as
   the key.

   example:

    /* bsearch is a unix function for inserting, or searching into a
       binary tree. You could replace this with any type of insert
       into a fast access structure.  Don't really try this code
       it won't work, it is here to explain the concept.
    */

       bsearch("John F. Kennedy",tree,"John F. Kennedy")
       bsearch("JFK",tree,"John F. Kennedy");



    This is pretty fast, and space could be minmized by putting
    pointers to strings into the tree instead of having a copy
    of the string in each node of the tree.


    This approach works for state abbreviations as well, it is really
    like aliasing when writing parsers, and scanners....



        Hope this helps,

                                Ken Nelson
                                Principal Engineer
                                Software Systems Design
                                3627 Padua Av.
                                Claremont, CA 91711
                                (714) 624-3402

mikey@ontek.com (krill are yummy and krunchy) (08/11/90)

In comp.lang.c and comp.sources.wanted, 
  gisle@ifi.uio.no (Gisle Hannemyr) writes:
|
| I want to look up a string using an unambiguous abbreviation of the words
| in the string.
|
| Does anyone have a fast C routine that do the above, or know a
| clever algorithm to accomplish it.  My main requirement is speed.

The algorithm for determining whether the strings match is trivially
simple: loop over calls to strchr(), where the first parameter steps
through the "long" string and the second steps through the "short" 
string.  If the short string is consumed first, there is a match; if 
the long string is consumed first, there is no match.  Naturally, if
the "long" string has fewer characters than the "short" string, there
can be no match.

The issue of ambiguity is best solved by maintaining a list of all
the matches and then presenting that list to the user and letting
him or her choose.  A simpler and less user-friendly solution would
be to simply count the matches as you go and abort when the count
reaches two.

As for speed of execution, the above algorithm should work suitably
fast if the list is really in RAM.  If the list is so big that most
of it is sitting on a swap device somewhere, the disk access time
may overshadow the time spent checking the strings themselves.  A
solution I've often adopted is to add one more constraint to the
search: the first character of the two strings must match exactly.
Then sort the list and do a binary search on the first char and scan 
upwards for matches giving up when you get to the next letter of the 
ascii alphabet.

| Please mail your replies if possible.  I do not read these groups
| regularly.

I feel like posting it anyway.  Phblt.  (pbuh)

          love,
            mikey