[comp.lang.forth] HASHING THE DICTIONARY

wmb@MITCH.ENG.SUN.COM (Mitch Bradley) (10/08/90)

The F83 hash function selects one of 4 threads by looking at the lower 2
bits of the first character in the name (not the length byte, the first
character).

The current experimental version of my system uses a 256-way hash function
to index into a "cache" of recently-found entries from the forth vocabulary.
Other vocabularies do not participate.  The hash function concatenates
the low 3 bits of the name length with the low 5 bits of the first character
in the name.

I haven't studied the properties of this scheme in detail.  I sort of
threw it together in a hurry one evening.  Some advantages:
        * easy to layer on top of an existing system
        * doesn't consume much memory
        * hash cache memory can be discarded after the application is compiled

A fully-hashed dictionary like Ray Duncan uses is probably "better"
in some sense.

Binary trees have been tried.  I recall a paper by Hans Nieuwenhuiuzen
(I think) describing a tree-structured dictionary.  As I recall, the
disadvantages were:
    Some problem with using multiple vocabularies (although I can't
    think what it might be)

    Loss of chronological ordering; difficulty with redefining words.

These difficulties don't seem insurmountable.

Mitch

bouma@cs.purdue.EDU (William J. Bouma) (10/11/90)

In article <9010101326.AA20013@ucbvax.Berkeley.EDU> Mitch Bradley <wmb%MITCH.ENG.SUN.COM@CUNYVM.CUNY.EDU> writes:
>
>Binary trees have been tried.  I recall a paper by Hans Nieuwenhuiuzen
>(I think) describing a tree-structured dictionary.  As I recall, the
>disadvantages were:
>    Some problem with using multiple vocabularies (although I can't
>    think what it might be)
>
>    Loss of chronological ordering; difficulty with redefining words.
>
>These difficulties don't seem insurmountable.

  My system, FSH, uses balanced binary trees for the dictionary.  I have
  found there to be NO problems with this scheme unless you have limited
  memory of course.  I don't.  The header of a fsh word looks like this:

  typedef struct _word_ {
      char *name;                    Word name.
      funcptr f;                     Code for this word.
      struct _word_ **arg;           Argument for the code.
      char kind;                     Immediate, etc.
      char bal;                      Tree weight for balancing.
      short int mark;                General purpose.
      struct _word_ *parent;         Parent word.
      struct _word_ *vocab;          Children.
      struct _word_ *left, *right;   Siblings.
  } word;

  As you can see, there is provision for every word to have a sub-vocabulary.
  This allows for a neat uniform way of handling local variables and encap-
  sulated words.  Dictionary search is very neat and simple, although not as
  flexible as I would like.  Search begins in the current word's 'vocab'. If
  the word is not found, search proceeds to the parent word.  The top word,
  "forth" has no parent, so search stops there.  If you want an isolated
  word, just set its parent to 0.  For compiling, the current word is set
  to the word being compiled.  Any defining words, (:, variable), define
  the new word within the current word's vocab.  Since search also begins
  there, you are assured that the local variable definition will overshadow
  any preceeding definition.

  Why care about chronological order?

  Redefining words is simple.  I also have easy control over whether a new
  definition replaces the old definition or shadows it.  To replace, just
  find the original word, free the memory used by 'arg' and (recursively)
  'vocab'.  Then insert the new definions for 'f', 'arg', and 'vocab'. To
  shadow, simply build a new word and splice it into the tree where the 
  old one was.  All previous words will retain the original definition.

  It is fast.  Searching and inserting into a balanced binary tree takes
  log(n) time.  That means if there are 1000 words in a vocabulary it will
  take at most 10 checks to find the word you want.  2000 words and it will
  only take 11 tries to find out the word you want isn't there 8^).
-- 
Bill <bouma@cs.purdue.edu>      Just ask the Axis   He knows everything