[net.lang] Complex/Simple scanners - More hashing!

armstron@sjuvax.UUCP (L. Armstrong) (10/16/85)

*** YUM YUM - EAT EM' UP ***

I have a question that has bugging me for quite some time lately, and I was
wondering what some of the opinions of others were on the subject.

My question is this:
	Do you think it would be faster for a compiler to have a more
	complex scanner that could recognize key-words/reserved identifiers
	immediately (i.e., simply by keeping a record of each inputted 
	character) and not have to hash and search for those key-words
	each time one is encountered, or would it be better to keep the 
	scanner simple, recognizing each key-word as simply an identifier,
	and then search the symbol table to determined wether or not
	your identifier was reserved.

	It is not immediately obvious to me which one would be quicker,
	although I have a tendency to think it may be the former.  
	Lex allows for one to design a scanner of the first type 
	very simply.



						Thanks,
						-Len
-- 

NAME		Len Armstrong
UUCP		{astrovax | bpa | burdvax | allegra }!sjuvax!armstron
ORGANIZATIONS	RCA Advanced Technology Labs
		St. Joseph's University
PHONES		(WORK) (609) 866-6647

stevev@tekchips.UUCP (Steve Vegdahl) (10/21/85)

> 	Do you think it would be faster for a compiler to have a more
> 	complex scanner that could recognize key-words/reserved identifiers
> 	immediately (i.e., simply by keeping a record of each inputted 
> 	character) and not have to hash and search for those key-words
> 	each time one is encountered, or would it be better to keep the 
> 	scanner simple, recognizing each key-word as simply an identifier,
> 	and then search the symbol table to determined wether or not
> 	your identifier was reserved.
> 
> 	It is not immediately obvious to me which one would be quicker,
> 	although I have a tendency to think it may be the former.  
> 	Lex allows for one to design a scanner of the first type 
> 	very simply.

If you ignore things like virtual memory, I'd think that the complex scanner
would be faster if implemented in a straightforward way.  However, this
would increase the size of the state-transition table.  Instead of having
one or two character classes for letters, you'd have something like 26.

I prefer to use hashing here because the set of reserved words is known
at compiler-compile time.  Thus, you can analyze the reserved words in
advance and select a *perfect hash function* (or something close to it).
The size of your hash table will be small, and you will be guaranteed
to find any reserved word in a single probe of the table.

		Steve Vegdahl
		Computer Research Lab.
		Tektronix, Inc.
		Beaverton, Oregon