[mod.compilers] Generator of Lexical Analyzers, mini-review

johnl@ima.UUCP (Compilers mailing list) (06/17/86)

I got a copy of the Colorado group's tech report on GLA (Generator for
Lexical Analyzers) and wasn't terribly impressed.  It may be fast, but
it's a lot of machinery for rather modest results.  It is very much
oriented towards fairly conventional programming languages, and is not
a general-purpose substitute for LEX (I don't think they ever claimed
it was, but the point should be made more explicitly).  It's not even
very versatile at handling programming languages; for example, it can't
handle C's hexadecimal numbers or string continuations.

To my mind, the most useful part of the package would be several libraries
provided with it, encapsulating things like string storage and target-
machine integer arithmetic.  Given the limited domain of application of
GLA, I suspect that a better approach would be to invest a bit more effort
on the libraries and supply a "boilerplate" C scanner invoking them.  The
problem of scanning conventional programming languages just isn't complex
enough to justify a program-generator system.  (Caveat:  I haven't seen the
software, just the tech report, but I'd be very surprised if there was much
hidden complexity.  I have written lexical analyzers, including two for C.)

I must admit, though, it's cheap enough:  $25 plus two signed copies of
the license agreement.  In fact, I'd suggest raising the price, because
$25 will barely cover materials costs and can't possibly cover the time
and effort needed to do software distribution.  (I've done such work.)

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry
-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima

Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (Compilers mailing list) (06/24/86)

Thank you Henry for the mini-review on GLA.  There are a couple
of points that I would like to bring out.  I'm not sure there
was a need for you to be "impressed".  GLA merely generates a scanner
for programming languages that runs fast.  I think the main point
is 1) why did we build GLA, and 2) why use it.  You answered both points:

1> It may be fast, but it's a lot of machinery for rather modest results.

Well, it is fast.  Let me give some motivation. Last fall I was 
comparing and measuring the speed/size of GAG generated compilers with
hand written compilers.  Based on rumors, I had expected that the semantic
processing was going to be terrible for the GAG generated compilers.
Well surprise, lexing and parsing took 75% of the analysis time (no codegen).
It was the "easy" stuff that slowed down the GAG generated compiler!
We replaced the GAG provided lexer and parser with an earlier GLA, and
and a fast parser we were working on.  The resulting analyzer (for pascal)
ran faster than 4.2BSD pc, but not as fast as VMS-PASCAL.

There is plenty of additional evidence that shows that one needs to pay close
attention to lexing.  Have you ever seen LEX used for a production compiler?
Cc, pc, cpp don't use LEX, nor do most other frequently used compilers.
People write their own lexers because at this phase of the compilation
process, EVERY character has to be examined and you want this examination to
go very fast.  (Parsing looks at tokens, which occur 5-10 times less
frequently than characters of input).

"... lot of machinery ...".  I like to look at time and space requirements
of code.  The GLA generated lexer uses a lot less of these than a LEX lexer.
I assume Henry's comment was really referring to the auxiliary GLA
software for source input, error reporting in the source coordinates,
string memory (you don't want to use malloc), identifier table and
keywords, and denotation value management.  Both LEX and the hand
approach require that you provide all of this yourself AND work out
the non-trivial interfacing details to the parser.

2>The problem of scanning conventional programming languages just
 >isn't complex enough to justify a program-generator system.
 >I have written lexical analyzers, including two for C.)

Now here is the meat of the discussion.  You imply that scanning
is not very complex (true), but does this mean that lexers should be
constructed by hand?
I definitely don't believe that just because something is easy
(arithmetic for example), that we should ignore tools (calculators).
Why not use whatever tools help accomplish a quality product.
Furthermore, how many people know about or care to deal with the subtle
issues of lexing such as compact case labels?  Your lexer will run
up to 25% slower if you "switch" on the input character.  (Anyone interested
should see W.M. Waite 'Cost of Lexical Analysis', Software P&E 16 1986).

And what about reliability?  The code to properly handle tabs,
comments and floating point numbers is tricky.  How about
efficient keyword recognition?  There are so many pitfalls.
How long does it take to hand build a fast reliable lexer for an
arbitrary programming language?  I would much rather give a specification
to a tool so I will have time to work on hard, interesting problems.

>It is very much
>oriented towards fairly conventional programming languages, and is not
>a general-purpose substitute for LEX (I don't think they ever claimed
>it was, but the point should be made more explicitly).  It's not even
>very versatile at handling programming languages; for example, it can't
>handle C's hexadecimal numbers or string continuations.

Yes, GLA is not a regular expression tool, but please acknowledge that
one often has to step outside of regular expressions to handle common
lexing problems.   (Every see the contortions necessary to process
a C style comment in LEX?)  Why did you write several lexers by hand?
Was it because LEX and regular expressions just did not fit the problem
at hand? or Time/Space?  There is no inherent limitations in GLA
that prevent recognition of C hex numbers, or strings that span lines.
It is just that we did not implement that stuff in the current version.
It is not hard to take the produced lexer (in C code) and add things
we have not implemented.

>To my mind, the most useful part of the package would be several libraries
>provided with it, encapsulating things like string storage and target-
>machine integer arithmetic.  Given the limited domain of application of
>GLA, I suspect that a better approach would be to invest a bit more effort
>on the libraries and supply a "boilerplate" C scanner invoking them.

This IS the approach used, just don't trivialize the "boilerplate".
GLA dynamically has to produce the boilerplate or else the resulting
scanner will suffer in speed.  Yes, GLA is not a exciting, complex tool-
but it saves lots of tedious time when one needs a lexer.

-bob gray  bob@boulder.csnet {seismo!hao, ucbvax!nbires}!boulder!bob

-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima

Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (Compilers mailing list) (07/13/86)

On reflection, I think these issues deserve a bit more comment, and I
suspect that some aspects are of sufficiently general interest to make
it a followup rather than a private reply.

> ... one needs to pay close
> attention to lexing.  Have you ever seen LEX used for a production compiler?
> Cc, pc, cpp don't use LEX, nor do most other frequently used compilers.

At no time was I defending the use of LEX for production compilers.  LEX's
strong point is convenience and flexibility, which suits it well to things
like experimenting with notation.  For example, I tend to use LEX when I'm
basically inventing a new language for some specialized job, and I have no
particularly good idea of what it's going to look like.  In this situation,
efficiency is not my major concern.  I would not use LEX for a production
compiler, but that wasn't the issue.

> "... lot of machinery ...".  I like to look at time and space requirements
> of code.  The GLA generated lexer uses a lot less of these than a LEX lexer.

This is an unrealistic comparison, since we have already agreed that LEX
is unsuited to this application.  Actually, when I said "lots of machinery",
I wasn't referring so much to time and space as to the complexity of the
human interface of the scanner generator.

> I assume Henry's comment was really referring to the auxiliary GLA
> software...

Actually, the auxiliary software strikes me as the most valuable part
of GLA, since (as Bob points out) much of it has to be done anyway, and
doing it well is a hassle.  My objection to GLA is that it's not clear
to me that one really needs to top off this useful software with an
elaborate and quite inflexible scanner generator.

>  >I have written lexical analyzers, including two for C.)
> 
> Why did you write several lexers by hand?
> Was it because LEX and regular expressions just did not fit the problem
> at hand? or Time/Space?

LEX could probably have handled the syntax, but since these were meant
to be production-quality scanners, LEX's efficiency problems made it
unacceptable.  A contributing factor for the C lexers in particular
was a desire to avoid dependencies on non-trivial Unix-specific utilities.

> How long does it take to hand build a fast reliable lexer for an
> arbitrary programming language?

Not long, if you (a) bear in mind that programming languages use very
stereotyped lexical forms -- they are not "arbitrary", and (b) work from
an existing high-quality design rather than starting from scratch.  Note
that I did not advocate starting from scratch each time; I advocated
re-using existing code, such as a "boilerplate" scanner.  This is a
much-neglected approach in problems which are (1) stereotyped, (2) too
variable for a library function, and (3) too simple for a program generator.
It doesn't necessarily generate a grade-A scanner, but it yields a B+
simply and quickly.

> >... It's not even
> >very versatile at handling programming languages; for example, it can't
> >handle C's hexadecimal numbers or string continuations.
> 
> ...  There is no inherent limitations in GLA
> that prevent recognition of C hex numbers, or strings that span lines.

My point here was not that there is some sort of intrinsic limit, but
that a piece of software which claims to be a generic lexer generator
in fact cannot generate a lexer for a common, important, not-too-messy
language.  This actually is the heart of my objection to GLA:  it makes
me learn a fairly elaborate piece of machinery which can't cope with a
very wide range of jobs.  I strongly suspect that it was built for doing
one particular language and variants thereon, and nobody took the trouble
to generalize it.

Despite negative implications earlier, I actually do believe that there is
a place for a good programming-language-oriented lexer generator.  However,
the current GLA is not it.  What's needed is something that is relatively
straightforward to use (GLA strikes me as marginal here), consistently
generates grade-A scanners (GLA probably does), and is flexible enough
to handle most programming languages without drastic measures like
hand-editing the resulting code (GLA falls down badly here).  Anybody
want to write one?

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry
-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima

Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request