johnl@ima.UUCP (Compilers mailing list) (06/17/86)
I got a copy of the Colorado group's tech report on GLA (Generator for Lexical Analyzers) and wasn't terribly impressed. It may be fast, but it's a lot of machinery for rather modest results. It is very much oriented towards fairly conventional programming languages, and is not a general-purpose substitute for LEX (I don't think they ever claimed it was, but the point should be made more explicitly). It's not even very versatile at handling programming languages; for example, it can't handle C's hexadecimal numbers or string continuations. To my mind, the most useful part of the package would be several libraries provided with it, encapsulating things like string storage and target- machine integer arithmetic. Given the limited domain of application of GLA, I suspect that a better approach would be to invest a bit more effort on the libraries and supply a "boilerplate" C scanner invoking them. The problem of scanning conventional programming languages just isn't complex enough to justify a program-generator system. (Caveat: I haven't seen the software, just the tech report, but I'd be very surprised if there was much hidden complexity. I have written lexical analyzers, including two for C.) I must admit, though, it's cheap enough: $25 plus two signed copies of the license agreement. In fact, I'd suggest raising the price, because $25 will barely cover materials costs and can't possibly cover the time and effort needed to do software distribution. (I've done such work.) Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry -- Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (Compilers mailing list) (06/24/86)
Thank you Henry for the mini-review on GLA. There are a couple of points that I would like to bring out. I'm not sure there was a need for you to be "impressed". GLA merely generates a scanner for programming languages that runs fast. I think the main point is 1) why did we build GLA, and 2) why use it. You answered both points: 1> It may be fast, but it's a lot of machinery for rather modest results. Well, it is fast. Let me give some motivation. Last fall I was comparing and measuring the speed/size of GAG generated compilers with hand written compilers. Based on rumors, I had expected that the semantic processing was going to be terrible for the GAG generated compilers. Well surprise, lexing and parsing took 75% of the analysis time (no codegen). It was the "easy" stuff that slowed down the GAG generated compiler! We replaced the GAG provided lexer and parser with an earlier GLA, and and a fast parser we were working on. The resulting analyzer (for pascal) ran faster than 4.2BSD pc, but not as fast as VMS-PASCAL. There is plenty of additional evidence that shows that one needs to pay close attention to lexing. Have you ever seen LEX used for a production compiler? Cc, pc, cpp don't use LEX, nor do most other frequently used compilers. People write their own lexers because at this phase of the compilation process, EVERY character has to be examined and you want this examination to go very fast. (Parsing looks at tokens, which occur 5-10 times less frequently than characters of input). "... lot of machinery ...". I like to look at time and space requirements of code. The GLA generated lexer uses a lot less of these than a LEX lexer. I assume Henry's comment was really referring to the auxiliary GLA software for source input, error reporting in the source coordinates, string memory (you don't want to use malloc), identifier table and keywords, and denotation value management. Both LEX and the hand approach require that you provide all of this yourself AND work out the non-trivial interfacing details to the parser. 2>The problem of scanning conventional programming languages just >isn't complex enough to justify a program-generator system. >I have written lexical analyzers, including two for C.) Now here is the meat of the discussion. You imply that scanning is not very complex (true), but does this mean that lexers should be constructed by hand? I definitely don't believe that just because something is easy (arithmetic for example), that we should ignore tools (calculators). Why not use whatever tools help accomplish a quality product. Furthermore, how many people know about or care to deal with the subtle issues of lexing such as compact case labels? Your lexer will run up to 25% slower if you "switch" on the input character. (Anyone interested should see W.M. Waite 'Cost of Lexical Analysis', Software P&E 16 1986). And what about reliability? The code to properly handle tabs, comments and floating point numbers is tricky. How about efficient keyword recognition? There are so many pitfalls. How long does it take to hand build a fast reliable lexer for an arbitrary programming language? I would much rather give a specification to a tool so I will have time to work on hard, interesting problems. >It is very much >oriented towards fairly conventional programming languages, and is not >a general-purpose substitute for LEX (I don't think they ever claimed >it was, but the point should be made more explicitly). It's not even >very versatile at handling programming languages; for example, it can't >handle C's hexadecimal numbers or string continuations. Yes, GLA is not a regular expression tool, but please acknowledge that one often has to step outside of regular expressions to handle common lexing problems. (Every see the contortions necessary to process a C style comment in LEX?) Why did you write several lexers by hand? Was it because LEX and regular expressions just did not fit the problem at hand? or Time/Space? There is no inherent limitations in GLA that prevent recognition of C hex numbers, or strings that span lines. It is just that we did not implement that stuff in the current version. It is not hard to take the produced lexer (in C code) and add things we have not implemented. >To my mind, the most useful part of the package would be several libraries >provided with it, encapsulating things like string storage and target- >machine integer arithmetic. Given the limited domain of application of >GLA, I suspect that a better approach would be to invest a bit more effort >on the libraries and supply a "boilerplate" C scanner invoking them. This IS the approach used, just don't trivialize the "boilerplate". GLA dynamically has to produce the boilerplate or else the resulting scanner will suffer in speed. Yes, GLA is not a exciting, complex tool- but it saves lots of tedious time when one needs a lexer. -bob gray bob@boulder.csnet {seismo!hao, ucbvax!nbires}!boulder!bob -- Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (Compilers mailing list) (07/13/86)
On reflection, I think these issues deserve a bit more comment, and I suspect that some aspects are of sufficiently general interest to make it a followup rather than a private reply. > ... one needs to pay close > attention to lexing. Have you ever seen LEX used for a production compiler? > Cc, pc, cpp don't use LEX, nor do most other frequently used compilers. At no time was I defending the use of LEX for production compilers. LEX's strong point is convenience and flexibility, which suits it well to things like experimenting with notation. For example, I tend to use LEX when I'm basically inventing a new language for some specialized job, and I have no particularly good idea of what it's going to look like. In this situation, efficiency is not my major concern. I would not use LEX for a production compiler, but that wasn't the issue. > "... lot of machinery ...". I like to look at time and space requirements > of code. The GLA generated lexer uses a lot less of these than a LEX lexer. This is an unrealistic comparison, since we have already agreed that LEX is unsuited to this application. Actually, when I said "lots of machinery", I wasn't referring so much to time and space as to the complexity of the human interface of the scanner generator. > I assume Henry's comment was really referring to the auxiliary GLA > software... Actually, the auxiliary software strikes me as the most valuable part of GLA, since (as Bob points out) much of it has to be done anyway, and doing it well is a hassle. My objection to GLA is that it's not clear to me that one really needs to top off this useful software with an elaborate and quite inflexible scanner generator. > >I have written lexical analyzers, including two for C.) > > Why did you write several lexers by hand? > Was it because LEX and regular expressions just did not fit the problem > at hand? or Time/Space? LEX could probably have handled the syntax, but since these were meant to be production-quality scanners, LEX's efficiency problems made it unacceptable. A contributing factor for the C lexers in particular was a desire to avoid dependencies on non-trivial Unix-specific utilities. > How long does it take to hand build a fast reliable lexer for an > arbitrary programming language? Not long, if you (a) bear in mind that programming languages use very stereotyped lexical forms -- they are not "arbitrary", and (b) work from an existing high-quality design rather than starting from scratch. Note that I did not advocate starting from scratch each time; I advocated re-using existing code, such as a "boilerplate" scanner. This is a much-neglected approach in problems which are (1) stereotyped, (2) too variable for a library function, and (3) too simple for a program generator. It doesn't necessarily generate a grade-A scanner, but it yields a B+ simply and quickly. > >... It's not even > >very versatile at handling programming languages; for example, it can't > >handle C's hexadecimal numbers or string continuations. > > ... There is no inherent limitations in GLA > that prevent recognition of C hex numbers, or strings that span lines. My point here was not that there is some sort of intrinsic limit, but that a piece of software which claims to be a generic lexer generator in fact cannot generate a lexer for a common, important, not-too-messy language. This actually is the heart of my objection to GLA: it makes me learn a fairly elaborate piece of machinery which can't cope with a very wide range of jobs. I strongly suspect that it was built for doing one particular language and variants thereon, and nobody took the trouble to generalize it. Despite negative implications earlier, I actually do believe that there is a place for a good programming-language-oriented lexer generator. However, the current GLA is not it. What's needed is something that is relatively straightforward to use (GLA strikes me as marginal here), consistently generates grade-A scanners (GLA probably does), and is flexible enough to handle most programming languages without drastic measures like hand-editing the resulting code (GLA falls down badly here). Anybody want to write one? Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry -- Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request