steve@oakhill.UUCP (steve) (11/12/88)
<11112@cup.portal.com> In article <11112@cup.portal.com>, PLS@cup.portal.com (Paul L Schauble) writes: > The tricks in a Fortran lexical analyzer are indeed interesting. What I'd > like to know is: Are these documented anywhere? Surely with all of those When I wrote the lexical analyzer for F77 fortran, I could find no such bag of tricks. I should add that the reason I got pulled into the company was because I had a good background in front end compiler theory. Most of the tricks become self-evident in heat of writing a lexer, others took some imagination. This is noway a complete list of all the tricks, just the ones I can come off of on the top of my head. 1) Read in the whole statement at once into a buffer. This solves any complications arising from continuation lines and inbedded comments. Also since you have a limited number of continuation lines allowed, you can pre-allocated the buffer used (Note : If they have extra continuation lines then they should be willing to pay the price when they fill the buffer. They probably will not fill up the buffer though because of reason 2.) 2) As you read in the line throw out inbedded spaces. Since they are not significant, they only get in the way. This actually gives some spare room in your buffer. You need to be careful not to throw out spaces in quotes and holeriths, people get mad when you do (Since holeriths are not part of the core 77 standard, you could make this job easier if you don't care about supporting this feature). 3) Treat each line as a token sequence with an END OF STATEMENT (EOS) token at the end. This will help with parsing and error catching. Also sending an EOS token will help the lexer know when it can read the next line. 4) Since you need to read the first 6 columns to tell if the next statement is a continuation, store these away for future use (if label or if you allow special statements denoted by special first characters). Before reading the next line, check these to see if you have an initial token (line number) or a special action (ignore if debug line, (you still have to read off the full line), etc.) 5) Before you begin parsing, figure out what type of line you have, this is where most of the tricks occur. A) When reading a line keep track of the parenthesis count and wheather an equal sign occurs outside all parenthesis. If one does, the statement is an assignment or a DO statement. If an equal sign does occur at this level and a comma occurs at this level after the equal sign and the first two letter are 'DO' then you have a DO statement. B) All other statements the initial character sequence can be checked against a table for statementness. Some statements need some pre-parsing though. REAL FUNCTION A(34) and REAL FUNCTION A(B) look the same til you find what in the parenthsis (assuming you allow long variable names). The first declares the array FUNCTIONA to be of size 34, the second is the initial statement of a function A with parameter B. Note, you cannot decide on what type of statement you are looking for from the current phase of the program (declaration phase, data statement phase, statement phase, etc.) since both the above statements could occur as the first statement of a file. Unfortunately I do not have the list of these available here. C) Any thing preparse should be put on a lex-queue for future reference. No need reinventing the wheel in the same program. If you are putting pre-parse quanities on the queue, also preparse the ASSIGN statement up to the TO (the only occurence of the TO keyword, and the only keyword appearing after the initial keyword sequence). With this done, all unread tokens can be read be a simple lexer, and appear to be what they are. D) The CASE= directives in I/O statements should be treated as key words. They are checked for in the following fashion. In an I/O statement at the first parenthesis, turn on a flag to search for these. Once the flag is on, check all identifiers for a following equals sign. If it has one, look it up in a table. Turn the flag off at the closing parenthesis. The flag is necessary. CASE= is valid in other constructs, including the implied do of the READ and WRITE. E) Treat labels as integers at lex time. At parse time they become apparent as to what they are. So the next token is derived like this if (new statement) if (inital token) keep new statement flag on if (token is label) return label else if (special action) do special action else Read in the new statement Determine what the line is (find the initial token, put other tokens on the lex-queue). Return the first token on the line. else (if the is something on the lex-queue) Pop it off the lex stack return it. else (if anything else to read in buffer) Lex off the next token and return it. else Return an EOS token enough from this mooncalf - Steven ---------------------------------------------------------------------------- These opinions aren't necessarily Motorola's or Remora's - but I'd like to think we share some common views. ---------------------------------------------------------------------------- Steven R Weintraub cs.utexas.edu!oakhill!devsys!steve Motorola Inc. Austin, Texas (512) 440-3023 (office) (512) 453-6953 (home) ----------------------------------------------------------------------------