smryan@garth.uucp (Steven Ryan) (06/14/88)
I once heard a suggestion humans use a different parsing strategy than compilers. Compilers use a deep parse using lotsa malenky rules and produce a very impressive parse tree. The suggestion was that humans memorise faster than generate and are better at handling large chunks in parallel than small chunks nested. What I take that to mean, we memorise words in each inflected form, even regular forms, and possibly even groups words possibly up to words. Than language generation consists of inserting these chunks into a verb frame chunk, with each sentence form being a different chunk. I think I'll call this suggestion shallow parsing: the parse tree will only go down two or three levels before running into unanalysed (ig est memorised) chunks. In terms of a productions, this would mean having thousands of similar yet distinct productions instead factoring the similarities to reduce the number of productions. What interests me about this is the possible application to compilers: humans parse an ambiguous and nondeterministic language in almost always linear time. Most programming languages are intentionally designed with a deterministic context free syntax which is LL(1) or LR(1). LR(1) parsing is all very interesting, but the resulting language is often cumbersome: the programmer must write out a sufficient left context to help out the parser even when he/she/it/they/te/se/... can look a little to right and know what is happen. Example: the Pascal 'var' is there to help the parser know this is declaration and still remain LL(1). var v1,v2,...,vn:type; To claim LL(1)/LR(1) is superior because of the linear time, O(n), ignores the fact that this is context free parse and must be followed symbol identification. Assuming the number of distinct symbols is logarithmic in the program length, the time necessary for a context sensitive parse is from O(n log log n) to O(n**2). It would be interesting if we could learn something from our own minds and make computer languages more like natural languages (but excluding ambiguity). Make it simple--not simplistic. sm ryan [From smryan@garth.uucp (Steven Ryan)] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
bart@reed.dec.com (Bart Massey) (06/16/88)
> The suggestion was that humans memorise faster than generate and are better > at handling large chunks in parallel than small chunks nested. > ... > What interests me about this is the possible application to compilers: > humans parse an ambiguous and nondeterministic language in almost always > linear time. Most programming languages are intentionally designed with a > deterministic context free syntax which is LL(1) or LR(1). LR(1) parsing is > all very interesting, but the resulting language is often cumbersome: the > programmer must write out a sufficient left context to help out the parser > even when he/she/it/they/te/se/... can look a little to right and know what > is happen. > ... > To claim LL(1)/LR(1) is superior because of the linear time, O(n), ignores > the fact that this is context free parse and must be followed symbol > identification. Assuming the number of distinct symbols is logarithmic in the > program length, the time necessary for a context sensitive parse is from > O(n log log n) to O(n**2). I don't believe that the above argument about time is correct. My guess is that a human parsing *sentences of text* is much like a compiler parsing *sequences of statements*... Certainly such a behavior will occur in "linear time" on the number of sequences parsed. In fact, this is probably part of the reason why languages have statements and texts have sentences. I understand that statements far away from a given statement in a traditional language are kept track of by the language semantics (e.g. symbol table) rather than the grammar itself. Thus, I agree it is the implementation of the semantics of the language which will determine how parse time grows with *program* length. But I'm not sure what this tells me -- in particular, both humans and compilers seem to do quite well at recovering information about distant tokens from memory in effectively *constant time* for the sizes of texts and programs normally compiled. It's usually safe to assume that a compiler will compile a 1000 line program 10 times faster than a 10000 line program, or that a person will read a 100 page book 10 times faster than a 1000 page one. I believe that the chief advantage of single-level context-free grammars in the context of computer languages is to avoid the possibilities for ambiguity the author refers to in his (omitted) last paragraph. Context-free grammars are so simple (stupid) that there is very little possibility for ambiguous communication of a program to the machine. In other words, it is the *parser* which is "helping out" the *programmer*, in the sense that the programmer *cannot* easily write an ambiguous program. However, it is clear that humans use multilevel grammars with great effect. Does anyone know of work on isolating levels of recognition of text (e.g. natural language, mathematics) in humans? I know that HEARSAY II used a multilevel system for speech recognition, but I'm not sure whether the levels were modelled on human speech recognition, or just designed a priori... For example, there's clearly a point in recognition in which idiomatic text is converted to its corresponding meaning. Is this a seperate step, or part of some other? I think maybe what the author is getting at is this: In C, there is the common programming idiom for(<var> = 0; <var> < <const>; <var>++) { <loop body>; } Does it pay to recognize this idiom seperately from other for() loops? Why or why not? In particular: Where does such recognition belong in the programming process? Must this seperate recognition lead to ambiguity, or can one somehow guarantee that this for() loop still has the same semantics as other for loops? Can we somehow speed parsing by recognizing these loops *before* doing a standard LL or LR parse? I too am very interested to hear any answers anyone has to these questions. Bart Massey UUCP: ..tektronix!reed!bart [I always thought that computers use unambiguous context free grammars because we know how to parse them fast, and because they seem empirically to be convenient for humans to try to express things like computer programs. I also think that something like the current parsing methods is probably time-optimal for conventional computer architectures. Ned Irons looked at context-free parsing on small numbers of parallel processors and decided that it wasn't worth the effort, because each processor ends up waiting for the one to its left to tell it something crucial about the context for its chunk of the program. If the number of processors approximates the number of tokens, then using some of the processors to look for special cases might make more sense. Has anyone looked at highly parallel compiling, as opposed to compiling for highly parallel machines? -John] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
smryan@garth.uucp (Steven Ryan) (06/18/88)
In article <1114@ima.ISC.COM> Bart Massey <bart@reed.dec.com> writes: >I don't believe that the above argument about time is correct. My guess is >that a human parsing *sentences of text* is much like a compiler parsing >*sequences of statements*... Certainly such a behavior will occur in >"linear time" on the number of sequences parsed. In fact, this is probably >part of the reason why languages have statements and texts have sentences. Yes, I'm thinking that we divide the stream into sentence-like entities. In spoken language we use pauses to signify a break (and catch our breaths) and variety of gestures and eye movements. In written language we endmarks and paragraphs. Programming languages use begin/end/if/... keywords that break out the overall structures. Dividing the stream could be done in context free, deterministic fashion. It may be the contents of each sentence are parsed differently. >the semantics of the language which will determine how parse time grows with >*program* length. But I'm not sure what this tells me -- in particular, >both humans and compilers seem to do quite well at recovering information >about distant tokens from memory in effectively *constant time* for the I don't know about humans, but most compilers use a hash table for the symbol table. In this case the time to identify one symbol out of n distinct symbols is O(n/m) for a hash table size m. If m is larger than n, it is constant time. However if n is large enough, the hash table linearly decreases search time, but the search down any hash entry chain remains n (usually) or log n (for a sophisticated table). >Does it pay to recognize this idiom seperately from other for() loops? Why >or why not? Also, I'm wonderring about the tradeoff of determinstic versus nondeterministic grammars or languages or parsing. For those unfamilar with these terms, if a parser merrily trips through string left to right, given everything it is seen so far, looking at the current symbol at maybe k symbols to the right, does it know exactly which production it is in? If this true for a fixed k, the parser is deterministic. If it requires arbritrarily large k (the lookahead), it is nondeterminstic. Or it might be ambiguous. (Deterministic langauges are not inherently ambiguous: each has an LR(1) grammar and LR(1) grammars are not ambiguous.) >[I always thought that computers use unambiguous context free grammars because >we know how to parse them fast, and because they seem empirically to be >convenient for humans to try to express things like computer programs. I >also think that something like the current parsing methods is probably >time-optimal for conventional computer architectures. I suspect it has more to do parse generators: a parser can be automatically generated for an LR(1) grammar, but not so for a nondeterministical language. (Personal remark:) I think Wirth has crippled a generation of programming language by stripping them of anything that makes life tough for a compiler writer (my occupation if you haven't already guessed). > Ned Irons looked at >context-free parsing on small numbers of parallel processors and decided that I first thought about this with respect to CDC Cyber 205 and ETA-10x vector machines. If somebody can find a way to vectorise parsing, I think it will open up a vast field of applications which currently appear to be scalar or sequential. Thank you for your support, sm ryan [Using Earley's algorithm you can parse ambigious grammars without trouble, although it is much slower than LR or LALR for the grammars that the latter two can handle. -John] [From smryan@garth.uucp (Steven Ryan)] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request