goldberg@su-russell.UUCP (05/28/87)
In article <13263@watmath.UUCP> erhoogerbeet@watmath.UUCP (Edwin (Deepthot)) writes: >Is there a Backus-Naur Form for the English language itself or is this too >complicated? If not, how is it that we can understand a huge variety of >different sentence forms and still recognize that some are syntactically >incorrect? Basically, what I am asking is it possible to do syntactic >checking as if "compiling" a sentence with rules set down in some BNF? >------ --------- >erhoogerbeet@watmath.uucp >ehoogerbeets@wateuler.uucp >Edwin (Deepthot) >Hoogerbeets There are a couple of things to distinguish here. It is possible that there is a recursive grammar for a natural language (let's say English) without there being a BNF. There are two questions here: (1) Is it possible in general to write a BNF grammar for a natural language? (2) Is it appropriate to write a BNF grammar for a natural language? First a small confession. I am a linguist and not an AI researcher. I have never taken a computer science course in my life. I believe, however, that a BNF grammar is form of Context-Free Phrase Structure Grammar (CF-PSG, I will explain this later). If I am wrong, a portion of what I have to say will be meaningless and I will be royally flamed. I don't have the references here right now, and I am willing to go out on a limb. Let me take the question (2) first. Is it appropriate to parse/describe/characterize the grammar of a natural language with a BNF grammar? Well, it seems that you have only been exposed to only this grammar formalism so you have found yourself captivated by this notion of defining a language in these terms. But there are other ways of characterizing (formal) languages. If you use any of the Unix editors (ed, ex, vi) or Unix utilities such as grep or sed you will be familier with another set of pattern matching tools. The pattern search done with these programs basically is checking whether some string is in the language defined by the grammar. Now while you could recognize (almost) any language that grep could recognize using a BNF grammar, it would be very awkward to have to write a BNF grammar when a simple grep type pattern would do the job. The language for writing grep type expressions is tailored to a particular class of tasks. It provides easy ways of talking about a range of characters, etc. The (largely) more powerful BN form lacks such convenient devices. So how does this apply to natural language syntax? Well in many natural languages verbs agree with their subjects. This could be coded up in a BNF grammar, but it would take a substantial multiplication of rules. Now you might not consider this a problem, but you must realize that the linguist would like her system for writing rules to express certain general facts about natural languages. One such fact is (3). (3) Agreement systems happen in whole bunches of languages. This shows one way in which BNF grammars are not tailored to natural language. Another would be the area of word order. Languages vary in how strict their word order is. BNF grammars are useful for very strict word order languages (like English), but worthless for Makua or Warlpiri. Yet if one were to modify the formalism one used for writing grammars so that a specific rule doesn't (as in BNF) tell you both about constituency (what elements can make up what kind of phrase) and linear precedence (what order those elements must come in) one would have a system that was much more appropriate for describing natural language. Linear precedence rules would be kept separate from constituency rules. I could go on and list hundreds of ways in which using BNF grammars would be inappropriate for natural language analysis. But I will stop with this question and go on to why it is impossible. If I am correct in assuming that BNF grammars are a subset of CF-PSGs then I can point out that there are a number of natural language constructs that cannot be defined via BNF grammars no matter how clumsily one were willing to do it. CF-PSGs can only describe a certain class of languages: the Context-Free Languages (CFLs). The (formal) language that has as "words" 'a', 'b', and 'c' and as sentences only those strings that consist of some number of 'a's followed by the same number of 'b's followed by the same number of 'c's is not a CFL. So this is the language containing the strings "", "abc", "aabbcc", "aaabbbccc", and so on. I dare you to write a BNF grammar for such a language! Anyway, I don't have the papers or the exact references right here, but there were two papers published in the journal "Linguistics & Philosophy" in 1985 which should that some natural languages had constructs which were provably non context-free. One was by Christopher Culy on reduplication in Bambara. The other was by Stuart Shieber on embedded serial verbs in Swiss German. I refer you to these articles for the proofs which are quite straight forward. To answer your question: It is not possible in general to write a BNF grammar for an arbitrary natural language. But linguists do believe that there is some form of grammar (specific to language, we don't expect to find it in a Computer Science text book) that is fully appropriate for describing natural language. Such a form will force each grammar to have the properties that are universal to language. It will not allow us to write grammars for unnatural languages. We are working on discovering this form it. I have cross posted this response to sci.lang and have directed follow-ups there as well. -- Jeff Goldberg goldberg@russell.stanford.edu, ...!ucbvax!russell.stanford.edu!goldberg [The Bulgarians can assassinate him. The cipher key won't get to Pest.] (The preceding two sentence are to annoy security organizations.)