[sci.lang] BNFs and NL

goldberg@su-russell.UUCP (05/28/87)
In article <13263@watmath.UUCP> erhoogerbeet@watmath.UUCP (Edwin (Deepthot)) writes:

>Is there a Backus-Naur Form for the English language itself or is this too
>complicated? If not, how is it that we can understand a huge variety of 
>different sentence forms and still recognize that some are syntactically
>incorrect? Basically, what I am asking is it possible to do syntactic
>checking as if "compiling" a sentence with rules set down in some BNF?

>------ ---------
>erhoogerbeet@watmath.uucp
>ehoogerbeets@wateuler.uucp
>Edwin (Deepthot)
>Hoogerbeets

There are a couple of things to distinguish here.  It is possible that
there is a recursive grammar for a natural language (let's say English)
without there being a BNF.  There are two questions here: 

    (1) Is it possible in general to write a BNF grammar for a
    natural language?

    (2) Is it appropriate to write a BNF grammar for a natural
    language?

First a small confession.  I am a linguist and not an AI researcher.  I
have never taken a computer science course in my life.  I believe, however,
that a BNF grammar is form of Context-Free Phrase Structure Grammar
(CF-PSG, I will explain this later).  If I am wrong, a portion of what I
have to say will be meaningless and I will be royally flamed.  I don't have
the references here right now, and I am willing to go out on a limb.

Let me take the question (2) first.  Is it appropriate to
parse/describe/characterize the grammar of a natural language with a BNF
grammar?  Well, it seems that you have only been exposed to only this
grammar formalism so you have found yourself captivated by this notion of
defining a language in these terms.  But there are other ways of
characterizing (formal) languages.  If you use any of the Unix editors (ed,
ex, vi) or Unix utilities such as grep or sed you will be familier with
another set of pattern matching tools.  The pattern search done with these
programs basically is checking whether some string is in the language
defined by the grammar.  Now while you could recognize (almost) any language
that grep could recognize using a BNF grammar, it would be very awkward to
have to write a BNF grammar when a simple grep type pattern would do the
job.  The language for writing grep type expressions is tailored to a
particular class of tasks.  It provides easy ways of talking about a range
of characters, etc.  The (largely) more powerful BN form lacks such
convenient devices.

So how does this apply to natural language syntax?  Well in many natural
languages verbs agree with their subjects.  This could be coded up in a BNF
grammar, but it would take a substantial multiplication of rules.  Now you
might not consider this a problem, but you must realize that the linguist
would like her system for writing rules to express certain general facts
about natural languages.  One such fact is (3).

    (3)  Agreement systems happen in whole bunches of languages.

This shows one way in which BNF grammars are not tailored to natural
language.  Another would be the area of word order.  Languages vary in how
strict their word order is.  BNF grammars are useful for very strict word
order languages (like English), but worthless for Makua or Warlpiri.  Yet
if one were to modify the formalism one used for writing grammars so that a
specific rule doesn't (as in BNF) tell you both about constituency (what
elements can make up what kind of phrase) and linear precedence (what order
those elements must come in) one would have a system that was much more
appropriate for describing natural language.  Linear precedence rules would
be kept separate from constituency rules.  I could go on and list hundreds
of ways in which using BNF grammars would be inappropriate for natural
language analysis.  But I will stop with this question and go on to why it
is impossible.

If I am correct in assuming that BNF grammars are a subset of
CF-PSGs then I can point out that there are a number of natural
language constructs that cannot be defined via BNF grammars no
matter how clumsily one were willing to do it.  CF-PSGs can only
describe a certain class of languages:  the Context-Free
Languages (CFLs).  The (formal) language that has as "words" 'a', 'b', and
'c' and as sentences only those strings that consist of some number of 'a's
followed by the same number of 'b's followed by the same number of 'c's
is not a CFL.  So this is the language containing the strings "", "abc",
"aabbcc", "aaabbbccc", and so on.  I dare you to write a BNF grammar for
such a language!

Anyway, I don't have the papers or the exact references right here, but
there were two papers published in the journal "Linguistics & Philosophy"
in 1985 which should that some natural languages had constructs which were
provably non context-free.  One was by Christopher Culy on reduplication in
Bambara.  The other was by Stuart Shieber on embedded serial verbs in Swiss
German.  I refer you to these articles for the proofs which are quite
straight forward.

To answer your question:  It is not possible in general to write a BNF
grammar for an arbitrary natural language.  But linguists do believe that
there is some form of grammar (specific to language, we don't expect to
find it in a Computer Science text book) that is fully appropriate for
describing natural language.  Such a form will force each grammar to have
the properties that are universal to language.  It will not allow us to
write grammars for unnatural languages.  We are working on discovering this
form it.

I have cross posted this response to sci.lang and have directed follow-ups
there as well.

-- 
Jeff Goldberg 
goldberg@russell.stanford.edu, ...!ucbvax!russell.stanford.edu!goldberg
[The Bulgarians can assassinate him. The cipher key won't get to Pest.]
(The preceding two sentence are to annoy security organizations.)