[comp.unix.shell] Understanding the Bourne Shell

martin@mwtech.UUCP (Martin Weitzel) (01/08/91)

In article <443@minya.UUCP> jc@minya.UUCP (John Chambers) writes:
>> What ALLWAYS works in the Bourne-Shell is this:
>> 
>> 	for last do :; done
>
>Wow! A one-liner that works for more than 9 args!  Of course, there's 
>the question as to whether this loop is actually faster than starting 
>a subprocess that just does puts(argv[artc-1]), but at least there's
>a way to do it that is portable.

I have compared the alternatives here on my 386 box and as you might guess
the differences in speed depends on the length of the argument list.

For ~25 arguments the for-loop is the fastest, above that up to ~100
arguments there's few difference, but the for loop uses more usr-time and
the sub-process more sys-time. There seem to be minor differences between
what is called as sub-process, i.e. a specialized C program (as the poster
suggested) or another shell-script (as Maarten Litmaath posted earlier in
this thread).

For the rather untypical size of 250 arguments there still isn't much
difference but sometimes the sub-process is faster (the results vary over
some range and I didn't go into the efforts to calculate the average).
My general experience with the 386 is that it starts sub-processes really
fast, so I think the for-do method will even win even for more than 250
arguments on a lot of systems.

(BTW: I've learned by my experiments that the shell internally limits
the number of arguments that can be passed to a sub process to 254.
I allways thought the only limit were the space supplied by the OS
to pass the stuff to the sub-process, which is typically several KByte
for the *contents* of arguments + environment. I never noticed the limit
on the *number* of arguments before.)

>That comment isn't worth wasting the bandwidth, of course; my motive
>for this followup is a bit of bizarreness that I discovered while
>testing this command.  The usual format of a for loop is 3 lines:
>	for last
>	do :
>	done
>Usually when I want to collapse such vertical code into a horizontal
>format, I follow the rule "Replace the newlines with semicolons", and
>it works.  For instance,
>	if [ <test> ]
>	then <stuff>
>	else <stuff>
>	fi
>reduces to
>	if [ <test> ];then <stuff>;else <stuff>;fi
>which I can do in vi via a series of "Jr;" commands.  With the above 
>for-loop, this gives
>	for last;do :;done
>which doesn't work.  The shell gives a syntax error, complaining about
>an unexpected ';' in the line.  Myself, I found this to be a somewhat 
>unexpected error message.  It appears my simple-minded algorithm for 
>condensing code doesn't work in this case.
>
>So what's going on here?  What the @#$^&#( is the shell's syntax that 
>makes the semicolon not only unneeded, but illegal in this case?

Funny, I stumbled over the same thing when I "invented" my for-do method
for accessing the last argument some years ago. The explanation is a bit
longer, so all who aren't interested in the details should leave at this
point.

The syntax for the "for" statment is more or less the following (I stick
to the "yacc"-style here, but include keywords into single quotes even if
they are longer than one character, what is not allowed with "yacc"):


for_stmt : 'for' NAME 'in' word_list SEP 'do' cmd_list 'done'
	 | 'for' NAME 'do' cmd_list 'done'
	 ;

word_list: WORD
	 | word_list WORD

cmd_list : cmd arg_list SEP
	 | cmd_list cmd arg_list SEP
	 ;

arg_list : /*empty*/
	 | arg_list WORD
	 ;

SEP	 : ';'
	 | '\n'
	 ;

(The meaning of NAME and WORD should be obvious - I don't want to go into
the syntactic details too far. I have further left out an undocumented
shell feature, that allows you to replace "do" and "done" with "{" and "}";
note that the latter is only true for for-do-done, not for while-do-done
and until-do-done!)

Note that white space is allowed everywhere in between the tokens
and nonterminals. But SEP is a mandatory seperator (which can be
a newline or a semicolon). The reason for requiring a separator in
some cases is simple: There is the possibility that some keywords of
the shell might also be used as regular argument to commands or within
a word_list - we'll come back to this in a moment.

The shell detects the two forms of the "for" statement simply by looking
at what follows the loop-variable. If it is an "in" then there must also
follow a word_list, which in turn must be terminated by a mandatory
seperator, as explained above. If there follows a "do" there is no
wordlist. If there follows a semicolon after the loop-variable, this
is against the syntax (this was what the poster puzzled).

Of course, Mr. Bourne could have made the syntax to allow for it by
changing the RHS of the rule for the "for" statement without "in" into

	'for' NAME SEP 'do' cmd_list 'done'

but IMHO the difficulties of the poster (and many more, me included)
have some other reason, that has something to do with the difference
between
	- mandatory command separators resp. terminators and
	- optional white space before commands and keywords and
	- spaces as separators of command and argument list and
	- the semicolon beeing allowed only in the first case and
	- the newline beeing allowed in the first and second case,
	- space characters beeing allowed in the second and third.

In a simple command, i.e. a programm name that is followed by some arguments,
there's not much of a problem as it seems "natural" for most users to type
spaces to separate the arguments and newlines to terminate commands and it
seems obvious that the two can not be used interchangable, as this either
would terminate the argument list prematurely (if you try to separate
arguments with a newline) or it doesn't properly end your command (if you
don't type newline). 

Now let's consider the more complex shell statements. Some very stupid
users might in fact expect that the shell can read their mind, but all the
others will understand that the shell must either treat ALL keywords (and
maybe even all the commands) special, not allowing them as regular arguments,
or needs some other separator as the one used between arguments, if there
shall follow a keyword after a command (or there shall be two commands) in
the same line. The logic can be applied to most keywords regardless if
they introduce some complex command or if they mark the beginning of the
next part of the command (like "then" or "else" in an "if" statement).

More puzzling is that the shell also ALLOWS newlines in place of spaces
where it's clear that a complex command isn't complete%. One place where
this occurs is when you start a "for" statement and have not yet supplied
the matching "done".  For example

	for var in foo bar
		<some newlines here (1)>
	do	<some newlines here (2)>
		cmd
		<some newlines here (3)>
	done

is all allowed, though seldom used, except for exactly one newline in
the place marked (2). Note that the newlines before and after "cmd" here
can not simply be seen as "empty commands", because if they could, the
following would be legal:

	for var in foo bar
	do
	done

which IS NOT, since there is at least ONE command necessary between "do"
and "done" (please refer to the syntax given above). Note further that a
semicolon by itself is NOT an empty command, as

	for var in foo bar
	do ;
	done

does not work - you need at least the colon here:

	for var in foo bar
	do :
	done

------
%: More puzzling is that the shell does only allow it in some places.
   E.g. "for <newline>" is a syntax error while "for i <newline>"
   patiently waites for the "in" or "do".
------

>One of the real hassles I keep finding with /bin/sh (and /bin/csh is
>even worse ;-) is that the actual syntax regarding things like white
>space, newlines, and semicolons seems to be a secret.  It often takes 
>a lot of experimenting to find a way to get these syntax characters 
>right.  Is there any actual documentation on sh's syntax?  Is it truly 
>as ad-hoc as the above example implies?

For all I know the C-shell is more or less "ad-hoc", but for the Bourne
shell (which, until now and for the rest of this article, I allways mean
when I speak of "the shell") you can find a formal syntax allready in a
very ancient document, the "Bell Systems Technical Journal" (BSTJ in short)
from July/August 1978, ISSN0005-8580. The grammar starts on page 1987 as
Appendix A of an article written by S.R. Bourne himself. Though it fails
to mention some of the finer points (like the space/newline problems just
discussed) it may serve as a start for you and I found that it could even
be fed to yacc without much problems (I never tried to fill in the actions
to make it work as a "real" shell ...)

>Is there perhaps some logical 
>structure underlying it all that would explain why
> 	for last do :; done
>and
>	for last
>	do :
>	done
>both work but
>	for last;do :;done
>doesn't?

Well, "logic" is not so much an absolut value as many of us think, as it
often depends on what you expect. This is so because we may think we
have recognized something as a "rule" and tend to see all withstanding
observations as "illogical", where just the examples we studied were too
limited to recognize that we had only a seen special case (in this generality
that may also be true for the things we consider to be the "universal
laws" or "laws of nature" - but this brings us away from the topic.)

Now, what you observed were that newline and semicolon are interchangable
in all the examples you looked at and have tried before you came to that
"for" statement. (Remember I told you in the beginning that I had the same
problem with this - so it can not be said that your expectations were
without reason.) A bit more experimentation could also have shown that in
general the both are not really interchangable. E.g. if you type a single
newline nothing happens (except the shell prompts again), if you type two
newlines still nothing happens but if you type a semicolon + a newline this
is a syntax error. Hence semicolon and newline are not so much
interchangable as it seemed on first glance.

Now, having a little more experience we can come up with some other
explanation:

	- commands can not be empty (they consist at least of
	  an external or builtin command; the ":" is the builtin
	  command which does nothing but evaluate its arguments)
	- a semicolon or a newline% terminates a command
	- a command list is a non-empty sequence of commands, all
	  of which must be properly terminated
	- a semicolon or a newline terminates the word list of
	  the "in" part of the "for" statement
	- space characters and newlines are allowed before commands
	- nearly all the keywords of the shell are only recognized if
	  they are found in the position of a command, i.e. if there is
	  a previous command or a word list of a "for" statement there
	  MUST be a separator and their CAN be some space characters or
	  newlines
	- the most important exceptions from the above are "in" (as
	  well for the "for" statement as for the "case" statement) and
	  "do". But as the word list in the "in" part of a "for" statement
	  (or the command list after the "while" or "until" in such a
	  statement) must be properly terminated, a "do" NOT in command
	  position can only occur in a "in"-less "for" statement.

-----
%: There are other valid command separators/terminators that are recognized
together with the semicolon, but this doesn't matter here.
-----

In some sense, this are the "laws of nature" as derived from observing
the shell's behaviour. As the shell is not really nature but the outcome
of the thoughts of some human beeing, we could of course complain now
that this is "illogical" (compared to our sense logic!) or that there
are "too many exceptions" and that it could be simplified with fewer,
but more general rules.

But when thinking how to smoothen things out by using fewer rules, we
often do not recognize all the consequences that this would have.
Assume for a momemt we would treat both, newline and semicolon, as
statement terminator. Have you really considered what this would mean?
Typing a newline (at your terminal or as empty line in a shell script)
would be a syntax error (sic!) as a single semicolon is. Quite simple
I hear you say, then we allow for an empty statement to be really empty,
which would allow for single newlines as well as single semicolons. But
be careful! We then must think about the exit status of such a statement.
Should it allways be true as the colon command? But then you must be very
careful inserting empty lines into a script, because the following two
would have different semantics

	if		|		if	cmd
		cmd	|
	then		|		then

and you must never separate command execution and accessing $? by a
newline, since the empty command "newline" destroys the value of any
previous command's exit status. Again I hear you say, we make the
empty statment special - it shall leave the status of the "real" command
that was executed last. But now the following will become dangerous

	while
	do
		<do something until exit or break>
	done

as it depends on the last command BEFORE the loop when the loop is
entered the first time, and after that on the last command executed
WITHIN the loop. So, step by step we may introduce more special casing
for something that looked like a trivial change in the first place!

I hope you have gained a little more understanding for the syntax of the
shell now. It isn't really as strange as it might seem on first glance,
though I admit a few things are not so obvious and it's easy to come to
some wrong conclusions if you have insufficient experience. (If this
article hadn't become that long I could write a little more on it - maybe
some other time.)
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83

eric@mks.com (Eric Gisin) (01/08/91)

The shell's interpretation of newline is context sensitive.
It is usually equivalent to ";", but in a few cases it
is equivalent to white-space (space or tab). The latter cases
include after "|", "&&", "||", "for NAME", and "case WORD".

So all the following are valid:
$ ls |
> wc
$ true &&
> false ||
> maybe
$ for x
> in a b c
> do :
> done
$ case x
> in x) echo x!		# ;; optional here
> esac

allbery@NCoast.ORG (Brandon S. Allbery KB8JRR) (01/11/91)

As quoted from <1033@mwtech.UUCP> by martin@mwtech.UUCP (Martin Weitzel):
+---------------
| In some sense, this are the "laws of nature" as derived from observing
| the shell's behaviour. As the shell is not really nature but the outcome
| of the thoughts of some human beeing, we could of course complain now
| that this is "illogical" (compared to our sense logic!) or that there
| are "too many exceptions" and that it could be simplified with fewer,
| but more general rules.
| 
| But when thinking how to smoothen things out by using fewer rules, we
| often do not recognize all the consequences that this would have.
+---------------

There is one other problem.  I daresay it would be possible to make Bourne
shell syntax a bit more "regular" by using a yacc grammar.  THIS WON'T WORK!
At least, not without making the shell much less useful --- yacc (or other
parser generators) grammars are not designed for interaction.  In order to
do interaction *well*, the shell needs to be able to have at least some idea
of what is going on *without* having read an entire complex command (read
"if/while/for/case/etc.").  I've tried writing a yacc grammar that does this
kind of thing in a graceful manner; I ended up using context-sensitive hacks,
which I dislike in otherwise simple parsers.  This is also why csh is not
actually like C --- C can depend on the parser collecting statements for it,
but csh is primarily designed for interactive use and therefore must be able
to keep track of what's going on incrementally.

++Brandon
-- 
Me: Brandon S. Allbery			    VHF/UHF: KB8JRR on 220, 2m, 440
Internet: allbery@NCoast.ORG		    Packet: KB8JRR @ WA8BXN
America OnLine: KB8JRR			    AMPR: KB8JRR.AmPR.ORG [44.70.4.88]
uunet!usenet.ins.cwru.edu!ncoast!allbery    Delphi: ALLBERY

ronald@robobar.co.uk (Ronald S H Khoo) (01/12/91)

allbery@ncoast.ORG (Brandon S. Allbery KB8JRR) writes:

> There is one other problem.  I daresay it would be possible to make Bourne
> shell syntax a bit more "regular" by using a yacc grammar.  THIS WON'T WORK!
> At least, not without making the shell much less useful

Well, the some of the chaps at research seem to be quite happy with "rc"
and that's got a yacc grammar...  Apparently it was too painful to
port /bin/sh to Plan 9 so Duff wrote "rc".  (He presented a paper on it
to the UKUUG Summer Conference last year)

rc has exactly what you describe -- a regularised /bin/sh syntax.

And of course, since they use Gnots running Pike's windowing stuff, no
command line history/editing or anything like that in rc, it's just a shell,
and looks quite nice too.  Pity it's not available.
-- 
ronald@robobar.co.uk +44 81 991 1142 (O) +44 71 229 7741 (H)

allbery@NCoast.ORG (Brandon S. Allbery KB8JRR) (01/13/91)

As quoted from <1991Jan12.012225.6727@robobar.co.uk> by ronald@robobar.co.uk (Ronald S H Khoo):
+---------------
| allbery@ncoast.ORG (Brandon S. Allbery KB8JRR) writes:
| > There is one other problem.  I daresay it would be possible to make Bourne
| > shell syntax a bit more "regular" by using a yacc grammar.  THIS WON'T WORK!
| > At least, not without making the shell much less useful
| 
| Well, the some of the chaps at research seem to be quite happy with "rc"
| and that's got a yacc grammar...  Apparently it was too painful to
| port /bin/sh to Plan 9 so Duff wrote "rc".  (He presented a paper on it
| to the UKUUG Summer Conference last year)
+---------------

I wondered if anyone would comment about that after I read the "rc" stuff.
However, "rc" follows the general Plan 9 form (which, many ages, ago, was the
general Unix form) of moving stuff into separate programs.  "rc" is, in many
ways, nowhere near as complex as even the V7 shell, much less the System V
shell; it can get away with simple means of handling interactiveness in
complex control structures.  I was able to handle interactive use simply in a
certain yacc grammar up to a certain point, then I had to start using context
flags all over the place to make interactive use behave in an intuitive way.
I don't recall what point it was, except that the program I was working on was
gradually turning into a shell, which is why I eventually scrapped it in favor
of using the existing shell.

++Brandon
-- 
Me: Brandon S. Allbery			    VHF/UHF: KB8JRR on 220, 2m, 440
Internet: allbery@NCoast.ORG		    Packet: KB8JRR @ WA8BXN
America OnLine: KB8JRR			    AMPR: KB8JRR.AmPR.ORG [44.70.4.88]
uunet!usenet.ins.cwru.edu!ncoast!allbery    Delphi: ALLBERY

martin@mwtech.UUCP (Martin Weitzel) (01/14/91)

In article <1991Jan11.035416.18772@NCoast.ORG> allbery@ncoast.ORG (Brandon S. Allbery KB8JRR) writes:
>As quoted from <1033@mwtech.UUCP> by martin@mwtech.UUCP (Martin Weitzel):
>+---------------
>| But when thinking how to smoothen [the shell syntax by using] fewer rules,
>| we often do not recognize all the consequences that this would have.
>+---------------
>
>There is one other problem.  I daresay it would be possible to make Bourne
>shell syntax a bit more "regular" by using a yacc grammar.  THIS WON'T WORK!
>At least, not without making the shell much less useful --- yacc (or other
>parser generators) grammars are not designed for interaction.

My observations differ a little here. It is true that using a parser
generator like yacc sometimes makes less concious of the actual parsing
algorithm that may have to look for the next token to decide which rule
should be reduced (and hence which action should be executed).

But you can also write yacc-able grammars that can be parsed without look
ahead! (Actions are generally a bit more complex then - in most cases you
have to build the parsing tree explicitly as data structur rather than
simply depend on yyparse's value stack.)

But the conclusion that parsers generator grammars are not designed for
interaction is similar to the `goto-considered-harmful' discussion: You
cannot say that C programs are generally less structured just because
the language contains a `goto'-statement. It much depends on the typical
usage of the `goto' throughout a program, whether the program looks
structured or more like spaghetti-code. Of course, if C had no `goto'
at all even those old-time BASIC-hackers were forced to look at other
ways to do control-flow. In so far I see some truth in Brandon's statement:
Parser generators make it easy to write grammars which do not fit well
into an interactive environment.

>In order to
>do interaction *well*, the shell needs to be able to have at least some idea
>of what is going on *without* having read an entire complex command (read
>"if/while/for/case/etc.").  I've tried writing a yacc grammar that does this
>kind of thing in a graceful manner; I ended up using context-sensitive hacks,
>which I dislike in otherwise simple parsers.

Again, `context-sensitive hacks' are not a bad thing a priori (maybe they
are if they are real `hacks', but I think Brandon meant that he fed
back some information from the syntax analysis to the lexer). There are
two different situations: Either you plan a completly new syntax for
a new language. In this case I would not recommend the coupling between
parser and scanner, because such a syntax becomes more difficult to learn
for a user of this new language (things have different meanings in different
contexts).

On the other hand, if you need to parse a given language that the user
allready knows (e.g. some natural language or a sub-language thereof),
feedback from syntax analysis to lexical analysis will help much, as long
as it duplicates what the user allready expects.

Finding a yacc-able syntax for the Bourne-Shell is a mixed case: A
long-time shell-user would expect all the things in it that a newcomer
might consider to be irregularities. (I don't dare to decide which are
really irregularities as I belong rather to the former group, but at
least I know that most of the irregularities - e.g. implied double
quotes around the word after an `=' in an assignment and between
`case-in' - help to save some key-strokes, though they really are very
non-intuitive for newcomers.)

>This is also why csh is not
>actually like C --- C can depend on the parser collecting statements for it,
>but csh is primarily designed for interactive use and therefore must be able
>to keep track of what's going on incrementally.

Here I can second Brandon's statement and will even work it out a bit more:
One of the major problems come up if the syntax allows an if-statement with
an optional else-part, as this is the case in C (but not in the Bourne
Shell, as it has the closing `fi'). The user expects (of course) that
the if-part should be executed after it is completly written down.
But the parsing algorithm may want to look if there follows an `else'.
This is because the user "knows" what he or she will do next but the
Shell can not read the user's mind. That sort of things must be taken
care of during the design of an interactive language. Simply adopting
the syntax of a non-interactive language for an interactive language is
bound to fail here.

To summarize: IMHO it are not the parser generators which complicate
things, but inappropriate design of an interactive language.
(Esp. to Brandon: Do your experiences stem from trying to derive a
yacc-able grammar for the Bourne-Shell or rather for the C-Shell?)

BTW: I've redirected followups to comp.lang.misc, since the topic tends
to turn away from the focus of comp.unix.shell.
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83