[comp.unix.wizards] what should egrep '|root' /etc/passwd print?

rick@seismo.CSS.GOV (Rick Adams) (09/14/88)

What should 
	egrep '|root' /etc/passwd
print? (Presume you have one and only one line containing the string
root in /etc/passwd).

On BSD systems, it prints the single line containing root. On
System 5 systems it prints EVERY line.

Obviously the question is what should the null pattern to
the left of the "|" match. With BSD it matches nothing, with SYS5 it
matches everything.

(That patter was generated by aprogram in case you were wondering
who would ever type something like that)

---rick

guy@gorodish.Sun.COM (Guy Harris) (09/14/88)

> What should 
> 	egrep '|root' /etc/passwd
> print? (Presume you have one and only one line containing the string
> root in /etc/passwd).
> 
> On BSD systems, it prints the single line containing root. On
> System 5 systems it prints EVERY line.

I tried it on a 3B2 running S5R3.1 here, and it printed only the line
containing "root".  Perhaps somebody "improved" the "egrep" on the S5 system in
question?

gwyn@smoke.ARPA (Doug Gwyn ) (09/14/88)

In article <44414@beno.seismo.CSS.GOV> rick@seismo.CSS.GOV (Rick Adams) writes:
>What should 
>	egrep '|root' /etc/passwd
>print?

The question of how to handle empty strings in regular expressions
does not have a clear-cut answer.  "sam", which has the most powerful
regular expression mechanism of any of the editor-like programs I use,
says the following:
	?no operand for `|'
which is yet another solution.  (Getting the description right for
where empty strings do and don't match was the hardest task when I
was overhauling sam's manual entry.)

andrew@alice.UUCP (Andrew Hume) (09/15/88)

it is true that youhave to think carefully about null regular expressions.
i (and gre) believe as sam does, '|root' is a syntax error as | is dyadic.
the behaviour of matching every line is just wrong. egrep handles this
as a special case; recall that egrep takes \n as a synonym for |.
aho wanted to forgive the user for being sloppy with quotes and blank lines
as in egrep '
abc
def
' file

which ordinarily has superfluous \n at the beginning and end. i agree that this
is a flimsy aim but that is the current practice with egrep.
i think that it is clear that grep '' file is always a syntax error;
there is no useful interpretation of such a regular expression.

oz@yunexus.UUCP (Ozan Yigit) (09/15/88)

In article <8492@smoke.ARPA> Doug Gwyn  writes:
>In article <44414@beno.seismo.CSS.GOV> Rick Adams writes:
>>What should 
>>	egrep '|root' /etc/passwd
>>print?
>
>The question of how to handle empty strings in regular expressions
>does not have a clear-cut answer.  

Oh ??? Just what does '(foo)?' match on your egrep ?? It just happens
to be the same thing as E|foo (E for Epsilon, or the null match),
or exactly what rick wants to match. It had a clear cut answer before,
but I am sure it is lost somewhere along the line. 

> .."sam", which has the most powerful
>regular expression mechanism of any of the editor-like programs I use,
>says the following:
>	?no operand for `|'

A parsing glitch no doubt. See, some expressions are more regular :-)
then others, so "sam" will no doubt accept the expression under a
disguise of "(foo)?". 

In any case, Rick should probably use jaw-improved-gnu-egrep for
correct results.

oz
-- 
Crud that is not paged	        | Usenet: ...!utzoo!yunexus!oz
is still crud. 			|   ...uunet!mnetor!yunexus!oz
	andrew@alice		| Bitnet: oz@[yulibra|yuyetti]
				| Phonet: +1 416 736-5257x3976

ok@quintus.uucp (Richard A. O'Keefe) (09/16/88)

In article <8202@alice.UUCP> andrew@alice.UUCP (Andrew Hume) writes:
>it is true that youhave to think carefully about null regular expressions.
Right.
...
>i think that it is clear that grep '' file is always a syntax error;
>there is no useful interpretation of such a regular expression.
Wrong.

(1) The empty r.e. is the mathematically obvious way of writing a
    pattern which matches the empty string.

(2) grep -c '' file		- print the number of lines in the file
    grep -n '' file		- print lines with line numbers
    ls | egrep '\.c(|\.BAK)$'	- find *.c and *.c.BAK files

    The first two of these work fine in SunOS 3.2, the last one is the
    one I would really like, and it gets a syntax error.  There are
    several things I could do to make it acceptable, but why should I?
    It's clear as it stands, and csh would let me use *.c{,.BAK} as a
    pattern.
    
    It is true that the System V version of grep is broken here: if
    you try "grep -c '' file" you are told
	grep: RE error 41: No remembered search string
    which is absurd.  In order to write a pattern which will match the
    empty string, you are obliged to write something like
	grep -c 'x\{0\}' file
    It's crazy to allow a complicated hack like this but not the
    direct expression of the idea.  (Why is the \{..\} feature in
    grep, but not in egrep?)

andrew@alice.UUCP (Andrew Hume) (09/17/88)

it sounds appealing to allow a missing RE to mean the empty string
but i am unconvinced as to its utility. the examples show this; there
is little to choose between
	grep -c '' and grep -c '^'
for counting lines. the latter is one char longer but needs no new concepts.
if the example you realy want to work is
	ls | egrep '\.c(|\.BAK)$'	- find *.c and *.c.BAK files
then you can do that with the current egrep by
	ls | egrep '\.c(\.BAK)?$'
this time the patterns are the same length but no new concepts required.
can anyone think of other useful meanings for the null RE?

rick@seismo.CSS.GOV (Rick Adams) (09/17/88)

> In any case, Rick should probably use jaw-improved-gnu-egrep for
> correct results.


This all started becuase I WAS using the gnu-egrep and it broke
on of my shellscripts (The news "checkgroups" script). Admittedly,
the shellscript could have been better written, but it was an
incompatible behavior and I wasn't really sure which was 'right'

--rick

ok@quintus.uucp (Richard A. O'Keefe) (09/17/88)

In article <8209@alice.UUCP> andrew@alice.UUCP (Andrew Hume) writes:
>it sounds appealing to allow a missing RE to mean the empty string
>but i am unconvinced as to its utility. the examples show this;
[He goes on to show that each of the examples can be done without the
 use of empty R.E.s]

This misses the point.  In fact, at least two points.
(1) The method of eliminating the empty R.E. is *DIFFERENT* in each case.
(2) The empty R.E. meaning "match the empty string" is not an arbitrary
    association.  "empty" is the identity for concatenation of both
    strings and patterns.  It is the mathematically obvious thing, and a
    well-written program would have to go out of its way to disallow it.

In fact, as the error message from a System V "grep" shows, grep at least
*IS* going out of its way to disallow empty patterns in order to support
an ed(1) hack (// means "use previous pattern") which cannot possibly be
useful to grep (there *is* no previous pattern).

The argument "I am unconvinced as to its utility [because you can always
translate it away]" can be transferred to the "*" operator: "with the
current egrep" you can always rewrite a pattern P* as P+ .  In fact,
*, +, and ? are all special cases of \{..\} (see grep(1)) so instead of
P*, P+, and P? we should be writing P\{0,\}, P\{1,\}, and P\{0,1\}.
(And, of course, empty expressions as ;\{0\}.)

Why go out of your way to prohibit the obvious way of doing something?

tanner@ki4pv.uucp (Dr. T. Andrews) (09/17/88)

In article <8202@alice.UUCP>, andrew@alice.UUCP (Andrew Hume) writes:
) i think that it is clear that grep '' file is always a syntax error;
) there is no useful interpretation of such a regular expression.

I wasn't aware that the "grep" family was intended to decide for me what
constituted a "useful" regular expression.  I was under the impression
that these programs were merely supposed to match on whatever expression
I cared to provide.
-- 
...!bikini.cis.ufl.edu!ki4pv!tanner  ...!bpa!cdin-1!cdis-1!ki4pv!tanner
or...  {allegra killer gatech!uflorida decvax!ucf-cs}!ki4pv!tanner

bzs@encore.UUCP (Barry Shein) (09/18/88)

Just to add some history to the discussion:

From "The SNOBOL4 Programming Language", Griswold et al, Section 2.6
pp. 35:

"2.6 The Null String in Pattern Matching

The null string is the string of zero length. Attempts by the scanner
to match the null string always succeed...Pattern matching in the
statement

	STR NULL			:S(ON)F(ERROR)

always succeeds, even if STR itself has the null string as value."

Perhaps the SNOBOL crew could be consulted for their original
reasoning. I suspect there is a reasonable set-theoretic argument
which can be made (e.g. the null set is the subset of all other sets.)

	-Barry Shein, ||Encore||

dmcanzi@watdcsu.waterloo.edu (David Canzi) (09/19/88)

In article <8209@alice.UUCP> andrew@alice.UUCP (Andrew Hume) writes:
>it sounds appealing to allow a missing RE to mean the empty string
>but i am unconvinced as to its utility.

If x, y, and z are regular expressions, then xyz matches those strings
which can be formed by concatenating any three strings X, Y, and Z
where x matches X, y matches Y, and z matches Z.  The expression 'x|y'
matches any string that is matched by x or y.

So, suppose y=''.  Let x='aa' and z='bb'.  Then xyz='aabb'.  'aa' is
the only string x matches, and 'bb' is the only string z matches,
'aabb' is the only string xyz matches.  The only thing left for y to
match is the null string between 'aa' and 'bb'.  Therefore, the null
string matches the null string.

Let x='' and y='root', so that x|y = '|root'.  Then x|y matches the null
string (because it matches x) and the string 'root' (because it matches
y).  So the egrep command in the subject line should print out all of
/etc/passwd, since every line has the null string on it.

This is intuitively obvious to me, but I tried to prove it because I'm
not sure other people's intuitions are similar to mine.

As for utility, consider the case, which I have actually run into,
where I wanted an expression like 'aa(|bb)cc' to match the strings
'aacc' and 'aabbcc'.  In this case, it's clear I want the expression
in parentheses to match the null string.  The program I was using
wouldn't let me do this, and I had to use something like 'a(a|abb)cc'
to get what I wanted.  If I had had a program generate that expression,
I would have had to add code to detect this special case and rewrite
the regular expression.  Yecch.

-- 
David Canzi

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (09/20/88)

In article <5060@watdcsu.waterloo.edu> dmcanzi@watdcsu.waterloo.edu (David Canzi) writes:
: As for utility, consider the case, which I have actually run into,
: where I wanted an expression like 'aa(|bb)cc' to match the strings
: 'aacc' and 'aabbcc'.  In this case, it's clear I want the expression
: in parentheses to match the null string.  The program I was using
: wouldn't let me do this, and I had to use something like 'a(a|abb)cc'
: to get what I wanted.  If I had had a program generate that expression,
: I would have had to add code to detect this special case and rewrite
: the regular expression.  Yecch.

Interestingly enough, in Henry Spencer's regexp routines (which I borrowed
for perl), if you say /aa(bb)?cc/, it gets translated internally to
the equivalent /aa(bb|)cc/.

The null string should match anything because the whole idea of regular
expressions involves rejecting strings that you can't match.  To match /abc/,
you say "For each of the next N characters, bomb out if it doesn't match.
Otherwise it matches."  You don't go and change the rules just because N
happens to be 0 sometimes.

If you DO change the rules on boundary conditions, people who write program
generators will hate you forever, as David mentioned.  I know, I've been
there.  "Whaddya mean, I can't declare an array of size 0?"

Or look at it another way.  As the pattern gets shorter and shorter,
it matches more and more things.  When it gets as short as it can,
it ought to match as many things as it can, by the Principle of Least
Surprise.

Let's hear it for intuitionalization.

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov

dmcanzi@watdcsu.waterloo.edu (David Canzi) (09/23/88)

In article <2894@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
>The null string should match anything because the whole idea of regular
>expressions involves rejecting strings that you can't match.  To match /abc/,
>you say "For each of the next N characters, bomb out if it doesn't match.
>Otherwise it matches."  You don't go and change the rules just because N
>happens to be 0 sometimes.

This is a pet peeve of mine.  Many programs do something odd and
irregular with operand values of 0, null strings, empty argument lists,
and so on, often in cases where reasonable and regular ways of handling
such things are obvious.  This became a pet peeve of mine while I was
working on VM/CMS, where you are not allowed to create an empty file
(no records), or an empty record within a file, and where ever a
program might attempt to do so, it must check for it and do some kind
of special handling or other.

Here's another example having to do with regular expressions.  If a
regular expression package won't accept an expression like "aa(|bb)cc",
then a program that generates a regular expression using a sprintf
format of "aa(%s|bb)cc" can't work if that string is ever null.  It
occurred to me that if there was some non-null regular expression that
matches only a null string, then you could use something like:

sprintf(regx, "aa(%s%s|bb)cc", matchnull, string);

to accomplish the purpose without causing the regular expression
routines to object.  All that is necessary is to figure out a value for
matchnull.  Suppose "_" was a regular expression metacharacter that
doesn't match anything.  Then "_*" would match 0 or more occurrences of
strings that match "_" -- but there are no such strings, so only 0
occurrences could be matched.  That is, "_*" would match only null
strings.  Can a regular expression that doesn't match anything be
written?  It seemed so to me.  If "[abc]" matches any character from
the set of characters between the brackets, then "[]", having no
characters between the brackets, shouldn't match anything.  So, matchnull
should be set to "[]*".

Nice theory, but when I tried this out with grep and egrep, they
complained about syntax.

-- 
David Canzi

allbery@ncoast.UUCP (Brandon S. Allbery) (09/24/88)

As quoted from <855@yunexus.UUCP> by oz@yunexus.UUCP (Ozan Yigit):
+---------------
| In article <8492@smoke.ARPA> Doug Gwyn  writes:
| >In article <44414@beno.seismo.CSS.GOV> Rick Adams writes:
| >>What should 
| >>	egrep '|root' /etc/passwd
| >>print?
| >
| >The question of how to handle empty strings in regular expressions
| >does not have a clear-cut answer.  
| 
| Oh ??? Just what does '(foo)?' match on your egrep ?? It just happens
| to be the same thing as E|foo (E for Epsilon, or the null match),
| or exactly what rick wants to match. It had a clear cut answer before,
| but I am sure it is lost somewhere along the line. 
+---------------

But (foo)? is well-defined, whereas (|foo) is open to variant
interpretations, as shown.  Why not accept it?  If you're going to complain
about (|foo), why not complain about (^|foo) as well?

System V grep prints that weird error message because it uses /bin/ed's
regexp code; and ed re-uses the last regexp when presented with a null
regexp.  Perhaps a bit strange, but it indicates a missing option in the
regexp common code rather than a bug in grep, in my opinion.  (What say we
get AT&T to use Henry Spencer's V8 regexp code?)

++Brandon
-- 
Brandon S. Allbery, uunet!marque!ncoast!allbery			DELPHI: ALLBERY
	    For comp.sources.misc send mail to ncoast!sources-misc
"Don't discount flying pigs before you have good air defense." -- jvh@clinet.FI

andrew@alice.UUCP (Andrew Hume) (10/04/88)

for what it is worth, i give in and will support a null regular
expression which matches the inter-character null string.
thus, so will gre. of course, programs which use my libraries are
free to intercept empty expressions and substitute the last expression
etc.

i thank several correspondants who convinced me of this but due
to the HOPELESS STUPID WRETCHED CRAPPY netnews/mail black hole of despair
i couldn't reply directly so i have to broadcast my thanks.