[comp.unix.shell] Puzzled by A Regexp...

tres@virga.rap.ucar.edu (Tres Hofmeister) (03/05/91)

	I've run across a regular expression that I don't quite understand.
Not that this hasn't happened before, but this seems like it should be
fairly straightforward...

	I'm trying to match entries in /etc/group which have one or more
members.  The following works just fine, matching each of the colon
delimited fields individually followed by one or more characters:

	grep '^.*:.*:.*:..*' /etc/group

	What I don't understand is why the following doesn't work the same
way:

	grep '^.*:..*' /etc/group

	It grabs entries with one or more members, true, but also grabs
entries with no members, e.g. "news:*:6:".  I figured that this regexp
would match the longest possible string at the beginning of a line,
terminated by a colon, which in the group file should include the first
two colons, followed by at least one character.  It seems to be doing
something else, given that it will also match a line with no members.

	Any ideas?


Tres Hofmeister
tres@ncar.ucar.edu
--
Tres Hofmeister
tres@ncar.ucar.edu

jik@athena.mit.edu (Jonathan I. Kamens) (03/05/91)

In article <10469@ncar.ucar.edu>, tres@virga.rap.ucar.edu (Tres Hofmeister) writes:
|> 	It grabs entries with one or more members, true, but also grabs
|> entries with no members, e.g. "news:*:6:".  I figured that this regexp
|> would match the longest possible string at the beginning of a line,
|> terminated by a colon, which in the group file should include the first
|> two colons, followed by at least one character.  It seems to be doing
|> something else, given that it will also match a line with no members.

  Each segment of a regular expression matches the longest possible string
that it can match *while allowing the rest of the regular expression to match
as well*.

  So, let's analyze what happens when the regexp "^.*:..*" is compared to
"news:*:6:".  It will first match the colon in that regexp against the last
colon in the string.  But then it will discover that when it does that, the
rest of the regexp can't be matched.  So it will back off and see if "^.*:"
can be matched against something shorter.  As a result, the colon will get
matched up with the second to last colon in the string, and the "..*" will
match against "6:".

  I hope this clears things up for you.

-- 
Jonathan Kamens			              USnail:
MIT Project Athena				11 Ashford Terrace
jik@Athena.MIT.EDU				Allston, MA  02134
Office: 617-253-8085			      Home: 617-782-0710

]) (03/07/91)

In article <10469@ncar.ucar.edu> tres@virga.rap.ucar.edu (Tres Hofmeister) writes:
>
>	I've run across a regular expression that I don't quite understand.
>Not that this hasn't happened before, but this seems like it should be
>fairly straightforward...
>
>	I'm trying to match entries in /etc/group which have one or more
>members.  The following works just fine, matching each of the colon
>delimited fields individually followed by one or more characters:
>
>	grep '^.*:.*:.*:..*' /etc/group

This one will find any line with three or more colons with a character
of any type after colon-number-three-or-higher.  This re means

	From start of line
	zero or more of any characters
	a colon
	zero or more of any characters
	a colon
	zero or more of any characters
	a colon
	any single character
	zero or more of any characters

It'll match good group entries and

	:::::
	:::.:   ::: --:
	::::
	a:::a
	:::a

>	What I don't understand is why the following doesn't work the same
>way:
>
>	grep '^.*:..*' /etc/group

This one will find any record that includes a : before the last char in 
the line. The re means

	From start of line,
	zero or more of any characters
	a colon
	any single character
	0 or more of any characters

It matches the following

	::
	:::a
	:b:::
	:b
	gigo:1123

>	It grabs entries with one or more members, true, but also grabs
>entries with no members, e.g. "news:*:6:".  I figured that this regexp
>would match the longest possible string at the beginning of a line,
>terminated by a colon, which in the group file should include the first
>two colons, followed by at least one character.  It seems to be doing
>something else, given that it will also match a line with no members.

The only lines it *won't* match are those with no colons or where the
only colon in the line is the last character.  What it's looking for
is a line with a colon followed by a character.

>	Any ideas?

Instead of    .*   in there, on the first (field matching) version:

	grep '^[^:]*:[^:]*:[^:]*:..*' /etc/group

Even better for the second example is to anchor at the END instead of the
BEGINNING of the data lines:

	grep ':[^:]+$' /etc/group

will match any line with at least one non-colon character following the
last colon in the line.  Alternatives that are the same:

	:[^:]\{1,\}$
	:[^:][^:]*$
	^.*:[^:][^:]*$

Finally, any line not matching the following is either a group with no
members or a badly-formed line in the file

	^[^:]+:[^:]*:[0-9]+:[^:]+$

which matches

	From start of line
	at least one non-colon
	a colon
	any number of non-colons
	a colon
	a decimal number
	a colon
	at least one non-colon
	end of line

Note that it won't see other anomolies like a group with too big a gid
(system dependent and we can't check to see if it's 65536, for instance,
if 65535 is the biggest) or usernames that are too long or weird stuff
in the userids field (we could exclude spaces, for instance, by testing
in each case    [^: ]   instead of    [^:]   ), but any line *not* found
by the above is either a group with no members or a badly formed line.

...Kris
-- 
Kristopher Stephens, | (408-746-6047) | krs@uts.amdahl.com | KC6DFS
Amdahl Corporation   |                |                    |
     [The opinions expressed above are mine, solely, and do not    ]
     [necessarily reflect the opinions or policies of Amdahl Corp. ]