[comp.unix.wizards] what should egrep '|root' print?

oz@yunexus.UUCP (Ozan Yigit) (09/18/88)

[Apologies to those getting tired of this topic.]

In article <8209@alice.UUCP> andrew@alice.UUCP (Andrew Hume) writes:
> >it sounds appealing to allow a missing RE to mean the empty string
> but i am unconvinced as to its utility.  
> 

With all due respect, the argument of "utility" except in the
"specific" case of '|foo' (as used by Rick@seismo) is suspect
(bogus?). Unless I am mistaken in the equivalence of (foo)?  and
(foo|E), the issue reduces to one of expression syntax vs semantics.
Is there a good syntactic reason not to allow (foo|) as a valid
expression, such as grammar ambiguity ?? If NOT, I would claim that
the parsers rejecting the expression are "incomplete" (some would
say broken :-), regardless of whether it is in "sam" (Gwyn special,
Argumentum Ad Sam) or wherever.

I agree that "blah(foo||bar)gasp" may not look quite as interesting
(arguably) as "blah(foo|bar)+ptui", but if they are equivalent (yeah,
I know, gasp is not equivalent to ptui. :-) and if there is no solid
syntactic reason to allow one and disallow other, then, why bother
to come up with excuses for it ??

Any thoughts, and/or some real reason against (foo|) ??

oz
-- 
Crud that is not paged	        | Usenet: ...!utzoo!yunexus!oz
is still crud. 			|   ...uunet!mnetor!yunexus!oz
	andrew@alice		| Bitnet: oz@[yulibra|yuyetti]
				| Phonet: +1 416 736-5257x3976

henry@utzoo.uucp (Henry Spencer) (09/20/88)

In article <857@yunexus.UUCP> oz@yunexus.UUCP (Ozan Yigit) writes:
>Any thoughts, and/or some real reason against (foo|) ??

Well, personally, I'd dearly love to be able to use (| and |) as metasymbols,
since (a) one highly desirable extension to my regexp package would be the
beginning/end-of-identifier metasymbols found in many implementations,
(b) I am deeply opposed to declaring more unbackslashed characters to be
metasymbols, and (c) I am even more deeply opposed to declaring *any*
backslashed characters to be metasymbols.  There are other possibilities,
exploiting sequences that are syntax errors at the moment, but none of
them is nearly as pretty.  (Not a trivial issue, given that users have to
remember whatever sequence gets chosen.)  Alas, I am also sympathetic
to the argument that (1) it would be an unfortunate inconsistency, and
(2) programs that generate regexps might have to go out of their way to
avoid generating these magic sequences.  Argh.  Any thoughts?
-- 
NASA is into artificial        |     Henry Spencer at U of Toronto Zoology
stupidity.  - Jerry Pournelle  | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

ok@quintus.uucp (Richard A. O'Keefe) (09/21/88)

In article <1988Sep20.043728.20198@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>Well, personally, I'd dearly love to be able to use (| and |) as metasymbols,

Why not use (* ... ) as the meta-construct?

>(2) programs that generate regexps might have to go out of their way to
>avoid generating these magic sequences.  Argh.  Any thoughts?

I suggest that there ought to be a way for programs to generate R.E.s
*without* using magic sequences.  How about having a program do e.g.
	begin_re();             /*  "/"    */
	literal("foo");		/*  "foo"  */
	begin_alternatives();	/*  "("    */
	literal("baz");		/*  "baz"  */
	next_alternative();	/*  "|"    */
	end_alternatives();	/*  ")"    */
	literal(".c");		/*  "\.c"  */
	pattern = end_re();	/*  "/"    */
to obtain a pattern equivalent to Csh's foo{baz,}.c  
It is *already* the case that programs which generate patterns have to
go out of their way to avoid far too many magic sequences; a library like
this would eliminate the problem at the source.

weemba@garnet.berkeley.edu (Obnoxious Math Grad Student) (09/21/88)

In article <1988Sep20.043728.20198@utzoo.uucp>, henry@utzoo (Henry Spencer) writes:
>					   Alas, I am also sympathetic
>to the argument that (1) it would be an unfortunate inconsistency, and
>(2) programs that generate regexps might have to go out of their way to
>avoid generating these magic sequences.  Argh.  Any thoughts?

From a theoretician's point of view, these are the only arguments.

I ran into null regexps in Gnews, when I generalized from KILLing based on
newsgroup names to KILLing based on newsgroup regexps.  I was so pleased
when I realized that the null regexp would match all newsgroup names, and
thus provide for global KILLs.  It never occurred to me that there might
be regexp handlers that would not take this: it's plain unnatural.

ucbvax!garnet!weemba	Matthew P Wiener/Brahms Gang/Berkeley CA 94720

henry@utzoo.uucp (Henry Spencer) (09/23/88)

In article <454@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>Why not use (* ... ) as the meta-construct?

The trouble is that the word brackets aren't always used together, so the
trailing bracket needs to be distinguishable by itself.  (* is attractive,
but it has no obvious counterpart to be the closing bracket.

>It is *already* the case that programs which generate patterns have to
>go out of their way to avoid far too many magic sequences; a library like
>this would eliminate the problem at the source.

Actually, with my regexp package it suffices to backslash all the ordinary
characters.  A bit crude, but it works.  This is one of the reasons why I
am very reluctant to assign special meaning to any backslashed characters.
-- 
NASA is into artificial        |     Henry Spencer at U of Toronto Zoology
stupidity.  - Jerry Pournelle  | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

rroot@edm.UUCP (Stephen Samuel) (09/23/88)

From article <857@yunexus.UUCP>, by oz@yunexus.UUCP (Ozan Yigit):
> [Apologies to those getting tired of this topic.]
 
> In article <8209@alice.UUCP> andrew@alice.UUCP (Andrew Hume) writes:
>> >it sounds appealing to allow a missing RE to mean the empty string
>> but i am unconvinced as to its utility.  
  
> I agree that "blah(foo||bar)gasp" may not look quite as interesting
> (arguably) as "blah(foo|bar)+ptui", but if they are equivalent (yeah,
> I know, gasp is not equivalent to ptui. :-) and if there is no solid
> syntactic reason to allow one and disallow other, then, why bother
> to come up with excuses for it ??
  
I am inclined to say that it might be worthwile to allow it for the
purpose of completeness.  If you have something that does string
replacements, then there IS a real difference between:
   //  ,  /foo|/  and  /foo/
especially if they are prefixed by something else:  for example,
you might want to do something like:

     change:   /go\(ing|one|\) /  = /went/

and if you were using grep to search for things like that, it would be
nice to be able to be able to use  pieces of your other expressions in 
a 'grep' search, even if it does look like a null event sometimes.
-- 
-------------
 Stephen Samuel 			Disclaimer: You betcha!
  {ihnp4,ubc-vision,seismo!mnetor,vax135}!alberta!edm!steve
  BITNET: USERZXCV@UQV-MTS

ka@june.cs.washington.edu (Kenneth Almquist) (09/27/88)

henry@utzoo.uucp (Henry Spencer) writes:
> Well, personally, I'd dearly love to be able to use (| and |) as metasymbols,
> since (a) one highly desirable extension to my regexp package would be the
> beginning/end-of-identifier metasymbols found in many implementations,
> (b) I am deeply opposed to declaring more unbackslashed characters to be
> metasymbols, and (c) I am even more deeply opposed to declaring *any*
> backslashed characters to be metasymbols.  There are other possibilities,
> exploiting sequences that are syntax errors at the moment, but none of
> them is nearly as pretty.  (Not a trivial issue, given that users have to
> remember whatever sequence gets chosen.)  Alas, I am also sympathetic
> to the argument that (1) it would be an unfortunate inconsistency, and
> (2) programs that generate regexps might have to go out of their way to
> avoid generating these magic sequences.  Argh.  Any thoughts?

My solution (when I faced this problem a long time ago) was to make an
asterisk at the start of a regular expression require that the string
matched not be preceded or followed by an character which can appear in
a word.  The arguments pro and con seem to be:

1)  Word beginning and ending patterns are more flexible.  Can anyone come
    up with a use for this flexibility?  I can't.

2)  The asterisk convention is easier to type.

3)  The asterisk convention is easy to explain to a beginner on an intuitive
    level ("Place an asterisk in front of the expression to search for a
    word"), although a complete explanation of the semantics is about as
    complicated for either convention.

4)  Even after the user learns the word begin and end commands, the user
    still has to type two commands to get a word search, which increases
    the cognitive complexity compared to typing one command to get a word
    search.

5)  Neither syntax is intuitively obvious, but (| and |) do have intuitively
    obvious interpretations (both consist of a parethises and a '|' operator)
    which differ from the interpretation that Henry suggests for them.

The basic problem with the word beginning and ending patterns is that they
are at the wrong level.  If they are *only* used as building blocks to build
word searches, then a higher level feature like the asterisk convention
which allows users to request word searches directly is a better choice.
And they are too high level to be used for much else besides constructing
word searches.  The rare cases where they are used for something else (if
such cases exist) can be handled by lower level features from which word
beginning and ending patterns can be constructed.  I expect that Henry's
regexp package (like egrep) already has the required features.

In conclusion, I believe that including the (| and |) operators in a regular
expression package is a poor idea on two grounds.  The semantics are wrong;
if word searches are desired there are better ways to provide them, such as
the asterisk convention.  And (| and |) are a lousy choice of operators,
for reasons which Henry notes in his article, while the asterisk convention
has no such problems.
				Kenneth Almquist