[gnu.utils.bug] Grep bug?

lang@PRC.UNISYS.COM (09/13/89)

This may not be a bug, but rather my misunderstanding
the documentation, but this strikes me as odd....

% grep '\wx\w' foo
finds all occurrences in foo of 'x' flanked by alphanumerics.

% grep '\Wx\W' foo
finds all occurrences in foo of 'x' flanked by non-alphanumerics.

Fine.  So far so good.
Suppose now I want to find all occurrences of 'x'
flanked by non-alphanumerics, but I want the match
on 'x' not to be case-sensitive.  Well, the -i flag
makes grep ignore case difference when comparing strings,
so I try

% grep -i '\Wx\W' foo

But the effect here is that the -i flag makes the '\W'
meta-character non-case-sensitive as well, and I get
all occurrences of x in foo, regardless of the case of x
(which is fine) but also regardless of whether or not
the x (or X) is flanked by non-alphanumerics!

Is this a feature?

--Francois Lang

tarvaine@tukki.jyu.fi (Tapani Tarvainen) (09/16/89)

In article <8909121817.AA11159@gem> lang@PRC.UNISYS.COM writes:
...
>% grep -i '\Wx\W' foo
>
>But the effect here is that the -i flag makes the '\W'
>meta-character non-case-sensitive as well

I would call this a bug (and patched it).
I traced the problem to the following piece of code in dfa.c:

/* Parse and analyze a single string of the given length. */
void
regcompile(s, len, r, searchflag)
     const char *s;
     size_t len;
     struct regexp *r;
     int searchflag;
{
  if (case_fold)	/* dummy folding in service of regmust() */
    {
	static char *p;

	case_fold = 0;
	for (p = (char *)s; *p != 0; p++)
		if (isupper((int)*p))
			*p = tolower((int) *p);
...

I.e., when the -i flag is given, it folds the entire regexp to lower
case before doing anything else with it.  I failed to find any reason
for this: the search routines handle case folding on their own anyway,
and removing the above loop didn't seem to have any other effect than
removing the undesired effect of -i flag on \W (and \B).  

Does somebody know if the folding loop is necessary in some other
program using dfa.c (or do they maybe have the same bug)?

If yes, the following should work (avoids changing letters after a \):

	for (p = (char *)s; *p != 0; p++)
		if (isupper((int)*p))
			*p = tolower((int) *p);
                else if (*p == '\\' && *(p+1))
                        p++;

As far as e?grep is concerned, however, just removing the loop seems
to work just fine (the declaration of p can be removed as well).
Here's a context diff for just that (actually it #if's them
out rather than deletes them and adds a comment):

*** dfa.old	Sat Sep 16 12:04:32 1989
--- dfa.c	Sat Sep 16 12:04:34 1989
***************
*** 1668,1679 ****
--- 1668,1685 ----
  {
    if (case_fold)	/* dummy folding in service of regmust() */
      {
+ /* the following two #if 0's added by Tapani Tarvainen 16 Sep 89   */
+ /* to prevent -i flag from affecting \W and \B in e?grep           */
+ #if 0
  	static char *p;
+ #endif
  
  	case_fold = 0;
+ #if 0
  	for (p = (char *)s; *p != 0; p++)
  		if (isupper((int)*p))
  			*p = tolower((int) *p);
+ #endif
  	reginit(r);
  	r->mustn = 0;
  	r->must[0] = '\0';
-- 
Tapani Tarvainen    (tarvaine@tukki.jyu.fi, tarvainen@finjyu.bitnet)