[news.software.b] egrep hazardrous to your system's cnews health

res@colnet.uucp (Rob Stampfli) (05/11/91)

I recently added /usr/lib/newsbin to my default search path, so I could use
the 'queuelen' command to see what was going on with the batcher.  I found,
in the process, that the performance of this command was abysmal.  Why,
'uustat' is significantly faster by comparison.

Looking at the 'queuelen' script, I found that lo-and-behold, it wasn't
really that complicated.  The culprit turns out to be egrep.  There is
apparently something mightily inefficient with the way this command does
some searches.  The pertinent point is that the string:

	ls | egrep "^C\..*$grade....\$" | wc | awk '{print $1}'

took 22+ seconds to execute on my Unix-PC (16 entries in the directory).
Changing egrep to grep caused the execution time to take a more reasonable
2 secs.

The problems with the egrep command inefficiency seems to be most exacerbated
by that the four '.'s (match any character) after $grade, which seems to put
egrep in an almost infinite loop.

Is this true on other systems?  I noticed that cnews tends to use egrep
rather copiously.  Most of the instances could as easily use grep.  I am
tempted to start a wholesale replacement.

BTW, I am running cnews with patches thru 12/90.  The system is a Unix-PC with
3.51 (s5r2) Unix.  Any sage advice?
-- 
Rob Stampfli, 614-864-9377, res@kd8wk.uucp (osu-cis!kd8wk!res), kd8wk@n8jyv.oh

emv@ox.com (Ed Vielmetti) (05/13/91)

> egrep and C news performance

there are a number of programs out there which go by the name of
egrep, perform roughly the same function, and whose performance
differs measurably depending on system speed, available memory, and
the nature of the data to be grep'd for.  

your best bet would be to get an egrep that uses the Henry Spencer
regexp library or one of its derivatives; that would probably be
substantially better than the egrep that shipped with your Unix PC.
I'll have to check on the pedigree of the gnu 'egrep', it may also be
suited to your needs.

the other thing to note in the particular example is that it makes
very conservative assumptions (for portablility considerations, no
doubt) on the function of your stock utilities.  you could save an awk
invocation if your wc supports the '-l' flag, since "wc -l" is the
same as the "wc | awk '{print $1}'" for modern versions of wc.
there may well be other constructs which you could characterize,
isolate, and recode for efficiency on your own system.

C news does not ship with its own versions of sh, awk, ls, egrep, wc,
tr, sed, and cat; it is assumed that the vendor versions will be good
enough to suffice.  the existing shell scripts have to work around the
limitations of many known deficiences in these programs to be
reasonably portable.  There is enough consistency in the construction
of C news shell idioms that it's well within reason to look at
methodically replacing them with special purpose programs or more
precisely tailored shell constructs; I'd expect that the UUNET funding
of further C news development will have some of this in mind.

-- 
Edward Vielmetti, vice president for research, MSEN Inc.  emv@msen.com

"(6) The Plan shall identify how agencies and departments can
collaborate to ... expand efforts to improve, document, and evaluate
unclassified public-domain software developed by federally-funded
researchers and other software, including federally-funded educational
and training software; "
			"High-Performance Computing Act of 1991, S. 272"

gemini@geminix.in-berlin.de (Uwe Doering) (05/13/91)

res@colnet.uucp (Rob Stampfli) writes:

>I recently added /usr/lib/newsbin to my default search path, so I could use
>the 'queuelen' command to see what was going on with the batcher.  I found,
>in the process, that the performance of this command was abysmal.  Why,
>'uustat' is significantly faster by comparison.
>
>Looking at the 'queuelen' script, I found that lo-and-behold, it wasn't
>really that complicated.  The culprit turns out to be egrep.  There is
>apparently something mightily inefficient with the way this command does
>some searches.  The pertinent point is that the string:
>
>	ls | egrep "^C\..*$grade....\$" | wc | awk '{print $1}'
>
>took 22+ seconds to execute on my Unix-PC (16 entries in the directory).
>Changing egrep to grep caused the execution time to take a more reasonable
>2 secs.
>
>The problems with the egrep command inefficiency seems to be most exacerbated
>by that the four '.'s (match any character) after $grade, which seems to put
>egrep in an almost infinite loop.
>
>Is this true on other systems?  I noticed that cnews tends to use egrep
>rather copiously.  Most of the instances could as easily use grep.  I am
>tempted to start a wholesale replacement.

I found this to be true on ISC UNIX 386, too. I changed most occurences
of `egrep' to `grep' (at least where no special features of `egrep'
were used). Since then, batching runs much faster.

      Uwe
-- 
Uwe Doering  |  INET : gemini@geminix.in-berlin.de
Berlin       |----------------------------------------------------------------
Germany      |  UUCP : ...!unido!fub!geminix.in-berlin.de!gemini

henry@zoo.toronto.edu (Henry Spencer) (05/13/91)

In article <1991May11.040314.15393@colnet.uucp> res@colnet.uucp (Rob Stampfli) writes:
>Is this true on other systems?  I noticed that cnews tends to use egrep
>rather copiously.  Most of the instances could as easily use grep.  I am
>tempted to start a wholesale replacement.

On most systems egrep is faster, and its patterns are also more powerful.
We necessarily have to code for the usual situation rather than the odd ones.
-- 
And the bean-counter replied,           | Henry Spencer @ U of Toronto Zoology
"beans are more important".             |  henry@zoo.toronto.edu  utzoo!henry