[comp.unix.shell] How can I optimize this script?

palkovic@tomato.fnal.gov (John Palkovic) (03/02/91)

I posted this script yesterday (<9Y`#-6-@linac.fnal.gov> in
news.software.b) to remove old news floods (newly arrived articles
with old Date: headers):

#!/bin/sh
#
# Old.rm. Trash new news articles with old Date: headers.
# J. Palkovic 2/28/91
#
# You may need to change this next line:
PATH=/usr/lib/newsbin:/usr/local/bin:/usr/bin
TODAY="`date`"
TODAY=`getdate "$TODAY"`
LIMIT=`expr $TODAY - 1209600`
cd /usr/spool/news
find `ls |egrep -v '\.'` -mtime -7 -name '[0-9]*' -type f -print |(
 while read f
 do
   DATE=`header -f Date $f`
   THEN=`getdate "$DATE" 2>/dev/null`
   if test "$THEN"
   then
     test $THEN -lt $LIMIT && rm -f $f
   fi
 done
)

Header is a C program that extracts header fields, written by Chip
Salzenberg (see above mentioned post for source to header). The
question I have for all you comp.unix.shell gurus is: how can I speed
this script up? It took more than two hours to run on our 150+ MB news
spool.  I have some ideas, but would like to hear from you.

-John
--
palkovic@linac.fnal.gov || {tellab5,royko,simon}!linac!palkovic

tchrist@convex.COM (Tom Christiansen) (03/02/91)

From the keyboard of palkovic@linac.fnal.gov:
:I posted this script yesterday (<9Y`#-6-@linac.fnal.gov> in
:news.software.b) to remove old news floods (newly arrived articles
:with old Date: headers):
:
:#!/bin/sh
:#
:# Old.rm. Trash new news articles with old Date: headers.
:# J. Palkovic 2/28/91
:#
:# You may need to change this next line:
:PATH=/usr/lib/newsbin:/usr/local/bin:/usr/bin
:TODAY="`date`"
:TODAY=`getdate "$TODAY"`
:LIMIT=`expr $TODAY - 1209600`
:cd /usr/spool/news
:find `ls |egrep -v '\.'` -mtime -7 -name '[0-9]*' -type f -print |(
: while read f
: do
:   DATE=`header -f Date $f`
:   THEN=`getdate "$DATE" 2>/dev/null`
:   if test "$THEN"
:   then
:     test $THEN -lt $LIMIT && rm -f $f
:   fi
: done
:)
:
:Header is a C program that extracts header fields, written by Chip
:Salzenberg (see above mentioned post for source to header). The
:question I have for all you comp.unix.shell gurus is: how can I speed
:this script up? It took more than two hours to run on our 150+ MB news
:spool.  I have some ideas, but would like to hear from you.

The find, ls, and egrep aren't really a problem, as they only execute once.
Your big hit is that you have a lot of execs going in that tight loop.

I would try two things: first, run it under ksh (if you have it) and make
the tests be builtin.  That'll save you two execs.

The second thing I would try is to code up the whole tight loop in perl,
since the tests, rm, and header stuff are all trivial to do there.  
This could save you all the execs in the loop, which I think would be
a big win.

The gotcha is the getdate: I've got perl library routines to go from both
ctime and `date` output format, but getdate is actually much more clever
than just that.  I'm not convinced that all the dates will be in either
ctimer or `date` format: is this mandatory?  A peremptory perusal predicts
promising perl processing, but perhaps perfect processing is preferable.
So maybe I'd just use `getdate` after all and eat one exec.  However,
I didn't find any failures in the 3000 articles I just tested, so maybe
perfection isn't worth it.

--tom
--
"UNIX was not designed to stop you from doing stupid things, because
 that would also stop you from doing clever things." -- Doug Gwyn

 Tom Christiansen                tchrist@convex.com      convex!tchrist