[comp.sources.d] "look" utility?

daleske@cbdkc1.UUCP (02/19/87)

Ok.  I've looked through all of the various places I could try.  A patch
came through for ispell which implements "L", a lookup option.  Based
on some ifdefs it would use "look" or "egrep", both which use a dictionary
in /usr/dict.  The dictionary could be called "words" or "web2".

What package does "look" come in?  Is it public domain or part of some
product?  How does it function?

Thanks for any help!  I'll post a summary of responses.

___________________________________________________________________________
John Daleske  Columbus, Ohio.  614-860-4335 | Default disclaimer...
UUCP: {ihnp4,cbosgd,desoto}!cbdkc1!daleske  |
____________________________________________|______________________________
"Now," said the butterfly, "look closer and tell me what you see."
"I see a tiny horse with wings upon its back" said Flutterby. "Why that's
 me I see!  But what am I?"
"You are you. Just as I am me!" said the wise old butterfly.  "Nothing more,
 nothing less."

campbell@maynard.UUCP (02/23/87)

In article <1916@cbdkc1.UUCP> daleske@cbdkc1.UUCP ( John Daleske x4335 3S324W FRAN) writes:
>
>What package does "look" come in?  Is it public domain or part of some
>product?  How does it function?

"look" (like "dbm") comes with V7-derived UNIX systems, which includes
Berkeley UNIX and derivatives (i.e., Ultrix).  It's seems to be missing
from System V, so I presume it's yet another utility AT&T forgot to
include in System V.

An excerpt from the man page follows:

SYNOPSIS
    look [-df] string [ file ]

DESCRIPTION
    look consults a sorted file and prints all lines that begin with string.
    It uses binary search.

Note that look searches for a literal string, not a regular expression.
You could use grep, but for a sizable file look can be hundreds of times
faster.
-- 
Larry Campbell                                The Boston Software Works, Inc.
Internet: campbell@maynard.uucp             120 Fulton Street, Boston MA 02109
uucp: {alliant,wjh12}!maynard!campbell              +1 617 367 6846
ARPA: campbell%maynard.uucp@harvisr.harvard.edu      MCI: LCAMPBELL

straka@ihlpf.UUCP (02/25/87)

>     look consults a sorted file and prints all lines that begin with string.
>     It uses binary search.
> Note that look searches for a literal string, not a regular expression.
> You could use grep, but for a sizable file look can be hundreds of times
> faster.


What about using fgrep?
Or possibly grep "^string"?

-- 
Rich Straka     ihnp4!ihlpf!straka

campbell@maynard.UUCP (02/27/87)

In article <1145@ihlpf.ATT.COM> straka@ihlpf.ATT.COM (Straka) writes:
>>     look consults a sorted file and prints all lines that begin with string.
>>     It uses binary search.
>> Note that look searches for a literal string, not a regular expression.
>> You could use grep, but for a sizable file look can be hundreds of times
>> faster.
>
>What about using fgrep?
>Or possibly grep "^string"?

What about it?  look is still hundreds of times faster.
-- 
Larry Campbell                                The Boston Software Works, Inc.
Internet: campbell@maynard.uucp             120 Fulton Street, Boston MA 02109
uucp: {alliant,wjh12}!maynard!campbell              +1 617 367 6846
ARPA: campbell%maynard.uucp@harvisr.harvard.edu      MCI: LCAMPBELL

ark@alice.UUCP (02/28/87)

In article <1145@ihlpf.ATT.COM>, straka@ihlpf.UUCP writes:
> >     look consults a sorted file and prints all lines that begin with string.
> >     It uses binary search.
> > Note that look searches for a literal string, not a regular expression.
> > You could use grep, but for a sizable file look can be hundreds of times
> > faster.
> 
> 
> What about using fgrep?
> Or possibly grep "^string"?

The run time of any program of the grep family increases linearly
with the size of the file being searched (with some exceptions for
complicated patterns in grep).

Look, on the other hand, doesn't touch the entire file.  That is,
after all, the essence of binary search.  So the run time increases
only logarithmically with the size of the file.

That difference is nothing to sneeze at.  Consider a file with a
million records as an example: a program that can find any record
with only 20 probes into the file is going to beat almost any linear
search you can imagine.

straka@ihlpf.UUCP (03/04/87)

In article <846@maynard.BSW.COM>, campbell@maynard.BSW.COM (Larry Campbell) writes:
> >>     It uses binary search.
> >> Note that look searches for a literal string, not a regular expression.
> >> You could use grep, but for a sizable file look can be hundreds of times
> >> faster.

> >What about using fgrep?
> >Or possibly grep "^string"?

> What about it?  look is still hundreds of times faster.

Yes, I apologize.  I agree; it was simply a case of being asleep at the
terminal.  Binary searches of large sorted data are certainly faster
than any simple sequential search.

Moral: Have enough caffeine in your veins before responding to postings!

-- 
Rich Straka     ihnp4!ihlpf!straka

solomon@crystal.UUCP (03/07/87)

Look is much faster than any of the greps, since it assumes the file
is sorted and uses binary search, for a run time proportional to the
log of the number of lines rather than the number of lines (16 vs 24K in
the case of /usr/dict/words).  What's more surprising is the relative speeds
of the various greps.  I ran a few tests (4.3BSD on an unloaded VAX 780)
and reproduce the data below.  Fgrep is always the slowest (it really should
be called 'dgrep' for 'dumb-grep').  Egrep is always the fastest.
A few years ago, I had a visit from Al Aho (author of egrep).  He was
quite proud of egrep (as well he should be).  He had me type on my terminal
	time wc /usr/dict/words
	time egrep -c '^' /usr/dict/words
Egrep was the clear winner.  He said he considered writing the inner loop
in assembler, but decided he'd gone far enough.  In most cases, the
program is I/O bound anyhow.  If only egrep understood all the options
of the others (-w and -i from grep, and -x from fgrep), we could dump the
others.  (The comment in the man page about exponential behavior is a red
herring.  The exponential behavior is a theoretical worst case that never
really comes up in practice, and besides, we're talking about *size* which
is exponential in the length of the *pattern*, which is usually quite short.

Here are my test results.  Warning:  These figures are for comparison only;
actual milage may vary.

first word

+ time grep -l a /usr/dict/words 
        4.8 real         4.1 user         0.3 sys  
+ time fgrep -l a /usr/dict/words 
        0.2 real         0.0 user         0.1 sys  
+ time egrep -l a /usr/dict/words 
        0.2 real         0.0 user         0.1 sys  
+ time grep -l ^a /usr/dict/words 
        4.7 real         3.7 user         0.4 sys  
+ time egrep -l ^a /usr/dict/words 
        0.2 real         0.0 user         0.1 sys  
+ time look a 
        1.1 real         0.4 user         0.2 sys  

middle word

+ time grep -l jade /usr/dict/words 
        4.1 real         3.4 user         0.3 sys  
+ time fgrep -l jade /usr/dict/words 
        3.0 real         2.4 user         0.3 sys  
+ time egrep -l jade /usr/dict/words 
        2.7 real         1.3 user         0.3 sys  
+ time grep -l ^jade /usr/dict/words 
        4.4 real         3.7 user         0.3 sys  
+ time egrep -l ^jade /usr/dict/words 
        1.8 real         1.2 user         0.3 sys  
+ time look jade 
        0.4 real         0.0 user         0.1 sys  

last word

+ time grep -l zygote /usr/dict/words 
        4.0 real         3.4 user         0.3 sys  
+ time fgrep -l zygote /usr/dict/words 
        5.7 real         4.8 user         0.4 sys  
+ time egrep -l zygote /usr/dict/words 
        3.6 real         2.5 user         0.5 sys  
+ time grep -l ^zygote /usr/dict/words 
        5.1 real         3.7 user         0.3 sys  
+ time egrep -l ^zygote /usr/dict/words 
        4.1 real         2.5 user         0.4 sys  
+ time look zygote 
        0.2 real         0.0 user         0.1 sys  

missing word

+ time grep -l 2 /usr/dict/words 
        4.1 real         3.3 user         0.4 sys  
+ time fgrep -l 2 /usr/dict/words 
        5.6 real         4.8 user         0.4 sys  
+ time egrep -l 2 /usr/dict/words 
        3.7 real         2.5 user         0.5 sys  
+ time grep -l ^2 /usr/dict/words 
        4.7 real         3.7 user         0.4 sys  
+ time egrep -l ^2 /usr/dict/words 
        3.3 real         2.5 user         0.4 sys  
+ time look 2 
        0.5 real         0.0 user         0.2 sys  
-- 
	Marvin Solomon
	Computer Sciences Department
	University of Wisconsin, Madison WI
	solomon@gjetost.wisc.edu or seismo!uwvax!solomon

reid@decwrl.DEC.COM (Brian Reid) (03/08/87)

In article <271@crys.WISC.EDU> solomon@crys.WISC.EDU (Marvin Solomon) writes:
>
>Look is much faster than any of the greps, since ...
>Fgrep is always the slowest ...
>Egrep is always the fastest.

Hi, Marvin. I'm always glad to see actual measured data on the network. But
clearly you don't have the James Woods (ames!jaw) grep. At our site we call
it ngrep. Here are the times for ngrep. Your machine CRYS.WISC.EDU is a 
Vax 780; I have a 785. Therefore I re-ran all of your tests, to avoid having
to convert the numbers between dissimilar machines. Here is my output of your
tests. Note that in every case, ngrep is faster than egrep by a huge amount
in cpu time and by a significant amount in real time.

$ time grep -l a /usr/dict/words
        4.2 real         2.9 user         0.4 sys  
$ time fgrep -l a /usr/dict/words
        0.5 real         0.0 user         0.1 sys  
$ time egrep -l a /usr/dict/words
        0.2 real         0.0 user         0.1 sys  
$ time egrep -l \^a /usr/dict/words
        0.2 real         0.0 user         0.1 sys  
$ time ngrep -l a /usr/dict/words
        0.3 real         0.0 user         0.1 sys  
$ time ngrep -l a /usr/dict/words
        0.2 real         0.0 user         0.1 sys  
$ 
$ 
$ time grep -l jade /usr/dict/words
        3.7 real         2.3 user         0.4 sys  
$ time fgrep -l jade /usr/dict/words
        2.5 real         1.5 user         0.3 sys  
$ time egrep -l jade /usr/dict/words
        1.8 real         0.8 user         0.3 sys  
$ time egrep -l \^jade /usr/dict/words
        2.5 real         0.9 user         0.3 sys  
$ time ngrep -l jade /usr/dict/words
        1.0 real         0.1 user         0.2 sys  
$ time ngrep -l \^jade /usr/dict/words
        2.3 real         0.3 user         0.8 sys  
$ 
$ 
$ time grep -l zygote /usr/dict/words
        3.9 real         2.2 user         0.5 sys  
$ time fgrep -l zygote /usr/dict/words
        4.5 real         3.1 user         0.5 sys  
$ time egrep -l zygote /usr/dict/words
        3.3 real         1.8 user         0.4 sys  
$ time egrep -l \^zygote /usr/dict/words
        3.1 real         1.7 user         0.4 sys  
$ time ngrep -l zygote /usr/dict/words
        1.9 real         0.1 user         0.4 sys  
$ time ngrep -l \^zygote /usr/dict/words
        2.7 real         0.2 user         0.8 sys  
$ 

However, that's just a Vax. If you want to see ngrep really sing, watch while
I run it on a DECWRL's "Titan" machine, which is about 15 times the speed of
a 780:

$ time ngrep -l \^Zurich /usr/dict/words
        3.7 real         0.0 user         1.5 sys

The real time is slower because my Titan has only one disk, and the
disk arms must move back and forth between /usr/dict and /bin/ngrep.
I can reduce the effect of having only one disk arm by using a longer file to
search:

$ ls -l /usr/dict/words
-rw-r--r--  1 root       201314 Nov  5 10:43 /usr/dict/words
$ ls -l /usr/dict/web2
-rw-r--r--  1 reid      2498185 Aug 13  1985 /usr/dict/web2
$ tail -1 /usr/dict/web2
Zyzzogeton
$ time egrep -l Zyzzogeton /usr/dict/web2
/usr/dict/web2
       10.3 real         5.1 user         2.5 sys
$ time ngrep -l Zyzzogeton /usr/dict/web2
/usr/dict/web2
        7.8 real         0.4 user         2.3 sys

Conclusion: be careful with claims about how fast egrep is, since ngrep is
almost always significantly faster.

ajs@hpfcdt.HP.COM (Alan Silverstein) (03/13/87)

Re: look(1)

If you want a nice routine to do the same thing look does -- bsearchstr
-- I had it posted to mod.sources a while back.  Also understands case
sensitive (sort) versus insensitive (sort -f) searches.

Alan Silverstein

cire@hpihoah.UUCP (03/17/87)

the key assumption is the file is already sorted so a binary search
works.  this is generally not true.  to be true one then needs to
include the time needed to sort files of the size we are talking
about.  fgrep wins.

chris@mimsy.UUCP (03/21/87)

In article <4670001@hpihoah.HP.COM> cire@hpihoah.HP.COM (Eric B.
Decker) writes:
>the key assumption is the file is already sorted so a binary search
>works.  this is generally not true.

`Look' operates on sorted dictionaries.  It is not `generally not
true' *in the cases for which the program was designed*.

>to be true one then needs to include the time needed to sort files
>of the size we are talking about.  fgrep wins.

Only if the file changes often:  If not, it is better first to sort
it.  In any case, fgrep does not win; see James A. Woods's article
<706@ames.UUCP>, `Get Hep to Kanji-Ready Five-Algorithm [ef]?grep'
for details.  If you have only the standard family of greps, use
egrep.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!mimsy!chris	ARPA/CSNet:	chris@mimsy.umd.edu