[comp.unix.questions] Finding words in paragraphs

lacey@batcomputer.tn.cornell.edu (John Lacey) (07/17/89)

As regards the question of finding paragraphs in text which 
contain a particular word, I sent the following reply directly
to the asker of the question.  But then I saw the reply that no Unix
utility could handle this, and I have to disagree.  Awk will handle
this case with no problem.  Certainly the Awk solution is much nicer
than the previous proposal.


Awk is what you want in this case.  Try something like this:

	awk 'BEGIN { FS = ""; RS = "\n"} /the-word-here/' the-filename-here

Awk is a series of pattern-action pairs.  Whenever text matching the pattern
is recognized, the associated action is taken.  BEGIN is a special action
that matches exactly once, before the input file is read.  END is the
related pattern for after a file has reached EOF.

FS is the field separator, RS is the record separator.  So, we set RS to
a newline to make each paragraph (separated by a blank line) a different
record.  Then, we search for the word in question.  Patterns in Awk are
egrep-type regular expressions, bounded by /'s.  I left off the action,
to save space.  Any missing action is taken to be a print-the-record.
You can do this explicitly with a print command.

Awk is a lovely language.  I write a lot of one liners like this, and
I also use it to write reasonably large applications (including a small
relational database).

If you don't have awk documentation around, there is a book by Aho, 
Kernighan, and Weinberger (A, W, K) called, appropriately, the 
AWK Programming Language, that explains the whole thing.

Good luck, and cheers,

John Lacey           |  Internet:  lacey@tcgould.tn.cornell.edu
running unattached   |  BITnet:    lacey@crnlthry
                     |  UUCP:      cornell!batcomputer!lacey
"Whereof one cannot speak, thereof one must remain silent."  ---Wittgenstein
John Lacey           |  Internet:  lacey@tcgould.tn.cornell.edu
running unattached   |  BITnet:    lacey@crnlthry
                     |  UUCP:      cornell!batcomputer!lacey
"Whereof one cannot speak, thereof one must remain silent."  ---Wittgenstein

ip@me.utoronto.ca (Bevis Ip) (07/18/89)

In article <10545@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>In article <8421@batcomputer.tn.cornell.edu> lacey@tcgould.tn.cornell.edu (John Lacey) writes:
>>Awk is what you want in this case.  Try something like this:
>>	awk 'BEGIN { FS = ""; RS = "\n"} /the-word-here/' the-filename-here

That is a problem.  But, Doug, I'm kind of surprised at the comment you made
in your last article though.  Try this, it is simplified from something that
I wrote to search bibliographies without having to use indxbib.  I haven't
check but I think the hold buffer in sed is dynamically expended, so paragraph
size might not be a problem in most implementations.


[ To the original poster:  I wasn't too careful when I cut my script to
	you; here's the correct one. ]

for i
	SEARCH="$SEARCH -e /$i/!b"
sed -n -e '/^$/b gotcha' -e H -e '$b gotcha' -e b \
	-e :gotcha -e x $SEARCH -e p
Bevis Ip                <>  ip@me.toronto.edu, ip@me.utoronto.ca
University of Toronto   <>  {pyramid,uunet}!utai!me!ip
Mechanical Engineering  <>  {allegra,decwrl}!utcsri!me!ip

gwyn@smoke.BRL.MIL (Doug Gwyn) (07/18/89)

In article <8421@batcomputer.tn.cornell.edu> lacey@tcgould.tn.cornell.edu (John Lacey) writes:
>Awk is what you want in this case.  Try something like this:
>	awk 'BEGIN { FS = ""; RS = "\n"} /the-word-here/' the-filename-here

$ awk 'BEGIN { FS = ""; RS = "\n"} /test/' > foo
This isn't it.
Try again.

The requirement is to print the whole paragraph.
This is a test.
End of paragraph.

$ cat foo
This is a test.

lacey@batcomputer.tn.cornell.edu (John Lacey) (07/18/89)

In article <10545@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>In article <8421@batcomputer.tn.cornell.edu> lacey@tcgould.tn.cornell.edu (John Lacey) writes:
>>Awk is what you want in this case.  Try something like this:
>>	awk 'BEGIN { FS = ""; RS = "\n"} /the-word-here/' the-filename-here
> [A nice demonstration that this doesn't work.]

Yes, I mistyped the line.  It should be { FS = "\n"; RS = "" }.  I _did_
say "something like this" :-).

But, hey, Doug, why didn't you just post a fix to this? [ :-) ]  It's a 
simple typo, and you took the time to test it and post a demonstration
that it didn't work.  Anyhows, switching the values of the field and
record separators will fix this "code" right up.  Bythe, the default
values for Awk are FS = " ", and RS = "\n".


John Lacey           |  Internet:  lacey@tcgould.tn.cornell.edu
running unattached   |  BITnet:    lacey@crnlthry
                     |  UUCP:      cornell!batcomputer!lacey
"Whereof one cannot speak, thereof one must remain silent."  ---Wittgenstein

gwyn@smoke.BRL.MIL (Doug Gwyn) (07/19/89)

In article <89Jul17.172655edt.19593@me.utoronto.ca> ip@me.utoronto.ca (Bevis Ip) writes:
>But, Doug, I'm kind of surprised at the comment you made ...

Frankly, I didn't realize there was any way to get "sed" to do this task.
It's more programmable than I thought..