[comp.unix.questions] Finding words in paragraphs

lacey@batcomputer.tn.cornell.edu (John Lacey) (07/17/89)

As regards the question of finding paragraphs in text which 
contain a particular word, I sent the following reply directly
to the asker of the question.  But then I saw the reply that no Unix
utility could handle this, and I have to disagree.  Awk will handle
this case with no problem.  Certainly the Awk solution is much nicer
than the previous proposal.

----------------

Awk is what you want in this case.  Try something like this:

	awk 'BEGIN { FS = ""; RS = "\n"} /the-word-here/' the-filename-here

Awk is a series of pattern-action pairs.  Whenever text matching the pattern
is recognized, the associated action is taken.  BEGIN is a special action
that matches exactly once, before the input file is read.  END is the
related pattern for after a file has reached EOF.

FS is the field separator, RS is the record separator.  So, we set RS to
a newline to make each paragraph (separated by a blank line) a different
record.  Then, we search for the word in question.  Patterns in Awk are
egrep-type regular expressions, bounded by /'s.  I left off the action,
to save space.  Any missing action is taken to be a print-the-record.
You can do this explicitly with a print command.

Awk is a lovely language.  I write a lot of one liners like this, and
I also use it to write reasonably large applications (including a small
relational database).

If you don't have awk documentation around, there is a book by Aho, 
Kernighan, and Weinberger (A, W, K) called, appropriately, the 
AWK Programming Language, that explains the whole thing.

Good luck, and cheers,


-- 
John Lacey           |  Internet:  lacey@tcgould.tn.cornell.edu
running unattached   |  BITnet:    lacey@crnlthry
                     |  UUCP:      cornell!batcomputer!lacey
"Whereof one cannot speak, thereof one must remain silent."  ---Wittgenstein
-- 
John Lacey           |  Internet:  lacey@tcgould.tn.cornell.edu
running unattached   |  BITnet:    lacey@crnlthry
                     |  UUCP:      cornell!batcomputer!lacey
"Whereof one cannot speak, thereof one must remain silent."  ---Wittgenstein

ip@me.utoronto.ca (Bevis Ip) (07/18/89)

In article <10545@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>In article <8421@batcomputer.tn.cornell.edu> lacey@tcgould.tn.cornell.edu (John Lacey) writes:
>>Awk is what you want in this case.  Try something like this:
>>	awk 'BEGIN { FS = ""; RS = "\n"} /the-word-here/' the-filename-here

That is a problem.  But, Doug, I'm kind of surprised at the comment you made
in your last article though.  Try this, it is simplified from something that
I wrote to search bibliographies without having to use indxbib.  I haven't
check but I think the hold buffer in sed is dynamically expended, so paragraph
size might not be a problem in most implementations.

bevis

[ To the original poster:  I wasn't too careful when I cut my script to
	you; here's the correct one. ]

--------
#!/bin/sh
for i
do
	SEARCH="$SEARCH -e /$i/!b"
done
sed -n -e '/^$/b gotcha' -e H -e '$b gotcha' -e b \
	-e :gotcha -e x $SEARCH -e p
-- 
Bevis Ip                <>  ip@me.toronto.edu, ip@me.utoronto.ca
University of Toronto   <>  {pyramid,uunet}!utai!me!ip
Mechanical Engineering  <>  {allegra,decwrl}!utcsri!me!ip

gwyn@smoke.BRL.MIL (Doug Gwyn) (07/18/89)

In article <8421@batcomputer.tn.cornell.edu> lacey@tcgould.tn.cornell.edu (John Lacey) writes:
>Awk is what you want in this case.  Try something like this:
>	awk 'BEGIN { FS = ""; RS = "\n"} /the-word-here/' the-filename-here

$ awk 'BEGIN { FS = ""; RS = "\n"} /test/' > foo
This isn't it.
Try again.

The requirement is to print the whole paragraph.
This is a test.
End of paragraph.

Done.
^D
$ cat foo
This is a test.
$

lacey@batcomputer.tn.cornell.edu (John Lacey) (07/18/89)

In article <10545@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>In article <8421@batcomputer.tn.cornell.edu> lacey@tcgould.tn.cornell.edu (John Lacey) writes:
>>Awk is what you want in this case.  Try something like this:
>>	awk 'BEGIN { FS = ""; RS = "\n"} /the-word-here/' the-filename-here
>
> [A nice demonstration that this doesn't work.]

Yes, I mistyped the line.  It should be { FS = "\n"; RS = "" }.  I _did_
say "something like this" :-).

But, hey, Doug, why didn't you just post a fix to this? [ :-) ]  It's a 
simple typo, and you took the time to test it and post a demonstration
that it didn't work.  Anyhows, switching the values of the field and
record separators will fix this "code" right up.  Bythe, the default
values for Awk are FS = " ", and RS = "\n".

Cheers,

-- 
John Lacey           |  Internet:  lacey@tcgould.tn.cornell.edu
running unattached   |  BITnet:    lacey@crnlthry
                     |  UUCP:      cornell!batcomputer!lacey
"Whereof one cannot speak, thereof one must remain silent."  ---Wittgenstein

gwyn@smoke.BRL.MIL (Doug Gwyn) (07/19/89)

In article <89Jul17.172655edt.19593@me.utoronto.ca> ip@me.utoronto.ca (Bevis Ip) writes:
>But, Doug, I'm kind of surprised at the comment you made ...

Frankly, I didn't realize there was any way to get "sed" to do this task.
It's more programmable than I thought..