[comp.unix.questions] Text Processing Question

rkumar@buddha.usc.edu (C.P. Ravikumar) (03/18/91)

I was wondering if there is a utility to check
for repitition of words in a document.
For example, it should print those lines
where such sentences occur :

	the processor $i$  must send send its data to

(the word "send" is repeated twice).

I have the feeling this can be done using "awk".
I will appreciate if you can send your suggestion
to me directly, since  I don't read this newsgroup
on a regular basis.

If there are other such "syntax" checkers, I will
be interested! Who wouldn't be? :-)

Thanks!
-- 
Ravikumar

goer@ellis.uchicago.edu (Richard L. Goerwitz) (03/18/91)

In article <31134@usc> rkumar@buddha.usc.edu (C.P. Ravikumar) writes:

>I was wondering if there is a utility to check
>for repitition of words in a document....
>
>I have the feeling this can be done using "awk".

The hard part, as always, is settling on a field separator -

BEGIN	{ FS = "['.]*[^0-9A-Za-z-']+" }
{    for (i = 1; i < NF; i++) {
         if ($i == $(i+1))
             print NR ":  " $0
     }
}

-Richard

tchrist@convex.COM (Tom Christiansen) (03/18/91)

From the keyboard of goer@ellis.uchicago.edu (Richard L. Goerwitz):
:In article <31134@usc> rkumar@buddha.usc.edu (C.P. Ravikumar) writes:
:
:>I was wondering if there is a utility to check
:>for repitition of words in a document....
:>
:>I have the feeling this can be done using "awk".
:
:The hard part, as always, is settling on a field separator -

Perhaps.  I always thought the hard part was catching pairs of words that
extend over line boundaries.  Here's a perl version that catches these,
although I admit it's probably overkill to suck up the whole file into
memory before munging it.  Works fine on my machine. :-)

Here's the output when run on my C compiler man page:

/usr/man/man1/cc.1:
   39 compiler. Certain extensions, notably the [* long long *] type,
   57 Forces language and library interpretation based on [* the the *] original
  770 Each library has a profiled version whose name is formed [* by
  771 by *] inserting \(lq_p\(rq before the \(lq.a\(rq.

The precise definition of what constitutes a repeated words (and what
legit separators are) will vary according to tastes.  I chose identifier-
like tokens separated by white space.  Speed (and definitely memory)
optimizations are certainly possible, but this does the job well enough
for me.  The program (not line noise :-) follows:

--tom

#!/usr/bin/perl
undef $/; $* = 1; # process whole file
while ( $ARGV = shift ) { 
    if (!open ARGV) { warn "$ARGV: $!\n"; next; } 
    $_ = <>;
    s/\b(\s?)(([a-z]\w*)(\s+\3)+\b)/$1\200$2\200/g || next;
    split(/\n/);
    $n = 0; @hits = ();
    for (@_) { $n++; push(@hits, sprintf("%5d %s", $n, $_)) if /\200/; } 
    $_ = join("\n",@hits);
    s/\200([^\200]+)\200/[* $1 *]/g;
    print "$ARGV:\n$_\n";
}

rsalz@bbn.com (Rich Salz) (03/18/91)

In <31134@usc> rkumar@buddha.usc.edu (C.P. Ravikumar) writes:
>
>I was wondering if there is a utility to check
>for repitition of words in a document.
The totally awesome book by Kernighan and Pike, The Unix Programming Environment,
contains an AWK script to do just that.  (My copy's at home, else I'd type it in.)
The description also has a nice joke in it...  An excellent example of "write to fit."
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.

lewis@tramp.Colorado.EDU (LEWIS WILLIAM M JR) (03/19/91)

See page 121 of Kernighan & Pike "The UNIX Programming Environment" for
an awk version of "double".

In my edition of the book there is an error in the program -- near the
end it reads:

		printf "double %s, ...
	if (NF > 0)			# <= DELETE ONLY THIS LINE
		lastword = NF
}' $*

fitz@mml0.meche.rpi.edu (Brian Fitzgerald) (03/19/91)

C.P. Ravikumar writes:

>I was wondering if there is a utility to check
>for repetition of words in a document.

I use 'deroff -w | uniq -d'

It's from a Unix World article published about a year ago.

>I will appreciate if you can send your suggestion
>to me directly, since  I don't read this newsgroup
>on a regular basis.

Well, if you read this newsgroup again soon, you will see the posted
solutions.

Brian
-- 
Oh, God.  I forgot to upgrade to OS/2.