[comp.unix.questions] Proofreading documents with awk

heit@meme.Stanford.EDU (Evan Heit) (12/16/89)

I am looking for someone who has written a program in awk that will
will allow me to proofread my papers by by looking for word repetitions.

For example, the word "will" is repeated unnecessarily at the beginning of
the second line of this message, and the word "by" is also repeated.  I'm
looking for an awk program to catch word repetitions like these.

Of course, a program in something other than awk (a gnuemacs function?) would
be just as good.  Thanks in advance.

--Evan Heit
heit@psych.stanford.edu

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (12/16/89)

In article <25@meme.stanford.edu> heit@psych.Stanford.EDU (Evan Heit) writes:
: I am looking for someone who has written a program in awk that will
: will allow me to proofread my papers by by looking for word repetitions.

How about filtering through

    tr -cs "A-Za-z" "\012" | uniq -d

(Sys V'ers will have to make that [A-Z][a-z]).

I sincerely doubt that any awk (or perl) solution will do as well.

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov

steinbac@hpl-opus.HP.COM (Gunter Steinbach) (12/16/89)

> / hpl-opus:comp.unix.questions / heit@meme.Stanford.EDU (Evan Heit) /  
> 3:57 pm  Dec 15, 1989 /
> I am looking for someone who has written a program in awk that will
> will allow me to proofread my papers by by looking for word repetitions.

How about this awk script:

awk '{p=0 # use a flag to avoid one line multiple times
    if($1==last) {p=1; print NR-1 ": " last} # to catch doubles across lines
    for(f=2;f<=NF;f++) if($f==$(f+1)) p=1
    if(p==1) print NR ": " $0
    last=$NF}'

This is the output when run on your original note:

15: will
16: will allow me to proofread my papers by by looking for word repetitions.

Good enough?

	 Guenter Steinbach		gunter_steinbach@hplabs.hp.com

arnold@audiofax.com (Arnold Robbins) (12/21/89)

:In article <25@meme.stanford.edu> heit@psych.Stanford.EDU (Evan Heit) writes:
:: I am looking for someone who has written a program in awk that will
:: will allow me to proofread my papers by by looking for word repetitions.

In article <6612@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
>How about filtering through
>
>    tr -cs "A-Za-z" "\012" | uniq -d
>
>(Sys V'ers will have to make that [A-Z][a-z]).
>
>I sincerely doubt that any awk (or perl) solution will do as well.

Well, yes and no.  The following should work in GNU Awk and possibly
the V.4 nawk.  It is untested though.  Its advantage is that it
provides line number and file name information.

	#! /path/to/gawk -f

	{
		gsub(/[^A-Za-z0-9 \t]/, "");	# delete non-alphanumerics
		$0 = tolower($0)		# go to all one case
		if ($1 == last)
			printf "Duplicate '%s' line %d, file %s\n",
				last, FNR, FILENAME
		for (i = 2; i <= NF; i++)
			if ($(i-1) == $i)
				printf "Duplicate '%s' line %d, file %s\n",
					$i, FNR, FILENAME
		last = $NF
	}

As Jeff Lee points out, this IS slower than the tr | uniq solution.
-- 
Arnold Robbins -- Senior Research Scientist - AudioFAX | Laundry increases
2000 Powers Ferry Road, #220 / Marietta, GA. 30067     | exponentially in the
INTERNET: arnold@audiofax.com	Phone: +1 404 933 7600 | number of children.
UUCP:	  emory!audfax!arnold	Fax:   +1 404 933 7606 |   -- Miriam Hartholz

ggg@sunquest.UUCP (Guy Greenwald) (12/22/89)

In article <25@meme.stanford.edu>, heit@meme.Stanford.EDU (Evan Heit) writes:
> I am looking for someone who has written a program in awk that will
> will allow me to proofread my papers by by looking for word repetitions.

Read Read pages 119-122 in The Unix Programming Environment, Kernighan and
and Pike, Prentice-Hall, 1984. ISBN 0-13-937681-X.

--G. Guy Greenwald II