[comp.lang.perl] Help wanted in uppercasing keywords in a text file

rohit@dmdev.UUCP (Rohit Mehrotra) (02/09/91)

Hi,

I want to convert a list of keywords (about 150) into upper-case
where ever they occur in a text file, i.e. for ex: wherever say "update"
or "Update" occurs as a full word change it to "UPDATE".

Is their a PERL,SED,AWK script out their that would do this for me.

Please EMAIL me your responses as my news software is screwed up a little
bit these days, and I WOULD post a summary.

thanks

rohit
EMAIL: rohit%dmdev@uunet.uu.net  or uunet!dmdev!rohit
-- 
Rohit Mehrotra
Fleet Credit Corporation
8325 NW 53rd St, Miami, Fl 33166.
E-MAIL Address uunet!dmdev!rohit VOICE 1-(305)-477-0390 Ext 469

tchrist@convex.COM (Tom Christiansen) (02/10/91)

> I want to convert a list of keywords (about 150) into upper-case
> where ever they occur in a text file, i.e. for ex: wherever say "update"
> or "Update" occurs as a full word change it to "UPDATE".

> Is their a PERL,SED,AWK script out their that would do this for me.

Here's my solution:

    #!/usr/bin/perl
    $WORDS = shift 	|| die "usage: $0 wordlist [files ...]\n";
    open WORDS 		|| die "can't open $WORDS $!";
    $code = "while (<>) {\n    study;\n";
    while (<WORDS>) {
	chop;
	s/(\W)/\\$1/g;
	($lhs = $_) =~ tr/A-Z/a-z/;
	($rhs = $_) =~ tr/a-z/A-Z/;
	$code .= "    s/\\b$lhs\\b/$rhs/gi;\n";
    } 
    $code .= "    print;\n}\n";
    #print STDERR $code;
    eval $code;
    die $@ if $@;

Whether the study helps you or not depends on the word list.

I ran mine on perl's reserved words (plus fuzz) on its man page:

   sed -ne 's/.*strEQ(d,"\([^"]*\).*/\1/p' perl/src/toke.c > words # ~200 
   time perl capwords words /usr/man/man1/perl.1 > capperl

It took me 20 user seconds to run this on a C-220.  It takes ~10 more
without the study.  I doubt you'll get a sed/awk solution that approaches
this speed.  But I did try...

I attempted to construct an equivalent sh+sed program, but didn't know how
to express s/\bfoo\b/FOO/g in sed -- the \b escaped me.  So I decided to
make do with s/foo/FOO/g, but had problems with built-in limits on the
total number of sed commands.  When I reduced this to ~190 substitutions
instead of 200, it ran but it took more than twice as long (just to run
the dynamic sed script, not to build it with sh and paste and tr and
awk).  Then I remembered that the perl version was doing s/foo/FOO/gi so
changed the sed to things like s/[Ff][Oo][Oo]/FOO/g and found I'd now
exceeded sed's limit on the total amount of command text.  When I cut the
number of words in half (down to <100) and ran it, it took 4x the perl
time, to do less than half the work.  As we're now approaching an order of
magnitude difference, I gave up on sed.  One could probably construct a 
new awk script to do it, but that would probably run much longer still.

In fact, I'll even bet that you'd need a highly tuned C program to 
get this fast.  This might be one of the cases where a C program wouldn't
be any faster.

--tom
--
 "All things are possible, but not all expedient."  (in life, UNIX, and perl)