[comp.lang.perl] All-uppercase text to mixed case

bill@ksr.com (Bill Mann) (01/20/90)

A few weeks ago Carl Hommel posted a request for a perl script to convert
all-uppercase messages to mixed case.  I editted his message slightly,
*converted it to uppercase*, and ran it through the perl script below.
Here's the result:

  A user asked for help in converting files from uppercase to lowercase, 
  and back again.  Of course, there are simple one-liners using tr that
  will do that.  However, i foolishly said that it would be easy to write
  a program to convert an all uppercase message, which often are sent by our
  novice users, to a more normal format.

  The algorithm i thought of is:
      o uppercase any alpha two spaces after a "."
      o uppercase any alpha at the beginning of a line, when there
        was a "." at the end of the previous line.

  I could brute force this, but that would be work, not fun.  I have twiddled
  with the following features of perl, but got nothing to work:
      o setting $/ = "\177" and reading in the whole file at once
      o using   if (/\.  (.)/) {  And $1.

  I really want an elegant regexp, combined with the tr command.  Any ideas?

  Carl hommel
  carlton@apollo.hp.com

Here's my perl script:

#!/usr/bin/perl

# This program copies its input to STDOUT, converting uppercase characters
# to lowercase according to a simple rule.

# RULE: a sentence begins with an alphanumeric and ends at the end of
# the input file or at the first terminator (period, question mark, or 
# exclamation point) which is followed by white space; all but the
# first character of each sentence is converted to lower case.

    $/ = "\177";        # do not split the input into lines
    $_ = <>;            # read the entire input

    s/(\w)(([^.?!]+|[.?!]+\S)+)/($a=$2)=~tr|A-Z|a-z|,$1.$a/eg;

    print;              # print the results


Notes:
  . The ($a=$2) is necessary, as modifying $2 directly does not work.
  . The description of $* in the perl manual seems to indicate it should 
    be set to 1, but the script works either way for this example.

Bill Mann
bill@ksr.com (ksr!bill@uunet.uu.net)

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (01/23/90)

Here is something I whipped up for Peter Yee.  It does the same thing,
only more so.  It uses /usr/dict/words plus an exception dictionary.
The one supplied is obviously the start of one for translating NASA articles,
though there's still some stuff missing from it.

This won't work quite right until patch 9 comes out, with a fix for /\b/i.

Among other things, this program assumes that words containing no vowels
are acronyms, and should be capitalized.

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov

#!/bin/sh
: make a subdirectory, cd to it, and run this through sh.
echo 'If this kit is complete, "End of kit" will echo at the end'
echo Extracting unuc
sed >unuc <<'!STUFFY!FUNK!' -e 's/X//'
X#!/usr/bin/perl
X
Xprint STDERR "Loading proper nouns...\n";
Xopen(DICT,"/usr/dict/words") || die "Can't find /usr/dict/words: $!\n";
Xwhile (<DICT>) {
X    if (/^[A-Z]/) {
X	chop;
X	($lower = $_) =~ y/A-Z/a-z/;
X	$proper{$lower} = $_;
X    }
X}
Xclose DICT;
Xprint STDERR "Loading exceptions...\n";
X
Xopen(PATS,"unuc.pats") || die "Can't find unuc.pats: $!\n";
X
X$prog = <<'EOT';
Xwhile (<>) {
X    next if /[a-z]/;
X    y/A-Z/a-z/;
X    s/(\w+)/$proper{$1} ? $proper{$1} : $1/eg;
X    s/^(\s*)([a-z])/$1 . (($tmp = $2) =~ y:a-z:A-Z:,$tmp)/e;
X    s/([-.?!]["']?(\n\s*|  \s*)["']?)([a-z])/$1 . (($tmp = $3) =~ y:a-z:A-Z:,$tmp)/eg;
X    s/\b([b-df-hj-np-tv-xz]+)\b/(($tmp = $1) =~ y:a-z:A-Z:,$tmp)/eg;
X    s/([a-z])'([SDT])\b/$1 . "'" . (($tmp = $2) =~ y:A-Z:a-z:,$tmp)/eg;
XEOT
Xwhile (<PATS>) {
X    chop;
X    next if /^$/;
X    next if /^#/;
X    if (! /;$/) {
X	$foo = $_;
X	$foo =~ y/A-Z/a-z/;
X	print STDERR "Dup $_\n" if $proper{$foo};
X	$foo =~ s/([^\w ])/\\$1/g;
X	$foo =~ s/ /(\\s+)/g;
X	$foo = "\\b" . $foo if $foo =~ /^\w/;	# XXX till patch 9
X	$foo .= "\\b" if $foo =~ /\w$/;
X	$i = 0;
X	($bar = $_) =~ s/ /'$' . ++$i/eg;
X	$_ = "s/$foo/$bar/gi;";
X    }
X    $prog .= '    ' . $_ . "\n";
X}
Xclose PATS;
X$prog .= "}\ncontinue {\n    print;\n}\n";
X
X$/ = '';
X#print $prog;
Xeval $prog; die $@ if $@;
!STUFFY!FUNK!
echo Extracting unuc.pats
sed >unuc.pats <<'!STUFFY!FUNK!' -e 's/X//'
XA.M.
XAir Force
XAir Force Base
XAir Force Station
XAmerican
XApr.
XAriane
XAug.
XAugust
XBureau of Labor Statistics
XCIT
XCaltech
XCape Canaveral
XChallenger
XChina
XCorporation
XCrippen
XDaily News in Brief
XDaniel Quayle
XDec.
XDiscovery
XEdwards
XEndeavour
XFeb.
XFord Aerospace
XFri.
XGeneral Dynamics
XGeorge Bush
XHeadline News
XHOTOL
XI
XII
XIII
XIV
XIX
XInstitute of Technology
XJPL
XJan.
XJul.
XJun.
XKennedy Space Center
XLDEF
XLong Duration Exposure Facility
XLong March
XMar.
XMarch
XMartin
XMartin Marietta
XMercury
XMon.
Xin May
Xs/\bmay (\d)/May $1/g;
Xs/\boffice of (\w)/'Office of ' . (($tmp = $1) =~ y:a-z:A-Z:,$tmp)/eg;
XNational Science Foundation
XNASA Select
XNew Mexico
XNov.
XOMB
XOct.
XOffice of Management and Budget
XPresident
XPresident Bush
XRichard Truly
XRocketdyne
XRussian
XRussians
XSat.
XSep.
XSoviet
XSoviet Union
XSoviets
XSpace Shuttle
XSun.
XThu.
XTue.
XU.S.
XUnion of Soviet Socialist Republics
XUnited States
XVI
XVII
XVIII
XVice President
XVice President Quayle
XWed.
XWhite Sands
XKaman Aerospace
XAerospace Daily
XAviation Week
XSpace Technology
XWashington Post
XLos Angeles Times
XNew York Times
XAerospace Industries Association
Xpresident of
XJohnson Space Center
XSpace Services
XInc.
XCo.
XHughes Aircraft
XCompany
XOrbital Sciences
XSwedish Space
XArnauld
XNicogosian
XMagellan
XGalileo
XMir
XJet Propulsion Laboratory
XUniversity
XDepartment of Defense
XOrbital Science
XOMS
XUnited Press International
XUnited Press
XUPI
XAssociated Press
XAP
XCable News Network
XCape York
XZenit
XSYNCOM
XEastern
XWestern
XTest Range
XJcsat
XJapanese Satellite Communications
XDefence Ministry
XDefense Ministry
XSkynet
XFixed Service Structure
XLaunch Processing System
XAsiasat
XLaunch Control Center
XEarth
XCNES
XGlavkosmos
XPacific
XAtlantic
!STUFFY!FUNK!
echo ""
echo "End of kit"
: I do not append .signature, but someone might mail this.
exit