bill@ksr.com (Bill Mann) (01/20/90)
A few weeks ago Carl Hommel posted a request for a perl script to convert all-uppercase messages to mixed case. I editted his message slightly, *converted it to uppercase*, and ran it through the perl script below. Here's the result: A user asked for help in converting files from uppercase to lowercase, and back again. Of course, there are simple one-liners using tr that will do that. However, i foolishly said that it would be easy to write a program to convert an all uppercase message, which often are sent by our novice users, to a more normal format. The algorithm i thought of is: o uppercase any alpha two spaces after a "." o uppercase any alpha at the beginning of a line, when there was a "." at the end of the previous line. I could brute force this, but that would be work, not fun. I have twiddled with the following features of perl, but got nothing to work: o setting $/ = "\177" and reading in the whole file at once o using if (/\. (.)/) { And $1. I really want an elegant regexp, combined with the tr command. Any ideas? Carl hommel carlton@apollo.hp.com Here's my perl script: #!/usr/bin/perl # This program copies its input to STDOUT, converting uppercase characters # to lowercase according to a simple rule. # RULE: a sentence begins with an alphanumeric and ends at the end of # the input file or at the first terminator (period, question mark, or # exclamation point) which is followed by white space; all but the # first character of each sentence is converted to lower case. $/ = "\177"; # do not split the input into lines $_ = <>; # read the entire input s/(\w)(([^.?!]+|[.?!]+\S)+)/($a=$2)=~tr|A-Z|a-z|,$1.$a/eg; print; # print the results Notes: . The ($a=$2) is necessary, as modifying $2 directly does not work. . The description of $* in the perl manual seems to indicate it should be set to 1, but the script works either way for this example. Bill Mann bill@ksr.com (ksr!bill@uunet.uu.net)
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (01/23/90)
Here is something I whipped up for Peter Yee. It does the same thing, only more so. It uses /usr/dict/words plus an exception dictionary. The one supplied is obviously the start of one for translating NASA articles, though there's still some stuff missing from it. This won't work quite right until patch 9 comes out, with a fix for /\b/i. Among other things, this program assumes that words containing no vowels are acronyms, and should be capitalized. Larry Wall lwall@jpl-devvax.jpl.nasa.gov #!/bin/sh : make a subdirectory, cd to it, and run this through sh. echo 'If this kit is complete, "End of kit" will echo at the end' echo Extracting unuc sed >unuc <<'!STUFFY!FUNK!' -e 's/X//' X#!/usr/bin/perl X Xprint STDERR "Loading proper nouns...\n"; Xopen(DICT,"/usr/dict/words") || die "Can't find /usr/dict/words: $!\n"; Xwhile (<DICT>) { X if (/^[A-Z]/) { X chop; X ($lower = $_) =~ y/A-Z/a-z/; X $proper{$lower} = $_; X } X} Xclose DICT; Xprint STDERR "Loading exceptions...\n"; X Xopen(PATS,"unuc.pats") || die "Can't find unuc.pats: $!\n"; X X$prog = <<'EOT'; Xwhile (<>) { X next if /[a-z]/; X y/A-Z/a-z/; X s/(\w+)/$proper{$1} ? $proper{$1} : $1/eg; X s/^(\s*)([a-z])/$1 . (($tmp = $2) =~ y:a-z:A-Z:,$tmp)/e; X s/([-.?!]["']?(\n\s*| \s*)["']?)([a-z])/$1 . (($tmp = $3) =~ y:a-z:A-Z:,$tmp)/eg; X s/\b([b-df-hj-np-tv-xz]+)\b/(($tmp = $1) =~ y:a-z:A-Z:,$tmp)/eg; X s/([a-z])'([SDT])\b/$1 . "'" . (($tmp = $2) =~ y:A-Z:a-z:,$tmp)/eg; XEOT Xwhile (<PATS>) { X chop; X next if /^$/; X next if /^#/; X if (! /;$/) { X $foo = $_; X $foo =~ y/A-Z/a-z/; X print STDERR "Dup $_\n" if $proper{$foo}; X $foo =~ s/([^\w ])/\\$1/g; X $foo =~ s/ /(\\s+)/g; X $foo = "\\b" . $foo if $foo =~ /^\w/; # XXX till patch 9 X $foo .= "\\b" if $foo =~ /\w$/; X $i = 0; X ($bar = $_) =~ s/ /'$' . ++$i/eg; X $_ = "s/$foo/$bar/gi;"; X } X $prog .= ' ' . $_ . "\n"; X} Xclose PATS; X$prog .= "}\ncontinue {\n print;\n}\n"; X X$/ = ''; X#print $prog; Xeval $prog; die $@ if $@; !STUFFY!FUNK! echo Extracting unuc.pats sed >unuc.pats <<'!STUFFY!FUNK!' -e 's/X//' XA.M. XAir Force XAir Force Base XAir Force Station XAmerican XApr. XAriane XAug. XAugust XBureau of Labor Statistics XCIT XCaltech XCape Canaveral XChallenger XChina XCorporation XCrippen XDaily News in Brief XDaniel Quayle XDec. XDiscovery XEdwards XEndeavour XFeb. XFord Aerospace XFri. XGeneral Dynamics XGeorge Bush XHeadline News XHOTOL XI XII XIII XIV XIX XInstitute of Technology XJPL XJan. XJul. XJun. XKennedy Space Center XLDEF XLong Duration Exposure Facility XLong March XMar. XMarch XMartin XMartin Marietta XMercury XMon. Xin May Xs/\bmay (\d)/May $1/g; Xs/\boffice of (\w)/'Office of ' . (($tmp = $1) =~ y:a-z:A-Z:,$tmp)/eg; XNational Science Foundation XNASA Select XNew Mexico XNov. XOMB XOct. XOffice of Management and Budget XPresident XPresident Bush XRichard Truly XRocketdyne XRussian XRussians XSat. XSep. XSoviet XSoviet Union XSoviets XSpace Shuttle XSun. XThu. XTue. XU.S. XUnion of Soviet Socialist Republics XUnited States XVI XVII XVIII XVice President XVice President Quayle XWed. XWhite Sands XKaman Aerospace XAerospace Daily XAviation Week XSpace Technology XWashington Post XLos Angeles Times XNew York Times XAerospace Industries Association Xpresident of XJohnson Space Center XSpace Services XInc. XCo. XHughes Aircraft XCompany XOrbital Sciences XSwedish Space XArnauld XNicogosian XMagellan XGalileo XMir XJet Propulsion Laboratory XUniversity XDepartment of Defense XOrbital Science XOMS XUnited Press International XUnited Press XUPI XAssociated Press XAP XCable News Network XCape York XZenit XSYNCOM XEastern XWestern XTest Range XJcsat XJapanese Satellite Communications XDefence Ministry XDefense Ministry XSkynet XFixed Service Structure XLaunch Processing System XAsiasat XLaunch Control Center XEarth XCNES XGlavkosmos XPacific XAtlantic !STUFFY!FUNK! echo "" echo "End of kit" : I do not append .signature, but someone might mail this. exit