[comp.unix.questions] need AWK help: lowercase, trim trailing spaces

mario@wjvax.UUCP (Mario Dona) (04/20/91)

HELP! I have a situation that just cries out for an awk solution, however
I'm at a loss over some minor, but important details.  I have a list of
companies that need to be preprocessed before sending them to our typing
department.  A simplified portion of the input file as follows:

COMPANY1  2800 FULLING       P O BOX 3608     HARRISBURG PA           17105
COMPANY2  500 ELM                             MILWAUKEE WI            53122
COMPANY3  13500 CENTRAL      P O BOX 655303   DALLAS TX               75265-5303

^         ^                  ^                ^                       ^
|         |                  |                |                       |
1         11                 30               47                      71


My mission, which I chose to accept, was to reformat the list so that it looks
like this:

Company1
28 Fulling
P O Box 3608
Harrisburg PA 17105

Company2
500 Elm
Milwaukee WI 53122

Company3
13500 Central
P O Box 655303
Dallas TX 75265-5303

Using the SUBSTR function to get the parts I want was trivial; the problem
is, I can't figure out:

1.  How to prevent blank lines from printing if there is nothing to print
    (e.g. the second address in COMPANY2 above).
2.  How to concatenate the city and zip fields as shown.
3.  If a word is greater than 2 characters, lowercase all letters
    except for the first character (this is to keep state capitols 
    capitalized).

My feeble attempt so far is shown below.  If anyone has any ideas, I'd be
much obliged.

BEGIN   {
        RS="\n"
        }
{
name=substr($0,1,10)
address1=substr($0,11,19)
address2=substr($0,30,17)
city=substr($0,47,24)
zip=substr($0,71,10)
printf("%s\n%s\n%s\n%s\n%s\n\n", name, address1, address2, city, zip)
}

  Mario Dona
  ...!{ !decwrl!qubix, ames!oliveb!tymix, pyramid}!wjvax!mario         
  The above opinions are mine alone and not, in any way, those of WJ.

goer@ellis.uchicago.edu (Richard L. Goerwitz) (04/20/91)

In article <1817@wjvax.UUCP> mario@wjvax.UUCP (Mario Dona) writes:
>
>HELP! I have a situation that just cries out for an awk solution, however
>I'm at a loss over some minor, but important details.  I have a list of
>companies that need to be preprocessed before sending them to our typing
>department.  A simplified portion of the input file as follows:
>
>COMPANY1  2800 FULLING       P O BOX 3608     HARRISBURG PA           17105
>^         ^                  ^                ^                       ^
>|         |                  |                |                       |
>1         11                 30               47                      71
>
>My mission, which I chose to accept, was to reformat the list so that it looks
>like this:
>
>Company1
>28 Fulling
>P O Box 3608
>Harrisburg PA 17105
>...

Here is one Icon solution.  Note that it omits blank lines, capitalizes
multi-word city, street, and company names, and removes the annoying space
between the P and O in "P O Box."  It also inserts three spaces between
the state abbreviation and the zipcode (2 or 3 spaces is standard these
days).  A slight alteration (one line) would be all that you'd need to
add in to force all-uppercase company names.  Note that I split the line
based on the column positions you gave, although I can't imagine how
the gatherers of these statistics managed to fit everything into such
tight spaces!

procedure main()

    every line := trim(!&input,'\t ') do {
	line ? {
	    every i := 11|30|47
	    do write("" ~== capitalize_words(tab(i) \ 1))
	    writes(capitalize_words(tab(71), 1), "   ")
	    write(tab(0), "\n")
	}
    }

end    

procedure capitalize(s)
    s ? (return (move(1) || map(tab(upto('\t ') | 0)) || tab(0)) | "")
end

procedure capitalize_words(s, sw)

    s2 := ""
    trim(s,'\t ') ? {
	while chunk := capitalize(tab(upto('\t '))) do {
	    s2 ||:= chunk || { if chunk == "P" & =" O " then "O " else " " }
	    tab(many('\t '))
	}
	if \sw & s2 ~== ""
	then s2 ||:= tab(0)
	else s2 ||:= capitalize(tab(0))
    }
    return s2

end


-- 

   -Richard L. Goerwitz              goer%sophist@uchicago.bitnet
   goer@sophist.uchicago.edu         rutgers!oddjob!gide!sophist!goer

lewis@tramp.Colorado.EDU (LEWIS WILLIAM M JR) (04/21/91)

To prevent printing blank lines, simply do (in awk):

	address = substr(...   )
	if (length(adress) > 0)
		print address
	etc., etc.

The upper/lower case problem is more difficult

merlyn@iwarp.intel.com (Randal L. Schwartz) (04/21/91)

In article <1817@wjvax.UUCP>, mario@wjvax (Mario Dona) writes:
| HELP! I have a situation that just cries out for an awk solution, however
| I'm at a loss over some minor, but important details.

Well, to *me* it just cries out for a Perl solution.  Try this:

while (<>) {
	s/([A-Z]{3,})/\u\L$1$2/g;
	($name,$address1,$address2,$city,$zip) = unpack("A10A19A17A24A*",$_);
	print "$name\n";
	print "$address1\n";
	print "$address2\n" if $address2;
	print "$city $zip\n";
	print "\n";
}

Works just fine on your test data.

print "Just another Perl hacker,"
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/

goer@ellis.uchicago.edu (Richard L. Goerwitz) (04/21/91)

merlyn@iwarp.intel.com (Randal L. Schwartz) writes:
>
>while (<>) {
>	s/([A-Z]{3,})/\u\L$1$2/g;
>	($name,$address1,$address2,$city,$zip) = unpack("A10A19A17A24A*",$_);
>	print "$name\n";
>	print "$address1\n";
>	print "$address2\n" if $address2;
>	print "$city $zip\n";
>	print "\n";
>}
>
>Works just fine on your test data.

No, no!  The gentleman said quite plainly that he wanted his data to
look like this:

Company1
28 Fulling
P O Box 3608
Harrisburg PA 17105

Company2
500 Elm
Milwaukee WI 53122

Company3
13500 Central
P O Box 655303
Dallas TX 75265-5303

Did you actually try running your perl code?

-- 

   -Richard L. Goerwitz              goer%sophist@uchicago.bitnet
   goer@sophist.uchicago.edu         rutgers!oddjob!gide!sophist!goer

merlyn@iwarp.intel.com (Randal L. Schwartz) (04/22/91)

In article <1991Apr21.045226.16050@midway.uchicago.edu>, goer@ellis (Richard L. Goerwitz) writes:
| merlyn@iwarp.intel.com (Randal L. Schwartz) writes:
| >
| >while (<>) {
| >	s/([A-Z]{3,})/\u\L$1$2/g;
| >	($name,$address1,$address2,$city,$zip) = unpack("A10A19A17A24A*",$_);
| >	print "$name\n";
| >	print "$address1\n";
| >	print "$address2\n" if $address2;
| >	print "$city $zip\n";
| >	print "\n";
| >}
| >
| >Works just fine on your test data.
| 
| No, no!  The gentleman said quite plainly that he wanted his data to
| look like this:
| 
| Company1
| 28 Fulling
| P O Box 3608
| Harrisburg PA 17105
| 
| Company2
| 500 Elm
| Milwaukee WI 53122
| 
| Company3
| 13500 Central
| P O Box 655303
| Dallas TX 75265-5303
| 
| Did you actually try running your perl code?

Yes.  That's exactly what came out.  If it didn't come out on *your*
Perl, you have an old Perl.  (I used the new \u\L operators, if that's
what you're objecting to.)

I did make a silly typo in the first one.  The line:

	s/([A-Z]{3,})/\u\L$1$2/g;

should read:

	s/([A-Z]{3,})/\u\L$1/g;

The $2 was a leftover from doing it as two partial expressions, but
then I realized I didn't need to do that.  But the code worked in
either case, which is why I didn't catch it. :-)

This line finds 3 or more letters, and then lowercases all letters
after the first.  That was part of the spec.

(I hope I'm responding to your criticism.  It wasn't very specific.
But believe me, the code *does* work as requested.)

print "Just another Perl hacker,"
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/

goer@quads.uchicago.edu (Richard L. Goerwitz) (04/22/91)

merlyn@iwarp.intel.com (Randal L. Schwartz) writes (in response to my
objection that his Perl code didn't do what's expected):
>
>Yes.  That's exactly what came out.  If it didn't come out on *your*
>Perl, you have an old Perl.  (I used the new \u\L operators, if that's
>what you're objecting to.)

Understood.  I tried it out on perl 3.0 pl 18.  I wouldn't call perl
3.0 an "old" perl by any stretch of the imagination, especially since
4.0 just finished coming over the net a couple of days ago!

-Richard
-- 

   -Richard L. Goerwitz              goer%sophist@uchicago.bitnet
   goer@sophist.uchicago.edu         rutgers!oddjob!gide!sophist!goer

harrison@necssd.NEC.COM (Mark Harrison) (04/23/91)

In article <1817@wjvax.UUCP>, mario@wjvax.UUCP (Mario Dona) writes:

> HELP! I have a situation that just cries out for an awk solution

[converting from]

> COMPANY1  2800 FULLING       P O BOX 3608     HARRISBURG PA           17105

[to] 

> Company1
> 28 Fulling
> P O Box 3608
> Harrisburg PA 17105

> 1.  How to prevent blank lines from printing if there is nothing to print

Add this line after your BEGIN rule:
/^$/ {next} #skip blank lines

If you want to skip lines that may have white space:
/^[ \t]*$/ {next} #skip blank (non-text) lines

> 2.  How to concatenate the city and zip fields as shown.

To concatenate:

	city_and_zip = city " " zip

To strip trailing space from city before concatenating:

	while (substr(city, length(city)) == " ")
		city = substr(city, 1, length(city) - 1)

> 3.  If a word is greater than 2 characters, lowercase all letters
>     except for the first character (this is to keep state capitols 
>     capitalized).

This is doable, but not enjoyable.  There is more of a chance if
you use nawk or gawk. Otherwise, make an array:

uc["a"] = "A"  ... uc["z"] = "Z"
lc["A"] = "a"  ... lc["Z"] = "z"

and loop for the length of the string:

    if (uc[substr(str, i, 1)] == "")
        newstr = newstr substr(str, i , 1)
    else
        newstr = uc[substr(str, i, 1)]

martin@mwtech.UUCP (Martin Weitzel) (05/09/91)

In article <1991Apr20.220114.8727@colorado.edu> lewis@tramp.Colorado.EDU (LEWIS WILLIAM M JR) writes:
>To prevent printing blank lines, simply do (in awk):
>
>	address = substr(...   )
>	if (length(adress) > 0)
>		print address

Still simpler:

	address = substr(...   )
	if (address)
		print address

Or:
	# assign substring to adress and print if not empty
	if (address = substr(...   ))
		print address

BE WARNED: I explicitly wrote a comment in front of the if statement.
So don't start a discussion thread whether it is obscure, good, bad,
professional or whatever programming style to write assignments within
conditional contexts :-)
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83