[comp.unix.questions] using awk with records that have over 100 fields

plantz@manta.NOSC.MIL (Glen W. Plantz) (01/01/91)

I posted an "awk" question several weeks ago, that I'm still having trouble
with. I have a "modified" version of the same question here. Any help would be
appreciated.

I need to use awk to scan a file that consists of lines, each line starting 
with an integer, followed by a _LONG_ (paragraph) line of text that should
have a period at the end of the paragraph, and then a newline character 
following that. The script should save the number at the beginning of the 
line, and if the line does not have a "period" prior to the "newline", to 
print out a message, with the integer number that was at the beginning of
the line.

The problem I've had so far with our version of "awk", is that the lines
(paragraphs) that have too many fields cause an error of the type:
	547
awk: record `   479 Provide techn...' has too many fields
 record number 4

These "_LONG_" lines could have several hundred words on them.  How can
I get awk or another unix utilitity to process this text?

tchrist@convex.COM (Tom Christiansen) (01/01/91)

From the keyboard of plantz@manta.NOSC.MIL (Glen W. Plantz):
:The problem I've had so far with our version of "awk", is that the lines
:(paragraphs) that have too many fields cause an error of the type:
:	547
:awk: record `   479 Provide techn...' has too many fields
: record number 4
:
:These "_LONG_" lines could have several hundred words on them.  How can
:I get awk or another unix utilitity to process this text?

Run your awk script through the awk-to-perl translator, a2p, then run perl
on the resulting script as perl has no such limitations.  By default, the
translator will convert awk splits into perl splits with a maximum number
of resulting field of 999, but you can easily increase or remove that
restriction.

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
"With a kernel dive, all things are possible, but it sure makes it hard
 to look at yourself in the mirror the next morning."  -me

skwu@boulder.Colorado.EDU (WU SHI-KUEI) (01/02/91)

In article <1990Dec31.200723.7929@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
>From the keyboard of plantz@manta.NOSC.MIL (Glen W. Plantz):
>:The problem I've had so far with our version of "awk", is that the lines
>:(paragraphs) that have too many fields cause an error of the type:
>:	547
>:awk: record `   479 Provide techn...' has too many fields
>: record number 4
>:
>:These "_LONG_" lines could have several hundred words on them.  How can
>:I get awk or another unix utilitity to process this text?
>
>Run your awk script through the awk-to-perl translator, a2p, then run perl
>on the resulting script  ......

No need for 'perl', a boon to the majority of UNIX users that do not use it.
Simply replace the first, white space field separator with a some, otherwise
unused glyph (i.e. @) using 'sed' and then set the awk FS to that glyph.

tchrist@convex.COM (Tom Christiansen) (01/02/91)

From the keyboard of skwu@spot.Colorado.EDU (WU SHI-KUEI), quoting me:
:>Run your awk script through the awk-to-perl translator, a2p, then run perl
:>on the resulting script  ......
:
:No need for 'perl', a boon to the majority of UNIX users that do not use it.
:Simply replace the first, white space field separator with a some, otherwise
:unused glyph (i.e. @) using 'sed' and then set the awk FS to that glyph.

While for this particular application, it may well be that this solution
suffices, there remain all kinds of internal limits you're going to run into
with awk.  Eventually these will annoy you enough to stop using it for
large and/or complex problems.  For example, if the application were to
build an associative array of word-frequencies and you had the tremendously 
long lines described by the original poster, then awk wouldn't be able to
handle it, causing you to go through brain-twisting and gut-wrenching
contortions to pound the data back into something awk can handle.

Although perl isn't really new anymore, it's still generally perceived to
be so, and the resistance to new, useful tools in the community is so high
that some people will insist on shooting themselves in the foot using old,
limited (and even brain-damaged) software for years in the future.  (Yes,
I know it's hard to get things standardized across millions of systems,
but that shouldn't stop us from striving to forge ahead.)  My suspicion is
that this is just a manifestation in Unixdom of a principle familiar to
sociologists and historians.  While the desire to embrace better
technology may be somewhat higher amongst computer users than in the
general populace, there will always be some who wish to live (if you can
call that living) in a totally static environment where nothing ever
changes, where no improvement is ever radically different from previous
practice, and where the JCL scripts from 25 years ago still function.

Use awk while you can.  When you can't, be aware that there's an easy,
portable, freely-available upgrade path that doesn't require recoding
everything in C, and is a lot easier than trying to get AT&T to invest
the time in fixing awk.  You could reasonably argue that there are
actually two such paths, since gawk comes close to meeting these criteria:
it has greatly increased the limits of things like line length and number
of fields.  However, these limits still exist even in gawk, whereas in
perl they're entirely removed, so gawk may not be enough.  It all depends
on the problem.  Different problems are often best solved by employing
different tools, even if perl is the Swiss army chainsaw of UNIX.  

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
"With a kernel dive, all things are possible, but it sure makes it hard
 to look at yourself in the mirror the next morning."  -me

skwu@boulder.Colorado.EDU (WU SHI-KUEI) (01/03/91)

In article <1991Jan02.133911.24428@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
.....quoting my posting, which quoted his posting, and then continues....
>While for this particular application, it may well be that this solution
>suffices, there remain all kinds of internal limits you're going to run into
>with awk.......
>
>Although perl isn't really new anymore, it's still generally perceived to
>be so, and the resistance to new, useful tools in the community is so high
>that some people will insist on shooting themselves in the foot using old,
>limited (and even brain-damaged) software for years in the future.....
......
>While the desire to embrace better
>technology may be somewhat higher amongst computer users than in the
>general populace, there will always be some who wish to live (if you can
>call that living) in a totally static environment where nothing ever
>changes, where no improvement is ever radically different from previous
>practice, and where the JCL scripts from 25 years ago still function.
........
>
Different problems are often best solved by employing
>different tools, even if perl is the Swiss army chainsaw of UNIX.  

Has it ever struck you that perl scripts and JCL code are painfully similar
precisely because perl is a Swiss army chainsaw?

tchrist@convex.COM (Tom Christiansen) (01/03/91)

From the keyboard of skwu@spot.Colorado.EDU (WU SHI-KUEI):
:Has it ever struck you that perl scripts and JCL code are painfully similar
:precisely because perl is a Swiss army chainsaw?

Nope, not in the least.  Perl highly resembles its predecessors: awk, C,
and sed.  Pain is a matter of one's own making and perception.  You should
compare JCL with its UNIX equivalent, the original shell; you know, the
one where glob was a separate command.  Perl is far more analogous to REXX
on modern VM/CMS systems.

Only history will tell for sure, of course.

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
"With a kernel dive, all things are possible, but it sure makes it hard
 to look at yourself in the mirror the next morning."  -me

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (01/03/91)

In article <1991Jan2.164006.24557@csn.org> skwu@spot.Colorado.EDU (WU SHI-KUEI) writes:
: Has it ever struck you that perl scripts and JCL code are painfully similar
: precisely because perl is a Swiss army chainsaw?

No, I haven't stopped beating my wife, but her backhand is improving.

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov

kimcm@diku.dk (Kim Christian Madsen) (01/04/91)

skwu@boulder.Colorado.EDU (WU SHI-KUEI) writes:

>No need for 'perl', a boon to the majority of UNIX users that do not use it.
>Simply replace the first, white space field separator with a some, otherwise
>unused glyph (i.e. @) using 'sed' and then set the awk FS to that glyph.

That (awk) solution will not work, at least not on most System V systems,
since the number of allowable fields in each record is hardcoded into the
source code. If you have the source code you can increase the number and 
recompile. If not I suggest you find another tool, Tom Christensen has
provided a pointer to one of the more useful ones.

					Best Regards
					Kim Chr. Madsen

alex@am.sublink.org (Alex Martelli) (01/04/91)

tchrist@convex.COM (Tom Christiansen) writes on awk vs perl:
	...
>that this is just a manifestation in Unixdom of a principle familiar to
>sociologists and historians.  While the desire to embrace better

I rather believe it's a principle more familiar to booksellers - we're
just waiting for THE BOOK to get into our little grabby hands!-)

I know I don't speak for all Unix-lovers, but I wouldn't use awk, ksh,
icon, and so on, so willingly, if each did not have a good-to-great
book about it.  Great-to-good books ain't all (I *do* use dmake, and
*don't* use ratfor, for example...) - but they surely DO help!
-- 
Alex Martelli - (home snailmail:) v. Barontini 27, 40138 Bologna, ITALIA
Email: (work:) staff@cadlab.sublink.org, (home:) alex@am.sublink.org
Phone: (work:) ++39 (51) 371099, (home:) ++39 (51) 250434; 
Fax: ++39 (51) 366964 (work only), Fidonet: 332/401.3 (home only).

oz@yunexus.yorku.ca (Ozan Yigit) (01/05/91)

In article <1991Jan02.133911.24428@convex.com> tchrist@convex.COM
(Tom Christiansen) writes:

>... and the resistance to new, useful tools in the community is so high
>that some people will insist on shooting themselves in the foot using old,
>limited (and even brain-damaged) software for years in the future.

You may want to remind yourself this when the replacement of perl is out.
We too, don't like limited (and even brain-damaged) software.

oz
---
Good design means less design. Design   | Internet: oz@nexus.yorku.ca 
must serve users, not try to fool them. | UUCP: utzoo/utai!yunexus!oz
-- Dieter Rams, Chief Designer, Braun.  | phonet: 1+ 416 736 5257
  

john@basho.uucp (John Lacey) (01/07/91)

alex@am.sublink.org (Alex Martelli) writes:

>I know I don't speak for all Unix-lovers, but I wouldn't use awk, ksh,
>icon, and so on, so willingly, if each did not have a good-to-great
>book about it.  Great-to-good books ain't all (I *do* use dmake, and
>*don't* use ratfor, for example...) - but they surely DO help!

Ditto here.  In fact, I find that when a program comes with _any_
documentation, it is better than programs that come with none.  And
better documentation seems to be a good indicator of a better program.
These are generalizations, of course, broken from time to time.  My
favorite examples are TeX (ahh, bliss) and AWK.  The best
counter-example I know is Microsoft Word for the Macintosh, which has
well above average documentation ....
-- 
John Lacey         614 436 3773         73730,2250
john@basho.uucp  or  basho!john@cis.ohio-state.edu

david@cs.dal.ca (David Trueman) (01/08/91)

In article <1991Jan02.133911.24428@convex.com> tchrist@convex.COM (Tom Christiansen) writes:

>You could reasonably argue that there are
>actually two such paths, since gawk comes close to meeting these criteria:
>it has greatly increased the limits of things like line length and number
>of fields.  However, these limits still exist even in gawk, whereas in
>perl they're entirely removed, so gawk may not be enough.  It all depends

As the current primary developer of gawk, I would like to say that
I am unaware of any such limits in gawk (except the size of an int and the 
size of your swap space -- limits that I am sure perl also has).  If there
is any limit, it is an unknown bug that will be fixed -- yes, like perl
and unlike some unnamed commercial products, we do fix bugs!!  As they
say, you get what you pay for.
-- 
{uunet watmath}!dalcs!david  or  david@cs.dal.ca