[comp.unix.shell] Problem using multiple 'head' commands in shell script

garyb@abekrd.UUCP (Gary Bartlett) (01/30/91)

Can someone explain to me what is happening with the following Bourne shell
script and more importantly how I can get around it:


	#!/bin/sh

	cat file | (
		head -200
		echo "Line 201 follows"
		head -200
		echo "Line 401 follows"
		cat
	)

I am trying to use 'head' as a quick way to split up an input stream.  I
originally used 'read' and an 'expr' counter but this was too slow.

This script loses lines after each 'head'.  eg if file contained a stream of
numbers, the output would be missing lots of numbers!

It looks like 'head' initially reads in a whole buffer of data from file
(stdin), prints out the requisite number of lines and then dumps the rest
of the buffer.  The next 'head' then reads the NEXT buffer.  Is this
right/normal? How can I get around this (preferably whilst still using
'head').  Is it possible to change the buffering within this script?
Pointers, anyone?

Thanks,
Gary
-- 
---------------------------------------------------------------------------
Gary C. Bartlett               NET: garyb@abekrd.co.uk
Abekas Video Systems Ltd.     UUCP: ...!uunet!mcsun!ukc!pyrltd!abekrd!garyb
12 Portman Rd,   Reading,    PHONE: +44 734 585421
Berkshire.       RG3 1EA.      FAX: +44 734 567904
United Kingdom.              TELEX: 847579

mcgrew@ichthous.Eng.Sun.COM (Darin McGrew) (01/30/91)

In article <1671@abekrd.UUCP> garyb@abekrd.UUCP (Gary Bartlett) writes:
->Can someone explain to me what is happening with the following Bourne shell
->script and more importantly how I can get around it:
->
->	#!/bin/sh
->	cat file | (
->		head -200
->		echo "Line 201 follows"
->		head -200
->		echo "Line 401 follows"
->		cat
->	)
->
->...
->It looks like 'head' initially reads in a whole buffer of data from file
->(stdin), prints out the requisite number of lines and then dumps the rest
->of the buffer.  The next 'head' then reads the NEXT buffer....

Yes, head reads a bufferful at a time.  I'd use awk:

	awk '	NR==201	{print "Line 201 follows"}
		NR==401	{print "Line 401 follows"}
			{print}' < file

->Thanks,
->Gary

You're welcome.

                 Darin McGrew     "The Beginning will make all things new,
           mcgrew@Eng.Sun.COM      New life belongs to Him.
       Affiliation stated for      He hands us each new moment saying,
identification purposes only.      'My child, begin again....
				    You're free to start again.'"

]) (01/31/91)

In article <6925@exodus.Eng.Sun.COM> mcgrew@ichthous.Eng.Sun.COM (Darin McGrew) writes:
>In article <1671@abekrd.UUCP> garyb@abekrd.UUCP (Gary Bartlett) writes:
>->Can someone explain to me what is happening with the following Bourne shell
>->script and more importantly how I can get around it:
>->
>->	#!/bin/sh
>->	cat file | (
>->		head -200
>->		echo "Line 201 follows"
>->		head -200
>->		echo "Line 401 follows"
>->		cat
>->	)
>->
>->...
>->It looks like 'head' initially reads in a whole buffer of data from file
>->(stdin), prints out the requisite number of lines and then dumps the rest
>->of the buffer.  The next 'head' then reads the NEXT buffer....
>
>Yes, head reads a bufferful at a time.  I'd use awk:
>
>	awk '	NR==201	{print "Line 201 follows"}
>		NR==401	{print "Line 401 follows"}
>			{print}' < file

And this might be faster (note, though, that I'll need to left-justify
this to avoid inserting leading white-space).

-- start fragment --
sed \
-e '200a\
LIne 201 follows' \
-e '400a\
Line 401 follows' < file
-- end fragment --

I'm making no statement here that the sed call is better than the
awk call, just that if performance is significant, you might want
to try this approach too and compare execution times.

If, however, the    echo "Line ?01 follows"    in the original example
was a place holder for "I want to do other stuff here, then pick up
processing with the next set of lines", neither the awk nor the sed
calls will allow it, as both simply insert the line-counting messages
into the stream of data from file.

Dog slow though it be, the following will do it:

	#!/bin/sh
	(
	i=1
	while [ $i -lt 201 ]
	do
		read line; echo "$line"
		i=`expr $i + 1`
	done
	: process some stuff here
	while [ $i -lt 401 ]
	do
		read line; echo "$line"
		i=`expr $i + 1`
	done
	: process some more stuff here
	cat -
	) < file

It's only slightly better in ksh, by replacng the i=1 assignment with
typeset -i i=1   and replacing the expr call to increment $i with
((i += 1))   .   In either case, mayhem will result if file isn't at
least 400 lines long.

You may be forced into multiple reads of the file to get something
resembling good performance:

	#!/bin/sh
	(
	sed 200q file
	echo "Line 201 follows"
	sed -e '1,200d' -e '400q' file
	echo "Line 401 follows"
	sed '1,400d' file
	)

The saving graces here are that, even though the file is opened three
times, (1) only the first 200 lines are read thrice and the second
200 twice, and (2) one avoids the nearly nightmarish performance of
the while loops in the example preceeding this one.  It doesn't hurt,
too, that sed is pretty quick.

Now, let's take it one step further and generalize it into a function...

	#!/bin/sh
	
	#
	# A function to get $2 lines from file $1 starting at $3
	# Only the file ($1) is required
	#
	getlines() {
		file=$1
		count=$2
		start=${3:-1}	# default start at line 1
		if [ ! -r "$file" ]
		then
			echo "getlines: file '$1' not readable" 1>&2
			return 1
		fi
		# Whole file?
		if [ $start -eq 1 -a "$count" = "" ]
		then
			cat $file
			return $?
		fi
		# From start to EOF?
		if [  "$count" = "" ]
		then
			sed -n "$start,\$p" $file
			return $?
		fi
		# Start at line 1 for count lines?
		if [ $start -eq 1 ]
		then
			sed "${count}q" $file
			return $?
		fi
		# We have a start other than 1 and a count
		cut=`expr $start - 1`		# Don't print through $cut
		end=`expr $cut + $count`	# $end is last to print
		if [ $end -le $cut ]
		then
			echo "getlines: bad count($count)/start($start)" 1>&2
			return 1
		fi
		sed -e "1,${cut}d" -e "${end}q" $file
		return $?
		}

	#
	# Mainline code
	#
	file=${1:-file}	# If there's an arg, it's the filename
	wc=`wc -l < $file`
	count=200
	current=1
	while [ $current -le $wc ]
	do
		if [ $current -ne 1 ]
		then
			echo "Next line is $current"
		fi
		if getlines $file $current $count
		then
			current=`expr $current + $count`
		else
			saverc=$?
			echo "$0: getlines returned $saverc" 1>&2
			exit $saverc
		fi
	done

All that's left is to have flags for count and maybe the initial "current".

>->Thanks,
>->Gary
>
>You're welcome.

I'll second that!
...Kris
-- 
Kristopher Stephens, | (408-746-6047) | krs@uts.amdahl.com | KC6DFS
Amdahl Corporation   |                |                    |
     [The opinions expressed above are mine, solely, and do not    ]
     [necessarily reflect the opinions or policies of Amdahl Corp. ]

tchrist@convex.COM (Tom Christiansen) (01/31/91)

From the keyboard of krs@amdahl.uts.amdahl.com (Kris Stephens [Hail Eris!]):
:In article <6925@exodus.Eng.Sun.COM> mcgrew@ichthous.Eng.Sun.COM (Darin McGrew) writes:
:>In article <1671@abekrd.UUCP> garyb@abekrd.UUCP (Gary Bartlett) writes:
:>->Can someone explain to me what is happening with the following Bourne shell
:>->script and more importantly how I can get around it:
:>->
:>->	#!/bin/sh
:>->	cat file | (
:>->		head -200
:>->		echo "Line 201 follows"
:>->		head -200
:>->		echo "Line 401 follows"
:>->		cat
:>->	)
:>->
:>->...

:If, however, the    echo "Line ?01 follows"    in the original example
:was a place holder for "I want to do other stuff here, then pick up
:processing with the next set of lines", neither the awk nor the sed
:calls will allow it, as both simply insert the line-counting messages
:into the stream of data from file.
:
:Dog slow though it be, the following will do it:

I really wasn't going to do this, but once I saw things like "dog slow"
and "shell functions" (which many of us don't have) and "mayhem will
result", I just couldn't not give a perl solution.

    while (<>) {		# fetch line into pattern space
	if ($. < 201) {
	    s/foo/bar/g;	# do some other stuff
	} elsif ($. < 401) {
	    tr/a-z/A-Z/;	# do other stuff
	} else {
	    print "line number is $., continuing...\n"; # do final stuff
	} 
    } 

Basically, perl keeps track of your current input line number in the $.
variable (mnemonic: think of dot in editors) just as awk does with NR.
The advantage to using perl is that you can do much more without having to
call other program, and instead of asking yourself a load of questions
like "does their awk have functions?", "does their sh have functions?",
"did I exceed awk/sh's field limits?", you only have the one question of
whether perl is on the system, and unlike nawk and ksh (although like gawk
and bash), you can put it on your system without shelling out money if the
answer should be no.  You'll also find that perl will be faster than
the sed/sh/awk combo, and often faster than even just one of them.

Please, save the flames for alt.religion.computers.  I'm just trying
to present another possibility of which the original poster may not
have been aware.

--tom
--
"Hey, did you hear Stallman has replaced /vmunix with /vmunix.el?  Now
 he can finally have the whole O/S built-in to his editor like he
 always wanted!" --me (Tom Christiansen <tchrist@convex.com>)

garyb@abekrd.UUCP (Gary Bartlett) (01/31/91)

In <fcPl016n13RO00@amdahl.uts.amdahl.com> krs@uts.amdahl.com (Kris Stephens [Hail Eris!]) writes:
>In article <6925@exodus.Eng.Sun.COM> mcgrew@ichthous.Eng.Sun.COM (Darin McGrew) writes:
>>In article <1671@abekrd.UUCP> garyb@abekrd.UUCP (Gary Bartlett) writes:
>>->...
>>->It looks like 'head' initially reads in a whole buffer of data from file
>>->(stdin), prints out the requisite number of lines and then dumps the rest
>>->of the buffer.  The next 'head' then reads the NEXT buffer....
>>

>If, however, the    echo "Line ?01 follows"    in the original example
>was a place holder for "I want to do other stuff here, then pick up
>processing with the next set of lines", neither the awk nor the sed
>calls will allow it, as both simply insert the line-counting messages
>into the stream of data from file.

This is indeed what I intended - see my last piece of news on the subject.

>Dog slow though it be, the following will do it:
>	#!/bin/sh
>	(
>	i=1
>	while [ $i -lt 201 ]
>	do
>		read line; echo "$line"
>		i=`expr $i + 1`
>	done
>	: process some more stuff here
>	cat -
>	) < file

This is effectively what I started out using - a 'while' loop, an 'expr'
counter, and a couple of 'read's.  Hideously slow!

>You may be forced into multiple reads of the file to get something
>resembling good performance:

>The saving graces here are that, even though the file is opened three
>times, (1) only the first 200 lines are read thrice and the second
>200 twice, and (2) one avoids the nearly nightmarish performance of
>the while loops in the example preceeding this one.  It doesn't hurt,
>too, that sed is pretty quick.

The thing is, the file I'm merging from may be very long (ie very many
sed passes).

>Now, let's take it one step further and generalize it into a function...

I DO like the function idea though.

I did actually write my own 'head' (C) program which turned off all buffering
of the stdin before doing any reading.  This did the trick and worked in the
shell script.  It was faster but not greatly so - I guess it had to read every
character individually.  I did try using line-buffering but this did not work.
It still lost data (although not as much as when using the full-buffering of
head).  I'm not overly happy with that solution though - I but it's not at
all portable.

*** FLASH OF INSPIRATION ***

I have an idea:
- Process the original file by putting the line number at the beginning of
  each line,
- Process the file to be merged so that the merge points are at the beginning
  of each of these lines,
- Cat the two processed files together and pass through 'sort',
- Remove line numbers from beginning of resulting file, QED

This doesn't matter how big either file is.

Thoughts?

Thanks again for some very useful input,
Gary

-- 
---------------------------------------------------------------------------
Gary C. Bartlett               NET: garyb@abekrd.co.uk
Abekas Video Systems Ltd.     UUCP: ...!uunet!mcsun!ukc!pyrltd!abekrd!garyb
12 Portman Rd,   Reading,    PHONE: +44 734 585421
Berkshire.       RG3 1EA.      FAX: +44 734 567904
United Kingdom.              TELEX: 847579

tchrist@convex.COM (Tom Christiansen) (01/31/91)

From the keyboard of garyb@abekrd.UUCP (Gary Bartlett):

[various approaches including special versions of head and other
 nasties deleted, many complaining of being dog slow.]

:Thoughts?

I think this all a lot of unnecessary complexity; it's not that I 
think it impossible using tools from V6 UNIX, but it's more 
cumbersome and slow than it really needs to be.

Consequently, I offer to translate Gary's sed/sh/head/sort/blah script to
perl for him if he wants.  I promise it'll be faster, smaller, simpler,
and a lot more portable than writing a new version of head to do the job.

Sorry, one per customer please. :-)

--tom
--
"Hey, did you hear Stallman has replaced /vmunix with /vmunix.el?  Now
 he can finally have the whole O/S built-in to his editor like he
 always wanted!" --me (Tom Christiansen <tchrist@convex.com>)

martin@mwtech.UUCP (Martin Weitzel) (02/01/91)

In article <1678@abekrd.UUCP> garyb@abekrd.UUCP (Gary Bartlett) writes:
[...]
:*** FLASH OF INSPIRATION ***
:
:I have an idea:
:- Process the original file by putting the line number at the beginning of
:  each line,

Good.

:- Process the file to be merged so that the merge points are at the beginning
:  of each of these lines,

Good.

:- Cat the two processed files together and pass through 'sort',

Depending on how the lines look like it may also be possible to use
join(1).

:- Remove line numbers from beginning of resulting file, QED

and split the joined lines.
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83

]) (02/02/91)

In article <1678@abekrd.UUCP> garyb@abekrd.UUCP (Gary Bartlett) writes:
>*** FLASH OF INSPIRATION ***
>
>I have an idea:
>- Process the original file by putting the line number at the beginning of
>  each line,
>- Process the file to be merged so that the merge points are at the beginning
>  of each of these lines,
>- Cat the two processed files together and pass through 'sort',
>- Remove line numbers from beginning of resulting file, QED
>
>This doesn't matter how big either file is.
>
>Thoughts?

Hmmm....  If you've got the blocks available on your disk to
handle a second copy of the file, read it once and break it
into (n) pieces of your required length, each.  This would
use pretty much the same number of blocks of storage (plus
at most a block-per-chunk for fractional last blocks) and
some number of inodes (number of lines / lines per chunk).

--- multi-piece processing --
:
file=${1:-file}
lines=${2:-200}
base=Base

# awk will return the number of subsets it created
subsets=`awk '
NR == 1 {
	i++
	outfile = base "" i
	print > outfile
	next
	}
NR % lines == 1 {
	close outfile
	i++
	outfile = base "" i
	print > outfile
	next
	}
	
	{ print >> outfile }
END { close outfile; print i + 0 }' base=$base lines=$lines $file`

i=1
while [ $i -le $subsets ]
do
	subset="$base$i"
	cat $subset
	rm $subset	# Clean up now
	echo "# End of block $i"
	i=`expr $i + 1`
done > report	# or pipe it to something else (lp?)
--- multi-piece processing ---

The perl-literate will undoubtedly say "If you're going to do this,
use perl instead", but so it goes...   :-)

Anyhow, this reads the data twice (once to break it into pieces
of $lines; once to cat them into the report file).

>Thanks again for some very useful input,
>Gary

You're very welcome -- helps me stretch out my thinking, too.
...Kris
-- 
Kristopher Stephens, | (408-746-6047) | krs@uts.amdahl.com | KC6DFS
Amdahl Corporation   |                |                    |
     [The opinions expressed above are mine, solely, and do not    ]
     [necessarily reflect the opinions or policies of Amdahl Corp. ]

rbp@investor.pgh.pa.us (Bob Peirce #305) (02/04/91)

In article <1671@abekrd.UUCP> garyb@abekrd.UUCP (Gary Bartlett) writes:
>Can someone explain to me what is happening with the following Bourne shell
>script and more importantly how I can get around it:
>
>	cat file | (
>		head -200
>		echo "Line 201 follows"
>		head -200
>		echo "Line 401 follows"
>		cat
>	)
>
>I am trying to use 'head' as a quick way to split up an input stream.  I
>originally used 'read' and an 'expr' counter but this was too slow.
>
>This script loses lines after each 'head'.  eg if file contained a stream of
>numbers, the output would be missing lots of numbers!
>
Why not use

	cat file | split -200
	commands to process x?? files

-- 
Bob Peirce, Pittsburgh, PA				  412-471-5320
...!uunet!pitt!investor!rbp			rbp@investor.pgh.pa.us