[unix-pc.general] bsplit?

thad@cup.portal.com (Thad P Floryan) (02/01/90)

ken@cs.rochester.edu (Ken Yap) in <1990Jan30.051455.7500@cs.rochester.edu>
writes:

	Oh come on, what happened to reusing Unix tools?

	Here's a cheap bsplit, done in sh (hooray for redirection on builtins,
	fooey to csh on this point).

		for i in 1 2 3 4 5
		do
			dd bs=10k count=1 of=part$i
		done < foo

	Edit as appropriate.

Nothing's "wrong" with the above; thanks for the example and posting!

But, per "Edit as appropriate", one has to know beforehand how many parts
the original will be split into, and, here's the clinker to the above, your
example does NOT perserve sequential order if there are more than 9 parts
such that one could do (later, when repacking after a uucp) "zcat part* | .."
because "part10" collates after "part1" but BEFORE "part2".  In other words,
an "ls part*" would sequence part1, part10, part2, part3, ... , part9  which
is the incorrect order.

The output of bsplit (and my xsplit) preserves collating sequence per partaa,
partab, partac, ... partzz thus preserving the split-order.

The benefit of "bsplit" is evident when uucp'ing the 40- or 60- or 90-part
distributions from, say, osu-cis.  Consider just the GNU gcc; it's a 20+ part
archive on osu-cis, and the split-sequence is maintained (for zcat and UNIX'
wildcarding) due to "aa", "ab", ..., "bh" suffixes on the filenames.

Thad Floryan [ thad@cup.portal.com (OR) ..!sun!portal!cup.portal.com!thad ]

ken@cs.rochester.edu (Ken Yap) (02/02/90)

> 		for i in 1 2 3 4 5
> 		do
> 			dd bs=10k count=1 of=part$i
> 		done < foo

> 	Edit as appropriate.

> Nothing's "wrong" with the above; thanks for the example and posting!

> But, per "Edit as appropriate", one has to know beforehand how many parts
> the original will be split into, and, here's the clinker to the above, your
> example does NOT perserve sequential order if there are more than 9 parts
> such that one could do (later, when repacking after a uucp) "zcat part* | .."
> because "part10" collates after "part1" but BEFORE "part2".  In other words,
> an "ls part*" would sequence part1, part10, part2, part3, ... , part9  which
> is the incorrect order.

True. When I say edit as appropriate, I really mean add the bells and
whistles as needed. It would take very little shell hacking to add the
features you want.  Probably something along the lines of: get the size
from ls; loop, keeping track of the bytes written so far, a digits and
a tens counter (and a hundereds counter, if you're greedy); increment
as appropriate, exiting the loop when the size of the file has been
reached. Another approach would be to precompute the number of parts
needed and generate fixed width numbers by prepadding with zeros, then
trimming to the final width with sed. I can see some people retching in
the aisles now, but hey, sh scripts are easy to get working. :-)

I'm not against C bsplit in any way. I'd probably get that myself if I
needed that function. I just wanted to point out that sh and many other
Unix tools have lots of underused features* and that sometimes writing
a shell script is faster than hacking C.

No doubt somebody will suggest a perl version next. :-)

* Here's another example:

<foo exec

in a shell script will cause the rest of the script to read from file
foo. Similarly for >.

les@chinet.chi.il.us (Leslie Mikesell) (02/03/90)

In article <1990Feb1.232040.26182@cs.rochester.edu> ken@cs.rochester.edu writes:
>> 		for i in 1 2 3 4 5
>> 		do
>> 			dd bs=10k count=1 of=part$i
>> 		done < foo
>
>Another approach would be to precompute the number of parts
>needed and generate fixed width numbers by prepadding with zeros, then
>trimming to the final width with sed.

The fixed width numbers are easy with something like: (3 digits)
case $i in
 ?)
  i=00$i
 ;;
 ??)
  i=0$i
 ;;
esac

The real problem, though is that you can't feed the script from a
pipe.  dd is almost unique among the unix tools in that it
uses read() rather than fread() and will fail to read the
requested amount if the input pipeline cannot stay ahead.

>No doubt somebody will suggest a perl version next. :-)
Good idea...
 
Les Mikesell
  les@chinet.chi.il.us

thad@cup.portal.com (Thad P Floryan) (02/03/90)

jbm@uncle.UUCP (John B. Milton) in <679@uncle.UUCP> mentions:

	What about this:

	uucp -r osu-cis!~/gnu/bsplit.c /usr/spool/uucppublic

Thanks!  That file (bsplit) is NOT listed in the "GNU.how-to-get" file; one
must either peruse "ls-lR.Z" or roam osu-cis' directories via ftp or telnet.

Thad Floryan [ thad@cup.portal.com (OR) ..!sun!portal!cup.portal.com!thad ]

ken@cs.rochester.edu (Ken Yap) (02/04/90)

|The real problem, though is that you can't feed the script from a
|pipe.  dd is almost unique among the unix tools in that it
|uses read() rather than fread() and will fail to read the
|requested amount if the input pipeline cannot stay ahead.

There is a good reason for that, the semantics of reading large blocks
from tape have to be preserved. If stdio were used, the block size
would be whatever stdio happened to use. No doubt dd could be taught to
tell the difference between tape and pipes but nobody wants to mess
with nostalgia. :-)

res@cbnews.ATT.COM (Robert E. Stampfli) (02/06/90)

>The real problem, though is that you can't feed the script from a
>pipe.  dd is almost unique among the unix tools in that it
>uses read() rather than fread() and will fail to read the
>requested amount if the input pipeline cannot stay ahead.

Yes.  One way of dealing with this, although it may not be the most
efficient, is to change instances of

	... | dd -args | ...
to
	... | dd bs=whatever | dd -args | ...

This has worked in a pinch for me several times.
-- 
Rob Stampfli	/ att.com!stampfli (uucp@work) / kd8wk@w8cqk (packet radio)
614-864-9377	/ osu-cis.cis.ohio-state.edu!kd8wk!res (uucp@home)