[comp.unix.questions] Multiple Field Sorts in UNIX

tes@whuts.UUCP (STERKEL) (07/21/87)

<*>
 I am not certain if this is a neophyte question (comp.unix.question)
 or a request for software (comp.sources.wanted). If you respond with
 sources, please, please, please make certain that it is fully
 compatible with vanilla SysV.  Thanks.
 <*>
 Now for the problem:
 I need a multiple field sort that maintains sub-field order.  Let
 me draw a picture:

 File contents
 field 0   field 1   field x  field 2  field y  field 3  field z
 -------   -------   -------  -------  -------  -------  ------
 ...       aaa       ...      ...      ...      ...      ...
 ...       bbb       ...      ...      ...      ...      ...
 ...       bba       ...      aaa      ...      ...      ...
 ...       bba       ...      aab      ...      axa      ...
 ...       bba       ...      aab      ...      aya      ...
 ...       bba       ...      aac      ...      xxx      ...
 ...       bba       ...      aac      ...      xxx      ...
 ...       ccc       ...      ...      ...      ...      ...
 ...       dcc       ...      ...      ...      ...      ...
 ...       ecc       ...      ...      ...      ...      ...
 ...       fcc       ...      aaa      ...      ...      ...
 ...       fcc       ...      bbb      ...      ...      ...
 ...       fcc       ...      ccc      ...      aaa      ...
 ...       fcc       ...      ccc      ...      aab      ...
 etc
 note: "..." denotes that field contents do not affect sort order.

 In narrative form:  Multiple Field Sorts sort on the first field,
 then on the second field if the first field has two or more records
 with identical contents, then on the third field if the second field
 has two or more records with identical contents, then on the
 fourth...etc.

 An inefficient implementation of this has been:
 cat file | sort on field3 | sort on field2 | sort on field1 > sorted

BUT, this only for sorts that are "bubble" and/or "shell". 
Using Sort(1) "scrambles" the previous sorts on each pass,
leaving me with no easy way to use sort(1) to do multiple
field sorts.

Any hints?
-- 
                         Terry Sterkel
        {clyde|harvard|cbosgd|allegra|ulysses|ihnp4}!whuts!tes
              [opinions are obviously only my own]

boykin@custom.UUCP (Joseph Boykin) (07/22/87)

In article <2459@whuts.UUCP>, tes@whuts.UUCP (STERKEL) writes:
> <*>
>  I am not certain if this is a neophyte question (comp.unix.question)
>  or a request for software (comp.sources.wanted). If you respond with
>  sources, please, please, please make certain that it is fully
>  compatible with vanilla SysV.  Thanks.
>  <*>
>  Now for the problem:
>  I need a multiple field sort that maintains sub-field order.  Let
>  me draw a picture:
.....
>  In narrative form:  Multiple Field Sorts sort on the first field,
>  then on the second field if the first field has two or more records
>  with identical contents, then on the third field if the second field
>  has two or more records with identical contents, then on the
>  fourth...etc.

UNIX Sort (as well as our own PC/SORT) allows you to do exactly what you
want.  If you specify more than one sort key on the command line the
file is sorted by the first key, if there is more than one line with
the same information, the second key is used.  There are a maximum of
ten sort keys which may be specified.  If all keys are used and they
call compare equally, the entire line is used for a final check.

I believe this is exactly what you are looking for.  The format of
doing this is as follows

	sort +pos1 [-pos2 [+pos3] [-pos4]]] file...

+pos1 and -pos2 specify the first key with the field starting at
position 1 and ending right before position 2.  If the ending
position is not specified, the end of the physical line is used.

There are a bunch of options which help delimit what a field looks
like, sorting order, etc.  Check sort(1) for details.

Joe Boykin
Custom Software Systems
...necntc!custom!boykin

-- 

Joe Boykin
Custom Software Systems
...necntc!custom!boykin

avr@hou2d.UUCP (Adam V. Reed) (07/22/87)

In article <2459@whuts.UUCP>, tes@whuts.UUCP (STERKEL) writes:
> I need a multiple field sort that maintains sub-field order. 
> Using Sort(1) "scrambles" the previous sorts on each pass,
> leaving me with no easy way to use sort(1) to do multiple
> field sorts.
> 
> Any hints?

Yes. Use cut(1) to merge the relevant fields in the proper order into a
single field, paste(1) the result in front of the original, sort(1) the
result on the merged field, and then cut(1) the merged field away. This
is a classic UNIX one-liner pipeline problem.
						Adam Reed

root@hobbes.UUCP (John Plocher) (07/22/87)

+---- (STERKEL) writes the following in article <2459@whuts.UUCP> ----
|  I need a multiple field sort that maintains sub-field order.  Let
| 
|  An inefficient implementation of this has been:
|  cat file | sort on field3 | sort on field2 | sort on field1 > sorted
| 
| BUT, this only for sorts that are "bubble" and/or "shell". 
| Using Sort(1) "scrambles" the previous sorts on each pass,
| leaving me with no easy way to use sort(1) to do multiple
| field sorts.
+----

I use the following command to sort a "rolodex"(tm) type file into
post office acceptable order (by zip then by route then by name):
(The file is of the form "name","addr",...,"route","zip","phone";
thus the need for 8.1 and 5.1 to skip the quotes for sorting numeric
fields)

    sort -t, -y  +8.1 -8.7 -n  +5.1 -6.0 -d +0 -1 -o names.4 names.4

ie:               field 8       field 5      field 0
	           (zip)   (Rural Route/Box)  (name)

Sort(1) does the stable sorts for you.  This is using Sys5r2.

  - John

ps.   Yes I do backup the database before doing the in-place sort!

-- 
John Plocher uwvax!geowhiz!uwspan!plocher  plocher%uwspan.UUCP@uwvax.CS.WISC.EDU

chris@mimsy.UUCP (Chris Torek) (07/23/87)

In article <758@custom.UUCP> boykin@custom.UUCP (Joseph Boykin) writes:
>UNIX Sort (as well as our own PC/SORT) allows you to do exactly what you
>want.

 ... unless you need a numeric sort on one field but an alphabetical
sorts on another, or if any of the other global sort flags conflict
for two fields.  It should not be too hard to write a stable sort
for such applications, especially if you let /usr/bin/sort do the
hard work.  (Note the merge option!)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
Domain:	chris@mimsy.umd.edu	Path:	seismo!mimsy!chris

roy@phri.UUCP (Roy Smith) (07/23/87)

In article <2459@whuts.UUCP> tes@whuts.UUCP (STERKEL) writes:
>  I need a multiple field sort that maintains sub-field order.
>  [...]
>  An inefficient implementation of this has been:
>  cat file | sort on field3 | sort on field2 | sort on field1 > sorted
>
> BUT, this only for sorts that are "bubble" and/or "shell".  Using Sort(1)
> "scrambles" the previous sorts on each pass, leaving me with no easy way
> to use sort(1) to do multiple field sorts.

	What you want to do is "sort +2 +1 +0 file > sorted".  The +N
arguments mean to skip N fields.  This is a bit counter-intuitive, but it
boils down to "+2" meaning skip 2 fields, i.e. use the third field.  The
following "+1" and +0" arguments mean use the second and first field as
secondary and tertiary sort keys; exactly what you want (if I understood
your question right).

	The sort(1) man page says that if all keys compare equal, it sorts
on the whole input line.  This is why it "scrambles" your file when you do
a multi-pass sort.  It would be nice if sort had a "stable" option (i.e.
make two lines which compare equal on all specified keys stay in the
original input order; exactly what you need to do the multi-pass sort you
describe) but as far as I know, it doesn't.

	Not being a real sorting guru, I don't know what impact that option
would have on performance.  But I can tell you that the "sort +2 +1 +0" is
a lot faster than "sort +2 | sort +1 | sort +0", unless you happen to be on
a Sequent, and maybe not even then.

	You wanted a SysV answer.  I'll admit that this is a 4.N answer,
but since I don't think sort has changed much (on the outside) since v6,
I'm pretty confident that it works on SysV too.

	The Unix sort(1) utility is a pretty amazingly powerful tool,
especially with the "+N.M" stuff to skip fields and characters within
fields.  Unfortunately, the sort man page is about as obtruse as man pages
get.  I've been reading that man page on and off for about 10 years and I'm
still not sure I understand all the details.
-- 
Roy Smith, {allegra,cmcl2,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

gwyn@brl-smoke.ARPA (Doug Gwyn ) (07/23/87)

In article <2459@whuts.UUCP> tes@whuts.UUCP (STERKEL) writes:
> I need a multiple field sort that maintains sub-field order.

The standard UNIX "sort" utility can be asked to compare multiple
keys, which will probably do what you want.  If the original
record order must be retained even when ALL keys match, then add
an additional key field consisting of sequence numbers, sort on
the desired multiple keys followed by the sequence number key,
and finally strip off the sequence numbers.

Sorts that inherently preserve original record order when keys
(all) match are called "stable".  As you discovered, the n-way
merge sorting used by the UNIX "sort" utility is not stable.
My favorite in-core stable sorting technique is "list merge" as
described in Knuth Vol. 3, which is a good place to start if you
need to learn more about sorting techniques.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (07/23/87)

In article <7651@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
-In article <758@custom.UUCP> boykin@custom.UUCP (Joseph Boykin) writes:
->UNIX Sort (as well as our own PC/SORT) allows you to do exactly what you
->want.
- ... unless you need a numeric sort on one field but an alphabetical
-sorts on another, or if any of the other global sort flags conflict
-for two fields.

I don't see what the problem is.  The bdinr flags are specifiable
per-field as well as globally.

kimcm@olamb.UUCP (Kim Chr. Madsen) (07/23/87)

In article <2459@whuts.UUCP>, tes@whuts.UUCP (STERKEL) writes:
> 
>  An inefficient implementation of this has been:
>  cat file | sort on field3 | sort on field2 | sort on field1 > sorted
> 
> BUT, this only for sorts that are "bubble" and/or "shell". 
> Using Sort(1) "scrambles" the previous sorts on each pass,
> leaving me with no easy way to use sort(1) to do multiple
> field sorts.

You can do it in one pass through standard sort(1), by specifying the
fields in order.

	sort +1 -2 +3 -4 +2 -3 datafile

should sort "datafile" by primary key field 2 secondary key field 4
tertiary key field 3 and ignoring cointents of field 1.

					Kim Chr. Madsen.

kdw1@sphinx.uchicago.edu (Keith Waclena) (07/23/87)

In article <7651@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
!In article <758@custom.UUCP> boykin@custom.UUCP (Joseph Boykin) writes:
!!UNIX Sort (as well as our own PC/SORT) allows you to do exactly what you
!!want.
!
! ... unless you need a numeric sort on one field but an alphabetical
!sorts on another, or if any of the other global sort flags conflict
!for two fields.

Hmm.  My sort(1) man page says that you can apply the -n option (and
-dbfir) to individual sort fields, as in:

	$ sort +2n -3 +5df -6 +7n  file

I wonder: is this System V specific (we run SysVr2.1)?  Do you BSD users
have these options?

						Keith

--
Keith Waclena             University of Chicago Graduate Library School
                          1100 E. 57th Street, Chicago, Illinois   60637

                          ...ihnp4!gargoyle!sphinx!kdw1
#include <disclaimer.h>   kdw1@sphinx.UChicago.{EDU,BITNET,MAILNET,CSNET}

root@hobbes.UUCP (John Plocher) (07/23/87)

+---- Chris Torek writes the following in article <7651@mimsy.UUCP> ----
| >UNIX Sort  allows you to do exactly what you want.
|  ... unless you need a numeric sort on one field but an alphabetical
| sorts on another, or if any of the other global sort flags conflict
| for two fields.
+----

Oops!

You must have missed the fine print buried in the middle of the man page :-)
The manual page for Sys5r2 sort(1) states: (after giving the options
for dictionary order, upper/lower case folding, ASCII only, "month", and
numeric compares) :

  "When ordering options appear before restricted sort key specifications, the
  requested ordering rules are applied globally to all sort keys.  When
  attached to a specific sort key (described below), the specified ordering
  options override all global ordering options for that key.

  "The notation +pos1 -pos2 restricts a sort key to one beginning at pos1
  and ending at pos2.  ...


In the 7th edition manual which "reflects the state of the Berkeley system,
December 1979" it states:

  "The notation +pos1 -pos2 restricts a sort key to one beginning at pos1
  and ending at pos2.  Pos1 and pos2 each have the form m.n, optionally 
  followed by one or more of the flags *bdfinr*, where m tells a number of
  fields to skip from the beginning of the line and n tells a number of chars
  to skip further.  If any flags are present they override all global ordering
  options for this key.  ...

I hope that the behavior of sort(1) has not changed in 4.3BSD!

   - John

-- 
John Plocher uwvax!geowhiz!uwspan!plocher  plocher%uwspan.UUCP@uwvax.CS.WISC.EDU

chris@mimsy.UUCP (Chris Torek) (07/23/87)

>In article <7651@mimsy.UUCP> chris@mimsy.UUCP I wrote:
>... unless ... global sort flags conflict for two fields.

In article <6153@brl-smoke.ARPA> gwyn@brl-smoke.ARPA (Doug Gwyn) replies:
>The bdinr flags are specifiable per-field as well as globally.

Yes, one can indeed specify flags for individual fields.  I missed
that when I scanned the manual; I saw `global sort flags' and alarm
bells went off.  I do think the manual entry could be a bit clearer
about this (although the 4.3BSD entry does use an example with `+2n').
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
Domain:	chris@mimsy.umd.edu	Path:	seismo!mimsy!chris

guy%gorodish@Sun.COM (Guy Harris) (07/23/87)

> Hmm.  My sort(1) man page says that you can apply the -n option (and
> -dbfir) to individual sort fields, as in:
> 
> 	$ sort +2n -3 +5df -6 +7n  file
> 
> I wonder: is this System V specific (we run SysVr2.1)?  Do you BSD users
> have these options?

*V7* users had those options, which is why both BSD and S5 users have
them.  From the 4.2BSD manual page:

	The notation "+'pos1' -'pos2'" restricts a sort key to a field
	beginning at "pos1" and ending just before "pos2".  "Pos1" and
	"pos2" each have the form "'m'.'n'", optionally followed by one or
	more of the flags "bdfinr", where "m" tells a number of fields
	to skip from the beginning of the line and "n" tells a number of
	characters to skip further.  *If any flags are present they override
	all the global ordering options for this key.*  If the "b" option
	is in effect "n" is counted from the first nonblank in the field;
	"b" is attached independently to "pos2".
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guyoups:uptions,h

boykin@custom.UUCP (Joseph Boykin) (07/24/87)

In article <7651@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> In article <758@custom.UUCP> boykin@custom.UUCP (Joseph Boykin) writes:
> >UNIX Sort (as well as our own PC/SORT) allows you to do exactly what you
> >want.
> 
>  ... unless you need a numeric sort on one field but an alphabetical
> sorts on another, or if any of the other global sort flags conflict
> for two fields.  It should not be too hard to write a stable sort
> for such applications, especially if you let /usr/bin/sort do the
> hard work.  (Note the merge option!)
> -- 
> In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
> Domain:	chris@mimsy.umd.edu	Path:	seismo!mimsy!chris

As I said in my original posting, UNIX Sort will do exactly what he
wants, this includes the possibility of doing a numeric sort on one
field and alphabetical on another.  Sort has global options which
effect how the sort is done, i.e. do a numeric sort on all the specified
fields.  Each field may specify it's own sub-options, each of which
override the global options.  Specifically a sort key has the form
m.no where 'm' is the 'm'th field (0 relative), 'n' is the number of characters
from the start of that field and 'o' is zero or more of the flags
b d f i n or r.  If any of these are present they override the
global ordering options (but not the non-ordering options).

Joe Boykin
Custom Software Systems
...necntc!custom!boykin

m5@bobkat.UUCP (Mike McNally ) (07/24/87)

In article <2459@whuts.UUCP> tes@whuts.UUCP (STERKEL) writes:
><*>
> I need a multiple field sort that maintains sub-field order.  Let
> me draw a picture:

Try the sort(1) program.  It can be given more than one field on
which to sort.






-- 
Mike McNally, mercifully employed at Digital Lynx ---
    Where Plano Road the Mighty Flood of Forest Lane doth meet,
    And Garland fair, whose perfumed air flows soft about my feet...
uucp: {texsun,killer,infotel}!pollux!bobkat!m5 (214) 238-7474

allbery@ncoast.UUCP (Brandon Allbery) (07/29/87)

As quoted from <6153@brl-smoke.ARPA> by gwyn@brl-smoke.ARPA (Doug Gwyn ):
+---------------
| I don't see what the problem is.  The bdinr flags are specifiable
| per-field as well as globally.
+---------------

Yes, sort can do it all.

			...but its man page does nothing.  Will someone who
knows all of what sort can do *please* write and post a PD man page for it?
Something more readable than the trash that AT&T wrote?
-- 
 Brandon S. Allbery, moderator of comp.sources.misc and comp.binaries.ibm.pc
  {{harvard,mit-eddie}!necntc,well!hoptoad,sun!cwruecmp!hal}!ncoast!allbery
ARPA: necntc!ncoast!allbery@harvard.harvard.edu  Fido: 157/502  MCI: BALLBERY
   <<ncoast Public Access UNIX: +1 216 781 6201 24hrs. 300/1200/2400 baud>>

gwyn@brl-smoke.ARPA (Doug Gwyn ) (07/31/87)

In article <3679@ncoast.UUCP> allbery@ncoast.UUCP (Brandon Allbery) writes:
>Will someone who
>knows all of what sort can do *please* write and post a PD man page for it?
>Something more readable than the trash that AT&T wrote?

Actually, AT&T rewrote the "sort" manual entry for SVR2.
Unfortunately, in the process of clarifying things,
they broke some of the specs.  The SVR3 manual corrected this.
It now seems no worse than most manual entries;
i.e. the specification is unambiguous (but is not a tutorial).