[comp.unix.questions] sort question

tcianflo@nugipsy.UUCP (05/13/87)

Take a typical name and address file with many different names,
but some common names, such as:

Name1, J
Smith, A  |
Smith, C  | - family #1
Smith, E  |
Name2, R
Smith, B  |
Smith, D  | - family #2
Smith, F  |
Name3, P

Using the sort utility, and sorting alphabetically by last name,
you get family #1 and family #2 interleaved, of course, by virtue
of their same last names and first initials.
How would I set up the file so that I could get the sort
utility to:

	1) keep family members together without mixing up names
	with other families of the same last name

	2) force the listing of a family group to be in a specified
	order, for example, head-of-house followed by children.

Is there some scheme using additional fields in the file,
or perhaps some other utility I should be using?

Thanks in advance for any help you can send my way.
-- 
=> Regards, Tom Cianflone @ Gould Computer Systems Division <=
=>   ...!{seismo,sun,pur-ee,brl-smoke}!gould!tcianflone     <=
=>    ...!ihnp4!{codas,allegra}!novavax!gould!tcianflone    <=
=> NOTE: Disregard header info. Email to above paths only.  <=

boykin@custom.UUCP (Joseph Boykin) (05/15/87)

In article <303@nugipsy.UUCP>, tcianflo@nugipsy.UUCP (Tom Cianflone) writes:
> Using the sort utility, and sorting alphabetically by last name,
> you get family #1 and family #2 interleaved, of course, by virtue
> of their same last names and first initials.
> How would I set up the file so that I could get the sort
> utility to:
> 	1) keep family members together without mixing up names
> 	with other families of the same last name
> 	2) force the listing of a family group to be in a specified
> 	order, for example, head-of-house followed by children.

You would, obviously, have to place some other information in the
file.  However, sort does allow you to specify multiple sort fields.
It will sort using the first key, then sort lines which
are common by the subsequent sort keys.  That is, you could have a
file which has:
	Last	First	X

where 'X' is A for head of household, B for spouse, C for kids,
(D for pets!), etc.  To sort by family, and have it sorted by
"rank" within family, issue the following command:

	sort +0 -1 +2 -3 filename

This can be expanded to provide the information you were looking
for such as keeping people within the same family together, etc.


Joe Boykin
Custom Software Systems
...{necntc, frog}!custom!boykin

maciolek@gecrd1.UUCP (05/15/87)

In article <651@custom.UUCP> boykin@custom.UUCP (Joseph Boykin) writes:
>In article <303@nugipsy.UUCP>, tcianflo@nugipsy.UUCP (Tom Cianflone) writes:
>> How would I set up the file so that I could get the sort
>> utility to:
>> 	1) keep family members together without mixing up names
>> 	with other families of the same last name
>> 	2) force the listing of a family group to be in a specified
>> 	order, for example, head-of-house followed by children.
>
>You would, obviously, have to place some other information in the
>file.  However, sort does allow you to specify multiple sort fields.
>It will sort using the first key, then sort lines which
>are common by the subsequent sort keys.  That is, you could have a
>file which has:
>	Last	First	X
>
>where 'X' is A for head of household, B for spouse, C for kids,
>(D for pets!), etc.  To sort by family, and have it sorted by
>"rank" within family, issue the following command:
>
>	sort +0 -1 +2 -3 filename
>
>This can be expanded to provide the information you were looking
>for such as keeping people within the same family together, etc.

Well, this won't work either.  You'll wind up with something like

Smith Alice A
Smith Bob A
Smith Carol A
Smith Doris B
Smith Ted C
Smith Bill C

and won't know whether Ted Smith is a child of Alice, Bob or Carol.  To
make sure that families get grouped according to head-of-household, you
will have to be a little fancier.  Here are a couple of ideas...

First - all family names are prefaced by the head-of-household for that
family.  This is quite redundant and space-consuming, but it gets the job
done...unless two families have heads-of-household with the same name.

Thus, a file could be entered :

Smith Alice
Smith Alice	C Smith Bill
Smith Bob
Smith Bob	B Smith Doris
Smith Bob	C Smith Ted
Smith Carol

As I said, this makes for a lot of redundancy, especially in families
with many spouses :-) or children, since all family-members who ar not
heads-of-household are prefixed by the name of their head-of-household. 

Another solution that seems more feasible would be to group all family
members onto the same LINE as the head-of-household.  This is significant,
of course, because sort(1) has this notion of "lines" which are delimited
by newline characters.  Everything between a pair of newlines is treated
as a unit. SO...all you have to do is write a little filter which reads a
file like this:

Smith Bob A
Smith Doris B
Smith Ted C
Smith Carol A
Smith Alice A
Smith Bill C

and produces a file like this:

Smith Bob A ^MSmith Doris B ^MSmith Ted C
Smith Carol A
Smith Alice A ^MSmith Bill C

by joining all the lines which follow an 'A' (head-of-household) line UNTIL
another head-of-household is encountered.  I use control-M as a separator
here, though any character which would not appear in the text would be fine.

Now, when you run this file through sort(1), the output will be sorted by
head-of-household, with identically-named heads of household sorted next by
spouse, then by child.

Take the output and run it through an inverse of your original filter which
converts ^M's back to newlines:

filter1 <infile | sort | filter2 >outfile

An advantage is that you don't have to look up the syntax for specifying
the key field positions to sort(1).  The caveat here is that in the initial
file, the head of household always has to precede spouse and child entries,
AND the spouse and child entries will not be sorted under head-of-household.

Having now beaten this subject to death, I hope I don't see umpty-zillion
followups by nit-pickers.  And thank you for your support.

-- 
Mike Maciolek	    seismo!rochester!pt.cs.cmu.edu!cadre!pitt!gecrd1!maciolek
-consulting for-
General Electric

"Epoxy can be cured."

dan@hrc.UUCP (Dan Troxel) (05/04/89)

How can I use the 'sort' command, to sort by two fields?
Example:

_4 test _E                  _1 test _E
_2 test _E    should be     _2 test _E
_3 test _F                  _4 test _E
_1 test _E                  _3 test _F
 ^       ^
 ^       ^
  sorted
    by
-- 
Dan Troxel @ Handwriting Research Corporation                  WK 1-602-957-8870
Camelback Corporate Center  2821 E. Camelback Road  Suite 600  Phoenix, AZ 85016
ncar!noao!asuvax!hrc!dan                                  hrc!dan@asuvax.asu.edu

mchinni@pica.army.mil (Michael J. Chinni, SMCAR-CCS-E) (05/04/89)

> From: Dan Troxel <dan@hrc.uucp>
> Subject: sort question
> Date: 3 May 89 20:15:01 GMT
> To:       info-unix@sem.brl.mil
>
> How can I use the 'sort' command, to sort by two fields?
> Example:
>
> _4 test _E                  _1 test _E
> _2 test _E    should be     _2 test _E
> _3 test _F                  _4 test _E
> _1 test _E                  _3 test _F
>
Try: sort +10 -11 +2 -3 filename

This tells sort to sort first on a field starting in col. 10 and ending just
before col. 11, and secondly a field starting in col. 2 and ending just before
col. 3.

/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
			    Michael J. Chinni
      Chief Scientist, Simulation Techniques and Workplace Automation Team
	US Army Armament Research, Development, and Engineering Center
 User to skeleton sitting at cobweb    () Picatinny Arsenal, New Jersey  
   and dust covered terminal and desk  () ARPA: mchinni@pica.army.mil
    "System been down long?"           () UUCP: ...!uunet!pica.army.mil!mchinni
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/

davidsen@sungod.steinmetz (William Davidsen) (05/04/89)

In article <199448@hrc.UUCP> dan@hrc.UUCP (Dan Troxel) writes:
| How can I use the 'sort' command, to sort by two fields?
| Example:
| 
| _4 test _E                  _1 test _E
| _2 test _E    should be     _2 test _E
| _3 test _F                  _4 test _E
| _1 test _E                  _3 test _F

  Try "sort +2"

  Look at the +n.m and -n.m stuff. No you *don't* need "+2 +0", read the
man page...
	bill davidsen		(davidsen@crdos1.crd.GE.COM)
  {uunet | philabs}!crdgw1!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

jckfield@ihlpb.ATT.COM (Kelvin Fielding) (05/05/89)

In article <199448@hrc.UUCP>, dan@hrc.UUCP (Dan Troxel) writes:
> How can I use the 'sort' command, to sort by two fields?
> Example:
> 
> _4 test _E                  _1 test _E
> _2 test _E    should be     _2 test _E
> _3 test _F                  _4 test _E
> _1 test _E                  _3 test _F
>  ^       ^
>  ^       ^
>   sorted
>     by
> -- 

sort +0 -0 +2 <filename>

will do the trick.

Explanation:
		sort on field 1 +0 and ending on field 1 -0 then
		sort on field 1 +2.

 _   ,      _
' ) /      //
 /-<   _  // , __o  ____
/   ) </_</__\/ <__/ / < o

decot@hpisod2.HP.COM (Dave Decot) (05/05/89)

> How can I use the 'sort' command, to sort by two fields?
> Example:
> 
> _4 test _E                  _1 test _E
> _2 test _E    should be     _2 test _E
> _3 test _F                  _4 test _E
> _1 test _E                  _3 test _F
>  ^       ^
>  ^       ^
>   sorted
>     by

sort +0 -1 +2

guy@auspex.auspex.com (Guy Harris) (05/06/89)

>Try: sort +10 -11 +2 -3 filename
>
>This tells sort to sort first on a field starting in col. 10 and ending just
>before col. 11,

No, it doesn't, at least not if you're talking about the standard UNIX
versions of "sort".  It tells "sort" to sort first on the 10th *field*
(that is, the 11th field on the line - after all, it's written in C
:-)).  Fields are normally separated by white space; with the "-t" flag,
they're separated by tab columns.

gph@hpsemc.HP.COM (Paul Houtz) (05/09/89)

guy@auspex.auspex.com (Guy Harris) writes:

>Try: sort +10 -11 +2 -3 filename
>
>This tells sort to sort first on a field starting in col. 10 and ending just
>before col. 11,

No, it doesn't, at least not if you're talking about the standard UNIX
versions of "sort".  It tells "sort" to sort first on the 10th *field*
(that is, the 11th field on the line - after all, it's written in C
:-)).  Fields are normally separated by white space; with the "-t" flag,
they're separated by tab columns. 
----------

Right.  There is no way to do a true column sort using this utility as you
can on IBM or MPE systems and here is why:   Sort requires a FIELD DELIMITER
character.   That means that there is SOME character that will never be 
sorted.   If you need to sort data from a foreign file about which you have
little information except column numbers to sort on, you are (as far as I
know) out of luck with 'sort'.

If, however, you do know of some character which will NEVER appear as 
data in a file, then you can do a column sort.   For instance, say you
know that the charactyer '^' never appears in your data file.  Then you
can simply say:

sort -t^ -0.0 + 0.3   

This command assumes that there is only ONE field in the file, and that is
the entire record.  It then sorts the 0+0th column of the 0+0th field 
thru the 0+4th column of the 0+0th field, actually accomplishing a sort
of column 1 thru 4.

Here is the worst one I have seen on Unix.  I converted this myself from a
sort done on an IBM System/34.  This is a good example of a COMMON type of
sort done in the commercial world which you never see on Unix:

sort -dt'\012' +0.6 -0.8 +0.13 -0.15 +0.15r -0.17r +0.8 -0.13  DISK-SUMARY >SUM1

This guy sorts the summary file using the newline character as a field 
delimiter (i.e., no fields), and you can tell what column ranges are 
being sorted by subtracting 1 from the 'x' field of the 0.x parms.  

It sorts the 5  thru 7  columns in ascending order, 
         the 12 thru 14 columns in ascending order,
         the 14 thru 16 columns in DESCENDING order, (the "r" after the column)
    then the 7  thru 12 columns in ascending order.

If anyone ever tries to tell you that UNIX is user friendly, you can now
barf on them.  

Paul Houtz
HP Technology Access Center
10670 N. Tantau Avenue
Cupertino, Ca 95014
(408) 725-3864
hplabs!hpda!hpsemc!gph 
gph%hpsemc@hplabs.HP.COM

chris@mimsy.UUCP (Chris Torek) (05/10/89)

In article <810050@hpsemc.HP.COM> gph@hpsemc.HP.COM (Paul Houtz) writes:
>Right.  There is no way to do a true column sort using this utility as you
>can on IBM or MPE systems and here is why:   Sort requires a FIELD DELIMITER
>character.   That means that there is SOME character that will never be 
>sorted.

But (as you yourself point out) you can set the field delimiter to
newline, effectively making it vanish, then use the 0.n format to
specify column n.

>Here is the worst one I have seen on Unix.  I converted this myself from a
>sort done on an IBM System/34.  This is a good example of a COMMON type of
>sort done in the commercial world which you never see on Unix:
>
>sort -dt'\012' +0.6 -0.8 +0.13 -0.15 +0.15r -0.17r +0.8 -0.13 ...
>
>This guy sorts the summary file using the newline character as a field 
>delimiter (i.e., no fields), and you can tell what column ranges are 
>being sorted by subtracting 1 from the 'x' field of the 0.x parms.  

Or simply think of columns as numbered from 0 (if you count from 0
to 1023 on your fingers, as I do :-) ).

>If anyone ever tries to tell you that UNIX is user friendly, you can now
>barf on them.  

Why?

(Actually, Unix is not *meant* to be `user friendly'---if that means
`taking occasional users by the hand and leading them from each little
stepping stone on to the next'.  It is meant to get the job done
simply, tersely, without back-talk.  If you use a system every day,
you can get tired of wading through six levels of menus.  And if Unix
looks a little old and creaky in the user-interface area, well, it
*was* designed around printing terminals and dumb CRTs.  But then
again, that is all I have at home.  [An H19 with the Heath ROMs can
hardly be called clever :-) .])

If this sort of thing is not done often in Unix, why not?  Perhaps
because it is a bad idea not to delimit fields.  (I believe that any
fixed limit---this includes fixed field widths---is always too small.
I sometimes wonder what IBMers do about people with last names like
`de Martinesquez y de la Capillostraglio'; I *know* what they do with
people who, like me, put down an initial for a first name and a name
for a middle initial.)  But if, like Houtz, you are forced to make the
best of a bad design, and you dislike all the `+0.x -0.y', instead of
sulking, you *could* write a small shell script to convert whatever
column format you prefer into what sort requires.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

gph@hpsemc.HP.COM (Paul Houtz) (05/17/89)

chris@mimsy.UUCP (Chris Torek) writes:

>In article <810050@hpsemc.HP.COM> gph@hpsemc.HP.COM (Paul Houtz) writes:
>>Right.  There is no way to do a true column sort using this utility as you
>>can on IBM or MPE systems and here is why:   Sort requires a FIELD DELIMITER
>>character.   That means that there is SOME character that will never be 
>>sorted.
>
>But (as you yourself point out) you can set the field delimiter to
>newline, effectively making it vanish, then use the 0.n format to
>specify column n.

   Wouldn't it be nice if the whole world spoke Unix.  It would make any
minor user Unfriendliness seem like such a minor issue.  Oh well.  Unfortunately
there will probably always be systems out there that need to talk to non-
unix systems.    

   The problem with having to set the field delimiter is that you have to
decide what to set it TO.  Now, if you are reading from a file that has 
binary data in it, then it is possible that a newline character could appear
in the binary data.  This seems to me like it might be a problem.   Sort
would think it found the end of line.

   I can write a sort program that will do this column sorting for me, but
what a pain.  It's too bad there isn't one for unix, like there is for 
all the other major operating systems.

   (On the other hand, I'll be that some third party out there has already
written a true column sort for unix.  I just haven't found it yet.  Any
takers?)

gph@hpsemc.HP.COM (Paul Houtz) (05/17/89)

I wrote:

> Here is the worst one I have seen on Unix.  I converted this myself from a
>sort done on an IBM System/34.  This is a good example of a COMMON type of
>sort done in the commercial world which you never see on Unix:
>
>sort -dt'\012' +0.6 -0.8 +0.13 -0.15 +0.15r -0.17r +0.8 -0.13 DISK-SUMARY >SUM1
>
>This guy sorts the summary file using the newline character as a field 
>delimiter (i.e., no fields), and you can tell what column ranges are 
>being sorted by subtracting 1 from the 'x' field of the 0.x parms.  
>
> It sorts the 5  thru 7  columns in ascending order, 
> the 12 thru 14 columns in ascending order,
> the 14 thru 16 columns in DESCENDING order, (the "r" after the column)
> then the 7  thru 12 columns in ascending order.

Dr. T. Andrews, Systems, CompuData, Inc. DeLand, writes:

Ah, yes, that's an ugly command.  Now, what is the command to run the
general "sort" program on the "friendly" op sys where you would
prefer, performing the same sort?

Okay, in MPE, you do this:

sort
input DISK-SUMARY
output SUM1
key 5,7;12,14;14,16,DESC;7,12
end

The "key" parm says to sort column 5 thru 7 ascending, 
                                   12 thru 14 ascending
                                   14 thru 16 descending,
                                    7 thru 12 ascending.

That seems much clearer to me.

chris@mimsy.UUCP (Chris Torek) (05/17/89)

In article <810054@hpsemc.HP.COM> gph@hpsemc.HP.COM (Paul Houtz) writes:
>... if you are reading from a file that has binary data in it, then it
>is possible that a newline character could appear in the binary data.
>This seems to me like it might be a problem.   Sort would think it
>found the end of line.

If your file is of binary data, you have more of a problem than that.
Sort(1) sorts ASCII text files, not binary files.  (Numeric sorts are
done by conversion to and from numeric values.)

(Somehow this argument seems rather like saying that quicksort is bad
because if you sort nearly-sorted lists, it runs $O(n^2)$.  Indeed it
does, but that just means you use a different algorithm [Shell sorts
work well; or if it is almost completely sorted, a bubble sort may
outperform anything else!].)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

gph@hpsemc.HP.COM (Paul Houtz) (05/17/89)

chris@mimsy.UUCP (Chris Torek) writes:
>In article <810054@hpsemc.HP.COM> gph@hpsemc.HP.COM (Paul Houtz) writes:
>>... if you are reading from a file that has binary data in it, then it
>>is possible that a newline character could appear in the binary data.
>>This seems to me like it might be a problem.   Sort would think it
>>found the end of line.
>
>If your file is of binary data, you have more of a problem than that.
>Sort(1) sorts ASCII text files, not binary files.  (Numeric sorts are
>done by conversion to and from numeric values.)

   What unix sort do you use if you have a data file with mixed binary
and ascii fields?

   IBM has a number of sort utilities that will do this.

   The MPE and MPE XL sort utility handles this case fine.

   The VMS sort utility handles this too.

   Unless there is a sort utility on Unix that I haven't heard of, 
I don't think unix does this.   (Please don't tell me that it isn't 
a good idea to mix ascii and binary data in the same file).

morrell@hpsal2.HP.COM (Michael Morrell) (05/18/89)

/ hpsal2:comp.unix.questions / gph@hpsemc.HP.COM (Paul Houtz) / 10:04 am  May 16, 1989 /
   The problem with having to set the field delimiter is that you have to
decide what to set it TO.  Now, if you are reading from a file that has 
binary data in it, then it is possible that a newline character could appear
in the binary data.  This seems to me like it might be a problem.   Sort
would think it found the end of line.
----------

  Maybe I'm confused, but how do you do a column sort on binary data?  Aren't
columns numbers defined within each line, implying some "newline" char?

  Michael

gwyn@smoke.BRL.MIL (Doug Gwyn) (05/18/89)

In article <810056@hpsemc.HP.COM> gph@hpsemc.HP.COM (Paul Houtz) writes:
>What unix sort do you use if you have a data file with mixed binary
>and ascii fields?

Your first problem is to define exactly what constitutes a "record"
for such a case.  The other systems you mention support (require?)
file attributes such as record size; on UNIX files are just structureless
byte arrays.  The new-line terminator convention allows text files to
be dealt with on a line-oriented basis, but there is no widespread UNIX
convention for binary file structures.

chris@mimsy.UUCP (Chris Torek) (05/18/89)

In article <810056@hpsemc.HP.COM> gph@hpsemc.HP.COM (Paul Houtz) writes:
>   What unix sort do you use if you have a data file with mixed binary
>and ascii fields? ... (Please don't tell me that it isn't 
>a good idea to mix ascii and binary data in the same file).

Oh well, answer number zero down the tubes. :-)

Anyway, I would write a little filter to de-binarify the file (and
if necessary, another to re-binate it, or perhaps the same one to do
both).  That seems easier than writing a special sorter for it.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

andrew@root.co.uk (Andrew Dingwall) (05/23/89)

In article <810054@hpsemc.HP.COM> gph@hpsemc.HP.COM (Paul Houtz) writes:
>>In article <810050@hpsemc.HP.COM> gph@hpsemc.HP.COM (Paul Houtz) writes:
>>>Right.  There is no way to do a true column sort using this utility as you
>>>can on IBM or MPE systems and here is why:   Sort requires a FIELD DELIMITER
>>>character.   That means that there is SOME character that will never be 
>>>sorted.
>>
>>But (as you yourself point out) you can set the field delimiter to
>>newline, effectively making it vanish, then use the 0.n format to
>>specify column n.
>

No, in a previous job, I made extensive use of sort +0.x -0.y ... to do column
sorts on newline-delimited records without needing to specify a field delimiter.
Admittedly, that was on unix V7 (and a long time ago!), but I have tried the
same on a System V system and it still seems to work.
The only thing that I found necessary to make the scheme work is the newline
at the end of the record and no nulls or non-ascii characters in the record
body.

>   The problem with having to set the field delimiter is that you have to
>decide what to set it TO.  Now, if you are reading from a file that has 
>binary data in it, then it is possible that a newline character could appear
>in the binary data.  This seems to me like it might be a problem.   Sort
>would think it found the end of line.
>
>   I can write a sort program that will do this column sorting for me, but
>what a pain.  It's too bad there isn't one for unix, like there is for 
>all the other major operating systems.
>
>   (On the other hand, I'll be that some third party out there has already
>written a true column sort for unix.  I just haven't found it yet.  Any
>takers?)

Yes, we have written a binary sort called binsort.
It works like the unix sort except that it works on fixed-length binary records
and sorts by column position.
It understands all the usual unix data types (char, int, float, double etc),
together with data types more usually found in the commercial world
(cobol COMP (natural byte order signed binary) and COMP-3 (bcd)).
The command-line interface is similar to the unix sort (+m.n -m.n etc)
and appropriate options are supported (-d -f -u -c -r -m -o -T).

I'm not sure under what circumstances it might be made available but, as
UniSoft are a commercial organisation, it would probably cost money!

Andrew Dingwall
andrew@root.co.uk