[net.unix] Sort

woods@hao.UUCP (Greg "Bucket" Woods) (09/27/84)

  We have a need to numerically sort files which contain columns of numbers in
E-format, i.e. something of the form [+-]#.####e[+-]##, where "#" means
a digit and [+-] means an optional sign. Unfortunately, the -n option to
sort(1) does not recognize exponents and stops numerical conversion of the
sort field when it sees the "e". This results in incorrect sorting in some
cases, like it will put 1.0e-07 before 2.0e-09. 
  So, before I go through the hassle of hacking the source code for sort(1),
or writing another whole program to do this, I'd like to know if anyone else
has already invented this wheel. Has anyone hacked sort(1) to do this correctly?
I will settle for any kludges that would work with existing programs as well.
Various thoughts of running sort(1) twice have crossed my mind, but I have
yet to come up with anything that will work reliably in all cases. Any ideas?
  Please *mail* me any hints. If someone comes up with an answer, I will post
it to the net. We are running 4.2BSD on a VAX 11/750, in case that matters.

--Greg
-- 
{ucbvax!hplabs | allegra!nbires | decvax!stcvax | harpo!seismo | ihnp4!stcvax}
       		        !hao!woods
   
     "  She could make happy, any man alive..."

woods@hao.UUCP (Greg "Bucket" Woods) (09/29/84)

> 
>   We have a need to numerically sort files which contain columns of numbers in
> E-format, i.e. something of the form [+-]#.####e[+-]##, where "#" means
> a digit and [+-] means an optional sign. Unfortunately, the -n option to
> sort(1) does not recognize exponents and stops numerical conversion of the
> sort field when it sees the "e". This results in incorrect sorting in some
> cases, like it will put 1.0e-07 before 2.0e-09. 

   In reply to my own question, after a bit of trial and error I discovered
a method that seems to work. It does depend on the fact that every line
is identical in format, which is true in all cases we have. Here is an example:

 1.27000E-07 8.91000E+04 6.00495E+09 9.82000E+05 1.66451E+05 4.99966E+09 
 1.43000E-07 5.00000E+04 1.04275E+10 9.76000E+05 2.38238E+06 8.68145E+09 
 8.09000E-07 2.30000E+04 2.35302E+10 8.87000E+05 4.11476E+08 2.02331E+10 
 1.67000E-07 3.20000E+04 1.57815E+08 9.71000E+05 3.63586E+07 1.31336E+10 
 1.97000E-07 2.55000E+04 1.93346E+10 9.68000E+05 1.92010E+08 1.61099E+10 
 2.30000E-07 2.45000E+04 2.00822E+10 9.64000E+05 2.55430E+08 1.68091E+10 
 1.81000E-07 2.80000E+04 1.78057E+10 9.70000E+05 9.50806E+07 1.48126E+10 
 1.58000E-07 3.70000E+04 1.38137E+10 9.73000E+05 1.38215E+07 1.14989E+10 
 4.70000E-07 2.40000E+04 2.14417E+10 9.33000E+05 2.84392E+08 1.80507E+10 
 6.56000E-07 2.35000E+04 2.25669E+10 9.08000E+05 3.37865E+08 1.91669E+10 
 3.37000E-07 2.42000E+04 2.07391E+10 9.49000E+05 2.70261E+08 1.74114E+10 

   We want to sort on the third column. The command "sort +2.9 -n +2"
run on this file, which says "sort on third field and skip 9 characters, sort
this numerically, then subsort on the third field" does what we want.
It took a lot of trial and error to figure this one out! The only problem with
it is that it won't work if some of the exponents are negative (in all of our
cases, the exponents are all the same sign). I tried using "sort +2.8" instead, 
but apparently the stupid numeric sort algorithm knows about minus signs but 
not plus signs (AAARGH!) and so sort +2.8 failed totally. I'm going to see 
about fixing that so a plus sign as a leading character in a numeric field 
will be ignored instead of aborting the field.
  Thanks to all those who responded. Some people gave me kludges using "sed"
and/or "awk". I didn't actually try any of these, but from the looks of it, 
"awk" is aptly named! :-)
   One person even sent me mods to sort.c to make numeric sorts work on 
E-format.
   If anyone is interested in any of those, drop me a line and I'll be glad to
mail you everything I got.

--Greg
-- 
{ucbvax!hplabs | allegra!nbires | decvax!stcvax | harpo!seismo | ihnp4!stcvax}
       		        !hao!woods
   
     "Every silver lining has a touch of grey..."

smh@mit-eddie.UUCP (Steven M. Haflich) (09/29/84)

Quoth woods@hao.UUCP (Greg "Bucket" Woods):
  We have a need to numerically sort files which contain columns of
  numbers in E-format, i.e. something of the form [+-]#.####e[+-]##, where
  "#" means a digit and [+-] means an optional sign.

God is the following solution UGLY!!!!!!  But it works...  As a test
case, I use the output of the following program.
	#include <math.h>
	main() { register int i; float foo = 0.0;
		for (i=90; i--; ) {	printf("foo %e bar\n", sin(foo));
					printf("foo %e bar\n", 123.*sin(foo));
					foo += .2;
		}
	}
The following a shell script will sort it on the E-format number in the
second whitespace-delimited field:
	( awk '$2 ~ /^-/ {
		{ n = split($2, number, "e") }
		{ if (number[2] ~ /^\+/) number[2] = " " substr(number[2],2) }
		{ print $1, number[1], number[2], $3 }
		}
	' $* | sort +2nr +1n; awk '$2 ~ /^[^-]/ {
		{ n = split($2, number, "e") }
		{ if (number[2] ~ /^\+/) number[2] = " " substr(number[2],2) }
		{ print $1, number[1], number[2], $3 }
		}
	' $* | sort +2n +1n) |
		awk '$3 ~ /^-/	{ print $1, $2 "e" $3, $4 }
	$3 ~ /^[^-]/	{ print $1, $2 "e+" $3, $4 }
	'
Sorry -- this crock demands real file(s) as input and won't read a
pipe.  Converting it to read the proper input field is left as an
exercise for the student.

What this proves to anyone still reading this gibberish is that shell
and awk scripts are easier to read than to write. :-) Have a nice day!

Steve Haflich, smh@mit-ems@mit-mc, {decvax!genrad, ihnp4}!mit-eddie!smh