woods@hao.UUCP (Greg "Bucket" Woods) (09/27/84)
We have a need to numerically sort files which contain columns of numbers in E-format, i.e. something of the form [+-]#.####e[+-]##, where "#" means a digit and [+-] means an optional sign. Unfortunately, the -n option to sort(1) does not recognize exponents and stops numerical conversion of the sort field when it sees the "e". This results in incorrect sorting in some cases, like it will put 1.0e-07 before 2.0e-09. So, before I go through the hassle of hacking the source code for sort(1), or writing another whole program to do this, I'd like to know if anyone else has already invented this wheel. Has anyone hacked sort(1) to do this correctly? I will settle for any kludges that would work with existing programs as well. Various thoughts of running sort(1) twice have crossed my mind, but I have yet to come up with anything that will work reliably in all cases. Any ideas? Please *mail* me any hints. If someone comes up with an answer, I will post it to the net. We are running 4.2BSD on a VAX 11/750, in case that matters. --Greg -- {ucbvax!hplabs | allegra!nbires | decvax!stcvax | harpo!seismo | ihnp4!stcvax} !hao!woods " She could make happy, any man alive..."
woods@hao.UUCP (Greg "Bucket" Woods) (09/29/84)
> > We have a need to numerically sort files which contain columns of numbers in > E-format, i.e. something of the form [+-]#.####e[+-]##, where "#" means > a digit and [+-] means an optional sign. Unfortunately, the -n option to > sort(1) does not recognize exponents and stops numerical conversion of the > sort field when it sees the "e". This results in incorrect sorting in some > cases, like it will put 1.0e-07 before 2.0e-09. In reply to my own question, after a bit of trial and error I discovered a method that seems to work. It does depend on the fact that every line is identical in format, which is true in all cases we have. Here is an example: 1.27000E-07 8.91000E+04 6.00495E+09 9.82000E+05 1.66451E+05 4.99966E+09 1.43000E-07 5.00000E+04 1.04275E+10 9.76000E+05 2.38238E+06 8.68145E+09 8.09000E-07 2.30000E+04 2.35302E+10 8.87000E+05 4.11476E+08 2.02331E+10 1.67000E-07 3.20000E+04 1.57815E+08 9.71000E+05 3.63586E+07 1.31336E+10 1.97000E-07 2.55000E+04 1.93346E+10 9.68000E+05 1.92010E+08 1.61099E+10 2.30000E-07 2.45000E+04 2.00822E+10 9.64000E+05 2.55430E+08 1.68091E+10 1.81000E-07 2.80000E+04 1.78057E+10 9.70000E+05 9.50806E+07 1.48126E+10 1.58000E-07 3.70000E+04 1.38137E+10 9.73000E+05 1.38215E+07 1.14989E+10 4.70000E-07 2.40000E+04 2.14417E+10 9.33000E+05 2.84392E+08 1.80507E+10 6.56000E-07 2.35000E+04 2.25669E+10 9.08000E+05 3.37865E+08 1.91669E+10 3.37000E-07 2.42000E+04 2.07391E+10 9.49000E+05 2.70261E+08 1.74114E+10 We want to sort on the third column. The command "sort +2.9 -n +2" run on this file, which says "sort on third field and skip 9 characters, sort this numerically, then subsort on the third field" does what we want. It took a lot of trial and error to figure this one out! The only problem with it is that it won't work if some of the exponents are negative (in all of our cases, the exponents are all the same sign). I tried using "sort +2.8" instead, but apparently the stupid numeric sort algorithm knows about minus signs but not plus signs (AAARGH!) and so sort +2.8 failed totally. I'm going to see about fixing that so a plus sign as a leading character in a numeric field will be ignored instead of aborting the field. Thanks to all those who responded. Some people gave me kludges using "sed" and/or "awk". I didn't actually try any of these, but from the looks of it, "awk" is aptly named! :-) One person even sent me mods to sort.c to make numeric sorts work on E-format. If anyone is interested in any of those, drop me a line and I'll be glad to mail you everything I got. --Greg -- {ucbvax!hplabs | allegra!nbires | decvax!stcvax | harpo!seismo | ihnp4!stcvax} !hao!woods "Every silver lining has a touch of grey..."
smh@mit-eddie.UUCP (Steven M. Haflich) (09/29/84)
Quoth woods@hao.UUCP (Greg "Bucket" Woods): We have a need to numerically sort files which contain columns of numbers in E-format, i.e. something of the form [+-]#.####e[+-]##, where "#" means a digit and [+-] means an optional sign. God is the following solution UGLY!!!!!! But it works... As a test case, I use the output of the following program. #include <math.h> main() { register int i; float foo = 0.0; for (i=90; i--; ) { printf("foo %e bar\n", sin(foo)); printf("foo %e bar\n", 123.*sin(foo)); foo += .2; } } The following a shell script will sort it on the E-format number in the second whitespace-delimited field: ( awk '$2 ~ /^-/ { { n = split($2, number, "e") } { if (number[2] ~ /^\+/) number[2] = " " substr(number[2],2) } { print $1, number[1], number[2], $3 } } ' $* | sort +2nr +1n; awk '$2 ~ /^[^-]/ { { n = split($2, number, "e") } { if (number[2] ~ /^\+/) number[2] = " " substr(number[2],2) } { print $1, number[1], number[2], $3 } } ' $* | sort +2n +1n) | awk '$3 ~ /^-/ { print $1, $2 "e" $3, $4 } $3 ~ /^[^-]/ { print $1, $2 "e+" $3, $4 } ' Sorry -- this crock demands real file(s) as input and won't read a pipe. Converting it to read the proper input field is left as an exercise for the student. What this proves to anyone still reading this gibberish is that shell and awk scripts are easier to read than to write. :-) Have a nice day! Steve Haflich, smh@mit-ems@mit-mc, {decvax!genrad, ihnp4}!mit-eddie!smh