[net.unix-wizards] sort bug

fpt@wgivax.UUCP (Fred Toth #7252) (08/09/85)

Yes, we found that one a while back. We were sorting a giant list
of state abbreviations, and (whoops) one disappeared.
This was on our 11/70 v7 system, and when we got a vax it was one
of the first things I checked. As you know, it was still there.
System V has it too. We have a little Zenix system here that seems
to have it fixed, but we don't have source.

To see if you have the bug (not you Carl), try cating this file
up to a size big enough to cause at least one merge run.
This meant about 25k on a v7 system, but on 4.2 you need
probably 150k or so. Then just sort -u. If you have the bug, there
will be no "AK" in the output.

VA
VA
KY
AK
AK
AK
AL
AL
TN
TN
TN
VA
TN
VA
KY

My fix is below, along with a comment I wrote at the time. If you really sort
a lot, you might want to be aware of another problem with 4.2. Sort's
temp file naming starts out with stm<PID>aa, followed by stm<PID>ab,
and so on, stepping through the character set to make new names. The
problem is that it doesn't stop with the character set (7 bit ascii) but
goes on into 8 bit values, and eventually tries to turn over. That is,
it will sooner or later try to make a file: stm<PID>'\0''\0'. Well, that's
fine, at least it was fine in version 7. By the time the turn over happened,
the original files were long unlinked, and everything worked out, even
though you had hundred's of unreadable and unprintable file names in your
temp directory.

All that changed with 4.2 because it doesn't allow the 8th bit in any file
name characters. Of course, this only happens when you are sorting a
BIG file, like 100 megs or so, but the problem is there none the less.
There are lot's of simple fixes. An example is to make the temp file names
something like stm<PID>.00000, stm<PID>.00001, etc.

Good luck! Sort is a subject in itself, and I could go on and on, because
we really sort A LOT. But enough for now. Let me know if any of this is
too obscure ...

Excuse the interruption, now you can all go back to the 'ls' discussion.

Fred Toth
Washburn Graphics, Inc.
Charlotte, NC
decvax!mcnc!unccvax!wgivax!fpt

/*
 * From fpt:
 * On sort -u on files big enough for more than one call to merge()
 * lspace has leftovers from the first merge when the second merge runs.
 * This caused an occasional error:
 * The line below 'j = (*compare ... ' would find a match with the
 * leftover, when it should find garbage. This was only on the last
 * compare of the merge run, after rline() began returning EOF.
 * This is completely unintelligible I know, but the fix is to clear
 * out the leftovers at merge init time.
 *
 * Later from fpt:
 * Well, that didn't work completely. Just nulling out the as yet
 * unused p->l is not good enough for sort. 
 *
 * The problem:
 *
 * When the next to the last merge file EOFs, muflg
 * gets turned on for the first time. Then, for the first time,
 * the (*compare) of ibuf[i-1]->l and p->l is executed. The problem
 * is that p->l has not had anything put into it yet, and won't
 * until the for loop is broken. MOST of the time p->l has garbage.
 * SOMETIMES it has leftover data which will MATCH. The first fix
 * "p->l[0] = '\0'" didn't work because the compare routine will
 * scream through many k of nulls to find data to look at. SOMETIMES
 * it will MATCH. Crazy, but as much sorting as we do, sooner or
 * later it had to happen.
 * The fix is to prevent the first compare from happening (so we 
 * don't look at garbage). If merge files are down to 1, and
 * muflg is just being turned on, break out of the loop. See below.
 *
 * Note: none of this was a problem if we were merging unique (-mu)
 * because muflg was on the whole time. Thus p->l always has a copy
 * of the last output line.
 */
424,429c424,425
< 				if(i == 1) {
< 					if (!muflg) {
< 						muflg = uflg;
< 						break;
< 					}
< 				}
---
> 				if(i == 1)
> 					muflg = uflg;