bruce@trigraph.UUCP (Bruce Freeman) (01/13/88)
Index: usr.bin/sort.c 4.3BSD
Description:
The output of sort -u can differ from sort | uniq. The -u option can
"lose" records during the merge phase of sort. This bug is also present
in the 4.2 sort and maybe even V7 sort but I don't have access to a
V7 system to check this.
Repeat-By:
I have a 3Mb input file that shows this problem but it is a little
big to include here. Basically you need a large input file with very
few unique records that will do more than one merge pass inside sort.
You can see that there is a problem by running dbx on sort. Use as
input something that is too big to be sorted in memory such as
/usr/dict/web2a. Set a breakpoint on this line (457 in our source):
j = (*compare)(ibuf[i-1]->l,p->l);
inside the function merge(). Do "run -u" on your input file. When the
breakpoint is reached do "print *p". Observe that the contents of p.l
are garbage since it has not been initialized.
Fix:
The problem is that p, which is used to keep track of the last record
output, is used before it is set when we use -u. p is only looked at if
muflg is true which is the case all the time if we do sort -mu or
sort -c. For sort -u muflg is set to true only when the last run is
being looked at during the merge step. Once it has been set to true p
is used in the call to (*compare)() shown above before it has been given
a value. Generally this compares our current record to random garbage
which it does not equal and we go to the top of the loop where we
finally give p a value. However the next time we enter merge() and hit
this call to (*compare)() p will have the last value from the previous
pass. If this happens to match the record we are currently looking at
in this merge pass that record will not be output. Given enough merge
passes and intermediate files that have the same last record this record
will disappear from the intermediate files and hence from the final
output.
My fix was to set p whenever a record was output not just when muflg is
true. This means that with -u we copy records around more than we used
to but at least p has the correct value when it is used.
A context diff follows:
*** /tmp/,RCSt1006226 Mon Jan 11 17:07:48 1988
--- sort.c Thu Jan 7 15:10:16 1988
***************
*** 424,442 ****
i = j;
while(i > 0) {
cp = ibuf[i-1]->l;
! if (!cflg && (uflg == 0 || muflg || i == 1 ||
! (*compare)(ibuf[i-1]->l,ibuf[i-2]->l))) {
! fputs(cp, os);
! if (ferror(os)) {
! error = 1;
! term();
}
! }
! if(muflg){
! cp = ibuf[i-1]->l;
! dp = p->l;
! do {
! } while((*dp++ = *cp++) != '\n');
}
for(;;) {
if(rline(ibuf[i-1])) {
--- 424,444 ----
i = j;
while(i > 0) {
cp = ibuf[i-1]->l;
! if (uflg == 0 || muflg || i == 1 ||
! (*compare)(ibuf[i-1]->l,ibuf[i-2]->l)) {
! if (!cflg) {
! fputs(cp, os);
! if (ferror(os)) {
! error = 1;
! term();
! }
}
! if(uflg || cflg){
! cp = ibuf[i-1]->l;
! dp = p->l;
! do {
! } while((*dp++ = *cp++) != '\n');
! }
}
for(;;) {
if(rline(ibuf[i-1])) {
--
Bruce Freeman Trigraph Inc., Toronto, Canada utzoo!trigraph!bruce