bruce@trigraph.UUCP (Bruce Freeman) (01/13/88)
Index: usr.bin/sort.c 4.3BSD Description: The output of sort -u can differ from sort | uniq. The -u option can "lose" records during the merge phase of sort. This bug is also present in the 4.2 sort and maybe even V7 sort but I don't have access to a V7 system to check this. Repeat-By: I have a 3Mb input file that shows this problem but it is a little big to include here. Basically you need a large input file with very few unique records that will do more than one merge pass inside sort. You can see that there is a problem by running dbx on sort. Use as input something that is too big to be sorted in memory such as /usr/dict/web2a. Set a breakpoint on this line (457 in our source): j = (*compare)(ibuf[i-1]->l,p->l); inside the function merge(). Do "run -u" on your input file. When the breakpoint is reached do "print *p". Observe that the contents of p.l are garbage since it has not been initialized. Fix: The problem is that p, which is used to keep track of the last record output, is used before it is set when we use -u. p is only looked at if muflg is true which is the case all the time if we do sort -mu or sort -c. For sort -u muflg is set to true only when the last run is being looked at during the merge step. Once it has been set to true p is used in the call to (*compare)() shown above before it has been given a value. Generally this compares our current record to random garbage which it does not equal and we go to the top of the loop where we finally give p a value. However the next time we enter merge() and hit this call to (*compare)() p will have the last value from the previous pass. If this happens to match the record we are currently looking at in this merge pass that record will not be output. Given enough merge passes and intermediate files that have the same last record this record will disappear from the intermediate files and hence from the final output. My fix was to set p whenever a record was output not just when muflg is true. This means that with -u we copy records around more than we used to but at least p has the correct value when it is used. A context diff follows: *** /tmp/,RCSt1006226 Mon Jan 11 17:07:48 1988 --- sort.c Thu Jan 7 15:10:16 1988 *************** *** 424,442 **** i = j; while(i > 0) { cp = ibuf[i-1]->l; ! if (!cflg && (uflg == 0 || muflg || i == 1 || ! (*compare)(ibuf[i-1]->l,ibuf[i-2]->l))) { ! fputs(cp, os); ! if (ferror(os)) { ! error = 1; ! term(); } ! } ! if(muflg){ ! cp = ibuf[i-1]->l; ! dp = p->l; ! do { ! } while((*dp++ = *cp++) != '\n'); } for(;;) { if(rline(ibuf[i-1])) { --- 424,444 ---- i = j; while(i > 0) { cp = ibuf[i-1]->l; ! if (uflg == 0 || muflg || i == 1 || ! (*compare)(ibuf[i-1]->l,ibuf[i-2]->l)) { ! if (!cflg) { ! fputs(cp, os); ! if (ferror(os)) { ! error = 1; ! term(); ! } } ! if(uflg || cflg){ ! cp = ibuf[i-1]->l; ! dp = p->l; ! do { ! } while((*dp++ = *cp++) != '\n'); ! } } for(;;) { if(rline(ibuf[i-1])) { -- Bruce Freeman Trigraph Inc., Toronto, Canada utzoo!trigraph!bruce