[comp.bugs.4bsd] sort -u can lose data +FIX

bruce@trigraph.UUCP (Bruce Freeman) (01/13/88)

Index:	usr.bin/sort.c 4.3BSD

Description:
	The output of sort -u can differ from sort | uniq. The -u option can
	"lose" records during the merge phase of sort. This bug is also present
	in the 4.2 sort and maybe even V7 sort but I don't have access to a
	V7 system to check this.
Repeat-By:
	I have a 3Mb input file that shows this problem but it is a little
	big to include here. Basically you need a large input file with very
	few unique records that will do more than one merge pass inside sort.

	You can see that there is a problem by running dbx on sort. Use as
	input something that is too big to be sorted in memory such as
	/usr/dict/web2a. Set a breakpoint on this line (457 in our source):
		j = (*compare)(ibuf[i-1]->l,p->l);
	inside the function merge(). Do "run -u" on your input file. When the
	breakpoint is reached do "print *p". Observe that the contents of p.l
	are garbage since it has not been initialized.
Fix:
	The problem is that p, which is used to keep track of the last record
	output, is used before it is set when we use -u. p is only looked at if
	muflg is true which is the case all the time if we do sort -mu or
	sort -c. For sort -u muflg is set to true only when the last run is
	being looked at during the merge step. Once it has been set to true p
	is used in the call to (*compare)() shown above before it has been given
	a value. Generally this compares our current record to random garbage
	which it does not equal and we go to the top of the loop where we
	finally give p a value. However the next time we enter merge() and hit
	this call to (*compare)() p will have the last value from the previous
	pass. If this happens to match the record we are currently looking at
	in this merge pass that record will not be output. Given enough merge
	passes and intermediate files that have the same last record this record
	will disappear from the intermediate files and hence from the final
	output.

	My fix was to set p whenever a record was output not just when muflg is
	true. This means that with -u we copy records around more than we used
	to but at least p has the correct value when it is used.

	A context diff follows:

*** /tmp/,RCSt1006226	Mon Jan 11 17:07:48 1988
--- sort.c	Thu Jan  7 15:10:16 1988
***************
*** 424,442 ****
  	i = j;
  	while(i > 0) {
  		cp = ibuf[i-1]->l;
! 		if (!cflg && (uflg == 0 || muflg || i == 1 ||
! 			(*compare)(ibuf[i-1]->l,ibuf[i-2]->l))) {
! 			fputs(cp, os);
! 			if (ferror(os)) {
! 				error = 1;
! 				term();
  			}
! 		}
! 		if(muflg){
! 			cp = ibuf[i-1]->l;
! 			dp = p->l;
! 			do {
! 			} while((*dp++ = *cp++) != '\n');
  		}
  		for(;;) {
  			if(rline(ibuf[i-1])) {
--- 424,444 ----
  	i = j;
  	while(i > 0) {
  		cp = ibuf[i-1]->l;
! 		if (uflg == 0 || muflg || i == 1 ||
! 		    (*compare)(ibuf[i-1]->l,ibuf[i-2]->l)) {
! 			if (!cflg) {
! 				fputs(cp, os);
! 				if (ferror(os)) {
! 					error = 1;
! 					term();
! 				}
  			}
! 			if(uflg || cflg){
! 				cp = ibuf[i-1]->l;
! 				dp = p->l;
! 				do {
! 				} while((*dp++ = *cp++) != '\n');
! 			}
  		}
  		for(;;) {
  			if(rline(ibuf[i-1])) {
-- 
Bruce Freeman	Trigraph Inc., Toronto, Canada		utzoo!trigraph!bruce