[comp.bugs.sys5] crufty memcmp on 386

gls@cbnewsh.att.com (Col. G. L. Sicherman) (11/13/90)

There is a problem with the memcmp routine on my 6386WGS running Vr3.2.
The ordering it returns is not transitive, or even antisymmetric.  In
other words, it can tell you that x < y and y < x, or that x < y < z < x.

I do not know whether the formal specifications for memcmp require that
it produces a lineal ordering, but obviously it should.  A nonlineal
memcmp cannot be used for sorting and searching.

The 386-specific code does this:

	jz	zeroexit
	lahf		// load low byte of flags to high byte of AX
	cwtl		// convert word to longword with sign extension

There are two problems with this.  One is that if EAX contains zero and
all the flag bits are 0, it will return 0 instead of positive.  This
never seems to happen; apparently the unused bit 1 is always set to 1.

The other is that it effectively returns the sign flag as the sign of
the result.  Even PDP-11 programmers know better than that!  There are
two ways to give a consistent ordering on a 386:

	1. Use unsigned byte comparison.  This amounts to replacing
	   the CWTL by a 9-bit right rotate, to put the carry flag
	   into the sign position.  There is probably a faster way;
	   I don't have a 386 manual.  Like the method above, this
	   assumes that bit 1 of the flags is set -- otherwise you
	   could get zero.

	2. Use consistent signed byte comparison.  This means testing
	   for SF=OF.  The simplest way to do it is with a JGE, but
	   again there may well be a faster way.

Is anybody working on a fix for the code and (if necessary) the formal
specs for memcmp()?
-- 
Col. G. L. Sicherman
gls@odyssey.att.COM

dpi@loft386.uucp (Doug Ingraham) (11/14/90)

In article <1990Nov12.173812.3227@cbnewsh.att.com>, gls@cbnewsh.att.com (Col. G. L. Sicherman) writes:
> There is a problem with the memcmp routine on my 6386WGS running Vr3.2.
> The ordering it returns is not transitive, or even antisymmetric.  In
> other words, it can tell you that x < y and y < x, or that x < y < z < x.
> 
> I do not know whether the formal specifications for memcmp require that
> it produces a lineal ordering, but obviously it should.  A nonlineal
> memcmp cannot be used for sorting and searching.

The ANSI spec that requires the behavior is in section 4.11.4 on page 165
of the standard.

4.11.4 Comparison Functions

   The sign of a nonzero value returned by the comparison functions memcmp,
strcmp, and strncmp is determined by the sign of the difference between the
values of the first pair of characters (both interpreted as unsigned char)
that differ in the objects being compared.

4.11.4.1 The memcmp Function

Synopsis

	#include <string.h>
	int memcmp(const void *s1, const void *s2, size_t n);

Description

   The memcmp function compares the first n characters of the object pointed
to by s1 to the first n characters of the object pointed to by s2.

Returns
   The memcmp function returns an integer greater than, equal to, or less
than zero, accordingly as the object pointed to by s1 is greater than, equal
to, or less than the object pointed to by s2.

I wrote an exhaustive test for the purposes of testing my version of this
function (included at the end) but the version supplied with Bell Tech's
version of System V/386 3.2 passes my test.

> 
> The 386-specific code does this:
> 
> 	jz	zeroexit
> 	lahf		// load low byte of flags to high byte of AX
> 	cwtl		// convert word to longword with sign extension
> 
> There are two problems with this.  One is that if EAX contains zero and
> all the flag bits are 0, it will return 0 instead of positive.  This
> never seems to happen; apparently the unused bit 1 is always set to 1.

The lahf instruction converts the 8086 16 bit flags register to the
8080 8 bit format.  The jz before the lahf is used to ensure that
a zero is returned only when equalty is true.  The not equal case which
is what is being determined is the only case where the lahf cwtl comes
into play.  You will get a negative number or a positive non-zero from
the output of this sequence.  Probably the comparison operation on the
previous line was a rep cmpsb which will compare the two bytes via
an 8 bit subtraction and set the flags.  One of the unused bits is set
to a 1 and was specified to do that on the 8085 (I think) so this is
a very safe assumption.

> Is anybody working on a fix for the code and (if necessary) the formal
> specs for memcmp()?

I can't make the original fail.  Could you provide a short code segment
that will display the claimed behavior?  If you can make AT&T's fail,
I would like to know if mine fails also.

The official specs are copied above and the following is my replacement
routine that can be almost 4 times faster than the AT&T implementation.

------------------------------------------------------------------------

	.file	"memcmp.s"
/ The Intel 80386 implementation of the ANSI memcmp() function.
/ Written April 8, 1990 by Doug Ingraham.  dpi@loft386.UUCP
/ Copyright 1990 by Douglas P. Ingraham.  This code is freely
/ distributable so long as this notice remains intact.
	.text
	.align	4
	.globl	memcmp
memcmp:
/ Save edi and esi in edx and eax to save 4 clocks over popping them
	movl	%edi,%edx		/ 2 clocks 2 bytes
	movl	%esi,%eax		/ 2 clocks 2 bytes
/ esi is pointer to the first string
	movl	0x4(%esp),%esi		/ 2 clocks 4 bytes
/ edi is pointer to the second string
	movl	0x8(%esp),%edi		/ 2 clocks 4 bytes
/ This tests to see if the pointers are to the same string.
	cmpl	%esi,%edi		/ 2 clocks 2 bytes
	je	same			/ 7+m,3 clocks 2 bytes
/ The ecx gets the count.  If Zero then by definition they are the same.
	movl	0xC(%esp),%ecx		/ 2 clocks 4 bytes
	jcxz	same			/ 9+m,5 clocks 2 bytes
/ Test the count to make sure there are enough bytes to bother with this
/ speed optimization.  12 was chosen as a ballpark figure.  Must be greater
/ than 7 as a minimum.
	cmpl	$12,%ecx		/ 2 clocks 3 bytes
	jl	byte			/ 7+m,3 clocks 2 bytes
/ Test to see if either pointer is word aligned.  Most is 3 cmpsb's to align
/ so the loop is unrolled.  75 clocks worst case.
	testl	$3,%esi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	testl   $3,%edi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	cmpsb				/ 10 clocks 1 byte
	jne	different		/ 7+m,3 clocks 2 bytes
	decl	%ecx			/ 2 clocks 1 byte
	testl	$3,%esi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	testl   $3,%edi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	cmpsb				/ 10 clocks 1 byte
	jne	different		/ 7+m,3 clocks 2 bytes
	decl	%ecx			/ 2 clocks 1 byte
	testl	$3,%esi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	testl   $3,%edi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	cmpsb				/ 10 clocks 1 byte
	jne	different		/ 7+m,3 clocks 2 bytes
	decl	%ecx			/ 2 clocks 1 byte
/ Save the modified count back on the stack for possible later use.
is_aligned:
	movl	%ecx,0xC(%esp)		/ 2 clocks 4 bytes
/ Divide by 4 so that we can do word compares
	shrl	$2,%ecx			/ 3 clocks 3 bytes
/ Do the long level search.
	repz;	cmpsl			/ 5+9*n clocks 2 bytes
	je	byte_equal		/ 7+m,3 clocks 2 bytes
/ backup 4 bytes
	movl	$4,%ecx			/ 2 clocks 5 bytes
	subl	%ecx,%esi		/ 2 clocks 2 bytes
	subl	%ecx,%edi		/ 2 clocks 2 bytes
/ Do the byte level search.
byte:
	repz;	cmpsb			/ 5+9*n clocks 2 bytes
	je	same			/ 7+m/3 clocks 2 bytes
/ Restore the esi so that eax is available for return value.
different:
	movl	%eax,%esi		/ 2 clocks 2 byte
/ This is a trick to put a positive or negative in the eax.
	lahf				/ 2 clocks 1 byte
	cwtl				/ 3 clocks 1 byte
/ Restore the edi.
	movl	%edx,%edi		/ 2 clocks 2 bytes
	ret				/ 10+m clocks 1 byte
/ If the string compared out then come here.
same:	movl	%eax,%esi		/ 2 clocks 2 bytes
short_same:
	movl	%edx,%edi		/ 2 clocks 2 bytes
	xorl	%eax,%eax		/ 2 clocks 2 bytes
	ret				/ 10+m clocks 1 byte
/ come here if the high speed test was equal 
byte_equal:
	movl	0xC(%esp),%ecx		/ 2 clocks 4 bytes
	andl	$3,%ecx			/ 2 clocks 6 bytes
	repz;	cmpsb			/ 5+9*n clocks 2 bytes
/ Restore the esi so that eax is available for return value.
	movl	%eax,%esi		/ 2 clocks 2 byte
	je	short_same		/ 7+m/3 clocks 2 bytes
/ This is a trick to put a positive or negative in the eax.
	lahf				/ 2 clocks 1 byte
	cwtl				/ 3 clocks 1 byte
/ Restore the edi.
	movl	%edx,%edi		/ 2 clocks 2 bytes
	ret				/ 10+m clocks 1 byte

------------------------------------------------------------------------

It is quite a bit longer than the original, but is much faster in the
general case.

Apologies to those with other architectures.


-- 
Doug Ingraham (SysAdmin)
Lofty Pursuits (Public Access for Rapid City SD USA)
bigtex!loft386!dpi
uunet!loft386!dpi

gls@cbnewsh.att.com (Col. G. L. Sicherman) (11/16/90)

In <1990Nov13.182133.27748@loft386.uucp>, dpi@loft386.uucp writes:
> 
> I can't make the original fail.  Could you provide a short code segment
> that will display the claimed behavior?  If you can make AT&T's fail,
> I would like to know if mine fails also.

Sure.  Here are my test program and what came out of it:

REPEAT BY:
	Compile this with cc and run it.

main()
{
	long a[4];
	int i, j;
	int memcmp();

	a[0] = 0x10000000;
	a[1] = 0x50000000;
	a[2] = 0x90000000;
	a[3] = 0xd0000000;

	for (i=0; i<4; i++) for (j=0; j<4; j++)
	if (i!=j)
	printf("memcmp(%lx,%lx,4) is %lx\n", a[i], a[j], memcmp(a+i,a+j, 4));
}

THE DAMNING EVIDENCE:

memcmp(10000000,50000000,4) is ffff8720
memcmp(10000000,90000000,4) is ffff8320
memcmp(10000000,d0000000,4) is 320
memcmp(50000000,10000000,4) is 224
memcmp(50000000,90000000,4) is ffff8724
memcmp(50000000,d0000000,4) is ffff8324
memcmp(90000000,10000000,4) is ffff8228
memcmp(90000000,50000000,4) is 228
memcmp(90000000,d0000000,4) is ffff8728
memcmp(d0000000,10000000,4) is ffff862c
memcmp(d0000000,50000000,4) is ffff822c
memcmp(d0000000,90000000,4) is 22c

This gives 0x10 < 0x90 < 0x10, and 0x10 < 0x50 < 0x90 < 0xd0 < 0x10.
Since you too use LAHF and CWTL, your code makes the same mistake --
it uses the SF bit instead of the CF bit.  How did you test it?
-- 
G. L. Sicherman
gls@odyssey.att.COM

dpi@loft386.uucp (Doug Ingraham) (11/19/90)

In article <1990Nov15.163550.27015@cbnewsh.att.com>, gls@cbnewsh.att.com (Col. G. L. Sicherman) writes:
> In <1990Nov13.182133.27748@loft386.uucp>, dpi@loft386.uucp writes:
> > 
> > I can't make the original fail.  Could you provide a short code segment
> > that will display the claimed behavior?  If you can make AT&T's fail,
> > I would like to know if mine fails also.
> 
> Sure.  Here are my test program and what came out of it:

[Test program deleted]

I coded my routine from a Draft of the standard.  The version provided
with SYS V 386 performs a signed test which was acceptable in pre standard
times to most.  I have since updated my routine and it now does unsigned
comparisons.  Col Sicherman, I wrote my test routine to test for the
signed version and wrote my original to match the version supplied by
AT&T.  This one passes your test and my more complete test.  Feel free
to use it in anyway you wish for any purpose.  It is as bug free as I
can make it.  If anyone finds a problem or an improvement please let me
know.
-------------------------------------------------------------------------
	.file	"memcmp.s"
/ The Intel 80386 implementation of the ANSI memcmp() function.
/ Written April 8, 1990 by Doug Ingraham.  dpi@loft386.UUCP
/ Copyright 1990 by Douglas P. Ingraham.  This code is freely
/ distributable so long as this notice remains intact.
/
/ Corrected November 18, 1990 to compare unsigned as per the standard.
/
	.text
	.align	4
	.globl	memcmp
memcmp:
/ Save edi and esi in edx and eax to save 4 clocks over popping them
	movl	%edi,%edx		/ 2 clocks 2 bytes
	movl	%esi,%eax		/ 2 clocks 2 bytes
/ esi is pointer to the first string
	movl	0x4(%esp),%esi		/ 2 clocks 4 bytes
/ edi is pointer to the second string
	movl	0x8(%esp),%edi		/ 2 clocks 4 bytes
/ This tests to see if the pointers are to the same string.
/ This test is probably a waste of time.
	cmpl	%esi,%edi		/ 2 clocks 2 bytes
	je	same			/ 7+m,3 clocks 2 bytes
/ The ecx gets the count.  If Zero then by definition they are the same.
	movl	0xC(%esp),%ecx		/ 2 clocks 4 bytes
	jcxz	same			/ 9+m,5 clocks 2 bytes
/ Test the count to make sure there are enough bytes to bother with this
/ speed optimization.  12 was chosen as a ballpark figure.  Must be greater
/ than 7 as a minimum.
	cmpl	$12,%ecx		/ 2 clocks 3 bytes
	jl	byte			/ 7+m,3 clocks 2 bytes
/ Test to see if either pointer is word aligned.  Most is 3 cmpsb's to align
/ so the loop is unrolled.  75 clocks worst case.
	testl	$3,%esi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	testl   $3,%edi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	cmpsb				/ 10 clocks 1 byte
	jne	different		/ 7+m,3 clocks 2 bytes
	decl	%ecx			/ 2 clocks 1 byte
	testl	$3,%esi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	testl   $3,%edi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	cmpsb				/ 10 clocks 1 byte
	jne	different		/ 7+m,3 clocks 2 bytes
	decl	%ecx			/ 2 clocks 1 byte
	testl	$3,%esi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	testl   $3,%edi			/ 2 clocks 6 bytes
	jz	is_aligned		/ 7+m,3 clocks 2 bytes
	cmpsb				/ 10 clocks 1 byte
	jne	different		/ 7+m,3 clocks 2 bytes
	decl	%ecx			/ 2 clocks 1 byte
/ Save the modified count back on the stack for possible later use.
is_aligned:
	movl	%ecx,0xC(%esp)		/ 2 clocks 4 bytes
/ Divide by 4 so that we can do word compares
	shrl	$2,%ecx			/ 3 clocks 3 bytes
/ Do the long level search.
	repz;	cmpsl			/ 5+9*n clocks 2 bytes
	je	byte_equal		/ 7+m,3 clocks 2 bytes
/ backup 4 bytes
	movl	$4,%ecx			/ 2 clocks 5 bytes
	subl	%ecx,%esi		/ 2 clocks 2 bytes
	subl	%ecx,%edi		/ 2 clocks 2 bytes
/ Do the byte level search.
byte:
	repz;	cmpsb			/ 5+9*n clocks 2 bytes
	je	same			/ 7+m/3 clocks 2 bytes
/ Restore the esi so that eax is available for return value.
different:
	movl	%eax,%esi		/ 2 clocks 2 byte
/ If the carry is set, return -1
	jc	less_than		/ 7+m/3 clocks 2 bytes
greater_than:
	mov	$1,%eax			/ 2 clocks 5 bytes
/ Restore the edi.
	movl	%edx,%edi		/ 2 clocks 2 bytes
	ret				/ 10+m clocks 1 byte
/ If the string compared out then come here.
same:	movl	%eax,%esi		/ 2 clocks 2 bytes
short_same:
	movl	%edx,%edi		/ 2 clocks 2 bytes
	xorl	%eax,%eax		/ 2 clocks 2 bytes
	ret				/ 10+m clocks 1 byte
/ come here if the high speed test was equal 
byte_equal:
	movl	0xC(%esp),%ecx		/ 2 clocks 4 bytes
	andl	$3,%ecx			/ 2 clocks 6 bytes
	repz;	cmpsb			/ 5+9*n clocks 2 bytes
/ Restore the esi so that eax is available for return value.
	movl	%eax,%esi		/ 2 clocks 2 byte
	je	short_same		/ 7+m/3 clocks 2 bytes
	jnc	greater_than		/ 7+m/3 clocks 2 bytes
less_than:
	mov	$-1,%eax		/ 2 clocks 5 bytes
/ Restore the edi.
	movl	%edx,%edi		/ 2 clocks 2 bytes
	ret				/ 10+m clocks 1 byte
------------------------------------------------------------------------

> -- 
> G. L. Sicherman
> gls@odyssey.att.COM

Thanks for finding this.

-- 
Doug Ingraham (SysAdmin)
Lofty Pursuits (Public Access for Rapid City SD USA)
bigtex!loft386!dpi
uunet!loft386!dpi