gls@cbnewsh.att.com (Col. G. L. Sicherman) (11/13/90)
There is a problem with the memcmp routine on my 6386WGS running Vr3.2. The ordering it returns is not transitive, or even antisymmetric. In other words, it can tell you that x < y and y < x, or that x < y < z < x. I do not know whether the formal specifications for memcmp require that it produces a lineal ordering, but obviously it should. A nonlineal memcmp cannot be used for sorting and searching. The 386-specific code does this: jz zeroexit lahf // load low byte of flags to high byte of AX cwtl // convert word to longword with sign extension There are two problems with this. One is that if EAX contains zero and all the flag bits are 0, it will return 0 instead of positive. This never seems to happen; apparently the unused bit 1 is always set to 1. The other is that it effectively returns the sign flag as the sign of the result. Even PDP-11 programmers know better than that! There are two ways to give a consistent ordering on a 386: 1. Use unsigned byte comparison. This amounts to replacing the CWTL by a 9-bit right rotate, to put the carry flag into the sign position. There is probably a faster way; I don't have a 386 manual. Like the method above, this assumes that bit 1 of the flags is set -- otherwise you could get zero. 2. Use consistent signed byte comparison. This means testing for SF=OF. The simplest way to do it is with a JGE, but again there may well be a faster way. Is anybody working on a fix for the code and (if necessary) the formal specs for memcmp()? -- Col. G. L. Sicherman gls@odyssey.att.COM
dpi@loft386.uucp (Doug Ingraham) (11/14/90)
In article <1990Nov12.173812.3227@cbnewsh.att.com>, gls@cbnewsh.att.com (Col. G. L. Sicherman) writes: > There is a problem with the memcmp routine on my 6386WGS running Vr3.2. > The ordering it returns is not transitive, or even antisymmetric. In > other words, it can tell you that x < y and y < x, or that x < y < z < x. > > I do not know whether the formal specifications for memcmp require that > it produces a lineal ordering, but obviously it should. A nonlineal > memcmp cannot be used for sorting and searching. The ANSI spec that requires the behavior is in section 4.11.4 on page 165 of the standard. 4.11.4 Comparison Functions The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared. 4.11.4.1 The memcmp Function Synopsis #include <string.h> int memcmp(const void *s1, const void *s2, size_t n); Description The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2. Returns The memcmp function returns an integer greater than, equal to, or less than zero, accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2. I wrote an exhaustive test for the purposes of testing my version of this function (included at the end) but the version supplied with Bell Tech's version of System V/386 3.2 passes my test. > > The 386-specific code does this: > > jz zeroexit > lahf // load low byte of flags to high byte of AX > cwtl // convert word to longword with sign extension > > There are two problems with this. One is that if EAX contains zero and > all the flag bits are 0, it will return 0 instead of positive. This > never seems to happen; apparently the unused bit 1 is always set to 1. The lahf instruction converts the 8086 16 bit flags register to the 8080 8 bit format. The jz before the lahf is used to ensure that a zero is returned only when equalty is true. The not equal case which is what is being determined is the only case where the lahf cwtl comes into play. You will get a negative number or a positive non-zero from the output of this sequence. Probably the comparison operation on the previous line was a rep cmpsb which will compare the two bytes via an 8 bit subtraction and set the flags. One of the unused bits is set to a 1 and was specified to do that on the 8085 (I think) so this is a very safe assumption. > Is anybody working on a fix for the code and (if necessary) the formal > specs for memcmp()? I can't make the original fail. Could you provide a short code segment that will display the claimed behavior? If you can make AT&T's fail, I would like to know if mine fails also. The official specs are copied above and the following is my replacement routine that can be almost 4 times faster than the AT&T implementation. ------------------------------------------------------------------------ .file "memcmp.s" / The Intel 80386 implementation of the ANSI memcmp() function. / Written April 8, 1990 by Doug Ingraham. dpi@loft386.UUCP / Copyright 1990 by Douglas P. Ingraham. This code is freely / distributable so long as this notice remains intact. .text .align 4 .globl memcmp memcmp: / Save edi and esi in edx and eax to save 4 clocks over popping them movl %edi,%edx / 2 clocks 2 bytes movl %esi,%eax / 2 clocks 2 bytes / esi is pointer to the first string movl 0x4(%esp),%esi / 2 clocks 4 bytes / edi is pointer to the second string movl 0x8(%esp),%edi / 2 clocks 4 bytes / This tests to see if the pointers are to the same string. cmpl %esi,%edi / 2 clocks 2 bytes je same / 7+m,3 clocks 2 bytes / The ecx gets the count. If Zero then by definition they are the same. movl 0xC(%esp),%ecx / 2 clocks 4 bytes jcxz same / 9+m,5 clocks 2 bytes / Test the count to make sure there are enough bytes to bother with this / speed optimization. 12 was chosen as a ballpark figure. Must be greater / than 7 as a minimum. cmpl $12,%ecx / 2 clocks 3 bytes jl byte / 7+m,3 clocks 2 bytes / Test to see if either pointer is word aligned. Most is 3 cmpsb's to align / so the loop is unrolled. 75 clocks worst case. testl $3,%esi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes testl $3,%edi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes cmpsb / 10 clocks 1 byte jne different / 7+m,3 clocks 2 bytes decl %ecx / 2 clocks 1 byte testl $3,%esi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes testl $3,%edi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes cmpsb / 10 clocks 1 byte jne different / 7+m,3 clocks 2 bytes decl %ecx / 2 clocks 1 byte testl $3,%esi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes testl $3,%edi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes cmpsb / 10 clocks 1 byte jne different / 7+m,3 clocks 2 bytes decl %ecx / 2 clocks 1 byte / Save the modified count back on the stack for possible later use. is_aligned: movl %ecx,0xC(%esp) / 2 clocks 4 bytes / Divide by 4 so that we can do word compares shrl $2,%ecx / 3 clocks 3 bytes / Do the long level search. repz; cmpsl / 5+9*n clocks 2 bytes je byte_equal / 7+m,3 clocks 2 bytes / backup 4 bytes movl $4,%ecx / 2 clocks 5 bytes subl %ecx,%esi / 2 clocks 2 bytes subl %ecx,%edi / 2 clocks 2 bytes / Do the byte level search. byte: repz; cmpsb / 5+9*n clocks 2 bytes je same / 7+m/3 clocks 2 bytes / Restore the esi so that eax is available for return value. different: movl %eax,%esi / 2 clocks 2 byte / This is a trick to put a positive or negative in the eax. lahf / 2 clocks 1 byte cwtl / 3 clocks 1 byte / Restore the edi. movl %edx,%edi / 2 clocks 2 bytes ret / 10+m clocks 1 byte / If the string compared out then come here. same: movl %eax,%esi / 2 clocks 2 bytes short_same: movl %edx,%edi / 2 clocks 2 bytes xorl %eax,%eax / 2 clocks 2 bytes ret / 10+m clocks 1 byte / come here if the high speed test was equal byte_equal: movl 0xC(%esp),%ecx / 2 clocks 4 bytes andl $3,%ecx / 2 clocks 6 bytes repz; cmpsb / 5+9*n clocks 2 bytes / Restore the esi so that eax is available for return value. movl %eax,%esi / 2 clocks 2 byte je short_same / 7+m/3 clocks 2 bytes / This is a trick to put a positive or negative in the eax. lahf / 2 clocks 1 byte cwtl / 3 clocks 1 byte / Restore the edi. movl %edx,%edi / 2 clocks 2 bytes ret / 10+m clocks 1 byte ------------------------------------------------------------------------ It is quite a bit longer than the original, but is much faster in the general case. Apologies to those with other architectures. -- Doug Ingraham (SysAdmin) Lofty Pursuits (Public Access for Rapid City SD USA) bigtex!loft386!dpi uunet!loft386!dpi
gls@cbnewsh.att.com (Col. G. L. Sicherman) (11/16/90)
In <1990Nov13.182133.27748@loft386.uucp>, dpi@loft386.uucp writes: > > I can't make the original fail. Could you provide a short code segment > that will display the claimed behavior? If you can make AT&T's fail, > I would like to know if mine fails also. Sure. Here are my test program and what came out of it: REPEAT BY: Compile this with cc and run it. main() { long a[4]; int i, j; int memcmp(); a[0] = 0x10000000; a[1] = 0x50000000; a[2] = 0x90000000; a[3] = 0xd0000000; for (i=0; i<4; i++) for (j=0; j<4; j++) if (i!=j) printf("memcmp(%lx,%lx,4) is %lx\n", a[i], a[j], memcmp(a+i,a+j, 4)); } THE DAMNING EVIDENCE: memcmp(10000000,50000000,4) is ffff8720 memcmp(10000000,90000000,4) is ffff8320 memcmp(10000000,d0000000,4) is 320 memcmp(50000000,10000000,4) is 224 memcmp(50000000,90000000,4) is ffff8724 memcmp(50000000,d0000000,4) is ffff8324 memcmp(90000000,10000000,4) is ffff8228 memcmp(90000000,50000000,4) is 228 memcmp(90000000,d0000000,4) is ffff8728 memcmp(d0000000,10000000,4) is ffff862c memcmp(d0000000,50000000,4) is ffff822c memcmp(d0000000,90000000,4) is 22c This gives 0x10 < 0x90 < 0x10, and 0x10 < 0x50 < 0x90 < 0xd0 < 0x10. Since you too use LAHF and CWTL, your code makes the same mistake -- it uses the SF bit instead of the CF bit. How did you test it? -- G. L. Sicherman gls@odyssey.att.COM
dpi@loft386.uucp (Doug Ingraham) (11/19/90)
In article <1990Nov15.163550.27015@cbnewsh.att.com>, gls@cbnewsh.att.com (Col. G. L. Sicherman) writes: > In <1990Nov13.182133.27748@loft386.uucp>, dpi@loft386.uucp writes: > > > > I can't make the original fail. Could you provide a short code segment > > that will display the claimed behavior? If you can make AT&T's fail, > > I would like to know if mine fails also. > > Sure. Here are my test program and what came out of it: [Test program deleted] I coded my routine from a Draft of the standard. The version provided with SYS V 386 performs a signed test which was acceptable in pre standard times to most. I have since updated my routine and it now does unsigned comparisons. Col Sicherman, I wrote my test routine to test for the signed version and wrote my original to match the version supplied by AT&T. This one passes your test and my more complete test. Feel free to use it in anyway you wish for any purpose. It is as bug free as I can make it. If anyone finds a problem or an improvement please let me know. ------------------------------------------------------------------------- .file "memcmp.s" / The Intel 80386 implementation of the ANSI memcmp() function. / Written April 8, 1990 by Doug Ingraham. dpi@loft386.UUCP / Copyright 1990 by Douglas P. Ingraham. This code is freely / distributable so long as this notice remains intact. / / Corrected November 18, 1990 to compare unsigned as per the standard. / .text .align 4 .globl memcmp memcmp: / Save edi and esi in edx and eax to save 4 clocks over popping them movl %edi,%edx / 2 clocks 2 bytes movl %esi,%eax / 2 clocks 2 bytes / esi is pointer to the first string movl 0x4(%esp),%esi / 2 clocks 4 bytes / edi is pointer to the second string movl 0x8(%esp),%edi / 2 clocks 4 bytes / This tests to see if the pointers are to the same string. / This test is probably a waste of time. cmpl %esi,%edi / 2 clocks 2 bytes je same / 7+m,3 clocks 2 bytes / The ecx gets the count. If Zero then by definition they are the same. movl 0xC(%esp),%ecx / 2 clocks 4 bytes jcxz same / 9+m,5 clocks 2 bytes / Test the count to make sure there are enough bytes to bother with this / speed optimization. 12 was chosen as a ballpark figure. Must be greater / than 7 as a minimum. cmpl $12,%ecx / 2 clocks 3 bytes jl byte / 7+m,3 clocks 2 bytes / Test to see if either pointer is word aligned. Most is 3 cmpsb's to align / so the loop is unrolled. 75 clocks worst case. testl $3,%esi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes testl $3,%edi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes cmpsb / 10 clocks 1 byte jne different / 7+m,3 clocks 2 bytes decl %ecx / 2 clocks 1 byte testl $3,%esi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes testl $3,%edi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes cmpsb / 10 clocks 1 byte jne different / 7+m,3 clocks 2 bytes decl %ecx / 2 clocks 1 byte testl $3,%esi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes testl $3,%edi / 2 clocks 6 bytes jz is_aligned / 7+m,3 clocks 2 bytes cmpsb / 10 clocks 1 byte jne different / 7+m,3 clocks 2 bytes decl %ecx / 2 clocks 1 byte / Save the modified count back on the stack for possible later use. is_aligned: movl %ecx,0xC(%esp) / 2 clocks 4 bytes / Divide by 4 so that we can do word compares shrl $2,%ecx / 3 clocks 3 bytes / Do the long level search. repz; cmpsl / 5+9*n clocks 2 bytes je byte_equal / 7+m,3 clocks 2 bytes / backup 4 bytes movl $4,%ecx / 2 clocks 5 bytes subl %ecx,%esi / 2 clocks 2 bytes subl %ecx,%edi / 2 clocks 2 bytes / Do the byte level search. byte: repz; cmpsb / 5+9*n clocks 2 bytes je same / 7+m/3 clocks 2 bytes / Restore the esi so that eax is available for return value. different: movl %eax,%esi / 2 clocks 2 byte / If the carry is set, return -1 jc less_than / 7+m/3 clocks 2 bytes greater_than: mov $1,%eax / 2 clocks 5 bytes / Restore the edi. movl %edx,%edi / 2 clocks 2 bytes ret / 10+m clocks 1 byte / If the string compared out then come here. same: movl %eax,%esi / 2 clocks 2 bytes short_same: movl %edx,%edi / 2 clocks 2 bytes xorl %eax,%eax / 2 clocks 2 bytes ret / 10+m clocks 1 byte / come here if the high speed test was equal byte_equal: movl 0xC(%esp),%ecx / 2 clocks 4 bytes andl $3,%ecx / 2 clocks 6 bytes repz; cmpsb / 5+9*n clocks 2 bytes / Restore the esi so that eax is available for return value. movl %eax,%esi / 2 clocks 2 byte je short_same / 7+m/3 clocks 2 bytes jnc greater_than / 7+m/3 clocks 2 bytes less_than: mov $-1,%eax / 2 clocks 5 bytes / Restore the edi. movl %edx,%edi / 2 clocks 2 bytes ret / 10+m clocks 1 byte ------------------------------------------------------------------------ > -- > G. L. Sicherman > gls@odyssey.att.COM Thanks for finding this. -- Doug Ingraham (SysAdmin) Lofty Pursuits (Public Access for Rapid City SD USA) bigtex!loft386!dpi uunet!loft386!dpi