[comp.arch] Programming and Machine Operations

cik@l.cc.purdue.edu (Herman Rubin) (07/08/89)

In article <13970@haddock.ima.isc.com>, suitti@haddock.ima.isc.com (Stephen Uitti) writes:
> >For division with remainder, the code should
> >be machine language, not C.  Would you use C to do machine-independent
> >implementation of division?  You would not think of it.  Neither would
> >I think of making similar operations such as frexp, divrem, exponentiation,
> >absolute value, integer part, etc., machine independently coded in C.
> >I would not consider coding the =UP above in C.  It is an additional
> >primitive operation, like +, *, /, &.
> 
> Not only would i think of it, i've done it.  A check is made to see that
> the compiler really generates the right instructions.

Of course algorithms should be checked.  And it is quite likely that the
checking procedure will be relatively inefficient.  But to include a 
fantastically slow machine independent code in a VAX-specific source
library is criminal, especially since even a klutz understanding the
VAX instructions would produce an algorithm beating it always.  And
making abs and fabs function calls!  This in a widely-distributed
product!  It is not at all difficult to give natural problems in
which the frexp provided would be 1000 times as slow as crude inline
code.  And frexp was the first thing done in log and sqrt in 4.2BSD.

			...................

> While i find such papers hard to read (especially since the
> definition of the new notation syntax is seldom given), for
> programs it is worse.  Even with the C preprocessor's limited
> capabilities, one can write code that doesn't bear any
> resemblence to C.  Maintanence of this code is nearly impossible.
> Thus, sadly, notation flexibiltiy and maintainability are at odds.

If the definition of the new notation is not given, it is considered
by the referee and associate editor not to be new.  It is expected that
someone with the necessary background to read the paper already knows
the notation.  I agree that this is not always the case, but even here
the author typically refers to previous papers for the notation.

> >A simple example of the lack of desirability
> >of C portability is the addition of two vectors.  The good coding of this
> >depends on the costs of the *x++ in loading and storing and the use of
> >index in loading and storing.  These depend both on the machine and the
> >storage classes of the adresses.
> 
> The code sequence is:
> 
> void lvecadd(a, b, c, s) /* a = b + c, length s */
> long *a; long *b; long *c; long s;
> {
> 	do {
> 		*a++ = *b++ + *c++;
> 	} while (--s != 0);
> }

I am afraid I will have to give you a D- on this.  Most of the time I would
not even bother with a call, considering the code length.  But the code is
bad for vectors of length >2.  How about


{
	end = a + s;
 	do {
 		*a++ = *b++ + *c++;
 	} while (a < end);
}

I hope I am not unusual, but I consider things like this extremely obvious.

> or:
> 
> void lvecadd(a, b, c, s) /* a = b + c, length s */
> long *a; long *b; long *c; long s;
> {
> 	register long t;
> 	for (t = 0; i < s; t++)
> 		a[t] = b[t] + c[t];
> }
> 
> A '205 C compiler should (doesn't but should) take either
> piece of code and translate it into the instruction.

This was not the question.  I find the first code natural and efficient
on a VAX, and the second natural and efficient on a PYRAMID.  This from
reading the description of the addressing modes on the two machines.

> I've seen a C compiler which did better for the second example
> than the first.  What was odd was that the target machine, a
> 68000, does better with the first algorithm.  Worse, the second
> peice of code was effectively translated into the first, using
> redundant temporary variables for the pointers.  Somehow, the
> compiler thought that extra work had to take place for the first
> example.  The point, though, is that the programmer visible
> syntax need not bear any resemblence to the quality of the
> produced code.

I have yet to see a decent compiler.  However, I see no reason why the
human programmer should have any difficulty with this type of problem.
Is it the case that our training of programmers removes their native
ability to see things like this?  

> Portability has its costs - usually about 10 to 20% of the
> optimum machine speed.  Compiler technology has been attempting
> to narrow this.  However, it has enourmous benifits.  UNIX, being
> somewhat portable, allows new architectures to become useful in
> an extremely short time after manufacture (something like a
> year).  The speed penalty that comes from UNIX not being designed
> for the architecture is offset by the years of development that
> it takes to create a new OS.  By that time, the architecture may
> well be obsolete.  User applications are the same way.  The value
> of using a portable language is more than just being able to use
> code that was written on one machine on another.  It takes time
> for programmers to become proficient in a language.  If you write
> a new language for every program you write, no one will be able
> to make heads or tails of what you've done.

I can put up with 10 to 20%.  But I do not see that.  I see 50% or more.

It sure does take time for a programmer to become proficient in a
language.  Languages are complicated.  And there is no way the language
writers can anticipate what is needed.  They make no attempt to find out
what additional features the knowledgeable users want.  They complain,
rightly, that mathematicians do not use the computer enough, and then
go out of their way, almost, to leave out what is wanted.  This extends
to the hardware people, also.  I was told by an IBM physicist that the
IBM scientific people had complained about the total lack of communication
on the 360 and its successors between the integer registers and the
floating point registers.  At that time, the fix would have been easy.
Another example: if a machine has hardware division, it should have
hardware square root.  SOME hardware people see this as obvious, but
others do not see why there is any difference between square root and
transcendental functions.

As far as operating systems, maybe UNIX is a simplified version of 
MULTICS, but I still consider it horribly complicated.  I believe
that the OS for a machine should be of the KISS variety, with everything
else added on.  And the syntax!  Who would use mv for rename?  I will
not go into further details.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

suitti@haddock.ima.isc.com (Stephen Uitti) (07/09/89)

In various articles, cik & suitti argue...
>>suitti
>cik

>> The code sequence is: ...
>> 
>> void lvecadd(a, b, c, s) /* a = b + c, length s */
>> long *a; long *b; long *c; long s;
>> {
>> 	do {
>> 		*a++ = *b++ + *c++;
>> 	} while (--s != 0);
>> }
For the VAX:
L18:addl3	(r9)+,(r10)+,r0
movl	r0,(r11)+
decl	r8
jneq	L18

>
>I am afraid I will have to give you a D- on this.  Most of the time I would
>not even bother with a call, considering the code length.  But the code is
>bad for vectors of length >2.  How about
>{
>	end = a + s;
> 	do {
> 		*a++ = *b++ + *c++;
> 	} while (a < end);
>}

L26:addl3	(r9)+,(r10)+,r0
movl	r0,(r11)+
cmpl	r11,r7
jlss	L26

These are the loops as coded by the VAX PCC based C compiler.
I've no idea why it felt the movl was needed.  In any case both
code fragments compile to the same number of instructions.  My
decl has fewer arguments than your cmpl.  Without running the
test, there is no way to know which will run faster.  It depends
on which VAX.  A microVAX might (and probably does) have
different relative times than a 780.  Still, i expected the
compiler to use some sort of subtract one and branch if not zero
instruction instead of a decrement, and a branch.  Unfortunately,
i didn't check my VAX architecture handbook in advance, and i'd
forgotten (through disuse) that the VAX has an SOBGTR rather than
an SOBNE.  I should have used
	} while (s > 0);
On a PDP-11, the "SOB" instruction uses a test for not equal to
zero.  Further, on a PDP-11 (and many other machines) an extra
variable may strain the register resources of the machine
requiring some high use variable be put into slower storage.  A
really good compiler should have been able to convert the end
condition and use the right instruction.  The VAX PCC based
compiler is not a "really good compiler".

Also, i was considering inlined code, even if originally
presented as a function.  If the vector lengths are large (and
experience shows that they never are), it doesn't matter if it is
a subroutine or not, just as the computation for 'end' in your
example is immaterial.

Yeah, i know, facts in an argument.

Fortunately, i went to a school that didn't have letter grades.
It was pass/fail.  Many schools, including Purdue, often treat
their grad students with some dignity and respect.  I had the
good fortune to find an undergrad school that did so.  I never
had to argue with instructors.  It was always real clear if i'd
passed or failed (did i do the work? yes: pass, no: fail).  Thus,
i never got a D- for code that worked.  Once, however, in high
school, i managed to get an A (best you could get), with a U
(unsatisfactory) for effort.

Stephen.

hascall@atanasoff.cs.iastate.edu (John Hascall) (07/09/89)

In article <13979@haddock.ima.isc.com> suitti@haddock.ima.isc.com (Stephen Uitti) writes:
>In various articles, cik & suitti argue...
 
>>> The code sequence is: ...
>>> 
>>> void lvecadd(a, b, c, s) /* a = b + c, length s */
>>> long *a; long *b; long *c; long s;
>>> {
>>> 	do {
>>> 		*a++ = *b++ + *c++;
>>> 	} while (--s != 0);
>>> }
>For the VAX:
>L18:addl3	(r9)+,(r10)+,r0
>movl	r0,(r11)+
>decl	r8
>jneq	L18
 
>>I am afraid I will have to give you a D- on this.  Most of the time I would
>>not even bother with a call, considering the code length.  But the code is
>>bad for vectors of length >2.  How about
>>{
>>	end = a + s;
>> 	do {
>> 		*a++ = *b++ + *c++;
>> 	} while (a < end);
>>}
 
>L26:addl3	(r9)+,(r10)+,r0
>movl	r0,(r11)+
>cmpl	r11,r7
>jlss	L26
 

   Of course, both routines fall in the dumper if the vector length is 0 :-)

      void lvecadd(a, b, c, s)
      register long *a, *b, *c, s;         /* a[0..s] = b[0..s] + c[0..s] */
      {
              if (s <= 0) return;          /* never trust your callers */
	      do {
		      *a++ = *b++ + *c++;
              } while (--s > 0);
      }

   giving (using VAX C      V2.4-026 [VMS]):

	                0000  lvecadd:                                  
                  007C  0000          .entry  lvecadd,^m<r2,r3,r4,r5,r6>
            5E 04 C2    0002          subl2   #4,sp                     
         53 04 AC D0    0005          movl    4(ap),r3                  
         52 08 AC D0    0009          movl    8(ap),r2                  
         51 0C AC D0    000D          movl    12(ap),r1                 
         50 10 AC D0    0011          movl    16(ap),r0                 
	       01 14    0015          bgtr    sym.1
		  04    0017          ret
			0018  sym.1:
         83 81 82 C1    0018          addl3   (r2)+,(r1)+,(r3)+
	       50 D7    001C          decl    r0   
	       F8 14    001E          bgtr    sym.1

   I'm not sure why "sobgtr r0,sym.1" wasn't used for the last two
   instructions, is "decl r0 / bgtr sym.1" faster??  Does the optimizer
   not recognize the decrement/branch sequence??  [maybe someone with a
   newer version of the compiler could try it.]

   Why are r4,r5,r6 being saved?  And what the devil is "subl2 #4,sp" for?
   Can't the optimizer clean up after itself??

   At least it got the important part (the loop) right by using (rn)+
   for all the operands.

   John Hascall