[comp.unix.wizards] Complaint about complex architectures

bcase@amdcad.UUCP (04/01/87)

In article <6042@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <5@wb1.cs.cmu.edu> avie@wb1.cs.cmu.edu (Avadis Tevanian) writes:
>>... the 4.3 libc ... has been carefully optimized to use the fancy
>>VAX instructions for the string routines.  Unfortunately, some of
>>these instructions are not implemented by the MicroVAX-II hardware.
>>As it turns out, what is happening is that your tests (including
>>Dhrystone) are causing kernel traps to emulate those instructions!
>
>Exactly.  Strcpy, strcat, and strlen were all modified to use the
>Vax `locc' instruction to find the ends of strings.  This instruction
>is not implemented in hardware in the uVax II.  The obvious solution
>is to arrange the libraries so that on a uVax, programs use a
>straightforward test-byte-and-branch loop (see sample code below).

This brings up one of my major beefs abouts complex archtiectures:  an
optimizing compiler might have to do different things depending upon
the *version* of a CPU it is compiling for!  An optimizing compiler
that is considered "a great compiler" for one version of a CPU might
be "a mediocre" compiler for the next version of the machine.  The
compiler writer found out that some obvious sequences of code are not
the best for the current version of the machine, but then the implementors
of the next version "fake him out" by changing the relative timings of
the instructions (and take note of the fact that determining instruction
timings for some machines, e.g. VAXs, is near impossible since DEC just
won't tell you.  This makes superior code generation a nightmare).

One of the reasons that simple architectures are better for compilers
is that (nearly) all instructions take the same amount of time and space.
Thus, code generation and optimization are *much* easier.  Also, this
relationship of one time unit/one space unit per instruction is unlikely
to change as a function of CPU version.

    bcase

zben@umd5.UUCP (04/04/87)

In article <15341@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:

> This brings up one of my major beefs abouts complex archtiectures:  an
> optimizing compiler might have to do different things depending upon
> the *version* of a CPU it is compiling for!  An optimizing compiler
> that is considered "a great compiler" for one version of a CPU might
> be "a mediocre" compiler for the next version of the machine.

Gosh, I seem to remember a Cobol compiler that generated different code for
programs with the following two directives:

Object-Computer is Univac-1108.

Object-Computer is Univac-1108 with four memory boxes.

Forgive me if the dashes are in the wrong places.  It's been a LONG time.
(Not long enough though...)  I don't buy the complexity argument.  You're
arguing that bicycles are better than cars because they are easier to fix
and easier to learn to drive, while completely forgetting the performance
differances.

Case in point:  I just came up with a fast integer square-root routine for
a local project (written in C, available on request).  It has one multiply
within the main loop.  I also have a Unisys 1100 assembly version with NO
multiplies in the loop, but I can't translate it to C because C doesn't
have the double register operations, double precision shifts, and there is
no easy way to code for the LSC (load shift and count) instruction other
than yet another C loop.

I guess the point here is that it is possible for a dedicated assembly
language programmer to effectively utilize these complex architectures
to fly rings around anything written in a higher-level language.  It is
also possible for a really brilliantly written code generator to approach
this kind of performance.  Any attempt to simplify these architectures
had better deliver blinding increases in hardware speed, or I'm still
going to think it's a plot by the programmers and compiler writers to
shirk their responsibilities...
-- 
                    umd5.UUCP    <= {seismo!mimsy,ihnp4!rlgvax}!cvl!umd5!zben
Ben Cranston zben @ umd2.UMD.EDU    Kingdom of Merryland UniSys 1100/92
                    umd2.BITNET     "via HASP with RSCS"

rbj@icst-cmr.arpa (04/09/87)

   Case in point:  I just came up with a fast integer square-root routine for
   a local project (written in C, available on request).  It has one multiply
   within the main loop.  I also have a Unisys 1100 assembly version with NO
   multiplies in the loop, but I can't translate it to C because C doesn't
   have the double register operations, double precision shifts, and there is
   no easy way to code for the LSC (load shift and count) instruction other
   than yet another C loop.

What, no `asm' directive? How about `sed'-ing the assembly output?

   I guess the point here is that it is possible for a dedicated assembly
   language programmer to effectively utilize these complex architectures
   to fly rings around anything written in a higher-level language.  It is
   also possible for a really brilliantly written code generator to approach
   this kind of performance.  Any attempt to simplify these architectures
   had better deliver blinding increases in hardware speed, or I'm still
   going to think it's a plot by the programmers and compiler writers to
   shirk their responsibilities...

I'm not so sure. Remember the day Mike McAmis (heard from him lately?)
was so proud he had occaision to use `Add Negative Thirds'? Remember
the `convert Fieldata to (or from) Binary' using `Masked Load Uppers'
with strings like `B0B0B0' and `888888'? Pretty arcane stuff.

They just don't make them like that anymore. The machines were designed
by engineers (remember, we dropped out of engineering and into computer
science) who thought, `yeah, it's easy to do ANT, I'll just gate some
of these carrys end around (note: U1108 is one's complement) instead
of to the next bit' instead of finishing the *useful* instruction set.
Therefore, we have `Test Greater' but not `Test Less'.

Yeah, it was fun to look thru code, trying to slice off an instruction
here or there, looking for faster instructions, etc. But those days
are gone now, and it's conceptual clarity that counts.

I'm sure you know all the arguments about simplified decoding that
RISC is supposed to deliver. If the speed isn't delivered, at least
the machine should be cheaper.

It takes a Real Programmer to find occasion to use those Macho
instructions. What makes you think some Wimpy compiler can do it? :-)


	       umd5.UUCP    <= {seismo!mimsy,ihnp4!rlgvax}!cvl!umd5!zben
   Ben Cranston zben @ umd2.UMD.EDU    Kingdom of Merryland UniSys 1100/92
	       umd2.BITNET     "via HASP with RSCS"


(Root Boy) Jim "Just Say Yes" Cottrell	<rbj@icst-cmr.arpa>
I'm mentally OVERDRAWN!  What's that SIGNPOST up ahead?
Where's ROD STERLING when you really need him?