[net.arch] FLAME!!! Re: EA orthogonality

jack@boring.UUCP (05/20/85)

[ Note that I added net.arch to the newsgroup, since this is probably
where this discussion belongs]

In article <557@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes:
>**** WARNING ****   The following comments are not as nice as etiquette
>recommends.
Agreed. Also, I think they're not true.
> 
>> I think total orthogonality would be *very* useful.
>> ...
>> A 68K compiler has to think about modifying the branch condition, etc.
>> A 32K compiler just generates code in the way it sees the statement.
>> 
>> Of course, an optimizer might throw everything around again
>> to save registers or whatever, but the inital code generation is
>> much simpler in the 32K case.
>
>What in heck do you think we users are paying you compiler writers to
>DO?
>
>The purpose of a CPU is to solve the *user's* application as quickly as
>possible.
Agreed. In my opinion, this means that the CPU should be optimized to
doing what most users do most of the time: running high-level language
programs.

>
>The purpose of a CPU is *NOT* to be as easy to write a compiler for as
>possible.
Not agreed. If a machine is simple, the compiler is simpler, and thus it
is available sooner, doesn't have as much bugs, etc.
>
>Why on earth should the design of a CPU be based on how easy it will
>make the jobs of the five people who will write the compilers for it?
Because *EVERYONE* will use the product of those five people.
If, for instance, a compiler for a certain machine generates lousy
code for a for-loop, because the compiler writers didn't have time to
optimize it because they were too busy with getting the compiler to *work*
that will waste *HOURS* of CPU time eventually for everyone using it.

This is also the whole point behind RISC architecture, one of the
rising stars at the moment.
-- 
	Jack Jansen, jack@mcvax.UUCP
	The shell is my oyster.

doug@terak.UUCP (Doug Pardee) (05/22/85)

me>The purpose of a CPU is *NOT* to be as easy to write a compiler for as
me>possible.

> Not agreed. If a machine is simple, the compiler is simpler, and thus it
> is available sooner, doesn't have as much bugs, etc.

Did I miss something here?  Since when is it any concern of mine, as a
user, whether the compiler is simple???

And I have seen no evidence that compilers for "simple" machines are
available any sooner, or are any more reliable, than compilers for
warpo machines.

me>Why on earth should the design of a CPU be based on how easy it will
me>make the jobs of the five people who will write the compilers for it?

One response:

> Because *EVERYONE* will use the product of those five people.

But that doesn't address the question as to why the comfort and
convenience of those five people is of any concern to "*EVERYONE*".

Another response:

> If you have tried to hire good
> compiler people lately, you know that compiler-writer time is neither
> cheap nor in infinite supply.

Ah, here we finally get to the nitty-gritty.  What we're saying is that
we want to have CPUs that are easy to write compilers for so that we can
hire less-capable (aka *cheaper*) programmers to write the compilers!!!

Given how few micro-processor instruction sets there are, and how few
languages of interest, you don't *need* an "infinite supply" of compiler
programmers.  In fact, about a dozen could do the job for the entire
microcomputer world.  There are certainly a dozen top-notch compiler
programmers available for this task.  And given the importance of having
good compilers, they're worth whatever they get paid.

But CPUs and compilers are put out by IC manufacturers, and they
understand chips better than software.  So they tend to put their money
into design work on the chip, and hire cheap programming labor to
produce less-than-thrilling compilers.  Since the manufacturers'
compilers are often poor, third-party operations spring up all over the
place to try to cash in.  Typically underfinanced, these operations
*also* hire cheap programming labor and produce less-than-thrilling
compilers.  And the vacuum remains, so even more third-party start-ups
appear.

For heaven's sake, how many C compilers do we have to develop for the
68000 before we get one that's good???  Wouldn't it have been a whole
lot easier if Motorola or Microsoft or *someone* had put up the bucks
necessary to hire real compiler writers in the first place?

I think it makes more sense to take compiler-writing seriously, rather
than try to kludge the CPU so that every basement hacker can write what
he calls a "compiler".
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

kds@intelca.UUCP (Ken Shoemaker) (05/23/85)

> 
> [ Note that I added net.arch to the newsgroup, since this is probably
> where this discussion belongs]
> 
> >
> >The purpose of a CPU is *NOT* to be as easy to write a compiler for as
> >possible.
> Not agreed. If a machine is simple, the compiler is simpler, and thus it
> is available sooner, doesn't have as much bugs, etc.
> 
> This is also the whole point behind RISC architecture, one of the
> rising stars at the moment.
> -- 
> 	Jack Jansen, jack@mcvax.UUCP
> 	The shell is my oyster.

Not entirely true.

- the only instructions that can access memory are mov (or load) operations

- jumps jump only after the instruction after the jump has been executed

- some don't have hardware interlocks to prevent a register being read
  before a previous register write has completed, so you have to remember
  to do enough in between so you don't have problems.

- they don't allow arbitrary byte boundaries for code/data

You can argue that this is merely code reorganization, but they are
implemented this way such that you can eliminate both hardware pipeline
stages, and the delays in each stage that is there.  Just my impressions...
-- 
It looks so easy, but looks sometimes deceive...

Ken Shoemaker, Intel, Santa Clara, Ca.
{pur-ee,hplabs,amd,scgvaxd,dual,omovax}!intelca!kds
	
---the above views are personal.  They may not represent those of Intel.

jack@boring.UUCP (05/25/85)

I'm not sure whether Doug Pardee is serious, or just trying to keep
the discussion going. I'll assume he *is* serious, and answer
him anyway.

Doug>The purpose of a CPU is *NOT* to be as easy to write a compiler for as
Doug>possible.

me> Not agreed. If a machine is simple, the compiler is simpler, and thus it
me> is available sooner, doesn't have as much bugs, etc.

Doug>Did I miss something here?  Since when is it any concern of mine, as a
Doug>user, whether the compiler is simple???

Exactly for the reasons stated above. You don't want to argue that a
4000 line compiler is easier to maintain, debug, etc. than one of
8000 lines, I hope?


Doug>And I have seen no evidence that compilers for "simple" machines are
Doug>available any sooner, or are any more reliable, than compilers for
Doug>warpo machines.

No? Get yourself a PR1ME, and try the pascal compiler :-(

Now, I won't comment on the rest point-by-point, since it would be
too long-winded that way. Let me just explain the following point:

When you are designing a machine, you are facing two size problems:
1. How do I fit all those transistors on this little square?
2. How do I fit all those opcodes in those 16 bits?

An orthogonal design is clearly good for (1), since it allows you to
put use the hardware (or firmware) for calculating "x'100(a6:B)[d0]"
many times.

Now, to satisfy (2), you can do two things:
- Make the operand fields small, so you can have many opcodes.
- Make the opcode fields small, so you can have complicated operands.
(I won't go into RISC here, which makes both of them small).

If you take the first choice, you can have lots of nifty instructions
like 'search for a one bit, and return the position in a register'
or 'copy a string and translate' and those kind of things, which will
*never* be used by *any* compiler (except for cobol, maybe) since
most high-level language don't have a construct for that.
Can you imagine a compiler that would recognize
	for(p=src, q=dst; *p; p++, q++)
		*q = table[*p];
and translate it in the above mentioned instruction?

If you take the second branch, you will *not* have a string translate
instruction. You will, however, have the ability to make your
design orthogonal. Wirth (I think, I'm not sure) has long ago
measured that the average expression had 1.5 operands. This means
the half of the instructions you give will be expressible in *one*
instruction, provided that the machine lets you address something on
the stack as an operand.

For example:
	a += b;
orthogonal:
	add	b(r5),a(r5)
non-orthogonal:
	mov	a(r5),r4		<-- AND MAKE SURE IT'S FREE!!
	add	b(r5),r3
	mov	r3,a(r5)
Now, in cycles, the first one would result in 4 memory cylces and
3 additions, and the second in 6 memory cycles and 4 additions (PLUS
an additional 2 instruction decodes).

Well, this has got long-winded after all, sorry for that.
You may do what you want, but I'll stick to hardware that was designed
by software people. 
-- 
	Jack Jansen, jack@mcvax.UUCP
	The shell is my oyster.

cdshaw@watmum.UUCP (Chris Shaw) (05/27/85)

>The purpose of a CPU is *NOT* to be as easy to write a compiler for as
>possible.
>

All right then, what IS the purpose of a CPU?? It would seem to me
that the purpose of a CPU is to run programs. The purpose of a well-designed
instruction set is to make it as easy to program as possible without sacri-
ficing performance.

Now, it also seems to me that an intelligent CPU design takes into account
the types of programs that will run on it. Thus, it's obvious that the 8035
was never designed to be anything more than a controller. When designing
the 32032 then, the kind of programs the designers of the chip had in mind 
were those that would be created by high-level languages. Thus, they made the
instruction set as easy as possible to write compilers for. Obviously,
orthogonality doesn't matter quite so much on a controller, where the programmer
is a human, not a program. On a general-purpose CPU, however, most programs will
be created by programs (compilers), so it makes sense to tailor the instruction
set to its intended programmers.

Anybody who has written a compiler will tell you that ortho machines are easier
to write compilers for. It's a simple fact that has been true since day 1.
The benefits of programs that are easy to write vs hard to write are as follows:

	1) Productivity of the programmer is much higher. Despite Mr. Trissel's
	   comments, compiler writers are harder to come by than (say) COBOL
	   programmers, and are therfore more expensive. Simply asking for
	   better programmers doesn't solve this problem. Therefore, the more
	   productive your programmers, the better. Of course, if the market
	   for 8035 C compilers is twice that for 68000 C compilers, then
	   maybe start writing 8035 stuff, but that's another matter entirely.

	2) Program correctness (lack of compiler bugs). All things being equal
	   (which they aren't), a compiler for a weird machine produced from
	   N man-months of labour will be generally less right than that for
	   an ortho machine. This point is really an outgrowth of productivity.
	   Almost as importantly, an ortho compiler will be easier to maintain
	   and fix bugs for than for a non-ortho machine, since there is no
	   complicated register-assignment algorithm, etc...

	3) Object code speed. Given that a CPUs x and y have the same hardware,
	   but different instruction sets (2 microcode sets, say), compiler code
	   produced for the ortho version is most likely going to be faster,
	   since special-purpose register decisions are not reflected in the
	   code. In other words, non-orthogonality generates superfluous moves
	   that would probably not be necessary in an ortho machine. This point
	   is true whether the code is compiler or human produced. The lack
	   of a general reg-to-reg add on the Z80 is cause for much wasted
	   reg-to-reg MOVs, (or worse) reg-to-memory MOVs, for example.

>I think it makes more sense to take compiler-writing seriously, rather
>than try to kludge the CPU so that every basement hacker can write what
>he calls a "compiler".
>-- 
>Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug

I think this point is ripe nonsense. The bit which grabs me worse, of course,
is the twisted use of the word "kludge". And as for this garbage about basement
hackers, well.... (I guess it's time to go upstairs for a beer & mellow out :-)

Chris Shaw    watmath!watmum!cdshaw  or  cdshaw@watmath
University of Waterloo
In doubt?  Eat hot high-speed death -- the experts' choice in gastric vileness !

johnl@ima.UUCP (05/29/85)

/* Written  9:10 pm  May 27, 1985 by g-frank@gumby in ima:net.micro.68k */

>    Clever compilers for almost any language but C can paper over most sorts
> of yawning chasms.  Modula-2, Pascal, Ada, all are
> languages that port quite well to the 8086 family, and produce efficient,
> readable code without any sort of trickery required of the programmer.
> I have a stupid 68000 system in my basement
> that I can't use and can't sell because there's no software for it, and one of
> those [Intel CPU] vending machines sitting on my desk.

I wish I could get great performance out of my vending machine CPU merely by 
switching languages.  Unfortunately, the last time I looked, Modula, Pascal, 
and Ada compilers didn't produce notably better code than did C compilers.  
Would it were true.  In every case, if your total data is bigger than 64K, 
something gives.  You always find limits, like the total automatic data for 
a procedure (or, often, the whole program) being less than 64K.  I'm not
talking about a single huge array -- it's just as bad if you have lots of
small things all of which add up to more than 64K.

The need to distinguish between long and short pointers in order to produce 
decent code always pops up in 8086 compilers somehow, either by not allowing 
long pointers, by generating poor large model code anywhere, or by putting 
some wart in the language that lets the programmer tell the compiler what's 
long and what's short.  

There were legitimate reasons for IBM to pick the 8088 for the PC.  I gather
that the main competitor at the time was the Z80, so we should be thankful
for something, since 68008s weren't suffuciently available, and for price
reasons they wanted to stick with an 8-bit bus.  And I also recall that the
original PC came with only 16K, and loading up the machine past 128K was a
big deal.  But none of that means that the 8088 or the 286 is at all easy
to program for the sorts of things that people are doing on PCs now.  It also
doesn't mean that a chip that was designed to be spiritually compatible with
the 8080 is much of a choice for a general computing engine.

John Levine, ima!johnl

PS:  I hear that for applications with limited amounts of data and lots of
real-time I/O requirements, such as controlling vending machines, the 8088
is just great.

thomson@uthub.UUCP (Brian Thomson) (05/29/85)

Chris Shaw writes about orthogonality:
>When designing
>the 32032 then, the kind of programs the designers of the chip had in mind 
>were those that would be created by high-level languages. Thus, they made the
>instruction set as easy as possible to write compilers for. 
>On a general-purpose CPU, ... most programs will
>be created by programs (compilers), so it makes sense to tailor the instruction
>set to its intended programmers.

In my experience, the difficulty of (decent) compiler construction is affected
less by orthogonality than by the number of code sequences that must be
considered when implementing a given source language construct.
The C statement
		a = b * c + d + e;
might, in different contexts be implemented on your 32032 as:

		movd	_b,r0
		muld	_c,r0
		addd	_d,r0
		addd	_e,r0
		movd	r0,_a

or, if c is the constant 2, d a stack local, and e the constant 4,

		movd	_b,r0
		addr	4(-4(fp))[r0:w],_a

or even, if b, c, and d are all unsigned shorts, and e == b,

		movzwd	_b,r0
		indexw	r0,_c,_d	; b * (c+1) + d
		movd	r0,_a

Does that last one look ridiculous?  That's exactly my point: it's the
best code sequence under the given set of assumptions, and no compiler
would ever find it.  If these fancy addressing modes and high-level
language oriented instructions could be added without penalizing the
performance of bread-and-butter instructions, I'd be all for it, but
such is never the case.

If a machine forces me to put something in a data register before I
can add to it, and has no exceptions to this rule, it will be easy to
generate code.  It only gets tough when there are options.
-- 
		    Brian Thomson,	    CSRI Univ. of Toronto
		    {linus,ihnp4,uw-beaver,floyd,utzoo}!utcsrgv!uthub!thomson

doug@terak.UUCP (Doug Pardee) (06/03/85)

Wait a second!  It looks like I should have used one of my "patented"
200-line postings, because an awful lot of people have misinterpreted
my comments.

The original posting to which I had responded did *not* say that EA
orthogonality would result in better compiled code.  It said that EA
orthogonality would allow that compiler writer to save himself the
trouble of swapping operands on a compare instruction and logically
inverting the branch condition.

This does *NOT* improve the performance of the compiled code.  In fact,
on the NS320xx cpus (the only ones around with 2-address architecture),
a "backwards" compare instruction takes an extra 2 clock cycles of
execution time.

I have no objection to compiler writers who wish to make a case that EA
orthogonality will result in better compiled object code.  But I object
strenuously to the notion that regardless of whether it would benefit
or hurt the users, the cpu architecture should be changed to please lazy
compiler writers.

EA orthogonality should be argued on the basis of the efficiency of the
resulting object code, not on the ease with which the handful of
compiler writers can do their job.

Some of the notes have indicated that these concerns are one and the
same.  Sometimes, but not always.  Here's a choice counter-example:
Some RISC machines have a "branch *after* next instruction" operation.
This allows the pipeline to be used more efficiently.  It results in
more efficient object code than conventional branch instructions, but
it is a booger-bear to write an effective compiler for.

A lot of folks have also suggested that compilers which were easily
written (I call them "hastily knocked out" :-) are more bug-free than
ones that took some time to implement.  I maintain that the quantity of
bugs is related to the quantity and quality of design and debugging.
Now how much design and debugging do you expect to get from a compiler
writer who thinks that putting the operands of a "compare" instruction
in the proper order is "too much work"?

It is also said that good compilers take longer to produce than crummy
ones.  True.  Are we all so impatient that we'd rather have a crummy
compiler now than to wait six months for a good one?

And it has been said that good compilers cost more than crummy ones.
I'm not exactly surprised.  Isn't there an old saw about "only getting
what you pay for"?

I suggest that part of the problem here is that a lot of folks who are
reading this hope to write The Great American Compiler.  They weren't
planning on spending the time and money to write a good compiler.  And
they don't much care for hearing suggestions that users don't want to
buy crummy compilers.  (Have at it, my mailbox is asbestos-lined now).
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

rap@oliveb.UUCP (Robert A. Pease) (06/05/85)

> 
> You may do what you want, but I'll stick to hardware that was designed
> by software people. 
> -- 
> 	Jack Jansen, jack@mcvax.UUCP
> 	The shell is my oyster.

The thing that I keep thinking about is  that  every  paper,  article,
text, or whatever I have seen on the subject says that the best way to
design a system is to first decide what the application  will  be  and
then  design  the  hardware to support the design goals.  Seems to me,
then,  that  an  orthogonal  archetecture  would  support  high  level
languages  much  better  than one that is not orthogonal, or do I just
see things more clearly that others :-).
-- 
					Robert A. Pease
    {hplabs|zehntel|fortune|ios|tolerant|allegra|tymix}!oliveb!oliven!rap

paul@greipa.UUCP (Paul A. Vixie) (06/05/85)

In article <210@uthub.UUCP> thomson@uthub.UUCP (Brian Thomson) writes:
>The C statement
>		a = b * c + d + e;
>
>might, in different contexts be implemented on your 32032 as:
>		movd	_b,r0
>		muld	_c,r0
>		addd	_d,r0
>		addd	_e,r0
>		movd	r0,_a
>
>or, if c is the constant 2, d a stack local, and e the constant 4,
>		movd	_b,r0
>		addr	4(-4(fp))[r0:w],_a
>
>or even, if b, c, and d are all unsigned shorts, and e == b,
>		movzwd	_b,r0
>		indexw	r0,_c,_d	; b * (c+1) + d
>		movd	r0,_a

Or, how about:

;		extern long int a, b, c, d, e;
;		a = b * c + d + e;
		movd	ext(_b), tos
		muld	ext(_c), tos
		addd	ext(_d), tos
		addd	ext(_e), tos
		movd	tos, ext(_a)

;		extern long int a, b;
;		#define c 2
;		auto long int d;
;		#define e 4
;		a = b * c + d + e;
		movd	ext(_b), tos
		muld	2, tos
		addd	4(fp), tos
		addd	4, tos
		movd	tos, ext(_a)

;		extern long int a;
;		extern unsigned short int b, c, d;
;		a = b * c + d + b;
		movzwd	ext(_b), tos
		movzwd	ext(_c), tos
		muld	tos, tos
		movzwd	ext(_d), tos
		addd	tos, tos
		movzwd	ext(_b), tos
		addd	tos, tos
		movd	tos, ext(_a)

----------------
The above code is not very pretty nor efficient.  In each case I have done
five operations:  move, multiply, add, add, move.  The only real difference
is in the addressing modes;  this seems common of compiler-generated code.

I am no longer (thank <insert deity here>) an expert on the 68xxx, but I
don't remember an external or frame-relative addressing mode;  one assumes
that the many otherwise useless address registers will be used to hold the
current global and frame pointers, and the loader has alot of fixing up to
do on those globals - every reference needs modification, not just an extern
table (unless you plan to have your compiler generate enough low-level stuff
to do what the 32xxx external addressing mode does automagically).

Not being a compiler writer (yet :-), anyway) I don't see many other things
a compiler could optimize for (except the "muld 2, tos" which could have been
"ashd 1, tos" but only vax-11 C from DEC does this).  I do know that the
68xxx's addressing modes and strange restrictions on address and data registers
are more characteristic of RISC than a machine with all those instructions.
Can the 68xxx even do a "addd -(sp), (sp)" without doing the pop at the wrong
time?  The one I worked with didn't have any memory-to-memory instructions;
you could do register to memory, memory to register, or register to register,
but they were all different instructions (in fact, different instructions for
address and data registers, and that's when they felt like providing them -
often you had to move into an (address or data) register from a (data or
address) register to do a simple operation.

Gosh, what a ramble.  Sorry about that everybody.  My point in all this is
that a compiler can generate *clean* code *easily* for the 32xxx because
of all the neato addressing modes;  generating code for the 68xxx is either
(easy, ugly, inefficient) or (hard, functional, efficient) but that's like
a choice between the electric chair and the gas chamber.

	Paul Vixie
	{pyramid,dual,decwrl}!greipa!paul

mark@rtech.UUCP (Mark Wittenberg) (06/05/85)

> For example:
> 	a += b;
> orthogonal:
> 	add	b(r5),a(r5)
> non-orthogonal:
> 	mov	a(r5),r4		<-- AND MAKE SURE IT'S FREE!!
> 	add	b(r5),r3
> 	mov	r3,a(r5)
> Now, in cycles, the first one would result in 4 memory cylces and
> 3 additions, and the second in 6 memory cycles and 4 additions (PLUS
> an additional 2 instruction decodes).
> 
> -- 
> 	Jack Jansen, jack@mcvax.UUCP
> 	The shell is my oyster.

And furthermore, the orthogonal sequence is normally atomic;
in an OS kernel the non-orthogonal sequence might easily have to
be protected by a "disable/enable interrupt" sequence around it,
or "test-and-set" or some such in a multi-processor system 
(e.g., "a" and "b" might be global vars).
Multi-process user-programs would need "enter/exit monitor" or
"block-on-semaphore" sequences.  Besides being a pain (sometimes
a royal pain) this has the potential for eating a lot of CPU time.
-- 

Mark Wittenberg
Relational Technology
zehntel!rtech!mark
ucbvax!mtxinu!rtech!mark