[comp.arch] The VAX Always Uses Fewer Instructions

crowl@cs.rochester.edu (Lawrence Crowl) (06/15/88)

In article <28200161@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>In article <491@daver.UUCP> daver@daver.UUCP (Dave Rand) writes:
>>I am confused. How can a risc machine have a higher "vax mips" than native
>>mips? MORE (not less) risc instructions are required to do the same task,
>>when compared to a vax.
>
>Not always. Consider A=B+C, all in registers:
>    VAX:
>	mov rB,rA
>	add rC,rA
>    3 address RISC:
>	add rA,rB,rC
>
>So, we have an existence proof. What characteristics of the machine actually
>let this happen?

This is incorrect.  The VAX has three address arithmetic instructions.  So the
above example for a VAX (destinations are always on the right side) is:
 
        addl3 rB, rC, rA
 
It also takes four bytes to encode this instruction, the same as most RISC
machines.  The VAX instruction set wins (on number of instructions executed)
when using complex data structures because of the extensive addressing modes.
For example, the loop to add two vectors into a third on the VAX is:
 
   top: addl3 (rA)+, (rB)+, (rC)+
        sobgeq rD, top

which takes seven bytes for two instructions.  Most RISCs I now would have
something like the following loop (again destinations on the right):

   top: load  rA, rM
        load  rB, rN
        add   rM, rN, rO
        store rO, rC
        add   rA, 1, rA
        add   rB, 1, rB
        add   rC, 1, rC
        add   rD, -1, rD
        bgeq  top

which takes something on the order of thirty-six bytes and nine instructions.
I cannot think of any general computing task (such as the loop above) in which
the VAX will not execute fewer instructions.  Anyone?
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

chris@mimsy.UUCP (Chris Torek) (06/16/88)

In article <10595@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes:
>For example, the loop to add two vectors into a third on the VAX is:
> 
>   top: addl3 (rA)+, (rB)+, (rC)+
>        sobgeq rD, top
>
>which takes seven bytes for two instructions.

True.  An optimising compiler might expand the loop, however:

	extzv	$0,$3,rD,r0
	bicl2	r0,rD			# or bicl2 $7; same length
	casel	r0,$0,$7		# start the right distance in
9:	.word	0f - 9b			# 0
	.word	1f - 9b			# 1
	...
	.word	7f - 9b			# 7
7:	addl3	(rA)+,(rB)+,(rC)+
6:	addl3	(rA)+,(rB)+,(rC)+
5:	addl3	(rA)+,(rB)+,(rC)+
4:	addl3	(rA)+,(rB)+,(rC)+
3:	addl3	(rA)+,(rB)+,(rC)+
2:	addl3	(rA)+,(rB)+,(rC)+
1:	addl3	(rA)+,(rB)+,(rC)+
0:	addl3	(rA)+,(rB)+,(rC)+
	acbl	$0,$-8,rD,7b		# while (rD-=8) >= 0

This pushes the size up to (I think) 70 bytes.  Too bad the RISC
machines are still faster anyway :-) .

Actually, you could get rid of the case and the branch table:

	extzv	$0,$3,rD,r0
	bicl2	r0,rD
	subl3	r0,$7,r0	# invert
	ashl	$2,r0,r0	# times 4, size of addl3 instr below
	jmp	(pc)[r0]	# into the breach (or is it breech?...kapow!
0:	addl3	(rA)+,(rB)+,(rC)+	# maybe an ancient muzzle loader :-) )
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	addl3	(rA)+,(rB)+,(rC)+
	acbl	$0,$-8,rD,0b

This drops off 9 bytes, down to 61 bytes.  You can get rid of 5 more
bytes by changing the acbl into

	subl2	$8,rD
	bgeq	0b

but on non-pipelined VAXen that might be slower.  Alternatively, if you
have another register free, `mnegl $8,r1'; then acbl with r1 instead of
$-8; this saves only 1 byte overall, but brings the acbl down to 6 bytes.

[nb. the sobgeq loop above runs rD+1 times, so I made the acbl loops
do the same.  rD is left in a different state (-8 vs -1), and I did
need r0 for entry calculation.]

All of this just goes to show that the VAX provides too many ways to
do things!
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

mcglk@scott.stat.washington.edu (Ken McGlothlen) (06/16/88)

In article <11981@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
+----------
| In article <10595@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes:
| +----------
| |  For example, the loop to add two vectors into a third on the VAX is:
| |
| |    top: addl3 (rA)+, (rB)+, (rC)+
| |         sobgeq rD, top
| |
| |  which takes seven bytes for two instructions.
| +----------
| True.  An optimising compiler might expand the loop, however:
|
| [... case expansion example ...]
|
| This pushes the size up to (I think) 70 bytes.  Too bad the RISC
| machines are still faster anyway :-) .
|
| [... more examples ...]
|
| All of this just goes to show that the VAX provides too many ways to
| do things!
| -- 
| In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
| Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris
+----------

Oh, please.

Guess we're gonna have to not only bash that ultra-complex VAX architecture,
but while we're at it, may as well bash C, too.  I mean, we've got so many
ways of adding five to an integer variable!

     i = i + 5;
     i += 5;
     i++; i++; i++; i++; i++;
     for( j = 0 ; j < 5 ; j++ )  i++;

Yup.  Definitely too complex.  I think we oughta just keep the "++" operator.

I still haven't seen any good arguments as to why RISC is so much better or
faster.  I'm kind of fond of the VAX instruction set, and you can do a heck
of a lot more with one line of its instruction set than you can with five
or ten lines of RISC code.  Is having eighty or so registers all that much
faster?

				--Ken McGlothlen
				  mcglk@scott.biostat.washington.edu
				  mcglk@max.acs.washington.edu

brooks@maddog.llnl.gov (Eugene D. Brooks III) (06/17/88)

In article <914@entropy.ms.washington.edu> mcglk@scott.biostat.washington.edu writes:
>I still haven't seen any good arguments as to why RISC is so much better or
>faster.  I'm kind of fond of the VAX instruction set, and you can do a heck
>of a lot more with one line of its instruction set than you can with five
>or ten lines of RISC code.  Is having eighty or so registers all that much
>faster?
If main memory, and in particular shared memory in a multiprocessor, is
20 to 40 clocks away having eighty or so registers with fully pipelined
memory access is really a whole lot faster.

steve@basser.oz (Stephen Russell) (06/20/88)

>In article <914@entropy.ms.washington.edu> mcglk@scott.biostat.washington.edu writes:
>I still haven't seen any good arguments as to why RISC is so much better or
>faster.

Two quick reasons:

1. CISC costs silicon. Using up silicon in implementing complex instructions
leaves less for performance enhancements, like large register sets, on-board
caching, hardware assistance (barrel shifters, hardware multipliers, etc),
translation lookaside buffers, etc.

2. CISC costs performance. Lots of addressing modes, for example, result
in many pipeline blocks while additional addressing values are fetched
from cache/memory. The whole CPU stops because of some extra indirection.
Also, this stop/start behaviour must be planned for, and this costs more
silicon - see 1 above. Keeping things simple allows the system to do something
useful (or lots of useful things in parallel in the pipeline) for _every_
cycle.

>I'm kind of fond of the VAX instruction set, and you can do a heck
>of a lot more with one line of its instruction set than you can with five
>or ten lines of RISC code.

But is the single VAX instruction actually faster, all else being equal?

chris@softway.oz (Chris Maltby) (06/21/88)

This debate is just a waste of time. I don't care what the assembler
code looks like or how many instructions there are etc etc.
It seems that we are going to see a lot of RISC machines from now
on, not because they are a clean or nice way to do things, but because
they are easier to design than complex instruction set machines.
This means that new technology hits the market first with a RISC
architecture and it goes faster (for that reason alone) than the
CISC machines released along side it which use older technology.

In any case, all these dodgy machine architectures have put
compiler writers back in business after the 68000 era ...
-- 
Chris Maltby - Softway Pty Ltd	(chris@softway.oz)

PHONE:	+61-2-698-2322		UUCP:		uunet!softway.oz!chris
FAX:	+61-2-699-9174		INTERNET:	chris@softway.oz.au

darin@nova.laic.uucp (Darin Johnson) (06/22/88)

In article <1277@basser.oz>, steve@basser.oz (Stephen Russell) writes:
> >In article <914@entropy.ms.washington.edu> mcglk@scott.biostat.washington.edu writes:
> >I'm kind of fond of the VAX instruction set, and you can do a heck
> >of a lot more with one line of its instruction set than you can with five
> >or ten lines of RISC code.
> 
> But is the single VAX instruction actually faster, all else being equal?

Perhaps it would be possible for someone to come up with an 'assembler-compiler'
that would accept a CISC instruction set and generate RISC code.  This would
allow one to write using something like 'ADD mem-loc1 to mem-loc2 and store
in mem-loc3(R1)' without having  write the 5 or 10 RISC lines of code.
The biggest drawback I can see, is that there would have to be 'optimizing
assemblers'.  Of course, such an assembler would find it difficult to 
take advantage of some common RISC idioms, such as register windows.

Just another naive thought from the mind of...
Darin Johnson (...pyramid.arpa!leadsv!laic!darin)
              (...ucbvax!sun!sunncal!leadsv!laic!darin)
	"All aboard the DOOMED express!"

jesup@cbmvax.UUCP (Randell Jesup) (06/24/88)

In article <270@laic.UUCP> darin@nova.laic.uucp (Darin Johnson) writes:
>Perhaps it would be possible for someone to come up with an 'assembler-compiler'
>that would accept a CISC instruction set and generate RISC code.  This would
>allow one to write using something like 'ADD mem-loc1 to mem-loc2 and store
>in mem-loc3(R1)' without having  write the 5 or 10 RISC lines of code.

	People already do this.  For example, on the Rpm40, there are several
"meta-instructions" that you can use, that actually produce a series of actual
machine instructions.  Examples are MUL, CALL, DIV, FPLDD, etc.

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|ihnp4|allegra}!cbmvax!jesup

jlg@beta.lanl.gov (Jim Giles) (06/24/88)

In article <270@laic.UUCP>, darin@nova.laic.uucp (Darin Johnson) writes:
> [...]
> Perhaps it would be possible for someone to come up with an 'assembler-compiler'
> that would accept a CISC instruction set and generate RISC code.  This would
> allow one to write using something like 'ADD mem-loc1 to mem-loc2 and store
> to mem-loc3(R1)' without having  write the 5 or 10 RISC lines of code.
> The biggest drawback I can see, is that there would have to be 'optimizing
> assemblers'.  Of course, such an assembler would find it difficult to 
> take advantage of some common RISC idioms, such as register windows.

This is already possible with macros, opdefs, and micros that many
assemblers have.  Just define one of these for each CISC instruction
mnemonic you wish to emulate.  (OK. The syntax for defining these may
be messy, but it could be made to work.)

As you pointed out, there is a need for optimizing assemblers.  This
need already exists for macro assemblers since hand pipelining a code
is not possible through macro calls.  The Cray has needed such an
optimizing assembler for years.

J.Giles
Los Alamos

mash@mips.COM (John Mashey) (06/26/88)

In article <270@laic.UUCP> darin@nova.laic.uucp (Darin Johnson) writes:
....
>The biggest drawback I can see, is that there would have to be 'optimizing
>assemblers'.  Of course, such an assembler would find it difficult to 
>take advantage of some common RISC idioms, such as register windows.

Optimizing assemblers have existed for years, in various forms,
and in many companies, on both CISC and RISC machines.

Many RISC machines have optimizing assemblers, including such things
as code scheduling, addressing-style optimization, optimization of
constant creation, code selection for mulitply/divide by constants, etc, etc.
We had the first version of the MIPSco one working BEFORE the R2000
architecture was even frozen, for example.

At this point, optimization is moving further afield, i.e., one is even
starting to see optimizing linkers [we do some of this, and I think
Moto does also.]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

smryan@garth.UUCP (Steven Ryan) (06/26/88)

>As you pointed out, there is a need for optimizing assemblers.

Strong disagreement--an assembler should be safe, simple, and dumb. If you
want an optimiser, use a compiler.

What is preferable is separate layer to do the optimisation. The problem with
an assembler is that it knows very little. Consider an assembler which replaces
a long branch with a different short branch. What if it occurs in a switch?
            goto next + index
            goto a
             ...
            goto z
Also, to properly schedule requires moving code past loads and stores. How
can an assembler safely and generally determine what the target of memory
reference is?

chris@mimsy.UUCP (Chris Torek) (06/28/88)

>In article <11981@mimsy.UUCP> I wrote:
>>All of this just goes to show that the VAX provides too many ways to
>>do things!

In article <914@entropy.ms.washington.edu> mcglk@scott.stat.washington.edu
(Ken McGlothlen) writes:
>Oh, please.
>
>Guess we're gonna have to not only bash that ultra-complex VAX architecture,
>but while we're at it, may as well bash C, too.  I mean, we've got so many
>ways of adding five to an integer variable!

I think you missed the implicit :-) ---I was half kidding.  (But only
about half.)

>I still haven't seen any good arguments as to why RISC is so much better or
>faster.

Who cares about the arguments?  The fact is that if you have somewhere
between $10,000 and $1,000,000, and want to buy the fastest machine you
can get for that, right now that machine is probably `RISC-based'.

You can argue all you like as to why the Vax instruction set is better,
or why the 88000 instruction set is better, but the fastest Vax CPU from
DEC is slower than the fastest 88000 CPU from Motorola.  If it were the
other way around, DEC would be in fine shape.  (Maybe they just need
Motorola to design their next chip :-) .)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

colwell@mfci.uunet (06/28/88)

In article <12179@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>>I still haven't seen any good arguments as to why RISC is so much better or
>>faster.
>
>Who cares about the arguments?  The fact is that if you have somewhere
>between $10,000 and $1,000,000, and want to buy the fastest machine you
>can get for that, right now that machine is probably `RISC-based'.
>
>You can argue all you like as to why the Vax instruction set is better,
>or why the 88000 instruction set is better, but the fastest Vax CPU from
>DEC is slower than the fastest 88000 CPU from Motorola.  If it were the
>other way around, DEC would be in fine shape.  (Maybe they just need
>Motorola to design their next chip :-) .)
>-- 

But DEC IS in fine shape.  They sell 'way more VAXen/year than
everybody else combined.  No judgment on the 88000 implied, but users
don't really care about performance per se.  They want solutions to
their problems, which almost always requires decent I/O (large and
fast), acceptable reliability and service, and lots of available
software.  Something they don't tell you in your computer
architecture classes -- people don't always automatically buy the
machine with the highest performance (nor should they).

Also, please cast a jaundiced eye on the phrase "RISC-based".  I
think it has almost attained the status of "content-free".

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

mash@mips.COM (John Mashey) (06/30/88)

In article <12179@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
...
>Who cares about the arguments?  The fact is that if you have somewhere
>between $10,000 and $1,000,000, and want to buy the fastest machine you
>can get for that, right now that machine is probably `RISC-based'.
>
>You can argue all you like as to why the Vax instruction set is better,
>or why the 88000 instruction set is better, but the fastest Vax CPU from
>DEC is slower than the fastest 88000 CPU from Motorola.
--------^^^^^^

On what set of benchmarks is this assertion based?
(I believe it's true for integer and single-precision FP, and if you'd
said MIPS, it would have been true for DP floating also :-)
So far, the only double-precision floating-point number we've seen
for the 88K is Whetstone: 3 Megawhets (and I don't remember the source,
sorry). (An 8700 is about 4Mwhets DP).  We'd be VERY interested to
see more DP benchmarks: from an examination of the 88K's architecture
and the cycle counts, we have reasons to believe that the 88K
design essentially SACRIFIED double precision floating performance for
most compiled programs.  We'd be glad to be disabused by knowledgable folks
who can cite useful benchmarks like: DP Livermore Loops, Spice, Doduc, etc.
Note one more time that VAX-relative performance numbers, computed the
way DEC does, includes floating point.....
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

lackey@Alliant.COM (Stan Lackey) (07/01/88)

>In article <12179@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>...
>>Who cares about the arguments?  The fact is that if you have somewhere
>>between $10,000 and $1,000,000, and want to buy the fastest machine you
>>can get for that, right now that machine is probably `RISC-based'.
>>
>>You can argue all you like as to why the Vax instruction set is better,
>>or why the 88000 instruction set is better, but the fastest Vax CPU from
>>DEC is slower than the fastest 88000 CPU from Motorola.

This isn't much of an argument.  The Alliant single CPU (released in 1985)
also beats the VAX 8700 on whets, livermore loops, linpack, etc., and
is anything but a RISC - 68020 instruction set, and floating point,
vector, and concurrency instruction sets.

Not to say that RISC is "bad" - I would rather have implemented a RISC than
the 68020, and performance could very well have been better, design time 
would probably have been less, cost would have been less, etc.  But then
again we wouldn't have been able to offer pascal, ada, or c in the
timeframe.  

There are much more than the classic RISC arguments to consider when making
a business decision.
-Stan

greg@vertical.oz (Greg Bond) (07/05/88)

In article <810@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
>>As you pointed out, there is a need for optimizing assemblers.
>Strong disagreement--an assembler should be safe, simple, and dumb. If you
>want an optimiser, use a compiler.
>What is preferable is separate layer to do the optimisation. The problem with

In fact, use the "optimisation" pass from your favourite C compiler.
This is tough to organise on most Unix boxes, but goes great on the
8086 X-compiler we have here.  The optimiser is a separate assembler-assembler
processor.  I can't see where its orientation as a C optimiser would kill
semantics of assembler code.  But then, it may not be a real clever
optimiser either (for the general case).
-- 
Gregory Bond,  Vertical Software, Melbourne (greg@vertical.oz)
I used to be a pessimist. Now I am a realist.

ge@hobbit.sci.kun.nl (Ge' Weijers) (07/07/88)

In article <140@vertical.oz>, greg@vertical.oz (Greg Bond) writes:
) In article <810@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
) >>As you pointed out, there is a need for optimizing assemblers.
) >Strong disagreement--an assembler should be safe, simple, and dumb. If you
) >want an optimiser, use a compiler.
) >What is preferable is separate layer to do the optimisation. The problem with
) 
) In fact, use the "optimisation" pass from your favourite C compiler.
) This is tough to organise on most Unix boxes, but goes great on the
) 8086 X-compiler we have here.  The optimiser is a separate assembler-assembler
) processor.  I can't see where its orientation as a C optimiser would kill
) semantics of assembler code.  But then, it may not be a real clever
) optimiser either (for the general case).

Watch out for C optimisers. They usually assume things about their input
(register usage, Rx = frame pointer, etc) that are just NOT true for
hand-written assembly language. A case in point: see the manual of the
assembler for the Sun-4. It has an optimiser, but you are strongly advised
not to use it. 

Ge' Weijers, mcvax!kunivv1!hobbit!ge
-- 
Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands
UUCP: {uunet!,}mcvax!kunivv1!hobbit!ge