[net.arch] Addressing modes

jeff1@garfield.UUCP (02/11/86)

	We are currently designing a RISC chip and are wondering how
few addressing modes we can get away with.  Right now we have register
direct and register indirect.  The question is, is this enough?  What do
people as programmers and/or designers think is a minimum set of addressing
modes?  We can get away with just these two (I hope!), but doing some things
requires a lot of convoluted code.
	I guess followups should go to net.arch, just to keep them all in the
same place.

Jeff Sparkes
jeff1%garfield.mun.cdn@ubc.csnet  <- preferred route
{allegra,seismo,psuvax1,utcsri}!garfield!jeff1  (I don't trust ihnp4 any more)

freak@nsc.UUCP (Curt Mayer) (02/13/86)

In article <946@garfield.UUCP> jeff1@garfield.UUCP (Jeff Sparkes) writes:
>
>	We are currently designing a RISC chip and are wondering how
>few addressing modes we can get away with.  Right now we have register
>direct and register indirect.  The question is, is this enough?  What do
>people as programmers and/or designers think is a minimum set of addressing
>modes?  We can get away with just these two (I hope!), but doing some things
>requires a lot of convoluted code.
>	I guess followups should go to net.arch, just to keep them all in the
>same place.

:-)
i had quite a bit of fun looking at bizarre ways to simulate the immediate
addressing mode. maybe you could play with XORs and SHIFTs to get numbers
into registers in the first place.

Take a look at the CDC Cybers designed by Seymour Cray and friends in the
early 60's.  Heap Big Nasty Machines.  I watch with amusement whenever some
university comes up with a new, novel thingy that Cray came up with 20
years ago. 

the more that things change, the more they stay the same

	curt

brooks@lll-crg.ARpA (Eugene D. Brooks III) (02/14/86)

>Take a look at the CDC Cybers designed by Seymour Cray and friends in the
>early 60's.  Heap Big Nasty Machines.  I watch with amusement whenever some
>university comes up with a new, novel thingy that Cray came up with 20
>years ago. 
>
I am waiting for a university to "invent" the first RISC with pipelined
functional units!

hansen@mips.UUCP (Craig Hansen) (02/14/86)

> 	We are currently designing a RISC chip and are wondering how
> few addressing modes we can get away with.  Right now we have register
> direct and register indirect.  The question is, is this enough?  What do
> people as programmers and/or designers think is a minimum set of addressing
> modes?  We can get away with just these two (I hope!), but doing some things
> requires a lot of convoluted code.
> 
> Jeff Sparkes
> jeff1%garfield.mun.cdn@ubc.csnet  <- preferred route
> {allegra,seismo,psuvax1,utcsri}!garfield!jeff1  (I don't trust ihnp4 any more)

Did you forget to mention an immediate operand addressing mode?

I fear you are asking the wrong question.  Are you trying to minimize the
number of addressing modes, or maximize performance?  The RISC design
approach involves selecting a small set of operations that minimize the
product of (the dynamic number of instructions executed) * (average
instruction execution time).  By minimizing only the first of these
parameters, you get the classic overwrought CISC design, and by minimizing
only the second, you may drastically increase the number of instructions
executed, and get a less than optimum design.

The number of addressing modes alone isn't enough to make the decision; what
you provide in the way of the remaining operations is important too.
For example, with a suitable add register instruction, you can synthesize
indexed addressing, an add immediate instruction can synthesize
base-displacement addressing, and a shift instruction can be used to
synthesize index-shifted addressing.

As you say, keeping a small set of addressing modes will make the path
length (instruction count) of some operations rather high, but I would
expect that you would find relatively little usage of very complicated
addressing modes, and further, that these complicated modes can be
synthesized from a smaller set of operations.

Further, if you are trying to design a really practical RISC machine, you
must examine how machine code is generated and linked, as these operations
are just as critical to the selection of appropriate addressing structures
as the hardware trade-offs are.


Evurthin' tastes better when it sits on a RISC.

Craig Hansen
MIPS Computer Systems
...!decvax!decwrl!glacier!mips!hansen

kludge@gitpyr.UUCP (Scott Dorsey) (02/14/86)

In article <946@garfield.UUCP> jeff1@garfield.UUCP (Jeff Sparkes) writes:
>
>	We are currently designing a RISC chip and are wondering how
>few addressing modes we can get away with.  Right now we have register
>direct and register indirect.  The question is, is this enough?  What do
>people as programmers and/or designers think is a minimum set of addressing
>modes?  We can get away with just these two (I hope!), but doing some things
>requires a lot of convoluted code.
>	I guess followups should go to net.arch, just to keep them all in the
>same place.

   Well, I am a big fan of MOVQ type instructions, which move a 4-bit
literal field within the instruction into a register or address.  If you
have 32-bit instructions, there is plenty of room, and even on a 16-bit
machine, if the instruction set is small then there should be no problem.
For a good example, see the 32000 series.  It's not a RISC machine, but
its code density is admirable.  I don't know just how bad the difficulty
of handling such an instruction is, but with a barrel shifter hanging off
the side, it shouldn't be bad.
  I like double and triple indirection.  They probably don't justify
themselves in terms of speed, though.  But I still like them.
  I like indexed addressing (INDEX+constant)...and an instruction that
sums two values and gets the address pointed to by the sum would be cute
for array manipulation.  But then, I'm just an assembly hacker.

darrell@sdcsvax.UUCP (Darrell Long) (02/14/86)

You should include an indexed addressing mode.  Array manipulation
is a very common thing, and indexed mode helps the code generation
alot!  I'm not sure your set is functionally complete, how do you
get an address into the register in the first place?  You might
have a small problem there.

DL
-- 
Darrell Long
Department of Electrical Engineering and Computer Science
University of California, San Diego

UUCP: sdcsvax!darrell
ARPA: darrell@sdcsvax

maa@oasys.UUCP (02/15/86)

<just a byte or two for the line eater>

>   I like double and triple indirection.  They probably don't justify
> themselves in terms of speed, though.  But I still like them.
>   I like indexed addressing (INDEX+constant)...and an instruction that
> sums two values and gets the address pointed to by the sum would be cute
> for array manipulation.  But then, I'm just an assembly hacker.
> 
>  
There was a computer in the basement of the University of Colorado EE building 
that I spent quite a lot of time playing with.  It's name was GPL-3055, built
by the Librascope division of General Precision.  [There were only three of
these ever built--for the DoD of course :-)]

This computer supported infinate inderection with indexing at every level.  
I.e. all address had an indirect bit and index register number in them.  A guy
I knew was trying to write some program that made use of indexing after 
several levels of indirection.  I don't think he ever got it to work, though.

The word size on this beast was 8*6 bits and most the instructions had a field
in them to control which characters in the word were operated on.  It also had
nice colourful instruction names like BRING, KEEP and JERK.

I think it's a shame that all these wondedful old computers with strange
architectures are being retired.  [I saw pieces of GPL-3055 in the local
electronics junk store a year or so ago.]  I think that the current EE/CS
students are missing a lot by not getting to play with them.

Happy hacking,
Mark

roy@phri.UUCP (Roy Smith) (02/16/86)

In article <175@oasys.UUCP> maa@oasys.UUCP writes:
> This computer [the GPL-3055] supported infinate inderection with indexing
> at every level.  I.e. all address had an indirect bit and index register
> number in them.

	I understand that every address in the DG Nova used one bit (the
sign bit, I guess) to indicate indirection.  The indirect addressing mode
was not decoded as part of the instruction, but as part of the address.  As
long as the address you fetched had the indirect bit set, the machine kept
doing another level of indirection.

	I think that the CPU only checked for interrupts at the completion
of an instruction.  You could generate an infinite indirection loop (two
addresses which point to each other) and there was no way to halt the CPU
short of pulling the plug.

	We had one of these in the EE lab at college, but I never actually
got a chance to play with it so I may have some details wrong.  By the time
I got my hands on it, the most valuable parts left were the rack and the
power supplies.
-- 
Roy Smith, {allegra,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

mlopes@hhb.UUCP (02/17/86)

>	We are currently designing a RISC chip and are wondering how
>few addressing modes we can get away with.  Right now we have register
>direct and register indirect.  The question is, is this enough?  What do
>people as programmers and/or designers think is a minimum set of addressing
>modes?  We can get away with just these two (I hope!), but doing some things
>requires a lot of convoluted code.

Functionally, there is no problem with register direct and register indirect
as the only addressing modes.  Any other modes can be simulated with
them.  Immediate addressing would be simulated by allocating a memory
location for the immediate value, and loading its address into a register.
Now you you can use register indirect to load the value into another register.

A similar scheme can be used to simulate absolute addressing.  In this case,
the "immediate" value would be the absolute address.

All other fancier types of addressing can be simulated by address manipulation
within registers.

						Marco Lopes
						HHB-Systems
						Mahwah NJ
						philabs!hhb!mlopes

kludge@gitpyr.UUCP (Scott Dorsey) (02/17/86)

a@oasys.UUCP writes:
>I think it's a shame that all these wondedful old computers with strange
>architectures are being retired.  [I saw pieces of GPL-3055 in the local
>electronics junk store a year or so ago.]  I think that the current EE/CS
>students are missing a lot by not getting to play with them.

What ever happened to the TI 9900?  How about the PACE and IMP-8 from
Nat'l Semi?  And has anyone out there ever used the Writable Control
Store option on the Burroughs 5500?  There's some great stuff out there
that has been long forgotten, (except by the gov't.  If you want to see
some neat stuff, take a look at some of the AN/YUK manuals!).
   Has anyone out there ever done any research on the increase in code
density produced by various addressing modes/instructions vs. the effort
to decode them?  I don't mean RISC stuff, I mean some really hairy stuff,
like infinite indirection (oh, how I would love to see that!).
   Just once, I'd like to see, oh, say a 32000 with a RISC banked register
structure and port-mapped I/O.  And whatever happened to port-mapped I/O
anyway?
-------
Disclaimer: Everything I say is probably a trademark of someone.  But
            don't worry, I probably don't know what I'm talking about.

Scott Dorsey
ICS Programming Lab, Georgia Insitute of Technology, Atlanta Georgia, 30332
...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!kludge

jack@boring.uucp (Jack Jansen) (02/17/86)

In article <1417@sdcsvax.UUCP> darrell@sdcsvax.UUCP (Darrell Long) writes:
>You should include an indexed addressing mode.  Array manipulation
>is a very common thing, and indexed mode helps the code generation
>alot!
No, please..... The idea of a RISC architecture is to *eliminate*
all fancies like indexed addressing. The point is, a compiler will
have a terrible time generating code for it. As an example, the line

	a = arr[b];	/* a=r0, b=r1 */
could be translated into
	mov	arr(r1),r0	If 'arr' is static/external, and
				sizeof(arr[0]) = 1, or
	mov	arr(lb,r1.b),r0
	mov	arr(lb,r1.w),r0
	mov	arr(lb,r1.d),r0	If the machine knows double indexing
				with operand width specification
				and the array sits on the stack
				*and* sizeof(arr[0]) is 1,2 or 4; or
	mov	r1,r2
	mul	#sizeof(arr[0]), r2
	add	#arr,r2
	add	lb,r2
	mov	*r2, r0		If all else fails.

(And a lot of intermediate forms, depending on the machine).
Now, the point is that most compiler-writers don't have the time
to let their compiler look for all those fancy addressing modes
it might be able to use. And even if they do incorporate these
features, they'll make the compiler much bigger and hairier (and
buggier).

The idea behind a RISC machine is to have instructions with only
limited capabilities (when compared to a CISC machine, like a
VAX or a 32032), but to make them *fast*.
-- 
	Jack Jansen, jack@mcvax.UUCP
	The shell is my oyster.

billw@Navajo.ARPA (William E. Westfield) (02/17/86)

The DEC Kx10 CPU (DECsystem10, DECsystem20), has infinite
indirection with possible indexing at each level.  I myself
have written programs with two levels of indexing.

The 10's (36 bit) architecture is quite nice (by my
standards), and has quite a number of risk-like features.
(foremost benig an extremely regular instruction format).

Too bad it's dying out.

BillW

craig@dcl-cs.UUCP (Craig Wylie) (02/17/86)

In article <946@garfield.UUCP> jeff1@garfield.UUCP (Jeff Sparkes) writes:
>
>	We are currently designing a RISC chip and are wondering how
>few addressing modes we can get away with.  Right now we have register
>direct and register indirect.  The question is, is this enough?  What do
>people as programmers and/or designers think is a minimum set of addressing
>modes?  We can get away with just these two (I hope!), ...

Dosn't the Inmos Transputer get away with one mode ? (Everything is an address?)


Craig.

-- 
UUCP:	 ...!seismo!mcvax!ukc!dcl-cs!craig| Post: University of Lancaster,
DARPA:	 craig%lancs.comp@ucl-cs 	  |	  Department of Computing,
JANET:	 craig@uk.ac.lancs.comp		  |	  Bailrigg, Lancaster, UK.
Phone:	 +44 524 65201 Ext. 4146   	  |	  LA1 4YR
Project: Cosmos Distributed Operating Systems Research

pdg@ihdev.UUCP (P. D. Guthrie) (02/17/86)

In article <175@oasys.UUCP> maa@oasys.UUCP writes:
>This computer supported infinate inderection with indexing at every level.  
>I.e. all address had an indirect bit and index register number in them.  A guy
>I knew was trying to write some program that made use of indexing after 
>several levels of indirection.  I don't think he ever got it to work, though.
>

A Dec-20 also does this, and there are *lots* of those around.
-- 

Paul Guthrie				`When the going gets weird,
ihnp4!ihdev!pdg				 The weird turn pro'
					  - H. Thompson

kludge@gitpyr.UUCP (Scott Dorsey) (02/17/86)

In article <6777@boring.UUCP> jack@mcvax.UUCP (Jack Jansen) writes:
>No, please..... The idea of a RISC architecture is to *eliminate*
>all fancies like indexed addressing. The point is, a compiler will
>have a terrible time generating code for it. As an example, the line

   That is absolutely true, but who is to say just WHAT instructions
and adressing modes are really needed... That is to say which instructions
are worth the amount of time to decode them vs. the amount of time 
required to use equivalent low-level instructions?  Depending on the
architecture of the machine, this varies a lot.  Who is to say what
the optimum architectural complexity vs. instruction set complexity
is?  Without fancy register stacks, RISC machines DO tend to execute
faster than their CISC contemporaries doing most problems.  Maybe this
is due to the increase in compiler simplicity, in which case the idea
would seem to me to build a compiler which can better use the CISC.

-------
 Disclaimer:  I wouldn't like to code assembly on a RISC machine, but
             this message is being sent from one, and WOW!  Is it fast!

Scott Dorsey
ICS Programming Lab, Georgia Insitute of Technology, Atlanta Georgia, 30332
...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!kludge

b-davis@utah-cs.UUCP (Brad Davis) (02/18/86)

In article <137@hhb.UUCP> mlopes@hhb.UUCP (Marco Lopes) writes:
>Immediate addressing would be simulated by allocating a memory
>location for the immediate value, and loading its address into a register.

How do you load the address of the memory location into a register without
using an immediate addressing mode?

-- 
Brad Davis	{ihnp4, decvax, seismo}!utah-cs!b-davis	
		b-davis@utah-cs.ARPA
One drunk driver can ruin your whole day.

jqj@gvax.cs.cornell.edu (J Q Johnson) (02/18/86)

In article <175@oasys.UUCP> maa@oasys.UUCP writes:
>There was a computer in the basement of the University of Colorado EE building 
>that I spent quite a lot of time playing with.  It's name was GPL-3055, built
>by the Librascope division of General Precision.  [There were only three of
>these ever built--for the DoD of course :-)]
>
>This computer supported infinate inderection with indexing at every level.  
>I think it's a shame that all these wondedful old computers with strange
>architectures are being retired.

The PDP-10 architecture (DEC-10, DEC-20, etc.), of which there were 
substantially more than 3 built, also has infinite indirection with indexing
at every level.  In fact, one frequently uses multiple levels of indirection,
though generally only one or two indexing operations.

Yes, it's a shame the PDP-10 is being retired.  It was a nice machine, and
the mainstay of computer science research (especially AI) for a decade.

mlopes@anwar.UUCP (Marco Lopes) (02/18/86)

In my earlier posting, I assumed the existence of absolute addressing.
Therefore, a minimum set of addressing modes is:

		register
		register indirect
		absolute

Another functionally complete set would have immediate instead of absolute.

					Marco Lopes
					HHB-Systems
					Mahwah, NJ
					philabs!hhb!mlopes
-- 

-----------------------------------------------------------------------------
			Marco Lopes

		 	HHB-Softron
			1000 Wyckoff Ave.
			Mahwah NJ 07430
			201-848-8000

UUCP address:	{ihnp4,decvax,allegra}!philabs!hhb!mlopes

crowl@rochester.UUCP (Lawrence Crowl) (02/19/86)

In article <3686@utah-cs.UUCP> b-davis@utah-cs.UUCP (Brad Davis) writes:
>In article <137@hhb.UUCP> mlopes@hhb.UUCP (Marco Lopes) writes:
>>Immediate addressing would be simulated by allocating a memory
>>location for the immediate value, and loading its address into a register.
>
>How do you load the address of the memory location into a register without
>using an immediate addressing mode?

Any practical architecture must have a mechanism for loading a constant value
into a register or using an absolute address.  With constant addressing the
value can be placed in a memory location and then loaded with an absolute
addressing mode.  If constant value loading is present, the value can be loaded
and then used indirectly to achieve absolute addressing.

However, it apears both authors assume no constant loading facility at all.  So,
in response, I present a method to obtain constants without such a facility.
To get the constant 10, (mixture of C and Pascal)

    r1 := r1 - r1 ;  { r1 == 0 } 
    r1 := ~r1 ;      { r1 == -1, assuming two's complement }
    r2 := r2 - r2 ;  { r2 == 0 }
    r2 := r2 - r1 ;  { r2 == 1 }
    { from here the algorithm is constant and register dependant }
    r3 := r3 - r3 ;  { r3 == 0 }
    r3 := r3 + r2 ;  { r3 == 1 }
    r3 := r3 >> r2 ; { r3 == 2 }
    r3 := r3 >> r2 ; { r3 == 4 }
    r3 := r3 + r2 ;  { r3 == 5 }
    r3 := r3 >> r2 ; { r3 == 10 }

The method can be extended in the obvious manner to generate arbitrary
constants.  I cannot recommend an architecture relying on this method.

I suggest looking very seriously at the addressing mode provided by Patterson
for the RISC I and RISC II chips.  The form is

      value in register + constant = effective address

This allows absolute addressing with a 0 valued register, register indirect
with a 0 valued constant, and register displaced when both are non-zero.
This mode is very usefull for accessing variables off the stack frame (you will
need one even if you rely on registers).  In addition, it is very useful for
accessing pointer based records and such. 
-- 

Lawrence Crowl             716-275-5766        University of Rochester
                                               Computer Science Department
...!{allegra,decvax,seismo}!rochester!crowl    Rochester, New York,  14627

ludemann@ubc-cs.UUCP (Peter Ludemann) (02/19/86)

In article <2213@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>	I understand that every address in the DG Nova used one bit (the
>sign bit, I guess) to indicate indirection.  The indirect addressing mode
>was not decoded as part of the instruction, but as part of the address.  As
>long as the address you fetched had the indirect bit set, the machine kept
>doing another level of indirection.
>
>	I think that the CPU only checked for interrupts at the completion
>of an instruction.  You could generate an infinite indirection loop (two
>addresses which point to each other) and there was no way to halt the CPU
>short of pulling the plug.

A friend of mine programmed one of these.  It had the delightful "feature"
that references to non-existing memory returned data which included the
indirection bit on.  So the first invalid address put the machine into
a tight loop.

Old timers may recall that the IBM 1620 (a decimal machine) had an indirection
bit and also variable sized numbers (there was another bit on each digit
to indicate the end of the number).
-- 
-- Peter Ludemann
	ludemann@ubc-cs.uucp (ubc-vision!ubc-cs!ludemann)
	ludemann@cs.ubc.cdn  (ludemann@cs.ubc.cdn@ubc.mailnet)
	ludemann@ubc.csnet   (ludemann%ubc.csnet@CSNET-RELAY.ARPA)

ludemann@ubc-cs.UUCP (Peter Ludemann) (02/19/86)

In article <1433@gitpyr.UUCP> kludge@gitpyr.UUCP (Scott Dorsey) writes:
>In article <6777@boring.UUCP> jack@mcvax.UUCP (Jack Jansen) writes:
>>No, please..... The idea of a RISC architecture is to *eliminate*
>>all fancies like indexed addressing. The point is, a compiler will
>>have a terrible time generating code for it. ...
>
>   That is absolutely true, but who is to say just WHAT instructions
>and adressing modes are really needed... 
>...  Without fancy register stacks, RISC machines DO tend to execute
>faster than their CISC contemporaries doing most problems.  Maybe this
>is due to the increase in compiler simplicity, in which case the idea
>would seem to me to build a compiler which can better use the CISC.
>
As a sometimes compiler writer who has tackled some optimisation
problems, I would like to point out that compiler writing is
tricky enough without complexities added by the machine's
instruction set.  For a description of some of the difficulties,
take a look at the description of the portable C compiler in
the Unix programmer's manual, volume 2 (especially the part on
register allocation for expressions).

Here's a simple example problem.  On an IBM/360/370/...,
there are two ways to generate code for "A := B[I]":
	LA  R1,B	load address of B into reg 1
	A   R1,I	add value of I to reg 1
	L   R0,0(,R1)	load value pointed by reg 1 into reg 0
	ST  R0,A	store value in reg 0 into A
and:
	LA  R1,B	load address of B into reg 1
	L   R2,I	load value of I into reg 2
	L   R0,0(R1,R2)	load value pointed by reg1 + reg2 into reg 0
	ST  R0,A	store value in reg 0 into A
(There are more possibilities, such as replacing the last two lines
of the first way by "MVC A,0(R1) - move memory to memory".  Also,
the "A R1,I" could be done by "LA R1,I(R1)".)

Now, which is the best method?  In this case, the amount of code generated
is almost the same but execution time will vary.  In fact, what is faster
on one machine model may be slower on another.  A dramatic example of this
was a text processing program which was sped up about 10% by replacing
all MVCL (move character long) instructions by the simpler MVC (which could
only do up to 256 bytes) inside a loop.  But I digress.

The two methods have one other difference: the first uses 2 scratch
registers (1 if the MVC instruction is used) whereas the second uses
3 scratch registers.  So, deciding which to use can also depend on
how many scratch registers are available.

As far as I'm concerned, the test for RISCness should be: given any
piece of source code, is there only one reasonable code sequence which can be
output by the compiler?
-- 
-- Peter Ludemann
	ludemann@ubc-cs.uucp (ubc-vision!ubc-cs!ludemann)
	ludemann@cs.ubc.cdn  (ludemann@cs.ubc.cdn@ubc.mailnet)
	ludemann@ubc.csnet   (ludemann%ubc.csnet@CSNET-RELAY.ARPA)

jans@tekecs.UUCP (Jan Steinman) (02/19/86)

In article <175@oasys.UUCP> maa@oasys.UUCP writes:
>
>This computer supported infinate inderection with indexing at every level.  
>I think it's a shame that all these wondedful old computers with strange
>architectures are being retired.

Look at Tandem computers.  These are still being made (indeed, Tandem holds
over 80% of the fault-tolerant transaction processing market) and they feature
an "indirect bit" which, if set, says "I'm an address, not data.  Tandem does
not support an assembler, so it is unlikely that infinite indirection will
occur, unless memory gets clobbered somehow.

:::::: Artificial   Intelligence   Machines   ---   Smalltalk   Project ::::::
:::::: Jan Steinman		Box 1000, MS 60-405	(w)503/685-2956 ::::::
:::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

kludge@gitpyr.UUCP (Scott Dorsey) (02/20/86)

   I was speaking with David Lane, a user assistant here, about the minimum
number of addressing modes and instructions required to compute any computible
function, and we thought about conditionality.  First of all, why can't a jump
be treated as a special case of LOAD/STORE/MOVE, because most machine implement
it that way anyway (MOV 1200, PC).  Of course, that makes conditional jumps
impossible.  So, just set one (or two) bits of every instruction aside to make
it possible to conditionally execute any instruction.  Is it worth the trouble
to decode it?  It could be treated so that as the instruction is decoded, the
condition field is checked, and if the condition is not true, the instruction
is aborted (it should only take one machine cycle to check the flag, right?).
For one-cycle instructions, they would have to be executed only if the condition
was "always execute", otherwise they would have to be delayed a cycle for the
condition field to be decoded.  Is it worth it?  Does anyone care?
-------
Disclaimer: Everything I say is probably a trademark of someone.  But
            don't worry, I probably don't know what I'm talking about.

Scott Dorsey
ICS Programming Lab, Georgia Insitute of Technology, Atlanta Georgia, 30332
...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!kludge

martin@minster.UUCP (martin) (02/21/86)

In article <1441@gitpyr.UUCP> kludge@gitpyr.UUCP (Scott Dorsey) writes:
>
>impossible.  So, just set one (or two) bits of every instruction aside to make
>it possible to conditionally execute any instruction.  Is it worth the trouble

	The Acorn Risc Machine (ARM), does exactly this, although not for the
same reasons. All instructions (which are 32-bits long) have a 4-bit condition
field, and the instruction is only executed if the condition is true.
Obviously the condition 'always true' is one of the 16 possible.
	The reason for this approach is that it allows small sequences of
instructions to be skipped, without the overhead of a branch instruction, and
the associated pipeline flush. In a machine which executes instructions in
one clock cycle this can be a considerable saving.
	The ARM only has two main addressing modes, which are variations on an
indexed addressing mode, but these are associated with various options, so
that the result of the indexing operation can be written back to a register,
allowing other addressing modes (post increment etc) to be constructed.
There is also a form of immediate addressing mode.

	It is interesting to note that Acorn removed delayed branches when
virtual memory support was added to the processor, because it wasn't clear
how to restart the instruction following a branch! Perhaps some of the
other RISC machines have solved this, if they have I wonder if it was simple
enough to be worth it, or not? Does anyone know?

For more information on the ARM, I know of the following articles:
	Byte January 1986:
		"Byte U.K.: The Acorn RISC Machine"
		pp 387-393
	Electronics, August 5, 1985:
		"Acorn goes to market with RISC microprocessor"
		pp 14-15
	Electronics, August 26, 1985:
		"At 3 Mips, RISC Processor is Among Fastest Chips Around"
		pp 48-49
and for those in the UK:
	Personal Computer World (Sometime late last year!)
		"The Acorn RISC Machine" (approx)

I hope this is of interest:
		M C Atkins
usenet: ukc!minster.uucp!martin

PS.
	I have no connection with Acorn at all, except that I like the look
of the ARM!

bcase@amdcad.UUCP (Brian case) (02/21/86)

In article <1441@gitpyr.UUCP>, kludge@gitpyr.UUCP (Scott Dorsey) writes:
> impossible.  So, just set one (or two) bits of every instruction aside to make
> it possible to conditionally execute any instruction.  Is it worth the trouble

Welllllll, the Acorn RISC has this feature, and there was some work
done by a student at UCSD on this very subject (I can get references
if anyone is interested).  Some conditional branches can be eliminated
by this technique, thus making the observed path length of a program
longer; a longer path length can mean faster execution if a separate
path to instruction memory exists and that memory can support some
sort of burst mode access.  Anyway, branches are bad things to most
architectures, so eliminating some of them might yield better performance.

    bcase

nather@utastro.UUCP (Ed Nather) (02/22/86)

In article <1441@gitpyr.UUCP>, kludge@gitpyr.UUCP (Scott Dorsey) writes:
> First of all, why can't a jump
> be treated as a special case of LOAD/STORE/MOVE, because most machine implement
> it that way anyway (MOV 1200, PC).  Of course, that makes conditional jumps
> impossible.  So, just set one (or two) bits of every instruction aside to make
> it possible to conditionally execute any instruction.
> 
> Scott Dorsey

Perhaps you just paste a bit from a condition flag onto the (low) end, and
then you would jump to one of two (successive) locations depending on the
condition.  The first of that pair would probably need to be an unconditional
jump in most cases, but look, Ma!  No decoding!

The LGP-30 computer had only one test instruction (test minus) and it worked,
but was a touch awkward to program a test for zero -- but it could be done.
Whether it's worth doing is another question.

The TRUE RISC would have only one instruction: subtract.  With only one
instruction, you wouldn't need an op-code field; every instruction is just
a pair of addresses (what to subtract from what).  An "add" subroutine
would need 3 instructions.  With memory-mapped I/O and the above convention
for conditional branching, you'd have it all.

TRUE RISC is not a trademark of any known company.

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nather@astro.UTEXAS.EDU

tim@ism780c.UUCP (Tim Smith) (02/22/86)

In article <137@hhb.UUCP> mlopes@hhb.UUCP (Marco Lopes) writes:
>
> Functionally, there is no problem with register direct and
> register indirect as the only addressing modes.  Any other modes
> can be simulated with them.  Immediate addressing would be
> simulated by allocating a memory location for the immediate
> value, and loading its address into a register.  Now you you can
	     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> use register indirect to load the value into another register.

But how do you load THAT register?
-- 
Tim Smith       sdcrdcf!ism780c!tim || ima!ism780!tim || ihnp4!cithep!tim

patc@tekcrl.UUCP (Pat Caudill) (02/22/86)

	Under some circumstances you can get by with only register
indirect load and store and load half register immediate. The load
register to register can be done by add register to zero immediate
giving register. If the immediate is a small constant built into the
main instruction (doesn't require a second word fetch) this is a move.
Some designs also forced a register to always contain zero for
sequences like this.
	The IBM 801 allowed memory addressing only for load and store.
But it allowed 16 bit immediate operands. The instruction would
operate on either the upper or lower half of the register. At first
I thought this was a crock until I realized it didn't take any more
memory cycles than an instruction followed by a word of immediate
data. And of course it greatly simplified the instruction cycling.
And sometimes you can get away with out the second instruction.
The 801 allowed immediate operands for any of the logical instructions.
and lower half immediate on add and subtract.

			Tektronix!tekcrl!patc

mac@uvacs.UUCP (Alex Colvin) (02/22/86)

> So, just set one (or two) bits of every instruction aside to make it
> possible to conditionally execute any instruction.  Is it worth the trouble
> to decode it?

Instead of using the bits, make it a(n optional) prefix to the instruction.
Then you've got a machine with conditionl skips instead of conditional
jumps.  See the PDP8, Nova, &c.

crickman@umn-cs.UUCP (Robin Crickman) (02/23/86)

> function, and we thought about conditionality.  First of all, why can't a jump
> be treated as a special case of LOAD/STORE/MOVE, because most machine implement
> it that way anyway (MOV 1200, PC).  Of course, that makes conditional jumps
> impossible.  So, just set one (or two) bits of every instruction aside to make
> it possible to conditionally execute any instruction.  Is it worth the trouble
> to decode it?  It could be treated so that as the instruction is decoded, the
> condition field is checked, and if the condition is not true, the instruction
> is aborted (it should only take one machine cycle to check the flag, right?).
> For one-cycle instructions, they would have to be executed only if the condition
> was "always execute", otherwise they would have to be delayed a cycle for the
> condition field to be decoded.  Is it worth it?  Does anyone care?
> 
> Scott Dorsey
> ICS Programming Lab, Georgia Insitute of Technology, Atlanta Georgia, 30332

How about a conditional SKIP instruction (if TRUE, skip the next opcode, else 
execute it), which was the only kind of conditional instruction on the pdp-8?
I believe the pdp-8 used the sort of jump described above.  Was it a RISC 
machine?  It certainly had a reduced instruction set!

John Hasler (guest of ...ihnp4!umn-cs!crickman)

patc@tekcrl.UUCP (Pat Caudill) (02/23/86)

. .. . ... . (line eater kibbles)

	The use of a stored branch as the way to effect a
conditional jump is not new. In "First Draft of a Report on the
EDVAC", John von Neumann, Sec 11.3 discribes the conditional
'branching' facilities of the EDVAC.
	Some background, the EDVAC's ALU (arithmetic organ) had
three registers. I and J were inputs and O was output. Numbers were
loaded to I with the previous value in I shifting to J. When an
operation was performed the result appeared in O. There were operations
which allowed I or J to directly go to O, and O to I. The conditional
operation transfered the contents of I or J unchanged to O depending
on the previous value in O. To conditionally branch to one of two
locations the inputs would be loaded with two jump instructions
the correct one taken as the result and stored in memory then
jumped to. (often the clever programmer would arrange the store
address to be the next location.
	For example:

   if ( a < b ) then goto L1 else goto L2 ;

 would code up a:
    (my own notation, von Neumann's is very cryptic, and I dont have the
	Greek alphabet on this terminal)

		LD	A
		LD	B
		SUB				; -
		LD	#JMP+L1
		LD	#JMP+L2
		COND				; s
		STO	L3
	L3:	JMP	L3

	Turing for the Pilot-ACE ( I think, look in Early British
Computers from Digital Press for more details)just had an instruction
which set a negative accumulator to zero or 1 depending on its sign. He
suggested for a conditional branch:

	lbl1 * condition + lbl2 * (1-condition) + jmp -> nextInstr

	We now look down on self modifying code, but at that time
it was considered a great step forward. Ahh, those were the good old days.

			Tektronix!tekcrl!patc

jer@peora.UUCP (J. Eric Roskos) (02/24/86)

> I suggest looking very seriously at the addressing mode provided by
> Patterson for the RISC I and RISC II chips.  The form is
>
>       value in register + constant = effective address
>
> This allows absolute addressing with a 0 valued register, register indirect
> with a 0 valued constant, and register displaced when both are non-zero.

Then, I suggest looking very seriously at the "indexed" mode of the
Motorola 6800 family, 10 years or so older than the RISC chips!

Fortunately, from the example on how to load a constant into a register
that preceeded this, I know the above comment was meant as a joke... :-)
-- 
UUCP: Ofc:  jer@peora.UUCP  Home: jer@jerpc.CCUR.UUCP  CCUR DNS: peora, pesnta
  US Mail:  MS 795; CONCURRENT Computer Corp. SDC; (A Perkin-Elmer Company)
	    2486 Sand Lake Road, Orlando, FL 32809-7642

jer@peora.UUCP (J. Eric Roskos) (02/24/86)

> As far as I'm concerned, the test for RISCness should be: given any piece
> of source code, is there only one reasonable code sequence which can be
> output by the compiler?

How is that a test for "RISCness"?  Following the above scheme, you could,
for example, have a machine that directly (and optimally) interpreted some
concise representation of the source code... but that's exactly what people
are calling a CISC machine, and are claiming is bad...
-- 
UUCP: Ofc:  jer@peora.UUCP  Home: jer@jerpc.CCUR.UUCP  CCUR DNS: peora, pesnta
  US Mail:  MS 795; CONCURRENT Computer Corp. SDC; (A Perkin-Elmer Company)
	    2486 Sand Lake Road, Orlando, FL 32809-7642

mac@uvacs.UUCP (Alex Colvin) (02/25/86)

>
> The TRUE RISC would have only one instruction: subtract.  With only one
> instruction, you wouldn't need an op-code field; every instruction is just
> a pair of addresses (what to subtract from what).  An "add" subroutine
> would need 3 instructions.  With memory-mapped I/O and the above convention
> for conditional branching, you'd have it all.
> 
> TRUE RISC is not a trademark of any known company.

See the New England Digital ABLE 60 processor.  It actually has one
instruction (move).  Arithmetic is done more or less by moving to the ALU.

This processor is used for control systems, digital synthesizers
(Synklavier) and communications networks.

nather@utastro.UUCP (Ed Nather) (02/25/86)

In article <546@tekcrl.UUCP>, patc@tekcrl.UUCP (Pat Caudill) writes:
> 	We now look down on self modifying code, but at that time
> it was considered a great step forward. Ahh, those were the good old days.
> 
> 			Tektronix!tekcrl!patc

It will return; it's too neat an idea to stay dead.  What we lack is the
needed formal discipline to use it carefully.  Spaghetti code has given
way to the cleaner structured programs, now that we have patterned code
closer to the way we think.  I still have, and routinely use, a program
that depends for its successful operation on self-modifying code. [Yay!
I'm out of the closet at last!] I was able to find no other way to meet
the required conditions (animated display using a slow computer) and no
one else has either -- though several have tried.

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nather@astro.UTEXAS.EDU

rmarti@sun.uucp (Bob Marti) (02/25/86)

> I suggest looking very seriously at the addressing mode provided by
> Patterson for the RISC I and RISC II chips.  The form is
>
>       value in register + constant = effective address
>
> This allows absolute addressing with a 0 valued register, register indirect
> with a 0 valued constant, and register displaced when both are non-zero.

I seem to recall the CDC 6000 line had exactly this feature.  There were eight
18-bit index registers (B0 - B7) which were used by certain instructions.
The address was computed using the value in a B-register and adding a
constant.  They even had the B0 register "hardwired" to contain the value 0.

This was almost 20 years before the Berkeley people "invented" RISC ...

--Bob Marti

jlg@lanl.ARPA (Jim Giles) (02/25/86)

In article <1982@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes:
>> As far as I'm concerned, the test for RISCness should be: given any piece
>> of source code, is there only one reasonable code sequence which can be
>> output by the compiler?
>
>How is that a test for "RISCness"?  Following the above scheme, you could,
>for example, have a machine that directly (and optimally) interpreted some
>concise representation of the source code... but that's exactly what people
>are calling a CISC machine, and are claiming is bad...

That's just the problem.  The designers of CISC machines think that they
are making the compiler writer's job easier by supplying opcodes to
perform high level semantics.  Unfortunately, NO ONE has yet succeeded
in developing an instruction set which can 'directly (and optimally)'
interpret the semantics of the source code.  What usually happens is
that the 'high level' instruction is not as efficient as a sequence or
loop of simpler instructions - or it is only efficient for a subset of
its possible uses. ( As an example, consider the famous case of the VAX
instruction for evaluating polynomials.  It was never as fast as a loop
of simpler code to do the same job and it was horribly inefficient in
the case where the compiler could detect a large number of zero
coefficients.)  As a result, compilers must be VERY complex to generate
good code on CISC machines.

At best, the CISC machines provide some instructions which are generally
useful as well as lots of instructions which are seldom or never used
(consider all the wierd addressing modes of modern vaxen - which most
compilers never generate).  The idea of RISC is to keep these instructions
which are generally useful, and make them as fast as possible by
eliminating all the seldom used instructions.  A useful side effect of
RISC archetecture is that it permits compiler writers to implement high
level semantics with simple 'codes skeletons' since there is usually only
one efficient sequence of code which reflects the required meaning.

J. Giles
Los Alamos

kludge@gitpyr.UUCP (Scott Dorsey) (02/25/86)

In article <890@umn-cs.UUCP> crickman@umn-cs.UUCP (Robin Crickman) writes:
>How about a conditional SKIP instruction (if TRUE, skip the next opcode, else 
>execute it), which was the only kind of conditional instruction on the pdp-8?
>I believe the pdp-8 used the sort of jump described above.  Was it a RISC 
>machine?  It certainly had a reduced instruction set!
>
>John Hasler (guest of ...ihnp4!umn-cs!crickman)

   My HP-34C calculator has just that instruction.  Seems a little better
than a short branch, but not all that much.  Conditional branches are probably
faster, even if they require a larger word.  Most branches tend to be short, and
I suspect you'd wind up using a lot of (execute this branch if true) instructions.
I wouldn't call the PDP-8 a RISC machine, but then, what do you call a machine
that can't add?

-------
Disclaimer: Everything I say is probably a trademark of someone.  But
            don't worry, I probably don't know what I'm talking about.

Scott Dorsey
ICS Programming Lab, Georgia Insitute of Technology, Atlanta Georgia, 30332
...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!kludge

jack@boring.UUCP (02/26/86)

In article <736@ism780c.UUCP> tim@ism780c.UUCP (Tim Smith) writes:
>In article <137@hhb.UUCP> mlopes@hhb.UUCP (Marco Lopes) writes:
>>
>> Functionally, there is no problem with register direct and
>> register indirect as the only addressing modes.  Any other modes
>> can be simulated with them.  Immediate addressing would be
>> simulated by allocating a memory location for the immediate
>> value, and loading its address into a register.  Now you you can
>	     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> use register indirect to load the value into another register.
>
>But how do you load THAT register?
Ahh, that's simple. You just use the trick Marco Lopes
described above......
-- 
	Jack Jansen, jack@mcvax.UUCP
	The shell is my oyster.

jeff1@garfield.UUCP (02/26/86)

	It appears that most people agree on the need for immediate or
absolute addressing.  What we have (hopefully) decided on are:
	register direct
	register indirect +4 bit constant

(This is only a 16 bit machine due to size constraints.  Also, this is
only for an undergrad course in VLSI design.  Who can design something
REALLY useful in three months anyway?)

This allows us to fake immediate addressing with two instructions:
	load	Rx,PC,2
	addq	PC,1
	data
	...

Indirection is trivial.  Load the register, and then use the register for
the address.  You can do based addressing with an add instruction and an
extra register.

Branch instructions are also done neatly.  We have a mask specifiying which
condition codes to check, and then a value mask specifying the values to check 
for.  I.e if (CC & mask) == (mask & value) then branch.

It's nice to get all of these useful responses.  Thanks!

Jeff Sparkes
garfield!jeff1
jeff@garfield.mun.cdn  <- ubc-vision or ubc.csnet

ehj@mordor.UUCP (Eric H Jensen) (02/27/86)

In article <426@utastro.UUCP> nather@utastro.UUCP (Ed Nather) writes:
>In article <546@tekcrl.UUCP>, patc@tekcrl.UUCP (Pat Caudill) writes:
>> 	We now look down on self modifying code, but at that time
>> it was considered a great step forward. Ahh, those were the good old days.
>> 
>It will return; it's too neat an idea to stay dead.  What we lack is ...
>                            ... I was able to find no other way to meet
>the required conditions (animated display using a slow computer) and no
>one else has either -- though several have tried.

Let's hope it dosen't....

If you allow self-modifying code everyone would have to have a
seperate copy of their task.  Swap space is not that cheap - with a
lot of people running emacs and maybe a lisp underneath it, kiss your
swap area and probably system performance (unless you buy more main
memory) goodbye....   

I can't see how you could effectively manage a POLICY of
"self-modifying code is okay" in a statically linked environment.
Every sub-system that linked in a routine that modified its own code
would inherit the 'impure' curse.   Perhaps this would not be such
a problem in a dynamically linked environment.

Hardware-wise the key here is "... using a SLOW computer" (emphasis
mine).  Once a computer is targeted to perform above a certain level,
the design must resort to having separate instruction and data caches
(unless you can get those 3ns 16k rams that the Japanese are using in
a dual port configuration). A virtually addressed, read-only (the
piece of logic that writes it is the miss handler) icache (instruction
cache) is a very simple and compact design. Btw the icache is usually
in one of the critical paths of the machine - pc setup for branches.

A store into the icache is going to cost at the very least one dead
cycle unless more memory chips are added to play odd even games (i.e.
access the odd words while writing the even words) but now there is
more hair (defered writes) and possibly a sacrifice in speed (longer
logic paths).  A cache coherency problem also arises - the write to
the icache must also be reflected in the dcache (data cache).  I can
think of a number of approaches to this problem but either they are
messy or the create dead machine cycles (no computation takes place
while the update is made). In any event the design gets a lot larger
and it is dubious that any performance is gained.

I would opt for effcient indexed address modes and instructions that
keep the requirement for konstant operands to a minimum (i.e. the
amount to shift by for the shift instruction).  Aren't these the items
that made self modifing code passe?

-- 
eric h. jensen        (S1 Project @ Lawrence Livermore National Laboratory)
Phone: (415) 423-0229 USMail: LLNL, P.O. Box 5503, L-276, Livermore, Ca., 94550
ARPA:  ehj@angband    UUCP:   ...!decvax!decwrl!mordor!angband!ehj

rogerh@megaron.UUCP (02/27/86)

(Oh, no -- I've lost my reference for this!
 If anyone knows what goes in the blank ("[]") below,
 send me mail, thanks. -rh)

This (self-modifying code) is the way [] does a bitblt; the
entry code checks for the base cases, otherwise it builds
custom code to do the bitblt and then jumps to it.  Very
clever -- the case to build the code is larger than the 
case within the bitblt would be, but performance is much
higher because the code that's built can be optimal.

clif@intelca.UUCP (Clif Purkiser) (02/27/86)

> In article <1982@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes:
> >> As far as I'm concerned, the test for RISCness should be: given any piece
> >> of source code, is there only one reasonable code sequence which can be
> >> output by the compiler?
> >
> >How is that a test for "RISCness"?  Following the above scheme, you could,
> >for example, have a machine that directly (and optimally) interpreted some
> >concise representation of the source code... but that's exactly what people
> >are calling a CISC machine, and are claiming is bad...
> 
> That's just the problem.  The designers of CISC machines think that they
> are making the compiler writer's job easier by supplying opcodes to
> perform high level semantics.  Unfortunately, NO ONE has yet succeeded
> in developing an instruction set which can 'directly (and optimally)'
> interpret the semantics of the source code.  What usually happens is
> that the 'high level' instruction is not as efficient as a sequence or
> loop of simpler instructions - or it is only efficient for a subset of
> its possible uses. ( As an example, consider the famous case of the VAX
> instruction for evaluating polynomials.  It was never as fast as a loop
> of simpler code to do the same job and it was horribly inefficient in
> the case where the compiler could detect a large number of zero
> coefficients.)  As a result, compilers must be VERY complex to generate
> good code on CISC machines.
> 
> At best, the CISC machines provide some instructions which are generally
> useful as well as lots of instructions which are seldom or never used
> (consider all the wierd addressing modes of modern vaxen - which most
> compilers never generate).  The idea of RISC is to keep these instructions
> which are generally useful, and make them as fast as possible by
> eliminating all the seldom used instructions.  A useful side effect of
> RISC archetecture is that it permits compiler writers to implement high
> level semantics with simple 'codes skeletons' since there is usually only
> one efficient sequence of code which reflects the required meaning.
> 
> J. Giles
> Los Alamos

	The problem with RISC when taken to it is logically conclusion
is you get rid of a lot of useful instructions and very useful addressing 
modes.

	The RISC machine I designed for my Computer Architecture course
(taught by Mr. RISC Dave Patterson ) had 24 instructions and two addressing 
modes it didn't even have a multiple.  While I was very happy at the time that
I didn't have to try to right microcode for a lot of complex instructions!
(The previous class had to implement the Z8000 with TTL SSI and MSI parts)
My machine  sure took a long time to do useful work.  Needless to say my RISC 
computer was a toy compared to real RISC machines.

	J. Giles point is well taken that compiler writers have not yet figured out a way of taking advantage of the numerous addressing modes and 
instructions offered by CISC machines such as 86, Z8K, 68K or 32K families.
I would even concede that you can still have a RISC machine that takes more
than one clock to complete an instruction.  This allows an important
instruction such as multiple to be included in the instruction set.

	At first glance many CISC instructions seem useless (some probably
are) but on closer glance that generally have quite a few uses.  Take for
example my favorite CISC machine (you guessed it) the 80386.

	I am sure no good RISC designer would include such obscure instructions
as Ascii Adjust for Addition, Subtraction, AAA,AAS etc. or XLAT 
(Table Look-up Translation).  Yet they all have their uses.  The Ascii Adjust
instruction are used by COBOL (yes people still use it) compilers and 
spreadsheet designers because COBOL uses BCD and these instruction speed 
this process up.  Likewise, XLAT can be used for converting ASCII to EBCDIC
in 5 clocks.  

	One disagreement I have with the RISC proponents is the theory that
everyone writes in a HLL.  It seems that despite years of trying to force
everyone to write in HLL languages there will always be a few assembly 
language programers.  Because no matter how much performance Semicoducter and 
Computer companies give programmers they always want their programs to run 
faster.  So these CISC instructions while not useful to compiler writers are 
useful to assembly language jocks. 

	I also think that many of the addressing modes on CISC machines
result in higher performance than the simpler RISC addressing modes.
For example take
long x, a[10], i; 
x = a[i];

	This probably compiles into these RISC instructions assuming x, a, and i
are in register R1,R2, and R3 respectively.
ShiftL R4, R3, #2
Add    R4, R4, R2
Load   R1, [R4]
This is a single 4 clock instruction on the 80386 vs 3 clocks for the RISC
chip.  However, the RISC chip has had to fetch 3 instructions vs one for 
the CISC processor.  So unless the RISC chip has a large on-chip cache it
will be slower.    

	I guess my main point is that a RISC designer in the 
search to simplify the chips complexity may easily throw out obscure
instructions which are useful to certain applications.  Or they may eliminate
generally useful addressing modes.

	I think that the good thing about the RISC philosphy is that it
will reduce the tendency of designers to add new instructions or addressing
modes just because they look whizzy or Brand X has them.  If a complex way of
doing something is slower than a simple way don't put it in.
	
-- 
Clif Purkiser, Intel, Santa Clara, Ca.
HIGH PERFORMANCE MICROPROCESSORS
{pur-ee,hplabs,amd,scgvaxd,dual,idi,omsvax}!intelca!clif
	
{standard disclaimer about how these views are mine and may not reflect
the views of Intel, my boss , or USNET goes here. }

jbs@mit-eddie.UUCP (Jeff Siegal) (02/27/86)

In article <5681@mordor.UUCP> ehj@mordor.UUCP (Eric H Jensen) writes:
>If you allow self-modifying code everyone would have to have a
>seperate copy of their task.  Swap space is not that cheap - with a
>lot of people running emacs and maybe a lisp underneath it, kiss your
>swap area and probably system performance (unless you buy more main
>memory) goodbye....   

hmmm.  What about copy-on-write access?

>I can't see how you could effectively manage a POLICY of
>"self-modifying code is okay" in a statically linked environment.
>Every sub-system that linked in a routine that modified its own code
>would inherit the 'impure' curse.   Perhaps this would not be such
>a problem in a dynamically linked environment.

I'm not sure I understand what you are saying.  Under Unix, at least,
sub-systems are not shared.  Please explain.

Jeff Siegal - MIT EECS

jer@peora.UUCP (J. Eric Roskos) (02/27/86)

> That's just the problem.  The designers of CISC machines think that they
> are making the compiler writer's job easier by supplying opcodes to
> perform high level semantics.  Unfortunately, NO ONE has yet succeeded
> in developing an instruction set which can 'directly (and optimally)'
> interpret the semantics of the source code.

Now, this is exactly the point at which the RISC/CISC debate generates into
the sort of assertions more commonly found in advertising than in research.
The fact that "NO ONE has yet succeeded" in no way demonstrates or proves
that it can't be done; this is a simple fallacy!  It involves arguing that

	(a "good" CISC machine exists) -> (one can be built)
	~(a "good" CISC machine exists)
	:. ~(one can be built)

which would be like arguing, at the time that people were trying to build
flying machines and they were all falling apart, that this meant that
it was impossible to do.

Note that in my original posting, I didn't say "interpret the semantics
of the source code," I said interpret some *representation* of the source
code.  There is a lot of work done by compilers (e.g., resolving symbolic
references, translating infix arithmetic expressions with ASCII constants
into a more manageable form, etc.) which takes a lot of time, and any
machine (such as these microcomputers they are always building projects
for in Byte that directly execute BASIC) which takes on that task will
be much slower.

The problem with the practical examples of why "it won't work" are that
invariably one group of people designed an instruction set, and put in
features that they thought, or had been told indirectly, would be "useful"
for compilers, but this group of people knew almost nothing about compiler
writing, and so they simply did it wrong.
-- 
UUCP: Ofc:  jer@peora.UUCP  Home: jer@jerpc.CCUR.UUCP  CCUR DNS: peora, pesnta
  US Mail:  MS 795; CONCURRENT Computer Corp. SDC; (A Perkin-Elmer Company)
	    2486 Sand Lake Road, Orlando, FL 32809-7642

ehj@mordor.UUCP (Eric H Jensen) (02/27/86)

In article <906@megaron.UUCP> rogerh@megaron.UUCP (Roger Hayes) writes:
>(Oh, no -- I've lost my reference for this!
> If anyone knows what goes in the blank ("[]") below,
> send me mail, thanks. -rh)
>
>This (self-modifying code) is the way [] does a bitblt; the
>entry code checks for the base cases, otherwise it builds
>custom code to do the bitblt and then jumps to it.  Very
>clever -- the case to build the code is larger than the 
>case within the bitblt would be, but performance is much
>higher because the code that's built can be optimal.

I would argue that this is not self-modifying code - new code is being
generated, the old code is not being modified.  We use this technique
to speed up our hardware simulator.  The user can tell the simulator
to generate code for every node in the net that will gather, operate
on, and scatter the data bits associated with that node.  The
resulting simulation is REAL fast.

-- 
eric h. jensen        (S1 Project @ Lawrence Livermore National Laboratory)
Phone: (415) 423-0229 USMail: LLNL, P.O. Box 5503, L-276, Livermore, Ca., 94550
ARPA:  ehj@angband    UUCP:   ...!decvax!decwrl!mordor!angband!ehj

mash@mips.UUCP (John Mashey) (02/28/86)

>M C Atkins, ukc!minster.uucp!martin writes:
> 	It is interesting to note that Acorn removed delayed branches when
> virtual memory support was added to the processor, because it wasn't clear
> how to restart the instruction following a branch! Perhaps some of the
> other RISC machines have solved this, if they have I wonder if it was simple
> enough to be worth it, or not? Does anyone know?

1) Other RISC machines have solved this.
2) Careful design is needed, but it turns out that the same careful design
is needed to make heavily-pipelined machines work efficiently anyway,
with clean exception handling, i.e., the virtual memory problem is just
one example of a general class. 
3) This can be done fairly simply:
	a) On an exception caused by an instruction in a branch-delay slot:
	b) suppress the branch & the following instruction.
	c) Point a register at the branch, not at the instruction following.
	d) Set a magic bit that says that's what happened (a convenience).
	e) Take the exception.
	f) on returning from the exception, re-execute the branch.
4) Was it worth it? Yes.  With current technology, if you want to do 
heavily-pipelined, high-performance designs, delayed branches are Good Things.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.UUCP (John Mashey) (02/28/86)

Pat Caudill, Tektronix!tekcrl!patc writes:
> 	Under some circumstances you can get by with only register
> indirect load and store and load half register immediate....
> 	The IBM 801 allowed memory addressing only for load and store.
> But it allowed 16 bit immediate operands. The instruction would
> operate on either the upper or lower half of the register. At first
> I thought this was a crock until I realized it didn't take any more
> memory cycles than an instruction followed by a word of immediate
> data. And of course it greatly simplified the instruction cycling.

1) Yes, it looks like a crock, but it works.  Sneaky compilers can even
do things like keeping the upper-half of an address around in a register
for a while, thus lessening the number of fetches.

2) On this whole business of addressing modes [I'm just catching up,
since we moved buildings and were out of touch]: I recommend:
	A Characterization of Processor Performance in the VAX 11/780,
	Joel S. Emer, Douglas W. Clark, both of DEC
	Proc. 11th Ann. SYmp. on Computer Architecture (Ann Arbor, Mich,
	June 5-7), IEEE, N.Y., 301-310.
These guys have done a number of excellent studies of real VAX characteristics,
and even better, have published them.  I urge people who argue the merits
of architectural features to look up some of these papers.  For example,
see Table 4 in the above paper, from which a few columns are excerpted below:

Operand specifier distribution: (total)

Register	R	41.0%  ==> 41.0%

Short literal 	#n	15.8%
Immediate	(PC)+	 2.4%	==> 18.2%

Displacement	D(R)	25.0%
Reg.Deferred	(R)	 9.2%
Auto-inc	(R)+	 2.1%
Disp. Deferred	@D(R)	 2.7%
Absolute	@(PC)+	 0.6%
Auto-inc-deferr	@(R)+	 0.3%
Auto-dec	-(R)	 0.9%
			-----
			40.8%

What does this tell you?
1) registers get used (41%).
2) so do literals (15.8% + 2.4% = 18.2%)
3) Of the modes that actually reference memory, 83% ((25.0+9.2)/40.8) are
of the form D(R) or (R) (and the latter, or course is 0(R)).

Note: don't read TOO much into these figures, but it is interesting to
see how little dynamic count is contributed by some of the lesser-used modes.
Again, I recommend this paper highly.  Among other things, it has a good
analysis of average VAX instruction timing, showing that the "average"
VAX instruction uses 10.6 200ns cycles, i.e.:
	A "1-MIP" 11/780 IS REALLY A HALF-VAX-MIP CPU!
(This is why we always talk in performance realtive to VAXen, counting
Mips only confuses you.)
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

ehj@mordor.UUCP (Eric H Jensen) (03/01/86)

In article <1106@mit-eddie.UUCP> jbs@eddie.UUCP (Jeff Siegal) writes:
>In article <5681@mordor.UUCP> ehj@mordor.UUCP (Eric H Jensen) writes:
>>I can't see how you could effectively manage a POLICY of
>>"self-modifying code is okay" in a statically linked environment.
>>Every sub-system that linked in a routine that modified its own code
>>would inherit the 'impure' curse.   Perhaps this would not be such
>>a problem in a dynamically linked environment.
>
>I'm not sure I understand what you are saying.  Under Unix, at least,
>sub-systems are not shared.  Please explain.

Lets say csh uses a library routine FOOBAR which modifies its code
(assuming this is allowed). Now consider that csh forks another csh
(called csh<2>).  csh<2> does not get the state of the csh executable
on the disk, it gets the state of its parent executable.  Furthermore
assume that the parent does not wait for it's child to exit but keeps
on running. If csh and csh<2> share the executable and they both make
use of FOOBAR - good luck trying to find out what went wrong when your
system pukes.  So to rectify the situation when you fork, you bring in
a clean copy of the executable.

You mentioned a copy on write scheme. Maybe on a machine that allows a
physical page to be mapped to more than one virtual page this could be
done right, but on a machine like the latest IBM risc a physical page
can only be mapped to one virtual page (this saves a (pipeline
stage) or (gobs of hardware) for determining if you hit or not) you
are forced to copy the executable into a new segment.

But the above really doesn't get at the issue.  If performance is a
goal, the hardware is going to be slower (or more expensive) if it has
to deal with self-modifying code even if the software can handle it.

-- 
eric h. jensen        (S1 Project @ Lawrence Livermore National Laboratory)
Phone: (415) 423-0229 USMail: LLNL, P.O. Box 5503, L-276, Livermore, Ca., 94550
ARPA:  ehj@angband    UUCP:   ...!decvax!decwrl!mordor!angband!ehj

mc68020@gilbbs.UUCP (Tom Keller) (03/02/86)

In article <1989@peora.UUCP>, jer@peora.UUCP (J. Eric Roskos) writes:
> Note that in my original posting, I didn't say "interpret the semantics
> of the source code," I said interpret some *representation* of the source
> code.  There is a lot of work done by compilers (e.g., resolving symbolic
> references, translating infix arithmetic expressions with ASCII constants
> into a more manageable form, etc.) which takes a lot of time, and any
> machine (such as these microcomputers they are always building projects
> for in Byte that directly execute BASIC) which takes on that task will
> be much slower.

   A minor point, perhaps, but in fact there is no microprocessor on the
market that "directly executes BASIC."  There are a few that include a
ROM-based BASIC interpreter on-chip.  I realize that some people might 
consider me to be just a bit picky, but I don't consider this to be a
microporcessor that "directly executes BASIC."

-- 

====================================

tom keller
{ihnp4, dual}!ptsfa!gilbbs!mc68020

(* we may not be big, but we're small! *)

aglew@ccvaxa.UUCP (03/02/86)

> register indirect + 4 bit constant

4 bit constants hold only 20% of the offsets you're going to want, without
scaling, and only 30% with. These numbers are preliminary results of a
study I'm making. The distribution of the size of addressing constants
falls off roughly exponentially, with a 50% mark at 6 bits, and an 85% mark
at 8 bits.

What I'm trying to say is, you'll get much much better results if you provide
just a few more bits in the constant.

aglew@ccvaxa.UUCP (03/03/86)

We DO have the formal discipline to use self-modifying code:
it's called incremental compilation.

For some specific instance of a general routine with all sorts of
ifs depending on parameters values, take a general prototype with all
the ifs in it, compile only the parts you need into a buffer, and then
execute that buffer.

Note: this is at the module level, not the machine instruction level.
Self-modifying code at the instruction level will just always be too
machine dependent to become formalized as standard practice. Doing things
at the module level doesn't mean that we're losing too much efficiency,
on the new machines that support function calls well.

By the way, such compilations into buffers DO NOT make programs non-reentrant
- you just have to have a concept of more than one code (text) segment,
some pure and shared, others not.

prl@ethz.UUCP (Peter Lamb) (03/04/86)

In article <1468@gitpyr.UUCP> kludge@gitpyr.UUCP (Scott Dorsey) writes:
>I wouldn't call the PDP-8 a RISC machine, but then, what do you call a machine
>that can't add?

Now that just isn't true: the PDP-8 _did_ have an add instruction.
What it was missing was any LOAD instruction (no this is not a joke).

For those who have never seen a PDP-8 instruction set, here it is
(I'm doing this because I think that the '8 _was_ as RISC, though
the motivations for their design choices seem to be different:
this shouldn't surprise anyone, since the first '8s were implemented
with discrete transistors - that meant that ALU's and registers were
_expensive_)

Anyway, the 8 had a 3-bit opcode, 6 memory reference instructions
and the other 2 opcodes were used for register operate and IO,
where for these two instructions the rest of the bits described what the
instruction actually did.
Remaining were 9 bits (12-bit word), 1 indirect bit, 1 base page/current page
bit and 7 address bits. Fixed 128-word pages, none of your fancy PC-relative
stuff - that would have required another _expensive_ adder, or extra cpu
cycles.

The instructions were, then (some of the opcode names may be wrong,
it's been a long time...):
	TAD		Two's Complement Add ((acc) <- acc+(EA))
	SCA		Store and Clear Accumulator ((EA)<-(acc), (acc)<-0)
	ANA		And Accumulator ((acc)<-(EA)&(acc))
	ISZ		Increment & Skip on Zero ((EA)<-(EA)+1, skip if zero)
	JMP		Jump (PC)<-EA
	JSR		Jump to (EA)<-(PC),(PC)<-EA+1

The accumulator operate instructions could do two accumulator
operations in one instruction: one subset in the first clock cycle
of the instr, another in the next; generally they were arranged
sensibly, for example CMA (complement accumulator) was a cycle 1
opcode, and INC a cycle 2 opcode, so that negate could be done in one
instruction. There were also opcodes for shift, rotate, and conditional
skip.
Autoincrement indirection was also there (but not as opcode bits)
there were these locations in low memory, which, when addressed
indirectly, incremented themselves (shudder.....).

Anyway, given the aim of producing a minimum cost machine on the
available technology, I think that the PDP-8 was an _extremely good_
machine (fire-proof overalls on). I have heard one rumor that
the '8 was the first CPU to be sold for under $US10000, then later the
first under $US1000. In its time, it was very popular indeed, and
until reasonably recently (order mid-late 1970's) there were still
uProc versions of the 8 inside DEC products.

The important thing, in this discussion, though, was that
some similar philosophy was at work - the instruction set was
stripped down to a bare minimum, here to simplify hardware to
reduce cost by reducing the device count, in modern RISC,
to reduce design time/(individual) instruction execution time.
But both with the aim (and in the case of the '8 sucessfully)
of getting better cost-performance.

Mind you, I wouldn't want to program one in assembler though :-) .
-- 
Peter Lamb	({seismo|decvax}!mcvax!cernvax!ethz!prl)
Institut fur Informatik
ETH-Zentrum
8092 Zurich

david@ztivax.UUCP (03/04/86)

/* Written  6:57 am  Feb 27, 1986 by clif@intelca in ztivax:net.arch */
I will try to keep from flaming, but clif@intelca, your claims are wrong.  I
do not work in your high performance processor lab, but I sometimes have to
write code for those awful things Intel makes.  Do not let your loyalties
to your employer cloud your thinking.

>> ...  What usually happens is
>> that the 'high level' instruction is not as efficient as a sequence or
>> loop of simpler instructions
>	I am sure no good RISC designer would include such obscure instructions
>as Ascii Adjust for Addition, Subtraction, AAA,AAS etc. or XLAT 
>(Table Look-up Translation).  

Yes, you are right.  This is a perfect example of a useless instruction.  If
you code the action in low level 80*86 instructions, you will have a faster 
loop.  You may claim that the intruction itself is fast, but look at all that
silly setup which must first be done to even use this obscure example of
brain damage.

>	One disagreement I have with the RISC proponents is the theory that
> everyone writes in a HLL.  It seems that despite years of trying to force
> everyone to write in HLL languages there will always be a few assembly 
> language programers.  Because no matter how much performance Semicoducter and 
> Computer companies give programmers they always want their programs to run 
> faster.  

When optimizing code, I, like most others, ahve found that 1) its in the
algorithm, not the instructions, and 2) the super whiz-bang instructions
are expensive due to special setup and slow hardware / micro-code 
implementation.  And, if I am going to write assembler, I always find it
easier and more maintainable if done using a regular orthogonal instruction
set, not using the obscure bells and whistes.

> So these CISC instructions while not useful to compiler writers are 
> useful to assembly language jocks. 

Really?  I have found, as well as many others, that assember programmers are
like high level language programmers:  they use convenient subsets of the 
language.

And for the quick clock times you quote fpr the 80*86 family, you should
include all the time it takes to perform the loads of operands into the
specific registers required.  You will be suprised, and I hope, enlightened.

.... End of flame

However, cliff, a good question you have implied:  Does anyone do assembler 
programming on RISC machines?  

For example: Is UNIX on RISC entirely in C?  How do 
the stack recovery mechanisms work?  Interrupted / failed instructions?  
CPU mode register control manipulation (if applicable)?

David Smyth,
Free and proud of it.

rudell@cad.UUCP (Richard Rudell) (03/04/86)

>                                       ...... What usually happens is
> that the 'high level' instruction is not as efficient as a sequence or
> loop of simpler instructions - or it is only efficient for a subset of
> its possible uses. ( As an example, consider the famous case of the VAX
> instruction for evaluating polynomials.  It was never as fast as a loop
						  ^^^^^^^^^^^^^^^^^^^^^^^
> of simpler code to do the same job and it was horribly inefficient in
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> the case where the compiler could detect a large number of zero
> coefficients.)  As a result, compilers must be VERY complex to generate
> good code on CISC machines.

When in doubt, run a test.  The test program follows the message.  I tested
the following:

	1) VAX polyd instruction
	2) simple loop in C to evaluate a polynomial
	3) simple loop in assembler using register variables
	4) unrolled loop in assembler (i.e., no branches)

The test was run for a degree 4 polynomial with no zero coefficients.
Output from a VAX 11/785 with FPA:

	count is 1000000
	Time for null loop was 1.0 microseconds
	Time for polyd was 15.3 microseconds
	21.128271
	Time for poly in C was 36.8 microseconds
	21.128271
	Time for poly asm loop was 28.5 microseconds
	21.128271
	Time for poly asm loop unrolled was 22.6 microseconds
	21.128271

Certainly seems to me that the polyd runs faster than the best I can do
in assembly language on a VAX.  Results summarized from a VAX 8650 and a
MicroVax-II (with FPA) are:


			VAX 8650	MicroVax-II

	VAX polyd	5.1 us		30.4 us
	C code		8.7 us		69.2 us
	asm loop	7.5 us		57.9 us
	asm unrolled	6.0 us		48.4 us

I have been told that polyd is NOT emulated on the MicroVax-II, and
these results seem to confirm this.  The difference is less on the
8650, but still noticeable.

I will not argue the point of whether polyd is a frequent enough
instruction to warrant being included in an instruction set, nor
whether a VAX might have a cycle time n% less if polyd weren't
included.  Nor will I comment on the efficiency or inefficiency of
evaluating a polynomial with many zero coefficients in this way.
Lastly, I will not comment on the difficulty of a compiler generating
code for an instruction which changes 6 of the 16 general purpose
registers.

My only point is that the VAX polyd instruction is FASTER than the best
one can do in VAX assembler given all nonzero coefficients.  Period.
So why is the VAX poly instruction so famous ?

Richard Rudell.



--------- program follows for those interested ----------- 
/* quick hack to time polynomial instruction */

/* WARNING: highly VAX 4.3bsd specific !!! */

#include <stdio.h>
#include <sys/types.h>
#include <sys/times.h>

double ptime()
{
    struct tms buffer;
    double time;

    times(&buffer);
    time = buffer.tms_utime / 60.0;
    return time;
}

#define TIME(string, action) {\
    double time = ptime(), timex; register int i_;\
    for(i_ = 0; i_ < count; i_++) {action;}\
    time = ptime() - time - offset;\
    printf("Time for %s was %3.1f microseconds\n", string, time/count*1e6);\
    if (strcmp(string, "null loop") == 0) offset = time;\
}

double coef[12] = {1.0,-2.0,3.0,-4.0,5.0,-6.0,7.0,-8.0,9.0,-10.0};

main(argc, argv)
int argc;
char **argv;
{
    double f, 				   /* -8(fp) */
    g=3.141512341222419283749239849839842, /* -16(fp) */
    h=2.7182812454219283748238748392, 	   /* -24(fp) */
    offset=0.0;				   /* -32(fp) */
    register int i, count;				  

    for(i = 0; i < 12; i++) coef[i] /= h;	/* mix things up a little */

    if (argc >= 2)
	count = atoi(argv[1]);
    else
	count = 100000;
    printf("count is %d\n", count);

    TIME("null loop", ;);

    TIME("polyd", asm("polyd -16(fp),$4,_coef"); 
		  asm("movd r0,-8(fp)"));
    printf("%f\n", f);

    TIME("poly in C", f=coef[0]; for(i=1;i<=4;i++) f=f*g+coef[i];);
    printf("%f\n", f);

    TIME("poly asm loop", 
		     asm("moval _coef,r3");
		     asm("movd (r3),r0");
		     asm("movd -16(fp),r4");
		     asm("movl $1,r2");
		     asm("Lxx: muld2 r4,r0");
		     asm("addd2 (r3)[r2],r0");
		     asm("aobleq $4,r2,Lxx");
		     asm("movd r0,-8(fp)"));
    printf("%f\n", f);

    TIME("poly asm loop unrolled", 
		     asm("moval _coef,r3");
		     asm("movd -16(fp),r4");	/* move arg to r4 */
		     asm("movd (r3)+,r0");
		     asm("muld2 r4,r0");
		     asm("addd2 (r3)+,r0");
		     asm("muld2 r4,r0");
		     asm("addd2 (r3)+,r0");
		     asm("muld2 r4,r0");
		     asm("addd2 (r3)+,r0");
		     asm("muld2 r4,r0");
		     asm("addd3 (r3),r0,-8(fp)"));
    printf("%f\n", f);
}

franka@mmintl.UUCP (Frank Adams) (03/04/86)

In article <187@anwar.UUCP> mlopes@anwar.UUCP (Marco Lopes) writes:
>In my earlier posting, I assumed the existence of absolute addressing.
>Therefore, a minimum set of addressing modes is:
>
>		register
>		register indirect
>		absolute
>
>Another functionally complete set would have immediate instead of absolute.

You should seriously consider using immediate instead of absolute.  One is
inclined to assume it is inferior, but this is not obvious to me.

Actually, I would recommend using both immediate and absolute addressing
modes.  At this point you have two one-bit decisions: register or immediate,
and direct or indirect; absolute is equivalent to immediate indirect.

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108

franka@mmintl.UUCP (Frank Adams) (03/04/86)

In article <169@ubc-cs.UUCP> ludemann@ubc-cs.UUCP (Peter Ludemann) writes:
>As far as I'm concerned, the test for RISCness should be: given any
>piece of source code, is there only one reasonable code sequence which can be
>output by the compiler?

This is not a possible objective.  Consider sequences like:

   A = B[I];
   C = D[I];

On any machine which has registers, it is possible to do this by loading I
into a register, doing an address calculation for B[I] (which may be combined
with the next instruction), and moving the value of B[I] to A (perhaps via a
register).  One can then perform the same calculation for the second
statement.  Alternatively, one can keep the value of I (possibly shifted) in
a register after the first statement, and use it again in the second.

Either of these is a reasonable code sequence to be generated; it depends on
how much optimization you want to do.  This is only one simple example; there
are arbitrarily complex ones.  The problem of code optimization is
undecidable in general.

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108

jlg@lanl.UUCP (03/05/86)

In article <78@cad.UUCP> rudell@cad.UUCP (Richard Rudell) writes:
...
>Lastly, I will not comment on the difficulty of a compiler generating
>code for an instruction which changes 6 of the 16 general purpose
>registers.
>
>My only point is that the VAX polyd instruction is FASTER than the best
>one can do in VAX assembler given all nonzero coefficients.  Period.
>So why is the VAX poly instruction so famous ?

You would have convinced me if the benchmarks you did had included the
necessary set-up code for each different possible coding, as well as
use in a real code where register scheduling conflicts must be weighed
in the balance.  I didn't claim that the individual instruction was always
less efficient (maybe I sould have phrased differently). Sometimes it
is less efficient, but the hardware types wouldn't have provided it
if its failings were really obvious.  The problem is that the instruction
requires extra set-up that ordinary code sequences don't, and (as you say)
uses up 6 registers.  Since your benchmark doesn't include timing of
these in a real code environment, it doesn't really measure the speed
of the instruction.

J. Giles
Los Alamos

jdz@wucec2.UUCP (03/05/86)

> The TRUE RISC would have only one instruction: subtract.  With only one
> instruction, you wouldn't need an op-code field; every instruction is just
> a pair of addresses (what to subtract from what).  An "add" subroutine
> would need 3 instructions.  With memory-mapped I/O and the above convention
> for conditional branching, you'd have it all.

Not quite. You need two address, you see. If we go to two instructions, we
can get away with only one address. To wit:

	SUBS X	Subtract cotents of location X from accumulator and
		store results in accumulator and in X

	JUMP X	Indirect jump; i.e. PC <= contents of location X

Proving this beast to be functionally complete is arduous but do-able.
We use it as an exercise in our Computer Architecture class. The machine is
called the TWINC, for TWo INstruction Computer. We have surmised the existence
of an OINC, but have not successfully proved functional completeness (the
conditional branch is usually the problem). You describe OINC with two addr. -
can you do it with only one?

One student did a VLSI design of the TWINC for LSI design class; we're thinking
about fabrication by MOSIS. Only did an 8 bit word, though.

I don't have the cotext for the comment about "the above convention for
conditional branching" - could someone mail it to me?
-- 
Jason D. Zions			...!{seismo,cbosgd,ihnp4}!wucs!wucec2!jdz
Box 1045 Washington University
St. Louis MO 63130  USA		(314) 889-6160
Nope, I didn't say nothing. Just random noise.

hansen@mips.UUCP (Craig Hansen) (03/06/86)

> For those who have never seen a PDP-8 instruction set, here it is
> The instructions were, then (some of the opcode names may be wrong,
> it's been a long time...):
> 	TAD		Two's Complement Add ((acc) <- acc+(EA))
> 	SCA		Store and Clear Accumulator ((EA)<-(acc), (acc)<-0)
> 	ANA		And Accumulator ((acc)<-(EA)&(acc))
> 	ISZ		Increment & Skip on Zero ((EA)<-(EA)+1, skip if zero)
> 	JMP		Jump (PC)<-EA
> 	JSR		Jump to (EA)<-(PC),(PC)<-EA+1

More correctly, these should be:

	AND		And
	TAD		Two's Complement Add
	ISZ		Increment and Skip if zero
	DCA		Deposit and clear accumulator
	JMP		Jump
	JSR		Jump to subroutine

> Mind you, I wouldn't want to program one in assembler though :-) .
> -- 
> Peter Lamb	({seismo|decvax}!mcvax!cernvax!ethz!prl)
> Institut fur Informatik
> ETH-Zentrum
> 8092 Zurich

Actually, I found the machine really enjoyable to program in assembler.  It
was a instruction set that one could really learn completely, down to
knowing exactly what the assembler could generate. It was common practice to
locate the targets of AND instructions into memory so that the AND
instruction was a useful character constant.  Using the "microcoded" operate
instructions, you could generate any of the constants 0,1,2,3,4,6,-1,-2,-3
in a single instruction. One of the more unique programs written
for the machine was a two-voice music generator that generated
its output on an FM reciever placed close to the front panel of the machine!
Try that on your favorite multi-user, virtual memory computer!

Because of the small page sizes, and the difficulty of generating
relocatable code, I never encountered a really practical compiler for the
PDP-8.  Most so-called "compilers" generated pseudo-code, or virtual machine
code, that was interpreted.

In the realm of exploring the extremes of the price/performance
relationship, DEC build the PDP-8/S, which was a bit-serial implementation
of the PDP-8!

In later years, the PDP-8 line suffered from "creeping featurism." For some
reason, designers added an instruction which would swap the high and
low-order 6 bits of the 12-bit accumulator, called "byte swap."

I've got a lot of old code for the eight, but without a DECtape drive or a
paper tape reader, I can't make too much use of it.  Sigh.
I could run a software emulator for the eight in real time nowadays,
but it would never generate the same pattern of electrical noise
on an FM radio....

Craig Hansen
MIPS Computer Systems
...{decwrl,glacier}!mips!hansen

csg@pyramid.UUCP (Carl S. Gutekunst) (03/06/86)

In article <2900001@ztivax.UUCP> david@ztivax.UUCP writes:
>However, cliff, a good question you have implied:  Does anyone do assembler 
>programming on RISC machines?  
>
>For example: Is UNIX on RISC entirely in C?  How do 
>the stack recovery mechanisms work?  Interrupted / failed instructions?  
>CPU mode register control manipulation (if applicable)?

For Pyramid's RISC CPU, we do need to do a little programming in Assembler --
very little, actually. The first few words of all interrupt handlers are
written in assmbler, as are a few error handlers. But most of these prompty
call C functions to do the real work. Even the lowest-level I/O can be done
from C. I'd suspect the same is true of other commercial RISC processors.

A microcosmic but interesting comment on RISC comes from our string libc
functions, which some of the local speed freques have implemented in assmbler.
Even with just the peephole optimizer, the only real difference between the
compiler's versions and the optimal hand assembly versions is a handful of
extraneous instructions that get executed once. RISC machines just don't have
any clever kludges in the instruction set for moving bytes faster than what
the compiler can easily figure out for itself. I've generally seen this carry
over to more complex task as well. 

Whether this is a Good Thing or a Bad Thing, I'll let the rest of you debate.
I'm one of those good ol' assmbler hackers. Programming in assmbler on a RISC
machine is extremely boring. But it seems to make the compilers happy. :-)
--
Carl S. Gutekunst   {allegra,cmcl2,decwrl,hplabs,topaz,ut-sally}!pyramid!csg
Pyramid Technology Corp, Mountain View, CA  +1 415 965 7200

jer@peora.UUCP (J. Eric Roskos) (03/06/86)

> A minor point, perhaps, but in fact there is no microprocessor on the
> market that "directly executes BASIC."

I wondered if anybody would question my statement.  Actually, I maintain
that the particular microprocessor I'm thinking of (I forget the part
number; it is some Zilog single-chip microcomputer which Steve Ciarcia
writes a lot of his projects for) does indeed directly execute BASIC.  It
is true that there is a BASIC interpreter in ROM, but when you start up
the computer, it executes this interpreter.

This is no different from a more familiar microprogrammed computer, e.g.
an IBM 370.  Most (if not all) 370s execute an interpreter which
interprets the 370 instruction set; they don't execute it directly in
hardware.

Of course, the Zilog machine has another instruction set in which the
BASIC interpreter is written.  So there's another level of interpretation
involved.  But that's also true of some other microprogrammed machines.
It might be, in fact, that the microprogram for that instruction set is
implemented in a PLA, as a finite state machine, so you get into this area
of fine distinctions if you try to draw the line in terms of what the
"language" looks like in which it's programmed, i.e., when is it "hardware"
vs. "firmware" or "software".

Now, you could argue that the Zilog machine "actually" executes the
simpler instruction set, because you can jump into it from BASIC and have
machine-language programs.  I.e., there's a way to get to the lower-level
machine-language, so that must be the "real" language of the machine.  But
the machine I'm writing this on executes an instruction set on which you
can execute an "ECS" instruction and jump into a user-written *microcode*
routine.  I doubt many people would want to think of the microcode as the
"real" language of this machine, though (though of course you could argue
that it is, just as you could argue that the PLAs in the microprocessor
provide the "real" language for that machine, since that's what the hardware
executes).

So I will have to stand by my position that the language directly executed
by a machine is the language the machine presents to its user when it is
first started, the language it interprets when it first goes out to its
primary memory to begin executing a program. (Note that this is meant as
an informal sort of definition, so if you have another definition of
"directly executed" -- which may well be the case -- then you might
disagree on my choice of definition.  I.e., I'm not using someone else's
definition of the term and trying to apply it to this particular machine.
I'm just explaining the concept I meant in my original posting.  Also, it
*is* obviously just an informal definition, since you could argue
different ways about a machine that supports multiple instruction sets, as
well as a few other things I won't mention since they're minor points.)
-- 
UUCP: Ofc:  jer@peora.UUCP  Home: jer@jerpc.CCUR.UUCP  CCUR DNS: peora, pesnta
  US Mail:  MS 795; CONCURRENT Computer Corp. SDC; (A Perkin-Elmer Company)
	    2486 Sand Lake Road, Orlando, FL 32809-7642   LOTD(5)=O
----------------------
Amusing error message explaining reason for some returned mail recently:
> 554 xxxxxx.xxxxxx.ATT.UUCP!xxx... Unknown domain address: Not a typewriter
(The above message is true... only the names have been changed...)

david@ztivax.UUCP (03/06/86)

/* Written  4:27 pm  Feb 24, 1986 by jer@peora in ztivax:net.arch */
>> As far as I'm concerned, the test for RISCness should be: given any piece
>> of source code, is there only one reasonable code sequence which can be
>> output by the compiler?

>How is that a test for "RISCness"?  Following the above scheme, you could,
>for example, have a machine that directly (and optimally) interpreted some
>concise representation of the source code... but that's exactly what people
>are calling a CISC machine, and are claiming is bad...

Sorry jer, but those CISC machines have some super instructions which are
of LIMITED SCOPE, and therefore CANNOT be directly and optimally
substituted for the source code.  Instead you must perform many complicated
what if?s before deciding if you can use the instruction more easily than
the low level instructions which can ALWAYS be used.

For example, lets say you were going to do ANYTHING with an intel (aka
Brain Damage (*$Q##!)) 80*8*.  Look thru the book, you'll find some super
instruction.  Only problem is, how many registers must you store (and
where?) to set up for it, and how many do you need to restore?  Sometimes
(if you just happened to have only garbage in the registers), it IS optimal
to do character translations with that XLT (or is it TLX, or #%&*%?).
Often (typically) it is not.

The thing is:  Its got the fancy instructions, and the simple instructions,
and SOMETIMES the fancy ones fit the specific high level functions, and
SOMETIMES they dont.

And that is a pain in the ass.

david smyth
free and proud of it

seismo!unido!ztivax!david

david@ztivax.UUCP (03/06/86)

/* Written  3:11 pm  Feb 27, 1986 by jer@peora in ztivax:net.arch */
>> ... Unfortunately, NO ONE has yet succeeded
>> in developing an instruction set which can 'directly (and optimally)'
>> interpret the semantics of the source code.

>Now, this is exactly the point at which the RISC/CISC debate generates into
>the sort of assertions more commonly found in advertising than in research.
>The fact that "NO ONE has yet succeeded" in no way demonstrates or proves
>that it can't be done; this is a simple fallacy!  

Sorry, but that is the way it is.  No I am not flaming.

Software is ever bug free unless it is small enough and concise to be
proven.  For example, the inner loop of a parser can be proven, as t
only has about 40 lines of code and 3-5 branch points.  The grammar
can be proven, because the finite number of states is small.

Typical sofware has bugs because its so damn complex.  No one will
succeed in making a CISC which CAN interpret a "high level language"
without SIGNIFICANT BUGS because it CAN NOT BE DONE without a
SIGNIFICANT breakthrough in mathematics and computer science.

just ask Intel.  Whatever happened to the 432 (Ada in Silicon)?
Doesn't work, for the same reason all Ada compilers have bugs.  Too
big, too complex.  We (mankind, computer scientists, etc) do not have
the technology.  It is not on the horizon.

(Well, maybe I have an idea... soon I may publish...)

David Smyth
Free and proud of it

seismo!unido!ztivax!david

kludge@gitpyr.UUCP (Scott Dorsey) (03/07/86)

In article <5100016@ccvaxa> aglew@ccvaxa.UUCP writes:
>4 bit constants hold only 20% of the offsets you're going to want, without
>scaling, and only 30% with. These numbers are preliminary results of a
>study I'm making. The distribution of the size of addressing constants
>falls off roughly exponentially, with a 50% mark at 6 bits, and an 85% mark
>at 8 bits.
>
>What I'm trying to say is, you'll get much much better results if you provide
>just a few more bits in the constant.

     Personally, as an assembly hacker, I'd love a larger constant.  The
question is if it is worth the increase in power to increase the actual
size, versus the cost of the larger field (decrease in number of other
instructions able to be decoded.  No flames from RISC advocates, please).
An 8-bit constant would certainly be valuable, especially for character 
manipulation work.  Unless you already had hardware onboard to break up
5/6/7 bit fields, these would be hard to work with, however.  Would it
be worth the extra stuff?  Would it be better to just design the 5/6/7
bit decoding into the instruction set properly?  I don't know... but
I enjoy the 32032, and would give up many things to have even a four 
bit field on the machinses I am programming today.

------
Disclaimer: Everything I say is probably a trademark of someone.  But
            don't worry, I probably don't know what I'm talking about.

Scott Dorsey
Kaptain_kludge
ICS Programming Lab, Rich 110,
Georgia Institute of Technology, Atlanta, Georgia 30332
...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!kludge

USnail:  Box 36681, Atlanta GA. 30332

throopw@dg_rtp.UUCP (03/07/86)

>> [discussion of self-modifying code for conditional branching]

> [discussion of effect of self-modification on address space]
> So to rectify the situation when you fork, you bring in
> a clean copy of the executable.

Uh... I'm not sure what "fork" does on whatever OS *you're* talking
about, but in Unix(*), the fork(2) system call does *not* (repeat *not*)
want to bring in a "clean copy" of the executable.  The address space
*as* *modified* *by* *the* *parent* must be seen by the child.  On the
other hand, it is clear that memory containing modifiable bits, code or
not, cannot be in the "text" or "shareable" segment (unless special
fault-driven mechanisms are available for revocation of sharing).

> You mentioned a copy on write scheme. Maybe on a machine that allows a
> physical page to be mapped to more than one virtual page this could be
> done right, but on a machine like the latest IBM risc a physical page
> can only be mapped to one virtual page (this saves a (pipeline
> stage) or (gobs of hardware) for determining if you hit or not) you
> are forced to copy the executable into a new segment.

Interesting.  The IBM risc can't do shared text segments?  How odd.
Or perhaps by "a physical page can only be mapped to one virtual page"
you mean "all virtual pages that are mapped to a given physical page
must have the same virtual address"?  This latter interpretation would
allow shared text, while the original doesn't seem to.

> But the above really doesn't get at the issue.  If performance is a
> goal, the hardware is going to be slower (or more expensive) if it has
> to deal with self-modifying code even if the software can handle it.

I agree that this is so for *some* classes of hardware, and that these
classes of hardware include most widely-known-to-be-practical
high-performance designs.  Thus, though I think that some of what you
said in getting to your conclusion is inaccurate, I agree with your
conclusion.

> eric h. jensen    ehj@angband    ...!decvax!decwrl!mordor!angband!ehj
--
* Unix is a trademark of ATT
-- 
Wayne Throop at Data General, RTP, NC
<the-known-world>!mcnc!rti-sel!dg_rtp!throopw

debray@sbcs.UUCP (03/07/86)

>                                       ...... What usually happens is
> that the 'high level' instruction is not as efficient as a sequence or
> loop of simpler instructions - or it is only efficient for a subset of
> its possible uses. 

The problem is mostly that it's harder to provide efficient implementations
for "High Level" instructions than it is for simple, "low level" ones.
Some time back, when working on a Prolog machine emulator on a Vax 780, we
found that it was _significantly less efficient_ to use a single
"Test, Set and Branch" instruction than it was to use three separate
instructions to test, set and branch respectively.  This seemed very
counterintuitive until we realized that instructions with 2 operands or
fewer had the operands decoded in hardware, while instructions with more
than 2 operands (e.g the "test-set-branch" instruction) had to go to
microcode for this.  I guess the extra instruction fetches for the case
using three separate instructions overlapped sufficiently due to pipelining
to give the hardware decoding an overwhelming edge.

On the other hand, I think that if efficient implementations could be
provided for some of these "High Level" instructions, it wouldn't be a bad
thing entirely.  In the same Prolog machine emulator, we looked at the
instruction sequences generated for various programs and coalesced a couple
of the more common instruction sequences to form "macro-instructions".  The
performance improvement from this was not unimpressive.  As far as generating
these complex instructions went, our compiler simply generated the old,
dumb code to begin with, then converted to the smarter code in one pass of
peephole optimization.
-- 
Saumya Debray
SUNY at Stony Brook

	uucp: {allegra, hocsd, philabs, ogcvax} !sbcs!debray
	arpa: debray%suny-sb.csnet@csnet-relay.arpa
	CSNet: debray@sbcs.csnet

mash@mips.UUCP (03/08/86)

David Smyth writes:
> However, cliff, a good question you have implied:  Does anyone do assembler 
> programming on RISC machines?  
> 
> For example: Is UNIX on RISC entirely in C?  How do 
> the stack recovery mechanisms work?  Interrupted / failed instructions?  
> CPU mode register control manipulation (if applicable)?

1) Does anyone do assembler on RISC machines?
	Yes.  For what it's worth, a well-designed RISC is relatively
	easy to program in assembler when that makes sense.  After all,
	in those cases, what you're doing is writing microcode (more vertical
	than horizontal, but with some idea of overlap).
2) Is UNIX on RISC entirely in C?
	No.  On almost any machine, there are bound to be instructions that
	are hard to get to from C.  ALso, there are a few time-critical
	pieces of code where the programmer knows attributes of the problem
	that no compiler can reasonably deduce.  Example: for string & block
	copy loops, there is usually a tradeoff between a faster loop with
	longer setup time, and a slower loop that gets started quicker.
	It's very hard for a compiler to guess the distribution.
3) Thus, there is still room for selected assembler code.  However, from
experience with assembler (on S/360s, PDP-11s, VAXen, 68K's, and our R2000),
in most cases I'd rather have fast instructions that are easy to use and
give me more access to the machine's pipeline, than most complex ones.
Most memory-intensive multi-cycle instructions (like string instructions)
run just as fast when hand-coded, because they're memory limited anyway;
a lot of them don't really match what you need in some cases: for example,
look at all of the mem* and str* routines in SYstem V, and see how well
the string operations of most machines do them.  About the only win for
complex instructions is when they support a datatype (like floating point)
that is both frequently used in the application at hand, and difficult
to simulate cheaply.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

aglew@ccvaxa.UUCP (03/09/86)

> /* Written  7:22 am  Mar  6, 1986 by jer@peora.UUCP in ccvaxa:net.arch */
> It might be, in fact, that the microprogram for that instruction set is
> implemented in a PLA, as a finite state machine, so you get into this area
                   ^^^
> of fine distinctions if you try to draw the line in terms of what the
> "language" looks like in which it's programmed, i.e., when is it "hardware"
> vs. "firmware" or "software".

Just thought that you might be amused to try macroprogramming in a PLA.
I implemented the Quine-McCluskey PLA compression algorithm, with some tree
search pruning techniques to speed things up, and was looking around for
things to test it on. You never have a large FSA when you need one.
For a lark, I applied it to a whole slew of Z80 assembler programs
(actually, the machine code) that somebody had left lying around on my
system: 30% compression. Of course, it took a week to do it.

mahar@fear.UUCP (mahar) (03/10/86)

> can get away with only one address. To wit:
> 
> 	SUBS X	Subtract cotents of location X from accumulator and
> 		store results in accumulator and in X
> 
> 	JUMP X	Indirect jump; i.e. PC <= contents of location X
> 
> Proving this beast to be functionally complete is arduous but do-able.
> We use it as an exercise in our Computer Architecture class. The machine is
> called the TWINC, for TWo INstruction Computer. We have surmised the existence
> of an OINC, but have not successfully proved functional completeness (the
> conditional branch is usually the problem). You describe OINC with two addr. -
> can you do it with only one?
Yes. The instruction is reverse subrtact and skip if borrow. There is no opcode
field only the address field. The instruction subtracts the accumulator from
the memory operand. The result is stored in both the accumulator and the
memory operand and the next instruction is skipped if a borrow occured.
Memory location 0 is the PC.

A friend of mine built one of these out of TTL. He called it URISC for
Ultimate Reduced Instruction Set Computer. It uses about 8 standard 7400
series TTL packages. In this implementation the PC went backwards but that
is not necessary. He wrote a macro assembler which implemented a usable
set of instructions.

I know this design is functionally complete. I don't have a description
with me, so I can't prove it now. If your really interested I can mail
your a list of the macro package.
-- 

	Mike Mahar
	UUCP: {turtlevax, cae780}!weitek!mahar

	Disclaimer: The above opinions are, in fact, not opinions.
	They are facts.

dalel@athena.UUCP (Dale Lehmann) (03/10/86)

------------
In article <1441@wucec2.UUCP> jdz@wucec2.UUCP (Jason D. Zions) writes:
>>
>> The TRUE RISC would have only one instruction: subtract....
>
>Not quite. You need two address, you see. If we go to two instructions, we
>can get away with only one address. To wit:
>
>	SUBS X	Subtract cotents of location X from accumulator and
>		store results in accumulator and in X
>
>	JUMP X	Indirect jump; i.e. PC <= contents of location X
>
>Proving this beast to be functionally complete is arduous but do-able.
>We use it as an exercise in our Computer Architecture class....

When I was working as a lab instructor for a computer class at
Oregon State University, two of the students in the class did
something similar to this.  They had run across the claim that
the minimum number of instructions a computer needed were
"subtract one" and "branch on negative".  They proceeded to
write a simulator for the PDP-8 instruction set using only two
instructions: ISZ (increment memory, skip if result is zero)
and JMP.  Talk about your extra-credit projects!

--
Dale Lehmann
Tektronix, Inc.
UUCP: ...!tektronix!teklds!dalel
USMail: P.O. Box 4600, Beaverton, Oregon  97075

ludemann@ubc-cs.UUCP (Peter Ludemann) (03/10/86)

In article <1163@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>In article <169@ubc-cs.UUCP> ludemann@ubc-cs.UUCP (Peter Ludemann) writes:
>>As far as I'm concerned, the test for RISCness should be: given any
>>piece of source code, is there only one reasonable code sequence which can be
>>output by the compiler?
>
>This is not a possible objective.  Consider sequences like:
>
>   A = B[I];
>   C = D[I];
>
>On any machine which has registers, it is possible ...
>	(description of optimisations possible)

You miss my point.  The optimisations described (in this case common 
expression detection) are possible *regardless* of the target 
machine's architecture.  Any good compiler should try to optimise 
code like this.  However, once these "source level" and flow analysis 
optimisations are finished, there should be only one reasonable way 
of generating the machine code.  Otherwise, the poor compiler writer 
has to do another set of optimisation analyses, all because of the 
target machine's instruction set.  

As an aside, the MVCL (move character long) instruction on the 
IBM/370 machines was actually slower than an equivalent loop which 
used a simple block move instruction.  Although later models improved 
the situation, it is interesting to note that IBM's "optimizing" PL/1 
compiler still generates a subroutine which does *exactly* what a 
MVCL instruction does (even with the same register usage 
conventions).  
-- 
-- Peter Ludemann
	ludemann@ubc-cs.uucp (ubc-vision!ubc-cs!ludemann)
	ludemann@cs.ubc.cdn  (ludemann@cs.ubc.cdn@ubc.mailnet)
	ludemann@ubc.csnet   (ludemann%ubc.csnet@CSNET-RELAY.ARPA)

paulr@decwrl.DEC.COM (Paul Richardson) (03/10/86)

In article <90@sbcs.UUCP> debray@sbcs.UUCP (Saumya Debray) writes:
>Some time back, when working on a Prolog machine emulator on a Vax 780, we
>found that it was _significantly less efficient_ to use a single
>"Test, Set and Branch" instruction than it was to use three separate
>instructions to test, set and branch respectively.  This seemed very
>counterintuitive until we realized that instructions with 2 operands or
>fewer had the operands decoded in hardware, while instructions with more
>than 2 operands (e.g the "test-set-branch" instruction) had to go to
>microcode for this.  I guess the extra instruction fetches for the case
>using three separate instructions overlapped sufficiently due to pipelining
>to give the hardware decoding an overwhelming edge.
>

>Saumya Debray
>SUNY at Stony Brook
>
>	uucp: {allegra, hocsd, philabs, ogcvax} !sbcs!debray
>	arpa: debray%suny-sb.csnet@csnet-relay.arpa
>	CSNet: debray@sbcs.csnet


While I cannot profess knowledge about the Vax 11/780 microcode,I do
believe that the 780 is NOT pipelined.I stand to be corrected,but i am 
fairly sure that the 8600 was/is the first pipelined implementation 
of the Vax architecture.

P.S-I could be wrong on the above,even Dec employees don't always know
what is going on

jer@peora.UUCP (J. Eric Roskos) (03/10/86)

> Sorry jer, but those CISC machines have some super instructions which are
> of LIMITED SCOPE, and therefore CANNOT be directly and optimally
> substituted for the source code.

You missed my point.  It would be absurd to go backwards and try to show
that there is a mapping from source code into any particular existing CISC
instruction.

Few people familiar with the issues feel that the existing CISC
instructions that are usually cited as examples of bad CISC instructions
are much good.  Certainly I would never cite the Intel instructions you
listed as good CISC instructions (but then, I always considered the
Motorola 6800 as a most-elegant small instruction set design).  Actually
a lot of the research on instruction set design has been in the area of
providing mechanisms for determining how to create better instruction sets,
or in providing mechanisms to create new instruction sets automatically.

Actually, it's even somewhat strange that I always seem to argue here on
the side of CISC machines, since I generally favor the use of small
classes of powerful primitive operations, whether it's for instruction
sets or operating systems, and tend more often to disagree with people in
the "real world" because I think their designs are too complex, not too
simple.

Rather, I tend to feel that the RISC philosophy for many people involves a
concept of a "volkscomputer" -- a People's Machine by which those who
dedicate their efforts to difficult-to-understand research can be maligned.
Thus I tend to disagree at the tone in which the RISC approach is supported,
rather than with the results of the research.  On the other hand, you'll
note that I rarely disagree with people (such as mashey@mips) who seem to
understand the issues and state their arguments in a reasoned and
verbally unbiased tone.  (In fact, I often agree with them.)
-- 
      Ofc:  jer@peora.UUCP  Home: jer@jerpc.CCUR.UUCP  CCUR DNS: peo
   S Mail:  MS 795; CONCURRENT Comp.ter Corp. SDC; (A Perkin-Elmer Compa
	    2486 Sand Lake Road, Orlando, FL 32809-7642
----------------------
Obviously if I supplied LOTD(7) you'd know what was going to happen next,
wouldn't you?

mash@mips.UUCP (John Mashey) (03/11/86)

Peter Ludemann 	ludemann@ubc-cs.uucp (ubc-vision!ubc-cs!ludemann) writes:
> In article <1163@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
> >In article <169@ubc-cs.UUCP> ludemann@ubc-cs.UUCP (Peter Ludemann) writes:
> >>As far as I'm concerned, the test for RISCness should be: given any
> >>piece of source code, is there only one reasonable code sequence which can be
> >>output by the compiler?
> >
> >This is not a possible objective.  Consider sequences like:
> >	(description of optimisations possible)
> 
> You miss my point.  The optimisations described (in this case common 
> expression detection) are possible *regardless* of the target 
> machine's architecture.  Any good compiler should try to optimise 
> code like this.  However, once these "source level" and flow analysis 
> optimisations are finished, there should be only one reasonable way 
> of generating the machine code.  Otherwise, the poor compiler writer 
> has to do another set of optimisation analyses, all because of the 
> target machine's instruction set.  

One more time: I urge people not to propose metrics that are exceedingly
difficult to measure or comprehend or even define.  As I said before,
I'm not sure "RISCness" is something one necessarily wants to measure
as a goal, in place of things that one really cares about like performance,
price/performance, code size, etc, etc.  However, if one wants to measure
it, for whatever reason, PLEASE 
a) Use cycles/MIPS (normalized to some common unit of work), i.e., try:
	cycles/MIPS = (clock rate in Mhz) / (MIPS rating)
so that, for example, if you think a VAX 11/780 with UNIX is a 1 Mips cpu,
at 5Mhz, it gets a 5.  If you think a good 68020 implementation is
twice as fast, then it gets 16.67/2 = 8.33, etc.  [Notice that this
isn't cycles/instruction, but cycles/equivalent work.]
OR
b) Propose a different metric that has a reasonable way to measure it.
Is it unreasonable to ask for an algorithm to compute the metric?
[I suggest that the "only one way to do it" test is, in general,
mechanically undecidable.]

(Note: Peter's comment does get at something that usually happens,
which is that simpler machines have less different obvious code selection
possibilities at some level or other, but even then, there are almost
always time-space tradeoffs to be made, which allow for different code
sequences for even the RISCiest of machines.)
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

rb@ccivax.UUCP (03/11/86)

In article <1161@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>In article <187@anwar.UUCP> mlopes@anwar.UUCP (Marco Lopes) writes:
>>In my earlier posting, I assumed the existence of absolute addressing.
>>Therefore, a minimum set of addressing modes is:
>>		register
>>		register indirect
>>Another functionally complete set would have immediate instead of absolute.
>You should seriously consider using immediate instead of absolute.
For any who might remember the RCA 1702 COSMAC VIP computer, there was
a strange way to get immediate addressing.  The VIP would let you switch
which register would work as a program counter and when that register
was used for instruction fetches or anything else, it would cause the
register to be incremented.  If R7 was your PC register, you could say
MOV R6,(R7), and the result was the same as move imediate instruction,
because R7 would be incremented automatically.
This may have been a bug, but it was a useful bug (All instructions
had to be a uniform number of cycles, and immediate mode took too many
cycles).  Actually, the 1702 had a lot of "cute" features, it would be
interesting to see a 32 bit version.  Does anybody familiar with this
chip consider it RISC?  It was certainly a good learning tool.

rb@ccivax.UUCP (03/11/86)

In the previous article, I kept referring to the 1702, I meant 1802
Sorry about the confusion.  There is no 1702 CPU.

stevev@tekchips.UUCP (03/12/86)

> You miss my point.  The optimisations described (in this case common 
> expression detection) are possible *regardless* of the target 
> machine's architecture.  Any good compiler should try to optimise 
> code like this.  However, once these "source level" and flow analysis 
> optimisations are finished, there should be only one reasonable way 
> of generating the machine code.  Otherwise, the poor compiler writer 
> has to do another set of optimisation analyses, all because of the 
> target machine's instruction set.  

Even though an the correctness of an optimization is machine-independent,
its effect on the quality of code is very machine-dependent.  The selection
of the ``best'' source-level optimizations will differ from machine to
machine, even for RISCs.  Consider common-subexpression elimination.
Sure, you can perform the optimization at source level, but for some
machines, it might be cheaper to redo the computation rather than to
save the result because saving the result requires a register, which could
register allocation poorer; in some cases, reperforming the computation
could even be free on RISC machines if it can be done in an instruction
that would otherwise be a delayed-branch-induced NOP.

The point is that architectural attributes such as *the number of available
registers* and *delayed branching* can effect which optimizations you should
(should not) do at the source-level.

		Steve Vegdahl
		Computer Research Lab.
		Tektronix, Inc.
		Beaverton, Oregon

ron@brl-smoke.UUCP (03/13/86)

> >> The TRUE RISC would have only one instruction: subtract....
> >
> >Not quite. You need two address, you see. If we go to two instructions, we

> "subtract one" and "branch on negative".  They proceeded to

Somewhere (UTEXAS?) someone designed a MOVE PROCESSOR, which only
had move from memory to memory instructions.  To do math you fed
an ALU memory address the operands and pulled back.  The PC was
another address.  Sort of like having to program at the microcode
level all the time.

-Ron

greg@ncr-sd.UUCP (Greg Noel) (03/14/86)

In article <389@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
>	cycles/MIPS = (clock rate in Mhz) / (MIPS rating)
>so that, for example, if you think a VAX 11/780 with UNIX is a 1 Mips cpu,
>at 5Mhz, it gets a 5.  If you think a good 68020 implementation is
>twice as fast, then it gets 16.67/2 = 8.33, etc.  [Notice that this
>isn't cycles/instruction, but cycles/equivalent work.]

I'm currious -- are these off-the-top-of-the-head numbers to illustrate the
point, or do they approximately reflect reality?  If the latter, are there
any published results?
-- 
-- Greg Noel, NCR Rancho Bernardo    Greg@ncr-sd.UUCP or Greg@nosc.ARPA

mash@mips.UUCP (03/15/86)

Greg Noel (Greg@ncr-sd.UUCP or Greg@nosc.ARPA) writes:
> In article <389@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
> >	cycles/MIPS = (clock rate in Mhz) / (MIPS rating)
> >so that, for example, if you think a VAX 11/780 with UNIX is a 1 Mips cpu,
> >at 5Mhz, it gets a 5.  If you think a good 68020 implementation is
> >twice as fast, then it gets 16.67/2 = 8.33, etc.  [Notice that this
> >isn't cycles/instruction, but cycles/equivalent work.]
> 
> I'm currious -- are these off-the-top-of-the-head numbers to illustrate the
> point, or do they approximately reflect reality?  If the latter, are there
> any published results?

1) They're not off-the-top-of-the-head numbers, but do reflect reality,
given enough caveats and specifications, always remembering that:
	a) Reducing CPU performance to 1 number is always hard.  This
	doesn't ever stop anybody in this buisness from trying.
	b) One must always include software: thus, a more complete
	description of the above might be:
	".. if you think a VAX 11/780 with 4.2BSD runs user programs like
	yacc, grep, diff, etc..., compiled with the regular C compiler,
	and runs them at at rate we'll call 1 Mips, at 5 Mhz, it gets a 5.
	If you think a 68020 (SUN 3/160, running their currently-released
	software) is twice as fast.... it gets 8.33."

2) SUN has published a bunch of comparisons as part of their SUN-3
announcement.  I think Convergent has published theirs for the MightyFrame,
and Altos for their 3068, and .....

3) I personally believe these numbers are probably good within +/- 20%p
what I see personally, when I look [which is not real often or very
carefully), is that the SUN is 2X or a little more faster for some things,
especially little benchmarks, and is about 1.8X for many tasks.

4) As I've observed before here, IT IS VERY IMPORTANT TO COMPARE ACTUAL
MACHINE PERFORMANCE AND NOT GET CONFUSED BY MIPS RATINGS.  In particular,
many people do call a 780 a 1Mips (or thereabouts) machine, but it is known
that for some fairly substantial runs (user and kernel), the 780 averages
10.6 (200ns) cycles / instruction, i.e., it does .5M VAX instructions/sec,
including all of the degradations for memory, TLB misses, pipeline bubbles
for branches, etc, etc. [There was a good 1984 paper by Clark and Emer of DEC on
this.]  Thus, it would appear that a  machine that does 1 Million VAX
instructions/sec (i.e., a bit faster than a 785) does an amount of work
like what people usually call a 2Mips machine.  Thus, a machine that
executes 1M of its own instructions/sec to do the same work as the 780's
.5M VAX instructions can also be called a 1Mips machine,
since people call th4e VAX a 1Mips cpu.  Thus, there would seem to be
twice as many, but "twice as simple" (whatever that means) instructions.
So, in some sense, the instruction within the mythical "1Mips" is probably
more like a RISC instruction than a VAX (CISC) instruction!

Again, what counts is having a scale that people understand and on
which systems can be compared consistently. [At MIPS, we just always
call the given flavor of 780 == 1 and be done with it.]
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

tim@amdcad.UUCP (Tim Olson) (03/17/86)

In article <460@ccivax.UUCP>, rb@ccivax.UUCP (rex ballard) writes:
> For any who might remember the RCA 1702 COSMAC VIP computer, there was
> a strange way to get immediate addressing.  The VIP would let you switch
> which register would work as a program counter and when that register
> was used for instruction fetches or anything else, it would cause the
> register to be incremented.  If R7 was your PC register, you could say
> MOV R6,(R7), and the result was the same as move imediate instruction,
> because R7 would be incremented automatically.
> This may have been a bug, but it was a useful bug (All instructions
> had to be a uniform number of cycles, and immediate mode took too many
> cycles).  Actually, the 1702 had a lot of "cute" features, it would be
> interesting to see a 32 bit version.  Does anybody familiar with this
> chip consider it RISC?  It was certainly a good learning tool.

	I don't remember a 1702, but it sounds much like the RCA 1802.  This
processor had sixteen 16-bit general purpose registers, which could also be
used as thirty-two 8-bit registers.  However, except for INC and DEC, all
operations (including moves) had to be performed 8 bits at a time using the
accumulator.  Two 4-bit registers, P and X, pointed to the general-purpose
registers to be used for the program counter and index register, respectively.

	There were a few things that made the 1802 sort of "RISC-like".  One
was the large number of general-purpose registers available.  Another was the
small number of addressing modes: loads and stores were either register
indirect or implied ("double-indirect" through the 4-bit X register), and all
arithmetic and logical operations were performed using the accumulator (D) and
an implied (double-indirect through X) memory operand.  Note that the registers
could NOT be used as operands in arithmetic or logical operations! :-( 

	Another interesting feature was the absence of a subroutine call
instruction.  Instead, coroutines were implemented, each with a separate
program counter.  To call a coroutine, the P register was simply changed to
point to the new program counter register.  A subroutine was a coroutine that
ran to completion, then jumped to an SEP (set P register) instruction just
before the subroutine entry point.  This instruction "returned" to the calling
procedure, incrementing the subroutine's program counter to point to the entry
point again.  This coroutine method also made for fast interrupt handling.
The interrupt handler had a dedicated program counter (register 1).  When an
interrupt occurred, the 4-bit P and X registers were copied into an 8-bit 
T register, then P was forced to 1.  A return instruction copied back the P and
X registers.  In both these cases (subroutines and interrupts), only the basic
primitives were supplied; the user could build the required subroutine call or
interrupt convention from them, using these building blocks.  This approach of
building complex instructions from simple primatives can also be viewed as
"RISC-like".

	All of these "double-indirect" operations had a cost, however.  Most
instructions took 16 clock cycles to execute (8 for instruction fetch, 8 for
execution), and a few took 24.  To build a competitive 32-bit, single-cycle,
pipelined RISC machine using this architecture would require a lot of
complexity, especially in the register file, which would have to be a 7-port,
4-read, 3-write structure!  The 1802 certainly had some interesting concepts
for its time, however.

	Tim Olson
	Advanced Micro Devices
	ihnp4!amdcad!tim

fwb@siemens.UUCP (03/17/86)

/* Written 11:12 am  Mar 11, 1986 by rb@ccivax in siemens:net.arch */
/* ---------- "Re: Addressing modes (1702)" ---------- */

> For any who might remember the RCA 1702 COSMAC VIP computer, there was
> a strange way to get immediate addressing.

Actually, the CHIP was the 1802.  I wrote, and helped to write, many K of
code for this when I worked at RCA Laboratories.  The native instruction set
was miserable.  I wrote a cross assembler with a Data General macro
assembler.  This let me add macro instructions to do some nice things like
IF-THEN-ELSE, DO-WHILE, and 16-bit LOAD and STORE.  A little program took
the object module created by the DG assembler, and converted it to Intel HEX
format for a prom burner.  I rewrote these macros in Tek Development system
macro assembler for use by other groups in the Laboratories.

> If R7 was your PC register, you could say
> MOV R6,(R7), and the result was the same as move imediate instruction,
> because R7 would be incremented automatically.
> This may have been a bug, but it was a useful bug

This was not a bug.  It was a feature.  Really!  The designers designed it
as an immediate mode.

> (All instructions
> had to be a uniform number of cycles, and immediate mode took too many
> cycles).

Yes, there were 8 (or was it 16?) clocks per cpu cycle.  "Why?" you ask.
Well, I did too.  The answer: The ALU was one bit wide!  The next
instruction was decoded while the computation was completed on the last one.

> Actually, the 1702 had a lot of "cute" features, it would be
> interesting to see a 32 bit version.

Forget about a 32-bit version.  I doubt that RCA (or the new GERCA) is going
to make a big version based on this crummy architecture.  One of the big
problems with it was the D register.  This was an 8-bit register which sat
between memory and the 16 16-bit "scratchpad" register file.  All memory
references, arithmetic instructions, and I/O instructions went through this
bottleneck.  One of the things my macro instructions did was to hide this
register as much as possible.  It's too bad that I could not make it go away
completely.

> Does anybody familiar with this
> chip consider it RISC?  It was certainly a good learning tool.

I guess you could say that it is a RISC if you consider the PDP-8 to be a
RISC.  It was an 8-bit CMOS micro which came out before the 8080.  Silicon
real estate was very expensive, so the designers had to make some
compromises in the instruction set and operation.

I hope you learned something useful. :-)

-----------------------------------------------------
Frederic W. Brehm       (ihnp4!princeton!siemens!fwb)
Siemens Research and Technology Laboratories
105 College Road East
Princeton, NJ 08540
(609) 734-3336

radzy@calma.UUCP (Tim Radzykewycz) (03/18/86)

In article <460@ccivax.UUCP> rb@ccivax.UUCP (What's in a name ?) writes:
>For any who might remember the RCA 1702 COSMAC VIP computer, there was
>a strange way to get immediate addressing.  The VIP would let you switch
>which register would work as a program counter and when that register
>was used for instruction fetches or anything else, it would cause the
>register to be incremented.  If R7 was your PC register, you could say
>MOV R6,(R7), and the result was the same as move imediate instruction,
>because R7 would be incremented automatically.

Actually, the fact that the 1702 (and 1802 for that matter) allowed
you to change which register was used as the PC is irrelevant for this.

The PDP-11 used the same method, although the assembler was smart enough
to turn instructions like

MOV #47,R2

into the word-sequence

MOV (PC)+,R2
47

which is exactly what was described above for the 1702.  Immediate
is by definition just a form of PC-relative address, although it
might be implemented as a special case.
-- 
Tim (radzy) Radzykewycz, The Incredible Radical Cabbage.
	ARPA:	calma!radzy@ucbvax.berkeley.edu
	GEnet:	cc5vxa::cc5mxd::calma::radzy
	UUCP:	{ucbvax,sun,csd-gould}!calma!radzy
	VOICE:	"Hey! Radzy!"

earle@smeagol.UUCP (Greg Earle) (03/18/86)

In article <10693@amdcad.UUCP>, tim@amdcad.UUCP (Tim Olson) writes:
> ...  The 1802 certainly had some interesting concepts
> for its time, however.

Interesting that Tim refers to the 1802 in the past tense ... Here at The
Land That Time Forgot (erm, JPL that is), you folks out there might be
interested to know that both Galileo, and the not-to-fly-till-at-least-1989
Venus Radar Mapper mission (Magellan) are *both* using rad-hardened
1802's as the main CPU in the Command Data Subsystem.  There are Flight
Software people here even now huddled over their RCA COSMAC development
systems ... The people doing test software have to upload Hex'ed binary
files from the COSMAC to an IBM PC, run them through a filter program,
upload again to a Sun-2 which is the main front end to the Lab Test Set (LTS)
which finally uploads it into a breadboard version (in VRM's case) of the
CDS ...  Flight Software may get all the glory, but gimme my leading-edge
Suns anyday.  Anyhow, next time you reminisce about the 1802, shed a tear
for those people for whom the past is not as far back as you think :-)

"Forward, into the Past!" (Overheard at a JPL Flight Software meeting :-)

-- 
				::::::\:::::::::
				::::'  \:  `::::
				::'   /:\    `::
Anarchy, Peace, and Freedom	::   / ::\    ::	Devastate to Liberate
				::  /_____\   ::
				::./:' :: `\..::
				::/:.  ::  .\:::
				:::::::::::::\::

	Greg Earle		sdcrdcf!smeagol!earle (UUCP)
	JPL 			ia-sun2!smeagol!earle@csvax.caltech.edu (ARPA)

king@kestrel.ARPA (Dick King) (03/21/86)

I'm sorry to bug the whole net, but nobody around here knows how to
mail to UUCP from ARPA.  [if anyone knoes how, pls tell me.]

mahar@fear.UUCP, Please send me the macro package for the URISC.
Thanks...

-dick

cej@well.UUCP (Craig Jackson) (03/22/86)

I think that the source of some of this confusion might be the origin of the
'mip' term.  I believe it first arose in the IBM mainframe world.  Data
center types love to talk in 'mips' all day, because they know little else.

Anyway, when the 780 first came out, it was found to have roughly the same
computing power as what IBM was then calling a 1-mip processor (370/145, as I
remember).  Since a VAX is significantly more CISC than a 370 (3-address
instructions, etc.), that worked out to fewer than 1 million VAX instructions
per second.

-- 
Craig Jackson
UUCP: {ihnp4!linus,seismo!harvard}!axiom!mhis!dricej
      {dual,ptsfa,lll-crg,hplabs}!well!cej
BIX:  cjackson