jeff1@garfield.UUCP (02/11/86)
We are currently designing a RISC chip and are wondering how few addressing modes we can get away with. Right now we have register direct and register indirect. The question is, is this enough? What do people as programmers and/or designers think is a minimum set of addressing modes? We can get away with just these two (I hope!), but doing some things requires a lot of convoluted code. I guess followups should go to net.arch, just to keep them all in the same place. Jeff Sparkes jeff1%garfield.mun.cdn@ubc.csnet <- preferred route {allegra,seismo,psuvax1,utcsri}!garfield!jeff1 (I don't trust ihnp4 any more)
freak@nsc.UUCP (Curt Mayer) (02/13/86)
In article <946@garfield.UUCP> jeff1@garfield.UUCP (Jeff Sparkes) writes: > > We are currently designing a RISC chip and are wondering how >few addressing modes we can get away with. Right now we have register >direct and register indirect. The question is, is this enough? What do >people as programmers and/or designers think is a minimum set of addressing >modes? We can get away with just these two (I hope!), but doing some things >requires a lot of convoluted code. > I guess followups should go to net.arch, just to keep them all in the >same place. :-) i had quite a bit of fun looking at bizarre ways to simulate the immediate addressing mode. maybe you could play with XORs and SHIFTs to get numbers into registers in the first place. Take a look at the CDC Cybers designed by Seymour Cray and friends in the early 60's. Heap Big Nasty Machines. I watch with amusement whenever some university comes up with a new, novel thingy that Cray came up with 20 years ago. the more that things change, the more they stay the same curt
brooks@lll-crg.ARpA (Eugene D. Brooks III) (02/14/86)
>Take a look at the CDC Cybers designed by Seymour Cray and friends in the >early 60's. Heap Big Nasty Machines. I watch with amusement whenever some >university comes up with a new, novel thingy that Cray came up with 20 >years ago. > I am waiting for a university to "invent" the first RISC with pipelined functional units!
hansen@mips.UUCP (Craig Hansen) (02/14/86)
> We are currently designing a RISC chip and are wondering how > few addressing modes we can get away with. Right now we have register > direct and register indirect. The question is, is this enough? What do > people as programmers and/or designers think is a minimum set of addressing > modes? We can get away with just these two (I hope!), but doing some things > requires a lot of convoluted code. > > Jeff Sparkes > jeff1%garfield.mun.cdn@ubc.csnet <- preferred route > {allegra,seismo,psuvax1,utcsri}!garfield!jeff1 (I don't trust ihnp4 any more) Did you forget to mention an immediate operand addressing mode? I fear you are asking the wrong question. Are you trying to minimize the number of addressing modes, or maximize performance? The RISC design approach involves selecting a small set of operations that minimize the product of (the dynamic number of instructions executed) * (average instruction execution time). By minimizing only the first of these parameters, you get the classic overwrought CISC design, and by minimizing only the second, you may drastically increase the number of instructions executed, and get a less than optimum design. The number of addressing modes alone isn't enough to make the decision; what you provide in the way of the remaining operations is important too. For example, with a suitable add register instruction, you can synthesize indexed addressing, an add immediate instruction can synthesize base-displacement addressing, and a shift instruction can be used to synthesize index-shifted addressing. As you say, keeping a small set of addressing modes will make the path length (instruction count) of some operations rather high, but I would expect that you would find relatively little usage of very complicated addressing modes, and further, that these complicated modes can be synthesized from a smaller set of operations. Further, if you are trying to design a really practical RISC machine, you must examine how machine code is generated and linked, as these operations are just as critical to the selection of appropriate addressing structures as the hardware trade-offs are. Evurthin' tastes better when it sits on a RISC. Craig Hansen MIPS Computer Systems ...!decvax!decwrl!glacier!mips!hansen
kludge@gitpyr.UUCP (Scott Dorsey) (02/14/86)
In article <946@garfield.UUCP> jeff1@garfield.UUCP (Jeff Sparkes) writes: > > We are currently designing a RISC chip and are wondering how >few addressing modes we can get away with. Right now we have register >direct and register indirect. The question is, is this enough? What do >people as programmers and/or designers think is a minimum set of addressing >modes? We can get away with just these two (I hope!), but doing some things >requires a lot of convoluted code. > I guess followups should go to net.arch, just to keep them all in the >same place. Well, I am a big fan of MOVQ type instructions, which move a 4-bit literal field within the instruction into a register or address. If you have 32-bit instructions, there is plenty of room, and even on a 16-bit machine, if the instruction set is small then there should be no problem. For a good example, see the 32000 series. It's not a RISC machine, but its code density is admirable. I don't know just how bad the difficulty of handling such an instruction is, but with a barrel shifter hanging off the side, it shouldn't be bad. I like double and triple indirection. They probably don't justify themselves in terms of speed, though. But I still like them. I like indexed addressing (INDEX+constant)...and an instruction that sums two values and gets the address pointed to by the sum would be cute for array manipulation. But then, I'm just an assembly hacker.
darrell@sdcsvax.UUCP (Darrell Long) (02/14/86)
You should include an indexed addressing mode. Array manipulation is a very common thing, and indexed mode helps the code generation alot! I'm not sure your set is functionally complete, how do you get an address into the register in the first place? You might have a small problem there. DL -- Darrell Long Department of Electrical Engineering and Computer Science University of California, San Diego UUCP: sdcsvax!darrell ARPA: darrell@sdcsvax
maa@oasys.UUCP (02/15/86)
<just a byte or two for the line eater> > I like double and triple indirection. They probably don't justify > themselves in terms of speed, though. But I still like them. > I like indexed addressing (INDEX+constant)...and an instruction that > sums two values and gets the address pointed to by the sum would be cute > for array manipulation. But then, I'm just an assembly hacker. > > There was a computer in the basement of the University of Colorado EE building that I spent quite a lot of time playing with. It's name was GPL-3055, built by the Librascope division of General Precision. [There were only three of these ever built--for the DoD of course :-)] This computer supported infinate inderection with indexing at every level. I.e. all address had an indirect bit and index register number in them. A guy I knew was trying to write some program that made use of indexing after several levels of indirection. I don't think he ever got it to work, though. The word size on this beast was 8*6 bits and most the instructions had a field in them to control which characters in the word were operated on. It also had nice colourful instruction names like BRING, KEEP and JERK. I think it's a shame that all these wondedful old computers with strange architectures are being retired. [I saw pieces of GPL-3055 in the local electronics junk store a year or so ago.] I think that the current EE/CS students are missing a lot by not getting to play with them. Happy hacking, Mark
roy@phri.UUCP (Roy Smith) (02/16/86)
In article <175@oasys.UUCP> maa@oasys.UUCP writes: > This computer [the GPL-3055] supported infinate inderection with indexing > at every level. I.e. all address had an indirect bit and index register > number in them. I understand that every address in the DG Nova used one bit (the sign bit, I guess) to indicate indirection. The indirect addressing mode was not decoded as part of the instruction, but as part of the address. As long as the address you fetched had the indirect bit set, the machine kept doing another level of indirection. I think that the CPU only checked for interrupts at the completion of an instruction. You could generate an infinite indirection loop (two addresses which point to each other) and there was no way to halt the CPU short of pulling the plug. We had one of these in the EE lab at college, but I never actually got a chance to play with it so I may have some details wrong. By the time I got my hands on it, the most valuable parts left were the rack and the power supplies. -- Roy Smith, {allegra,philabs}!phri!roy System Administrator, Public Health Research Institute 455 First Avenue, New York, NY 10016
mlopes@hhb.UUCP (02/17/86)
> We are currently designing a RISC chip and are wondering how >few addressing modes we can get away with. Right now we have register >direct and register indirect. The question is, is this enough? What do >people as programmers and/or designers think is a minimum set of addressing >modes? We can get away with just these two (I hope!), but doing some things >requires a lot of convoluted code. Functionally, there is no problem with register direct and register indirect as the only addressing modes. Any other modes can be simulated with them. Immediate addressing would be simulated by allocating a memory location for the immediate value, and loading its address into a register. Now you you can use register indirect to load the value into another register. A similar scheme can be used to simulate absolute addressing. In this case, the "immediate" value would be the absolute address. All other fancier types of addressing can be simulated by address manipulation within registers. Marco Lopes HHB-Systems Mahwah NJ philabs!hhb!mlopes
kludge@gitpyr.UUCP (Scott Dorsey) (02/17/86)
a@oasys.UUCP writes: >I think it's a shame that all these wondedful old computers with strange >architectures are being retired. [I saw pieces of GPL-3055 in the local >electronics junk store a year or so ago.] I think that the current EE/CS >students are missing a lot by not getting to play with them. What ever happened to the TI 9900? How about the PACE and IMP-8 from Nat'l Semi? And has anyone out there ever used the Writable Control Store option on the Burroughs 5500? There's some great stuff out there that has been long forgotten, (except by the gov't. If you want to see some neat stuff, take a look at some of the AN/YUK manuals!). Has anyone out there ever done any research on the increase in code density produced by various addressing modes/instructions vs. the effort to decode them? I don't mean RISC stuff, I mean some really hairy stuff, like infinite indirection (oh, how I would love to see that!). Just once, I'd like to see, oh, say a 32000 with a RISC banked register structure and port-mapped I/O. And whatever happened to port-mapped I/O anyway? ------- Disclaimer: Everything I say is probably a trademark of someone. But don't worry, I probably don't know what I'm talking about. Scott Dorsey ICS Programming Lab, Georgia Insitute of Technology, Atlanta Georgia, 30332 ...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!kludge
jack@boring.uucp (Jack Jansen) (02/17/86)
In article <1417@sdcsvax.UUCP> darrell@sdcsvax.UUCP (Darrell Long) writes: >You should include an indexed addressing mode. Array manipulation >is a very common thing, and indexed mode helps the code generation >alot! No, please..... The idea of a RISC architecture is to *eliminate* all fancies like indexed addressing. The point is, a compiler will have a terrible time generating code for it. As an example, the line a = arr[b]; /* a=r0, b=r1 */ could be translated into mov arr(r1),r0 If 'arr' is static/external, and sizeof(arr[0]) = 1, or mov arr(lb,r1.b),r0 mov arr(lb,r1.w),r0 mov arr(lb,r1.d),r0 If the machine knows double indexing with operand width specification and the array sits on the stack *and* sizeof(arr[0]) is 1,2 or 4; or mov r1,r2 mul #sizeof(arr[0]), r2 add #arr,r2 add lb,r2 mov *r2, r0 If all else fails. (And a lot of intermediate forms, depending on the machine). Now, the point is that most compiler-writers don't have the time to let their compiler look for all those fancy addressing modes it might be able to use. And even if they do incorporate these features, they'll make the compiler much bigger and hairier (and buggier). The idea behind a RISC machine is to have instructions with only limited capabilities (when compared to a CISC machine, like a VAX or a 32032), but to make them *fast*. -- Jack Jansen, jack@mcvax.UUCP The shell is my oyster.
billw@Navajo.ARPA (William E. Westfield) (02/17/86)
The DEC Kx10 CPU (DECsystem10, DECsystem20), has infinite indirection with possible indexing at each level. I myself have written programs with two levels of indexing. The 10's (36 bit) architecture is quite nice (by my standards), and has quite a number of risk-like features. (foremost benig an extremely regular instruction format). Too bad it's dying out. BillW
craig@dcl-cs.UUCP (Craig Wylie) (02/17/86)
In article <946@garfield.UUCP> jeff1@garfield.UUCP (Jeff Sparkes) writes: > > We are currently designing a RISC chip and are wondering how >few addressing modes we can get away with. Right now we have register >direct and register indirect. The question is, is this enough? What do >people as programmers and/or designers think is a minimum set of addressing >modes? We can get away with just these two (I hope!), ... Dosn't the Inmos Transputer get away with one mode ? (Everything is an address?) Craig. -- UUCP: ...!seismo!mcvax!ukc!dcl-cs!craig| Post: University of Lancaster, DARPA: craig%lancs.comp@ucl-cs | Department of Computing, JANET: craig@uk.ac.lancs.comp | Bailrigg, Lancaster, UK. Phone: +44 524 65201 Ext. 4146 | LA1 4YR Project: Cosmos Distributed Operating Systems Research
pdg@ihdev.UUCP (P. D. Guthrie) (02/17/86)
In article <175@oasys.UUCP> maa@oasys.UUCP writes: >This computer supported infinate inderection with indexing at every level. >I.e. all address had an indirect bit and index register number in them. A guy >I knew was trying to write some program that made use of indexing after >several levels of indirection. I don't think he ever got it to work, though. > A Dec-20 also does this, and there are *lots* of those around. -- Paul Guthrie `When the going gets weird, ihnp4!ihdev!pdg The weird turn pro' - H. Thompson
kludge@gitpyr.UUCP (Scott Dorsey) (02/17/86)
In article <6777@boring.UUCP> jack@mcvax.UUCP (Jack Jansen) writes: >No, please..... The idea of a RISC architecture is to *eliminate* >all fancies like indexed addressing. The point is, a compiler will >have a terrible time generating code for it. As an example, the line That is absolutely true, but who is to say just WHAT instructions and adressing modes are really needed... That is to say which instructions are worth the amount of time to decode them vs. the amount of time required to use equivalent low-level instructions? Depending on the architecture of the machine, this varies a lot. Who is to say what the optimum architectural complexity vs. instruction set complexity is? Without fancy register stacks, RISC machines DO tend to execute faster than their CISC contemporaries doing most problems. Maybe this is due to the increase in compiler simplicity, in which case the idea would seem to me to build a compiler which can better use the CISC. ------- Disclaimer: I wouldn't like to code assembly on a RISC machine, but this message is being sent from one, and WOW! Is it fast! Scott Dorsey ICS Programming Lab, Georgia Insitute of Technology, Atlanta Georgia, 30332 ...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!kludge
b-davis@utah-cs.UUCP (Brad Davis) (02/18/86)
In article <137@hhb.UUCP> mlopes@hhb.UUCP (Marco Lopes) writes: >Immediate addressing would be simulated by allocating a memory >location for the immediate value, and loading its address into a register. How do you load the address of the memory location into a register without using an immediate addressing mode? -- Brad Davis {ihnp4, decvax, seismo}!utah-cs!b-davis b-davis@utah-cs.ARPA One drunk driver can ruin your whole day.
jqj@gvax.cs.cornell.edu (J Q Johnson) (02/18/86)
In article <175@oasys.UUCP> maa@oasys.UUCP writes: >There was a computer in the basement of the University of Colorado EE building >that I spent quite a lot of time playing with. It's name was GPL-3055, built >by the Librascope division of General Precision. [There were only three of >these ever built--for the DoD of course :-)] > >This computer supported infinate inderection with indexing at every level. >I think it's a shame that all these wondedful old computers with strange >architectures are being retired. The PDP-10 architecture (DEC-10, DEC-20, etc.), of which there were substantially more than 3 built, also has infinite indirection with indexing at every level. In fact, one frequently uses multiple levels of indirection, though generally only one or two indexing operations. Yes, it's a shame the PDP-10 is being retired. It was a nice machine, and the mainstay of computer science research (especially AI) for a decade.
mlopes@anwar.UUCP (Marco Lopes) (02/18/86)
In my earlier posting, I assumed the existence of absolute addressing. Therefore, a minimum set of addressing modes is: register register indirect absolute Another functionally complete set would have immediate instead of absolute. Marco Lopes HHB-Systems Mahwah, NJ philabs!hhb!mlopes -- ----------------------------------------------------------------------------- Marco Lopes HHB-Softron 1000 Wyckoff Ave. Mahwah NJ 07430 201-848-8000 UUCP address: {ihnp4,decvax,allegra}!philabs!hhb!mlopes
crowl@rochester.UUCP (Lawrence Crowl) (02/19/86)
In article <3686@utah-cs.UUCP> b-davis@utah-cs.UUCP (Brad Davis) writes: >In article <137@hhb.UUCP> mlopes@hhb.UUCP (Marco Lopes) writes: >>Immediate addressing would be simulated by allocating a memory >>location for the immediate value, and loading its address into a register. > >How do you load the address of the memory location into a register without >using an immediate addressing mode? Any practical architecture must have a mechanism for loading a constant value into a register or using an absolute address. With constant addressing the value can be placed in a memory location and then loaded with an absolute addressing mode. If constant value loading is present, the value can be loaded and then used indirectly to achieve absolute addressing. However, it apears both authors assume no constant loading facility at all. So, in response, I present a method to obtain constants without such a facility. To get the constant 10, (mixture of C and Pascal) r1 := r1 - r1 ; { r1 == 0 } r1 := ~r1 ; { r1 == -1, assuming two's complement } r2 := r2 - r2 ; { r2 == 0 } r2 := r2 - r1 ; { r2 == 1 } { from here the algorithm is constant and register dependant } r3 := r3 - r3 ; { r3 == 0 } r3 := r3 + r2 ; { r3 == 1 } r3 := r3 >> r2 ; { r3 == 2 } r3 := r3 >> r2 ; { r3 == 4 } r3 := r3 + r2 ; { r3 == 5 } r3 := r3 >> r2 ; { r3 == 10 } The method can be extended in the obvious manner to generate arbitrary constants. I cannot recommend an architecture relying on this method. I suggest looking very seriously at the addressing mode provided by Patterson for the RISC I and RISC II chips. The form is value in register + constant = effective address This allows absolute addressing with a 0 valued register, register indirect with a 0 valued constant, and register displaced when both are non-zero. This mode is very usefull for accessing variables off the stack frame (you will need one even if you rely on registers). In addition, it is very useful for accessing pointer based records and such. -- Lawrence Crowl 716-275-5766 University of Rochester Computer Science Department ...!{allegra,decvax,seismo}!rochester!crowl Rochester, New York, 14627
ludemann@ubc-cs.UUCP (Peter Ludemann) (02/19/86)
In article <2213@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: > I understand that every address in the DG Nova used one bit (the >sign bit, I guess) to indicate indirection. The indirect addressing mode >was not decoded as part of the instruction, but as part of the address. As >long as the address you fetched had the indirect bit set, the machine kept >doing another level of indirection. > > I think that the CPU only checked for interrupts at the completion >of an instruction. You could generate an infinite indirection loop (two >addresses which point to each other) and there was no way to halt the CPU >short of pulling the plug. A friend of mine programmed one of these. It had the delightful "feature" that references to non-existing memory returned data which included the indirection bit on. So the first invalid address put the machine into a tight loop. Old timers may recall that the IBM 1620 (a decimal machine) had an indirection bit and also variable sized numbers (there was another bit on each digit to indicate the end of the number). -- -- Peter Ludemann ludemann@ubc-cs.uucp (ubc-vision!ubc-cs!ludemann) ludemann@cs.ubc.cdn (ludemann@cs.ubc.cdn@ubc.mailnet) ludemann@ubc.csnet (ludemann%ubc.csnet@CSNET-RELAY.ARPA)
ludemann@ubc-cs.UUCP (Peter Ludemann) (02/19/86)
In article <1433@gitpyr.UUCP> kludge@gitpyr.UUCP (Scott Dorsey) writes: >In article <6777@boring.UUCP> jack@mcvax.UUCP (Jack Jansen) writes: >>No, please..... The idea of a RISC architecture is to *eliminate* >>all fancies like indexed addressing. The point is, a compiler will >>have a terrible time generating code for it. ... > > That is absolutely true, but who is to say just WHAT instructions >and adressing modes are really needed... >... Without fancy register stacks, RISC machines DO tend to execute >faster than their CISC contemporaries doing most problems. Maybe this >is due to the increase in compiler simplicity, in which case the idea >would seem to me to build a compiler which can better use the CISC. > As a sometimes compiler writer who has tackled some optimisation problems, I would like to point out that compiler writing is tricky enough without complexities added by the machine's instruction set. For a description of some of the difficulties, take a look at the description of the portable C compiler in the Unix programmer's manual, volume 2 (especially the part on register allocation for expressions). Here's a simple example problem. On an IBM/360/370/..., there are two ways to generate code for "A := B[I]": LA R1,B load address of B into reg 1 A R1,I add value of I to reg 1 L R0,0(,R1) load value pointed by reg 1 into reg 0 ST R0,A store value in reg 0 into A and: LA R1,B load address of B into reg 1 L R2,I load value of I into reg 2 L R0,0(R1,R2) load value pointed by reg1 + reg2 into reg 0 ST R0,A store value in reg 0 into A (There are more possibilities, such as replacing the last two lines of the first way by "MVC A,0(R1) - move memory to memory". Also, the "A R1,I" could be done by "LA R1,I(R1)".) Now, which is the best method? In this case, the amount of code generated is almost the same but execution time will vary. In fact, what is faster on one machine model may be slower on another. A dramatic example of this was a text processing program which was sped up about 10% by replacing all MVCL (move character long) instructions by the simpler MVC (which could only do up to 256 bytes) inside a loop. But I digress. The two methods have one other difference: the first uses 2 scratch registers (1 if the MVC instruction is used) whereas the second uses 3 scratch registers. So, deciding which to use can also depend on how many scratch registers are available. As far as I'm concerned, the test for RISCness should be: given any piece of source code, is there only one reasonable code sequence which can be output by the compiler? -- -- Peter Ludemann ludemann@ubc-cs.uucp (ubc-vision!ubc-cs!ludemann) ludemann@cs.ubc.cdn (ludemann@cs.ubc.cdn@ubc.mailnet) ludemann@ubc.csnet (ludemann%ubc.csnet@CSNET-RELAY.ARPA)
jans@tekecs.UUCP (Jan Steinman) (02/19/86)
In article <175@oasys.UUCP> maa@oasys.UUCP writes: > >This computer supported infinate inderection with indexing at every level. >I think it's a shame that all these wondedful old computers with strange >architectures are being retired. Look at Tandem computers. These are still being made (indeed, Tandem holds over 80% of the fault-tolerant transaction processing market) and they feature an "indirect bit" which, if set, says "I'm an address, not data. Tandem does not support an assembler, so it is unlikely that infinite indirection will occur, unless memory gets clobbered somehow. :::::: Artificial Intelligence Machines --- Smalltalk Project :::::: :::::: Jan Steinman Box 1000, MS 60-405 (w)503/685-2956 :::::: :::::: tektronix!tekecs!jans Wilsonville, OR 97070 (h)503/657-7703 ::::::
kludge@gitpyr.UUCP (Scott Dorsey) (02/20/86)
I was speaking with David Lane, a user assistant here, about the minimum number of addressing modes and instructions required to compute any computible function, and we thought about conditionality. First of all, why can't a jump be treated as a special case of LOAD/STORE/MOVE, because most machine implement it that way anyway (MOV 1200, PC). Of course, that makes conditional jumps impossible. So, just set one (or two) bits of every instruction aside to make it possible to conditionally execute any instruction. Is it worth the trouble to decode it? It could be treated so that as the instruction is decoded, the condition field is checked, and if the condition is not true, the instruction is aborted (it should only take one machine cycle to check the flag, right?). For one-cycle instructions, they would have to be executed only if the condition was "always execute", otherwise they would have to be delayed a cycle for the condition field to be decoded. Is it worth it? Does anyone care? ------- Disclaimer: Everything I say is probably a trademark of someone. But don't worry, I probably don't know what I'm talking about. Scott Dorsey ICS Programming Lab, Georgia Insitute of Technology, Atlanta Georgia, 30332 ...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!kludge
martin@minster.UUCP (martin) (02/21/86)
In article <1441@gitpyr.UUCP> kludge@gitpyr.UUCP (Scott Dorsey) writes: > >impossible. So, just set one (or two) bits of every instruction aside to make >it possible to conditionally execute any instruction. Is it worth the trouble The Acorn Risc Machine (ARM), does exactly this, although not for the same reasons. All instructions (which are 32-bits long) have a 4-bit condition field, and the instruction is only executed if the condition is true. Obviously the condition 'always true' is one of the 16 possible. The reason for this approach is that it allows small sequences of instructions to be skipped, without the overhead of a branch instruction, and the associated pipeline flush. In a machine which executes instructions in one clock cycle this can be a considerable saving. The ARM only has two main addressing modes, which are variations on an indexed addressing mode, but these are associated with various options, so that the result of the indexing operation can be written back to a register, allowing other addressing modes (post increment etc) to be constructed. There is also a form of immediate addressing mode. It is interesting to note that Acorn removed delayed branches when virtual memory support was added to the processor, because it wasn't clear how to restart the instruction following a branch! Perhaps some of the other RISC machines have solved this, if they have I wonder if it was simple enough to be worth it, or not? Does anyone know? For more information on the ARM, I know of the following articles: Byte January 1986: "Byte U.K.: The Acorn RISC Machine" pp 387-393 Electronics, August 5, 1985: "Acorn goes to market with RISC microprocessor" pp 14-15 Electronics, August 26, 1985: "At 3 Mips, RISC Processor is Among Fastest Chips Around" pp 48-49 and for those in the UK: Personal Computer World (Sometime late last year!) "The Acorn RISC Machine" (approx) I hope this is of interest: M C Atkins usenet: ukc!minster.uucp!martin PS. I have no connection with Acorn at all, except that I like the look of the ARM!
bcase@amdcad.UUCP (Brian case) (02/21/86)
In article <1441@gitpyr.UUCP>, kludge@gitpyr.UUCP (Scott Dorsey) writes: > impossible. So, just set one (or two) bits of every instruction aside to make > it possible to conditionally execute any instruction. Is it worth the trouble Welllllll, the Acorn RISC has this feature, and there was some work done by a student at UCSD on this very subject (I can get references if anyone is interested). Some conditional branches can be eliminated by this technique, thus making the observed path length of a program longer; a longer path length can mean faster execution if a separate path to instruction memory exists and that memory can support some sort of burst mode access. Anyway, branches are bad things to most architectures, so eliminating some of them might yield better performance. bcase
nather@utastro.UUCP (Ed Nather) (02/22/86)
In article <1441@gitpyr.UUCP>, kludge@gitpyr.UUCP (Scott Dorsey) writes: > First of all, why can't a jump > be treated as a special case of LOAD/STORE/MOVE, because most machine implement > it that way anyway (MOV 1200, PC). Of course, that makes conditional jumps > impossible. So, just set one (or two) bits of every instruction aside to make > it possible to conditionally execute any instruction. > > Scott Dorsey Perhaps you just paste a bit from a condition flag onto the (low) end, and then you would jump to one of two (successive) locations depending on the condition. The first of that pair would probably need to be an unconditional jump in most cases, but look, Ma! No decoding! The LGP-30 computer had only one test instruction (test minus) and it worked, but was a touch awkward to program a test for zero -- but it could be done. Whether it's worth doing is another question. The TRUE RISC would have only one instruction: subtract. With only one instruction, you wouldn't need an op-code field; every instruction is just a pair of addresses (what to subtract from what). An "add" subroutine would need 3 instructions. With memory-mapped I/O and the above convention for conditional branching, you'd have it all. TRUE RISC is not a trademark of any known company. -- Ed Nather Astronomy Dept, U of Texas @ Austin {allegra,ihnp4}!{noao,ut-sally}!utastro!nather nather@astro.UTEXAS.EDU
tim@ism780c.UUCP (Tim Smith) (02/22/86)
In article <137@hhb.UUCP> mlopes@hhb.UUCP (Marco Lopes) writes: > > Functionally, there is no problem with register direct and > register indirect as the only addressing modes. Any other modes > can be simulated with them. Immediate addressing would be > simulated by allocating a memory location for the immediate > value, and loading its address into a register. Now you you can ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > use register indirect to load the value into another register. But how do you load THAT register? -- Tim Smith sdcrdcf!ism780c!tim || ima!ism780!tim || ihnp4!cithep!tim
patc@tekcrl.UUCP (Pat Caudill) (02/22/86)
Under some circumstances you can get by with only register indirect load and store and load half register immediate. The load register to register can be done by add register to zero immediate giving register. If the immediate is a small constant built into the main instruction (doesn't require a second word fetch) this is a move. Some designs also forced a register to always contain zero for sequences like this. The IBM 801 allowed memory addressing only for load and store. But it allowed 16 bit immediate operands. The instruction would operate on either the upper or lower half of the register. At first I thought this was a crock until I realized it didn't take any more memory cycles than an instruction followed by a word of immediate data. And of course it greatly simplified the instruction cycling. And sometimes you can get away with out the second instruction. The 801 allowed immediate operands for any of the logical instructions. and lower half immediate on add and subtract. Tektronix!tekcrl!patc
mac@uvacs.UUCP (Alex Colvin) (02/22/86)
> So, just set one (or two) bits of every instruction aside to make it > possible to conditionally execute any instruction. Is it worth the trouble > to decode it? Instead of using the bits, make it a(n optional) prefix to the instruction. Then you've got a machine with conditionl skips instead of conditional jumps. See the PDP8, Nova, &c.
crickman@umn-cs.UUCP (Robin Crickman) (02/23/86)
> function, and we thought about conditionality. First of all, why can't a jump > be treated as a special case of LOAD/STORE/MOVE, because most machine implement > it that way anyway (MOV 1200, PC). Of course, that makes conditional jumps > impossible. So, just set one (or two) bits of every instruction aside to make > it possible to conditionally execute any instruction. Is it worth the trouble > to decode it? It could be treated so that as the instruction is decoded, the > condition field is checked, and if the condition is not true, the instruction > is aborted (it should only take one machine cycle to check the flag, right?). > For one-cycle instructions, they would have to be executed only if the condition > was "always execute", otherwise they would have to be delayed a cycle for the > condition field to be decoded. Is it worth it? Does anyone care? > > Scott Dorsey > ICS Programming Lab, Georgia Insitute of Technology, Atlanta Georgia, 30332 How about a conditional SKIP instruction (if TRUE, skip the next opcode, else execute it), which was the only kind of conditional instruction on the pdp-8? I believe the pdp-8 used the sort of jump described above. Was it a RISC machine? It certainly had a reduced instruction set! John Hasler (guest of ...ihnp4!umn-cs!crickman)
patc@tekcrl.UUCP (Pat Caudill) (02/23/86)
. .. . ... . (line eater kibbles) The use of a stored branch as the way to effect a conditional jump is not new. In "First Draft of a Report on the EDVAC", John von Neumann, Sec 11.3 discribes the conditional 'branching' facilities of the EDVAC. Some background, the EDVAC's ALU (arithmetic organ) had three registers. I and J were inputs and O was output. Numbers were loaded to I with the previous value in I shifting to J. When an operation was performed the result appeared in O. There were operations which allowed I or J to directly go to O, and O to I. The conditional operation transfered the contents of I or J unchanged to O depending on the previous value in O. To conditionally branch to one of two locations the inputs would be loaded with two jump instructions the correct one taken as the result and stored in memory then jumped to. (often the clever programmer would arrange the store address to be the next location. For example: if ( a < b ) then goto L1 else goto L2 ; would code up a: (my own notation, von Neumann's is very cryptic, and I dont have the Greek alphabet on this terminal) LD A LD B SUB ; - LD #JMP+L1 LD #JMP+L2 COND ; s STO L3 L3: JMP L3 Turing for the Pilot-ACE ( I think, look in Early British Computers from Digital Press for more details)just had an instruction which set a negative accumulator to zero or 1 depending on its sign. He suggested for a conditional branch: lbl1 * condition + lbl2 * (1-condition) + jmp -> nextInstr We now look down on self modifying code, but at that time it was considered a great step forward. Ahh, those were the good old days. Tektronix!tekcrl!patc
jer@peora.UUCP (J. Eric Roskos) (02/24/86)
> I suggest looking very seriously at the addressing mode provided by > Patterson for the RISC I and RISC II chips. The form is > > value in register + constant = effective address > > This allows absolute addressing with a 0 valued register, register indirect > with a 0 valued constant, and register displaced when both are non-zero. Then, I suggest looking very seriously at the "indexed" mode of the Motorola 6800 family, 10 years or so older than the RISC chips! Fortunately, from the example on how to load a constant into a register that preceeded this, I know the above comment was meant as a joke... :-) -- UUCP: Ofc: jer@peora.UUCP Home: jer@jerpc.CCUR.UUCP CCUR DNS: peora, pesnta US Mail: MS 795; CONCURRENT Computer Corp. SDC; (A Perkin-Elmer Company) 2486 Sand Lake Road, Orlando, FL 32809-7642
jer@peora.UUCP (J. Eric Roskos) (02/24/86)
> As far as I'm concerned, the test for RISCness should be: given any piece > of source code, is there only one reasonable code sequence which can be > output by the compiler? How is that a test for "RISCness"? Following the above scheme, you could, for example, have a machine that directly (and optimally) interpreted some concise representation of the source code... but that's exactly what people are calling a CISC machine, and are claiming is bad... -- UUCP: Ofc: jer@peora.UUCP Home: jer@jerpc.CCUR.UUCP CCUR DNS: peora, pesnta US Mail: MS 795; CONCURRENT Computer Corp. SDC; (A Perkin-Elmer Company) 2486 Sand Lake Road, Orlando, FL 32809-7642
mac@uvacs.UUCP (Alex Colvin) (02/25/86)
> > The TRUE RISC would have only one instruction: subtract. With only one > instruction, you wouldn't need an op-code field; every instruction is just > a pair of addresses (what to subtract from what). An "add" subroutine > would need 3 instructions. With memory-mapped I/O and the above convention > for conditional branching, you'd have it all. > > TRUE RISC is not a trademark of any known company. See the New England Digital ABLE 60 processor. It actually has one instruction (move). Arithmetic is done more or less by moving to the ALU. This processor is used for control systems, digital synthesizers (Synklavier) and communications networks.
nather@utastro.UUCP (Ed Nather) (02/25/86)
In article <546@tekcrl.UUCP>, patc@tekcrl.UUCP (Pat Caudill) writes: > We now look down on self modifying code, but at that time > it was considered a great step forward. Ahh, those were the good old days. > > Tektronix!tekcrl!patc It will return; it's too neat an idea to stay dead. What we lack is the needed formal discipline to use it carefully. Spaghetti code has given way to the cleaner structured programs, now that we have patterned code closer to the way we think. I still have, and routinely use, a program that depends for its successful operation on self-modifying code. [Yay! I'm out of the closet at last!] I was able to find no other way to meet the required conditions (animated display using a slow computer) and no one else has either -- though several have tried. -- Ed Nather Astronomy Dept, U of Texas @ Austin {allegra,ihnp4}!{noao,ut-sally}!utastro!nather nather@astro.UTEXAS.EDU
rmarti@sun.uucp (Bob Marti) (02/25/86)
> I suggest looking very seriously at the addressing mode provided by > Patterson for the RISC I and RISC II chips. The form is > > value in register + constant = effective address > > This allows absolute addressing with a 0 valued register, register indirect > with a 0 valued constant, and register displaced when both are non-zero. I seem to recall the CDC 6000 line had exactly this feature. There were eight 18-bit index registers (B0 - B7) which were used by certain instructions. The address was computed using the value in a B-register and adding a constant. They even had the B0 register "hardwired" to contain the value 0. This was almost 20 years before the Berkeley people "invented" RISC ... --Bob Marti
jlg@lanl.ARPA (Jim Giles) (02/25/86)
In article <1982@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes: >> As far as I'm concerned, the test for RISCness should be: given any piece >> of source code, is there only one reasonable code sequence which can be >> output by the compiler? > >How is that a test for "RISCness"? Following the above scheme, you could, >for example, have a machine that directly (and optimally) interpreted some >concise representation of the source code... but that's exactly what people >are calling a CISC machine, and are claiming is bad... That's just the problem. The designers of CISC machines think that they are making the compiler writer's job easier by supplying opcodes to perform high level semantics. Unfortunately, NO ONE has yet succeeded in developing an instruction set which can 'directly (and optimally)' interpret the semantics of the source code. What usually happens is that the 'high level' instruction is not as efficient as a sequence or loop of simpler instructions - or it is only efficient for a subset of its possible uses. ( As an example, consider the famous case of the VAX instruction for evaluating polynomials. It was never as fast as a loop of simpler code to do the same job and it was horribly inefficient in the case where the compiler could detect a large number of zero coefficients.) As a result, compilers must be VERY complex to generate good code on CISC machines. At best, the CISC machines provide some instructions which are generally useful as well as lots of instructions which are seldom or never used (consider all the wierd addressing modes of modern vaxen - which most compilers never generate). The idea of RISC is to keep these instructions which are generally useful, and make them as fast as possible by eliminating all the seldom used instructions. A useful side effect of RISC archetecture is that it permits compiler writers to implement high level semantics with simple 'codes skeletons' since there is usually only one efficient sequence of code which reflects the required meaning. J. Giles Los Alamos
kludge@gitpyr.UUCP (Scott Dorsey) (02/25/86)
In article <890@umn-cs.UUCP> crickman@umn-cs.UUCP (Robin Crickman) writes: >How about a conditional SKIP instruction (if TRUE, skip the next opcode, else >execute it), which was the only kind of conditional instruction on the pdp-8? >I believe the pdp-8 used the sort of jump described above. Was it a RISC >machine? It certainly had a reduced instruction set! > >John Hasler (guest of ...ihnp4!umn-cs!crickman) My HP-34C calculator has just that instruction. Seems a little better than a short branch, but not all that much. Conditional branches are probably faster, even if they require a larger word. Most branches tend to be short, and I suspect you'd wind up using a lot of (execute this branch if true) instructions. I wouldn't call the PDP-8 a RISC machine, but then, what do you call a machine that can't add? ------- Disclaimer: Everything I say is probably a trademark of someone. But don't worry, I probably don't know what I'm talking about. Scott Dorsey ICS Programming Lab, Georgia Insitute of Technology, Atlanta Georgia, 30332 ...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!kludge
jack@boring.UUCP (02/26/86)
In article <736@ism780c.UUCP> tim@ism780c.UUCP (Tim Smith) writes: >In article <137@hhb.UUCP> mlopes@hhb.UUCP (Marco Lopes) writes: >> >> Functionally, there is no problem with register direct and >> register indirect as the only addressing modes. Any other modes >> can be simulated with them. Immediate addressing would be >> simulated by allocating a memory location for the immediate >> value, and loading its address into a register. Now you you can > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> use register indirect to load the value into another register. > >But how do you load THAT register? Ahh, that's simple. You just use the trick Marco Lopes described above...... -- Jack Jansen, jack@mcvax.UUCP The shell is my oyster.
jeff1@garfield.UUCP (02/26/86)
It appears that most people agree on the need for immediate or absolute addressing. What we have (hopefully) decided on are: register direct register indirect +4 bit constant (This is only a 16 bit machine due to size constraints. Also, this is only for an undergrad course in VLSI design. Who can design something REALLY useful in three months anyway?) This allows us to fake immediate addressing with two instructions: load Rx,PC,2 addq PC,1 data ... Indirection is trivial. Load the register, and then use the register for the address. You can do based addressing with an add instruction and an extra register. Branch instructions are also done neatly. We have a mask specifiying which condition codes to check, and then a value mask specifying the values to check for. I.e if (CC & mask) == (mask & value) then branch. It's nice to get all of these useful responses. Thanks! Jeff Sparkes garfield!jeff1 jeff@garfield.mun.cdn <- ubc-vision or ubc.csnet
ehj@mordor.UUCP (Eric H Jensen) (02/27/86)
In article <426@utastro.UUCP> nather@utastro.UUCP (Ed Nather) writes: >In article <546@tekcrl.UUCP>, patc@tekcrl.UUCP (Pat Caudill) writes: >> We now look down on self modifying code, but at that time >> it was considered a great step forward. Ahh, those were the good old days. >> >It will return; it's too neat an idea to stay dead. What we lack is ... > ... I was able to find no other way to meet >the required conditions (animated display using a slow computer) and no >one else has either -- though several have tried. Let's hope it dosen't.... If you allow self-modifying code everyone would have to have a seperate copy of their task. Swap space is not that cheap - with a lot of people running emacs and maybe a lisp underneath it, kiss your swap area and probably system performance (unless you buy more main memory) goodbye.... I can't see how you could effectively manage a POLICY of "self-modifying code is okay" in a statically linked environment. Every sub-system that linked in a routine that modified its own code would inherit the 'impure' curse. Perhaps this would not be such a problem in a dynamically linked environment. Hardware-wise the key here is "... using a SLOW computer" (emphasis mine). Once a computer is targeted to perform above a certain level, the design must resort to having separate instruction and data caches (unless you can get those 3ns 16k rams that the Japanese are using in a dual port configuration). A virtually addressed, read-only (the piece of logic that writes it is the miss handler) icache (instruction cache) is a very simple and compact design. Btw the icache is usually in one of the critical paths of the machine - pc setup for branches. A store into the icache is going to cost at the very least one dead cycle unless more memory chips are added to play odd even games (i.e. access the odd words while writing the even words) but now there is more hair (defered writes) and possibly a sacrifice in speed (longer logic paths). A cache coherency problem also arises - the write to the icache must also be reflected in the dcache (data cache). I can think of a number of approaches to this problem but either they are messy or the create dead machine cycles (no computation takes place while the update is made). In any event the design gets a lot larger and it is dubious that any performance is gained. I would opt for effcient indexed address modes and instructions that keep the requirement for konstant operands to a minimum (i.e. the amount to shift by for the shift instruction). Aren't these the items that made self modifing code passe? -- eric h. jensen (S1 Project @ Lawrence Livermore National Laboratory) Phone: (415) 423-0229 USMail: LLNL, P.O. Box 5503, L-276, Livermore, Ca., 94550 ARPA: ehj@angband UUCP: ...!decvax!decwrl!mordor!angband!ehj
rogerh@megaron.UUCP (02/27/86)
(Oh, no -- I've lost my reference for this! If anyone knows what goes in the blank ("[]") below, send me mail, thanks. -rh) This (self-modifying code) is the way [] does a bitblt; the entry code checks for the base cases, otherwise it builds custom code to do the bitblt and then jumps to it. Very clever -- the case to build the code is larger than the case within the bitblt would be, but performance is much higher because the code that's built can be optimal.
clif@intelca.UUCP (Clif Purkiser) (02/27/86)
> In article <1982@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes: > >> As far as I'm concerned, the test for RISCness should be: given any piece > >> of source code, is there only one reasonable code sequence which can be > >> output by the compiler? > > > >How is that a test for "RISCness"? Following the above scheme, you could, > >for example, have a machine that directly (and optimally) interpreted some > >concise representation of the source code... but that's exactly what people > >are calling a CISC machine, and are claiming is bad... > > That's just the problem. The designers of CISC machines think that they > are making the compiler writer's job easier by supplying opcodes to > perform high level semantics. Unfortunately, NO ONE has yet succeeded > in developing an instruction set which can 'directly (and optimally)' > interpret the semantics of the source code. What usually happens is > that the 'high level' instruction is not as efficient as a sequence or > loop of simpler instructions - or it is only efficient for a subset of > its possible uses. ( As an example, consider the famous case of the VAX > instruction for evaluating polynomials. It was never as fast as a loop > of simpler code to do the same job and it was horribly inefficient in > the case where the compiler could detect a large number of zero > coefficients.) As a result, compilers must be VERY complex to generate > good code on CISC machines. > > At best, the CISC machines provide some instructions which are generally > useful as well as lots of instructions which are seldom or never used > (consider all the wierd addressing modes of modern vaxen - which most > compilers never generate). The idea of RISC is to keep these instructions > which are generally useful, and make them as fast as possible by > eliminating all the seldom used instructions. A useful side effect of > RISC archetecture is that it permits compiler writers to implement high > level semantics with simple 'codes skeletons' since there is usually only > one efficient sequence of code which reflects the required meaning. > > J. Giles > Los Alamos The problem with RISC when taken to it is logically conclusion is you get rid of a lot of useful instructions and very useful addressing modes. The RISC machine I designed for my Computer Architecture course (taught by Mr. RISC Dave Patterson ) had 24 instructions and two addressing modes it didn't even have a multiple. While I was very happy at the time that I didn't have to try to right microcode for a lot of complex instructions! (The previous class had to implement the Z8000 with TTL SSI and MSI parts) My machine sure took a long time to do useful work. Needless to say my RISC computer was a toy compared to real RISC machines. J. Giles point is well taken that compiler writers have not yet figured out a way of taking advantage of the numerous addressing modes and instructions offered by CISC machines such as 86, Z8K, 68K or 32K families. I would even concede that you can still have a RISC machine that takes more than one clock to complete an instruction. This allows an important instruction such as multiple to be included in the instruction set. At first glance many CISC instructions seem useless (some probably are) but on closer glance that generally have quite a few uses. Take for example my favorite CISC machine (you guessed it) the 80386. I am sure no good RISC designer would include such obscure instructions as Ascii Adjust for Addition, Subtraction, AAA,AAS etc. or XLAT (Table Look-up Translation). Yet they all have their uses. The Ascii Adjust instruction are used by COBOL (yes people still use it) compilers and spreadsheet designers because COBOL uses BCD and these instruction speed this process up. Likewise, XLAT can be used for converting ASCII to EBCDIC in 5 clocks. One disagreement I have with the RISC proponents is the theory that everyone writes in a HLL. It seems that despite years of trying to force everyone to write in HLL languages there will always be a few assembly language programers. Because no matter how much performance Semicoducter and Computer companies give programmers they always want their programs to run faster. So these CISC instructions while not useful to compiler writers are useful to assembly language jocks. I also think that many of the addressing modes on CISC machines result in higher performance than the simpler RISC addressing modes. For example take long x, a[10], i; x = a[i]; This probably compiles into these RISC instructions assuming x, a, and i are in register R1,R2, and R3 respectively. ShiftL R4, R3, #2 Add R4, R4, R2 Load R1, [R4] This is a single 4 clock instruction on the 80386 vs 3 clocks for the RISC chip. However, the RISC chip has had to fetch 3 instructions vs one for the CISC processor. So unless the RISC chip has a large on-chip cache it will be slower. I guess my main point is that a RISC designer in the search to simplify the chips complexity may easily throw out obscure instructions which are useful to certain applications. Or they may eliminate generally useful addressing modes. I think that the good thing about the RISC philosphy is that it will reduce the tendency of designers to add new instructions or addressing modes just because they look whizzy or Brand X has them. If a complex way of doing something is slower than a simple way don't put it in. -- Clif Purkiser, Intel, Santa Clara, Ca. HIGH PERFORMANCE MICROPROCESSORS {pur-ee,hplabs,amd,scgvaxd,dual,idi,omsvax}!intelca!clif {standard disclaimer about how these views are mine and may not reflect the views of Intel, my boss , or USNET goes here. }
jbs@mit-eddie.UUCP (Jeff Siegal) (02/27/86)
In article <5681@mordor.UUCP> ehj@mordor.UUCP (Eric H Jensen) writes: >If you allow self-modifying code everyone would have to have a >seperate copy of their task. Swap space is not that cheap - with a >lot of people running emacs and maybe a lisp underneath it, kiss your >swap area and probably system performance (unless you buy more main >memory) goodbye.... hmmm. What about copy-on-write access? >I can't see how you could effectively manage a POLICY of >"self-modifying code is okay" in a statically linked environment. >Every sub-system that linked in a routine that modified its own code >would inherit the 'impure' curse. Perhaps this would not be such >a problem in a dynamically linked environment. I'm not sure I understand what you are saying. Under Unix, at least, sub-systems are not shared. Please explain. Jeff Siegal - MIT EECS
jer@peora.UUCP (J. Eric Roskos) (02/27/86)
> That's just the problem. The designers of CISC machines think that they > are making the compiler writer's job easier by supplying opcodes to > perform high level semantics. Unfortunately, NO ONE has yet succeeded > in developing an instruction set which can 'directly (and optimally)' > interpret the semantics of the source code. Now, this is exactly the point at which the RISC/CISC debate generates into the sort of assertions more commonly found in advertising than in research. The fact that "NO ONE has yet succeeded" in no way demonstrates or proves that it can't be done; this is a simple fallacy! It involves arguing that (a "good" CISC machine exists) -> (one can be built) ~(a "good" CISC machine exists) :. ~(one can be built) which would be like arguing, at the time that people were trying to build flying machines and they were all falling apart, that this meant that it was impossible to do. Note that in my original posting, I didn't say "interpret the semantics of the source code," I said interpret some *representation* of the source code. There is a lot of work done by compilers (e.g., resolving symbolic references, translating infix arithmetic expressions with ASCII constants into a more manageable form, etc.) which takes a lot of time, and any machine (such as these microcomputers they are always building projects for in Byte that directly execute BASIC) which takes on that task will be much slower. The problem with the practical examples of why "it won't work" are that invariably one group of people designed an instruction set, and put in features that they thought, or had been told indirectly, would be "useful" for compilers, but this group of people knew almost nothing about compiler writing, and so they simply did it wrong. -- UUCP: Ofc: jer@peora.UUCP Home: jer@jerpc.CCUR.UUCP CCUR DNS: peora, pesnta US Mail: MS 795; CONCURRENT Computer Corp. SDC; (A Perkin-Elmer Company) 2486 Sand Lake Road, Orlando, FL 32809-7642
ehj@mordor.UUCP (Eric H Jensen) (02/27/86)
In article <906@megaron.UUCP> rogerh@megaron.UUCP (Roger Hayes) writes: >(Oh, no -- I've lost my reference for this! > If anyone knows what goes in the blank ("[]") below, > send me mail, thanks. -rh) > >This (self-modifying code) is the way [] does a bitblt; the >entry code checks for the base cases, otherwise it builds >custom code to do the bitblt and then jumps to it. Very >clever -- the case to build the code is larger than the >case within the bitblt would be, but performance is much >higher because the code that's built can be optimal. I would argue that this is not self-modifying code - new code is being generated, the old code is not being modified. We use this technique to speed up our hardware simulator. The user can tell the simulator to generate code for every node in the net that will gather, operate on, and scatter the data bits associated with that node. The resulting simulation is REAL fast. -- eric h. jensen (S1 Project @ Lawrence Livermore National Laboratory) Phone: (415) 423-0229 USMail: LLNL, P.O. Box 5503, L-276, Livermore, Ca., 94550 ARPA: ehj@angband UUCP: ...!decvax!decwrl!mordor!angband!ehj
mash@mips.UUCP (John Mashey) (02/28/86)
>M C Atkins, ukc!minster.uucp!martin writes: > It is interesting to note that Acorn removed delayed branches when > virtual memory support was added to the processor, because it wasn't clear > how to restart the instruction following a branch! Perhaps some of the > other RISC machines have solved this, if they have I wonder if it was simple > enough to be worth it, or not? Does anyone know? 1) Other RISC machines have solved this. 2) Careful design is needed, but it turns out that the same careful design is needed to make heavily-pipelined machines work efficiently anyway, with clean exception handling, i.e., the virtual memory problem is just one example of a general class. 3) This can be done fairly simply: a) On an exception caused by an instruction in a branch-delay slot: b) suppress the branch & the following instruction. c) Point a register at the branch, not at the instruction following. d) Set a magic bit that says that's what happened (a convenience). e) Take the exception. f) on returning from the exception, re-execute the branch. 4) Was it worth it? Yes. With current technology, if you want to do heavily-pipelined, high-performance designs, delayed branches are Good Things. -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.UUCP (John Mashey) (02/28/86)
Pat Caudill, Tektronix!tekcrl!patc writes: > Under some circumstances you can get by with only register > indirect load and store and load half register immediate.... > The IBM 801 allowed memory addressing only for load and store. > But it allowed 16 bit immediate operands. The instruction would > operate on either the upper or lower half of the register. At first > I thought this was a crock until I realized it didn't take any more > memory cycles than an instruction followed by a word of immediate > data. And of course it greatly simplified the instruction cycling. 1) Yes, it looks like a crock, but it works. Sneaky compilers can even do things like keeping the upper-half of an address around in a register for a while, thus lessening the number of fetches. 2) On this whole business of addressing modes [I'm just catching up, since we moved buildings and were out of touch]: I recommend: A Characterization of Processor Performance in the VAX 11/780, Joel S. Emer, Douglas W. Clark, both of DEC Proc. 11th Ann. SYmp. on Computer Architecture (Ann Arbor, Mich, June 5-7), IEEE, N.Y., 301-310. These guys have done a number of excellent studies of real VAX characteristics, and even better, have published them. I urge people who argue the merits of architectural features to look up some of these papers. For example, see Table 4 in the above paper, from which a few columns are excerpted below: Operand specifier distribution: (total) Register R 41.0% ==> 41.0% Short literal #n 15.8% Immediate (PC)+ 2.4% ==> 18.2% Displacement D(R) 25.0% Reg.Deferred (R) 9.2% Auto-inc (R)+ 2.1% Disp. Deferred @D(R) 2.7% Absolute @(PC)+ 0.6% Auto-inc-deferr @(R)+ 0.3% Auto-dec -(R) 0.9% ----- 40.8% What does this tell you? 1) registers get used (41%). 2) so do literals (15.8% + 2.4% = 18.2%) 3) Of the modes that actually reference memory, 83% ((25.0+9.2)/40.8) are of the form D(R) or (R) (and the latter, or course is 0(R)). Note: don't read TOO much into these figures, but it is interesting to see how little dynamic count is contributed by some of the lesser-used modes. Again, I recommend this paper highly. Among other things, it has a good analysis of average VAX instruction timing, showing that the "average" VAX instruction uses 10.6 200ns cycles, i.e.: A "1-MIP" 11/780 IS REALLY A HALF-VAX-MIP CPU! (This is why we always talk in performance realtive to VAXen, counting Mips only confuses you.) -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
ehj@mordor.UUCP (Eric H Jensen) (03/01/86)
In article <1106@mit-eddie.UUCP> jbs@eddie.UUCP (Jeff Siegal) writes: >In article <5681@mordor.UUCP> ehj@mordor.UUCP (Eric H Jensen) writes: >>I can't see how you could effectively manage a POLICY of >>"self-modifying code is okay" in a statically linked environment. >>Every sub-system that linked in a routine that modified its own code >>would inherit the 'impure' curse. Perhaps this would not be such >>a problem in a dynamically linked environment. > >I'm not sure I understand what you are saying. Under Unix, at least, >sub-systems are not shared. Please explain. Lets say csh uses a library routine FOOBAR which modifies its code (assuming this is allowed). Now consider that csh forks another csh (called csh<2>). csh<2> does not get the state of the csh executable on the disk, it gets the state of its parent executable. Furthermore assume that the parent does not wait for it's child to exit but keeps on running. If csh and csh<2> share the executable and they both make use of FOOBAR - good luck trying to find out what went wrong when your system pukes. So to rectify the situation when you fork, you bring in a clean copy of the executable. You mentioned a copy on write scheme. Maybe on a machine that allows a physical page to be mapped to more than one virtual page this could be done right, but on a machine like the latest IBM risc a physical page can only be mapped to one virtual page (this saves a (pipeline stage) or (gobs of hardware) for determining if you hit or not) you are forced to copy the executable into a new segment. But the above really doesn't get at the issue. If performance is a goal, the hardware is going to be slower (or more expensive) if it has to deal with self-modifying code even if the software can handle it. -- eric h. jensen (S1 Project @ Lawrence Livermore National Laboratory) Phone: (415) 423-0229 USMail: LLNL, P.O. Box 5503, L-276, Livermore, Ca., 94550 ARPA: ehj@angband UUCP: ...!decvax!decwrl!mordor!angband!ehj
mc68020@gilbbs.UUCP (Tom Keller) (03/02/86)
In article <1989@peora.UUCP>, jer@peora.UUCP (J. Eric Roskos) writes: > Note that in my original posting, I didn't say "interpret the semantics > of the source code," I said interpret some *representation* of the source > code. There is a lot of work done by compilers (e.g., resolving symbolic > references, translating infix arithmetic expressions with ASCII constants > into a more manageable form, etc.) which takes a lot of time, and any > machine (such as these microcomputers they are always building projects > for in Byte that directly execute BASIC) which takes on that task will > be much slower. A minor point, perhaps, but in fact there is no microprocessor on the market that "directly executes BASIC." There are a few that include a ROM-based BASIC interpreter on-chip. I realize that some people might consider me to be just a bit picky, but I don't consider this to be a microporcessor that "directly executes BASIC." -- ==================================== tom keller {ihnp4, dual}!ptsfa!gilbbs!mc68020 (* we may not be big, but we're small! *)
aglew@ccvaxa.UUCP (03/02/86)
> register indirect + 4 bit constant
4 bit constants hold only 20% of the offsets you're going to want, without
scaling, and only 30% with. These numbers are preliminary results of a
study I'm making. The distribution of the size of addressing constants
falls off roughly exponentially, with a 50% mark at 6 bits, and an 85% mark
at 8 bits.
What I'm trying to say is, you'll get much much better results if you provide
just a few more bits in the constant.
aglew@ccvaxa.UUCP (03/03/86)
We DO have the formal discipline to use self-modifying code: it's called incremental compilation. For some specific instance of a general routine with all sorts of ifs depending on parameters values, take a general prototype with all the ifs in it, compile only the parts you need into a buffer, and then execute that buffer. Note: this is at the module level, not the machine instruction level. Self-modifying code at the instruction level will just always be too machine dependent to become formalized as standard practice. Doing things at the module level doesn't mean that we're losing too much efficiency, on the new machines that support function calls well. By the way, such compilations into buffers DO NOT make programs non-reentrant - you just have to have a concept of more than one code (text) segment, some pure and shared, others not.
prl@ethz.UUCP (Peter Lamb) (03/04/86)
In article <1468@gitpyr.UUCP> kludge@gitpyr.UUCP (Scott Dorsey) writes: >I wouldn't call the PDP-8 a RISC machine, but then, what do you call a machine >that can't add? Now that just isn't true: the PDP-8 _did_ have an add instruction. What it was missing was any LOAD instruction (no this is not a joke). For those who have never seen a PDP-8 instruction set, here it is (I'm doing this because I think that the '8 _was_ as RISC, though the motivations for their design choices seem to be different: this shouldn't surprise anyone, since the first '8s were implemented with discrete transistors - that meant that ALU's and registers were _expensive_) Anyway, the 8 had a 3-bit opcode, 6 memory reference instructions and the other 2 opcodes were used for register operate and IO, where for these two instructions the rest of the bits described what the instruction actually did. Remaining were 9 bits (12-bit word), 1 indirect bit, 1 base page/current page bit and 7 address bits. Fixed 128-word pages, none of your fancy PC-relative stuff - that would have required another _expensive_ adder, or extra cpu cycles. The instructions were, then (some of the opcode names may be wrong, it's been a long time...): TAD Two's Complement Add ((acc) <- acc+(EA)) SCA Store and Clear Accumulator ((EA)<-(acc), (acc)<-0) ANA And Accumulator ((acc)<-(EA)&(acc)) ISZ Increment & Skip on Zero ((EA)<-(EA)+1, skip if zero) JMP Jump (PC)<-EA JSR Jump to (EA)<-(PC),(PC)<-EA+1 The accumulator operate instructions could do two accumulator operations in one instruction: one subset in the first clock cycle of the instr, another in the next; generally they were arranged sensibly, for example CMA (complement accumulator) was a cycle 1 opcode, and INC a cycle 2 opcode, so that negate could be done in one instruction. There were also opcodes for shift, rotate, and conditional skip. Autoincrement indirection was also there (but not as opcode bits) there were these locations in low memory, which, when addressed indirectly, incremented themselves (shudder.....). Anyway, given the aim of producing a minimum cost machine on the available technology, I think that the PDP-8 was an _extremely good_ machine (fire-proof overalls on). I have heard one rumor that the '8 was the first CPU to be sold for under $US10000, then later the first under $US1000. In its time, it was very popular indeed, and until reasonably recently (order mid-late 1970's) there were still uProc versions of the 8 inside DEC products. The important thing, in this discussion, though, was that some similar philosophy was at work - the instruction set was stripped down to a bare minimum, here to simplify hardware to reduce cost by reducing the device count, in modern RISC, to reduce design time/(individual) instruction execution time. But both with the aim (and in the case of the '8 sucessfully) of getting better cost-performance. Mind you, I wouldn't want to program one in assembler though :-) . -- Peter Lamb ({seismo|decvax}!mcvax!cernvax!ethz!prl) Institut fur Informatik ETH-Zentrum 8092 Zurich
david@ztivax.UUCP (03/04/86)
/* Written 6:57 am Feb 27, 1986 by clif@intelca in ztivax:net.arch */ I will try to keep from flaming, but clif@intelca, your claims are wrong. I do not work in your high performance processor lab, but I sometimes have to write code for those awful things Intel makes. Do not let your loyalties to your employer cloud your thinking. >> ... What usually happens is >> that the 'high level' instruction is not as efficient as a sequence or >> loop of simpler instructions > I am sure no good RISC designer would include such obscure instructions >as Ascii Adjust for Addition, Subtraction, AAA,AAS etc. or XLAT >(Table Look-up Translation). Yes, you are right. This is a perfect example of a useless instruction. If you code the action in low level 80*86 instructions, you will have a faster loop. You may claim that the intruction itself is fast, but look at all that silly setup which must first be done to even use this obscure example of brain damage. > One disagreement I have with the RISC proponents is the theory that > everyone writes in a HLL. It seems that despite years of trying to force > everyone to write in HLL languages there will always be a few assembly > language programers. Because no matter how much performance Semicoducter and > Computer companies give programmers they always want their programs to run > faster. When optimizing code, I, like most others, ahve found that 1) its in the algorithm, not the instructions, and 2) the super whiz-bang instructions are expensive due to special setup and slow hardware / micro-code implementation. And, if I am going to write assembler, I always find it easier and more maintainable if done using a regular orthogonal instruction set, not using the obscure bells and whistes. > So these CISC instructions while not useful to compiler writers are > useful to assembly language jocks. Really? I have found, as well as many others, that assember programmers are like high level language programmers: they use convenient subsets of the language. And for the quick clock times you quote fpr the 80*86 family, you should include all the time it takes to perform the loads of operands into the specific registers required. You will be suprised, and I hope, enlightened. .... End of flame However, cliff, a good question you have implied: Does anyone do assembler programming on RISC machines? For example: Is UNIX on RISC entirely in C? How do the stack recovery mechanisms work? Interrupted / failed instructions? CPU mode register control manipulation (if applicable)? David Smyth, Free and proud of it.
rudell@cad.UUCP (Richard Rudell) (03/04/86)
> ...... What usually happens is > that the 'high level' instruction is not as efficient as a sequence or > loop of simpler instructions - or it is only efficient for a subset of > its possible uses. ( As an example, consider the famous case of the VAX > instruction for evaluating polynomials. It was never as fast as a loop ^^^^^^^^^^^^^^^^^^^^^^^ > of simpler code to do the same job and it was horribly inefficient in ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > the case where the compiler could detect a large number of zero > coefficients.) As a result, compilers must be VERY complex to generate > good code on CISC machines. When in doubt, run a test. The test program follows the message. I tested the following: 1) VAX polyd instruction 2) simple loop in C to evaluate a polynomial 3) simple loop in assembler using register variables 4) unrolled loop in assembler (i.e., no branches) The test was run for a degree 4 polynomial with no zero coefficients. Output from a VAX 11/785 with FPA: count is 1000000 Time for null loop was 1.0 microseconds Time for polyd was 15.3 microseconds 21.128271 Time for poly in C was 36.8 microseconds 21.128271 Time for poly asm loop was 28.5 microseconds 21.128271 Time for poly asm loop unrolled was 22.6 microseconds 21.128271 Certainly seems to me that the polyd runs faster than the best I can do in assembly language on a VAX. Results summarized from a VAX 8650 and a MicroVax-II (with FPA) are: VAX 8650 MicroVax-II VAX polyd 5.1 us 30.4 us C code 8.7 us 69.2 us asm loop 7.5 us 57.9 us asm unrolled 6.0 us 48.4 us I have been told that polyd is NOT emulated on the MicroVax-II, and these results seem to confirm this. The difference is less on the 8650, but still noticeable. I will not argue the point of whether polyd is a frequent enough instruction to warrant being included in an instruction set, nor whether a VAX might have a cycle time n% less if polyd weren't included. Nor will I comment on the efficiency or inefficiency of evaluating a polynomial with many zero coefficients in this way. Lastly, I will not comment on the difficulty of a compiler generating code for an instruction which changes 6 of the 16 general purpose registers. My only point is that the VAX polyd instruction is FASTER than the best one can do in VAX assembler given all nonzero coefficients. Period. So why is the VAX poly instruction so famous ? Richard Rudell. --------- program follows for those interested ----------- /* quick hack to time polynomial instruction */ /* WARNING: highly VAX 4.3bsd specific !!! */ #include <stdio.h> #include <sys/types.h> #include <sys/times.h> double ptime() { struct tms buffer; double time; times(&buffer); time = buffer.tms_utime / 60.0; return time; } #define TIME(string, action) {\ double time = ptime(), timex; register int i_;\ for(i_ = 0; i_ < count; i_++) {action;}\ time = ptime() - time - offset;\ printf("Time for %s was %3.1f microseconds\n", string, time/count*1e6);\ if (strcmp(string, "null loop") == 0) offset = time;\ } double coef[12] = {1.0,-2.0,3.0,-4.0,5.0,-6.0,7.0,-8.0,9.0,-10.0}; main(argc, argv) int argc; char **argv; { double f, /* -8(fp) */ g=3.141512341222419283749239849839842, /* -16(fp) */ h=2.7182812454219283748238748392, /* -24(fp) */ offset=0.0; /* -32(fp) */ register int i, count; for(i = 0; i < 12; i++) coef[i] /= h; /* mix things up a little */ if (argc >= 2) count = atoi(argv[1]); else count = 100000; printf("count is %d\n", count); TIME("null loop", ;); TIME("polyd", asm("polyd -16(fp),$4,_coef"); asm("movd r0,-8(fp)")); printf("%f\n", f); TIME("poly in C", f=coef[0]; for(i=1;i<=4;i++) f=f*g+coef[i];); printf("%f\n", f); TIME("poly asm loop", asm("moval _coef,r3"); asm("movd (r3),r0"); asm("movd -16(fp),r4"); asm("movl $1,r2"); asm("Lxx: muld2 r4,r0"); asm("addd2 (r3)[r2],r0"); asm("aobleq $4,r2,Lxx"); asm("movd r0,-8(fp)")); printf("%f\n", f); TIME("poly asm loop unrolled", asm("moval _coef,r3"); asm("movd -16(fp),r4"); /* move arg to r4 */ asm("movd (r3)+,r0"); asm("muld2 r4,r0"); asm("addd2 (r3)+,r0"); asm("muld2 r4,r0"); asm("addd2 (r3)+,r0"); asm("muld2 r4,r0"); asm("addd2 (r3)+,r0"); asm("muld2 r4,r0"); asm("addd3 (r3),r0,-8(fp)")); printf("%f\n", f); }
franka@mmintl.UUCP (Frank Adams) (03/04/86)
In article <187@anwar.UUCP> mlopes@anwar.UUCP (Marco Lopes) writes: >In my earlier posting, I assumed the existence of absolute addressing. >Therefore, a minimum set of addressing modes is: > > register > register indirect > absolute > >Another functionally complete set would have immediate instead of absolute. You should seriously consider using immediate instead of absolute. One is inclined to assume it is inferior, but this is not obvious to me. Actually, I would recommend using both immediate and absolute addressing modes. At this point you have two one-bit decisions: register or immediate, and direct or indirect; absolute is equivalent to immediate indirect. Frank Adams ihnp4!philabs!pwa-b!mmintl!franka Multimate International 52 Oakland Ave North E. Hartford, CT 06108
franka@mmintl.UUCP (Frank Adams) (03/04/86)
In article <169@ubc-cs.UUCP> ludemann@ubc-cs.UUCP (Peter Ludemann) writes: >As far as I'm concerned, the test for RISCness should be: given any >piece of source code, is there only one reasonable code sequence which can be >output by the compiler? This is not a possible objective. Consider sequences like: A = B[I]; C = D[I]; On any machine which has registers, it is possible to do this by loading I into a register, doing an address calculation for B[I] (which may be combined with the next instruction), and moving the value of B[I] to A (perhaps via a register). One can then perform the same calculation for the second statement. Alternatively, one can keep the value of I (possibly shifted) in a register after the first statement, and use it again in the second. Either of these is a reasonable code sequence to be generated; it depends on how much optimization you want to do. This is only one simple example; there are arbitrarily complex ones. The problem of code optimization is undecidable in general. Frank Adams ihnp4!philabs!pwa-b!mmintl!franka Multimate International 52 Oakland Ave North E. Hartford, CT 06108
jlg@lanl.UUCP (03/05/86)
In article <78@cad.UUCP> rudell@cad.UUCP (Richard Rudell) writes: ... >Lastly, I will not comment on the difficulty of a compiler generating >code for an instruction which changes 6 of the 16 general purpose >registers. > >My only point is that the VAX polyd instruction is FASTER than the best >one can do in VAX assembler given all nonzero coefficients. Period. >So why is the VAX poly instruction so famous ? You would have convinced me if the benchmarks you did had included the necessary set-up code for each different possible coding, as well as use in a real code where register scheduling conflicts must be weighed in the balance. I didn't claim that the individual instruction was always less efficient (maybe I sould have phrased differently). Sometimes it is less efficient, but the hardware types wouldn't have provided it if its failings were really obvious. The problem is that the instruction requires extra set-up that ordinary code sequences don't, and (as you say) uses up 6 registers. Since your benchmark doesn't include timing of these in a real code environment, it doesn't really measure the speed of the instruction. J. Giles Los Alamos
jdz@wucec2.UUCP (03/05/86)
> The TRUE RISC would have only one instruction: subtract. With only one > instruction, you wouldn't need an op-code field; every instruction is just > a pair of addresses (what to subtract from what). An "add" subroutine > would need 3 instructions. With memory-mapped I/O and the above convention > for conditional branching, you'd have it all. Not quite. You need two address, you see. If we go to two instructions, we can get away with only one address. To wit: SUBS X Subtract cotents of location X from accumulator and store results in accumulator and in X JUMP X Indirect jump; i.e. PC <= contents of location X Proving this beast to be functionally complete is arduous but do-able. We use it as an exercise in our Computer Architecture class. The machine is called the TWINC, for TWo INstruction Computer. We have surmised the existence of an OINC, but have not successfully proved functional completeness (the conditional branch is usually the problem). You describe OINC with two addr. - can you do it with only one? One student did a VLSI design of the TWINC for LSI design class; we're thinking about fabrication by MOSIS. Only did an 8 bit word, though. I don't have the cotext for the comment about "the above convention for conditional branching" - could someone mail it to me? -- Jason D. Zions ...!{seismo,cbosgd,ihnp4}!wucs!wucec2!jdz Box 1045 Washington University St. Louis MO 63130 USA (314) 889-6160 Nope, I didn't say nothing. Just random noise.
hansen@mips.UUCP (Craig Hansen) (03/06/86)
> For those who have never seen a PDP-8 instruction set, here it is > The instructions were, then (some of the opcode names may be wrong, > it's been a long time...): > TAD Two's Complement Add ((acc) <- acc+(EA)) > SCA Store and Clear Accumulator ((EA)<-(acc), (acc)<-0) > ANA And Accumulator ((acc)<-(EA)&(acc)) > ISZ Increment & Skip on Zero ((EA)<-(EA)+1, skip if zero) > JMP Jump (PC)<-EA > JSR Jump to (EA)<-(PC),(PC)<-EA+1 More correctly, these should be: AND And TAD Two's Complement Add ISZ Increment and Skip if zero DCA Deposit and clear accumulator JMP Jump JSR Jump to subroutine > Mind you, I wouldn't want to program one in assembler though :-) . > -- > Peter Lamb ({seismo|decvax}!mcvax!cernvax!ethz!prl) > Institut fur Informatik > ETH-Zentrum > 8092 Zurich Actually, I found the machine really enjoyable to program in assembler. It was a instruction set that one could really learn completely, down to knowing exactly what the assembler could generate. It was common practice to locate the targets of AND instructions into memory so that the AND instruction was a useful character constant. Using the "microcoded" operate instructions, you could generate any of the constants 0,1,2,3,4,6,-1,-2,-3 in a single instruction. One of the more unique programs written for the machine was a two-voice music generator that generated its output on an FM reciever placed close to the front panel of the machine! Try that on your favorite multi-user, virtual memory computer! Because of the small page sizes, and the difficulty of generating relocatable code, I never encountered a really practical compiler for the PDP-8. Most so-called "compilers" generated pseudo-code, or virtual machine code, that was interpreted. In the realm of exploring the extremes of the price/performance relationship, DEC build the PDP-8/S, which was a bit-serial implementation of the PDP-8! In later years, the PDP-8 line suffered from "creeping featurism." For some reason, designers added an instruction which would swap the high and low-order 6 bits of the 12-bit accumulator, called "byte swap." I've got a lot of old code for the eight, but without a DECtape drive or a paper tape reader, I can't make too much use of it. Sigh. I could run a software emulator for the eight in real time nowadays, but it would never generate the same pattern of electrical noise on an FM radio.... Craig Hansen MIPS Computer Systems ...{decwrl,glacier}!mips!hansen
csg@pyramid.UUCP (Carl S. Gutekunst) (03/06/86)
In article <2900001@ztivax.UUCP> david@ztivax.UUCP writes: >However, cliff, a good question you have implied: Does anyone do assembler >programming on RISC machines? > >For example: Is UNIX on RISC entirely in C? How do >the stack recovery mechanisms work? Interrupted / failed instructions? >CPU mode register control manipulation (if applicable)? For Pyramid's RISC CPU, we do need to do a little programming in Assembler -- very little, actually. The first few words of all interrupt handlers are written in assmbler, as are a few error handlers. But most of these prompty call C functions to do the real work. Even the lowest-level I/O can be done from C. I'd suspect the same is true of other commercial RISC processors. A microcosmic but interesting comment on RISC comes from our string libc functions, which some of the local speed freques have implemented in assmbler. Even with just the peephole optimizer, the only real difference between the compiler's versions and the optimal hand assembly versions is a handful of extraneous instructions that get executed once. RISC machines just don't have any clever kludges in the instruction set for moving bytes faster than what the compiler can easily figure out for itself. I've generally seen this carry over to more complex task as well. Whether this is a Good Thing or a Bad Thing, I'll let the rest of you debate. I'm one of those good ol' assmbler hackers. Programming in assmbler on a RISC machine is extremely boring. But it seems to make the compilers happy. :-) -- Carl S. Gutekunst {allegra,cmcl2,decwrl,hplabs,topaz,ut-sally}!pyramid!csg Pyramid Technology Corp, Mountain View, CA +1 415 965 7200
jer@peora.UUCP (J. Eric Roskos) (03/06/86)
> A minor point, perhaps, but in fact there is no microprocessor on the > market that "directly executes BASIC." I wondered if anybody would question my statement. Actually, I maintain that the particular microprocessor I'm thinking of (I forget the part number; it is some Zilog single-chip microcomputer which Steve Ciarcia writes a lot of his projects for) does indeed directly execute BASIC. It is true that there is a BASIC interpreter in ROM, but when you start up the computer, it executes this interpreter. This is no different from a more familiar microprogrammed computer, e.g. an IBM 370. Most (if not all) 370s execute an interpreter which interprets the 370 instruction set; they don't execute it directly in hardware. Of course, the Zilog machine has another instruction set in which the BASIC interpreter is written. So there's another level of interpretation involved. But that's also true of some other microprogrammed machines. It might be, in fact, that the microprogram for that instruction set is implemented in a PLA, as a finite state machine, so you get into this area of fine distinctions if you try to draw the line in terms of what the "language" looks like in which it's programmed, i.e., when is it "hardware" vs. "firmware" or "software". Now, you could argue that the Zilog machine "actually" executes the simpler instruction set, because you can jump into it from BASIC and have machine-language programs. I.e., there's a way to get to the lower-level machine-language, so that must be the "real" language of the machine. But the machine I'm writing this on executes an instruction set on which you can execute an "ECS" instruction and jump into a user-written *microcode* routine. I doubt many people would want to think of the microcode as the "real" language of this machine, though (though of course you could argue that it is, just as you could argue that the PLAs in the microprocessor provide the "real" language for that machine, since that's what the hardware executes). So I will have to stand by my position that the language directly executed by a machine is the language the machine presents to its user when it is first started, the language it interprets when it first goes out to its primary memory to begin executing a program. (Note that this is meant as an informal sort of definition, so if you have another definition of "directly executed" -- which may well be the case -- then you might disagree on my choice of definition. I.e., I'm not using someone else's definition of the term and trying to apply it to this particular machine. I'm just explaining the concept I meant in my original posting. Also, it *is* obviously just an informal definition, since you could argue different ways about a machine that supports multiple instruction sets, as well as a few other things I won't mention since they're minor points.) -- UUCP: Ofc: jer@peora.UUCP Home: jer@jerpc.CCUR.UUCP CCUR DNS: peora, pesnta US Mail: MS 795; CONCURRENT Computer Corp. SDC; (A Perkin-Elmer Company) 2486 Sand Lake Road, Orlando, FL 32809-7642 LOTD(5)=O ---------------------- Amusing error message explaining reason for some returned mail recently: > 554 xxxxxx.xxxxxx.ATT.UUCP!xxx... Unknown domain address: Not a typewriter (The above message is true... only the names have been changed...)
david@ztivax.UUCP (03/06/86)
/* Written 4:27 pm Feb 24, 1986 by jer@peora in ztivax:net.arch */ >> As far as I'm concerned, the test for RISCness should be: given any piece >> of source code, is there only one reasonable code sequence which can be >> output by the compiler? >How is that a test for "RISCness"? Following the above scheme, you could, >for example, have a machine that directly (and optimally) interpreted some >concise representation of the source code... but that's exactly what people >are calling a CISC machine, and are claiming is bad... Sorry jer, but those CISC machines have some super instructions which are of LIMITED SCOPE, and therefore CANNOT be directly and optimally substituted for the source code. Instead you must perform many complicated what if?s before deciding if you can use the instruction more easily than the low level instructions which can ALWAYS be used. For example, lets say you were going to do ANYTHING with an intel (aka Brain Damage (*$Q##!)) 80*8*. Look thru the book, you'll find some super instruction. Only problem is, how many registers must you store (and where?) to set up for it, and how many do you need to restore? Sometimes (if you just happened to have only garbage in the registers), it IS optimal to do character translations with that XLT (or is it TLX, or #%&*%?). Often (typically) it is not. The thing is: Its got the fancy instructions, and the simple instructions, and SOMETIMES the fancy ones fit the specific high level functions, and SOMETIMES they dont. And that is a pain in the ass. david smyth free and proud of it seismo!unido!ztivax!david
david@ztivax.UUCP (03/06/86)
/* Written 3:11 pm Feb 27, 1986 by jer@peora in ztivax:net.arch */ >> ... Unfortunately, NO ONE has yet succeeded >> in developing an instruction set which can 'directly (and optimally)' >> interpret the semantics of the source code. >Now, this is exactly the point at which the RISC/CISC debate generates into >the sort of assertions more commonly found in advertising than in research. >The fact that "NO ONE has yet succeeded" in no way demonstrates or proves >that it can't be done; this is a simple fallacy! Sorry, but that is the way it is. No I am not flaming. Software is ever bug free unless it is small enough and concise to be proven. For example, the inner loop of a parser can be proven, as t only has about 40 lines of code and 3-5 branch points. The grammar can be proven, because the finite number of states is small. Typical sofware has bugs because its so damn complex. No one will succeed in making a CISC which CAN interpret a "high level language" without SIGNIFICANT BUGS because it CAN NOT BE DONE without a SIGNIFICANT breakthrough in mathematics and computer science. just ask Intel. Whatever happened to the 432 (Ada in Silicon)? Doesn't work, for the same reason all Ada compilers have bugs. Too big, too complex. We (mankind, computer scientists, etc) do not have the technology. It is not on the horizon. (Well, maybe I have an idea... soon I may publish...) David Smyth Free and proud of it seismo!unido!ztivax!david
kludge@gitpyr.UUCP (Scott Dorsey) (03/07/86)
In article <5100016@ccvaxa> aglew@ccvaxa.UUCP writes: >4 bit constants hold only 20% of the offsets you're going to want, without >scaling, and only 30% with. These numbers are preliminary results of a >study I'm making. The distribution of the size of addressing constants >falls off roughly exponentially, with a 50% mark at 6 bits, and an 85% mark >at 8 bits. > >What I'm trying to say is, you'll get much much better results if you provide >just a few more bits in the constant. Personally, as an assembly hacker, I'd love a larger constant. The question is if it is worth the increase in power to increase the actual size, versus the cost of the larger field (decrease in number of other instructions able to be decoded. No flames from RISC advocates, please). An 8-bit constant would certainly be valuable, especially for character manipulation work. Unless you already had hardware onboard to break up 5/6/7 bit fields, these would be hard to work with, however. Would it be worth the extra stuff? Would it be better to just design the 5/6/7 bit decoding into the instruction set properly? I don't know... but I enjoy the 32032, and would give up many things to have even a four bit field on the machinses I am programming today. ------ Disclaimer: Everything I say is probably a trademark of someone. But don't worry, I probably don't know what I'm talking about. Scott Dorsey Kaptain_kludge ICS Programming Lab, Rich 110, Georgia Institute of Technology, Atlanta, Georgia 30332 ...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!kludge USnail: Box 36681, Atlanta GA. 30332
throopw@dg_rtp.UUCP (03/07/86)
>> [discussion of self-modifying code for conditional branching] > [discussion of effect of self-modification on address space] > So to rectify the situation when you fork, you bring in > a clean copy of the executable. Uh... I'm not sure what "fork" does on whatever OS *you're* talking about, but in Unix(*), the fork(2) system call does *not* (repeat *not*) want to bring in a "clean copy" of the executable. The address space *as* *modified* *by* *the* *parent* must be seen by the child. On the other hand, it is clear that memory containing modifiable bits, code or not, cannot be in the "text" or "shareable" segment (unless special fault-driven mechanisms are available for revocation of sharing). > You mentioned a copy on write scheme. Maybe on a machine that allows a > physical page to be mapped to more than one virtual page this could be > done right, but on a machine like the latest IBM risc a physical page > can only be mapped to one virtual page (this saves a (pipeline > stage) or (gobs of hardware) for determining if you hit or not) you > are forced to copy the executable into a new segment. Interesting. The IBM risc can't do shared text segments? How odd. Or perhaps by "a physical page can only be mapped to one virtual page" you mean "all virtual pages that are mapped to a given physical page must have the same virtual address"? This latter interpretation would allow shared text, while the original doesn't seem to. > But the above really doesn't get at the issue. If performance is a > goal, the hardware is going to be slower (or more expensive) if it has > to deal with self-modifying code even if the software can handle it. I agree that this is so for *some* classes of hardware, and that these classes of hardware include most widely-known-to-be-practical high-performance designs. Thus, though I think that some of what you said in getting to your conclusion is inaccurate, I agree with your conclusion. > eric h. jensen ehj@angband ...!decvax!decwrl!mordor!angband!ehj -- * Unix is a trademark of ATT -- Wayne Throop at Data General, RTP, NC <the-known-world>!mcnc!rti-sel!dg_rtp!throopw
debray@sbcs.UUCP (03/07/86)
> ...... What usually happens is > that the 'high level' instruction is not as efficient as a sequence or > loop of simpler instructions - or it is only efficient for a subset of > its possible uses. The problem is mostly that it's harder to provide efficient implementations for "High Level" instructions than it is for simple, "low level" ones. Some time back, when working on a Prolog machine emulator on a Vax 780, we found that it was _significantly less efficient_ to use a single "Test, Set and Branch" instruction than it was to use three separate instructions to test, set and branch respectively. This seemed very counterintuitive until we realized that instructions with 2 operands or fewer had the operands decoded in hardware, while instructions with more than 2 operands (e.g the "test-set-branch" instruction) had to go to microcode for this. I guess the extra instruction fetches for the case using three separate instructions overlapped sufficiently due to pipelining to give the hardware decoding an overwhelming edge. On the other hand, I think that if efficient implementations could be provided for some of these "High Level" instructions, it wouldn't be a bad thing entirely. In the same Prolog machine emulator, we looked at the instruction sequences generated for various programs and coalesced a couple of the more common instruction sequences to form "macro-instructions". The performance improvement from this was not unimpressive. As far as generating these complex instructions went, our compiler simply generated the old, dumb code to begin with, then converted to the smarter code in one pass of peephole optimization. -- Saumya Debray SUNY at Stony Brook uucp: {allegra, hocsd, philabs, ogcvax} !sbcs!debray arpa: debray%suny-sb.csnet@csnet-relay.arpa CSNet: debray@sbcs.csnet
mash@mips.UUCP (03/08/86)
David Smyth writes: > However, cliff, a good question you have implied: Does anyone do assembler > programming on RISC machines? > > For example: Is UNIX on RISC entirely in C? How do > the stack recovery mechanisms work? Interrupted / failed instructions? > CPU mode register control manipulation (if applicable)? 1) Does anyone do assembler on RISC machines? Yes. For what it's worth, a well-designed RISC is relatively easy to program in assembler when that makes sense. After all, in those cases, what you're doing is writing microcode (more vertical than horizontal, but with some idea of overlap). 2) Is UNIX on RISC entirely in C? No. On almost any machine, there are bound to be instructions that are hard to get to from C. ALso, there are a few time-critical pieces of code where the programmer knows attributes of the problem that no compiler can reasonably deduce. Example: for string & block copy loops, there is usually a tradeoff between a faster loop with longer setup time, and a slower loop that gets started quicker. It's very hard for a compiler to guess the distribution. 3) Thus, there is still room for selected assembler code. However, from experience with assembler (on S/360s, PDP-11s, VAXen, 68K's, and our R2000), in most cases I'd rather have fast instructions that are easy to use and give me more access to the machine's pipeline, than most complex ones. Most memory-intensive multi-cycle instructions (like string instructions) run just as fast when hand-coded, because they're memory limited anyway; a lot of them don't really match what you need in some cases: for example, look at all of the mem* and str* routines in SYstem V, and see how well the string operations of most machines do them. About the only win for complex instructions is when they support a datatype (like floating point) that is both frequently used in the application at hand, and difficult to simulate cheaply. -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
aglew@ccvaxa.UUCP (03/09/86)
> /* Written 7:22 am Mar 6, 1986 by jer@peora.UUCP in ccvaxa:net.arch */ > It might be, in fact, that the microprogram for that instruction set is > implemented in a PLA, as a finite state machine, so you get into this area ^^^ > of fine distinctions if you try to draw the line in terms of what the > "language" looks like in which it's programmed, i.e., when is it "hardware" > vs. "firmware" or "software". Just thought that you might be amused to try macroprogramming in a PLA. I implemented the Quine-McCluskey PLA compression algorithm, with some tree search pruning techniques to speed things up, and was looking around for things to test it on. You never have a large FSA when you need one. For a lark, I applied it to a whole slew of Z80 assembler programs (actually, the machine code) that somebody had left lying around on my system: 30% compression. Of course, it took a week to do it.
mahar@fear.UUCP (mahar) (03/10/86)
> can get away with only one address. To wit: > > SUBS X Subtract cotents of location X from accumulator and > store results in accumulator and in X > > JUMP X Indirect jump; i.e. PC <= contents of location X > > Proving this beast to be functionally complete is arduous but do-able. > We use it as an exercise in our Computer Architecture class. The machine is > called the TWINC, for TWo INstruction Computer. We have surmised the existence > of an OINC, but have not successfully proved functional completeness (the > conditional branch is usually the problem). You describe OINC with two addr. - > can you do it with only one? Yes. The instruction is reverse subrtact and skip if borrow. There is no opcode field only the address field. The instruction subtracts the accumulator from the memory operand. The result is stored in both the accumulator and the memory operand and the next instruction is skipped if a borrow occured. Memory location 0 is the PC. A friend of mine built one of these out of TTL. He called it URISC for Ultimate Reduced Instruction Set Computer. It uses about 8 standard 7400 series TTL packages. In this implementation the PC went backwards but that is not necessary. He wrote a macro assembler which implemented a usable set of instructions. I know this design is functionally complete. I don't have a description with me, so I can't prove it now. If your really interested I can mail your a list of the macro package. -- Mike Mahar UUCP: {turtlevax, cae780}!weitek!mahar Disclaimer: The above opinions are, in fact, not opinions. They are facts.
dalel@athena.UUCP (Dale Lehmann) (03/10/86)
------------ In article <1441@wucec2.UUCP> jdz@wucec2.UUCP (Jason D. Zions) writes: >> >> The TRUE RISC would have only one instruction: subtract.... > >Not quite. You need two address, you see. If we go to two instructions, we >can get away with only one address. To wit: > > SUBS X Subtract cotents of location X from accumulator and > store results in accumulator and in X > > JUMP X Indirect jump; i.e. PC <= contents of location X > >Proving this beast to be functionally complete is arduous but do-able. >We use it as an exercise in our Computer Architecture class.... When I was working as a lab instructor for a computer class at Oregon State University, two of the students in the class did something similar to this. They had run across the claim that the minimum number of instructions a computer needed were "subtract one" and "branch on negative". They proceeded to write a simulator for the PDP-8 instruction set using only two instructions: ISZ (increment memory, skip if result is zero) and JMP. Talk about your extra-credit projects! -- Dale Lehmann Tektronix, Inc. UUCP: ...!tektronix!teklds!dalel USMail: P.O. Box 4600, Beaverton, Oregon 97075
ludemann@ubc-cs.UUCP (Peter Ludemann) (03/10/86)
In article <1163@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: >In article <169@ubc-cs.UUCP> ludemann@ubc-cs.UUCP (Peter Ludemann) writes: >>As far as I'm concerned, the test for RISCness should be: given any >>piece of source code, is there only one reasonable code sequence which can be >>output by the compiler? > >This is not a possible objective. Consider sequences like: > > A = B[I]; > C = D[I]; > >On any machine which has registers, it is possible ... > (description of optimisations possible) You miss my point. The optimisations described (in this case common expression detection) are possible *regardless* of the target machine's architecture. Any good compiler should try to optimise code like this. However, once these "source level" and flow analysis optimisations are finished, there should be only one reasonable way of generating the machine code. Otherwise, the poor compiler writer has to do another set of optimisation analyses, all because of the target machine's instruction set. As an aside, the MVCL (move character long) instruction on the IBM/370 machines was actually slower than an equivalent loop which used a simple block move instruction. Although later models improved the situation, it is interesting to note that IBM's "optimizing" PL/1 compiler still generates a subroutine which does *exactly* what a MVCL instruction does (even with the same register usage conventions). -- -- Peter Ludemann ludemann@ubc-cs.uucp (ubc-vision!ubc-cs!ludemann) ludemann@cs.ubc.cdn (ludemann@cs.ubc.cdn@ubc.mailnet) ludemann@ubc.csnet (ludemann%ubc.csnet@CSNET-RELAY.ARPA)
paulr@decwrl.DEC.COM (Paul Richardson) (03/10/86)
In article <90@sbcs.UUCP> debray@sbcs.UUCP (Saumya Debray) writes: >Some time back, when working on a Prolog machine emulator on a Vax 780, we >found that it was _significantly less efficient_ to use a single >"Test, Set and Branch" instruction than it was to use three separate >instructions to test, set and branch respectively. This seemed very >counterintuitive until we realized that instructions with 2 operands or >fewer had the operands decoded in hardware, while instructions with more >than 2 operands (e.g the "test-set-branch" instruction) had to go to >microcode for this. I guess the extra instruction fetches for the case >using three separate instructions overlapped sufficiently due to pipelining >to give the hardware decoding an overwhelming edge. > >Saumya Debray >SUNY at Stony Brook > > uucp: {allegra, hocsd, philabs, ogcvax} !sbcs!debray > arpa: debray%suny-sb.csnet@csnet-relay.arpa > CSNet: debray@sbcs.csnet While I cannot profess knowledge about the Vax 11/780 microcode,I do believe that the 780 is NOT pipelined.I stand to be corrected,but i am fairly sure that the 8600 was/is the first pipelined implementation of the Vax architecture. P.S-I could be wrong on the above,even Dec employees don't always know what is going on
jer@peora.UUCP (J. Eric Roskos) (03/10/86)
> Sorry jer, but those CISC machines have some super instructions which are > of LIMITED SCOPE, and therefore CANNOT be directly and optimally > substituted for the source code. You missed my point. It would be absurd to go backwards and try to show that there is a mapping from source code into any particular existing CISC instruction. Few people familiar with the issues feel that the existing CISC instructions that are usually cited as examples of bad CISC instructions are much good. Certainly I would never cite the Intel instructions you listed as good CISC instructions (but then, I always considered the Motorola 6800 as a most-elegant small instruction set design). Actually a lot of the research on instruction set design has been in the area of providing mechanisms for determining how to create better instruction sets, or in providing mechanisms to create new instruction sets automatically. Actually, it's even somewhat strange that I always seem to argue here on the side of CISC machines, since I generally favor the use of small classes of powerful primitive operations, whether it's for instruction sets or operating systems, and tend more often to disagree with people in the "real world" because I think their designs are too complex, not too simple. Rather, I tend to feel that the RISC philosophy for many people involves a concept of a "volkscomputer" -- a People's Machine by which those who dedicate their efforts to difficult-to-understand research can be maligned. Thus I tend to disagree at the tone in which the RISC approach is supported, rather than with the results of the research. On the other hand, you'll note that I rarely disagree with people (such as mashey@mips) who seem to understand the issues and state their arguments in a reasoned and verbally unbiased tone. (In fact, I often agree with them.) -- Ofc: jer@peora.UUCP Home: jer@jerpc.CCUR.UUCP CCUR DNS: peo S Mail: MS 795; CONCURRENT Comp.ter Corp. SDC; (A Perkin-Elmer Compa 2486 Sand Lake Road, Orlando, FL 32809-7642 ---------------------- Obviously if I supplied LOTD(7) you'd know what was going to happen next, wouldn't you?
mash@mips.UUCP (John Mashey) (03/11/86)
Peter Ludemann ludemann@ubc-cs.uucp (ubc-vision!ubc-cs!ludemann) writes: > In article <1163@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: > >In article <169@ubc-cs.UUCP> ludemann@ubc-cs.UUCP (Peter Ludemann) writes: > >>As far as I'm concerned, the test for RISCness should be: given any > >>piece of source code, is there only one reasonable code sequence which can be > >>output by the compiler? > > > >This is not a possible objective. Consider sequences like: > > (description of optimisations possible) > > You miss my point. The optimisations described (in this case common > expression detection) are possible *regardless* of the target > machine's architecture. Any good compiler should try to optimise > code like this. However, once these "source level" and flow analysis > optimisations are finished, there should be only one reasonable way > of generating the machine code. Otherwise, the poor compiler writer > has to do another set of optimisation analyses, all because of the > target machine's instruction set. One more time: I urge people not to propose metrics that are exceedingly difficult to measure or comprehend or even define. As I said before, I'm not sure "RISCness" is something one necessarily wants to measure as a goal, in place of things that one really cares about like performance, price/performance, code size, etc, etc. However, if one wants to measure it, for whatever reason, PLEASE a) Use cycles/MIPS (normalized to some common unit of work), i.e., try: cycles/MIPS = (clock rate in Mhz) / (MIPS rating) so that, for example, if you think a VAX 11/780 with UNIX is a 1 Mips cpu, at 5Mhz, it gets a 5. If you think a good 68020 implementation is twice as fast, then it gets 16.67/2 = 8.33, etc. [Notice that this isn't cycles/instruction, but cycles/equivalent work.] OR b) Propose a different metric that has a reasonable way to measure it. Is it unreasonable to ask for an algorithm to compute the metric? [I suggest that the "only one way to do it" test is, in general, mechanically undecidable.] (Note: Peter's comment does get at something that usually happens, which is that simpler machines have less different obvious code selection possibilities at some level or other, but even then, there are almost always time-space tradeoffs to be made, which allow for different code sequences for even the RISCiest of machines.) -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
rb@ccivax.UUCP (03/11/86)
In article <1161@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: >In article <187@anwar.UUCP> mlopes@anwar.UUCP (Marco Lopes) writes: >>In my earlier posting, I assumed the existence of absolute addressing. >>Therefore, a minimum set of addressing modes is: >> register >> register indirect >>Another functionally complete set would have immediate instead of absolute. >You should seriously consider using immediate instead of absolute. For any who might remember the RCA 1702 COSMAC VIP computer, there was a strange way to get immediate addressing. The VIP would let you switch which register would work as a program counter and when that register was used for instruction fetches or anything else, it would cause the register to be incremented. If R7 was your PC register, you could say MOV R6,(R7), and the result was the same as move imediate instruction, because R7 would be incremented automatically. This may have been a bug, but it was a useful bug (All instructions had to be a uniform number of cycles, and immediate mode took too many cycles). Actually, the 1702 had a lot of "cute" features, it would be interesting to see a 32 bit version. Does anybody familiar with this chip consider it RISC? It was certainly a good learning tool.
rb@ccivax.UUCP (03/11/86)
In the previous article, I kept referring to the 1702, I meant 1802 Sorry about the confusion. There is no 1702 CPU.
stevev@tekchips.UUCP (03/12/86)
> You miss my point. The optimisations described (in this case common > expression detection) are possible *regardless* of the target > machine's architecture. Any good compiler should try to optimise > code like this. However, once these "source level" and flow analysis > optimisations are finished, there should be only one reasonable way > of generating the machine code. Otherwise, the poor compiler writer > has to do another set of optimisation analyses, all because of the > target machine's instruction set. Even though an the correctness of an optimization is machine-independent, its effect on the quality of code is very machine-dependent. The selection of the ``best'' source-level optimizations will differ from machine to machine, even for RISCs. Consider common-subexpression elimination. Sure, you can perform the optimization at source level, but for some machines, it might be cheaper to redo the computation rather than to save the result because saving the result requires a register, which could register allocation poorer; in some cases, reperforming the computation could even be free on RISC machines if it can be done in an instruction that would otherwise be a delayed-branch-induced NOP. The point is that architectural attributes such as *the number of available registers* and *delayed branching* can effect which optimizations you should (should not) do at the source-level. Steve Vegdahl Computer Research Lab. Tektronix, Inc. Beaverton, Oregon
ron@brl-smoke.UUCP (03/13/86)
> >> The TRUE RISC would have only one instruction: subtract.... > > > >Not quite. You need two address, you see. If we go to two instructions, we > "subtract one" and "branch on negative". They proceeded to Somewhere (UTEXAS?) someone designed a MOVE PROCESSOR, which only had move from memory to memory instructions. To do math you fed an ALU memory address the operands and pulled back. The PC was another address. Sort of like having to program at the microcode level all the time. -Ron
greg@ncr-sd.UUCP (Greg Noel) (03/14/86)
In article <389@mips.UUCP> mash@mips.UUCP (John Mashey) writes: > cycles/MIPS = (clock rate in Mhz) / (MIPS rating) >so that, for example, if you think a VAX 11/780 with UNIX is a 1 Mips cpu, >at 5Mhz, it gets a 5. If you think a good 68020 implementation is >twice as fast, then it gets 16.67/2 = 8.33, etc. [Notice that this >isn't cycles/instruction, but cycles/equivalent work.] I'm currious -- are these off-the-top-of-the-head numbers to illustrate the point, or do they approximately reflect reality? If the latter, are there any published results? -- -- Greg Noel, NCR Rancho Bernardo Greg@ncr-sd.UUCP or Greg@nosc.ARPA
mash@mips.UUCP (03/15/86)
Greg Noel (Greg@ncr-sd.UUCP or Greg@nosc.ARPA) writes: > In article <389@mips.UUCP> mash@mips.UUCP (John Mashey) writes: > > cycles/MIPS = (clock rate in Mhz) / (MIPS rating) > >so that, for example, if you think a VAX 11/780 with UNIX is a 1 Mips cpu, > >at 5Mhz, it gets a 5. If you think a good 68020 implementation is > >twice as fast, then it gets 16.67/2 = 8.33, etc. [Notice that this > >isn't cycles/instruction, but cycles/equivalent work.] > > I'm currious -- are these off-the-top-of-the-head numbers to illustrate the > point, or do they approximately reflect reality? If the latter, are there > any published results? 1) They're not off-the-top-of-the-head numbers, but do reflect reality, given enough caveats and specifications, always remembering that: a) Reducing CPU performance to 1 number is always hard. This doesn't ever stop anybody in this buisness from trying. b) One must always include software: thus, a more complete description of the above might be: ".. if you think a VAX 11/780 with 4.2BSD runs user programs like yacc, grep, diff, etc..., compiled with the regular C compiler, and runs them at at rate we'll call 1 Mips, at 5 Mhz, it gets a 5. If you think a 68020 (SUN 3/160, running their currently-released software) is twice as fast.... it gets 8.33." 2) SUN has published a bunch of comparisons as part of their SUN-3 announcement. I think Convergent has published theirs for the MightyFrame, and Altos for their 3068, and ..... 3) I personally believe these numbers are probably good within +/- 20%p what I see personally, when I look [which is not real often or very carefully), is that the SUN is 2X or a little more faster for some things, especially little benchmarks, and is about 1.8X for many tasks. 4) As I've observed before here, IT IS VERY IMPORTANT TO COMPARE ACTUAL MACHINE PERFORMANCE AND NOT GET CONFUSED BY MIPS RATINGS. In particular, many people do call a 780 a 1Mips (or thereabouts) machine, but it is known that for some fairly substantial runs (user and kernel), the 780 averages 10.6 (200ns) cycles / instruction, i.e., it does .5M VAX instructions/sec, including all of the degradations for memory, TLB misses, pipeline bubbles for branches, etc, etc. [There was a good 1984 paper by Clark and Emer of DEC on this.] Thus, it would appear that a machine that does 1 Million VAX instructions/sec (i.e., a bit faster than a 785) does an amount of work like what people usually call a 2Mips machine. Thus, a machine that executes 1M of its own instructions/sec to do the same work as the 780's .5M VAX instructions can also be called a 1Mips machine, since people call th4e VAX a 1Mips cpu. Thus, there would seem to be twice as many, but "twice as simple" (whatever that means) instructions. So, in some sense, the instruction within the mythical "1Mips" is probably more like a RISC instruction than a VAX (CISC) instruction! Again, what counts is having a scale that people understand and on which systems can be compared consistently. [At MIPS, we just always call the given flavor of 780 == 1 and be done with it.] -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
tim@amdcad.UUCP (Tim Olson) (03/17/86)
In article <460@ccivax.UUCP>, rb@ccivax.UUCP (rex ballard) writes: > For any who might remember the RCA 1702 COSMAC VIP computer, there was > a strange way to get immediate addressing. The VIP would let you switch > which register would work as a program counter and when that register > was used for instruction fetches or anything else, it would cause the > register to be incremented. If R7 was your PC register, you could say > MOV R6,(R7), and the result was the same as move imediate instruction, > because R7 would be incremented automatically. > This may have been a bug, but it was a useful bug (All instructions > had to be a uniform number of cycles, and immediate mode took too many > cycles). Actually, the 1702 had a lot of "cute" features, it would be > interesting to see a 32 bit version. Does anybody familiar with this > chip consider it RISC? It was certainly a good learning tool. I don't remember a 1702, but it sounds much like the RCA 1802. This processor had sixteen 16-bit general purpose registers, which could also be used as thirty-two 8-bit registers. However, except for INC and DEC, all operations (including moves) had to be performed 8 bits at a time using the accumulator. Two 4-bit registers, P and X, pointed to the general-purpose registers to be used for the program counter and index register, respectively. There were a few things that made the 1802 sort of "RISC-like". One was the large number of general-purpose registers available. Another was the small number of addressing modes: loads and stores were either register indirect or implied ("double-indirect" through the 4-bit X register), and all arithmetic and logical operations were performed using the accumulator (D) and an implied (double-indirect through X) memory operand. Note that the registers could NOT be used as operands in arithmetic or logical operations! :-( Another interesting feature was the absence of a subroutine call instruction. Instead, coroutines were implemented, each with a separate program counter. To call a coroutine, the P register was simply changed to point to the new program counter register. A subroutine was a coroutine that ran to completion, then jumped to an SEP (set P register) instruction just before the subroutine entry point. This instruction "returned" to the calling procedure, incrementing the subroutine's program counter to point to the entry point again. This coroutine method also made for fast interrupt handling. The interrupt handler had a dedicated program counter (register 1). When an interrupt occurred, the 4-bit P and X registers were copied into an 8-bit T register, then P was forced to 1. A return instruction copied back the P and X registers. In both these cases (subroutines and interrupts), only the basic primitives were supplied; the user could build the required subroutine call or interrupt convention from them, using these building blocks. This approach of building complex instructions from simple primatives can also be viewed as "RISC-like". All of these "double-indirect" operations had a cost, however. Most instructions took 16 clock cycles to execute (8 for instruction fetch, 8 for execution), and a few took 24. To build a competitive 32-bit, single-cycle, pipelined RISC machine using this architecture would require a lot of complexity, especially in the register file, which would have to be a 7-port, 4-read, 3-write structure! The 1802 certainly had some interesting concepts for its time, however. Tim Olson Advanced Micro Devices ihnp4!amdcad!tim
fwb@siemens.UUCP (03/17/86)
/* Written 11:12 am Mar 11, 1986 by rb@ccivax in siemens:net.arch */ /* ---------- "Re: Addressing modes (1702)" ---------- */ > For any who might remember the RCA 1702 COSMAC VIP computer, there was > a strange way to get immediate addressing. Actually, the CHIP was the 1802. I wrote, and helped to write, many K of code for this when I worked at RCA Laboratories. The native instruction set was miserable. I wrote a cross assembler with a Data General macro assembler. This let me add macro instructions to do some nice things like IF-THEN-ELSE, DO-WHILE, and 16-bit LOAD and STORE. A little program took the object module created by the DG assembler, and converted it to Intel HEX format for a prom burner. I rewrote these macros in Tek Development system macro assembler for use by other groups in the Laboratories. > If R7 was your PC register, you could say > MOV R6,(R7), and the result was the same as move imediate instruction, > because R7 would be incremented automatically. > This may have been a bug, but it was a useful bug This was not a bug. It was a feature. Really! The designers designed it as an immediate mode. > (All instructions > had to be a uniform number of cycles, and immediate mode took too many > cycles). Yes, there were 8 (or was it 16?) clocks per cpu cycle. "Why?" you ask. Well, I did too. The answer: The ALU was one bit wide! The next instruction was decoded while the computation was completed on the last one. > Actually, the 1702 had a lot of "cute" features, it would be > interesting to see a 32 bit version. Forget about a 32-bit version. I doubt that RCA (or the new GERCA) is going to make a big version based on this crummy architecture. One of the big problems with it was the D register. This was an 8-bit register which sat between memory and the 16 16-bit "scratchpad" register file. All memory references, arithmetic instructions, and I/O instructions went through this bottleneck. One of the things my macro instructions did was to hide this register as much as possible. It's too bad that I could not make it go away completely. > Does anybody familiar with this > chip consider it RISC? It was certainly a good learning tool. I guess you could say that it is a RISC if you consider the PDP-8 to be a RISC. It was an 8-bit CMOS micro which came out before the 8080. Silicon real estate was very expensive, so the designers had to make some compromises in the instruction set and operation. I hope you learned something useful. :-) ----------------------------------------------------- Frederic W. Brehm (ihnp4!princeton!siemens!fwb) Siemens Research and Technology Laboratories 105 College Road East Princeton, NJ 08540 (609) 734-3336
radzy@calma.UUCP (Tim Radzykewycz) (03/18/86)
In article <460@ccivax.UUCP> rb@ccivax.UUCP (What's in a name ?) writes: >For any who might remember the RCA 1702 COSMAC VIP computer, there was >a strange way to get immediate addressing. The VIP would let you switch >which register would work as a program counter and when that register >was used for instruction fetches or anything else, it would cause the >register to be incremented. If R7 was your PC register, you could say >MOV R6,(R7), and the result was the same as move imediate instruction, >because R7 would be incremented automatically. Actually, the fact that the 1702 (and 1802 for that matter) allowed you to change which register was used as the PC is irrelevant for this. The PDP-11 used the same method, although the assembler was smart enough to turn instructions like MOV #47,R2 into the word-sequence MOV (PC)+,R2 47 which is exactly what was described above for the 1702. Immediate is by definition just a form of PC-relative address, although it might be implemented as a special case. -- Tim (radzy) Radzykewycz, The Incredible Radical Cabbage. ARPA: calma!radzy@ucbvax.berkeley.edu GEnet: cc5vxa::cc5mxd::calma::radzy UUCP: {ucbvax,sun,csd-gould}!calma!radzy VOICE: "Hey! Radzy!"
earle@smeagol.UUCP (Greg Earle) (03/18/86)
In article <10693@amdcad.UUCP>, tim@amdcad.UUCP (Tim Olson) writes: > ... The 1802 certainly had some interesting concepts > for its time, however. Interesting that Tim refers to the 1802 in the past tense ... Here at The Land That Time Forgot (erm, JPL that is), you folks out there might be interested to know that both Galileo, and the not-to-fly-till-at-least-1989 Venus Radar Mapper mission (Magellan) are *both* using rad-hardened 1802's as the main CPU in the Command Data Subsystem. There are Flight Software people here even now huddled over their RCA COSMAC development systems ... The people doing test software have to upload Hex'ed binary files from the COSMAC to an IBM PC, run them through a filter program, upload again to a Sun-2 which is the main front end to the Lab Test Set (LTS) which finally uploads it into a breadboard version (in VRM's case) of the CDS ... Flight Software may get all the glory, but gimme my leading-edge Suns anyday. Anyhow, next time you reminisce about the 1802, shed a tear for those people for whom the past is not as far back as you think :-) "Forward, into the Past!" (Overheard at a JPL Flight Software meeting :-) -- ::::::\::::::::: ::::' \: `:::: ::' /:\ `:: Anarchy, Peace, and Freedom :: / ::\ :: Devastate to Liberate :: /_____\ :: ::./:' :: `\..:: ::/:. :: .\::: :::::::::::::\:: Greg Earle sdcrdcf!smeagol!earle (UUCP) JPL ia-sun2!smeagol!earle@csvax.caltech.edu (ARPA)
king@kestrel.ARPA (Dick King) (03/21/86)
I'm sorry to bug the whole net, but nobody around here knows how to mail to UUCP from ARPA. [if anyone knoes how, pls tell me.] mahar@fear.UUCP, Please send me the macro package for the URISC. Thanks... -dick
cej@well.UUCP (Craig Jackson) (03/22/86)
I think that the source of some of this confusion might be the origin of the 'mip' term. I believe it first arose in the IBM mainframe world. Data center types love to talk in 'mips' all day, because they know little else. Anyway, when the 780 first came out, it was found to have roughly the same computing power as what IBM was then calling a 1-mip processor (370/145, as I remember). Since a VAX is significantly more CISC than a 370 (3-address instructions, etc.), that worked out to fewer than 1 million VAX instructions per second. -- Craig Jackson UUCP: {ihnp4!linus,seismo!harvard}!axiom!mhis!dricej {dual,ptsfa,lll-crg,hplabs}!well!cej BIX: cjackson