nathan@orstcs.UUCP (03/01/84)
We've seen a lot lately about RISC architectures, and a bunch of people down at Berkeley did a chip, and Pyramid Computers has a box out that they claim far outperforms a Vax. I would like to make the claim that a RISC with a medium-to-large instruction cache could be called a "virtual-microcode" machine. Since on a RISC all the complicated operations are implemented with a somewhat brain-damaged subroutine call, the subroutines that implement what would be instructions on another architecture tend to be in the cache all the time. This includes the Enter and Exit operations, loop operations, and the like; esoterica such as long divides would sit out in main memory until needed. Does anyone have any objection to this analogy? Nathan Myers
jvz@sdcsvax.UUCP (03/04/84)
I don't have any problems with your analogy, but beware that a machine like that was done by Burroughs Corp... the B1700/B1800 series. Those machines were microcoded, with an instruction cache to require only small amounts of expensive memory. John Van Zandt
nather@utastro.UUCP (Ed Nather) (03/09/84)
<> I have advocated "cache-advisor" instructions for RISC's--instructions that don't *do* anything, from the programmer's point of view, but which inform the cache-manager of highly-probable-to-certain upcoming memory events. This is another trade-off--a loss of code-density with a possible net gain in performance. That need not complicate the architecture. It could, in fact simplify it--if cache-block- prefetching were more effective, then instructions could be looser and "fatter" (more easily decoded and executed) with less worry over the fetching costs. Possible analogies: impurity doping of semiconductors to increase electron mobility, or adding minute amounts of certain minerals in steel-making to improve its tensile strength. It's worth some experimentation, I think. --- Michael Turner (ucbvax!ucbesvax.turner) This sounds like a splendid idea to me. If the programmer inserts a single instruction that serves as "advice" to the cache-manager, then the overhead involved is just 1 fetch-time (plus decoding time, but that's true for any instruction). It would act like a "pseudo-op" in an assembler program, to help manage things but generate no output. If such advice saves only 1 fetch it breaks even; everything after that is gravy. I found that the "sieve" benchmark could be speeded up a factor of 2 on our Vax by advising the "C" compiler about what to put into registers. The "C" compiler is the only one we have humble enough to accept advice. I find the thought of a much closer relationship between compiler and programmer, via direct "advice" where it can really matter, a marvelous one. Advising the computer directly could be even better! -- Ed Nather ihnp4!{ut-sally,kpno}!utastro!nather Astronomy Dept., U. of Texas, Austin
guy@rlgvax.UUCP (Guy Harris) (03/10/84)
This is a spinoff from a comment about speeding up C programs by advising the C compiler what to put into registers. The idea is good, but there's one problem with its current implementation; there's no way to make those hints truly "machine-independent". The problem is that you want to use all the registers you can (actually, there's even a tradeoff here; you get extra performance from instructions referencing registers, but you do have to save and restore those registers, and if you have complicated expressions you may run out of scratch registers and be forced to use temporaries in memory), but "all the registers you can" is machine-dependent. On the PDP-11, "all the registers you can" in the Ritchie and Johnson compilers is 3; on the VAX-11, it's 6, and on the M68000, it's (at least on our compiler) 6 registers not containing pointers and four registers containing pointers. Just declaring everything you can to be "register" wins only if 1) there are enough registers to satisfy your requests or 2) the variables that will cause the greatest improvement when put in registers are assigned to registers first, and the less important ones cause the compiler to ignore the "register" declaration. You can sort of rig it to work right, on compilers that assign "register" variables to registers in the order they encounter them in the source text, buy putting the "best" candidates first; however, 1) this means you may not be able to use the "register parameter" feature, 2) it makes your code a little less readable, and 3) most importantly, it requires you to know what the compiler does - but what happens if the compiler doesn't work that way? Writing portable code to use registers efficiently is tricky at times. Guy Harris {seismo,ihnp4,allegra}!rlgvax!guy
rpw3@fortune.UUCP (03/11/84)
#R:orstcs:-280000100:fortune:16500005:000:2889 fortune!rpw3 Mar 10 19:07:00 1984 And while all you guys are busy stuffing things in registers, let us not forget that process-switch time gets worse the more "short term" state there is to save/restore. Especially when using an operating system model of the Concurrent Euclid sort, or any other system in which the "interrupt handlers" are full processes in their own right, saving and restoring ("swapping") the enormous amount of state implied by all those registers can result in a net reduction of throughput. Properly designed non-write-through cache does not suffer from that problem, since many pieces of many interrupt processes can come to live in the cache. Small aside: I really wish that the Motorola 68000 had only 8 general registers rather than 8 addr + 8 data. For any instruction set of the VAX/68k/16k style, 16 registers is too many if you do a lot of process switching. If the "registers" are implemented as main memory locations which addressed with short addresses relative to some process base, and the identification of "register" addresses is fast enough so that cached "registers" compete with hardware register addressing, then the number of registers should be set solely by the number of bits we wish to use for them in the instructions. I still feel that even in this case the number of specially named "registers" (short addresses) will be quite small (32 or less). An interesting example to look at for register naming is the old RCA-1602 (?) 4-bit microcontroller chip. It had a few registers (4?), but the neat thing was that the actual scratchpad cell that was used for a given register was run-time settable. To do a "process switch" you said "Use loc.234 for the N register, 107 for the T reg, and (last) loc.025 for the PC". Now, following that through to bigger machines, we should look at decoupling the NAMING of "registers" (which is restricted by the desire to keep the instruction stream efficient) from the LOCATION of registers (which should be a large address space). The context-switch problem then becomes changing the name-location map quickly, like the TI 9900. Additionally, with smart enough compilers (but don't ask me to program in assembly for THIS machine!) you can change the assignment of registers to locations as needed throughout a given module or even subroutine, to balance the "register working set". ("Quick Martha, call the nut house! He's gone and re-invented register page-maps!") If the register page-map was actually of the translation- lookaside-cache style (with the same number of entries as registers), and was onboard the CPU chip, it could probably compete in performance with fixed hardwired register sets. Just a little something to jog the discussion along... Rob Warnock UUCP: {sri-unix,amd70,hpda,harpo,ihnp4,allegra}!fortune!rpw3 DDD: (415)595-8444 USPS: Fortune Systems Corp, 101 Twin Dolphin Drive, Redwood City, CA 94065
hal@cornell.UUCP (Hal Perkins) (03/12/84)
The IBM 801, which also has a simple register-register instruction set like the RISC, has explicit instructions to control the cache. For example, if the program is about to store the registers into a block of memory, the instructions can be used to prevent the cache from fetching the old contents of the storage line when the first byte of the line is referenced. Now, a question. The only thing I've seen on the 801 is the paper in the 1982 conference on hardware support for high-level languages (printed in a special issue of SIGPLAN Notices/SIGARCH News). This was later revised and printed in the IBM Journal of R&D. Has anyone seen any other information about this? I'd particularly like to see a "Principles of Operation" or architecture manual for the 801, but I'm afraid that information is probably still classified IBM Top Secret. Hal Perkins UUCP: {decvax|vax135|...}!cornell!hal Cornell Computer Science ARPA: hal@cornell BITNET: hal@crnlcs
aaw@pyuxss.UUCP (Aaron Werman) (03/12/84)
To add my wish list - First a note -the TMS9900 series was one of the slowest microprocessors ever publicly sold, because although context switching and process switching did not take much time, the lack of registers slowed it to a crawl. How about an architecture with mixed memory - that is, have subroutines consisting of a small chunk of very fast memory trailed by the rest of the subroutine contained in normal memory. The obvious place for this "uncache" would be inside the memory controller. This scheme would allow fast context switching as well as real time program execution. The problem with nonregister architectures though is most visible in 0 address (pure stack) machines, which inherently clog memory busses rather than perform useful function. Some algorithms are enhanced by registers, others don't need 'em. {harpo,houxm,ihnp4}!pyuxss!aaw Aaron Werman
mark@umcp-cs.UUCP (03/13/84)
The Ridge computer has cache control instructions, so they say, as well as instructions to turn on and off updating the virtual memory translation look-aside buffer. Its branches also have advice to the cache about which way they expect to go. -- Spoken: Mark Weiser ARPA: mark@maryland CSNet: mark@umcp-cs UUCP: {seismo,allegra,brl-bmd}!umcp-cs!mark
tgg@hou5e.UUCP (03/14/84)
>The idea is good, but there's one problem with its current implementation; >there's no way to make those hints truly "machine-independent". The problem >is that you want to use all the registers you can (actually, there's even >a tradeoff here; you get extra performance from instructions referencing >registers, but you do have to save and restore those registers, and if you >have complicated expressions you may run out of scratch registers and be >forced to use temporaries in memory), but "all the registers you can" is >machine-dependent. On the PDP-11, "all the registers you can" in the Ritchie >and Johnson compilers is 3; on the VAX-11, it's 6, and on the M68000, it's >(at least on our compiler) 6 registers not containing pointers and four >registers containing pointers. Just declaring everything you can to be >"register" wins only if 1) there are enough registers to satisfy your requests >or 2) the variables that will cause the greatest improvement when put in >registers are assigned to registers first, and the less important ones cause >the compiler to ignore the "register" declaration. > > > Guy Harris > {seismo,ihnp4,allegra}!rlgvax!guy No no no - you're taking the meaning of 'register' too literally! It does not mean that the compiler is forced to use a register for these variables, it just means that you're giving it a hint that these variables are more likely to be referenced than other variables. If that means that, for the machine at hand, the variables should be three's complimented and stored in reverse bit order starting at memory location 3fd3.8 (octal) because that is the absolute best place to store an integer for speedy calculations, then the compiler should do that for 'register' variables. Let the compiler worry about what to do if it runs out of registers or other resources. The point is that 'register' exists to differentiate some variables from others so that a compiler can make some reasonable tradeoffs if possible. Making every variable in a program 'register' isn't any good because their *all* important. If you want overall optimization, make a better optimizer. Personally, I think that 'register' should have been changed to 'fast' or something like it long ago... Tom Gulvin ABI - Holmdel
smk@axiom.UUCP (Steven M. Kramer) (03/15/84)
You have to take into acct : 1) # of reg assignment allowed on your machine, 2) cost in terns of saving/restoring regs as Guy Harris said. I usually order my reg so tthat the most used variable comes first, so that if my program were ported to a PDP-11, it would still run sort of fast. I know that's sort of implementation dependent, but all optimization is that way. On the 68000 port we are using, there is a difference between address resgister and data registers that ALSO must be taken into consideration. On the VAX and PDP that's not so, but using rigister in any machine should keep in mind all of the differences. The best code would #ifdef the differences (especially if another processor handled things differently), but few of us (I don't, I can't anticipate all uses) think that way. So, each new port has to look a register declarations. Few even use them to begin with. so ANY use is a plus. -- --steve kramer {allegra,genrad,ihnp4,utzoo,philabs,uw-beaver}!linus!axiom!smk (UUCP) linus!axiom!smk@mitre-bedford (MIL)
turner@ucbesvax.UUCP (03/20/84)
> /***** ucbesvax:net.arch / orstcs!nathan / 10:42 pm Mar 3, 1984*/ > I would like to make the claim that a RISC with a medium-to-large > instruction cache could be called a "virtual-microcode" machine. > ... the subroutines that implement what would be instructions on another > architecture tend to be in the cache all the time. This includes the > Enter and Exit operations, loop operations, and the like; esoterica such > as long divides would sit out in main memory until needed. [Nathan Myers] I think that's what they had in mind (I haven't kept up with the ever- shifting rationale for RISC). I still question whether the gains reaped from the various simplifications and optimizations possible in RISC architectures aren't lost in some ways. RISC has no BCD or i/o formatting instructions, because "compiler-writers don't use them." Fine. But programs nevertheless spend much time translating back and forth between ASCII/EBCDIC and binary. Instructions with a static frequency approaching zero might still have a dynamic frequency (and a cost for emulating them in software) that is quite high. So if "printf" is in the cache, rather than having much of it hardwired in microcode, this leaves less space for other code that competes for the cache. Also, since RAM has a lower bit-density than ROM/PLA implementation there is that much less space to compete for. But then again--when you are *not* using BCD/io-format heavily, on a RISC you don't have to pay for them. It's a trade-off. I have advocated "cache-advisor" instructions for RISC's--instructions that don't *do* anything, from the programmer's point of view, but which inform the cache-manager of highly-probable-to-certain upcoming memory events. This is another trade-off--a loss of code-density with a possible net gain in performance. That need not complicate the architecture. It could, in fact simplify it--if cache-block-prefetching were more effective, then instructions could be looser and "fatter" (more easily decoded and executed) with less worry over the fetching costs. Possible analogies: impurity doping of semiconductors to increase electron mobility, or adding minute amounts of certain minerals in steel-making to improve its tensile strength. It's worth some experimentation, I think. --- Michael Turner (ucbvax!ucbesvax.turner)