crp@stcvax.UUCP (03/15/86)
> J. Eric Roskos @ Concurrent Computer Corp in Orlando says: > One of the ongoing areas of research in microprogramming involves "vertical > migration" -- analyzing sequences of code to determine things that can > be migrated into the microcode, essentially to produce new instructions. > From the RISC end you'd just go the other way; it's been argued that the > cache does that "automatically," but it's hard to believe that > in the long run, when the RISC approach has come to be seen as mundane, > that someone doesn't start doing statistical analyses on RISC instruction > sequences, and discovers that some sequences commonly occur, > and makes new instructions out of those. > But that's essentially identical to the vertical migration strategy. > ... [a bit deleted] ... > but I think the underlying approach is more or less the same. One reason that vertical migration of frequent instruction sequences "down" into a lower "interpretation layer" differs between microcoded machines and RISC machines (as they are spoken about -- oops, I mean thought about) today is that the RISC interpretation layer is hardware (today). If you really need hardware to implement the instruction, then even though an instruction sequence is common, the price/performance tradeoff may still be the instruction sequence rather than "interpret" a single instruction with hardware. For example, HPs new "Spectrum" machines don't have a multiply instruction. The reason, they say, is that it made a WHOLE lot of difference in the amount of ALU hardware needed and the most common case, a multiplier that is a small constant, can be effected cleverly and quickly with the remainder of their hardware and instruction set. Apparently a general multiply subroutine is "fast enough" for the remaining cases. An "Electronics" (March 3, 1986) article about Spectrum says that "multiplications by both variables and constants are done in an average of three or four cycles - faster than the 20 or so cycles many multiplications consume in other machines." The following is motherhood, but it seems to bear repeated articulation. A system design, ANY system design, is done for some particular - problem (program) domain and - customer (pocketbook) domain. The "RISC ballet" (or is it a demolition derby?) is an optimization problem with variables that include performance range required, cost range, hardware technologies (CPU stuff and memories in particular), compiler technology, programs from the application domain(s) supported, and other things I either don't know or have forgotten at the moment. I've noticed that people seem to want to talk about the instruction set without considering the programs to be run. You can't. A decision to leave out integer multiply, for instance, can't be made only on the basis of what it does to the chip area or complexity. HP measured a "whole lot" of "real" programs from people that use their existing machines (the mag says "hundreds of customer jobs for over 100 customers"). They designed a machine to support THIS APPLICATION MIX with some particular cost/performance requirements and not *necessarily* any other mix. They simulated the effect of their design decisions on programs from people who gave them money once and might be induced to do so sometime again. Apparently they were able to decide that integer multiply wasn't important enough to enough people to hurt their business. For them, for now, leaving out multiply is a Good Thing. HP's Spectrum series has support for coprocessors -- like an FPU. My own speculation is that they found that a reasonable multiplier for the core instruction set didn't satisfy the high-performance applications so they had to use a coprocessor (like the FPU) for that part of the problem domain anyway. So are RISC designs going to be "general purpose enough"? I think most designers have taken a broad cross section of problem domains for their design base and the fundamental operations are similar enough for most programs that this will hold true. (Don't all programs spend their time branching and looping?). John Mashey (mips!mash) comments that though MIPS hadn't targeted the business program domain specifically that early experience is telling him that their machine does a reasonable job with this kind of program. I'm not surprised since the business programs don't do anything remarkably different than text editors or compilers. On the other hand, I suspect that the MIPS machine isn't as good (i.e. effective and cost-effective) as a CRAY XMP (at the same clock speed) at predicting tomorrow's weather for India. Every computer is not the right solution for every problem. As technology changes, useful hardware/software solutions will change. Architectures aren't forever -- especially today when [vu]lsi is moving so fast. One trick to survival for computer makers will be to very carefully choose an architecture so that it makes sense with today's technology and still makes sense with the technology available in the medium future. charlie -- Charlie Price {hao ihnp4 decvax}!stcvax!crp (303) 673-5698 USnail: Storage Technology Corp - MD 3T / Louisville, CO / 80028
jer@peora.UUCP (J. Eric Roskos) (03/18/86)
Charlie Price at Storage Technology* writes, in response to my suggesting that there are analogies between vertical migration and the "optimizing CISC" description of RISC cache: > One reason that vertical migration of frequent instruction sequences > "down" into a lower "interpretation layer" differs between microcoded > machines and RISC machines (as they are spoken about -- oops, I mean > thought about) today is that the RISC interpretation layer is hardware > (today). While I do agree with the remainder of the posting, the above suggests I wasn't clear on what I meant in making the analogy. The goal of vertical migration is to identify frequently-occurring sequences of operations, and move them into the microcode. The reason for moving them into the microcode is the assumption that the microcode is in a small, fast, but costly memory, and thus can be executed faster. In general, there's also the assumption that you can do more in microcode in parallel, since horizontal microprograms are really commands to control points in the machine, rather than conventional, familiar, sequential instructions; but how applicable this is depends on how "horizontal" your microinstruction is. The former assumption is the one used recently in here to claim that a RISC is really an "optimizing CISC". There is some moderate truth to that, too, I think, although I have certain doubts at present. A CISC machine that has used the vertical migration techniques to give a good CISC instruction set for a particular program (and note that indeed there may be different instruction sets for different *programs*; several years ago when I was involved briefly in research in this area under R. I. Winner we were working on implementing dynamic switching of control store under Unix in order to support this approach) consists essentially of subroutine calls to RISC-like subroutines (actually subroutines in the microprogram control store); the subroutine calls are made out of a slower memory external to the CPU -- generally the CPU treats this memory like a user-level machine would treat a disk, and has to explicitly address and fetch data out of the memory and possibly wait for it to arrive -- and these calls are the CISC instructions. Now, in order for a RISC machine to work this way, i.e. in order for the instructions in the cache to be somehow analogous to CISC instructions in control store, there has to be both a spatial locality and a locality of reference for the frequently-used instruction sequence. At the time I made the original posting, I debated somewhat over whether to raise this point, and decided not to, because it would be somewhat difficult to determine exactly *how* similar this makes the two kinds of machine. But the basic problem is that, if you say that the RISC machine is an "optimizing CISC" because it keeps frequently executed instruction sequences in cache, you have to have a compiler that recognizes these frequently executed sequences, and puts them in one place (say, a subroutine that is repeatedly invoked). For example, having an identical sequence randomly distributed through memory wouldn't work, because the cache works spatially; it doesn't know anything about the instruction sequences per se, so for example you might be fetching in the same sequence multiple times just because it was spread through memory and thus not in the cache when you next needed it.** On the other hand, if you put the sequence in a subroutine, and aligned the subroutine such that it all fitted into cache and you could keep it in there while fetching instructions from somewhere else that invoked that subrotine, than each time the subroutine was called it would already be in cache, and the operation would be analogous to a CISC with vertical migration of the sequence that's in the subroutine into microcode. I'm not sure it's safe to say that makes the RISC with cache an "optimizing CISC", though, since it requires as much information about the original program as would be required to do the vertical migration in the first place. Which was why I said originally that I thought the two areas of research probably had a lot in common and probably would tend to converge. However, the above is a fairly simplified description, since there are a lot of questions in my own mind on how comparable the two approaches actually are that would be hard to address productively in this sort of discussion. -------- * A company that makes amazing tape drives! We have one of them here. ** Note that I am thinking here of the more general case in which you just have a repeatedly occurring sequence you want to put in a faster place; this might happen, for example, in a large program which has a large loop that is repeated many times, such that the loop is too big to fit into the cache at once, but contains sequences of operations (maybe interspersed with various other differing operations of varied lengths) that do recur repeatedly. I didn't mention the obvious case in which you have a loop that can fit entirely in cache; and also I wasn't considering the benefits of cache in a machine with a very wide bus, where you fetch 4 or 8 words into the cache all at the same time because it costs the same as fetching 1 word. Those are all special cases, and have been covered a lot in here already, and it's been shown that they do provide significant benefits themselves. -- E. Roskos