franka@mmintl.UUCP (Frank Adams) (10/29/87)
[Not food] I had an idea some time ago that I'm surprised I've never seen discussed. Suppose, for example, that your instruction processor has four stages. With conventional pipelining, that means that four consecutive instructions from the same program are at some stage of execution at the same time. Instead, why not have four different execution threads being performed simultaneously? This eliminates the dependency checks and latency delays inherent in "vertical" pipelining. (Many RISCs put these into the compiler instead of the architecture, but they're still there). On a multi-user system with a reasonable load level, it seems to me that this should represent a performance improvement. Of course, it won't look good on the standard benchmarks. One drawback I can see is that multiple register sets must be kept active simultaneously. However, this doesn't seem like a major problem given current technology. Has anyone tried anything like this? Is there some major problem I'm overlooking? -- Frank Adams ihnp4!philabs!pwa-b!mmintl!franka Ashton-Tate 52 Oakland Ave North E. Hartford, CT 06108
earl@mips.UUCP (Earl Killian) (11/01/87)
In article <2525@mmintl.UUCP>, franka@mmintl.UUCP (Frank Adams) writes: > Suppose, for example, that your instruction processor has four stages. With > conventional pipelining, that means that four consecutive instructions from > the same program are at some stage of execution at the same time. > > Instead, why not have four different execution threads being performed > simultaneously? This is an old idea; it was done on the CDC 6600 i/o processors. More recently it was tried on the HEP, which wasn't very successful. One of the problems is that you end up with a multiprocessor built out of slow uniprocessors, which is rarely successful when equivalent power uniprocessors are available. Besides N register sets, you also need N times larger caches to handle N simultaneous working sets. This is usually more expensive than conventional pipelining. Or you can eliminate the cache on the theory that you'll run other tasks while you wait for memory, thereby providing even slower uniprocessors (but perhaps more of them).
csg@pyramid.pyramid.com (Carl S. Gutekunst) (11/03/87)
In article <2525@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: >Instead, why not have four different execution threads being performed >simultaneously? This eliminates the dependency checks and latency delays >inherent in "vertical" pipelining. The early UNIVAC 1100-series processors did this. The instruction fetch ro- tated among the four pipelines, and when there were no dependancy problems between threads it executed them simultaneously. There certainly were depen- dancy troubles; you could not have any thread manipulating the same operands as any other thread. I don't see how you could avoid this. The UNIVAC 1110 had four pipelines; the 1100/60 and 1100/80 have two. I don't think the 1100/90 has more than one. I don't know specifically why Sperry dropped the four-stage parallel pipeline, although I can guess: it was a lot of iron that was difficult to use effectively. <csg>
rajiv@im4u.UUCP (Rajiv N. Patel) (11/04/87)
Summary:Horizontal Pipelining may be good after all. Distribution:World In article <2525@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: >I had an idea some time ago that I'm surprised I've never seen discussed. >Suppose, for example, that your instruction processor has four stages. With >conventional pipelining, that means that four consecutive instructions from >the same program are at some stage of execution at the same time. > >Instead, why not have four different execution threads being performed >simultaneously? This eliminates the dependency checks and latency delays >inherent in "vertical" pipelining. (Many RISCs put these into the compiler >instead of the architecture, but they're still there). On a multi-user >system with a reasonable load level, it seems to me that this should >represent a performance improvement. Of course, it won't look good on the >standard benchmarks. > >Frank Adams ihnp4!philabs!pwa-b!mmintl!franka >Ashton-Tate 52 Oakland Ave North E. Hartford, CT 06108 I seem to agree with the reasoning given by Frank. Many RISCs have placed great emphasis on software issues like an efficient compiler, still it is common to see that 30-50% (debatable issue) of the pipelined stages do no fruitful work (pipeline bubbles) due to data dependencies, latency delays and probably branches causing a pipeline flush. Now if one were to introduce horizontal pipelining to run more than one process in order to fill this pipeline bubbles I feel that though a single process execution rate may go down a little bit the overall throughput rate of the processor would increase dramatically. Think about hardware utilization available using such a concept (I cannot quote but have been told that hardware utilization in terms of active logic circuits per cycle on a chip is pretty low.) This concept may not appeal those designers who want to get the maximum throughput for a single process as in Super-computing problems, but definitely would appeal to designers of general purpose chips which could be used efficiently for control applications to designing workstations which tend to have applications requiring many processes to be run. Benchmarking such architectures and comparing them to normal RISC/CISC architectures is another big controversial issue. I have still not been able to figure out how to compare the two, but one way to make the horizontally pipelined architecture to look damn good is to compare hardware utilization ratios or compare raw instructions executed per (some million) cycle(s) for any process available to be executed. Most of the comments I have made here are based on our studies here at UT Austin on a computer architecture project which combines the RISC philosophy with the concept of Horizontal pipelining to give a hardware efficient processor with marginal hits on single process execution rates. As mentioned by someone earlier on the net, the cache design for such an architecture is the problem as it has to cache multiple instruction streams. I feel that with progress in VLSI technology this problem will not pose to be as serious. Fairchild CLIPPER already has a 4K byte cache chip, and a 8K byte cache chip should probably be a decent starting point for a processor with say 2-4 processes able to execute concurrently. Well I have certainly raised a lot of issues here which many would like to criticize or comment on. Please feel free to do so, this may help us here on our research work. Rajiv Patel. (rajiv@im4u.utexas.edu)
mrc@CAM.UNISYS.COM (Ken Leonard --> ken@oahu.mcl.unisys.com@sdcjove.cam.unisys.com) (11/05/87)
The HEP was (is? are the few that were built still running?) a damnfine machine for several interesting classes of problems--especially problems which can be cast in a reduction-like paradigm. It was also a very interesting cryptanalysis engine. As a box, it did not so much "suffer" from "bad engineering" (the engineering was damnfine considering the screwball funding or lack thereof) as from corporate mismangagement. First, the company brought in new "leverage", "market-oriented" management just after the first couple of production HEP-1s were delivered. Synonyms: leverage==>what Boone Pickens and Carl Icahn are good at; market-oriented==>sell the sizzle, not the steak. Second, the new management spent all their working funds on managing, hiring more managers, and putting on shows to impress themselves--rather than on evolving the proof-of-concept HEP-1 into a really workable HEP-2. The HEP-1 never lived up to the promise of the HEP concept because it was never really intended or expected to do so, it !did! prove the value of the concept and (to a few folk) open the door to doing some new things. The HEP-2 would have more than met the promise if it had not been starved and bludgeoned to death before it was ever really born. The last I heard of HEP's daddy, Burton Smith, he was at the Nat'l Supercomputer Center, still (at least tentatively) pondering how son-of-HEP might be brought into existence. Anyone who wants to learn how to totally destroy a concept and a company at the same time as (%&^%$%^&*&^&^$%&&^*&%)-ing some fine people, should carefully study the history of Denelcor and the HEP. Anyone who wants to learn how to come up with a better idea should do likewise, and also talk to Burton. Anyone who wants to get rich while making one heck of a technological splash should buy the rights to HEP into an intelligent organization and get their tail in gear.
daveb@geac.UUCP (11/06/87)
In article <2525@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: | I had an idea some time ago that I'm surprised I've never seen discussed. | Suppose, for example, that your instruction processor has four stages. With | conventional pipelining, that means that four consecutive instructions from | the same program are at some stage of execution at the same time. | | Instead, why not have four different execution threads being performed | simultaneously? A logically different but physically similar technique is used by the Nippon Electric Company's DPS-90 series of processors: they keep three pipelines around for pre-evaluating code down three possible branches. This cuts down on so-called "pipeline breaks" most wonderfully in programs containing lots of branch instructions. -- David Collier-Brown. {mnetor|yetti|utgpu}!geac!daveb Geac Computers International Inc., | Computer Science loses its 350 Steelcase Road,Markham, Ontario, | memory (if not its mind) CANADA, L3R 1B3 (416) 475-0525 x3279 | every 6 months.
daveb@geac.UUCP (11/06/87)
In article <862@gumby.UUCP> earl@mips.UUCP (Earl Killian) writes: >In article <2525@mmintl.UUCP>, franka@mmintl.UUCP (Frank Adams) writes: >> Instead, why not have four different execution threads being performed >> simultaneously? > >This is an old idea; it was done on the CDC 6600 i/o processors. More >recently it was tried on the HEP, which wasn't very successful. > [explanation truncated] Many moons ago, ICL (the british mainframers) tried what my boss called "half-caches", which as you might guess from the name, allowed a cheap and easy swap from a running program to a read-to-run program. The system was really quite elegant (ie, it was far easier than it looked), but I haven't heard anything on it since. Any ICLians out there? -- David Collier-Brown. {mnetor|yetti|utgpu}!geac!daveb Geac Computers International Inc., | Computer Science loses its 350 Steelcase Road,Markham, Ontario, | memory (if not its mind) CANADA, L3R 1B3 (416) 475-0525 x3279 | every 6 months.
franka@mmintl.UUCP (Frank Adams) (11/06/87)
In article <9398@pyramid.pyramid.com> csg@pyramid.UUCP (Carl S. Gutekunst) writes: |In article <2525@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: |>Instead, why not have four different execution threads being performed |>simultaneously? This eliminates the dependency checks and latency delays |>inherent in "vertical" pipelining. | |The early UNIVAC 1100-series processors did this. There certainly were depen- |dancy troubles; you could not have any thread manipulating the same operands |as any other thread. I don't see how you could avoid this. There is a fairly standard programming model these days that says that code is always in read-protected memory, and read-write memory is not shared between processes. Some machines more or less require this. If this model is used, there are no dependency problems. |I don't know specifically why Sperry dropped the four-stage parallel |pipeline, although I can guess: it was a lot of iron that was difficult |to use effectively. It doesn't seem like it would be that hard to use effectively in a multi-user environment. A bit of sophistication is required by the operating system, but user programs should be unaffected. My own guess is that this kind of design suffers in the market because it does poorly on standard performance measurements. -- Frank Adams ihnp4!philabs!pwa-b!mmintl!franka Ashton-Tate 52 Oakland Ave North E. Hartford, CT 06108
atbowler@orchid.UUCP (11/14/87)
In article <1782@geac.UUCP> daveb@geac.UUCP (Dave Collier-Brown) writes: >In article <2525@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: >| Instead, why not have four different execution threads being performed >| simultaneously? > > A logically different but physically similar technique is used by >the Nippon Electric Company's DPS-90 series of processors: they keep >three pipelines around for pre-evaluating code down three possible >branches. This cuts down on so-called "pipeline breaks" most >wonderfully in programs containing lots of branch instructions. It is amazing how people get excited ofer reinventing some really old techniques. "4 different execution threads" sounds like a conventional multiprocessor system to me. The GE-600 and its successors (including the above mentioned DPS-90) have been doing this since the early sixties. DPS-90 mentioned above) have done this since the early sixties. The multiple pipeline techinique within a single processor that Dave describes is a somewhat newer technique (early seventies when I heard IBM describe it for the 370/168) for getting faster performance from a single instruction thread (processor). Besides higher system throughput, the multiprocessor approach increases reliability since it is easy to get the system going again if one of the processors fails. GCOS is perfectly happy to let a CPU be released serviced by the field engineer, and returned to use with the user's never seeing anything happen expect some responce time degradation. DEC supported multiprocessors with the PDP-10 (a.k.a. system-20) and Univac did it with the 1100 series. IBM announced it several times, but until the Sierra series never figured out how to make the operating system handle it, so I suppose in the minds of many people multiproccessors are brand new.
rick@svedberg.bcm.tmc.edu (Richard H. Miller) (11/16/87)
In article <11711@orchid.waterloo.edu>, atbowler@orchid.waterloo.edu (Alan T. Bowler [SDG]) writes: > Besides higher system throughput, the multiprocessor approach increases > reliability since it is easy to get the system going again if one of the > processors fails, GCOS is perfectly happy to let a CPU be released > serviced by the field engineer, and returned to use with the user's never > seeing anything happen expect some responce time degradation. > DEC supported multiprocessors with the PDP-10 (a.k.a. system-20) > and Univac did it with the 1100 series. IBM announced it several > times, but until the Sierra series never figured out how to make > the operating system handle it, so I suppose in the minds of many > people multiproccessors are brand new. Two points about the topics raised in the above. Unisys (a.k.a Sperry) still does it with the 1100 architecture. We can have up to four processors sharing common memory and executing common code. (Although an activity will only be running on one processor). If one processor fails, the system usually will be able to automatically down the failing instruction processor without the user being aware that there is a problem. This also extends to the mass storage (memory, not disk), IO processors, channels and controllers. In fact, under certain circumstances, different architectures can share common memory (the AVP on the 1100/70 allows a OS/3 system to run using the same memory as the 1100 and the ISP [Integrated Scientific Processor] allows a vector machine to run with the 1100/90 IP and Mass storage. Having been exposed to many types of architecture and machines, my view is that the 1100 line provides one of the nicest system environments for a large main frame system. (I do have a bias towards this system since it is our primary mainframe). The second point concerns the PDP-10. As I remember, TOPS-20 never was able to fully support Symmetric Multiprocessing but the TOPS-10 system did. Until the 7.01 release of TOPS-10, the PDP-10 systems used a master/slave relation- ship in multiprocessing. As of this release, the TOPS-10 was able to support true symmetric multiprocessing in that there was no master CPU. I/O could run from either CPU and it provided a true use of both processors. (Up until this release, the second processor usually was NOT as highly used as the master since I/O could only be handled by the master CPU and unless a shop was very CPU bound, it would not be able to utilize the second CPU fully.) The released product support up to 4 CPUs and several large TOPS-10 sites did run a quad system (altough it was never *officially* supported). Since we dropped out of the DEC world about 7 years ago, I never did hear if TOPS-20 ever got the capability. It is interesting that what I have read in the trades indicates that the VAX is to be given this capability that the TOPS-10 system had almost 7 years ago. Sigh Richard H. Miller Email: rick@svedburg.bcm.tmc.edu Head, System Support Voice: (713)799-4511 Baylor College of Medicine US Mail: One Baylor Plaza, 302H Houston, Texas 77030
daveb@geac.UUCP (11/16/87)
In article <11711@orchid.waterloo.edu> atbowler@orchid.waterloo.edu (Alan T. Bowler [SDG]) writes: > ... The >multiple pipeline technique within a single processor that Dave >describes is a somewhat newer technique (early seventies when I >heard IBM describe it for the 370/168) for getting faster performance >from a single instruction thread (processor). Hi, Alan! I confess I don't remember IBM actually *having* the multiple pipelines in the /168... although I do remember it being proposed when UofWindsor still had one. Did they get it shoehorned in? --dave (formerly brown) c-b -- David Collier-Brown. {mnetor|yetti|utgpu}!geac!daveb Geac Computers International Inc., | Computer Science loses its 350 Steelcase Road,Markham, Ontario, | memory (if not its mind) CANADA, L3R 1B3 (416) 475-0525 x3279 | every 6 months.
mrc@CAM.UNISYS.COM (McLean Research Center guest) (11/16/87)
In article <11711@orchid.waterloo.edu>, atbowler@orchid.waterloo.edu (Alan T. Bowler [SDG]) writes: > ... > GCOS is perfectly happy to let a CPU be released serviced by the > field engineer, and returned to use with the user's never seeing > ... > and Univac did it with the 1100 series. IBM announced it several > times, but until the Sierra series never figured out how to make > the operating system handle it,... Well, I must take exception: IBM did it in the System/360-40V (a prototype), S/360-67 (production), S/370-158-AP, S/370-158-MP, S/370-168-AP, S/370-168-MP (all 4 as production). Of course, in the S/370s, they bollixed the I/O channel addressing scheme. True, they never did generally release a true multi-CPU opsys in their main product line. DOS never would have had a chance of running multi-CPU; OS/MFT (which became SVS) never would have had a chance, either; OS/MVT (when it became MVS) could have had a chance, but they made a political decision. But, they did have a system which did it all, namely TSS/360 --> TSS/370. TSS was strictly a DATbox/relocating/virtualmemory system, and the S360/67 fit it (or verse vica) beautifully. TSS was even so open-structured that it was ported to S/370 in about a year (calendar) and did everything that it did before, except for a couple features designed-out of the S/370 boxes. TSS was inherently n-plex: the released version could handle 4 CPUs, 4 I/O channel CONTROLLERS (that's what the left nibble of the two-byte I/O address was really for), up to 16 channels per controller, understood channels as logically standalone and switchable/shareable across/between controllers, up to 16 memory frames as partitionable re-base-addressable reconfigurable. The most-plex customer system installed for S/360 had 2CPUs, 3CCUs, 8 MUs (I think). The most-plex customer system installed for S/370 had 2CPUs (168-MP), logically 2CCUs (really each CPU frame's imbedded channel set), logically 16? MUs (severable chunks within the CPU frames' imbedded memory). The most-plex ever run, up in the Mohansic Lab, and then at Poughkeepsie (914 bldg?) is reputed to have been a 4x4x16 S/360-67. And it was, as some might say, a HUMMER! Just for drill, consider the fact that a 2-CPU S/370-168-MP ran a throughput of 2.1 to 2.25 over a 1-CPU system, depending on just what we gave it for workload. There ain't nothing new under the sun. Regardz, Ken
doug@edge.UUCP (11/19/87)
Very minor and unimportant historical correction: >OS/MFT (which became SVS) never would have had a chance, either; >OS/MVT (when it became MVS) could have had a chance, but they OS/MFT became (was the basis for) OS/VS1. OS/MVT was the basis for OS/VS2. Release 2 of OS/VS2 was the first MVS. Inevitably, OS/VS2 Release 1 became known to history as SVS. -- Doug Pardee -- Edge Computer Corp., Scottsdale, AZ -- ihnp4!oliveb!edge!doug
mrc@CAM.UNISYS.COM.UUCP (11/20/87)
In article <988@edge.UUCP> doug@edge.UUCP (Doug Pardee) writes: >OS/MFT became (was the basis for) OS/VS1. OS/MVT was the basis for OS/VS2. >Release 2 of OS/VS2 was the first MVS. Inevitably, OS/VS2 Release 1 became >known to history as SVS. I stand corrected. Ken
dmt@ptsfa.UUCP (11/21/87)
In article <393@sdcjove.CAM.UNISYS.COM> mrc@CAM.UNISYS.COM (Ken Leonard --> ken@oahu.mcl.unisys.com@sdcjove.cam.unisys.com) writes: >In article <988@edge.UUCP> doug@edge.UUCP (Doug Pardee) writes: >>OS/MFT became (was the basis for) OS/VS1. OS/MVT was the basis for OS/VS2. >>Release 2 of OS/VS2 was the first MVS. Inevitably, OS/VS2 Release 1 became >>known to history as SVS. > I thought that OS/VS1 was called SVS. Locally, we called VS1, SVS and VS2 (release 1), MVS right from the start. IBM didn't like the labels. Finally, they called OS/VS2 Release 2, MVS. -- Dave Turner 415/542-1299 {ihnp4,lll-crg,qantel,pyramid}!ptsfa!dmt
lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (11/21/87)
This discussion needs a new title. It started with the Denelcor style of shared-functional-unit multiprocessors: wandered all unbeknownst into conventional multiprocessors (huh?!?): and now it's a history of IBM nomenclature (pardon ????). To get back to the original subject: There are two reasons to share functional units. - cost, or, if you will, duty cycle. - simplicity ( in the sense of RISCness ). The duty cycle argument says that if a unit is rarely used, then you get a more effective design by sharing it among all the instruction-issue units. Note that a lot of the average Cray sits idle while the rest is being useful. The counter-argument is that decreasing {prices, power consumption, etc} make sharing less of a win. Plus, sharing puts constraints on packaging - you have to get there from here. The simplicity argument says that since successive clocks are on behalf of different threads, therefore the pipelines need no interlocks. This should lead to lean, mean pipelines with good clock rates. The problem is that to do this, you would have to put interlocks on the pipe entrances, to resolve the asynchonous demands for service. Denelcor solved that by sharing the instruction-issue unit, using a queue. (When an answer came out, its thread became elibible for another issue. ) The problem is that any single sequential program is now unable to issue instructions at the full rate. So, the machine is only a win for timesharing loads, or for multithread applications. Obviously, you can do a sort of "fork" in a few clocks. Denelcor argued that fine grained fork would pick up wins: and Alliant seems to be getting mileage from such forking. If you assume a single-chip CPU, I guess it's a bad idea. -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science
mash@mips.UUCP (John Mashey) (11/22/87)
In article <380@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes: >This discussion needs a new title... > >There are two reasons to share functional units. > - cost, or, if you will, duty cycle. > - simplicity ( in the sense of RISCness ). > >The duty cycle argument says that if a unit is rarely used, then you get a >more effective design by sharing it among all the instruction-issue units. >Note that a lot of the average Cray sits idle while the rest is being >useful. The counter-argument is that decreasing {prices, power consumption, >etc} make sharing less of a win. Plus, sharing puts constraints on packaging >- you have to get there from here. >If you assume a single-chip CPU, I guess it's a bad idea. That's the critical observation, and observe that an increasing piece of the computing spectrum is being dominated by single-chip CPUs, whose design tradeoffs are very different from having boards full of [TTL, ECL, etc] logic. For example, if you want to micro-time-slice N processes, you must provide N sets of the highest-speed state in the memory hierarchy [registers], and in fact, you'd probably want N sets of caches also. [Think about having N processes thrashing around interleaved in the same cache: it is hard to see how this will help your hit rates very much. TLBs likewise] If you were building CPUs that were multiple boards anyway, it might not be impossible to replicate the registers without incurring awful speed penalties: there will be a limit, but certainly, successful systems have been built this way, if only to minimize context switching time. Board yields don't drop like a stone just because you used a little more space. On the other hand, if it's VLSI, you can be up against serious limits, and you have to think hard about what's on the chips. Finally, here are the reasons why the "single-chip" observation is the critical one. I might be accused of bias on the following conjectures, but I don't think they're too far out of line: 1) Each year, an increasing proportion of newly-installed computers (both units and $$) will be based on single-chip CPUs. 2) Single-chip solutions already dominate the low-end, and they keep moving up. The only way some of the existing architectures compete there is by VLSIing as quickly as possible [microVAXen, for example]. 3) Solutions that are not single-chip (or very small chip count) will increasingly be: a) Highest-end supercomputers b) Upward extensions of existing product lines that didn't start life as single-chip CPUs c) "Unusual" architectures in the mini-super arena, which can often support anything if it solves some class of problem enough more cost-effectively than other available ones. 4) It's hard to believe there will be ANY more new computer architectures in the low-to-mid range of computing that aren't single-chip VLSI micros. (Oops: qualify that: SUCCESSFUL architectures). Note that low-to-mid range means shippable 10-mips uniprocessors in 1987, 20-mips in 1988, >40 in 1989. 5) To summarize: for general-purpose computing, the time-slicing hardware approach seems doomed to niches at best, because it runs right against the likely design trends of the next few years. This does leave the question of identifying the niches that might be possible. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mrc@CAM.UNISYS.COM (Ken Leonard --> ken@oahu.mcl.unisys.com@sdcjove.cam.unisys.com) (11/23/87)
Having had a couple of replies re my discussion of IBM OS/VS etymology... MFT (Multiprogramming, Fixed number of Tasks) became OS/VS1 and stayed. MVT (" Variable...) became OS/VS2, and then... ...thru release 1.highest, used only a _single_ problem-state address space and was thus known as SVS (Single Virtual Storage); ...with release 2.0 and later, used (supposedly) _multiple_ problem- state address spaces and was thus known as MVS (Multiple...). Sokay? Regardz, Ken
kenm@sci.UUCP (Ken McElvain) (11/24/87)
In article <958@winchester.UUCP>, mash@mips.UUCP (John Mashey) writes: > In article <380@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes: + >This discussion needs a new title... + > + >There are two reasons to share functional units. + > - cost, or, if you will, duty cycle. + > - simplicity ( in the sense of RISCness ). + > + >The duty cycle argument says that if a unit is rarely used, then you get a + >more effective design by sharing it among all the instruction-issue units. + >Note that a lot of the average Cray sits idle while the rest is being + >useful. The counter-argument is that decreasing {prices, power consumption, + >etc} make sharing less of a win. Plus, sharing puts constraints on packaging + >- you have to get there from here. + + >If you assume a single-chip CPU, I guess it's a bad idea. + + That's the critical observation, and observe that an increasing piece + of the computing spectrum is being dominated by single-chip CPUs, + whose design tradeoffs are very different from having boards full of + [TTL, ECL, etc] logic. For example, if you want to micro-time-slice N + processes, you must provide N sets of the highest-speed state in the + memory hierarchy [registers], and in fact, you'd probably want + N sets of caches also. [Think about having N processes thrashing + around interleaved in the same cache: it is hard to see how this + will help your hit rates very much. TLBs likewise] If you were building CPUs + that were multiple boards anyway, it might not be impossible to replicate + the registers without incurring awful speed penalties: there will be + a limit, but certainly, successful systems have been built this way, + if only to minimize context switching time. Board yields don't drop + like a stone just because you used a little more space. + On the other hand, if it's VLSI, you can be up against serious limits, + and you have to think hard about what's on the chips. + I agree that cache [or TLB] hit rates will almost certainly go down. However, miss penalties will also drop. It is quite possible that a cache fill could happen in the time it takes for the barrel to turn around. A ten stage barrel processor running at 25Mhz would easily allow over 300ns for a cache fill before it cost another instruction slot. The performance limit here is likely to be the bandwidth of the cache fill mechanism. Another issue is the instruction set. It's not clear that you want a bunch of registers. It may be much better to do more of a memory to memory architecture. (I would recommend keeping some base registers). A number of other areas also have some surprising tradeoffs. Ken McElvain Silicon Compiler Systems decwrl!sci!kenm
mash@mips.UUCP (John Mashey) (11/25/87)
In article <11444@sci.UUCP> kenm@sci.UUCP (Ken McElvain) writes: >In article <958@winchester.UUCP>, mash@mips.UUCP (John Mashey) writes: >I agree that cache [or TLB] hit rates will almost certainly go down. >However, miss penalties will also drop. It is quite possible that >a cache fill could happen in the time it takes for the barrel >to turn around. >A ten stage barrel processor running at 25Mhz would easily allow >over 300ns for a cache fill before it cost another instruction slot. >The performance limit here is likely to be the bandwidth of the >cache fill mechanism. ^^^^^^^^^^^^^^^^^^^^^^ yes. I believe that there is more interference than you might think, although it would be nice to see simulation numbers, since I don't have any. Let's try a few quick assumptions. Assume we're using split I & D caches. Assume that the cache line is N words long, filled 1 word/cycle after a latency of L cycles. One would expect that efficient cache designs have L <= N. When filling an I-cache miss, you can do L more barrel slots, then you must stall for N slots (or equivalent), because it doesn't make sense to have the I-cache run faster than the chip (if it did, you would run the chip faster). Putting an I-cache on the chip just moves the problem around. Assuming L <= N, this says that when you hit an I-cache miss, you get at most 50% of the total refill time (L+N) that you can actually initiate new instructions. D-cache refill is a little less painful, in that only 30% of the instructions are loads/stores (on our systems, but typical), so that you don't block, or skip something, until you hit a load/store. I'm not sure what you do when you execute something that causes a cache miss while you're already in a cache miss, maybe just block. Of course, I & D cache refills run into each other, and if you're using write-thru caches, writes run into refills also. These numbers seem to indicate that maybe 2-way barrel might be possible, but much above that, very little benefit can be gained from overlapping cache refill with execution. >Another issue is the instruction set. It's not clear that you want >a bunch of registers. It may be much better to do more of a memory >to memory architecture. (I would recommend keeping some base registers). >A number of other areas also have some surprising tradeoffs. -----------------------------------^^^^ Please explain some more. Note that in a memory-memory architecture, what I said about 30% load/store above also gets worse. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jay@splut.UUCP (Jay Maynard) (11/25/87)
In article <3801@ptsfa.UUCP>, dmt@ptsfa.UUCP (Dave Turner) writes: > >In article <988@edge.UUCP> doug@edge.UUCP (Doug Pardee) writes: > >>OS/MFT became (was the basis for) OS/VS1. OS/MVT was the basis for OS/VS2. > >>Release 2 of OS/VS2 was the first MVS. Inevitably, OS/VS2 Release 1 became > >>known to history as SVS. > I thought that OS/VS1 was called SVS. > Locally, we called VS1, SVS and VS2 (release 1), MVS right from the start. > IBM didn't like the labels. Finally, they called OS/VS2 Release 2, MVS. Actually, Doug had it right. OS/VS2 release 1 was called SVS, because it provided a single virtual address space (0-16 MB). OS/VS2 release 2 (did anyone actually run it, or was release 3 the first to see actual use? I was never clear on that point) provided multiple virtual address spaces, with each user's storage (except for defined common areas - the nucleus and common work areas) separate and unaddressable between address spaces. One address space was (normally) assigned to each TSO user, each batch job, and each started task. There are, in later releases, several auxiliary address spaces for system use. Source for most of this: IBM manual _MVS System Overview_ (no, I don't have the number - the book's at the office, and I'm at home.) OS/VS1 was simply called VS1, except for those people who called it bleh... -- Jay Maynard, K5ZC (@WB5BBW)...>splut!< | uucp: uunet!nuchat!splut!jay Never ascribe to malice that which can | or: academ!uhnix1!--^ adequately be explained by stupidity. | GEnie: JAYMAYNARD CI$: 71036,1603 The opinions herein are shared by none of my cats, much less anyone else.
bcase@apple.UUCP (Brian Case) (11/25/87)
In article <11444@sci.UUCP> kenm@sci.UUCP (Ken McElvain) writes: [Seems to be talking about something like the PPUs of the old Cybers.[ >I agree that cache [or TLB] hit rates will almost certainly go down. >However, miss penalties will also drop. It is quite possible that >a cache fill could happen in the time it takes for the barrel >to turn around. >A ten stage barrel processor running at 25Mhz would easily allow >over 300ns for a cache fill before it cost another instruction slot. >The performance limit here is likely to be the bandwidth of the >cache fill mechanism. Yes, but if a fair fraction of the processors in the barrel are causing misses (say 3 or so) then your memory system will have to be multiported (or very fast, in which case why not just one fast processor?). This doesn't invalidate what you are saying, just an observation. >Another issue is the instruction set. It's not clear that you want >a bunch of registers. It may be much better to do more of a memory >to memory architecture. (I would recommend keeping some base registers). >A number of other areas also have some surprising tradeoffs. I fail to see why memory-memory would be better than registers. Can you give some proof? Also, what other areas have surprising tradeoffs, and what are they?
franka@mmintl.UUCP (Frank Adams) (11/26/87)
In article <958@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: |In article <380@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes: |>If you assume a single-chip CPU, I guess it's a bad idea. | |That's the critical observation, and observe that an increasing piece |of the computing spectrum is being dominated by single-chip CPUs, |whose design tradeoffs are very different from having boards full of |[TTL, ECL, etc] logic. This is a good point, but unless I am missing something, it is only a temporary one. Surely we will reach the point, not too many years from now, when the logic which now fills many boards will all fit on one chip. At that point, the arguments for horizontal pipelining on a single chip CPU will be as strong as they are today for multi-chip CPUs. Won't they? Another trend which might doom the idea is that towards individual (single-user) computers. The future of multi-tasking on such machines is very much in question; if it becomes a big thing, there is no problem. Otherwise, we are left the relatively few (but, on average, higher-powered) time-shared systems which are left. -- Frank Adams ihnp4!philabs!pwa-b!mmintl!franka Ashton-Tate 52 Oakland Ave North E. Hartford, CT 06108
mash@mips.UUCP (11/30/87)
In article <2581@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: >|That's the critical observation, and observe that an increasing piece >|of the computing spectrum is being dominated by single-chip CPUs, >|whose design tradeoffs are very different from having boards full of >|[TTL, ECL, etc] logic. >This is a good point, but unless I am missing something, it is only a >temporary one. Surely we will reach the point, not too many years from now, >when the logic which now fills many boards will all fit on one chip. At >that point, the arguments for horizontal pipelining on a single chip CPU >will be as strong as they are today for multi-chip CPUs. Won't they? Maybe, maybe not: it seems to me there are packaging issue differences. (I'm not a packaging expert: maybe someone who is will contribute). However, I'd observe that I've seen a lot of multi-board designs that were able to afford to have lots of busses, and there didn't seem to be huge discontinuities in cost when you needed to make something a little bigger, standard form factors notwithstanding. On the other hand, it seems that with VLSI design, at any given point in time: a) Adding pins can lead to big discontinuities in costs. b) # signal pins and/or speeds maybe severely limited by the testers available c) So far, there are obvious good uses for additional pins Maybe cheap, testable packages get big enough sometime that we can have as many pins as we want.... >Another trend which might doom the idea is that towards individual >(single-user) computers. The future of multi-tasking on such machines is >very much in question; if it becomes a big thing, there is no problem. Hopefully, multi-tasking will some year come to single-user computers :-) >Otherwise, we are left the relatively few (but, on average, higher-powered) >time-shared systems which are left. Note: perhaps my earlier posting was not clear enough: let me add some more. I'd said that an increasing part of computing will be taken over by VLSI small-#-of-chip solutions, either in single-cpu units, or in multi-processors. The main reason for this belief wasn't technical, but economic. To make a barrel-style processor economically viable (versus ganging micros together and riding their performance curves at low cost), it needs to have a truly compelling cost/performance advantage. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
elg@killer.UUCP (12/06/87)
In article <1006@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: >In article <2581@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: >>Another trend which might doom the idea is that towards individual >>(single-user) computers. The future of multi-tasking on such machines is >>very much in question; if it becomes a big thing, there is no problem. > >Hopefully, multi-tasking will some year come to single-user computers :-) Actually, multi-tasking single-user computers have been available for years. OS-9 on the TRS-80 Color Computer, for example, and AmigaDOS on the Commodore Amiga. Just because the IBM PEE-CEE and Apple Macintosh don't have a multitasking oprating system, doesn't mean that the rest of the world is stuck with single tasking (and note that both IBM and Apple intend to introduce multitasking OS's Real Soon Now). I think that we'll see the demise of ancient CP/M-derived operating systems Real Soon Now (as Marketing would say :-). -- Eric Green elg@usl.CSNET Snail Mail P.O. Box 92191 {cbosgd,ihnp4}!killer!elg Lafayette, LA 70509 Hello darkness my old friend, I've come to talk with you again....