[comp.arch] Horizontal pipelining

franka@mmintl.UUCP (Frank Adams) (10/29/87)

[Not food]

I had an idea some time ago that I'm surprised I've never seen discussed.
Suppose, for example, that your instruction processor has four stages.  With
conventional pipelining, that means that four consecutive instructions from
the same program are at some stage of execution at the same time.

Instead, why not have four different execution threads being performed
simultaneously?  This eliminates the dependency checks and latency delays
inherent in "vertical" pipelining.  (Many RISCs put these into the compiler
instead of the architecture, but they're still there).  On a multi-user
system with a reasonable load level, it seems to me that this should
represent a performance improvement.  Of course, it won't look good on the
standard benchmarks.

One drawback I can see is that multiple register sets must be kept active
simultaneously.  However, this doesn't seem like a major problem given
current technology.

Has anyone tried anything like this?  Is there some major problem I'm
overlooking?
-- 

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

earl@mips.UUCP (Earl Killian) (11/01/87)

In article <2525@mmintl.UUCP>, franka@mmintl.UUCP (Frank Adams) writes:
> Suppose, for example, that your instruction processor has four stages.  With
> conventional pipelining, that means that four consecutive instructions from
> the same program are at some stage of execution at the same time.
> 
> Instead, why not have four different execution threads being performed
> simultaneously?

This is an old idea; it was done on the CDC 6600 i/o processors.  More
recently it was tried on the HEP, which wasn't very successful.

One of the problems is that you end up with a multiprocessor built out
of slow uniprocessors, which is rarely successful when equivalent power
uniprocessors are available.

Besides N register sets, you also need N times larger caches to handle
N simultaneous working sets.  This is usually more expensive than
conventional pipelining.  Or you can eliminate the cache on the theory
that you'll run other tasks while you wait for memory, thereby
providing even slower uniprocessors (but perhaps more of them).

csg@pyramid.pyramid.com (Carl S. Gutekunst) (11/03/87)

In article <2525@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>Instead, why not have four different execution threads being performed
>simultaneously?  This eliminates the dependency checks and latency delays
>inherent in "vertical" pipelining.

The early UNIVAC 1100-series processors did this. The instruction fetch ro-
tated among the four pipelines, and when there were no dependancy problems
between threads it executed them simultaneously. There certainly were depen-
dancy troubles; you could not have any thread manipulating the same operands
as any other thread. I don't see how you could avoid this.

The UNIVAC 1110 had four pipelines; the 1100/60 and 1100/80 have two. I don't
think the 1100/90 has more than one. I don't know specifically why Sperry
dropped the four-stage parallel pipeline, although I can guess: it was a lot
of iron that was difficult to use effectively.

<csg>

rajiv@im4u.UUCP (Rajiv N. Patel) (11/04/87)

Summary:Horizontal Pipelining may be good after all.

Distribution:World


In article <2525@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>I had an idea some time ago that I'm surprised I've never seen discussed.
>Suppose, for example, that your instruction processor has four stages.  With
>conventional pipelining, that means that four consecutive instructions from
>the same program are at some stage of execution at the same time.
>
>Instead, why not have four different execution threads being performed
>simultaneously?  This eliminates the dependency checks and latency delays
>inherent in "vertical" pipelining.  (Many RISCs put these into the compiler
>instead of the architecture, but they're still there).  On a multi-user
>system with a reasonable load level, it seems to me that this should
>represent a performance improvement.  Of course, it won't look good on the
>standard benchmarks.
>
>Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
>Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108


I seem to agree with the reasoning given by Frank. Many RISCs have placed
great emphasis on software issues like an efficient compiler, still it is
common to see that 30-50% (debatable issue) of the pipelined stages do no
fruitful work (pipeline bubbles) due to data dependencies, latency delays
and probably branches causing a pipeline flush. Now if one were to introduce
horizontal pipelining to run more than one process in order to fill this
pipeline bubbles I feel that though a single process execution rate may go
down a little bit the overall throughput rate of the processor would 
increase dramatically. Think about hardware utilization available using such
a concept (I cannot quote but have been told that hardware utilization in 
terms of active logic circuits per cycle on a chip is pretty low.)

This concept may not appeal those designers who want to get the maximum
throughput for a single process as in Super-computing problems, but definitely
would appeal to designers of general purpose chips which could be used
efficiently for control applications to designing workstations which tend
to have applications requiring many processes to be run.

Benchmarking such architectures and comparing them to normal RISC/CISC 
architectures is another big controversial issue. I have still not been able
to figure out how to compare the two, but one way to make the horizontally
pipelined architecture to look damn good is to compare hardware utilization
ratios or compare raw instructions executed per (some million) cycle(s) for any
process available to be executed.

Most of the comments I have made here are based on our studies here at
UT Austin on a computer architecture project which combines the RISC 
philosophy with the concept of Horizontal pipelining to give a hardware
efficient processor with marginal hits on single process execution rates. 
As mentioned by someone earlier on the net, the cache design for such an
architecture is the problem as it has to cache multiple instruction streams.
I feel that with progress in VLSI technology this problem will not pose to be
as serious. Fairchild CLIPPER already has a 4K byte cache chip, and a 8K byte
cache chip should probably be a decent starting point for a processor with
say 2-4 processes able to execute concurrently.

Well I have certainly raised a lot of issues here which many would like to 
criticize or comment on. Please feel free to do so, this may help us here on
our research work.


Rajiv Patel.
(rajiv@im4u.utexas.edu)

mrc@CAM.UNISYS.COM (Ken Leonard --> ken@oahu.mcl.unisys.com@sdcjove.cam.unisys.com) (11/05/87)

The HEP was (is? are the few that were built still running?) a damnfine
machine for several interesting classes of problems--especially problems
which can be cast in a reduction-like paradigm.  It was also a very interesting
cryptanalysis engine.
 
As a box, it did not so much "suffer" from "bad engineering" (the engineering
was damnfine considering the screwball funding or lack thereof) as from
corporate mismangagement.

First, the company brought in new "leverage", "market-oriented" management
just after the first couple of production HEP-1s were delivered.  Synonyms:
leverage==>what Boone Pickens and Carl Icahn are good at;
market-oriented==>sell the sizzle, not the steak.

Second, the new management spent all their working funds on managing, hiring
more managers, and putting on shows to impress themselves--rather than on
evolving the proof-of-concept HEP-1 into a really workable HEP-2.

The HEP-1 never lived up to the promise of the HEP concept because it was
never really intended or expected to do so, it !did! prove the value of the
concept and (to a few folk) open the door to doing some new things.  The HEP-2
would have more than met the promise if it had not been starved and
bludgeoned to death before it was ever really born.

The last I heard of HEP's daddy, Burton Smith, he was at the Nat'l 
Supercomputer Center, still (at least tentatively) pondering how
son-of-HEP might be brought into existence.

Anyone who wants to learn how to totally destroy a concept and a company at
the same time as (%&^%$%^&*&^&^$%&&^*&%)-ing some fine people, should carefully
study the history of Denelcor and the HEP.

Anyone who wants to learn how to come up with a better idea should do likewise,
and also talk to Burton.

Anyone who wants to get rich while making one heck of a technological splash
should buy the rights to HEP into an intelligent organization and get
their tail in gear.

daveb@geac.UUCP (11/06/87)

In article <2525@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
| I had an idea some time ago that I'm surprised I've never seen discussed.
| Suppose, for example, that your instruction processor has four stages.  With
| conventional pipelining, that means that four consecutive instructions from
| the same program are at some stage of execution at the same time.
| 
| Instead, why not have four different execution threads being performed
| simultaneously?

  A logically different but physically similar technique is used by
the Nippon Electric Company's DPS-90 series of processors: they keep
three pipelines around for pre-evaluating code down three possible
branches.  This cuts down on so-called "pipeline breaks" most
wonderfully in programs containing lots of branch instructions.
-- 
 David Collier-Brown.                 {mnetor|yetti|utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind)
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

daveb@geac.UUCP (11/06/87)

In article <862@gumby.UUCP> earl@mips.UUCP (Earl Killian) writes:
>In article <2525@mmintl.UUCP>, franka@mmintl.UUCP (Frank Adams) writes:
>> Instead, why not have four different execution threads being performed
>> simultaneously?
>
>This is an old idea; it was done on the CDC 6600 i/o processors.  More
>recently it was tried on the HEP, which wasn't very successful.
> [explanation truncated]

  Many moons ago, ICL (the british mainframers) tried what my boss
called "half-caches", which as you might guess from the name,
allowed a cheap and easy swap from a running program to a
read-to-run program.  The system was really quite elegant (ie, it was
far easier than it looked), but I haven't heard anything on it since.
  Any ICLians out there?


-- 
 David Collier-Brown.                 {mnetor|yetti|utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind)
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

franka@mmintl.UUCP (Frank Adams) (11/06/87)

In article <9398@pyramid.pyramid.com> csg@pyramid.UUCP (Carl S. Gutekunst) writes:
|In article <2525@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
|>Instead, why not have four different execution threads being performed
|>simultaneously?  This eliminates the dependency checks and latency delays
|>inherent in "vertical" pipelining.
|
|The early UNIVAC 1100-series processors did this.  There certainly were depen-
|dancy troubles; you could not have any thread manipulating the same operands
|as any other thread. I don't see how you could avoid this.

There is a fairly standard programming model these days that says that code
is always in read-protected memory, and read-write memory is not shared
between processes.  Some machines more or less require this.  If this model
is used, there are no dependency problems.

|I don't know specifically why Sperry dropped the four-stage parallel
|pipeline, although I can guess: it was a lot of iron that was difficult
|to use effectively.

It doesn't seem like it would be that hard to use effectively in a
multi-user environment.  A bit of sophistication is required by the
operating system, but user programs should be unaffected.

My own guess is that this kind of design suffers in the market because it
does poorly on standard performance measurements.
-- 

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

atbowler@orchid.UUCP (11/14/87)

In article <1782@geac.UUCP> daveb@geac.UUCP (Dave Collier-Brown) writes:
>In article <2525@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>| Instead, why not have four different execution threads being performed
>| simultaneously?
>
>  A logically different but physically similar technique is used by
>the Nippon Electric Company's DPS-90 series of processors: they keep
>three pipelines around for pre-evaluating code down three possible
>branches.  This cuts down on so-called "pipeline breaks" most
>wonderfully in programs containing lots of branch instructions.

It is amazing how people get excited ofer reinventing some really old
techniques.  "4 different execution threads" sounds like a conventional
multiprocessor system to me.  The GE-600 and its successors (including the
above mentioned DPS-90) have been doing this since the early sixties.
DPS-90 mentioned above) have done this since the early sixties.  The
multiple pipeline techinique within a single processor that Dave
describes is a somewhat newer technique (early seventies when I
heard IBM describe it for the 370/168) for getting faster performance
from a single instruction thread (processor).  Besides higher system
throughput, the multiprocessor approach increases reliability since
it is easy to get the system going again if one of the processors fails.
GCOS is perfectly happy to let a CPU be released serviced by the
field engineer, and returned to use with the user's never seeing
anything happen expect some responce time degradation.
DEC supported multiprocessors with the PDP-10 (a.k.a. system-20)
and Univac did it with the 1100 series.  IBM announced it several
times, but until the Sierra series never figured out how to make
the operating system handle it, so I suppose in the minds of many
people multiproccessors are brand new.

rick@svedberg.bcm.tmc.edu (Richard H. Miller) (11/16/87)

In article <11711@orchid.waterloo.edu>, atbowler@orchid.waterloo.edu (Alan T. Bowler [SDG]) writes:

> Besides higher system throughput, the multiprocessor approach increases
> reliability since it is easy to get the system going again if one of the
> processors fails, GCOS is perfectly happy to let a CPU be released
> serviced by the field engineer, and returned to use with the user's never
> seeing  anything happen expect some responce time degradation.
> DEC supported multiprocessors with the PDP-10 (a.k.a. system-20)
> and Univac did it with the 1100 series.  IBM announced it several
> times, but until the Sierra series never figured out how to make
> the operating system handle it, so I suppose in the minds of many
> people multiproccessors are brand new.


Two points about the topics raised in the above. Unisys (a.k.a Sperry) still
does it with the 1100 architecture. We can have up to four processors sharing
common memory and executing common code. (Although an activity will only be
running on one processor). If one processor fails, the system usually will be
able to automatically down the failing instruction processor without the user
being aware that there is a problem. This also extends to the mass storage
(memory, not disk), IO processors, channels and controllers. In fact, under
certain circumstances, different architectures can share common memory (the
AVP on the 1100/70 allows a OS/3 system to run using the same memory as the
1100 and the ISP [Integrated Scientific Processor] allows a vector machine to
run with the 1100/90 IP and Mass storage. Having been exposed to many types of
architecture and machines, my view is that the 1100 line provides one of the
nicest system environments for a large main frame system. (I do have a bias
towards this system since it is our primary mainframe).

The second point concerns the PDP-10. As I remember, TOPS-20 never was able 
to fully support Symmetric Multiprocessing but the TOPS-10 system did. Until
the 7.01 release of TOPS-10, the PDP-10 systems used a master/slave relation-
ship in multiprocessing. As of this release, the TOPS-10 was able to support
true symmetric multiprocessing in that there was no master CPU. I/O could
run from either CPU and it provided a true use of both processors. (Up until
this release, the second processor usually was NOT as highly used as the 
master since I/O could only be handled by the master CPU and unless a shop was
very CPU bound, it would not be able to utilize the second CPU fully.) The
released product support up to 4 CPUs and several large TOPS-10 sites did 
run a quad system (altough it was never *officially* supported). Since we
dropped out of the DEC world about 7 years ago, I never did hear if TOPS-20
ever got the capability. It is interesting that what I have read in the trades
indicates that the VAX is to be given this capability that the TOPS-10 system
had almost 7 years ago. Sigh


Richard H. Miller                 Email: rick@svedburg.bcm.tmc.edu
Head, System Support              Voice: (713)799-4511
Baylor College of Medicine        US Mail: One Baylor Plaza, 302H
                                           Houston, Texas 77030

daveb@geac.UUCP (11/16/87)

In article <11711@orchid.waterloo.edu> atbowler@orchid.waterloo.edu (Alan T. Bowler [SDG]) writes:
> ... The
>multiple pipeline technique within a single processor that Dave
>describes is a somewhat newer technique (early seventies when I
>heard IBM describe it for the 370/168) for getting faster performance
>from a single instruction thread (processor).

Hi, Alan!

  I confess I don't remember IBM actually *having* the multiple
pipelines in the /168... although I do remember it being proposed
when UofWindsor still had one.
  Did they get it shoehorned in?

 --dave (formerly brown) c-b
-- 
 David Collier-Brown.                 {mnetor|yetti|utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind)
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

mrc@CAM.UNISYS.COM (McLean Research Center guest) (11/16/87)

In article <11711@orchid.waterloo.edu>, atbowler@orchid.waterloo.edu (Alan T. Bowler [SDG]) writes:
> ...
> GCOS is perfectly happy to let a CPU be released serviced by the
> field engineer, and returned to use with the user's never seeing
> ...
> and Univac did it with the 1100 series.  IBM announced it several
> times, but until the Sierra series never figured out how to make
> the operating system handle it,...

Well, I must take exception:

IBM did it in the System/360-40V (a prototype), S/360-67 (production),
S/370-158-AP, S/370-158-MP, S/370-168-AP, S/370-168-MP (all 4 as production).

Of course, in the S/370s, they bollixed the I/O channel addressing scheme.

True, they never did generally release a true multi-CPU opsys in their
main product line.  DOS never would have had a chance of running multi-CPU;
OS/MFT (which became SVS) never would have had a chance, either;
OS/MVT (when it became MVS) could have had a chance, but they
made a political decision.

But, they did have a system which did it all, namely TSS/360 --> TSS/370.
TSS was strictly a DATbox/relocating/virtualmemory system, and the S360/67
fit it (or verse vica) beautifully.  TSS was even so open-structured that
it was ported to S/370 in about a year (calendar) and did everything that
it did before, except for a couple features designed-out of the S/370 boxes.

TSS was inherently n-plex:  the released version could handle 4 CPUs, 4 I/O
channel CONTROLLERS (that's what the left nibble of the two-byte I/O address
was really for), up to 16 channels per controller, understood channels as
logically standalone and switchable/shareable across/between controllers,
up to 16 memory frames as partitionable re-base-addressable reconfigurable.

The most-plex customer system installed for S/360 had 2CPUs, 3CCUs, 8 MUs (I
think).  The most-plex customer system installed for S/370 had 2CPUs (168-MP),
logically 2CCUs (really each CPU frame's imbedded channel set), logically
16? MUs (severable chunks within the CPU frames' imbedded memory).

The most-plex ever run, up in the Mohansic Lab, and then at Poughkeepsie
(914 bldg?)  is reputed to have been a 4x4x16 S/360-67.  And it was, as
some might say, a HUMMER!

Just for drill, consider the fact that a 2-CPU S/370-168-MP ran a throughput
of 2.1 to 2.25 over a 1-CPU system, depending on just what we gave it for
workload.

There ain't nothing new under the sun.

Regardz,
Ken

doug@edge.UUCP (11/19/87)

Very minor and unimportant historical correction:

>OS/MFT (which became SVS) never would have had a chance, either;
>OS/MVT (when it became MVS) could have had a chance, but they

OS/MFT became (was the basis for) OS/VS1.  OS/MVT was the basis for OS/VS2.
Release 2 of OS/VS2 was the first MVS.  Inevitably, OS/VS2 Release 1 became
known to history as SVS.
-- 
Doug Pardee -- Edge Computer Corp., Scottsdale, AZ -- ihnp4!oliveb!edge!doug

mrc@CAM.UNISYS.COM.UUCP (11/20/87)

In article <988@edge.UUCP> doug@edge.UUCP (Doug Pardee) writes:
>OS/MFT became (was the basis for) OS/VS1.  OS/MVT was the basis for OS/VS2.
>Release 2 of OS/VS2 was the first MVS.  Inevitably, OS/VS2 Release 1 became
>known to history as SVS.

I stand corrected.

Ken

dmt@ptsfa.UUCP (11/21/87)

In article <393@sdcjove.CAM.UNISYS.COM> mrc@CAM.UNISYS.COM (Ken Leonard --> ken@oahu.mcl.unisys.com@sdcjove.cam.unisys.com) writes:
>In article <988@edge.UUCP> doug@edge.UUCP (Doug Pardee) writes:
>>OS/MFT became (was the basis for) OS/VS1.  OS/MVT was the basis for OS/VS2.
>>Release 2 of OS/VS2 was the first MVS.  Inevitably, OS/VS2 Release 1 became
>>known to history as SVS.
>

I thought that OS/VS1 was called SVS.

Locally, we called VS1, SVS and VS2 (release 1), MVS right from the start.
IBM didn't like the labels. Finally, they called OS/VS2 Release 2, MVS.




-- 
Dave Turner	415/542-1299	{ihnp4,lll-crg,qantel,pyramid}!ptsfa!dmt

lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (11/21/87)

This discussion needs a new title. It started with the Denelcor style of
shared-functional-unit multiprocessors: wandered all unbeknownst into
conventional multiprocessors (huh?!?): and now it's a history of IBM
nomenclature (pardon ????).

To get back to the original subject:

There are two reasons to share functional units.
 - cost, or, if you will, duty cycle.
 - simplicity ( in the sense of RISCness ).

The duty cycle argument says that if a unit is rarely used, then you get a
more effective design by sharing it among all the instruction-issue units.
Note that a lot of the average Cray sits idle while the rest is being
useful.  The counter-argument is that decreasing {prices, power consumption,
etc} make sharing less of a win. Plus, sharing puts constraints on packaging
- you have to get there from here.

The simplicity argument says that since successive clocks are on behalf of
different threads, therefore the pipelines need no interlocks. This should
lead to lean, mean pipelines with good clock rates. The problem is that to
do this, you would have to put interlocks on the pipe entrances, to resolve
the asynchonous demands for service. Denelcor solved that by sharing the
instruction-issue unit, using a queue. (When an answer came out, its thread
became elibible for another issue. ) The problem is that any single
sequential program is now unable to issue instructions at the full rate.
So, the machine is only a win for timesharing loads, or for multithread
applications. Obviously, you can do a sort of "fork" in a few clocks.
Denelcor argued that fine grained fork would pick up wins: and Alliant seems
to be getting mileage from such forking.

If you assume a single-chip CPU, I guess it's a bad idea.
-- 
	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

mash@mips.UUCP (John Mashey) (11/22/87)

In article <380@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes:
>This discussion needs a new title...
>
>There are two reasons to share functional units.
> - cost, or, if you will, duty cycle.
> - simplicity ( in the sense of RISCness ).
>
>The duty cycle argument says that if a unit is rarely used, then you get a
>more effective design by sharing it among all the instruction-issue units.
>Note that a lot of the average Cray sits idle while the rest is being
>useful.  The counter-argument is that decreasing {prices, power consumption,
>etc} make sharing less of a win. Plus, sharing puts constraints on packaging
>- you have to get there from here.

>If you assume a single-chip CPU, I guess it's a bad idea.

That's the critical observation, and observe that an increasing piece
of the computing spectrum is being dominated by single-chip CPUs,
whose design tradeoffs are very different from having boards full of
[TTL, ECL, etc] logic.  For example, if you want to micro-time-slice N
processes, you must provide N sets of the highest-speed state in the
memory hierarchy [registers], and in fact, you'd probably want
N sets of caches also.  [Think about having N processes thrashing
around interleaved in the same cache: it is hard to see how this
will help your hit rates very much. TLBs likewise]  If you were building CPUs
that were multiple boards anyway, it might not be impossible to replicate
the registers without incurring awful speed penalties: there will be
a limit, but certainly, successful systems have been built this way,
if only to minimize context switching time. Board yields don't drop
like a stone just because you used a little more space.
On the other hand, if it's VLSI, you can be up against serious limits,
and you have to think hard about what's on the chips.

Finally, here are the reasons why the "single-chip" observation is
the critical one.  I might be accused of bias on the following conjectures,
but I don't think they're too far out of line:

1) Each year, an increasing proportion of newly-installed computers
(both units and $$) will be based on single-chip CPUs.

2) Single-chip solutions already dominate the low-end, and they keep
moving up.  The only way some of the existing architectures compete
there is by VLSIing as quickly as possible [microVAXen, for example].

3) Solutions that are not single-chip (or very small chip count)
will increasingly be:
	a) Highest-end supercomputers
	b) Upward extensions of existing product lines that didn't start life
	as single-chip CPUs
	c) "Unusual" architectures in the mini-super arena, which can often
	support anything if it solves some class of problem enough more
	cost-effectively than other available ones.

4) It's hard to believe there will be ANY more new computer architectures
in the low-to-mid range of computing that aren't single-chip VLSI micros.
(Oops: qualify that: SUCCESSFUL architectures).  Note that low-to-mid
range means shippable 10-mips uniprocessors in 1987, 20-mips in 1988,
>40 in 1989.

5) To summarize: for general-purpose computing, the time-slicing hardware
approach seems doomed to niches at best, because it runs right against
the likely design trends of the next few years.  This does leave the
question of identifying the niches that might be possible.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mrc@CAM.UNISYS.COM (Ken Leonard --> ken@oahu.mcl.unisys.com@sdcjove.cam.unisys.com) (11/23/87)

Having had a couple of replies re my discussion of IBM OS/VS etymology...
   MFT (Multiprogramming, Fixed number of Tasks) became OS/VS1 and stayed.
   MVT ("                 Variable...) became OS/VS2, and then...
   ...thru release 1.highest, used only a _single_ problem-state address
      space and was thus known as SVS (Single Virtual Storage);
   ...with release 2.0 and later, used (supposedly) _multiple_ problem-
      state address spaces and was thus known as MVS (Multiple...).
Sokay?
Regardz,
Ken

kenm@sci.UUCP (Ken McElvain) (11/24/87)

In article <958@winchester.UUCP>, mash@mips.UUCP (John Mashey) writes:
> In article <380@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes:
+ >This discussion needs a new title...
+ >
+ >There are two reasons to share functional units.
+ > - cost, or, if you will, duty cycle.
+ > - simplicity ( in the sense of RISCness ).
+ >
+ >The duty cycle argument says that if a unit is rarely used, then you get a
+ >more effective design by sharing it among all the instruction-issue units.
+ >Note that a lot of the average Cray sits idle while the rest is being
+ >useful.  The counter-argument is that decreasing {prices, power consumption,
+ >etc} make sharing less of a win. Plus, sharing puts constraints on packaging
+ >- you have to get there from here.
+ 
+ >If you assume a single-chip CPU, I guess it's a bad idea.
+ 
+ That's the critical observation, and observe that an increasing piece
+ of the computing spectrum is being dominated by single-chip CPUs,
+ whose design tradeoffs are very different from having boards full of
+ [TTL, ECL, etc] logic.  For example, if you want to micro-time-slice N
+ processes, you must provide N sets of the highest-speed state in the
+ memory hierarchy [registers], and in fact, you'd probably want
+ N sets of caches also.  [Think about having N processes thrashing
+ around interleaved in the same cache: it is hard to see how this
+ will help your hit rates very much. TLBs likewise]  If you were building CPUs
+ that were multiple boards anyway, it might not be impossible to replicate
+ the registers without incurring awful speed penalties: there will be
+ a limit, but certainly, successful systems have been built this way,
+ if only to minimize context switching time. Board yields don't drop
+ like a stone just because you used a little more space.
+ On the other hand, if it's VLSI, you can be up against serious limits,
+ and you have to think hard about what's on the chips.
+ 

I agree that cache [or TLB] hit rates will almost certainly go down.
However, miss penalties will also drop.  It is quite possible that
a cache fill could happen in the time it takes for the barrel
to turn around.

A ten stage barrel processor running at 25Mhz would easily allow
over 300ns for a cache fill before it cost another instruction slot.
The performance limit here is likely to be the bandwidth of the
cache fill mechanism.

Another issue is the instruction set.  It's not clear that you want
a bunch of registers.  It may be much better to do more of a memory
to memory architecture.  (I would recommend keeping some base registers).
A number of other areas also have some surprising tradeoffs.

Ken McElvain
Silicon Compiler Systems
decwrl!sci!kenm

mash@mips.UUCP (John Mashey) (11/25/87)

In article <11444@sci.UUCP> kenm@sci.UUCP (Ken McElvain) writes:
>In article <958@winchester.UUCP>, mash@mips.UUCP (John Mashey) writes:

>I agree that cache [or TLB] hit rates will almost certainly go down.
>However, miss penalties will also drop.  It is quite possible that
>a cache fill could happen in the time it takes for the barrel
>to turn around.

>A ten stage barrel processor running at 25Mhz would easily allow
>over 300ns for a cache fill before it cost another instruction slot.
>The performance limit here is likely to be the bandwidth of the
>cache fill mechanism.
^^^^^^^^^^^^^^^^^^^^^^ yes.
I believe that there is more interference than you might
think, although it would be nice to see simulation numbers, since
I don't have any.  Let's try a few quick assumptions.
Assume we're using split I & D caches. Assume that the cache line
is N words long, filled 1 word/cycle after a latency of L cycles.
One would expect that efficient cache designs have L <= N.
When filling an I-cache miss, you can do L more barrel slots,
then you must stall for N slots (or equivalent), because it doesn't
make sense to have the I-cache run faster than the chip (if it did,
you would run the chip faster). Putting an I-cache on the chip just
moves the problem around. Assuming L <= N, this says that when you hit
an I-cache miss, you get at most 50% of the total refill time (L+N)
that you can actually initiate new instructions.
D-cache refill is a little less painful,
in that only 30% of the instructions are loads/stores (on our systems,
but typical), so that you don't block, or skip something, until you
hit a load/store.  I'm not sure what you do when you execute something
that causes a cache miss while you're already in a cache miss,
maybe just block. Of course, I & D cache refills run into each other,
and if you're using write-thru caches, writes run into refills also.

These numbers seem to indicate that maybe 2-way barrel might be possible,
but much above that, very little benefit can be gained from overlapping
cache refill with execution.


>Another issue is the instruction set.  It's not clear that you want
>a bunch of registers.  It may be much better to do more of a memory
>to memory architecture.  (I would recommend keeping some base registers).
>A number of other areas also have some surprising tradeoffs.
-----------------------------------^^^^
Please explain some more.  Note that in a memory-memory architecture,
what I said about 30% load/store above also gets worse.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jay@splut.UUCP (Jay Maynard) (11/25/87)

In article <3801@ptsfa.UUCP>, dmt@ptsfa.UUCP (Dave Turner) writes:
> >In article <988@edge.UUCP> doug@edge.UUCP (Doug Pardee) writes:
> >>OS/MFT became (was the basis for) OS/VS1.  OS/MVT was the basis for OS/VS2.
> >>Release 2 of OS/VS2 was the first MVS.  Inevitably, OS/VS2 Release 1 became
> >>known to history as SVS.
> I thought that OS/VS1 was called SVS.
> Locally, we called VS1, SVS and VS2 (release 1), MVS right from the start.
> IBM didn't like the labels. Finally, they called OS/VS2 Release 2, MVS.

Actually, Doug had it right. OS/VS2 release 1 was called SVS, because it
provided a single virtual address space (0-16 MB). OS/VS2 release 2 (did
anyone actually run it, or was release 3 the first to see actual use? I was
never clear on that point) provided multiple virtual address spaces, with
each user's storage (except for defined common areas - the nucleus and
common work areas) separate and unaddressable between address spaces. One
address space was (normally) assigned to each TSO user, each batch job, and
each started task. There are, in later releases, several auxiliary address
spaces for system use.

Source for most of this: IBM manual _MVS System Overview_ (no, I don't have
the number - the book's at the office, and I'm at home.)

OS/VS1 was simply called VS1, except for those people who called it bleh...

-- 
Jay Maynard, K5ZC (@WB5BBW)...>splut!< | uucp: uunet!nuchat!splut!jay
Never ascribe to malice that which can | or:  academ!uhnix1!--^
adequately be explained by stupidity.  | GEnie: JAYMAYNARD  CI$: 71036,1603
The opinions herein are shared by none of my cats, much less anyone else.

bcase@apple.UUCP (Brian Case) (11/25/87)

In article <11444@sci.UUCP> kenm@sci.UUCP (Ken McElvain) writes:

   [Seems to be talking about something like the PPUs of the old Cybers.[

>I agree that cache [or TLB] hit rates will almost certainly go down.
>However, miss penalties will also drop.  It is quite possible that
>a cache fill could happen in the time it takes for the barrel
>to turn around.
>A ten stage barrel processor running at 25Mhz would easily allow
>over 300ns for a cache fill before it cost another instruction slot.
>The performance limit here is likely to be the bandwidth of the
>cache fill mechanism.

Yes, but if a fair fraction of the processors in the barrel are causing
misses (say 3 or so) then your memory system will have to be multiported
(or very fast, in which case why not just one fast processor?).
This doesn't invalidate what you are saying, just an observation.

>Another issue is the instruction set.  It's not clear that you want
>a bunch of registers.  It may be much better to do more of a memory
>to memory architecture.  (I would recommend keeping some base registers).
>A number of other areas also have some surprising tradeoffs.

I fail to see why memory-memory would be better than registers.  Can
you give some proof?  Also, what other areas have surprising tradeoffs,
and what are they?

franka@mmintl.UUCP (Frank Adams) (11/26/87)

In article <958@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
|In article <380@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes:
|>If you assume a single-chip CPU, I guess it's a bad idea.
|
|That's the critical observation, and observe that an increasing piece
|of the computing spectrum is being dominated by single-chip CPUs,
|whose design tradeoffs are very different from having boards full of
|[TTL, ECL, etc] logic.

This is a good point, but unless I am missing something, it is only a
temporary one.  Surely we will reach the point, not too many years from now,
when the logic which now fills many boards will all fit on one chip.  At
that point, the arguments for horizontal pipelining on a single chip CPU
will be as strong as they are today for multi-chip CPUs.  Won't they?

Another trend which might doom the idea is that towards individual
(single-user) computers.  The future of multi-tasking on such machines is
very much in question; if it becomes a big thing, there is no problem.
Otherwise, we are left the relatively few (but, on average, higher-powered)
time-shared systems which are left.
-- 

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

mash@mips.UUCP (11/30/87)

In article <2581@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:

>|That's the critical observation, and observe that an increasing piece
>|of the computing spectrum is being dominated by single-chip CPUs,
>|whose design tradeoffs are very different from having boards full of
>|[TTL, ECL, etc] logic.

>This is a good point, but unless I am missing something, it is only a
>temporary one.  Surely we will reach the point, not too many years from now,
>when the logic which now fills many boards will all fit on one chip.  At
>that point, the arguments for horizontal pipelining on a single chip CPU
>will be as strong as they are today for multi-chip CPUs.  Won't they?
Maybe, maybe not: it seems to me there are packaging issue differences.
(I'm not a packaging expert: maybe someone who is will contribute).
However, I'd observe that I've seen a lot of multi-board designs that
were able to afford to have lots of busses, and there didn't seem to
be huge discontinuities in cost when you needed to make something a little
bigger, standard form factors notwithstanding.  On the other hand,
it seems that with VLSI design, at any given point in time:
	a) Adding pins can lead to big discontinuities in costs.
	b) # signal pins and/or speeds maybe severely limited by
	the testers available
	c) So far, there are obvious good uses for additional pins
Maybe cheap, testable packages get big enough sometime that we can
have as many pins as we want....

>Another trend which might doom the idea is that towards individual
>(single-user) computers.  The future of multi-tasking on such machines is
>very much in question; if it becomes a big thing, there is no problem.

Hopefully, multi-tasking will some year come to single-user computers :-)

>Otherwise, we are left the relatively few (but, on average, higher-powered)
>time-shared systems which are left.

Note: perhaps my earlier posting was not clear enough: let me add some more.
I'd said that an increasing part of computing will be taken over by
VLSI small-#-of-chip solutions, either in single-cpu units, or
in multi-processors.  The main reason for this belief wasn't technical,
but economic.  To make a barrel-style processor economically viable
(versus ganging micros together and riding their performance curves
at low cost), it needs to have a truly compelling cost/performance
advantage.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

elg@killer.UUCP (12/06/87)

In article <1006@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>In article <2581@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>>Another trend which might doom the idea is that towards individual
>>(single-user) computers.  The future of multi-tasking on such machines is
>>very much in question; if it becomes a big thing, there is no problem.
>
>Hopefully, multi-tasking will some year come to single-user computers :-)

Actually, multi-tasking single-user computers have been available for years.
OS-9 on the TRS-80 Color Computer, for example, and AmigaDOS on the Commodore
Amiga. Just because the IBM PEE-CEE and Apple Macintosh don't have a
multitasking oprating system, doesn't mean that the rest of the world is stuck
with single tasking (and note that both IBM and Apple intend to introduce
multitasking OS's Real Soon Now).

I think that we'll see the demise of ancient CP/M-derived operating systems
Real Soon Now (as Marketing would say :-).

--
     Eric Green   elg@usl.CSNET      Snail Mail P.O. Box 92191       
     {cbosgd,ihnp4}!killer!elg       Lafayette, LA 70509             
Hello darkness my old friend, I've come to talk with you again....