[comp.arch] Second-generation RISC

gor@cs.strath.ac.uk (Gordon Russell) (03/15/91)

Hi there.....

   Does anyone out their have any additional information concerning the
PgC7600 microprocessor, other than the overview given by BYTE (March 1991).

   Just to recap, it is a fast RISC processor, created by a new British
company named PGC, which has been designed as a secod generation RISC
processor. The PgC7600 will reportedly ru at 160 MIPS, a speed made possible
by integrating numerous support chips into the processor. The core of the
device runs asynchronously (unclocked), communicating to the outside
world by on-chip interface devices.

   I am interested in its register windowing system, actual performance,
if it actually exists now, and anything else about the device.

   The review in BYTE is a little difficult to find, but it is a sizable
review, and can be located on page 90IS-109.

   Any replies welcome.

   Gordon Russell, PhD Student
        University of Strathclyde
        Glasgow
        Scotland.

coy@ssc-vax (Stephen B Coy) (03/20/91)

In article <6128@baird.cs.strath.ac.uk> gor@cs.strath.ac.uk (Gordon Russell) writes:
>   The review in BYTE is a little difficult to find, but it is a sizable
>review, and can be located on page 90IS-109.

The review is extremely difficult to find.  Unfortunetly its not
included in the US verison of Byte.  Could you please summarize the
review for those of us on this side of the "pond".

Stephen Coy
coy@ssc-vax.UUCP

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (03/20/91)

In article <6128@baird.cs.strath.ac.uk>
	gor@cs.strath.ac.uk (Gordon Russell) writes:

>Subject: Re: Second-generation RISC

I think that the only existing second generation RISC is HP's PA-RISC 1.1.

>   Does anyone out their have any additional information concerning the
>PgC7600 microprocessor, other than the overview given by BYTE (March 1991).

I have no information. The other second generation RISC I know is MIPS's
R4000, which in only announced.

>The PgC7600 will reportedly ru at 160 MIPS, a speed made possible
>by integrating numerous support chips into the processor.

I think that the second generaton RISC should:

	1) be faster than 50 MIPS (use whatever reasonable one,
	   absolutely not dhrystone MIPS)

	2) be able to have large (>=0.5Mbyte) cache

	3) have large (>4Gbyte) address space

	4) have FLOPS at least as fast as its external clock.

Moreover, I think they will soon (or already) have dual level cache and
dual level TLB.

I also think that the compilers of the second generation RISC will
be tuned against SPECmark.

Dose anyone have any commnet about the second generation RISC?

							Masataka Ohta

kds@blabla.intel.com (Ken Shoemaker) (03/23/91)

My personal feeling?  The current generation of RISC processors were defined
given the implementation constraints of at least 5 years ago.  Since then,
the implementation technology has changed considerably.  The number of
transistors you can fit on a chip has grown, of course, but the
disparity of on-chip devices to off-chip devices has grown much more.
Though you can get increased bandwidth by increasing the number of pins
(another development is packages with huge numbers of pins), you have a hard
time pushing the speeds of the interchip interfaces.

But this is just one example of how the technology has changed.  In general,
you can try to adopt existing architectures to the new technology, but you
inevitable create some kind of mismatch which requires more device
complexity to address.  This camp includes both superscaler and
superpipelined implementations of existing architectures, including
implementations of first generation RISC processors.  None of these
implementations can be as clean as the implementation of an architecture which
was designed with the new constraints from the start.  This was one of the
primary reasons for this whole family of processors.  On of the many
possible acronyms that RISC was described to mean even had something to do
with responses to semiconductor technology.

You can even try to expand or generalize your architecture to try to
encompass the new opportunities that a different implementation technology
presents.  But you are still encumbered by the constraints imposed by your
existing architecture.  Which isn't to say that there isn't money to be made
in this camp!  But the closer your architecture it tied to technology, the
better a chance that it will become obsolete when the underlying technology 
changes.

So any "second generation" RISC would necessarily look different than a
first generation RISC in that it would be tuned to a different technology.
Register set sizes might get larger, but my guess is that they would get
smaller and wider.  Multiple execution units would be standard, and would
impose an entirely new set of constraints on instruction sequences.  Delayed
branches will probably go away as they really are an artifact of having a
short, fixed length pipeline.  But my opinion is worth, of course, exactly 
what you paid for it.  I certainly don't want to start a flame war, and I
admit that this may sound like the pot calling the kettle black!
-------------------
Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
	kds@mipos2.intel.com

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (03/24/91)

In article <3189@inews.intel.com> kds@mipos2.intel.com (Ken Shoemaker) writes:
>Though you can get increased bandwidth by increasing the number of pins
>you have a hard time pushing the speeds of the interchip interfaces.

As clock frequencies go up, some funny things start happening.  A row
of bus drivers starts looking like a phased array - and I'm talking
about 150 MHz, not 1 GHz. So, you double rail all the lines, and
interleave with power and ground lines, and add P/G planes, and you
may still have to perturb the board layout to avoid parallel traces.

One view is that we should just accustom ourselves to inevitable
nuisances and drawbacks. The alternate view is that there's physics
out in them thar hills, and we should be looking for it.

In particular, optical communication is looking better and better.
Over long distances, it takes less energy to push photons than to
push electrons. One Bell Labs estimate puts the crossover distance at
200 microns. That is not a typo. The crossover is not at 200 meters
(LAN size) but nearer 200 microns (onchip size).

But is it fast? Well, the speed record for transistors is in the low
picoseconds.  The record for short optical pulses is in the low
femtoseconds. The record for optical pulse repitition rate is 600 fs
- 1600 GHz - also not a typo.

But can it be compact?  Well, face-emitting lasers can be as little
as a micron across. Yes, they have been successfully coupled to
optical fibers.

Cheap?  A wafer of thousands of face-emitting lasers is a few
thousand dollars. Yields are 85-90%. The only difficulty with
individually modulating each of them at a gigahertz is the wiring and
signal distribution headache.

It doesn't really matter that practical difficulties remain. As clock
rates rise, all possible approaches have practical difficulties. Why
not go for a win instead of a draw?




-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

jdarcy@seqp4.ORG (Jeffrey d'Arcy) (03/25/91)

kds@blabla.intel.com (Ken Shoemaker) writes:
>Delayed
>branches will probably go away as they really are an artifact of having a
>short, fixed length pipeline.

As long as there's at least one pipe stage devoted to instruction fetch and
decode, I think delayed branches would make sense.  If this part of the 
pipeline ceases to be fixed length, maybe we'll see multiple-instruction
branch delays, with a *variable* number of instructions after the branch.

ICK!  8]

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/27/91)

In article <705@seqp4.UUCP> jdarcy@seqp4.ORG (Jeffrey d'Arcy) writes:

| As long as there's at least one pipe stage devoted to instruction fetch and
| decode, I think delayed branches would make sense.  If this part of the 
| pipeline ceases to be fixed length, maybe we'll see multiple-instruction
| branch delays, with a *variable* number of instructions after the branch.

Sure, if you want a disgusting thought, how about a few bits in the
branch instruction to indicate the number of cycles (minimum) to delay?

This is a truely disgusting thought...
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

hamilton@siberia.rtp.dg.com (Eric Hamilton) (03/27/91)

In article <705@seqp4.UUCP>, jdarcy@seqp4.ORG (Jeffrey d'Arcy) writes:
|> kds@blabla.intel.com (Ken Shoemaker) writes:
|> >Delayed
|> >branches will probably go away as they really are an artifact of having a
|> >short, fixed length pipeline.
|> 
|> As long as there's at least one pipe stage devoted to instruction fetch and
|> decode, I think delayed branches would make sense.  If this part of the 
|> pipeline ceases to be fixed length, maybe we'll see multiple-instruction
|> branch delays, with a *variable* number of instructions after the branch.
|> 
This is a slightly backwards way of looking at the problem.  The question
is not how many branch delays are implied by the pipeline structure; that
will change from implementation to implementation.  The question is how
many branch delays can be profitably filled by compilers.

Suppose, for example, that we discover that 99% of the time the compiler
can profitably fill three branch delay slots.  Then, regardless of the
pipeline structure of the machine, the delayed branches should support three
delay slots.  Instead of writing:

	instr a
	instr b		(executes in four clocks on a non-superscalar
	branch.n foo     machine with one fetch/decode stage)
	instr c

we write:

	branch.n foo
	instr a		(also executes in four clocks on a non-superscalar
	instr b		 machine with one fetch/decode stage)
	instr c

Since, by hypothesis, the compiler can profitably use all three slots nearly all
the time we can live with the occasional no-op inserted when all three slots
cannot be filled.

But look at what happens when we move this code to a two intructions/cycle
superscalar...
The first sequence now wastes one entire cycle, or two instruction times,
and has an effective execution time of three clocks.  The second sequence
wastes nothing and has an effective execution time of two clocks.  The second
sequence, in other words, executes on a given implementation at its best
possible speed, regardless of the pipeline structure.

The moral of the story is that the delayed branching should be designed around
the best that the compiler can do, not the idiosyncracies of a particular
implementation.  The compiler should be able to generate code that uses all
the available pipelining, without worrying about precisely how much pipelining
that is.

----------------------------------------------------------------------
Eric Hamilton				+1 919 248 6172
Data General Corporation		hamilton@dg-rtp.rtp.dg.com
62 Alexander Drive			...!mcnc!rti!xyzzy!hamilton
Research Triangle Park, NC  27709, USA

piziali@convex.com (Andy Piziali) (03/27/91)

Regarding arbitrary delay slot lengths, the instruction encoding of the
Evans and Sutherland ES-1 supercomputer specified a bit named the "split bit."
The split bit indicated to the processor to switch the instruction execution
stream from the current stream to the pending branch target stream if the
pending branch was taken.  As with other branch delay slot schemes, the branch
target instruction stream was prefetched between the branch instruction and the
split instruction.
--
Home:	           andy@piziali.lonestar.org                |
{convex,egsner,frontier,laczko}!piziali!andy  ________------+------________
Office:                   piziali@convex.com               / \     
           {sun,texsun,uunet}!convex!piziali              *---*

krolnik@convex.COM (Adam Krolnik) (03/28/91)

Two interesting architecture papers

WISQ: A Restartable Architecture using Queues
   and
PIPE: A VLSI Decoupled Architecture


Both of these had a count of instructions that were the delay slot.

Pipe: had a count that specified the number of instructions to execute regardless of the
outcome of the branch.

Wisq: had a mask that specified which instructions to invalidate, if the processor predicted the
wrong path to execute.



      Adam Krolnik
      Design Verification Engineer  214-497-4578
      Convex Computer Corp. Richardson, Tx 75080

prener@arnor.UUCP (Dan Prener) (03/28/91)

In article <1991Mar26.210643.2052@dg-rtp.dg.com>, hamilton@siberia.rtp.dg.com (Eric Hamilton) writes:
|> 
|> Suppose, for example, that we discover that 99% of the time the compiler
|> can profitably fill three branch delay slots.  Then, regardless of the
|> pipeline structure of the machine, the delayed branches should support three
|> delay slots.  Instead of writing:
|> 
|> 	instr a
|> 	instr b		(executes in four clocks on a non-superscalar
|> 	branch.n foo     machine with one fetch/decode stage)
|> 	instr c
|> 
|> we write:
|> 
|> 	branch.n foo
|> 	instr a		(also executes in four clocks on a non-superscalar
|> 	instr b		 machine with one fetch/decode stage)
|> 	instr c
|> 
|> Since, by hypothesis, the compiler can profitably use all three slots nearly all
|> the time we can live with the occasional no-op inserted when all three slots
|> cannot be filled.
|> 
That argument ignores the second-order effects, which arise from the memory
hierarchy.  In a machine that doesn't really need three branch delay slots,
there will be no gain from having delayed branches with three slots.  But
there can be dramatic losses.  Think of the (admittedly uncommon) cases in which
the fetch of a no-op that padded out the third delay slot causes a cache miss,
or, even worse, a page fault.  So the expected value of the three slot delay
on this machine is negative (zero with large probability, some significantly
non-zero negative number with small probability) on this machine.  On the future
superscalar implementation, there is some positive contribution and some 
negative contribution to the expected value of the delay.  So, even there,
it is far from clear that it wins.
-- 
                                   Dan Prener (prener @ watson.ibm.com)

herrickd@iccgcc.decnet.ab.com (03/29/91)

In article <3291@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:
> bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
>         "Most of the VAX instructions are in microcode,
>          but halt and no-op are in hardware for efficiency"

But, when you tell your VAX to do nothing, it does it RIGHT NOW!

dan herrick
herrickd@iccgcc.decnet.ab.com