[comp.arch] instruction stack

rmbult01@ulkyvx.bitnet (Robert M. Bultman) (02/26/91)

The CDC 6600, 7600, STAR 100, and IBM 360/195 made use of an
instruction stack.  The implementation varied, but generally they were
used to speed up operation of loops.  If a backwards branch referenced
an instruction contained within the instruction stack, the instruction
was obtained from the instruction stack rather than main store.  This
allowed the CPU<->Memory path to be used for data access rather than
instruction fetch.  I have several questions regarding this:

    1) Do any recent (> 1985) machines use this, or has this been
       obviated by the use of cache?

    2) Was this mechanism software controllable, or was it a
       hardware-only mechanism?  (The reference (see below) regarding
       this indicated it was hardware only using an address comparison
       to determine if the backwards jump fell within the limit of the
       instruction stack.  This sounds like a very small i-cache but
       is contained within the CPU.)

    3) Would such a mechanism help RISCs? (Why? Why not?)

Reference:
Topham, Nigel P., Omondi, Amos, and Ibbett, Roland N., "On the Design
and Performance of Conventional Pipelined Architectures", The Journal
of Supercomputing, Vol. 1, No. 4, Aug 1988.

Robert Bultman, University of Louisville, Speed Scientific School

kirchner@informatik.uni-kl.de (Reinhard Kirchner) (03/01/91)

From article <1991Feb26.112611.808@ulkyvx.bitnet>, by rmbult01@ulkyvx.bitnet (Robert M. Bultman):
> 
>     1) Do any recent (> 1985) machines use this, or has this been
>        obviated by the use of cache?
> 
>     2) Was this mechanism software controllable, or was it a
>        hardware-only mechanism?  (The reference (see below) regarding
>        this indicated it was hardware only using an address comparison
> 
>     3) Would such a mechanism help RISCs? (Why? Why not?)
> 
The already mentioned Hyperstone does something like this, even if it is
called 'cache' . The load a circular buffer with the executed instructions
and have them at hand in a loop. They do an automated prefetch and have also
an instruction to start a certain amount of prefetching ( max. 8 words ).
This seems to be implementable with very little amount of transistors
and seems to generate a lot of speed ( the Hyperstone has only 85k Transistors)
Since the processor has special means for sequential access in the same
RAM page prefetching allows using this and avoids intermixed instruction
and data fetches.

Reinhard Kirchner
Univ. Kaiserslautern, Germany
kirchner@uklirb.informatik.uni-kl.de

hrubin@pop.stat.purdue.edu (Herman Rubin) (03/01/91)

In article <1991Feb26.112611.808@ulkyvx.bitnet>, rmbult01@ulkyvx.bitnet (Robert M. Bultman) writes:
> The CDC 6600, 7600, STAR 100, and IBM 360/195 made use of an
> instruction stack.  The implementation varied, but generally they were
> used to speed up operation of loops.  If a backwards branch referenced
> an instruction contained within the instruction stack, the instruction
> was obtained from the instruction stack rather than main store.  This
> allowed the CPU<->Memory path to be used for data access rather than
> instruction fetch.  I have several questions regarding this:

Forward references were also speeded up.  In the 6600/7600, this was not as
common as on some of the others, but some of them brought in a forward stack
as well upon reaching boundaries.

>     1) Do any recent (> 1985) machines use this, or has this been
>        obviated by the use of cache?

I believe the ETA10 has this, and I do not know which others.  On some
machines, this is referred to as an instruction cache; the uses of
"cache" and "stack" here are essentially the same.  Presumably, one
could even have different levels of this.

>     2) Was this mechanism software controllable, or was it a
>        hardware-only mechanism?  (The reference (see below) regarding
>        this indicated it was hardware only using an address comparison
>        to determine if the backwards jump fell within the limit of the
>        instruction stack.  This sounds like a very small i-cache but
>        is contained within the CPU.)

As far as I know, this was hardware only.  I do not know of any instructions
on these machines, and some others, which would allow software control.

>     3) Would such a mechanism help RISCs? (Why? Why not?)

Unless memory access is as fast as stack address, of course it would help.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

gary@chpc.utexas.edu (Gary Smith) (03/02/91)

In article <1991Feb26.112611.808@ulkyvx.bitnet>, rmbult01@ulkyvx.bitnet (Robert M. Bultman) writes:
|> The CDC 6600, 7600, STAR 100, and IBM 360/195 made use of an
|> instruction stack.  The implementation varied, but generally they were
|> used to speed up operation of loops.  If a backwards branch referenced
|> an instruction contained within the instruction stack, the instruction
|> was obtained from the instruction stack rather than main store.  This
|> allowed the CPU<->Memory path to be used for data access rather than
|> instruction fetch.  I have several questions regarding this:
|> 
|>     1) Do any recent (> 1985) machines use this, or has this been
|>        obviated by the use of cache?
|> 
|>     2) Was this mechanism software controllable, or was it a
|>        hardware-only mechanism?  (The reference (see below) regarding
|>        this indicated it was hardware only using an address comparison
|>        to determine if the backwards jump fell within the limit of the
|>        instruction stack.  This sounds like a very small i-cache but
|>        is contained within the CPU.)
|> 
|>     3) Would such a mechanism help RISCs? (Why? Why not?)
|> 
|> Reference:
|> Topham, Nigel P., Omondi, Amos, and Ibbett, Roland N., "On the Design
|> and Performance of Conventional Pipelined Architectures", The Journal
|> of Supercomputing, Vol. 1, No. 4, Aug 1988.
|> 
|> Robert Bultman, University of Louisville, Speed Scientific School
 
1) The CRAY Y-MP systems use the instruction stack concept: each CPU of this
shared memory multiprocessor system contains 4 instruction buffers which are
loaded through port D (shared with I/O, ports A&B are for vector loads, with
port C for vector stores).  Instruction fetch has priority over I/O transfer
in port D.  The buffers are used round-robin and each contains thirty-two 64
bit words (or 128 16-bit parcels).  The first instruction in any of the four
groups must have a word address which is a multiple of 32.  Each instruction
buffer has an IBAR (high-order 17 bits of P) which is compared with P and if
an in-buffer condition exists, no access to CM occurs.  One "neat" aspect of
non-coincidence recovery is that the word containing the parcel accessed is
fetched from memory first, with [poteltial] circular fill for remainding 31.
During fetch sequence, instructions are retrieved from CM to an instruction
buffer at a rate of one 64-bit word per clock period (6 ns).  It takes nine-
teen clocks for the first word to be available, and 3 more clocks for the
first instruction to get to CIP.  Memory conflicts can of course lengthen an
instruction fetch beyond the 19+31 = 50 clock period best case.  I suppose I
would call this a cache, although it's not obvious to me where on the conti-
nuum of direct-mapped to fully-associative one would put CRAY's concept.  As
is typically the case, it's easier for me to just describe it as I understand
it.  Categorizing is quite difficult for me, but that's not germane I guess.

It's "interesting" to observe the effects of running the following program in
one or more CPU's of an X-MP which had similar instruction cache (although a
VERY different implementation - instruction fetch on those machines grabbed
all CPUs' paths to CM so that fetch could be completed in 14+3 clocks (8.5 ns)
rather than 14+31 clocks).  I once measured "slow down" (more CPU time charged
to user) of 32% on an X by running the following program in one of the CPU's:

	IDENT	LUM	LOCK UP MEMORY
LUM	ENTER	NP=0,MODE=USER
	S2	0	(FOR XPAK DISPLAY)
	S1	1
LUM1	J	LUM2	JUMP FOREVER, USING ALL INSTRUCTION BUFFERS
	ALLIGN		(EQUIVALENT TO BSS 31)
LUM2	J	LUM3
	ALLIGN
LUM3	J	LUM4
	ALLIGN
LUM4	J	LUM5
	ALLIGN
LUM5	S2	S2+S1
	J	LUM1
	END

2) The CDC 6600 instruction stack consisted of eight 60-bit instruction re-
gisters, labeled I0, I1, ..., I7.  An additional 60-bit register knows as
the Input Register served an a buffer between CM and the stack.  There were
some restrictions, but fundamentally this allowed thirty-two 15-bit parcels
to be resident in "cache" on a given clock.  In-stack loops required a 30-bit
instruction for the brach and jumps could not be executed from I0, the bottow
register of the stack due to proces called "inching" the stack (which caused
discarding of I7).  All this fine hardware was contained on Chassis 5 but I'm
lapsing into something not germane again.  Sorry.

3) The CDC 6600 was a "RISC" machine in my opinion, as is today's CRAY.  In
fact, and this is an oversimplification of course, the CRAY simply adds one
bit to the 6-bit op code of the 6600.  If the bit's 0, you get a scalar op,
if it's 1, you get a vector op.  How much this can/does help other RISC sys-
tems depends on too many factors to list here.  It gets into general memory
hierarchy questions.  I believe the reason for memory hierarchies is speed-
ing up execution times of REAL applications, and I certainly don't know of
any complete solutions to that challenge.


-- 
Randolph Gary Smith <gary@chpc.utexas.edu>
Systems Group, Center for High Performance Computing
The University of Texas System
Commons Building 1.151C, Balcones Research Center
10100 Burnet Road
Austin, TX 78758
(512) 471-2411

gah@mt-xia.watson.ibm.com (Gary A. Hoffman) (03/04/91)

AMD 29000 has something we call a Branch-Target-Table (BTT) in a patent.
The CMOS version of the ROMP processor in the deceased IBM/RT had
a loop buffer in it.  There are other examples.

Generally, small BTTs and other mechanisms work better than small caches.
A good deal of the value comes from seperating instructions from data with
this form of associativity.  

etc, etc, etc.
-- 
g

jesup@cbmvax.commodore.com (Randell Jesup) (03/05/91)

In article <1991Mar4.153151.26216@arnor.uucp> gah@ibm.com (Gary A. Hoffman) writes:
>AMD 29000 has something we call a Branch-Target-Table (BTT) in a patent.
>The CMOS version of the ROMP processor in the deceased IBM/RT had
>a loop buffer in it.  There are other examples.

	BTW, there may be prior art on BTC's: The RPM-40 has them, and had
them in the design before the 29000 was announced (well before, I think).
We (the rpm-40 team) may well have gotten the idea from even earlier work,
a lot of the ideas in the rpm-40 came from other architectures, such as
transputers, the MIPS and RISC chips, etc.

	I also could be wrong and we might have gotten it from the 29000,
but I'm fairly certain that's not the case (I remember when it was announced,
and I was already on the team - the BTC's were there long before I joined it).

	Note: I know nothing (and care less) about legal matters and lawyers.
I also may have may dates confused, it was something like 4+ years ago.

>Generally, small BTTs and other mechanisms work better than small caches.
>A good deal of the value comes from seperating instructions from data with
>this form of associativity.  

	So long as I-mem can keep up with the bandwidth required for
sequential operation, BTC's can be very small and give a big win.  You can
also make good use of BTC's to handle branchs and a large external I-cache
to interface to memory too slow to handle the required bandwidth.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/07/91)

On 5 Mar 91 02:05:26 GMT, jesup@cbmvax.commodore.com (Randell Jesup) said:

jesup> In article <1991Mar4.153151.26216@arnor.uucp> gah@ibm.com (Gary A. Hoffman) writes:

pcg> AMD 29000 has something we call a Branch-Target-Table (BTT) in a patent.
pcg> The CMOS version of the ROMP processor in the deceased IBM/RT had
pcg> a loop buffer in it.  There are other examples.

jesup> 	BTW, there may be prior art on BTC's: The RPM-40 has them, and had
jesup> them in the design before the 29000 was announced (well before, I think).
jesup> We (the rpm-40 team) may well have gotten the idea from even earlier work,
jesup> a lot of the ideas in the rpm-40 came from other architectures, such as
jesup> transputers, the MIPS and RISC chips, etc.

What about the MU5 branch target cache? Documented in the vital and
little known book "The MU5 Computer System", by Ibbett & Morris,
McMillan. Early seventies stuff...
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk