rmbult01@ulkyvx.bitnet (Robert M. Bultman) (02/26/91)
The CDC 6600, 7600, STAR 100, and IBM 360/195 made use of an instruction stack. The implementation varied, but generally they were used to speed up operation of loops. If a backwards branch referenced an instruction contained within the instruction stack, the instruction was obtained from the instruction stack rather than main store. This allowed the CPU<->Memory path to be used for data access rather than instruction fetch. I have several questions regarding this: 1) Do any recent (> 1985) machines use this, or has this been obviated by the use of cache? 2) Was this mechanism software controllable, or was it a hardware-only mechanism? (The reference (see below) regarding this indicated it was hardware only using an address comparison to determine if the backwards jump fell within the limit of the instruction stack. This sounds like a very small i-cache but is contained within the CPU.) 3) Would such a mechanism help RISCs? (Why? Why not?) Reference: Topham, Nigel P., Omondi, Amos, and Ibbett, Roland N., "On the Design and Performance of Conventional Pipelined Architectures", The Journal of Supercomputing, Vol. 1, No. 4, Aug 1988. Robert Bultman, University of Louisville, Speed Scientific School
kirchner@informatik.uni-kl.de (Reinhard Kirchner) (03/01/91)
From article <1991Feb26.112611.808@ulkyvx.bitnet>, by rmbult01@ulkyvx.bitnet (Robert M. Bultman): > > 1) Do any recent (> 1985) machines use this, or has this been > obviated by the use of cache? > > 2) Was this mechanism software controllable, or was it a > hardware-only mechanism? (The reference (see below) regarding > this indicated it was hardware only using an address comparison > > 3) Would such a mechanism help RISCs? (Why? Why not?) > The already mentioned Hyperstone does something like this, even if it is called 'cache' . The load a circular buffer with the executed instructions and have them at hand in a loop. They do an automated prefetch and have also an instruction to start a certain amount of prefetching ( max. 8 words ). This seems to be implementable with very little amount of transistors and seems to generate a lot of speed ( the Hyperstone has only 85k Transistors) Since the processor has special means for sequential access in the same RAM page prefetching allows using this and avoids intermixed instruction and data fetches. Reinhard Kirchner Univ. Kaiserslautern, Germany kirchner@uklirb.informatik.uni-kl.de
hrubin@pop.stat.purdue.edu (Herman Rubin) (03/01/91)
In article <1991Feb26.112611.808@ulkyvx.bitnet>, rmbult01@ulkyvx.bitnet (Robert M. Bultman) writes: > The CDC 6600, 7600, STAR 100, and IBM 360/195 made use of an > instruction stack. The implementation varied, but generally they were > used to speed up operation of loops. If a backwards branch referenced > an instruction contained within the instruction stack, the instruction > was obtained from the instruction stack rather than main store. This > allowed the CPU<->Memory path to be used for data access rather than > instruction fetch. I have several questions regarding this: Forward references were also speeded up. In the 6600/7600, this was not as common as on some of the others, but some of them brought in a forward stack as well upon reaching boundaries. > 1) Do any recent (> 1985) machines use this, or has this been > obviated by the use of cache? I believe the ETA10 has this, and I do not know which others. On some machines, this is referred to as an instruction cache; the uses of "cache" and "stack" here are essentially the same. Presumably, one could even have different levels of this. > 2) Was this mechanism software controllable, or was it a > hardware-only mechanism? (The reference (see below) regarding > this indicated it was hardware only using an address comparison > to determine if the backwards jump fell within the limit of the > instruction stack. This sounds like a very small i-cache but > is contained within the CPU.) As far as I know, this was hardware only. I do not know of any instructions on these machines, and some others, which would allow software control. > 3) Would such a mechanism help RISCs? (Why? Why not?) Unless memory access is as fast as stack address, of course it would help. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
gary@chpc.utexas.edu (Gary Smith) (03/02/91)
In article <1991Feb26.112611.808@ulkyvx.bitnet>, rmbult01@ulkyvx.bitnet (Robert M. Bultman) writes: |> The CDC 6600, 7600, STAR 100, and IBM 360/195 made use of an |> instruction stack. The implementation varied, but generally they were |> used to speed up operation of loops. If a backwards branch referenced |> an instruction contained within the instruction stack, the instruction |> was obtained from the instruction stack rather than main store. This |> allowed the CPU<->Memory path to be used for data access rather than |> instruction fetch. I have several questions regarding this: |> |> 1) Do any recent (> 1985) machines use this, or has this been |> obviated by the use of cache? |> |> 2) Was this mechanism software controllable, or was it a |> hardware-only mechanism? (The reference (see below) regarding |> this indicated it was hardware only using an address comparison |> to determine if the backwards jump fell within the limit of the |> instruction stack. This sounds like a very small i-cache but |> is contained within the CPU.) |> |> 3) Would such a mechanism help RISCs? (Why? Why not?) |> |> Reference: |> Topham, Nigel P., Omondi, Amos, and Ibbett, Roland N., "On the Design |> and Performance of Conventional Pipelined Architectures", The Journal |> of Supercomputing, Vol. 1, No. 4, Aug 1988. |> |> Robert Bultman, University of Louisville, Speed Scientific School 1) The CRAY Y-MP systems use the instruction stack concept: each CPU of this shared memory multiprocessor system contains 4 instruction buffers which are loaded through port D (shared with I/O, ports A&B are for vector loads, with port C for vector stores). Instruction fetch has priority over I/O transfer in port D. The buffers are used round-robin and each contains thirty-two 64 bit words (or 128 16-bit parcels). The first instruction in any of the four groups must have a word address which is a multiple of 32. Each instruction buffer has an IBAR (high-order 17 bits of P) which is compared with P and if an in-buffer condition exists, no access to CM occurs. One "neat" aspect of non-coincidence recovery is that the word containing the parcel accessed is fetched from memory first, with [poteltial] circular fill for remainding 31. During fetch sequence, instructions are retrieved from CM to an instruction buffer at a rate of one 64-bit word per clock period (6 ns). It takes nine- teen clocks for the first word to be available, and 3 more clocks for the first instruction to get to CIP. Memory conflicts can of course lengthen an instruction fetch beyond the 19+31 = 50 clock period best case. I suppose I would call this a cache, although it's not obvious to me where on the conti- nuum of direct-mapped to fully-associative one would put CRAY's concept. As is typically the case, it's easier for me to just describe it as I understand it. Categorizing is quite difficult for me, but that's not germane I guess. It's "interesting" to observe the effects of running the following program in one or more CPU's of an X-MP which had similar instruction cache (although a VERY different implementation - instruction fetch on those machines grabbed all CPUs' paths to CM so that fetch could be completed in 14+3 clocks (8.5 ns) rather than 14+31 clocks). I once measured "slow down" (more CPU time charged to user) of 32% on an X by running the following program in one of the CPU's: IDENT LUM LOCK UP MEMORY LUM ENTER NP=0,MODE=USER S2 0 (FOR XPAK DISPLAY) S1 1 LUM1 J LUM2 JUMP FOREVER, USING ALL INSTRUCTION BUFFERS ALLIGN (EQUIVALENT TO BSS 31) LUM2 J LUM3 ALLIGN LUM3 J LUM4 ALLIGN LUM4 J LUM5 ALLIGN LUM5 S2 S2+S1 J LUM1 END 2) The CDC 6600 instruction stack consisted of eight 60-bit instruction re- gisters, labeled I0, I1, ..., I7. An additional 60-bit register knows as the Input Register served an a buffer between CM and the stack. There were some restrictions, but fundamentally this allowed thirty-two 15-bit parcels to be resident in "cache" on a given clock. In-stack loops required a 30-bit instruction for the brach and jumps could not be executed from I0, the bottow register of the stack due to proces called "inching" the stack (which caused discarding of I7). All this fine hardware was contained on Chassis 5 but I'm lapsing into something not germane again. Sorry. 3) The CDC 6600 was a "RISC" machine in my opinion, as is today's CRAY. In fact, and this is an oversimplification of course, the CRAY simply adds one bit to the 6-bit op code of the 6600. If the bit's 0, you get a scalar op, if it's 1, you get a vector op. How much this can/does help other RISC sys- tems depends on too many factors to list here. It gets into general memory hierarchy questions. I believe the reason for memory hierarchies is speed- ing up execution times of REAL applications, and I certainly don't know of any complete solutions to that challenge. -- Randolph Gary Smith <gary@chpc.utexas.edu> Systems Group, Center for High Performance Computing The University of Texas System Commons Building 1.151C, Balcones Research Center 10100 Burnet Road Austin, TX 78758 (512) 471-2411
gah@mt-xia.watson.ibm.com (Gary A. Hoffman) (03/04/91)
AMD 29000 has something we call a Branch-Target-Table (BTT) in a patent. The CMOS version of the ROMP processor in the deceased IBM/RT had a loop buffer in it. There are other examples. Generally, small BTTs and other mechanisms work better than small caches. A good deal of the value comes from seperating instructions from data with this form of associativity. etc, etc, etc. -- g
jesup@cbmvax.commodore.com (Randell Jesup) (03/05/91)
In article <1991Mar4.153151.26216@arnor.uucp> gah@ibm.com (Gary A. Hoffman) writes: >AMD 29000 has something we call a Branch-Target-Table (BTT) in a patent. >The CMOS version of the ROMP processor in the deceased IBM/RT had >a loop buffer in it. There are other examples. BTW, there may be prior art on BTC's: The RPM-40 has them, and had them in the design before the 29000 was announced (well before, I think). We (the rpm-40 team) may well have gotten the idea from even earlier work, a lot of the ideas in the rpm-40 came from other architectures, such as transputers, the MIPS and RISC chips, etc. I also could be wrong and we might have gotten it from the 29000, but I'm fairly certain that's not the case (I remember when it was announced, and I was already on the team - the BTC's were there long before I joined it). Note: I know nothing (and care less) about legal matters and lawyers. I also may have may dates confused, it was something like 4+ years ago. >Generally, small BTTs and other mechanisms work better than small caches. >A good deal of the value comes from seperating instructions from data with >this form of associativity. So long as I-mem can keep up with the bandwidth required for sequential operation, BTC's can be very small and give a big win. You can also make good use of BTC's to handle branchs and a large external I-cache to interface to memory too slow to handle the required bandwidth. -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com BIX: rjesup The compiler runs Like a swift-flowing river I wait in silence. (From "The Zen of Programming") ;-)
pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/07/91)
On 5 Mar 91 02:05:26 GMT, jesup@cbmvax.commodore.com (Randell Jesup) said:
jesup> In article <1991Mar4.153151.26216@arnor.uucp> gah@ibm.com (Gary A. Hoffman) writes:
pcg> AMD 29000 has something we call a Branch-Target-Table (BTT) in a patent.
pcg> The CMOS version of the ROMP processor in the deceased IBM/RT had
pcg> a loop buffer in it. There are other examples.
jesup> BTW, there may be prior art on BTC's: The RPM-40 has them, and had
jesup> them in the design before the 29000 was announced (well before, I think).
jesup> We (the rpm-40 team) may well have gotten the idea from even earlier work,
jesup> a lot of the ideas in the rpm-40 came from other architectures, such as
jesup> transputers, the MIPS and RISC chips, etc.
What about the MU5 branch target cache? Documented in the vital and
little known book "The MU5 Computer System", by Ibbett & Morris,
McMillan. Early seventies stuff...
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk