[comp.arch] loadable control store, an idea whose time has gone

henry@zoo.toronto.edu (Henry Spencer) (11/01/90)

In article <2817@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  Loadable control store is a great idea, and can really improve the
>performance of a program...

Well, if you're using a microprogrammed CPU with a control store in the
first place.  Nobody in his right mind designs a high-performance system
that way any more, given a choice.  You improve the performance of the
program even more by going to a RISC CPU which has a cache instead of
a control store and runs user code at one instruction per cycle.
-- 
"I don't *want* to be normal!"         | Henry Spencer at U of Toronto Zoology
"Not to worry."                        |  henry@zoo.toronto.edu   utzoo!henry

pcg@cs.aber.ac.uk (Piercarlo Grandi) (11/02/90)

On 1 Nov 90 04:45:15 GMT, henry@zoo.toronto.edu (Henry Spencer) said:

henry> In article <2817@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com
henry> (bill davidsen) writes:

davidsen> Loadable control store is a great idea, and can really improve
davidsen> the performance of a program...

henry> Well, if you're using a microprogrammed CPU with a control
henry> store in the first place.  Nobody in his right mind designs a
henry> high-performance system that way any more, given a choice.

Uhmmmm. We remain to be convinced. What has been proven so far is that
general purpose microprogrammed instruction sets are not a win because
the high level instructions that you can then implement are mostly
useless in a general purpose environment. But this thread was about the
usefulness of having multiple high level instruction sets, each tailored
to oen particular purpose. There is no reason for this not to work, and
it results in impressive code size reductions.

henry> You improve the performance of the program even more by going
henry> to a RISC CPU which has a cache instead of a control store and
henry> runs user code at one instruction per cycle.

Only if you have unlimited real memory...

But what you say is still mostly true -- in the sense that this implies
inlining the ad-hoc high level instructions, and this seems better than
offlining them in the control store. This works well because most time
in a program is spent in loops, and loops can fit, even inlined, in the
cache, and we can build caches that are fast enough and do not steal
bandwidth from data accesses (Harvard architectures).

But here we have a tradeoff -- the same effect may be achieved by having
a single high level instruction (tailored for the purpose -- one could
even have a tool to generate an ad hoc microcode for the specific
program) that expands to a call to an offline sequence of micro
instructions in control store, or with an already expanded sequence of
simple instructions in an I cache, but the performanc implications are
very different.

In the offline case we have extra dispatch time, but even more direct
access to the innards of the CPU/ALU in the micro instructions. In the
inline case we have direct execution, but the simple instructions are
more abstract.

Little can be done to avoid the problem with extra dispatch time to the
control store, except implementing the simplest high level instructions
as special, direct execution, cases; on the other hand we could have
very low level, microprogram like, instructions at the architecture
level (e.g. VLIW), but then so much of the CPU/ALU innards are exposed
that recompilation becomes necessary across the architecture
implementations, which is a no-no since the system/360 days.

There are also the system wide implications -- better code density makes
for smaller working sets, and even small improvements in code locality
mean much lower page fault rates, and given the relative cost of a page
fault, this may be important.

Currently code density is not reckoned important, and the extra dispatch
time to the control store is. Maybe offlining will become more important
with superscalars (it is already important with vector machines).
--
Piercarlo "Peter" Grandi           | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

peter@ficc.ferranti.com (Peter da Silva) (11/04/90)

Microcode, macros, inline code. What are we talking about here?

Fast subroutine calls with low overhead that don't blow the cache, right?

How about having a shared memory segment in very fast memory (as fast as
the cache) that's readonly to the process. Since it's fast, bypass the
cache. Write your micros as some sort of low-overhead subroutines (trap,
maybe, or just copy return address to a register and jump... they're sure
as hell not gonna be recursive!) and put them in there.

This would give you most of the advantage of writable control store, and
you could implement it on current CPUs.

(be sort of like programming the 1802)
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
peter@ferranti.com

johnl@esegue.segue.boston.ma.us (John R. Levine) (11/04/90)

In article <PCG.90Nov2153545@odin.cs.aber.ac.uk> you write:
>But this thread was about the usefulness of having multiple high level
>instruction sets, each tailored to one particular purpose. There is no reason
>for this not to work, and it results in impressive code size reductions.

The Burroughs 1700/1800 series had a microcode interpreter for each language.
It was extremely slow.  Perhaps it's because the people writing the microcode
didn't try hard enough to make it fast, but I suspect that the problem was
more that microcode interpretation was still too slow.

>Little can be done to avoid the problem with extra dispatch time to the
>control store, except implementing the simplest high level instructions
>as special, direct execution, cases; on the other hand we could have
>very low level, microprogram like, instructions at the architecture
>level (e.g. VLIW), but then so much of the CPU/ALU innards are exposed
>that recompilation becomes necessary across the architecture
>implementations, which is a no-no since the system/360 days.

Funny you should mention that.  The IBM System/38 and its sequel the AS/400
use an interesting scheme.  Compilers (most commonly RPG III, but keep
reading anyway) compile into a high-level macrocode object code.  Among other
things the macrocode has huge addresses implementing a single level store.
At the time a program is loaded, the program loader translates it into the
actual microcode for the processor upon which it is running, including
binding addresses into the machine's actual address space, typically 32 bits.
You can even put breakpoints into the macrocode and the translator makes it
all work in the translated code, installing and removing breakpoints and
translating the state at a break back into macrocode terms.

Different processors have different microcode, but since object code
portability is at the macrocode level, nobody cares.  Clearly, the microcode
has to be designed to support the macrocode, so there are fairly strong
constraints on the design of the microengine, but it does permit a wide
range of different performance implementations without the interpretation
overhead that conventional microcoded machines have.

Regards,
John Levine, johnl@iecc.cambridge.ma.us, {spdcc|ima|world}!esegue!johnl

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (11/05/90)

In article <PCG.90Nov2153545@odin.cs.aber.ac.uk>, pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
> There is no reason for this not to work, and
> it results in impressive code size reductions.

I'm reminded of the old idea of "throw-away compiling".
It seems to me that it might work at the instruction-set level.
The throw-away compiling idea was a hybrid between compiling BASIC
and interpreting it.  What you do is you leave each statement in
tokenised form until you execute it (in fact the first "instruction"
of a tokenised line is a call to the compiler).  Compilation was done
into a "buffer".  When the buffer filled up, it was just thrown away
(hence the name) and compilation started over with the current statement.

Now, imagine a very compact form of code (byte-codes?) held on disc.
When you call a procedure, if it isn't already expanded, you expand it.
The expansion is exactly the kind of work that a micro-coded interpreter
would do anyway, except that you're storing the results into a buffer
instead of executing it.  This may involve fetching the byte-code string
from disc.  When a code page is to be "paged out", you just throw it away,
preferring to keep byte code strings in memory.
-- 
The problem about real life is that moving one's knight to QB3
may always be replied to with a lob across the net.  --Alasdair Macintyre.