[comp.arch] VISC - A way to speed up moto cisc mpu's?

mcdonald@Daisy.EE.UND.AC.ZA (Bruce J McDonald) (05/15/91)

Variant Instruction Set Computers - VISC
----------------------------------------

A way to speed up future Motorola CISC MPUs could be:

Widen data bus to 64 bits and make the internal data paths and
ALU width all 64 bits.  Introduce an exclusive mode switch instruction
which would switch in an enhanced, RISC-like micro-engine, with totally
new instruction set geared for 64bit operations.  Superset the existing
16 x 32bit register file up to a n x 64bit register file so that the new
64bit mode could access the old style register file as part of the new,
large register file.  Access to FPU, cache and MMU ( and any other funct-
ional units ) would be maintained transparently as well as employing the
same pipeline stages ( this would be harder to do .. ) as the old CISC
core.  Notice that the RISC-like enhancements to the CISC core should 
be dropped and the CISC core kept for downward compatibility only - all
speedy execution should be handled by the RISC core.

This opens up the interesting option of, say implementing a SPARC RISC
core, or a HP-PA core, which would mean that an existing 680x0 product
would run HP-PA code on executing the mode switch instr.

This would mean that new compilers would have to be written which would 
be able to switch the MPU into the new mode for enhanced performance.  I
would think that this mean a addition CCR bit but since there are slots
available, it should be no problem.

What I do not like about this scheme is that it is resorting to kludging in
the same fashion that intel used to upgrade their 8080 to 80486 by adding 
bits and pieces which destroyed the orthogonality of the original design
( except that the 8080 wasn't a great design ).  The mode switch should
be possible without having to reset or destroy data in the CPU as opposed
to the real <-> protected mode switch horrors of the 80x86's.

Comments please ... ( flames to /dev/null )

BJ McDonald, University of Natal, Durban, King George V Ave, South Africa.

sef@kithrup.COM (Sean Eric Fagan) (05/16/91)

In article <1991May15.110000.25800@Daisy.EE.UND.AC.ZA> mcdonald@Daisy.EE.UND.AC.ZA (Bruce J McDonald) writes:
>Variant Instruction Set Computers - VISC
>
>ALU width all 64 bits.  Introduce an exclusive mode switch instruction
>which would switch in an enhanced, RISC-like micro-engine, with totally
>new instruction set geared for 64bit operations.

Uhm, it would probably be better to devote all that chip space to the "RISC"
processor and ship a software emulator.  If one puts in enough cache on the
chip, one might even be able to make the entire emulator fit in cache (see
Henry Spencer).

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

martin@adpplz.UUCP (Martin Golding) (05/17/91)

In <1991May15.110000.25800@Daisy.EE.UND.AC.ZA> mcdonald@Daisy.EE.UND.AC.ZA (Bruce J McDonald) writes:

>Variant Instruction Set Computers - VISC
>----------------------------------------
[description of hypothetical future machine with 64 bit risc and 68xxx modes]

Mode switching is Vax. Extending the instruction set is Eagle. 
Plus ca change, plus ce la meme chose (My French is as good as my c).
For the MOST exciting variable instruction set, consider the Burroughs
1700, with a byte width of 1, a variable word width up to 24, and microcode
swapping to adapt the instruction set to the program that was running.
(PROOF that Cobol isn't a real language: on the B1700, Cobol and RPG ran
on the _same virtual machine_. Peugh.)

And while we're busy making current computers into oldfashioned computers:
Don't forget the Cyber trick of running multiple tasks on multiple
memory buses with a single CPU (adapt cheap memory to fast RISC chips)
or the IBM fancy that decoded 360 instructions for a dataflow processor
(For some instruction sequences, dataflow beats scoreboard).

Martin Golding    | sync, sync, sync, sank ... sunk:
Dod #0236         |  He who steals my code steals trash.
A poor old decrepit Pick programmer. Sympathize at:
{mcspdx,pdxgate}!adpplz!martin or martin@adpplz.uucp

plinio@turing.seas.ucla.edu (Plinio Barbeito) (05/17/91)

In article <1991May15.110000.25800@Daisy.EE.UND.AC.ZA> mcdonald@Daisy.EE.UND.AC.ZA (Bruce J McDonald) writes:
>A way to speed up future Motorola CISC MPUs could be:
>
...
>core.  Notice that the RISC-like enhancements to the CISC core should 
>be dropped and the CISC core kept for downward compatibility only - all
>speedy execution should be handled by the RISC core.

Does having a RISC core in itself guarantee fast execution?  I thought
the reason RISC was fast was the great amount of space it freed up on
the chip that could be used to speed up basic operations.

I think it would be more in line with RISC philosophy to rip out as
much of the CISC core as possible, leaving close to the bare minimum of 
what is needed to emulate via software traps those addressing modes that 
would be deleted (most) and those instructions that would be deleted 
(anything that compilers are staying away from, up to the neck of the 
curve).

This is how many FPU ops in the 68040 are implemented, and the strategy 
seems to have been successful, if SPEC numbers are worth their salt.  
Keeping the old CISC core would hog chip real-estate that could be 
better applied speeding up other things, like the ubiquitous
'move's, IMHO.

As to whether they should go load/store, I think it would be less
of a compatibility-kludge-nightmare to keep the move's but enlarge 
the cache to help outweigh benefits this approach might have had.  
Besides, I've always savored the fact that 68k programs have been 
consistently and significantly smaller than many equivalent RISC 
binaries.  This helps load-time performance if processes are I/O bound, 
or use slow I/O devices.  It also saves space on mass-storage devices, 
but I haven't seen these issues dealt with extensively in this group.  
Apparently, everyone buys enough RAM so that they never page fault :-)

Regarding 64-bits, in my opinion, it depends.  In the short term it 
looks like it would disproportionately raise costs and complicate 
compatibility issues.  However, if memory speeds continue to lag 
behind CPU speeds, it may eventually become the only feasible solution.
Comments?  Is everyone jumping on the 64-bit bandwagon?

>This would mean that new compilers would have to be written which would 
>be able to switch the MPU into the new mode for enhanced performance.  I
>would think that this mean a addition CCR bit but since there are slots
>available, it should be no problem.

The advantage of doing it the other way is that you don't have to rewrite 
any new software except maybe a new optimizer for your compilers. 

Interesting side-note: Since it would be cheap to add new instructions
via traps, how about putting in an opcode to speed up string comparisons 
to deny intel the ability to claim higher performance in any
benchmark category? (Flamesuit on)

plin
--
To mak wridin mo eficiend, i sujes de folouin janjs: drop deleder 'c', as 
'k' uil do jus fin.  gt rid of endn 'e', sins ids nevr pronncd aniuai.  als, 
't' is nevr nedd; us 'd'.  repeddv knsnnds shd b nls bpp ngbbl rr...01011101

preston@ariel.rice.edu (Preston Briggs) (05/18/91)

plinio@turing.seas.ucla.edu (Plinio Barbeito) writes:

>Does having a RISC core in itself guarantee fast execution?  I thought
>the reason RISC was fast was the great amount of space it freed up on
>the chip that could be used to speed up basic operations.
>
>I think it would be more in line with RISC philosophy to rip out as
>much of the CISC core as possible, leaving close to the bare minimum of 
>what is needed to emulate via software traps those addressing modes that 
>would be deleted (most) and those instructions that would be deleted 
>(anything that compilers are staying away from, up to the neck of the 
>curve).

I think an important part of the "risc philosophy" is to expose low
level operation to the compiler.  If you bundle them up into
cisc-like globs, the optimizer loses many opportunities.

Emulation should probably be restricted to maintaining object compatibility,
with the understanding the recompilation is always preferable, in terms
of performance.

Preston Briggs

torek@elf.ee.lbl.gov (Chris Torek) (05/22/91)

[Context: someone suggested adding/deleting/changing 680x0 instructions
for newer 680x0s, with a compatibility mode in the status register or
some such.]

In article <1991May15.183328.22820@kithrup.COM> sef@kithrup.COM
(Sean Eric Fagan) writes:
>Uhm, it would probably be better to devote all that chip space to the "RISC"
>processor and ship a software emulator. ...

Why not just build a multiprocessor system with completely different
processors?  I.e., ship a system that contains, say, one 68040 and one
or more 88x00s.  There is no particular reason that the O/S cannot run
the proper binary on the proper CPU automatically.

Of course, this takes more board space unless the 68040 and 88100 are
in the same package (and if that is the case you might have pin problems).
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

plinio@turing.seas.ucla.edu (Plinio Barbeito) (05/23/91)

In article <1991May23.210000.8152@kithrup.COM> sef@kithrup.COM (Sean Eric Fagan) writes:
>In article <13445@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>>Why not just build a multiprocessor system with completely different
>>processors?  I.e., ship a system that contains, say, one 68040 and one
>>or more 88x00s.  There is no particular reason that the O/S cannot run
>>the proper binary on the proper CPU automatically.

Interesting... this could lead to a more intelligent distributed process 
server.  One that not only takes which CPU the least loaded into 
consideration for exec'ing new processes, but also how fast a given code 
will run on that system or CPU.  For example, in such a system, the O/S 
could use the 040 to move blocks of memory around and floating point, the 
486 to do the strcmp's, the RIOS to do the published Dhrystone benchmark,
:-) and the i860 to provide an excuse for the frequent down-time and
delivery delays :-) :-)

(I went a bit overboard with that one...)  Maybe not just binaries, but 
different library functions could be assigned to different
processors.  I can't help but think that there would be *something* a
CISC processor would be consistently better at over its RISC 
contemporary.

Based on previous experience in this group, Somebody Has Already Thought 
Of This.   If it's true that SHATOT, then please share it with us.

>Because then you wouldn't get any speedup on your old programs.  I guess.
>Historically, such ventures have not done too well.  (Anyone remember the
>machine, many years ago, that had a 68k, a 6502, a Z80, and possibly one or
>two other processors?  Dimension, mayhap?  Anyway, it failed.)

Yes, but how many Apple II's (ages hence) didn't have a Z80 card so 
people could run CP/M?  Also, Intel's 'vision of the future', as they 
have explained it, is that the i860 family is not intended to
compete with or replace the 80x86 line for desktop systems, but that it 
will accompany (or so they would hope) most of them as a card to
"speed up operations".  Then again, maybe they don't really think this 
would fly and are just using it as a ploy to calm investors paranoid of 
self-competition hurting immediate maximal profits.

Possibly, the strategy of combining different processors would meet more
consistent success if each processor was *needed* there to run a given 
operating system (or other suitable, presently existing investment in
a body of software), and if the parts were cheap enough, and the 
signals compatible enough (the latter of which is why I think Chris must 
have mentioned the 040 together with the 88k's).  A nice example of this
might be a mac server, with the 040 running Mac-os and the other
processors running an unhindered version of Unix, serving files out,
etc.

plin
--
----- ---- --- -- ------ ---- --- -- - -  -  plinio@seas.ucla.edu 
Para-noia will destroy-yaaaaa...

gdtltr@brahms.udel.edu (gdtltr@limbo.org (The Befuddled One)) (05/23/91)

In article <13445@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
=>Why not just build a multiprocessor system with completely different
=>processors?  I.e., ship a system that contains, say, one 68040 and one
=>or more 88x00s.  There is no particular reason that the O/S cannot run
=>the proper binary on the proper CPU automatically.

   There was a paper on something like this in Operating Systems Review
about a year ago. The system was called AAMP (I think) and was written
by someone from Sequent. The system arranged resource management in a
heirarchical structure, with the root actually managing memory and I/O.
Client operating systems perform basic I/O functions by passing messages
to its immediate server, and on up the tree. A server has access to the
memory spaces of its clients, which facilitates message passing and
allows for easy debugging of client operating systems.
   The example system was a Sequent Symmetry running several copies of
a modified Dynix but the paper made its point that a heterogeneous,
multi-OS multiprocessor is possible.

                                        Gary Duzan
                                        Time  Lord
                                    Third Regeneration



-- 
                            gdtltr@brahms.udel.edu
   _o_                      ----------------------                        _o_
 [|o o|]                   To be is to be networked.                    [|o o|]
  |_o_|        Disclaimer: I have no idea what I am talking about.       |_o_|

peter@ficc.ferranti.com (peter da silva) (05/23/91)

In article <1991May23.210000.8152@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes:
> In article <13445@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
> >There is no particular reason that the O/S [on a heterogenous MP] cannot run
> >the proper binary on the proper CPU automatically.

> Because then you wouldn't get any speedup on your old programs.  I guess.

If your older programs are *the* critical thing to speed up, then that's a
problem. But in that case you're unlikely to abandon the 68000, 8086, VAX,
or whatever family anyway.

For a concrete example... if I could get an 88000 in my Amiga, and I just had
to recompile one of the PD raytracers to get it to use it, it might well be
worthwhile. Especially with the fine-grained multitasking the Amiga supports.

On another point, heterogenous networks that operate this way have existed
for some time.
-- 
Peter da Silva; Ferranti International Controls Corporation; +1 713 274 5180;
Sugar Land, TX  77487-5012;         `-_-' "Have you hugged your wolf, today?"

sef@kithrup.COM (Sean Eric Fagan) (05/24/91)

In article <13445@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>Why not just build a multiprocessor system with completely different
>processors?  I.e., ship a system that contains, say, one 68040 and one
>or more 88x00s.  There is no particular reason that the O/S cannot run
>the proper binary on the proper CPU automatically.

Because then you wouldn't get any speedup on your old programs.  I guess.
Historically, such ventures have not done too well.  (Anyone remember the
machine, many years ago, that had a 68k, a 6502, a Z80, and possibly one or
two other processors?  Dimension, mayhap?  Anyway, it failed.)

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

aduane@urbana.mcd.mot.com (Andrew Duane) (05/29/91)

In article <21621@brahms.udel.edu> gdtltr@brahms.udel.edu (gdtltr@limbo.org (The Befuddled One)) writes:
>In article <13445@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>=>Why not just build a multiprocessor system with completely different
>=>processors?  I.e., ship a system that contains, say, one 68040 and one
>=>or more 88x00s.  There is no particular reason that the O/S cannot run
>=>the proper binary on the proper CPU automatically.
>
>   There was a paper on something like this in Operating Systems Review
>about a year ago. The system was called AAMP (I think) and was written
>by someone from Sequent.

Perhaps this is the XA/MP architecture from Intel? I worked on some
whiteboard-type research to make a project proposal based on this
architecture last year. It was a combination of the 80x86 (where
'x' probably == 4), and the i860. Our instance of this would have
run Mach or somehting like it.

We looked at several problems with a heterogeneous architecture,
and (as long as byte order was the same between CPUs), the actual
selection of a processor to run a thread on was pretty simple.
We even figured out how to induce the compiler to emit both flavors
of object code, and let the exec facility select the right one.

Andrew L. Duane (JOT-7)  w:(408)366-4935
Motorola Microcomputer Design Center	 decvax!cg-atla!samsung!duane
10700 N. De Anza Boulevard			  uunet/
Cupertino, CA   95014			 duane@samsung.com

Only my cat shares my opinions, and she's too heavy to care.

mac@gold.kpc.com (Mike McNamara) (05/29/91)

	I've worked on two different commercially unsucessful
heterogeneous processor machines.  I don't think that I am the common
thread of failure ;-)
	The first was the Cydrome Cydra 5, which had a 50/25 MFLOP ECL
super computer wedded to a symetric 6 68020 cpu system running our
MP'ized  System V R3.2.  This machine was introduced in 1987.

	The second was the Stardent Stiletto, which had two MIPS
R3000, and four intel 860 processors.  Each R3000 had a tightly
coupled i860 which was connected as a vector processor, via our
implementation of the canonical Mips write buffer chip.  This is the
same model we used to build the Ardent Titan super graphics
workstations, altough we implemented the vector unit with gatearrays
and floating point cores.

	The other two i860s were on a separate graphics board, with
associated pixel processing power.

	One common problem with both machines is that the benchmark
results did not justify the cost of the machine.  This isn't a problem
with the design of the machine, per se: Instead, it is just hard to
write a benchmark that could effectively use all the power avialable
on a heterogeneous machine:

	SPEC ran on Stiletto just as well as it would run on any
machine with two 33MHz R3000/R3010 cpus.  SPEC didn't "notice" that
3D graphics could go on concurrently with no performance loss.
[Mashey: add a gSPEC to the mix: you must rotate a teapot while
gcc'ing ;-]

	Another common problem with both machines is twice the
technology risk exposure.  At Cydrome, the ECL super computer was
ready, and demonstrated at a trade show, one year before the general
purposed 68020 system was ready. At Ardent/Stardent, the dual R3000
system was ready 10 months before all the bugs were ironed out of the
i860 chips and our software.

	So, yes, Virginia, SHATOT (twice), and did not succeed.

> Based on previous experience in this group, Somebody Has Already Thought
> Of This.   If it's true that SHATOT, then please share it with us.

	Of course that by no means indicates that it is not a viable
proposition.  However, there are bones littered on the side of the
trail...


	-mac  \___/^^\___/
	       ---|oo|---
		   ||
		   \/
--
+-----------+-----------------------------------------------------------------+
|mac@kpc.com| Increasing Software complexity lets us sell Mainframes as       |
|           | personal computers. Carry on, X windows/Postscript/emacs/CASE!! |
+-----------+-----------------------------------------------------------------+

martin@adpplz.UUCP (Martin Golding) (05/31/91)

In <MAC.91May29095345@gold.kpc.com> mac@gold.kpc.com (Mike McNamara) writes:

>	I've worked on two different commercially unsucessful
>heterogeneous processor machines.  I don't think that I am the common
>thread of failure ;-)

[description of two machines canceled due to lack of interest]

>	One common problem with both machines is that the benchmark
>results did not justify the cost of the machine. 

AHA I just realised that I have useful information to contribute here.
Here at ADP we built a _successful_ dual processor system; it ran a
(disk-based) RT11 like operating system and (virtual memory based)
Reality, simultaneously. For about 4 years they were our most popular
single model.

Our machine sold because we had just oodles of software for _each_
processor, and the cost and risk of rewriting was worse than the engineering
for the computer. (Two systems types and a hardware engineer for the
computer, 200 programmers working 5 years for the software). If the
stuff had _all_ been in _one_ (preferably popular) language, there
wouldn't have been any point.

Moral: It's only worth building the dual processor machine if you
_already_ have software that _can't otherwise_ be ported. See the
interesting dos and cp/m add-ins for all kinds of interesting computers.
Note also that binary converters and interpreters are gaining strength.

Drawback: interesting software is currently produced for DOS and unix.
Dos machines are dos machines, eh? And unix computers sell based on
how many interesting packages get ported. So the demand for a RISC
computer with 68xxx binary coprocessor is probably not worth the
engineering. Besides, we get better stuff for our 88k's than for our
68k's; I think that the software porting types have a personal fondness
for the RISC systems that biases the results.

All of this has strayed very wide from the RISC vs CISC long term
performance, which is heavily language dependent; unless someone
wants to make vectorising compilers that extract string functions
from c.

Martin Golding    | sync, sync, sync, sank ... sunk:
Dod #0236         |  He who steals my code steals trash.
A poor old decrepit Pick programmer. Sympathize at:
{mcspdx,pdxgate}!adpplz!martin or martin@adpplz.uucp