[comp.arch] RISC v. CISC

cory@gloom.UUCP (Cory Kempf) (10/17/88)

A while back, I was really hot on the idea of RISC.  Then a friend 
pointed out a few things that set me straight...

First, there is no good reason that all of the cache and pipeline
enhancements cannot be put on to a CISC processor.

Second, to perform a complex task, a RISC chip will need more
instructions than a CISC chip.

Third, given the same level of technology for each (ie caches, pipelines,
etc), a microcode fetch is faster than a memory fetch.

As an aside, the 68030 can do a 32 bit multiply in about (If I remember 
correctly -- I don't have the book in front of me) 40 cycles.  A while
back, I tried to write a 32 bit multiply macro that would take less 
than the 40 or so that the '030 took.  I didn't even come close (even 
assuming lots of registers and a 32 bit word size (which the 6502 
doesn't have)).  



-- 
Cory Kempf
UUCP: {decvax, bu-cs}!encore!gloom!cory
revised reality... available at a dealer near you.

usenet@cps3xx.UUCP (Usenet file owner) (10/18/88)

In article <156@gloom.uucp>, Cory Kempf (decvax!encore!gloom!cory) writes:
>First, there is no good reason that all of the cache and pipeline
>enhancements cannot be put on to a CISC processor.

  This is definitely true.  Look at the caching on the 68030, or the
Z80,000 for instance.  The advantage a RISC gives you is more space
for caching logic, though--so you can have a bigger cache (or more
registers, or possibly both).

>Second, to perform a complex task, a RISC chip will need more
>instructions than a CISC chip.

  Right!  Unless you add special hardware to help it with the most
common complex tasks, in which case you're heading right back to CISC.
Nick Tredennick gives an interesting characterization of RISC in his
paper from the IEEE CompCon '86 panel on RISC vs. CISC:

	Cut a MC68000 in half across the middle just below the control
	store.  Throw away the part with the instruction decoders,
	control store, state machine, clock phase generators, branch
	control, interrupt handler, and bus controller.  What you will
	have left is a RISC "microprocessor."  All the instructions
	execute in one cycle.  The design is greatly simplified.  The
	chip is smaller.  And the apparent performance is vastly
	improved. [stuff omitted] ...try to build a system using this
	wonderful new chip.  You have to rebuild on the card the parts
	you just cut off.  Good luck trying to service the microcode
	interface at the 'microprocessor' clock rate.

  I think this is a great characterization of a particular segment of
the debate: that which talks about chip complexity.  Now, instruction
set complexity is a bit different, and I'm not convinced one way or
the other on that yet (though I lean toward CISC).  The recent
discussion of "the 68030 is RISCier than the 68020" and "a RISC
compatible with the 68020" doesn't have anything to do with the
instruction set--just the chip design.  Maybe there's a better term
for it than RISC....
  Just my thoughts...

		Anton

Disclaimer: I'm into software, not hardware!

+----------------------------------+------------------------+
| Anton Rang (grad student)	   | "UNIX: Just Say No!"   |
| Michigan State University	   | rang@cpswh.cps.msu.edu |
+----------------------------------+------------------------+

tim@crackle.amd.com (Tim Olson) (10/18/88)

In article <156@gloom.UUCP> cory@gloom.UUCP (Cory Kempf) writes:
| A while back, I was really hot on the idea of RISC.  Then a friend 
| pointed out a few things that set me straight...

I guess we are going to have to reset you straight, again! ;-)

| First, there is no good reason that all of the cache and pipeline
| enhancements cannot be put on to a CISC processor.

If it is a microcoded processor, than the CISC machine will have to
perform this pipelining at both the microinstruction and
macroinstruction level, in order to be able to execute simple
instructions in a single cycle.  This costs more than if the
micro and macro levels were the same (RISC).

| Second, to perform a complex task, a RISC chip will need more
| instructions than a CISC chip.

This is true, although it is typically only 30% more from dynamic
measurements, not the "3 to 5 times" that some people report.

| Third, given the same level of technology for each (ie caches, pipelines,
| etc), a microcode fetch is faster than a memory fetch.

Also true.  However, this only buys you anything if most of your
instructions take multiple cycles.  Unfortunately (?), most programs use
simple instructions which should execute in a single cycle.  If a CISC
processor is to compete effectively, it must also be able to execute the
most-used instructions in a single cycle.  Therefore, it must also have
the off-chip instruction bandwidth or on-chip cache bandwidth that RISC
requires.  With this requirement, it doesn't matter that microcode may
be slightly faster than a cache access -- the cache is the limiting
factor.

| As an aside, the 68030 can do a 32 bit multiply in about (If I remember 
| correctly -- I don't have the book in front of me) 40 cycles.  A while
| back, I tried to write a 32 bit multiply macro that would take less 
| than the 40 or so that the '030 took.  I didn't even come close (even 
| assuming lots of registers and a 32 bit word size (which the 6502 
| doesn't have)).  

Most (if not all) RISCs address this by

	a) using existing floating-point multiply hardware (i.e. 32x32
	multiplier array) for integer multiply (1 - 4 cycles)

or
	b) having multiply sequencing or step operations that perform
	1-2 bits at a time (16 - 40 cycles)

so they are no slower than the current crop of CISC processors.  In
addition, if step operations are used, inexpensive "early-out"
calculations will allow the average multiply time to drop quite a bit
(because the distribution of runtime multiplies leans heavily towards
multipliers of 8 bits or less).

	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

bcase@cup.portal.com (Brian bcase Case) (10/20/88)

>A while back, I was really hot on the idea of RISC.  Then a friend 
>pointed out a few things that set me straight...
>First, there is no good reason that all of the cache and pipeline
>enhancements cannot be put on to a CISC processor.

True for the simple instructions in the CISC instruction set.  Not so
true for the ones with complex addressing modes, etc.  Fixed instruction
size, format, and prevention of page-boundary crossings are very good
things to do.  This limits the CISCy ness of an instruction set, or the
instructions will need to be very long, or they will need to be two-address
instead of three address, or worse, one-address, or....

>Second, to perform a complex task, a RISC chip will need more
>instructions than a CISC chip.

This is simply an exageration.  Yeah, maybe 1.2 to 1.5 times as many,
but this is usually not a big deal.  If it is (it might be for some),
then CISC away.

>Third, given the same level of technology for each (ie caches, pipelines,
>etc), a microcode fetch is faster than a memory fetch.

But not much faster than a cache fetch.  And the cache will have the
"macros" that the program actually uses, not the ones that the instruction
set designers assumed the application would use.  The problem is that it
is presumptuous to think that you know exactly how the procedure linkage,
run-time addressing model, etc. is going to be implemented by the language
and operating system designers.  Once its in uCode, it's there for a long
time.  And if the microcode routines are longer than one instruction, you
no longer have single cycle instructions.  But this is a complex issue.

>As an aside, the 68030 can do a 32 bit multiply in about (I don't have
>the book in front of me) 40 cycles. I tried to write a 32 bit multiply
>macro that would take less than the 40 or so that the '030 took.  I
>didn't even come close (even assuming lots of registers and a 32 bit word
>size (which the 6502 doesn't have)).  

First of all, the 6502 is simple, but it is very far from a RISC.  Maybe
you mistyped.  Second, if multiply is important, which it typically isn't
in systems code, implement it combinatorially, in parallel.  Third, you
probably failed to check for the reduced cases in your macro; by checking
for small operands, etc. you can get the cycle count down.  And multiplies
by small constants (the most frequent case in system codes) can be done in
very few cycles using shifts, adds, and subtracts.

Disclaimer:  Everone is entitled to an opinion.

bcase@cup.portal.com (Brian bcase Case) (10/20/88)

>>Second, to perform a complex task, a RISC chip will need more
>>instructions than a CISC chip.

For most purposes the difference is not important, maybe 20% more with
the top at 40%.  But the working set is the important issue where caches
are concerned.  Is the RISC's cache working set bigger than the CISC's?
Maybe, I don't know for sure.  I wrote a fairly big program, 40K lines
(an incrementally-compiling simulator for the 68000, fun!) last year.
The code size was 15% bigger on the 29K than on the Vax. Admittedly, the
29K compiler is much better than the vax's.  But does anyone ever ask
*why* it is much better?

Slightly paraphrased Nick Tredennick:
>	Cut a MC68000 in half; Throw away the instruction decoders,
>	control store, state machine, clock phase generators, branch
>	control, interrupt handler, and bus controller.  What you will
>	have left is a RISC "microprocessor."  All the instructions

Not true.  RISCs do not throw away the bus controller, the interrupt
handler, the insruction decoder, branch control, etc. etc.  He is just
griping because it is so small on the chip that it looks like it has
been thrown out.  :-) :-)
>	execute in one cycle.  The design is greatly simplified.  The
>	chip is smaller.  And the apparent performance is vastly
>	improved. [stuff omitted] ...try to build a system using this
>	wonderful new chip.  You have to rebuild on the card the parts
>	you just cut off.  Good luck trying to service the microcode
>	interface at the 'microprocessor' clock rate.
There are existence proofs:  several systems are doing it.  What more 
could he want?

>The recent discussion of "the 68030 is RISCier than the 68020" and "a RISC
>compatible with the 68020" doesn't have anything to do with the
>instruction set--just the chip design.  Maybe there's a better term
>for it than RISC....

The 68030 core is *exactly the same* (maybe a Moto guy can comment?) as
the 68020's core.  They shrunk it and added the data cache.  The bus
controller now supports 4-word bursts.  The cache line size changed.

What has been left out of this discussion is the software side of the
issue.  The almighty Compiler can save us from our sins!  It is our
saviour!  Long live common subexpression elimination!  Hail to the code
reorganizer!  Praise the register allocator!  Jim Bakker, watch out!

aglew@urbsdc.Urbana.Gould.COM (10/20/88)

>First, there is no good reason that all of the cache and pipeline
>enhancements cannot be put on to a CISC processor.

Space.

It's less of a reason now, which is way the phase of RISC may pass.

>Second, to perform a complex task, a RISC chip will need more
>instructions than a CISC chip.

There aren't many complex tasks. Code size inflation is usually due to
lack of memory to register ops, not sophisticated instructions.

>Third, given the same level of technology for each (ie caches, pipelines,
>etc), a microcode fetch is faster than a memory fetch.

I used to work for a company where, for straight line code, a microfetch
was the same speed as the memory fetch. Plus, access to main memory from
microcode was 2 to 4 times *more* expensive than access to memory from
an instruction (dedicated hardware for instruction memory accesses,
hiding most load/stores).

>As an aside, the 68030 can do a 32 bit multiply in about (If I remember 
>correctly -- I don't have the book in front of me) 40 cycles.  A while
>back, I tried to write a 32 bit multiply macro that would take less 
>than the 40 or so that the '030 took.  I didn't even come close (even 
>assuming lots of registers and a 32 bit word size (which the 6502 
>doesn't have)).  

There do exist RISCs with multiply instructions. In fact, real 
multiplies, with full multiplier arrays taking lots of space that
might otherwise have had to be used for microcode.

>Cory Kempf

Andy Glew

grzm@zyx.SE (Gunnar Blomberg) (10/20/88)

In article <156@gloom.UUCP> cory@gloom.UUCP (Cory Kempf) writes:
>[...]
>
>Second, to perform a complex task, a RISC chip will need more
>instructions than a CISC chip.
>
>[...]

   Is this really the widely accepted truth?  It seems to me that a typical
well-designed RISC chip should actually need *fewer* instructions
(statically and dynamically) to perform most tasks than your typical CISC
chip, for the following reasons:

   * The RISC chip has more registers.

   * The RISC chip has a more orthogonal instruction set.

   * The RISC chip has three-operand instructions.

   I am assuming something like a 680x0 or an 80386 as the CISC here, ie
something that suffers heavily from non-orthogonality and lack of
registers.  A memory-to-memory CISC with a really orthogonal instruction
set is quite a different animal.

   What this boils down to is that a well-designed orthogonal instruction
set should give fewer instructions for most tasks than your typical Complex
Instruction Set, even taking into account all the strange instructions for
function calls and other things.  I would *much* rather program an HP-PA
RISC than any CISC I have ever seen (with the possible exception of the
PDP-10), and the same is true for the SPARC chip, though less emphatically
so.  Thank heaven chip designers (finally) realized the value of a clean,
orthogonal instruction set!

   On the other hand, since most RISC encodings use a fixed instruction
size, the program will probably be bigger.  Maybe this is what is meant
above?

-- 
Disguised as a Dutch mathematician, | Gunnar Blomberg
Brow [the alien] had advanced the   | ZYX Sweden AB, Bangardsg 13,
destructive mathematical philosophy | S-753 20  Uppsala, Sweden
called intuitionsism --Rudy Rucker  | email: grzm@zyx.SE

elg@killer.DALLAS.TX.US (Eric Green) (10/21/88)

>>A while back, I was really hot on the idea of RISC.  Then a friend 
>>pointed out a few things that set me straight...
>>First, there is no good reason that all of the cache and pipeline
>>enhancements cannot be put on to a CISC processor.
> 
> True for the simple instructions in the CISC instruction set.  Not so
> true for the ones with complex addressing modes, etc.  Fixed instruction
> size, format, and prevention of page-boundary crossings are very good
> things to do.  This limits the CISCy ness of an instruction set, or the

For example, the high-end Vaxen have a pipelined MICROARCHITECTURE. It
is almost impossible to effectively pipeline the macroarchitecture of
a Vax, because of the multitude of instruction set formats (almost as
bad as the 680x0). 

What this seems to mean is that the difference between CISC and RISC
lies more in instruction format than in number of instructions (as
someone else on this list pointed out). I suspect that you could have
a "CISC" just as fast as a "RISC" IF the instruction format is fairly
regular (i.e., no "expanding opcode" hyper-compressed formats need
apply). True, a compact opcode takes less memory space. BUT, it has to
UNcompacted before it's used...

Someone else on this newsgroup mentioned that Seymour Cray's secret to
a fast computer was to put as few gates as possible in critical paths.
Anybody got a reference for where Cray said that? In any event, on too
many CISCS, instruction decode is that critical path...

--
Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
          Snail Mail P.O. Box 92191 Lafayette, LA 70509              
It's understandable that Mike Dukakis thinks he can walk on water.
He's used to walking on Boston harbor.

chris@mimsy.UUCP (Chris Torek) (10/22/88)

In article <5863@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green)
writes:
>For example, the high-end Vaxen have a pipelined MICROARCHITECTURE. It
>is almost impossible to effectively pipeline the macroarchitecture of
>a Vax, because of the multitude of instruction set formats (almost as
>bad as the 680x0). 

While the 680x0 have a number of formats (and thus lengths), one of the
nice properties of its instruction set is that the first word tells you
the length of the entire instruction.  This is not true of the Vax
instruction set: on the Vax, the first byte is simply the opcode, and you
must read all of the operand bytes to discover the location of the next
instruction.  In other words, you must (almost) fully decode the current
instruction before you can begin decoding the next.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

robert@pvab.UUCP (Robert Claeson) (10/22/88)

In article <310@lynx.zyx.SE>, grzm@zyx.SE (Gunnar Blomberg) writes:

> It seems to me that a typical
> well-designed RISC chip should actually need *fewer* instructions
> (statically and dynamically) to perform most tasks than your typical CISC
> chip, for the following reasons:
> 
>    * The RISC chip has more registers.

The more registers, the more to save at every context switch in a typical
OS (such as UNIX). Which will slow things down if you have many processes
running.
-- 
Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden
Tel: +46 758-202 50   Fax: +46 758-197 20
Email: robert@pvab.se (soon rclaeson@erbe.se)

bcase@cup.portal.com (Brian bcase Case) (10/25/88)

>The more registers, the more to save at every context switch in a typical
>OS (such as UNIX). Which will slow things down if you have many processes
>running.
>-- 
>Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden

What data do you have to substantiate this claim?  This is another popular
misconception, I think.

I used to work at Pyramid Technology.  They make a machine with 512 32-bit
words of register windows (16, 16, 16 organization).  When we were porting
UNIX to the machine, we wondered what the register file save/restore on
context switch was costing.  Turns out that the average save spent 40 usec
saving registers.  The other 200 usec (this is a guess, I can't remember
the total context switch time) were taken by the implementation of the
context-switch *menchanism* inherent in UNIX.  And this is on top of the
fact that some of the critical loops, e.g., the one that decides what
process is next to run, were hand coded.  Pyramid's UNIX has (at least
had) an incredibly fast resopnse.  Customers noticed that this was so.

Two things:  if the memory system and save/restore mechanism are designed
with some care, they can go fast, and, except for real-time systems, where
save/restore is indeed a critical factor, context-switch time is dominated
by everything but the register save/restore.

At 200 context-switches per second (unusual max number on a machine like
the 780), saving 128 registers on every switch takes, as a percentage
of total available processor time,:

    ((200/sec)*(128 registers)*(2 cycles/register)/(25 Mega cycles/sec)

which is 0.20 percent of the total available CPU time.  I don't think
this is significant.  For some implmentations, it is more like 1 cycle
per register saved.  The other side of the equation is register restore:
but on machines with register windows (or work-alikes), only a small
number of registers, say 32, need to be restored (since any others will
be faulted in).  Thus, to total might be 0.30 percent.  It is even less
on machines with flat files of 32 registers, e.g. MIPS.  By speeding
up the machine on general code, the registers more than make up for this
cost.

matloff@bizet.Berkeley.EDU (Norman Matloff) (10/25/88)

In article <332@pvab.UUCP> robert@pvab.UUCP (Robert Claeson) writes:
>In article <310@lynx.zyx.SE>, grzm@zyx.SE (Gunnar Blomberg) writes:

*> It seems to me that a typical
*> well-designed RISC chip should actually need *fewer* instructions
*> (statically and dynamically) to perform most tasks than your typical CISC
*> chip, for the following reasons:
*> 
*>    * The RISC chip has more registers.
*
*The more registers, the more to save at every context switch in a typical
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*OS (such as UNIX). Which will slow things down if you have many processes,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*running.
^^^^^^^^

Based on parameters of Berkelely RISC I or II, the register-saving
might take on the order of 0.1 msec.  If the quantum size is set to
be in the range claimed to be typical in the Peterson and Silberschatz
OS book, i.e. 10 to 100 msec, then we see that the register-saving
issue for a RISC with lots of regiters has probably been greatly
overemphasized.

Comments?

   Norm Matloff

mash@mips.COM (John Mashey) (10/25/88)

ARGH! I'm away for a week and comp.arch goes crazy! :-)
Rather than try to multiple post on the hordes of RISC-CISC stuff,
I've glommed them together:

>Article 6913 of comp.arch:
>From: cory@gloom.UUCP (Cory Kempf)

>A while back, I was really hot on the idea of RISC.  Then a friend 
>pointed out a few things that set me straight...

[1] At least one of the things is pretty misleading:

>First, there is no good reason that all of the cache and pipeline
>enhancements cannot be put on to a CISC processor.

Cache and pipeline enhancements help CISCs also;  people are always
working on making CISCs go faster by better (deeper) pipelining and caching.
(The literature has plenty of examples of the efforts of CISC implementors.
to make their existing architectures go faster.)

However, there are some FUNDAMENTAL ways that most popular CISC architectures
differ from the higher-performance RISCs.  Here are a few, and why they
matter:
			CISCs			RISCs
EFFICIENT (DEEP)	possible, but		designed for this
PIPELINE		expensive in hardware
			and/or design time

	example:	variable-size instrs,	32-bit instrs
			sequential decode (VAX)	
			complex side-effects	at most simple side-effects
	example:	conditional branches	delayed-branches
			(tricky, much hardware)

SEPARATE I&D cache	maybe, but sometimes	usually, and don't support
			must support old code	store-into-instr stream w/o
			that does store-into-	explicit info
			instr-stream (this is
			a royal pain, since you pay hardware in the fastest
			part of the machine for something that seldom happens.
			(Yes, I was bad too: a popular S/360 program I wrote
			almost 20 years ago used this "feature".  sigh.)

	examples:	comparators in Amdahls watching for I-stream stores

ADDRESSING MODES	can be very complex,	usually just load/store with
			including side-effects	at most indexing & auto- +/-
			and page-crossings	with no page-crossings
	examples:	VAX; new modes in 68020
			Note that complex addressing modes can interact horribly
			with deep pipelining, because the very thigns you want
			to do to make it go fast add complexity and/or state
			in the fastest parts of the machines.

DESIGNED FOR		maybe, maybe not	yes
OPTIMZERS
	examples:	registers either	32 or more GP registers
			insufficient, or	available for allocation
			split up in odd	
			ways.
	example:	When you count general-purpose regs available for
			general allocation, a 386 gives you about 5-6,
			I think, compared to maybe 26-28 on an R3000,
			SPARC, HP PA, etc.  No amount of caching and
			pipelining makes 5 look like 26 to an optimizer.
			(This is not to say a good optimizer won't HELP,
			and in fact, Prime bought our compilers because it
			will help them; it just doesn't help as much.)

EXPOSED PIPELINE	usually not		usually some
	example:	It helps to reorganize code on CISCs (like S/360s)
			to cover load latencies, and spread settings of
			condition codes apart from the branch-conditions
			(on some models), but RISCs usually cater to these.
			Note that machines with complex address-modes built
			into the instructions are hard to do this with,
			i.e., the compilers can't easily split instructions with
			memory-indirect loads, for example, to get a smoother
			pipeline flow.

EXCEPTION-HANDLING	can get complex		relatively simpler
	example:	Exception-handling in heavily-pipelined CISCs
			not designed for that can either get very tricky,
			take a while to design and get right, or burn
			a lot of hardware, or all 3.

Note that hardware complexity is especially an issue in VLSI:
it is relatively easy to get dense regular structures on a given
amount of silicon (registers, MMUs, caches), but complex logic burns
it up fast, and routing can get tricky.

These are a few of the salient areas that illustrate a common
principle: there's hardly anything you couldn't do [except perhaps
cleanly increase the simultaneously-available registers] that you
can do in a RISC that you can't also do in a CISC.
HOWEVER, IT MAY TAKE YOU SO LONG TO GET IT RIGHT, OR COST YOU SO MUCH,
THAT IT DOESN"T MAKE COMPETITIVE SENSE TO DO IT!!!!
More than one large, competent computer company has discovered this
fact, which is why you often see multiprocessors being popular at
certain times, i.e., because it's easier to gang them together than
it is to make them go faster.
The problems often show up in 2 places:
	bus interface (including MMU)
	exception-handling
I'm sure any OS person out there who's dealt with early samples of
32-bit micros still has nightmares over some of these [How about
some postings on the chip bugs you remember worst!  I'll start with one:
UNIX always seems to find these $@!% things, which somehow have slipped
thru diags. Our 1973 PDP 11/45 had a bug which was only seen on UNIX,
because it used the MMU differently than DEC did, and the C compiler
often used some side-effectful addressing mode that DEC didn't often:
as I recall, if you accessed the stack with a particular sequence,
and a page boundary got crossed, and a trap resulted, something bad
happened.]

Making CISCs go faster is an interesting and worthy art in its own
right, and is certainly a good idea for anybody with a serious
installed base.  However, it does get hard: one of the architects
of a popular CISC system once told me that making it go much faster
(other than with circuit speedups) seemed beyond human comprehensibility
to do in a reasonable timeframe.

>Article 6914 of comp.arch:
>Subject: Re: RISC v. CISC
>Reply-To: rang@cpswh.cps.msu.edu (Anton Rang)

>In article <156@gloom.uucp>, Cory Kempf (decvax!encore!gloom!cory) writes:
>>First, there is no good reason that all of the cache and pipeline....

>  This is definitely true.  Look at the caching on the 68030, or the
>Z80,000 for instance.  The advantage a RISC gives you is more space
>for caching logic, though--so you can have a bigger cache (or more
>registers, or possibly both).

[2] Again, there is no good reason not to uses caches, but there are good
reasons why deeper CISC pipelines sometimes get very expensive.

Re: Z80,000: is that a real chip?  [Real = actually shipping to people
in at least large sample quantities; would be nice to see UNIX running, etc].
Note: you can find magazine articles describing it in detail, as though
it were imminently available....the problem is, some of those articles
are now 4 years old...If it doesn't really exist as a product, how can
it be cited as an example to prove anything? (If it is really out there
in use, please post some more to that effect and this comment will go away.)

>Article 6918 of comp.arch:
>From: baum@Apple.COM (Allen J. Baum)
>Subject: Re: RISC v. CISC --more misconceptions

[3] (Allen properly replies to many of the original misconceptions,
omitting only the discussion in [1] above on difficulty of deep pipelining
on some CISCs.)

>Article 6920 of comp.arch:
>From: sbw@naucse.UUCP (Steve Wampler)
>Subject: CISCy RISC? RISCy CISC?

>Just what is it about RISC vs. CISC that really sets them apart?
>... Other than that, I doubt I would care
>whether my machine is RISC or CISC, if I can even tell them apart.

[4] ABSOLUTELY RIGHT!  Most people should care less whether it's RISC
or CISC, just whether it does the job needed, goes fast, and is cheap.

>A case in point.  I know of a not-yet-announced machine (perhaps
>never to be announced machine) that has just about the largest
>instruction set I can imagine (not to mention the 15+ addressing
>modes)....
>The result is a 12.5MHz machine that runs 25000 (claimed)
>dhrystones using what I would call a 'throwaway' C compiler....

As you note, not-yet-announced.  On the other hand, MIPS R3000s
do 42K Dhrystones, and they're already in real machines, and vendors
are quoting the CPUs at $10/mip, i.e., $200 for 25MHz parts.

>Now, I've missed most of the RISC/CISC wars, but these seem to
>me to be very fine numbers, at least compared with the
>uVAXen I've played with (all of which cost more).
But uVAXen are real...
>How do they compare to current RISCs?  I'd bet pretty much the same.
>I personally couldn't care which machine I'd own (not that I can
>afford any).  When the really fast chips come in, I bet the RISC
>machines are the first to come out, but still, is there something
>that will keep CISC from catching up?
See the discussion in [1] above.  ALso, note, in a time when the
design cycle is 12-18 months, and people double performance in that
period, being that far behind means a factor of 2X in performance....

>Article 6936 of comp.arch:
>From: daveh@cbmvax.UUCP (Dave Haynie)
>>>It seems that the NeXT machine may have a few problems:

>>>1) Outdated Processor Technology: NeXT just missed the wave of fast RISC 
>>>   processors.  The 5 MIPS 68030 is completely out performed by the currently
>>>   available RISC chips (Motorola, MIPS, Sparc) that run at approximately
>>>   20 VAX (they claim) MIPS.  In a year or two, ECL versions of some of these
>>>   RISC chips will be running at 40 to 50 MIPS.

>Priced the ~8 MIPS Sun 4 lately?  Or the ~14 MIPS 88K chipset.  How about
>an Apollo 10K?  RISC machines are starting to get fast, and they're even
>starting to get down in price, but these two directions haven't met yet.

[5] Actually, this is the wrong reason: you can put together MIPS chipsets
at similar (or even slightly better) cost/performance levels (have you
prices a 68882 lately, for example?)  However, be fair to Jobs & co:
when they started, none of the RISC chips was generally available;
some of them [88K] are not yet generally available in volume.  try drawing
a timeline sometime of a) when you get first specs on a chip, b) when
you can design it in c) when you can make enough to get the software act
together d) when you can actually ship in volume.  IT TAKES A WHILE!
(I've commented earlier on ECL desktop hoevercrafts.)

Also, betting on a new architecture at the beginning of a cycle [i.e.,
in the Z8000/68K/X86/X32 etc wars in the early 80s, and the current
BRAWL (Big RISC Architecture War & Lunacy) is very exciting, and probably
not something a startup should do.  Consider, choosing an architecture
is like an odd form of Russian Roulette: you pick a chip and pull the
trigger, then wait a year or two to see if you've blown your brains out.
(An awful lot of workstation startups picked wrong the last time, and
they're gone, for example.)
Fortunately, the BRAWL will be over before the end of the year,
which will make life saner.

>Since the VAST majority of Suns sold to universities are Sun 3s (68020 based)
>and below (believe it or not, folks STILL use Sun 2s here and there), I don't
>think a 68030 based system, even NeXT's, which isn't an especially fast 68030
>system (they're running it's memory at about 1/2 the possible speed), will have
>no trouble competing with the installed 68020 systems.  Or a $25,000-$50,000 

>RISC based workstation.
I still think there's nothing wrong with NeXT using a 68030; there will however
be both SPARC and MIPS-based workstations a lot cheaper than $25-50K,
in volume, by the time the NeXT boxes are out in volume.

>Article 6964 of comp.arch:
>From: wkk@wayback.UUCP (W.Kapalow)
>Subject: RISC realities
>
>I have used, programmed, and evalutated most of the current crop of
>RISC chipsets.....

[6]....some reasonable analysis, from somebody with fewer axes to grind
than most of us, thank goodness!

>Chips like the Amd29000 are trying to make things better by having
>an onboard branch-target cache and blockmode instruction fetches.  Try
>getting 1-2 cycles/instruction with a R2000 with dynamic memory and no
>cache, the 29000 does much better.

Yep, although R3000s with some of the new cache-chip variants
will get to be an interesting fight here, i.e., since the R2000/R3000
has all ofthe cache control on-chip, and there are new cheap, small
FIFO parts that eliminate the write buffers.

>....  Look at the AT&T CRISP processor,  ....
Worth doing: some interesting ideas, regardless of commercial issues.

>Article 6968 of comp.arch:
>From: peter@ficc.uu.net (Peter da Silva)
>Subject: RISC/CISC and the wheel of life.

>I have noticed one very interesting thing about RISCs lately... they are
>getting quite sophisticated instruction sets. 3-address operations and
>addressing modes aren't what I used to associate with RIS, but if you look
>at them they turn out to be refinements of older RISCs.

[7]  This is very confusing.  Most RISCs use 3-address operations, i.e.,
	reg3 = reg1 OP reg2.
			rather than just 2-address ops:
	reg1 = reg1 OP reg2

Certainly, these include, but are not limited to: IBM 801, HP PA,
MIPS R2000, SPARC, 29K, 88K.

>What's happening, of course, is that the chips are so much faster than any
>sort of affordable RAM that it's worthwhile to put more into the instructions.
>The speed of the system as a whole goes up, since the chip can still handle
>all three register references in one external clock. No point in fetching
>instructions any faster than that...

I think this obfuscates the issue.  Any reasonable design has a register
file that has at least 2 read-ports and 1 write-port, i.e., can do
2 register reads and 1 write per cycle.  BOTH 3-address and 2-address
forms need to do those 2 reads & 1 write; the only difference is that
the 2-address form allows a denser instruction encoding, but the
base hardware is rather similar.

>Article 6970 of comp.arch:
>From: guy@auspex.UUCP (Guy Harris)
>Subject: Re: The NeXT Problem

[8]....Guy gives some reasonable comments....

>>Not trying to start a flame war, but 030's are faster than Sun 4's.

>To which '030 machine, and to which Sun-4, are you referring?  At the
>time the Sun-4/260 came out, no available '030 machine was faster
>because there weren't any '030 machines.....
>Also, you might compare '030s against MIPS-based machines; are they
>faster than them, as well?
No.

>>I puke trying to write assembly on RISC machines.

>Fortunately, I rarely had to do so, and these days fewer and fewer
>people writing applications have to do so.
>These days, "ease of writing assembly code" is less and less of a
>figure of merit for an instruction set.

100% agree; however, most people who've used our RISCs think they're
easier to deal with in assembler anyway, although they observe there's
less oppurtunity for writing unbearably obscure/clever code....

>Article 6974 of comp.arch:
>From: daveh@cbmvax.UUCP (Dave Haynie)
>Subject: Re: "Compatible" (was Re: The NeXT Problem)

>> In article <5941@winchester.mips.COM> John Mashey writes:
>>> This defies all logic.
>>> a) If it's compatible with an 030, it's not a RISC.
>
>> I agree with John, completely.
>
>For an example of an architecture that's 68000 compatible and RISCy to
>the point of executing most instructions in a single clock cycle, look
>no farther than the Edge computer.  However, if you want this on a 
>single chip, instead of a bunch of gate arrays, you'll have to wait.

This gets back to the point in [1]: you can throw an immense pile of
hardware and design time at an architecture to make it go faster,
but that doesn't make it a RISC architecture.  Maybe it makes it
a RISCier, or less CISCier implementation implementation [which is
what I meant when I said the 030 was RISCier than 020, which caused
a lot of confusion.  sorry.]  Another example is the way that
the MicroVAX chipset is a RISCier implementation of a VAX (and this
is more true than the Edge example, i.e., the MicroVAX gets by with
less hardware by moving some of the less frequent ops to software.)

>> the MC680X0's instruction set would NOT be a RISC instruction set.
>....  Consider that most
>of the RISCy CPUs on the market have been done as little baby chips,
>by ASIC houses (SPARC, MIPS).

Wrong.  the first SPARCs were gate arrays, but the
Cypress SPARCs are coming.  MIPS chips have NEVER been done
in ASICs, although LSI Logic is working on ASIC cores of them.
In our case, the CPU+MMU is about 100K transistors, which is
NOT as large as a 386 or 030, but not a "little baby chip" either.
AMD 29Ks are definitely not little baby chips either< and they're real, too.

>Article 6975 of comp.arch:
>From: daveh@cbmvax.UUCP (Dave Haynie)
>Subject: Re: The NeXT Problem

(AMD 29K prices from Tim Olson).
>> 	16MHz	$174
>> 	20MHz	$230
>> 	25MHz	$349
>
>> I'm sure that LSI Logic could also show you very low prices on their
>> RISC chips.  Last I heard, the 68030 was in the $300+ price range.

>Alot of it depends on quantity.  I'm sure NeXT and Apple are buying their
>68030s more that 100 at a time.  Many of the ASIC houses making RISCs are
>output limited.  And with most of the RISC designs, once you pay the 
>additional cost of caches and MMUs, you're way out of the 68030 league,
>cost wise.  Complete systems I've seen with both MIPS and 88k put you
>at around $1000 for the CPU subsystem.

All of these depend on quantity, and what it is you're trying to build.
Admittedly, it's hard for us to build anything less than about 6 VUPs.
I suspect you can build a CPU (+ FPU) subsystem like that for around $500,
given large quantities, maybe $400-$500 as the new cache chips come out.

>Article 6977 of comp.arch:
>From: jsp@b.gp.cs.cmu.edu (John Pieper)

>Actually, I heard a guy from Motorola talking about their n+1st generation
>680X0 machine -- they run an internal clock at 2X the external clock, and
>play some other tricks to get 14 MIPS effective, 25 MIPS max @ 25 MHz. Seems
>to me that CISC designers could do this very effectively to get ahead of the
>RISC types (modulo the design time).

[10] But remember that existing RISCs, shipping now, get 20 MIPS @ 25 MHz,
so it's hard to see how that's getting CISCs ahead. [It still is perfectly
reasonable to do, i.e., a 68040.  Plenty will get sold.]

>BTW, as far as design time goes, you have to take the RISC argument with a
>grain of salt. the 68030 is only a little different that the 68020, but with
>technology advances and just a few man-years they more than tripled the
>speed of the initial 68020 release (in 82?). The 68040 will take the same
>basic ALU design, and add the FPU. This shouldn't require too much redesign.
>The point is that a good CISC design can be modified (added to) as quickly
>as a major redesign of a RISC chip. What really counts is who can sell their
>instruction set.

Starting from scratch in 1984, and getting the first systems in mid-1986,
the high-performance VLSI RISC  [i.e., MIPS as example] is:
	1986	5 MIPS
	1987	10 MIPS
	1988	20 MIPS

But the last comment is really right: what really counts is who sells
the instruction set.  That's why the battle is pretty ferocious over
who gets to be the RISC standard (or standards), because everybody
knows there can only be a few, at most.

>Article 6987 of comp.arch:
>From: rsexton@uceng.UC.EDU (robert sexton)
>Subject: Re: The NeXT Problem

>While RISC may be cheaper(smaller design, less silicon) what you are really
>doing is shifting the cost burden onto the rest of the system.  The high
>memory bandwidth of the RISC design means more high speed memory, bigger
>high-speed caches.  With a CISC design, you put all of the high speed silicon
>on one chip, lowering the cost of all the support circuitry and memory.

[11] This is not a reasonable conclusion.  You can put caches on-chip
in either case.  A fast machine, in either case, will need a lot of
memory bandwidth: observe, for example, that the data-bandwidth should
be about the same for both.  Finally, note that people are generally
adding external caches to X86s and 68Ks to push the performance up,
for all the same reasons as RISCs. 

>Article 7005 of comp.arch:
>From: phil@diablo.amd.com (Phil Ngai)
>Subject: Re: RISC realities

[12]....reasonable discussion about burst mode I-fetchs, VRAMS, etc.

>I don't think the R2000 or the Mc88000 support this, but that's not
>an inherent limitation of RISC architectures.

Nope, we don't do this, or at least not exactly.  R3000s support
"instruction-streaming", whereby when you have an I-cache miss,
you do multi-word refill into the cache, but you execute the relevant
instructions, as they go by.  Typical designs use page-mode DRAM access.
Note, of course, that in the next rounds of design across the industry,
where almost everybody goes for on-chip I-cache with burst-mode refill
(I.e., 486; >= 68030, etc), the distinction disappears.

>Article 7013 of comp.arch:
>From: malcolm@Apple.COM (Malcolm Slaney)
>Subject: Re: CISCy RISC? RISCy CISC?
>P.S.  An interesting question is whether Symbolics/TI/LMI will fail because 
>the market is to small to support a processor designed for Lisp and GC or
>because CISC's are a mistake.

[13] The evidence so far is that neither reason is the most likely reason for
potential failure.  The more general reason is that special-purpose
processors that don't get real serious volume get hurt sooner or later,
for one of several reasons:
	a) A more general part ends up getting more volume, which keeps
		costs down.
	b) It's hard to stay on the technology curve without the volume.

>Article 7033 of comp.arch:
>From: eric@snark.UUCP (Eric S. Raymond)
>Subject: Re: RISC/CISC and the wheel of life.

>My understanding of RISC philosophy suggests that 3-address ops and fancy
>addressing modes are only regarded as *symptoms* of the RISC problem -- poor
>match of instructions to compiler code generator capabilities, excessive
>miceocode-interpretation overhead in both cycles and chip real estate.
>
>If your compiler can make effective use of three-address instructions, and
>you've got CAD tools smart enough to gen logic for them onto an acceptably
>small % of the chip area (so that you don't have to give up on more important
>features like a big windowed register file and on-chip MMU), then I don't see
>any problem with calling the result a RISC.

[14] As noted in [7] above, 3-address instructions are NATURAL matches
to typical register-file designs; people shouldn't be assuming that
there is some big cost to having them (in terms of logic complexity).

>Article 7040 of comp.arch:
>From: doug@edge.UUCP (Doug Pardee)
>Subject: Re: CISCy RISC? RISCy CISC?
>Organization: Edge Computer Corporation, Scottsdale, AZ

>The incorrect assumption here is that you would want to build a mainframe
>using RISC technology -- that RISC technology has anything to offer at
>that price/cost level.
Well, M/2000s act like 5860s, and we think next year's M/xxxx will
make 5990s sweat some.  Why wouldn't we want to build RISC-based mainframes?
Lots of people do.

>As we at Edgcore have shown, it is both possible and practical to implement
>CISC instruction sets at speeds faster than RISC.  But -- it doesn't all fit
>on one chip.  Yet.

Could you cite some benchmarks for the newest machines?  [I don't believe
that the current production ones are faster than 25MHz R3000s, but I could
be convinced.]

>In a mainframe design, who cares if it fits on one chip?  Jeez, in our E2000
>system we need an entire triple-high VME card jam-packed with surface-mount
>parts just to hold the *caches* that we need to have to keep from starving
>the CPU.  The complexity and board area of the CPU itself is insignificant
>compared to that required by mainframe-sized multi-level memory systems.

I sort-of agree, in the sense that if you're building a physically
large/expensive box anyway, then the CPU is a small piece of the action.
On the other hand:
People who want to put mainframe (CPU performance) on desktop/deskside
systems care; weirdly enough, a whole lot of people expect to do this.

How big are the caches?  It does surprise me they're a whole big VME card,
unless they're absolutely immense.  We get 20-VUPS performance with 128K
cache, which fits with the CPU+FPU+write buffers on about a 6" x 6" square.

>Article 7041 of comp.arch:
>From: pardo@june.cs.washington.edu (David Keppel)
>Subject: Re: LISPMs not RISC? - Re: CISCy RISC? RISCy CISC?

>Oh, heck, there's some (relatively) new supercomputer being produced
>by some subsidiary of CDC (I think?) that was written up in "digital

[16] ETA is the reference.  One could argue about whether to call
it CISC or RISC, depending on what you generally think vector machines
really are.

>Also, while CISC is out of vogue in new industry designs at the
>moment, there are plenty of Universities building microcoded
>processors (read "CISC"?).

Of course, this proves little about commercial reality [that is not good
or bad; it is not the job of universities to do that.], but quite a few
folks think there is more to RISC than "being in vogue".

Whew!
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

chrisj@cup.portal.com (Christopher T Jewell) (10/25/88)

In <14112@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes

>In article <5863@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green)
>writes:
>>For example, the high-end Vaxen have a pipelined MICROARCHITECTURE. It
>>is almost impossible to effectively pipeline the macroarchitecture of
>>a Vax, because of the multitude of instruction set formats (almost as
>>bad as the 680x0). 
>
>While the 680x0 have a number of formats (and thus lengths), one of the
>nice properties of its instruction set is that the first word tells you
>the length of the entire instruction.

Only if x <= 1.  On the '020 and '030, an indexed addressing mode
(specified by the opcode word) can require from 1 to 5 extension words
(specified by the first extension word for that operand).  An instruction
whose opcode word specifies `MOVE (ix,An),(ix,An)' can be from 6 to 22
bytes long.

Christopher T Jewell   chrisj@cup.portal.com   sun!cup.portal.com!chrisj
"Sure I'm an egomaniac---like everyone else, I'm the only god there is."
				Spinrad, _Riding_the_Torch_

paul@unisoft.UUCP (n) (10/25/88)

In article <15964@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes:
>In article <332@pvab.UUCP> robert@pvab.UUCP (Robert Claeson) writes:
>
>*The more registers, the more to save at every context switch in a typical
>*OS (such as UNIX). Which will slow things down if you have many processes,
>*running.
>
>Based on parameters of Berkelely RISC I or II, the register-saving
>might take on the order of 0.1 msec.  If the quantum size is set to
>be in the range claimed to be typical in the Peterson and Silberschatz
>OS book, i.e. 10 to 100 msec, then we see that the register-saving
>issue for a RISC with lots of regiters has probably been greatly
>overemphasized.

Actually some modern chips run faster than this, the 29k for example
has 192 registers which it can save with a single burst write to memory,
a 25MHz part takes:

		192 x 40nS = 7.68 uS 

Since the stack cache isn't always full, and because the OS uses some of
these registers for itself the total save time is usually actually less
(and of course the 30MHz parts can save even faster). Of course compared
with a quantum size in the milliseconds range this is virtually non
existant. In fact compared with the normal Unix process switch overhead 
it's not really a big deal.

	Paul

-- 
Paul Campbell, UniSoft Corp. 6121 Hollis, Emeryville, Ca ..ucbvax!unisoft!paul  
Nothing here represents the opinions of UniSoft or its employees (except me)

	"Where was George?" - Nudge, nudge say no more

tim@crackle.amd.com (Tim Olson) (10/26/88)

In article <15964@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes:
| Based on parameters of Berkelely RISC I or II, the register-saving
| might take on the order of 0.1 msec.  If the quantum size is set to
| be in the range claimed to be typical in the Peterson and Silberschatz
| OS book, i.e. 10 to 100 msec, then we see that the register-saving
| issue for a RISC with lots of regiters has probably been greatly
| overemphasized.
| 
| Comments?

Actually, the register saving is more likely to be on the order of 10 to
20 microseconds (order of magnitude less than the 0.1 you suggest). 
Comparing 100 context-switches per second to 350,000 procedure calls per
second, it isn't hard to see where to concentrate your optimization
efforts...


	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

peter@ficc.uu.net (Peter da Silva) (10/26/88)

In article <6865@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
> [7]  This is very confusing.  Most RISCs use 3-address operations, i.e.,
> 	reg3 = reg1 OP reg2.
> 			rather than just 2-address ops:
> 	reg1 = reg1 OP reg2

> Certainly, these include, but are not limited to: IBM 801, HP PA,
> MIPS R2000, SPARC, 29K, 88K.

I've been out of things for a while, but didn't RISCs use to use either
stack or load-store architecture? Or was that just RISC-1?

Anyway, I brought up two CISCy features I'd read about here recently. That
was one. Addressing modes are the other.

And addressing modes... even just indexing and autoincrement... are pretty
CISCy.

Just pointing out that RISC isn't a religion... it's a technique.
-- 
Peter da Silva  `-_-'  Ferranti International Controls Corporation
"Have you hugged  U  your wolf today?"     uunet.uu.net!ficc!peter
Disclaimer: My typos are my own damn business.   peter@ficc.uu.net

mat@amdahl.uts.amdahl.com (Mike Taylor) (10/26/88)

In article <15964@agate.BERKELEY.EDU>, matloff@bizet.Berkeley.EDU (Norman Matloff) writes:
> ^^^^^^^^
> 
> Based on parameters of Berkelely RISC I or II, the register-saving
> might take on the order of 0.1 msec.  If the quantum size is set to
> be in the range claimed to be typical in the Peterson and Silberschatz
> OS book, i.e. 10 to 100 msec, then we see that the register-saving
> issue for a RISC with lots of regiters has probably been greatly
> overemphasized.
> 
> Comments?
> 
>    Norm Matloff

I have trouble with milliseconds, but it depends on the workload and the
OS variant. How about transaction processing, where there may be as few
as (say) 4K cycles between process switches in a message-oriented
environment. (I know this has nothing to do with NeXT).  Then cache
and register effects may be very significant - particularly if you dump
a large register file into a cache.
-- 
Mike Taylor                               ...!{hplabs,amdcad,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

fotland@hpihoah.HP.COM (Dave Fotland) (10/26/88)

>Based on parameters of Berkelely RISC I or II, the register-saving
>might take on the order of 0.1 msec.  If the quantum size is set to
>be in the range claimed to be typical in the Peterson and Silberschatz
>OS book, i.e. 10 to 100 msec, then we see that the register-saving
>issue for a RISC with lots of regiters has probably been greatly
>overemphasized.

>Comments?

>   Norm Matloff
----------

This assumes all your processes are compute bound and run for the whole
time slice.  In commercaial applications there are very few instructions
between system calls and these frequently block, causing a contect switch.

If you only execute 10,000 instructions between context switches (about
1 msec) then a .1 msec overhead for saving and restoring the registers is
a big deal.

If you are only interested in workstations running mainly single 
compute bound jobs then register windows don't cost very much performance,
but if you want to build a general purpose architecture that can also be
used for large commercial systems then you probably want to leave them out.

Also, if you want to build a general purpose system that can be used for
real time applications, that .1msec in your interrupt latency could be
a problem.

-David Fotland

matloff@bizet.Berkeley.EDU (Norman Matloff) (10/26/88)

In article <23367@amdcad.AMD.COM> tim@crackle.amd.com (Tim Olson) writes:

>Actually, the register saving is more likely to be on the order of 10 to
>20 microseconds (order of magnitude less than the 0.1 msec you suggest). 
>Comparing 100 context-switches per second to 350,000 procedure calls per
>second, it isn't hard to see where to concentrate your optimization
>efforts...

My computation was a conservative one, assuming (e.g.) the slow 400 ns
cycle time on RISC I, and taking into account that LOAD/STOREs take an
extra cycle.

But the point is, and you seem to agree, that the often-voiced (and
recently brought up in comp.arch) claim that context switches would
make multiple-window-register-file-based RISC's unsuitable for
timeshare applications is just simply not borne out by the data.

  Norm

mash@mips.COM (John Mashey) (10/26/88)

In article <16003@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes:
>In article <23367@amdcad.AMD.COM> tim@crackle.amd.com (Tim Olson) writes:
>

>But the point is, and you seem to agree, that the often-voiced (and
>recently brought up in comp.arch) claim that context switches would
>make multiple-window-register-file-based RISC's unsuitable for
>timeshare applications is just simply not borne out by the data.

In the following, it must be noted that I am NOT biased in favor of
register windows:

1) It is CLEAR that in a typical UNIX environment, saving/restoring
a SPARC or 29K register file is not, in of itself, particularly important,
compared with typical UNIX scheduling. [Other people got there first
and ran the numbers, I see.]  Of course it costs something, and
even little things add up, but I doubt that this is a dominant effect.

2) It is CLEAR that one MIGHT care about this in
	a) Real-time applications that require guaranteed minimal latency.
	(Note that the real real-time folks would consider anathema
	the solution of keeping one window free and then faulting the
	others in as needed.) [These folks like things like locking
	things in caches, for example.]
	b) Heavy transaction-oriented systems (as Mike Taylor noted);
	these could either be big database systems or
	things like electronic switching systems, which have (effectively)
	numerous small processes.
In both cases, one would have to run the numbers and see, and this is much more
instance-specific than a general UNIX environment.

3) The UNIX kernel can sometimes be painful for register windows (as in
SPARC, but NOT as in the non-window styles of 29K or CRISP) as follows:
	Register window design (as in UCB) used certain kinds of programs
	to create the statistics to support the design. User programs
	often bounce around in a fairly shallow window-count.
	UNIX kernels are worse.  They often zoom up and down 10-12 levels
	very quickly, causing window faults like crazy.  I have to believe
	the SunOS folks have been working hard to tune for this.

4) Finally, issues of multi-user performance on Sun-4s is a completely
separate matter.  As discussed in various USENIX papers, the kind of virtual
cache used in Sun-[34]/2xx needs substantial cache-flushing on
context switch [the UNIX u-area, particularly].  This can be gotten around,
but takes a while to do.  Of course, this effect has nothing at all to
do with register windows.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (10/26/88)

In article <2005@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:
>In article <6865@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
>> [7]  This is very confusing.  Most RISCs use 3-address operations, i.e.,
>> 	reg3 = reg1 OP reg2.
>> 			rather than just 2-address ops:
>> 	reg1 = reg1 OP reg2

>I've been out of things for a while, but didn't RISCs use to use either
>stack or load-store architecture? Or was that just RISC-1?
RISCs are mostly load/store designs, but maybe I misread what you
meant.  Most RISCs use load/store designs, where a single load/store
accesses 1 memory object, which generally can't cross page (or
even naturally-aligned object) boundaries.  Some of them allowed
for simple indexed and/or auto-increment/decrement addressing.

I don't know of any RISCs that have instructions that touch 3 addresses
in memory, so I assume you were asking about the 3-operand forms
(in registers), which are used by most RISCs.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (10/26/88)

In article <6865@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>As you note, not-yet-announced.  On the other hand, MIPS R3000s
>do 42K Dhrystones, and they're already in real machines, and vendors
>are quoting the CPUs at $10/mip, i.e., $200 for 25MHz parts.

>Starting from scratch in 1984, and getting the first systems in mid-1986,
>the high-performance VLSI RISC  [i.e., MIPS as example] is:
>	1986	5 MIPS
>	1987	10 MIPS
>	1988	20 MIPS

The above two paragraphs aren't here for any good reason.  I just
liked them.  (Remember that an Amdahl 5890 [the second fastest scaler
processor in the world...:-] does on the order of 42 or 43K dhrystones.)

>>From: doug@edge.UUCP (Doug Pardee)
>>The incorrect assumption here is that you would want to build a mainframe
>>using RISC technology -- that RISC technology has anything to offer at
>>that price/cost level.
>Well, M/2000s act like 5860s, and we think next year's M/xxxx will
>make 5990s sweat some.  Why wouldn't we want to build RISC-based mainframes?
>Lots of people do.

A couple things.  At Amdahl, people do think about things like building
a RISC based mainframe processor.  The big problem that arises is in
guaranteeing object-code compatibility for old COBOL binaries that do
ugly things like use self-modifying code.  But mainframe people are
definitely interested in RISC technology, and are working on thinking
up ways to take advantage of the technology.

John Mashey brings up a point that I've never had a satisfactory
answer to.  If we assume that RISC-based manufacturers can build
machines that outperform mainframes, where will companies like Amdahl
make their money?  When I asked this question around Amdahl, the
answer was "I/O bandwidth.  I/O bandwidth!"  

To what extent would next year's M/xxxx (40 Mips?) processor really
make a 5990 sweat?  I'll concede that on some programs, this processor-
to-come will be as fast as a 5990.  But let's look at the kinds of
processing that are common on mainframes:  database processing.
A 5990 can be equipped with 256 Megabytes of 55nanosecond static ram.
(That's its main memory, not its cache.)  That kind of memory costs
a whole lot, and if you need that kind of memory (for your huge
database and 3000 users), it's going to cost, even on a RISC based
mainframe.

The 5990 also has lots of I/O bandwidth.  (Anyone want to help me
with the numbers here?)  I believe that you can hook up something
like 32 4.5Megabyte (byte, not bit) per second channels to one of these
beasties.  That kind of I/O bandwidth costs.  (For comparison,
a diskless Sun has about 1.25 Megabytes of bandwidth [10 Megabit Ethernet].)
A diskful Sun probably doesn't have much more than 4 Megabytes.
So, a mainframe can do something like 30 times as much I/O as
a workstation...)

(People at Amdahl would also mention that when you build a mainframe,
is has to be highly reliable and extremely serviceable.  Apparently,
there's a fair amount of hardware and money that go into increasing
the reliability and serviceability of a mainframe.)

So, the basic claim that I want to make, and that I'd like to hear
counter-arguments to, is that if you build a RISC-based mainframe,
it's still going to cost $10,000,000.

(Random thoughts...  People at Amdahl are starting to worry that
the next generation of Amdahl mainframes might be able to support
64K concurrent processes, or at least enough processes to make
pid's wrap way to frequently.  Has MIPS started worrying about
the problem of 16-bit pid's yet?  Seems like MIPS might run into
trouble in 1990 or 1991...)  (16-bit major/minor device numbers
are already too small for a 5890 [have you ever tried to configure
3000 terminal devices in an 8-bit field?]  How much trouble is
MIPS having with this 16-bit limit?)

-- Chuck

csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (10/26/88)

In article <16003@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes:
>But the point is, and you seem to agree, that the often-voiced (and
>recently brought up in comp.arch) claim that context switches would
>make multiple-window-register-file-based RISC's unsuitable for
>timeshare applications is just simply not borne out by the data.
>
>  Norm

If I remember the arguments from MIPS correctly (want to help me out
John?), there's a stronger objection to multiple-window-register-files.
I think it's something to the effect that register-windows cause the
load/store access time to be slower.  I think there also may be some
argument that a good compiler makes multiple-windows relatively
unnecessary.

Could one of you nice people address something like the above and
help me clarify my thinking?

-- Thanks, Chuck

jack@cwi.nl (Jack Jansen) (10/26/88)

In article <15964@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes:
>
>Based on parameters of Berkelely RISC I or II, the register-saving
>might take on the order of 0.1 msec.  If the quantum size is set to
>be in the range claimed to be typical in the Peterson and Silberschatz
>OS book, i.e. 10 to 100 msec, then we see that the register-saving
>issue for a RISC with lots of regiters has probably been greatly
>overemphasized.
>
>Comments?
>
>   Norm Matloff
Well, 100 usec might be fine for standard unix, it is definitely not
fine for operating systems supporting light-weight threads.

In amoeba, our distributed system, thread-to-thread switch time
is in the order of 20-50usec, and on a fast machine like a R2000
it would probably be down to 5-20usec, not counting the register
save.

What I would like is some help from the architecture, like dirty bits on
groups of registers or something. 
Actually, I'm not *that* familiar with the R2000 (or the other risc
chips, for that matter); do any of them provide a feature for this?

Also, does anyone know  thread switch times for Mach or other systems
that support light-weight threads, and how these would be affected
on machines with large register files?
--
Fight war, not wars			| Jack Jansen, jack@cwi.nl
Destroy power, not people! -- Crass	| (or mcvax!jack)

daveh@cbmvax.UUCP (Dave Haynie) (10/26/88)

in article <6865@winchester.mips.COM>, mash@mips.COM (John Mashey) says:
> Xref: cbmvax comp.arch:7151 alt.next:236

>>Priced the ~8 MIPS Sun 4 lately?  Or the ~14 MIPS 88K chipset.  How about
>>an Apollo 10K?  RISC machines are starting to get fast, and they're even
>>starting to get down in price, but these two directions haven't met yet.

> [5] Actually, this is the wrong reason: you can put together MIPS chipsets
> at similar (or even slightly better) cost/performance levels (have you
> prices a 68882 lately, for example?)  

The probably pay less than $50 for the 25MHz part, if they are able to convince
Motorola to give them good volume pricing.

> All of these depend on quantity, and what it is you're trying to build.
> Admittedly, it's hard for us to build anything less than about 6 VUPs.
> I suspect you can build a CPU (+ FPU) subsystem like that for around $500,
> given large quantities, maybe $400-$500 as the new cache chips come out.

Which is still more than twice as expensive as a 68030 based system's chips
will cost, in reasonable quantities.  Perhaps twice the performance, or
better, if the caches work well and you can live with the same priced memory
the 68030 system can use.  If you're really trying to design a workstation,
maybe you should consider RISC at this point, because all the PCs are
starting to use '020s, '030s, and '386s.  Still it's a problem of choosing
the RISC of the week and betting that it will succeed.

> -john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
              Amiga -- It's not just a job, it's an obsession

guy@auspex.UUCP (Guy Harris) (10/26/88)

>>The more registers, the more to save at every context switch in a typical
>>OS (such as UNIX). Which will slow things down if you have many processes
>>running.

>What data do you have to substantiate this claim?  This is another popular
>misconception, I think.

There appears to be a belief out there that register windows slow
context switches down on Sun-4s; this may be the source of the claim. 
It isn't true; what slows them down is the expense of flushing the
entries for the U area from the virtual address cache, since the U area
is at the same address in kernel virtual space in all processes, and the
context number is the same for all those processes, so you can't rely on
the context number to distinguish between the virtual address of the U
areas in different processes.  The fix is to put them at different
virtual addresses.... 

mash@mips.COM (John Mashey) (10/27/88)

In article <468@oracle.UUCP> csimmons@oracle.UUCP (Charles Simmons) writes:

>>>From: doug@edge.UUCP (Doug Pardee)
>>>The incorrect assumption here is that you would want to build a mainframe
>>>using RISC technology -- that RISC technology has anything to offer at
>>>that price/cost level.
>>Well, M/2000s act like 5860s, and we think next year's M/xxxx will
>>make 5990s sweat some.  Why wouldn't we want to build RISC-based mainframes?
>>Lots of people do.

In general, I agree 100% with Chuck: CPU performance doesn't necessarily
imply I/O performance (which I've said numerous times), and if I'd not
been in catchup mode, I would have said "sweat some on uniprocessor CPU
performance".  Actually, in terms of market conflict, as far as I can
tell, despite managing to bump into lots of other people, Amdahl is
one we don't, and probably never will. [Why? 1) Most people who buy
from Amdahl have already chosen their architecture, based on existing
applications, 2) They pick Amdahl over other PCMs or IBM for a variety
of reasons, including cost/performance or smart features like the
mulitple-domain thing, 3) Their customers tend to be very loyal,
as they appear to be treated well.  (These comments arise from having
spoken at an Amdahl User's Group meeting not long ago and spending a lot
of time talking to their customers.)]

>A couple things.  At Amdahl, people do think about things like building
>a RISC based mainframe processor.  The big problem that arises is in
>guaranteeing object-code compatibility for old COBOL binaries that do
>ugly things like use self-modifying code.  But mainframe people are
>definitely interested in RISC technology, and are working on thinking
>up ways to take advantage of the technology.

As noted elsewhere, it makes perfect sense, once you have some base for
it, to keep pushing an architecture further. S/360 and it's descendants
are clearly a fertile area for this.

>John Mashey brings up a point that I've never had a satisfactory
>answer to.  If we assume that RISC-based manufacturers can build
>machines that outperform mainframes, where will companies like Amdahl
>make their money?  When I asked this question around Amdahl, the
>answer was "I/O bandwidth.  I/O bandwidth!"  
This is a legitimate technical answer, as it certainly distinguishes
things with mainframe-class CPU performance from real, large mainframes.
(Actually, I think the other issues mentioned above are at least as important.)

>To what extent would next year's M/xxxx (40 Mips?) processor really
>make a 5990 sweat?  I'll concede that on some programs, this processor-
>to-come will be as fast as a 5990.  But let's look at the kinds of
>processing that are common on mainframes:  database processing.
>A 5990 can be equipped with 256 Megabytes of 55nanosecond static ram.
>(That's its main memory, not its cache.)  That kind of memory costs
>a whole lot, and if you need that kind of memory (for your huge
>database and 3000 users), it's going to cost, even on a RISC based
>mainframe.
Yep, absolutely.  My guess is that it will be a while before people
build RISC-based systems that can capture these sorts of applications:
	a) You do have to build memories with a lot of bandwidth.
	b) You have to build I/O, spend a lot of $ on reliability &
	serviceability.
	c) You have to move the applications. [IMS? CICS? hmmm.]
	d) You have to be a company of such size and nature that those
	folks will trust those applications to you....and some of those
	folks have only recently noticed that companies like DEC or
	Amdahl are substantial enough to consider :-)
On the other hand, some mainframe cycles go towards engineering applications,
or towards general time-sharing, and other less immediately "mission-critical"
applications, and some of those we actually get a chance to fight for.
(Actually, quite a few MIPS machines are used in multi-user database
environments, but not in the same ones that Amdahls would be used in.)

>The 5990 also has lots of I/O bandwidth.  (Anyone want to help me
>with the numbers here?)  I believe that you can hook up something
>like 32 4.5Megabyte (byte, not bit) per second channels to one of these
>beasties.  That kind of I/O bandwidth costs.  (For comparison,
>a diskless Sun has about 1.25 Megabytes of bandwidth [10 Megabit Ethernet].)
>A diskful Sun probably doesn't have much more than 4 Megabytes.
>So, a mainframe can do something like 30 times as much I/O as
>a workstation...)

Yes, I believe we won't have quite that bandwidth next year, although
the I/O will be quite respectable at the price.  Of course,
we worry about the issue in general: CPU performance is going up so fast
right now, it's clearly leaving cost-equivalent I/O behind.
On the other hand, there is interesting work going on in the world towards,
for example, farms of small disks, which can get some good bandwidth
rather cheaply.

>(People at Amdahl would also mention that when you build a mainframe,
>is has to be highly reliable and extremely serviceable.  Apparently,
>there's a fair amount of hardware and money that go into increasing
>the reliability and serviceability of a mainframe.)
Yes.  Note that here there is some edge for the RISCs, just because the
basic hardware is simpler in the first place; it's less work to air-cool
them, etc.  The CPU+cache can be 1 board, etc.  Again, this only applies to
the CPU subsystem, but that's certainly one of the more stressful areas.

>So, the basic claim that I want to make, and that I'd like to hear
>counter-arguments to, is that if you build a RISC-based mainframe,
>it's still going to cost $10,000,000.

When you get to really large configurations, it's clear that very little
of the money is in the CPU any more.  On the other hand, sometimes you can
trade CPU performance for some kinds of I/O gear (i.e., a small example
would be having cheaper serial-i/o support because you can afford to
have more CPU overhead per interrupt, becuase the CPU is faster).
I'll have to think about the number: it will be a long time if ever before
we build something that costs that much.

>(Random thoughts...  People at Amdahl are starting to worry that
>the next generation of Amdahl mainframes might be able to support
>64K concurrent processes, or at least enough processes to make
>pid's wrap way to frequently.  Has MIPS started worrying about
>the problem of 16-bit pid's yet?  Seems like MIPS might run into
>trouble in 1990 or 1991...)  (16-bit major/minor device numbers
>are already too small for a 5890 [have you ever tried to configure
>3000 terminal devices in an 8-bit field?]  How much trouble is
>MIPS having with this 16-bit limit?)

We haven't done a lot in that direction, mainly because:
	a) It's more likely to get solved as part of the general UNIX
	evolution, I think.  You guys have just run into it earlier
	than most people.
	b) We don't need to quite yet.  Although we have some M/1000s
	that have 60-100 users on them, and M/2000s that will
	have more, I suspect we won't have 3000 for a while.
	Among other things, people get greedy, if the cycles are cheap.
	(We now have the spectacle of people having gotten used to having
	a 20-mips machine by themselves, and wanting it all of the time.
	Of course, we have people who'd barely be satisfied with Cray-YMPs
	on their desks, so that's probably not surprising.)

Anyway, mainframe I/O is definitely in a different league right now.
In fact, it might be instructive for the newsgroup for somebody
to post a description of what a 5990 memory hierarchy looks like in
more detail.  This newsgroup argues more about microprocessors
than mainframes.  If you review computing history, you find that
each wave [mainframe, mini, micro] has tended to repeat much of the
evolution of the earlier waves.  Now that VLSI micros are getting
supermini & up performance, many of the same old issues will arise.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (10/27/88)

In article <469@oracle.UUCP> csimmons@oracle.UUCP (Charles Simmons) writes:

>If I remember the arguments from MIPS correctly (want to help me out
>John?), there's a stronger objection to multiple-window-register-files.
>I think it's something to the effect that register-windows cause the
>load/store access time to be slower.  I think there also may be some
>argument that a good compiler makes multiple-windows relatively
>unnecessary.

1) Register-windows make load/store access slower: I don't particularly
believe this.  I believe that people sometimes think that windows
may reduce the numbers of loads and stores enough that they think
they can get away with slower loads and stores.  I believe that whether
that's true or not depends on the application; it certainly is not true
for some applications.  As far as I know, the slower loads/stores on
existing SPARCs has nothing to do with having windows, but with the
cache design, and is not intrinsic to the architecture, but rather
the implementation.

2) Good compilers + enough registers do reduce the benefits gained
by register windows; over many benchmarks, we find that the average
number of registers saved/restored is about 1.5-2.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jjw@celerity.UUCP (Jim ) (10/27/88)

In article <332@pvab.UUCP> robert@pvab.UUCP (Robert Claeson) writes:
>The more registers, the more to save at every context switch in a typical
>OS (such as UNIX). Which will slow things down if you have many processes
>running.

One solution to this is to have a "register cache" which holds the register
sets for several processes.  Context switches among the loaded processes is
then very fast.  The save/restore penalty need be paid only when a new
process is brought into the mix.  Given that there is a "locality" of
processes (processes which just ran are likely to be run again soon) this
significantly reduces the context switch cost.

mash@mips.COM (John Mashey) (10/27/88)

In article <5112@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes:
>in article <6865@winchester.mips.COM>, mash@mips.COM (John Mashey) says:
>> [5] Actually, this is the wrong reason: you can put together MIPS chipsets
>> at similar (or even slightly better) cost/performance levels (have you
>> prices a 68882 lately, for example?)  

>The probably pay less than $50 for the 25MHz part, if they are able to convince
>Motorola to give them good volume pricing.

Unfortunately, all of this depends on what things really cost, and when they
cost what they cost, and at what volumes are those costs.  I suspect it
maybe be a while till the 68882 costs $50, but then I haven't bought any
lately, so I could be wrong.  It's often very hard to get real numbers.

>the 68030 system can use.  If you're really trying to design a workstation,
>maybe you should consider RISC at this point, because all the PCs are
>starting to use '020s, '030s, and '386s.  Still it's a problem of choosing
>the RISC of the week and betting that it will succeed.

Yep.  The latter is the real issue.  Mark Linimon suggested a paraphrase
of the Russian roulette analogy especially appropiate to RISC:
	Russian roulette in the  delay slot.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

henry@utzoo.uucp (Henry Spencer) (10/27/88)

In article <332@pvab.UUCP> robert@pvab.UUCP (Robert Claeson) writes:
>The more registers, the more to save at every context switch in a typical
>OS (such as UNIX). Which will slow things down if you have many processes
>running.

This one comes up regularly, sigh...  Whether it gives you a net slowdown
or not depends on how much context-switching is going on, how long a
process runs between context switches (i.e. how much chance it has to
take advantage of having that data in registers), and how much you care
about interrupt latency.  If context switches are not *too* common and
latency is not a big deal, lots of registers can be a huge net win even
if it does slow context switching.  The same comment applies to non-
writethrough caches.
-- 
The dream *IS* alive...         |    Henry Spencer at U of Toronto Zoology
but not at NASA.                |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

matloff@bizet.Berkeley.EDU (Norman Matloff) (10/27/88)

In article <469@oracle.UUCP> csimmons@oracle.UUCP (Charles Simmons) writes:
*In article <16003@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes:
*>But the point is, and you seem to agree, that the often-voiced (and
*>recently brought up in comp.arch) claim that context switches would
*>make multiple-window-register-file-based RISC's unsuitable for
*>timeshare applications is just simply not borne out by the data.

*If I remember the arguments from MIPS correctly (want to help me out
*John?), there's a stronger objection to multiple-window-register-files.
*I think it's something to the effect that register-windows cause the
*load/store access time to be slower.  I think there also may be some
*argument that a good compiler makes multiple-windows relatively
*unnecessary.

A compiler which automatically inlines procedure calls might be able
to do what you are saying.  However, there may be some unpleasant
side effects, depending on the language.  Actually, one of my grad
students in making a thesis out of this, so I'll try to report more
on it later.

   Norm

dharvey@wsccs.UUCP (David Harvey) (10/27/88)

In article <10194@cup.portal.com>, bcase@cup.portal.com (Brian bcase Case) writes:
> 
> What has been left out of this discussion is the software side of the
> issue.  The almighty Compiler can save us from our sins!  It is our
> saviour!  Long live common subexpression elimination!  Hail to the code
> reorganizer!  Praise the register allocator!  Jim Bakker, watch out!

This view is typical of hardware types.  By all means, lets pass the
buck to the next guy.  So the compiler writer has his(her) share of
nightmares actually getting something to make the thing compile some
code.  And then the systems programmer comes along and inserts a few
more kludges to make the machine purr.  Did he document them?  I hope
so.  Now it is the application programmer's turn to s***w things up.
If my memory serves me correctly, it is much easier to get something
up and running on a Motorola 68000 than on an Intel 8086 (very nasty,
those beasty little segments).  And miracle of miracles, we learn that
over 70% of computing costs are software.  It seems like hardware types
should be designing their end of the deal to reduce it at the other end.


dharvey@wsccs

What do I know...I don't design the d**n things, I just use them.

koopman@a.gp.cs.cmu.edu (Philip Koopman) (10/27/88)

In article <468@oracle.UUCP>, csimmons@hqpyr1.oracle.UUCP (Charles Simmons) writes:
> In article <6865@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
> >As you note, not-yet-announced.  On the other hand, MIPS R3000s
> >do 42K Dhrystones, and they're already in real machines, and vendors
> >are quoting the CPUs at $10/mip, i.e., $200 for 25MHz parts.

Hey, wait a minute.  You can't just spec the price of the CPU itself,
you need to include the cost of other required chips (like cache controllers,
MMU's, or whatever) when you say how much the CPU costs.  On some machines,
you can't run without these extra components.

  Phil Koopman                koopman@maxwell.ece.cmu.edu   Arpanet
  5551 Beacon St.
  Pittsburgh, PA  15217    
PhD student at CMU and sometime consultant to Harris Semiconductor.

guy@auspex.UUCP (Guy Harris) (10/27/88)

>> [7]  This is very confusing.  Most RISCs use 3-address operations, i.e.,
>> 	reg3 = reg1 OP reg2.
>> 			rather than just 2-address ops:
>> 	reg1 = reg1 OP reg2

>I've been out of things for a while, but didn't RISCs use to use either
>stack or load-store architecture? Or was that just RISC-1?

They still do.  Note that the 2-address and 3-address operations he
lists all have "regN" as the operands; RISCs tend to use load-store
operations as their only *memory-reference* operations, but (unless you
have magic "memory" locations that do arithmetic) you generally need
arithmetic operations as well to make a useful computer.  RISCs tend to
have only register-to-register arithmetic operations, and they tend to
be 3-"address" in the sense that they operate on two registers and stick
the result in a third, with none of them obliged to be the same register.

>Anyway, I brought up two CISCy features I'd read about here recently. That
>was one. Addressing modes are the other.
>
>And addressing modes... even just indexing and autoincrement... are pretty
>CISCy.

Umm, if indexing is "pretty CISCy", then just about every machine out
there is a CISC, which makes "CISCy" pretty much uninteresting as an
adjective, unless you can show an interesting machine that lacks
indexing.

"Indexing" generally refers to forming an effective address by adding
the values in one or more registers to a constant offset, and both the
MIPS Rn000 (OK, John, what's the term you use to refer to the R2000 and
R3000, or are they different enough that such a term wouldn't be
useful?) and SPARC, to name just two machines generally thought of as
RISCs, both support indexing in that sense (register+offset on MIPS,
register+register+offset on SPARC).

guy@auspex.UUCP (Guy Harris) (10/27/88)

>	Register window design (as in UCB) used certain kinds of programs
>	to create the statistics to support the design. User programs
>	often bounce around in a fairly shallow window-count.
>	UNIX kernels are worse.  They often zoom up and down 10-12 levels
>	very quickly, causing window faults like crazy.  I have to believe
>	the SunOS folks have been working hard to tune for this.

While I was at Sun, I don't remember there ever having been any effort
to reduce the depth of the kernel call stack in order to speed things up
on SPARC-based Suns.  (Remember, they have to make it run sufficiently
fast on three architectures, not one - four, if you count 370/XA and
compatibles, and even more, if you consider that a lot of Sun code is
going into S5R4....)

mash@mips.COM (John Mashey) (10/28/88)

In article <7681@boring.cwi.nl> jack@cwi.nl (Jack Jansen) writes:
>Well, 100 usec might be fine for standard unix, it is definitely not
>fine for operating systems supporting light-weight threads.

>In amoeba, our distributed system, thread-to-thread switch time
>is in the order of 20-50usec, and on a fast machine like a R2000
>it would probably be down to 5-20usec, not counting the register
>save.

>What I would like is some help from the architecture, like dirty bits on
>groups of registers or something. 
>Actually, I'm not *that* familiar with the R2000 (or the other risc
>chips, for that matter); do any of them provide a feature for this?

There are two styles of doing this, most typically associated with
the floating-point register file.
	a) Keep a dirty bit.
	b) Keep a "useable" bit, where you trap if somebody issues an
	FP instruction.

In case a), on a context switch from task 1 to 2:
	if 1's registers are dirty, save them
	load 2's state into the reigsters
	switch

In case b), for the same context switch:
	maintain an "owner" for the FP regs, which is either a task (X),or empty
		note that 1 may well not own the FP regs at this point
	before switching to 2:
		if 2 is the owner of the FP regs, turn useability on
		if 2 is not the owner, turn useability off
	switch to 2
	if 2 uses an FP op, trap it
		save the FP state into X's context
		load up 2's FP state into the registers
		owner = 2
	there are variant strategies, depending on how fancy you want to get.

MIPS has a useability bit for each coprocessor; we also actually keep bits
in the executables that say which registers got used.  [we put these in just
in case, although more for special-purpose environments.  They turn out
not to be very useful: the optimizers are too good at grabbing every register.]

SPARC uses a similar technique, I think.  Clipper uses a dirty bit.
Various other micros do one or the other.

BTW: it is not instantly obvious that one would add a bit in for just this
purpose.  On a 16.7MHz M/120, it takes something like 4-30 microseconds
to save 32 registers and restore 32 registers [the 4 is all cache hit,
the 30 is all cache miss].  On a 25MHz M/2000, it takes 3-10 microseconds,
even with a large (i.e., inherently longer-latency) memory system:
note that block refill of the caches helps a lot in that case.
I'd guess that "typical" numbers, especially in a high context switch
environment would be on the order of 15 & 7 microseconds, respectively,
in a general-purpose environment.  [In a real-time environment, one would
gimmick some of the things to avoid I-cache misses.]
Thus, a useability bit might save this for you, some of the time.
We actually put it in for several reasons:
	a) Symmetry: we actually use a useability bit on coprocessor 0,
	which subsumes what would otherwise be privileged ops.
	b) Simplicity of handling systems without an FPU.
	c) and, finally, the ability to avoid FP context switches as described.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/28/88)

In article <10447@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>>The more registers, the more to save at every context switch in a typical
:
>What data do you have to substantiate this claim?  This is another popular
>misconception, I think.

(Interesting data on Pyramid study omitted)

>which is 0.20 percent of the total available CPU time.  I don't think
>this is significant.  For some implmentations, it is more like 1 cycle

I agree with this point and would like to add that there may be some
simple things which can be added to hardware to speed up context
switching.  CDC has typically used a "save everything" approach with
the complete save taking place with a single hardware instruction.
This instruction is easy to implement in a RISC machine as well, and it
trades some use of extra memory bandwidth but with a potential payoff
in less code executed to do a context switch.  However, it may be true
that picking the next runnable process may dominate by far the cost of
a context switch.  Is there any hard data out there?

-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/28/88)

In article <16003@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes:

>But the point is, and you seem to agree, that the often-voiced (and
>recently brought up in comp.arch) claim that context switches would
>make multiple-window-register-file-based RISC's unsuitable for
>timeshare applications is just simply not borne out by the data.

Correct.  Register windows seem to be a bad idea, but not because of
increased context switch time.  Rather, they seem to be marginally
less productive of performance gain than other uses of equivalent
silicon/gates/etc.


-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

mash@mips.COM (John Mashey) (10/28/88)

In article <313@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
>>	Register window design (as in UCB) used certain kinds of programs
>>	to create the statistics to support the design. User programs
>>	often bounce around in a fairly shallow window-count.
>>	UNIX kernels are worse.  They often zoom up and down 10-12 levels
>>	very quickly, causing window faults like crazy.  I have to believe
>>	the SunOS folks have been working hard to tune for this.

>While I was at Sun, I don't remember there ever having been any effort
>to reduce the depth of the kernel call stack in order to speed things up
>on SPARC-based Suns.  (Remember, they have to make it run sufficiently
>fast on three architectures, not one - four, if you count 370/XA and
>compatibles, and even more, if you consider that a lot of Sun code is
>going into S5R4....)

I'm surprised that nobody was doing this.  Note, however, that this is
NOT the "standard UNIX optimized for SPARC" issue, i.e., although
squishing the call tree wouldn't particularly help the other architectures,
it wouldn't hurt them that much either, and it might help SPARC some.
Of course, no sensible software engineer would do terrible distortions to
the code to do this, but I can imagine where people might put some effort
into (machine-independent) level-squishing.
As usual, if Guy says it wasn't being done, it probably wasn't;
certainly my comment was speculation.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (10/28/88)

In article <3404@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes:
>In article <468@oracle.UUCP>, csimmons@hqpyr1.oracle.UUCP (Charles Simmons) writes:
>> In article <6865@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>> >As you note, not-yet-announced.  On the other hand, MIPS R3000s
>> >do 42K Dhrystones, and they're already in real machines, and vendors
>> >are quoting the CPUs at $10/mip, i.e., $200 for 25MHz parts.

>Hey, wait a minute.  You can't just spec the price of the CPU itself,
>you need to include the cost of other required chips (like cache controllers,
>MMU's, or whatever) when you say how much the CPU costs.  On some machines,
>you can't run without these extra components.

100% agree: I just don't know the rest of the numbers offhand.
In our case, the mMU & cache control are on-chip; these days, you
add an FPU (probably costs 1.5-2X the corresponding CPU),
and SRAM (which is where most of the money is), plus some fairly cheap
glue parts for the external memory interface.

There was no intention to make people think the entire CPU core costs $10/mip,
and if readers thought that, unthink it. I'd guess a whole CPU core
(CPU + FPU + MMU + cache control + cache + external memory interface)
currently costs about $70-$10/VUP for the kind of things you build in
systems [less for embedded systems, especially as the 4Kx16 & similar-shaped
SRAMs become more available.]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mat@amdahl.uts.amdahl.com (Mike Taylor) (10/28/88)

In article <7038@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
> In fact, it might be instructive for the newsgroup for somebody
> to post a description of what a 5990 memory hierarchy looks like in
> more detail.


Your wish is my command.  Each processor has a 64K byte instruction cache
and a 64K byte operand cache, equipped with their own TLBs. Implemented
in 16K chips, 2.8 ns. access (chips also have 1200 logic gates).
From memory, I think it is organized as 128-byte lines,
4-way set associative, with about 15 cycles miss penalty.
Main storage is up to 512 megabytes of 55ns. 256K SRAM, accessed as cache
lines and interleaved. Below main storage, there is expanded storage
and I/O. Expanded storage is up to 2GB of 1M DRAM, accessed as pages.
I/O consists of up to 128 channels. 2 are byte-multiplexor channels,
another 30 are either, and the remainder are only block-multiplexor
channels. Block mux channels go up to 4.5 mB/sec. A typical
configuration has about 5 GB/MIPS of DASD installed. So an average
5990-1400 at (say) 105 370 MVS MIPS would have about 500GB of DASD.
Many of our customers have DASD farms in the terabyte league, however,
plus tape, non-volatile electronic storage ("EDAS"), etc.
-- 
Mike Taylor                               ...!{hplabs,amdcad,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

walsh@endor.harvard.edu (Bob Walsh) (10/28/88)

I know that the TCP/IP in the (BSD derivative) kernel has fewer call
levels than some others (the BBN TCP/IP) because it was felt that
subroutine call overhead was excessive (on the VAX).  Though the
original design decision was not based on a RISC chip, I would not
be surprised if gprof were used on the (RISC) kernel by various
implementors to see where time is spent, with the side effect that
various routines are brought in-line or turned into macros.  It is
the sort of thing done in the privacy of one's office; a product is
not wholely the result of company policy/non-policy.

peter@ficc.uu.net (Peter da Silva) (10/28/88)

In article <312@auspex.UUCP>, guy@auspex.UUCP (Guy Harris) writes:
> Umm, if indexing is "pretty CISCy", then just about every machine out
> there is a CISC, which makes "CISCy" pretty much uninteresting as an
> adjective, unless you can show an interesting machine that lacks
> indexing.

Well, blithely stepping over the autoincrement question, what about the
Cosmac 1802? The first CMOS microprocessor, the first micro with an orthogonal
instruction set, the first micro with a real-time operating system. It
had all sorts of RISCy features, such as a load-store architecture, gobs
of registers, and orthogonal instructions. The PC was just a general register,
pointed to by the 4-bit P register, so was the SP, so for a microcontroller
application you could do a context switch in two instructions:

	SEX	n
	SEP	n

It was widely used in embedded controller applications where low power was
important well into the early '80s. If Cosmac has been able to support and
expand it it'd be a decent contender to the intel chips today. It's a much
saner design than the 8080, and its sucessors wouldn't have been the
monstrosities that the 8080 has visited upon us.
-- 
Peter da Silva  `-_-'  Ferranti International Controls Corporation
"Have you hugged  U  your wolf today?"     uunet.uu.net!ficc!peter
Disclaimer: My typos are my own damn business.   peter@ficc.uu.net

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/28/88)

In article <7180@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>In article <7681@boring.cwi.nl> jack@cwi.nl (Jack Jansen) writes:
>>Well, 100 usec might be fine for standard unix, it is definitely not
:
>BTW: it is not instantly obvious that one would add a bit in for just this
>purpose.  On a 16.7MHz M/120, it takes something like 4-30 microseconds
>to save 32 registers and restore 32 registers [the 4 is all cache hit,
>the 30 is all cache miss].  On a 25MHz M/2000, it takes 3-10 microseconds,

On a 50MHz Cyber 205, it takes approximately 200 minor cycles = 4 microseconds
to swap the entire processor context.  128 of those cycles are used to
swap the entire 256 general purpose registers.

Sometimes people confuse procedure call time (which requires saving only
those registers which will be used - small for small procedures) with
the context switching time (time to switch to another user/process).

There is no need for context switching time to be tiny in an ordinary
(no fine-grained parallelism) system, since even on a very large machine
context switch shouldn't occur more frequently than several thousand/second.

Procedure call time must be very fast, obviously, since it may occur
several thousand times more often than context switching does. 

The Cray-2 is the only machine that I know of that has a real problem with
context switching.  The reason is that it has an extraordinarily large
user context in the form of "local memory".


-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/28/88)

In article <17208@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov.UUCP (Hugh LaMaster) writes:
>Correct.  Register windows seem to be a bad idea, but not because of

I seem to have overstated my case.  I should have said, that I consider
that it is unproven that register windows are a good idea relative to 
other equivalent uses of the same real estate.  The only "problem"
that I have with register windows is that, like "RISC" (whatever that is),
some people make extraordinary unsubstantiated claims.


-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

henry@utzoo.uucp (Henry Spencer) (10/29/88)

In article <313@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
>While I was at Sun, I don't remember there ever having been any effort
>to reduce the depth of the kernel call stack in order to speed things up
>on SPARC-based Suns...

Um, Guy, a lot of us Sun customers concluded quite a while ago that
performance (except on benchmarks) is not high on Sun's list of priorities.
"Just buy a faster machine, we'll be happy to sell you one."
-- 
The dream *IS* alive...         |    Henry Spencer at U of Toronto Zoology
but not at NASA.                |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

alan@pdn.UUCP (Alan Lovejoy) (10/29/88)

In article <28200218@urbsdc< aglew@urbsdc.Urbana.Gould.COM writes:
<>As an aside, the 68030 can do a 32 bit multiply in about (If I remember 
<>correctly -- I don't have the book in front of me) 40 cycles.  A while
<>back, I tried to write a 32 bit multiply macro that would take less 
<>than the 40 or so that the '030 took.  I didn't even come close (even 
<>assuming lots of registers and a 32 bit word size (which the 6502 
<>doesn't have)).  
<
<There do exist RISCs with multiply instructions. In fact, real 
<multiplies, with full multiplier arrays taking lots of space that
<might otherwise have had to be used for microcode.
<
<>Cory Kempf
<
<Andy Glew

The 88k does a 32-bit integer multiply in 4 cycles (r3000 takes 13
cycles, I believe).  A 32-bit integer divide takes the 88k 39 cycles
(r3000 takes 36 cycles, I believe).  Of course, if either of the
division operands is negative (signed division opcode), the 88k has to
trap to a software routine to finish the division.  In other words,
not all RISCs are wimps just because they don't have "complex
instructions". 

-- 
Alan Lovejoy; alan@pdn; 813-530-8241; Paradyne Corporation: Largo, Florida.
Disclaimer: Do not confuse my views with the official views of Paradyne
            Corporation (regardless of how confusing those views may be).
Motto: Never put off to run-time what you can do at compile-time!  

aglew@urbsdc.Urbana.Gould.COM (10/30/88)

> Umm, if indexing is "pretty CISCy", then just about every machine out
> there is a CISC, which makes "CISCy" pretty much uninteresting as an
> adjective, unless you can show an interesting machine that lacks
> indexing.

How about the AMD 29000?  I'm sure that Brian Case will comment
on this.

larus@paris.Berkeley.EDU.berkeley.edu (James Larus) (10/30/88)

In article <17260@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov.UUCP (Hugh LaMaster) writes:
>In article <17208@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov.UUCP (Hugh LaMaster) writes:
>>Correct.  Register windows seem to be a bad idea, but not because of
>
>I seem to have overstated my case.  I should have said, that I consider
>that it is unproven that register windows are a good idea relative to 
>other equivalent uses of the same real estate.  The only "problem"
>that I have with register windows is that, like "RISC" (whatever that is),
>some people make extraordinary unsubstantiated claims.

You might be interested in David Wall's paper "Register Windows vs.
Register Allocation," in the 1988 PLDI Conf.  He found that register
windows were generally as good as large register files (64 or more
entry) together with interprocedural register allocation.  However,
Wall argued that load/store in a large register file should be faster
so that the overall performance would be better without windows.
Given that the last assumption is questionable and that windows have
significant advantages for simpler compilers and for languages that
don't allow interprocedural allocation (e.g., Lisp), it is surprising
that more machines don't use them.

/Jim

aglew@urbsdc.Urbana.Gould.COM (10/31/88)

>In article <313@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
>>While I was at Sun, I don't remember there ever having been any effort
>>to reduce the depth of the kernel call stack in order to speed things up
>>on SPARC-based Suns...
>
>Um, Guy, a lot of us Sun customers concluded quite a while ago that
>performance (except on benchmarks) is not high on Sun's list of priorities.
>"Just buy a faster machine, we'll be happy to sell you one."
>-- 
>The dream *IS* alive...         |    Henry Spencer at U of Toronto Zoology
>but not at NASA.                |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

Yep, SUN keeps making liars out of the people who say that the CPU
has ceased to be a bottleneck. :-)

guy@auspex.UUCP (Guy Harris) (11/01/88)

>This view is typical of hardware types.  By all means, lets pass the
>buck to the next guy.  So the compiler writer has his(her) share of
>nightmares actually getting something to make the thing compile some
>code.

Umm, I think compiler writers for CISC have their own headaches....

>And then the systems programmer comes along and inserts a few
>more kludges to make the machine purr.

Ditto....

>Now it is the application programmer's turn to s***w things up.
>If my memory serves me correctly, it is much easier to get something
>up and running on a Motorola 68000 than on an Intel 8086 (very nasty,
>those beasty little segments).

Well, perhaps a better comparison there would be between the 68K and the
80386; in that case, you can avoid dealing with the segments.  Given
that comparison, I don't see why it matters to the application program -
or, in a lot of cases, to the *systems* programmer; I've written OS code
that runs on the 68K, SPARC, IBM 370, and 80386 (and that would probably
run on a boatload of other architectures), and I didn't have to do any
extra work to make it work on them all - the C compiler did the work for
me. 

>And miracle of miracles, we learn that over 70% of computing costs are
>software.

A more interesting figure would be "for a given system, how much of the
*design* costs are hardware and how much are software."  I suspect a lot
of the "expensive software and cheap hardware" types are comparing the
*production* costs of the hardware with the *development* costs of the
software, which doesn't yield interesting results - why not compare the
production costs of the hardware and of the software, which would prove
that software is cheap and hardware is expensive....

(Then again, there's the question of whether microcode is software,
hardware, or both....)

>It seems like hardware types should be designing their end of the deal
>to reduce it at the other end.

No, both types should be designing their ends to reduce it at the bottom
line, which is, after all, what really counts.

guy@auspex.UUCP (Guy Harris) (11/01/88)

>Yep, SUN keeps making liars out of the people who say that the CPU
>has ceased to be a bottleneck. :-)

s/the CPU/memory - most of the slowness I've seen on Suns has been due
to paging.

bcase@cup.portal.com (Brian bcase Case) (11/01/88)

I wrote:
>> The almighty Compiler can save us from our sins!

>This view is typical of hardware types.  By all means, lets pass the
>buck to the next guy.

If that makes the system better (cheaper, faster, more reliable), then
we shold pass the buck.

>And miracle of miracles, we learn that over 70% of computing costs
>are software.

The architecture has little if anything to do with this.  The compiler
is a comparatively-meager one-time investment, and getting a good one
is very important.  A good optimizing compiler is probably easier to
write for a simple machine (e.g. RISC) than for a complex machine.

>It seems like hardware types should be designing their end of the
>deal to reduce it at the other end.

This is the kind of thinking that got us machines like the VAX:  "Well,
I just *know* those ol' compiler guys are gonna love me 'cause I'm
giving them these bitchin' addressing modes and memory to memory
operations.  They won't mind that I'm squeezing out registers to save
code size and fit in the microcode.  Sure its hard work, but I don't
mind, besides my boss keeps telling me I'm supposed to be designing my
end of the deal to reduce their work."  Never mind that the hardware guy
is thwarting their efforts to write a good compiler:  the next version
of the machine has instruction timings that completely change the
trade-offs of code generation.  Compiler guy:  "What do you mean
that addressing mode is now twice as slow!?  I spent the better part
of six months making the compiler use it to save code space!  I
think I'm gonna go work on ol' Lazy Larry's machine, at least I know
each instruction's execution time, and he won't change it next year!"

mph@praxis.co.uk (Martin Hanley) (11/01/88)

With the Acorn RISC Machine (ARM), *EVERY* instruction can be
conditional, not just the jumps. If the condition flags are not set,
then the instruction is ignored. Also, a bit in the instruction flags
whether of not to reset the condition codes when executing the
instruction.

This setup has obvious advantages when it comes to preserving
pipelines, since the major bugbear of said pipelines is that every
jump causes them to be broken. This is circumvented to some extent by
the provision of delayed jumps (which the ARM also has), but not
entirely.

Does any other machine have this feature? Anybody have comments on it?

							mph.

-----------------------------------------------------------------------------
"I'm not a god, I was misquoted"   - Lister, Red Dwarf

These are, of course, my opinions. Who else would want them?

My home:      mph@praxis.co.uk
-----------------------------------------------------------------------------

cprice@mips.COM (Charlie Price) (11/02/88)

In article <4759@pdn.UUCP> alan@pdn.UUCP (0000-Alan Lovejoy) writes:
>
>The 88k does a 32-bit integer multiply in 4 cycles (r3000 takes 13
>cycles, I believe).  A 32-bit integer divide takes the 88k 39 cycles
>(r3000 takes 36 cycles, I believe).  Of course, if either of the
>division operands is negative (signed division opcode), the 88k has to
>trap to a software routine to finish the division.  In other words,
>not all RISCs are wimps just because they don't have "complex
>instructions". 
>Alan Lovejoy; alan@pdn; 813-530-8241; Paradyne Corporation: Largo, Florida.

The MIPS R2000 and R3000 have integer multiply/divide instructions,
but they are unlike the other main CPU instructions.
The source operands are in general purpose registers and
the result (64-bit product or 32-bit quotient and 32-bit remainder)
is written to a special pair of registers named HI and LOW.
There are instructions (MFHI MFLO) to move from HI and LOW to a
general register.
So why do it this (seemingly odd) way?

From the architecture spec:

  Multiply and divide operations are performed by a separate,
  autonomous execution unit.  After a multiply or divide operation
  is started, execution of other instructions may continue in parallel.
  The multiply/divide unit continues to operate during cache miss and
  other delaying cycles in which no instructions are executed.

  The number of cycles required for multiply/divide operations is
  implementation-dependent.  The MFHI and MFLO instructions are
  interlocked so that any attempt to read them before operations
  have completed will cause execution of instructions to be delayed
  until the operations finishes.

  The table below gives the number of cycles required between a
  MULT, MULTU, DIV or DIVU operation and a subsequent MFHI or MFLO
  operation, in order that no interlock or stall occurs.

		MULT	MULTU	DIV	DIVU
  R2000		12	12	33	33
  R3000		12	12	33	33

Clearly in order to do something useful you
need to pick up at least one 32-bit portion of the result,
so in the best case you get a 13 cycle multiply and a
34 cycle divide.  If a stall occurs, it may complicate
restarting the pipeline and add an additional cycle.
By the way, it is worth noting that the 88000 4-cycle multiply
mentioned above only generates a 32-bit result...

The "why" of the MIPS architecture is that integer multiply/divide
is a sort-of coprocessor.
When a full multiply is necessary, it can be done faster than
with software-only and it may be possible to get other useful
work done while waiting for the result.
All of the work determining that this was a worthwhile feature
to add to the architecture was done long before I came to MIPS so
I can't comment on the basis for this decisison (perhaps mash
will comment on that).

In practice, many things that seem to require multiply instructions
get turned into some sequence of inline shifts and adds.
Obviously the compiler makes some sort of decision about which
is "better" to use.
-- 
Charlie Price    cprice@mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086

jkrueger@daitc.daitc.mil (Jonathan Krueger) (11/02/88)

In article <359@auspex.UUCP>, guy@auspex (Guy Harris) writes:
>I suspect a lot of the "expensive software and cheap hardware" types
>are comparing the *production* costs of the hardware with the
>*development* costs of the software

This is a good point.  However, a complete analysis adds up life cycle
costs and divides by number of runs or copies sold, arriving at each
unit's amortized costs.  This makes either custom hardware or software
look very expensive indeed.  Yet it may cost no more than general
purpose hardware or software, where the R&D is amortized over many
sales or runs.  More generally, I suspect that many "expensive
software and cheap hardware" types are comparing the per-unit costs of
the hardware with the unamortized costs of the software.

But since most of us customize software rather than hardware, this may
be a valid comparison.  If a vendor can sell you a workstation for
$10K, including its payback on his R&D, but it costs you $20K to
develop the software that will run on it, it seems valid to say that
the software cost more than the hardware.  If you run the same
software on ten such workstations, of course the opposite is true.

-- Jon
-- 

guy@auspex.UUCP (Guy Harris) (11/02/88)

>But since most of us customize software rather than hardware, this may
>be a valid comparison.

Although I doubt it is in this particular case.  Compiler and OS
development costs tend to be amortized over a base close in size to the
base over which hardware development costs are amortized, and that's
what the original poster was talking about.  (Arguably, the base is even
larger, since many hardware development costs are amortized over an
implementation of an architecture, rather than over the entire
architecture.)

I've not seen any good evidence that any added software
development/production/maintenance cost, over the lifecycle of an
architecture, of a simpler architecture outweighs any added hardware
development/production/maintenance cost, over the life cycle of an
architecture, of a more complex architecture. 

I'm also not convinced that there's a major added application
development cost for developing in sufficiently high-level languages on
RISC machines (I've heard some flames that there is, but many of those
problems appear to be due to sloppy coding habits combined with luck, on
the part of the developers, in the choice of CISC machines they ported
to).  I know that I had no great extra trouble getting C code, whether
kernel-mode or user-mode, working on SPARC as well as 68K; it tended to
run on the 80386 and 370 as well.  In cases where code failed on SPARC,
it often failed for reasons that would cause it to fail on some
CISC-based systems as well.... 

khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (11/03/88)

In article <3264@newton.praxis.co.uk> mph@praxis.co.uk (Martin Hanley) writes:
>
>With the Acorn RISC Machine (ARM), *EVERY* instruction can be
>conditional, not just the jumps. If the condition flags are not set,
>then the instruction is ignored. Also, a bit in the instruction flags
>whether of not to reset the condition codes when executing the
>instruction.
>
>This setup has obvious advantages when it comes to preserving
>pipelines, since the major bugbear of said pipelines is that every
>jump causes them to be broken. This is circumvented to some extent by
>the provision of delayed jumps (which the ARM also has), but not
>entirely.
>
>Does any other machine have this feature? Anybody have comments on it?

The late Cydra 5 had 'em. The condition bits could (and often were)
set based on the data (thus the name "directed dataflow"). Since the
cydra 5 had multiple instructions (6/7 depending on who counted) per
clock, and very long pipes (26 for memory fetch) these conditional
exeuction features were very valuable.


Keith H. Bierman
It's Not My Fault ---- I Voted for Bill & Opus

csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (11/03/88)

In article <754@wsccs.UUCP> dharvey@wsccs.UUCP (David Harvey) writes:
>In article <10194@cup.portal.com>, bcase@cup.portal.com (Brian bcase Case) writes:
>> The almighty Compiler can save us from our sins! 
>
>If my memory serves me correctly, it is much easier to get something
>up and running on a Motorola 68000 than on an Intel 8086 (very nasty,
>those beasty little segments).  And miracle of miracles, we learn that
>over 70% of computing costs are software.  It seems like hardware types
>should be designing their end of the deal to reduce it at the other end.
>
>dharvey@wsccs

Hmmm...  Bad example.  The 8086 is an extremely unorthogonal
architechture.  The 68000 isn't very orthogonal (there are two
different kinds of registers).  RISC chips tend to be extremely
orthogonal.  Thus, this example would suggest that RISC designers
are reducing the software complexity, and that it would be easier
to get something up and running on a RISC than on a 68000 (much
less an Intel chip).

-- Chuck

grow@druhi.ATT.COM (Gary Oblock) (11/04/88)

In article <7472@winchester.mips.COM>, cprice@mips.COM (Charlie Price) writes:
> In article <4759@pdn.UUCP> alan@pdn.UUCP (0000-Alan Lovejoy) writes:
> >
> >The 88k does a 32-bit integer multiply in 4 cycles (r3000 takes 13
> >cycles, I believe).  A 32-bit integer divide takes the 88k 39 cycles
> >(r3000 takes 36 cycles, I believe).  Of course, if either of the
> >division operands is negative (signed division opcode), the 88k has to
> >trap to a software routine to finish the division.  In other words,
> >not all RISCs are wimps just because they don't have "complex
> >instructions". 
> >Alan Lovejoy; alan@pdn; 813-530-8241; Paradyne Corporation: Largo, Florida.
> 
> The MIPS R2000 and R3000 have integer multiply/divide instructions,
> but they are unlike the other main CPU instructions.
> The source operands are in general purpose registers and
> the result (64-bit product or 32-bit quotient and 32-bit remainder)
> is written to a special pair of registers named HI and LOW.
> There are instructions (MFHI MFLO) to move from HI and LOW to a
> general register.
> So why do it this (seemingly odd) way?
> 
       :
    Deleted (a bunch of hardware reasons)
> 
> In practice, many things that seem to require multiply instructions
> get turned into some sequence of inline shifts and adds.
> Obviously the compiler makes some sort of decision about which
> is "better" to use.
> -- 
> Charlie Price    cprice@mips.com        (408) 720-1700
> MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086

 Another very good reason to do things this way is that your register
 allocation scheme does not have to deal with allocating  register
 pairs. When register allocation takes place at the intermediate
 code level (e.g. the Stanford U-Code compiling  systems) the
 problem is especially evident. As Fred Chow said about this in
 his section on the limitations of allocating at the intermediate 
 level [page 71; A Portable Machine-Independent Global Optimizer--
 Design and Measurement; Chow, Frederick Chi-Tak; PhD 1984; Stanford
 University]

   The requirements and effects of individual machine instructions
   cannot be taken into account. Such uses of registers arising out
   of instruction selection by code generators are not necessarily
   related to the register allocation decisions. When registers are
   globally allocated by the optimizer, intermixing of registers used
   by the optimizer and registers used by the code generator is not
   possible. ....stuff about redundant copies being introduced...

 In plain old-fashioned register allocators on machines that require
 the use of register pairs for multiplication and division you have
 to treat these registers as special cases. This is done by reserving
 a pair of scratch registers for these operations and/or using clumsy
 heuristics that attempt to keep register pair(s) free when needed for
 these operations.

 All things considerered, allocating registers on a machine with a
 special pair of result registers for multiplication and division
 (i.e. the MIPS) should be much easier and much more effective.

 Gary Oblock -- Compiler consultant to Bell Laboratories -- Denver, CO
                    (303)538-4169 -- att!druhi!grow

 Disclaimer -- I'm pretty sure they'll agree with about this at MIPS.
               But,...  they're not my employers!

mash@mips.COM (John Mashey) (11/08/88)

In article <473@oracle.UUCP> csimmons@oracle.UUCP (Charles Simmons) writes:
>In article <754@wsccs.UUCP> dharvey@wsccs.UUCP (David Harvey) writes:
>>......  And miracle of miracles, we learn that
>>over 70% of computing costs are software.  It seems like hardware types
>>should be designing their end of the deal to reduce it at the other end.

>Hmmm...  Bad example.  The 8086 is an extremely unorthogonal
>architechture.  The 68000 isn't very orthogonal (there are two
>different kinds of registers).  RISC chips tend to be extremely
>orthogonal.  Thus, this example would suggest that RISC designers
>are reducing the software complexity, and that it would be easier
>to get something up and running on a RISC than on a 68000 (much
>less an Intel chip).

Yes, for sure: it's absolutely weird that people ahve decided that
you must work much harder in software for RISCs:

IT'S NOT THAT YOU NEED MORE COMPLICATED SOFTWARE FOR A RISC THAN A
CISC, IT'S THAT WELL-DESIGNED RISCS OFFER MORE AND EASIER OPPURTUNITIES
FOR OPTIMIZATION THAN SOME CISCs:
	a) You spend less time on weird-case selection and analysis.
		(should I do an add, or should I use weird-address-mode XX?)
		(registers x,y,z, and Q are needed for funny-instruction ZZ)
	b) You usually have more registers, and more orthogonal ones,
		hence global register allocation has more to work with,
		and you can think about things like interprocedural
		allocation (because you have a useful number of registers),
		whereas if you've only got half-a-dozen, you don't
		even think about it.
	c) It is usually easier to do pipeline reorganization, given the
		above, plus lack of things like condition codes.  You don't
		HAVE to do much of this, but you can.

At MIPS, the first reorganizer was written in a couple weeks.
We had C/Pascal compilers generating reasonable code BEFORE the architecture
was completely frozen (work really got started December 84/January 85,
and we could run generated+linked executables through our MIPS->VAX
object code converter around mid-year.  A compiled UNIX was running on
a simulator 11/85, and the compilers bootstrapped thru themselves
successfully about the same time, including lots of optimization.
Admittedly, MIPS didn't start from scratch, but used the work from
Stanford.  Nevertheless, reasonable code was generated pretty early.)

As a datapoint, consider that an 8MHz R2000 (in a "5-mips" M/500),
with global optimization omitted, and NoRegs, yields 8,800 1.1 Dhrystones,
which is nowhere near the 13,000 it gets with -O3....but still
faster than most CISC micro implementations.  So, good optimization
is worth having, but what you get without it isn't bad. (You also
get 5900 DP Kwhets on a 12.5MHz R2000, which likewise is not awful.)


In addition, let me observe, that certainly on machines of the
	MIPS, HP PA, 88K, SPARC ilk:
	a) It is probably simpler to write assembler code (at least compared
		with S/360s, PDP-11s, Vaxen, 3Bs, 68Ks , in my experience,
		and other people, not at MIPS, who've had experience with
		both MIPS and other architectures)
	b) The simpler machines often help things like:
		debuggers
		the object-code-to-object-code translators like we
			or Ardent use.  (The profiling and architectural
			analysis of these are incredibly productive;
			they may exist on CISCs, but if so, not many.)
		object formats (like dense line number tables, not so
			easy to do with variable-length instructions)

Anyway, the bottom line is:
	In many RISCs, some hardware functionality has moved to software,
	but the REQUIRED software is mostly straightforward and modular
	(like doing * or / in software), although it can get pretty
	tricky (like the more complex versions of doing these), but
	it's still modular.

	Clean RISCs give optimizing compilers more leverage, but it is
	pretty easy to generate at least reasonable code for them.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

martin@minster.york.ac.uk (11/09/88)

In article <1622@scolex> seanf@sco.COM (Sean Fagan) writes:
>In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>>...
>where n=[1,5].  To store a value from X<n>, you load the address into A<n>,
	 ^^^^^
actually it was n=[3,4,5] A1 and A2 were also general purpose.

>where n=[6,7].  A0 has no special values, and B0 is a hardwired 0.
>
>Ok, a few from the Wonderful World of the Cyber:
>
>	Count the set bits in a word (register or memory, it doesn't
>matter).  Very useful for some trivial applications (such as playing
>Othello), but I haven't seen much else done with it.  It was put in, it
>looks like, because the hardware was already there (in the form of parity
>checking), but I could be wrong.
>
>Sean Eric Fagan  | "Engineering without management is *ART*"
>seanf@sco.UUCP   |     Jeff Johnson (jeffj@sco)
>(408) 458-1422   | Any opinions expressed are my own, not my employers'.

The history of the population instruction is interesting: I was told (by
someone from CDC, whom I have every reason to believe knew what he was
talking about) that when Seymour Cray first designed the 6600 it did
not have the population instruction; when they showed the the machine to
the people at Los Alamos they said ``Great! We'll have 6, but only if
you add this instruction we need!''. Every machine that Seymour Cray
designed since has included the population instruction. (that was up to
the time I was told the story - can anyone verify this for the Cray-2, etc?)
It is interesting to note that the instruction set is quite nice and
regular (at the bit level), but the population instruction does not
fit into the pattern, also suggesting that it was an afterthought.

Note that the Cyber 170 is compatible with the 6600. It was an interesting
machine to program in assembler, as can probably be guessed from the above
description. However PP (Peripheral Processor) programming was much more
fun, since you have unlimited access to the main memory of the Central
Processors - the base/limit memory protection only affected CP programs.
(I'm speaking in the past tense, not because there are no more of these
machines, but because, fortunately, I don't have to program one any more!
I've still got my Compass manuals though!!)

Martin C Atkins
...!ukc!minster!martin

PS
  I leave it as an exercise for the reader (is this wise?) to guess what
Los Alamos wanted the population instruction for! But they weren't interested
in Hamming distances, rather in making bits of Plutonium go bang!!

seanf@sco.COM (Sean Fagan) (11/20/88)

In article <595030314.2944@minster.york.ac.uk> martin@minster.york.ac.uk writes:
>In article <1622@scolex> seanf@sco.COM (Sean Fagan) writes:
>>where n=[1,5].  To store a value from X<n>, you load the address into A<n>,
>	 ^^^^^
>actually it was n=[3,4,5] A1 and A2 were also general purpose.

Nope, sorry.  'SA1 X1' would load the first argument into X1 (assuming you
were using FORTRAN calling conventions).  Let's hope, however, that you only
had one argument 8-).  It would, btw, also load the address of the first
argument into A1, so that, when you were done, you do:
	BX6	X1
	SA6	A1
and the argument was stored.  Nice.  (also explains why a 5 can become a 3
and totally screw people up 8-).)

>>where n=[6,7].  A0 has no special values, and B0 is a hardwired 0.
>>
>Note that the Cyber 170 is compatible with the 6600. It was an interesting
>machine to program in assembler, as can probably be guessed from the above
>description. However PP (Peripheral Processor) programming was much more
>fun, since you have unlimited access to the main memory of the Central
>Processors - the base/limit memory protection only affected CP programs.
>(I'm speaking in the past tense, not because there are no more of these
>machines, but because, fortunately, I don't have to program one any more!
>I've still got my Compass manuals though!!)

True, PP's do (note the present tense 8-)) have unlimited access to the main
memory.  However, you had to have System Access (or some such) bit set in
your Protection word (NOS has, I believe, several 60-bit words for
permissions.  Many permissions are in the form of one-bit values).  Then,
you rebuild the libraries (aka the system), and your PP program could be
loaded.  However:  PP's are 12-bit machines, with an Accumulator only.  To
address any arbitrary word, you a) had to build the address using the
P register and the K register (P was 12 bits, K was 6;  I think the names
are right), and b) had to make five passes, since you only had 12-bit words
in the PP.

Lastly, why wouldn't you want to program these machines?  They are
*wonderful*.  Anybody who has to learn assembly language should learn these
machines (they're about as regular as a PDP, but are much faster, and
prepare people for RISC)!

-- 
Sean Eric Fagan  | "Engineering without management is *ART*"
seanf@sco.UUCP   |     Jeff Johnson (jeffj@sco)
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

rik@june.cs.washington.edu (Rik Littlefield) (11/21/88)

In article <1762@scolex>, seanf@sco.COM (Sean Fagan) writes:
> In article <595030314.2944@minster.york.ac.uk> martin@minster.york.ac.uk writes:
> >In article <1622@scolex> seanf@sco.COM (Sean Fagan) writes:
> >>where n=[1,5].  To store a value from X<n>, you load the address into A<n>,
> >	 ^^^^^
> >actually it was n=[3,4,5] A1 and A2 were also general purpose.
> 
> Nope, sorry.  'SA1 X1' would load the first argument into X1 (assuming you
> were using FORTRAN calling conventions).  Let's hope, however, that you only
> had one argument 8-).  It would, btw, also load the address of the first
> argument into A1, so that, when you were done, you do:
> 	BX6	X1
> 	SA6	A1
> and the argument was stored.  Nice.  (also explains why a 5 can become a 3
> and totally screw people up 8-).)

Not quite.  Sean is quite correct about A1-A5 being used for fetches.  But
the sequence SA1 X1 / BX6 X1 / SA6 A1 is a no-op as far as memory is
concerned -- it simply fetches a value from the address originally stored in
X1, then stores that value back *into the same address*.  Perhaps Sean meant
to imply that some other things happened to X1 in between the SA1 and the
BX6.  That would be inefficient, since the result could have been left in X6
directly, but I've seen it done.  BTW, that's the "FTN" calling sequence.
The earlier "RUN" Fortran compiler was completely different and used mostly
the B-registers.

> Lastly, why wouldn't you want to program these machines?  They are
> *wonderful*.  Anybody who has to learn assembly language should learn these
> machines (they're about as regular as a PDP, but are much faster, and
> prepare people for RISC)!

I agree completely -- no complicated addressing modes, a simple regular
instruction set, and (for their day) they ran like scalded dogs.  Having
just 18-bit addresses did get in the way, though.  Especially since only 17
of them were actually usable, and most machines weren't even that big!
(Also, as an aside, it's a bit amusing to see the number of posters who
revel in the simplicity of the instruction set ... and then get their
examples wrong ;-)

--Rik

martin@minster.york.ac.uk (11/21/88)

I wrote:
> In article <1622@scolex> seanf@sco.COM (Sean Fagan) writes:
> >In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> >>...
> >where n=[1,5].  To store a value from X<n>, you load the address into A<n>,
> 	 ^^^^^
> actually it was n=[3,4,5] A1 and A2 were also general purpose.

I'm sorry - I wrote this before checking the manual (always a bad move!).
The original poster was correct, A1 and A2 also load the X regsiters.

Many apologies to all concerned, and thanks to those who were kind enough
to point out my mistake by mail.

Martin

eriks@cadnetix.COM (Eriks Ziemelis) (11/22/88)

In article <6475@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes:
>In article <1762@scolex>, seanf@sco.COM (Sean Fagan) writes:
>
>> Lastly, why wouldn't you want to program these machines?  They are
>> *wonderful*.  Anybody who has to learn assembly language should learn these
>> machines (they're about as regular as a PDP, but are much faster, and
>> prepare people for RISC)!
>
>I agree completely -- no complicated addressing modes, a simple regular
>instruction set, and (for their day) they ran like scalded dogs.  Having
>just 18-bit addresses did get in the way, though.  Especially since only 17
>of them were actually usable, and most machines weren't even that big!
>
>--Rik


Hear, hear! Even though my experience with the Cyber family was in college
(2 6500 and 1 6600: one of the systems had the ECS) I loved it. Still have 
copies of the manuals and after having worked with Vaxen, 68K, PDP, et. al.
I still want to program a Cyber. Almost took a job out of college doing just
that for a SDI (Star Wars) research company. Oh well, we all make 
mistakes.

To anyone at Purdue: I heard a rumor that the assembly language programming
course (CS 300) is no longer Compass. Is this true?


Eriks A. Ziemelis


Internet:  eriks@cadnetix.com
UUCP:  ...!{uunet,boulder}!cadnetix!eriks
U.S. Snail: Cadnetix Corp.
	    5775 Flatiron Pkwy
	    Boulder, CO 80301
Baby Bell: (303) 444-8075 X221

smryan@garth.UUCP (Steven Ryan) (11/24/88)

today's trivia question:

what does compass abbreviate?
-- 
                                                   -- s m ryan
--------------------------------------------------------------------------------
As loners, Ramdoves are ineffective in making intelligent decisions, but in
groups or wings or squadrons or whatever term is used, they respond with an
esprit de corps, precision, and, above all, a ruthlessness...not hatefulness,
that implies a wide ranging emotional pattern, just a blind, unemotional
devotion to doing the job.....

dik@cwi.nl (Dik T. Winter) (11/24/88)

In article <1973@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
 > today's trivia question:
 > 
 > what does compass abbreviate?
 > -- 
COMPrehensive ASSembler?
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

seanf@sco.COM (Sean Fagan) (11/30/88)

In article <1973@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
>today's trivia question:
>what does compass abbreviate?

COMPrehensive ASSembler.  Truly trivial.
Next trivia question:
What happens if you unplug the wire at row 13, column 43, in a Cyber
170/760?

(btw: 8-))
-- 
Sean Eric Fagan  | "Engineering without management is *ART*"
seanf@sco.UUCP   |     Jeff Johnson (jeffj@sco)
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

smryan@garth.UUCP (Steven Ryan) (11/30/88)

> > today's trivia question:
> > 
> > what does compass abbreviate?
> > -- 
>COMPrehensive ASSembler?

COMPrehensive ASsembly System, which is why the product ident is CPn instead
of CAn. (Product ident is CDC's 3 letter name for a product.)
-- 
                                                   -- s m ryan
+------------------------------------------------------------------------------+
|Good day-eh.                                                   Je me souviens.|
+------------------------------------------------------------------------------+

bcase@cup.portal.com (Brian bcase Case) (12/01/88)

>What happens if you unplug the wire at row 13, column 43, in a Cyber
>170/760?

Uh, it runs twice as fast?
Er, it emulates a 370?
Hmm, several city blocks may be affected?
Let's see, all student jobs fail?
I know!  The console displays "Greetings Seymour, it has been a long time."

phil@aimt.UUCP (Phil Gustafson) (12/03/88)

In article <1795@scolex>, seanf@sco.COM (Sean Fagan) writes:
> What happens if you unplug the wire at row 13, column 43, in a Cyber
> 170/760?

Well, you remove a disable line from a gate.  You enable a feature
whose name I forget (Environment Swap or something),  gain tens
of thousands of 1965 dollars worth of feature, and guarantee that
your service contract is null and void.

							phil
--
Opinions outside attributed quotations are mine alone.
Satirical material may not be labeled as such.
--
-- 
				Phil Gustafson, Graphics/UN*X Consultant
				{uunet,ames!coherent}!aimt!phil phil@aimt.uu.net
				1550 Martin Ave, San Jose, Ca 95126