[comp.arch] Today's dumb question...

greg@xios.XIOS.UUCP (Greg Franks) (03/24/88)

I really hate to break up this interesting discussion on MIPS and VIPS
et al, but a question has come to mind....

RISC generally implies single instruction per clock cycle, and a
load/store type architecture.  Now for _tightly coupled_
multiprocessing, one needs some sort of atomic test-and-set instruction. 
How do the various RISC chips provide this function, with LOCK prefixes,
or with some other technique?

Sign me
   Just curious...
-- 
Greg Franks                   XIOS Systems Corporation, 1600 Carling Avenue,
utzoo!dciem!nrcaer!xios!greg  Ottawa, Ontario, Canada, K1Z 8R8. (613) 725-5411.
       "Those who stand in the middle of the road get
               hit by trucks coming from both directions." Evelyn C. Leeper.

ard@pdn.UUCP (Akash Deshpande) (03/30/88)

In article <503@xios.XIOS.UUCP>, greg@xios.XIOS.UUCP (Greg Franks) writes:
>                                Now for _tightly coupled_
> multiprocessing, one needs some sort of atomic test-and-set instruction. 
> How do the various RISC chips provide this function, with LOCK prefixes,
> or with some other technique?
> Greg Franks

	RISC people (as I discovered at ASPLOS II, San Jose, Oct 87) would
	rather not speak of parallel processing. Reminds me of the ostrich.
	Ask them - "how are you going to maintain cache coherency, TLB
	flushing, accesses integrity, etc in a parallel processing system?"
	and they will say "why do you want parallel processing when one
	RISC machine is so much faster than even parallel CISCs?"

	I would prefer a philosophy that allows for clean parallelisability
	over any single cpu speedups.

							Akash
-- 
Akash Deshpande					Paradyne Corporation
{gatech,rutgers,attmail}!codas!pdn!ard		Mail stop LF-207
(813) 530-8307 o	 			Largo, Florida 34649-2826
Like certain orifices, every one has opinions. I haven't seen my employer's!

petolino%joe@Sun.COM (Joe Petolino) (03/31/88)

>RISC generally implies single instruction per clock cycle, and a
>load/store type architecture.  Now for _tightly coupled_
>multiprocessing, one needs some sort of atomic test-and-set instruction. 
>How do the various RISC chips provide this function, with LOCK prefixes,
>or with some other technique?

SPARC provides this function with an atomic test-and-set instruction.  From a
processor chip's point of view, 'store-but-look-at-the-old-contents' is not
much different from a simple 'store', so RISC does not necessarily preclude a
test-and-set instruction.  The really difficult part of interprocessor
synchronization comes at the system level, outside of the processor chip.

-Joe

garner@gaas.Sun.COM (Robert Garner) (03/31/88)

> Now for _tightly coupled_multiprocessing, one needs some sort of 
> atomic test-and-set instruction.  How do the various RISC chips
> provide this function, with LOCK prefixes, or with some other technique?

The SPARC instruction set includes two instructions for this purpose:

        LDSTUB  -  Load/Store Unsigned Byte  -  reads a byte from memory
                  and then rewrites the same byte to -1. 
	SWAP  -  exchanges an integer register and a memory word.

LDSTUB and SWAP are currently implemented as multi-cycle operations.
Between the load and the store, the processor asserts a signal to
the memory (or I/O) bus that prevents intervening accesses.
(A precise requirement is that, if these instructions are
issued by more than one processor, they must execute in some serial order.)

Assuming a specialized memory system that includes an arithmetic unit,
SWAP can also implement the Fetch_and_Add instruction.


On another subject, I recall some confusion in an old msg about SPARC's
multiply-step instruction (MULScc).  The author thought that
MULScc was limited to signed multiplies.  This is certainly not
true:  MULScc implements both signed and unsigned 32x32 multiplies.

[BTW, I noticed that the Am29000 has three instructions--Multiply Step
(MUL), Multiply Last Step (MULL), and Multiply Step Unsigned (MULU).
MULU and MULL seem unnecessary since the fix up for
32x32 unsigned multiplies (or a negative multiplier
in the case of signed multiplies) requires only 3 cycles.] 


On yet another subject, am I still correct in believing that AMD's
Am29C327 floating-point coprocessor does NOT directly execute
the Am29000 floating-point instruction set?  In other words,
must Am29000 instructions such as FMUL and FEQ be emulated
via a trap handler?  Wouldn't this make them too slow?
-----------------------------------
          Robert Garner

ARPA:  garner@sun.com
UUCP:  {ucbvax,decvax,decwrl,seismo}!sun!garner
Phone: (415) 960-1300  or  (415) 691-2125

webber@porthos.rutgers.edu (Bob Webber) (03/31/88)

In article <47649@sun.uucp>, garner@gaas.Sun.COM (Robert Garner) writes:
> ...
> On yet another subject, am I still correct in believing that AMD's
> Am29C327 floating-point coprocessor does NOT directly execute

How does the 327 differ from the Am29027?  The floating-point
instructions trapping on the Am29000 looked a bit odd to me.  Is the
notion that it is important to standardize the interface to
floating-point stuff so that people can buy floating-point chips later
and not have to recompile?  Or is it that one would want to
``hardwire'' the coprocessor interactions that are currently being
done at trap when the chip space becomes available?

--- BOB (webber@athos.rutgers.edu ; rutgers!athos.rutgers.edu!webber)

[By the way, I have been pondering the SPARC and Am29000 chips for a
while now trying to figure out if it is plausible to build a simple
home computer around them.  If any one has references that talk about
the sort of glue that holds together a board with such a processor, 1
or 2 SCSI ports, 1 or 2 serial ports, and some static ram, I would
certainly be interested.]

walter@garth.UUCP (Walter Bays) (04/01/88)

In article <2676@pdn.UUCP> ard@pdn.UUCP (Akash Deshpande) writes:
>	RISC people (as I discovered at ASPLOS II, San Jose, Oct 87) would
>	rather not speak of parallel processing. Reminds me of the ostrich.
>	Ask them - "how are you going to maintain cache coherency, TLB
>	flushing, accesses integrity, etc in a parallel processing system?"
>	and they will say "why do you want parallel processing when one
>	RISC machine is so much faster than even parallel CISCs?"

Most RISC computers are, in a limited sense, multiprocessors, because
you'll want at least an 80286 or something for an IOP.  The Clipper has
bus-watch hardware in the CAMMU's (Cache and Memory Management Unit) to
assure cache and TLB consistency in copy-back mode.  And yes, there is
a test-and-set instruction.  Of course you couldn't put too many
Clippers (or any fast CPU) on a bus before it saturated...  I'd rather
not talk about it :-)
-- 
------------------------------------------------------------------------------
Any similarities between my opinions and those of the
person who signs my paychecks is purely coincidental.
E-Mail route: ...!pyramid!garth!walter
USPS: Intergraph APD, 2400 Geng Road, Palo Alto, California 94303
Phone: (415) 852-2384
------------------------------------------------------------------------------

jpa@celerity.UUCP (Jeff Anderson) (04/01/88)

In article <2676@pdn.UUCP> ard@pdn.UUCP (Akash Deshpande) writes:
>	RISC people (as I discovered at ASPLOS II, San Jose, Oct 87) would
>	rather not speak of parallel processing. Reminds me of the ostrich.
>	Ask them - "how are you going to maintain cache coherency, TLB
>	flushing, accesses integrity, etc in a parallel processing system?"
>	and they will say "why do you want parallel processing when one
>	RISC machine is so much faster than even parallel CISCs?"

The Celerity 6000 has many RISC attributes, and was designed as a
multiprocessor.  The instruction set of all Celerity's processors
includes separate fetch and receive instructions, which provided an easy
mechanism for implementing semaphores in hardware without "special"
instructions.  

Basically there is a semaphore "box" in the memory address space of the
processor (which incidentally is implemented as 4 semaphores per 16KB
page, in order that they can be assigned 4-at-a-time to the virtual
address space of a user process).  Each semaphore in the box has two
"registers", a "content register" for reading and initializing, and an
"access register" for P'ing and V'ing.  V's are stores to the access
register, and P's are fetches to the access register.  The data returned
indicates the semaphore state for determining whether the processor
should wait.  I won't give you any more specifics since it is patentable,
but it's pretty fast to execute, and easy to implement.

This semaphore implementation doesn't violate any RISC principle since
the instruction set does not need to be tweaked and most of the work is
done in special off-processor memory.  Cache coherancy and TLB flushing
present problems which understandably RISC chip designers would rather
not commit to silicon;  but there ARE fast, off-chip solutions.  Maybe
you're talking to the wrong group?

We don't buy the argument that RISC people "don't care" about
multiprocessing.  Celerity does.

			-- The "J" Team at Celerity Computing
					JJ Whelan
					Jeff Anderson

					ucsd!celerity!jpa
					ucsd!celerity!jjw

					619-271-9940

mikep@amdcad.AMD.COM (Mike Parker) (04/02/88)

In article <Mar.31.04.58.35.1988.1216@porthos.rutgers.edu> webber@porthos.rutgers.edu (Bob Webber) writes:
>
>How does the 327 differ from the Am29027?  The floating-point
>instructions trapping on the Am29000 looked a bit odd to me.  Is the
>notion that it is important to standardize the interface to
>floating-point stuff so that people can buy floating-point chips later
>and not have to recompile?  

Yes.

>Or is it that one would want to
>``hardwire'' the coprocessor interactions that are currently being
>done at trap when the chip space becomes available?
>
>--- BOB (webber@athos.rutgers.edu ; rutgers!athos.rutgers.edu!webber)
>
No comment, except that if one were to find the chip space and hardwire
what are now traps wouldn't that relate back to the "not have to
recompile" question.

>[By the way, I have been pondering the SPARC and Am29000 chips for a
>while now trying to figure out if it is plausible to build a simple
>home computer around them.  If any one has references that talk about
>the sort of glue that holds together a board with such a processor, 1
>or 2 SCSI ports, 1 or 2 serial ports, and some static ram, I would
>certainly be interested.]


Oh, no.  SRAM, too expensive for a PC.  The Am29000 memory design handbook
ought to be available real soon with a couple of inexpensive memory systems
that still provide better performance than a Sun 4.  (Given some SRAM to
build a cache, the Am29000 averages twice the performance of the Sun 4,
including floating-point with those "slow" traps).  Talk to phil@amdcad.amd.com
about his $1200 PC accelerator performance.

mikep

-- 
-------------------------------------------------------------------------

 UUCP: {ucbvax,decwrl,ihnp4,allegra}!amdcad!mike
 ARPA: amdcad!mike@decwrl.dec.com

dougj@rosemary.Berkeley.EDU.berkeley.edu (Doug Johnson) (04/02/88)

 >In article <503@xios.XIOS.UUCP>, greg@xios.XIOS.UUCP (Greg Franks) writes:
 >>                                Now for _tightly coupled_
 >> multiprocessing, one needs some sort of atomic test-and-set instruction. 
 >> How do the various RISC chips provide this function, with LOCK prefixes,
 >> or with some other technique?
 >> Greg Franks

 >RISC people (as I discovered at ASPLOS II, San Jose, Oct 87) would
 >rather not speak of parallel processing. Reminds me of the ostrich.
 >Ask them - "how are you going to maintain cache coherency, TLB
 >flushing, accesses integrity, etc in a parallel processing system?"
 >and they will say "why do you want parallel processing when one
 >RISC machine is so much faster than even parallel CISCs?"

 >I would prefer a philosophy that allows for clean parallelisability
 >over any single cpu speedups.

Take a look at the SPUR project being done at UC Berkeley.  (There is
an overview article in the November 1986 issue of Computer.)  It is a
RISC designed to do tightly coupled, coarse grained,  multiprocessing.
It addresses cache coherency (with snoopy caches), TLB flushing (no
TLB, address translation is done in the cache), has a test_and_set,
etc.   -- Doug

vandys@hpindda.HP.COM (Andy Valencia) (04/02/88)

	From all I can tell from the various MPs which have been
built (or died trying... :->), the real point isn't which atomic
ops you offer; it's how you make multiple CPUs live together
in the same memory domain--cache consistency, bus bandwidth, and
memory cycle time.  I don't recall ever reading a post-mortem which
said "if we'd only used synchronization mechanism X instead of Y,
we would have been golden".  Most of them instead talked about the
unpleasant bus and memory traffic characteristics which show up
when you try to get a respectable number of processors going
simultaneously.  And never underestimate what cache consistency
is going to cost you.

				Andy Valencia

hjm@cernvax.UUCP (hjm) (05/09/88)

Dear All,

     I see the thorny subjects of RISC v. CISC and scalar v. vector have reared
their ugly heads again, but in a different guise - multiprocessing!

     Allow to point out some of the ENGINEERING issues involved:

	- the cost of a computing system is primarily a function of size, 
	  weight and the number of chips or pins;

	- to go really fast and to be efficient, the hardware should be simple;

     So what am I trying to point out?  Merely that a large amount of hardware
in present-day machines is there because of difficulties in software.  For
example, take the common-place example of your local UNIX or VMS box.  Inside
these beasts is a *lot* of hardware to keep one user away from his fellow
hackers.  An equally large amount of hardware is provided for the demand-paged
virtual memory system.  Add to that a healthy(?) helping of cache chippery and
what do you get - yes, a machine built upon boards the size of a small squash
court!  None of this hardware is simple, and applies to both the uniprocessor
and the multiprocessor case.

     Now, add in the magic multiprocessor devices and all hell breaks loose on
the hardware front (not to mention the software - groan).  Everyones favourite
trick seems to be finding evermore complicated ways of getting large numbers of
CPUs to talk to the memory all at once.  Just imagine an ever increasing number
of waiters trying to get in and out of the same kitchen all at once through one
door, and you can see the mess.  OK, let's increase the number of doors ... 
in hardware terms this means separating the memory into several pages which can
be accessed simultaneously, thereby increasing the effective bandwidth of 
the memory.  Is this really admitting that shared memory is not necessary?
Surely the highest bandwidth is achieved when each processor has its own memory
which it shares with noone else?  It also makes the hardware a lot smaller.

     To summarise all of this in a few points:

	- virtual memory is useful only when an application won't fit in
	  physical memory.  But memory is cheap, so with lots of Mbytes
	  who needs it, especially if the program is written well.

	- multi-user machines are too complicated to be both fast and simple.

	- shared-memory is not necessary; it's a software issue that shouldn't
	  be solved in hardware.

     For example, 10 MIPS of computation with 4 MB of ECC RAM can be placed on
a single 4" x 6" Eurocard.  Add multi-user support, virtual memory or multiple
CPUs and the board looks like a football pitch in comparison.  Guess which is
cheaper as well!

Remember,

  S I M P L E    =   F A S T   =   E F F I C I E N T   =   C H E A P.

------------------------------------------------------------------------------

	Hubert Matthews (software junkie, surprisingly enough)

------------------------------------------------------------------------------

#include <disclaimer.h>

crowl@cs.rochester.edu (Lawrence Crowl) (05/10/88)

In article <674@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>	- virtual memory is useful only when an application won't fit in
>	  physical memory.  But memory is cheap, so with lots of Mbytes
>	  who needs it, especially if the program is written well.

Here are some counter-examples.  Others can provide more.  The benefit varies
with your application.

inter-process protection - If a process cannot address memory belonging to
    another, then it cannot trash it.  Even when security is not an issue,
    correctness is.  I prefer not to see errant programs trashing others.
    There are other approaches.

copy on write - Virtual memory allows one to implement copy semantics without
    actually copying the data.  This makes Unix fork and passing very large
    messages more efficient.

single level store - One can manage huge amounts of data within a virtual
    address space without having to write file access code.  The virtual
    address space becomes an easy file system.  The operating system ensures
    that the portions I am actually working with are present.  Does this mean
    my application "won't fit".  Well if I knew I had to use real memory, I
    would use memory in a much more conservative manner.  Since virtual memory
    affects my programming style, it is not a does/does not fit question.

tagged addresses - Some implementers of Lisp use some bits of a large sparse
    address space to implement tags.  If the address space were physical, it
    would require gigabytes of physical storage.  Memory is not that cheap.
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

grunwald@uiucdcsm.cs.uiuc.edu (05/10/88)

VM offers more than protection from errant programs. You can use it to:

	+ do cheap heap based allocation. See the DEC-SRC report by Li
	  and Appel (and someone else) on a heap based allocation
	  scheme which uses page level protection to make heap allocation
	  almost as efficient as stack based allocation.

	+ Process migration. See Zayas (sp?) in the last SOSP. Using
	  demand paging for process migration *even given crappy hardware*
	  was a big win.

	+ Cheap memory copies, less memory fragmentation, etc.

	+ You *need* paging hardware for the software solutions to
	  shared memory hardware.

Also, I want a single-user, multi-programming machine. That means I *still*
need protection -- from myself.

As for caches being expensive. Well, if you presume that SRAM costs drop
faster than cache controller chips, maybe. Even if you put 32Mb in a
system, it's not going to help if you can only afford 32Mb of 150ns DRAM
but you need 25ns SRAM to pump your system.

glennw@nsc.nsc.com (Glenn Weinberg) (05/11/88)

In article <674@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>
>     To summarise all of this in a few points:
>
>	- virtual memory is useful only when an application won't fit in
>	  physical memory.  But memory is cheap, so with lots of Mbytes
>	  who needs it, especially if the program is written well.
>
>	- multi-user machines are too complicated to be both fast and simple.
>
>	- shared-memory is not necessary; it's a software issue that shouldn't
>	  be solved in hardware.
>
>     For example, 10 MIPS of computation with 4 MB of ECC RAM can be placed on
>a single 4" x 6" Eurocard.  Add multi-user support, virtual memory or multiple
>CPUs and the board looks like a football pitch in comparison.  Guess which is
>cheaper as well!
>

I beg your pardon?  There are VME boards available today that contain a
NS32532 (a 10 MIP processor) with 64KB of cache and 4MB of memory, support
Unix* System V Release 3 (which is a multi-user system, of course), virtual
memory, and can be combined into a multiprocessor configuration.  I do
believe that a double-height VME board is slightly smaller than a "football
pitch" (the actual dimensions are 6" x 9").  Furthermore, put the board
into a VME chassis with a SCSI controller, a 5-1/4" hard disk, a cartridge
tape and an Ethernet board and you have one hell of a system in a box that's
about the size of the proverbial breadbox for less than $20,000.

Sure, you can argue that supporting multi-user environments and virtual
memory costs you something, but there are very, very few real-world
situations in which you have no need to interact with other systems and
people.  You simply can't do that unless you have a system which both allows
you to have that interaction and protects you from the (un)intentional
dangers of the outside world.  Not to mention the other benefits you
get from a multi-user, multi-tasking operating system such as Unix.

In summary, unless your system is used only as a dedicated processor
that does no interaction with human beings, the advantages of a multi-user
virtual memory (or at least memory-protected) environment significantly
make up for any increase in cost or board space.

-- 
Glenn Weinberg					Email: glennw@nsc.nsc.com
National Semiconductor Corporation		Phone: (408) 721-8102

lgy@pupthy2.PRINCETON.EDU (Larry Yaffe) (05/11/88)

In article <674@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
	[[ much stuff about "simple" machines deleted" ]]

>	- virtual memory is useful only when an application won't fit in
>	  physical memory.  But memory is cheap, so with lots of Mbytes
>	  who needs it, especially if the program is written well.

    I find this claim completely bogus.
    Especially when discussing future architectures for
    high performance machines (a major topic of this newsgroup).
    Real, worthwhile, uses of more memory than you will
    ever be able to afford exist in many, many areas.

    My view is that "memory is ALWAYS expensive".
    The price in $/Mb is completely irrelevant, since 
    cheaper memory simply increases the range of interesting
    problems which become practical to persue.
    I would include this statement as one of the "laws" of computer science.

    Certainly, when designing new machines/software/languages, I would
    argue that the goal should always be to accomodate applications
    larger than than are practical today.  (For this reason, I find
    "dataflow" languages hopeless - they waste too much memory.)

>	Hubert Matthews (software junkie, surprisingly enough)

------------------------------------------------------------------------
Laurence G. Yaffe			lgy@pupthy.princeton.edu
Department of Physics			lgy@pucc.bitnet
Princeton University			...!princeton!pupthy!lgy
PO Box 708, Princeton NJ 08544		609-452-4371 or -4400

bertil@carola.uucp (Bertil Reinhammar) (05/11/88)

In article <674@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>
>	- the cost of a computing system is primarily a function of size, 
>	  weight and the number of chips or pins;
>
From the hardware engineers point of view, yes, but not when considering
a complete system including S/W.
>
>	- to go really fast and to be efficient, the hardware should be simple;
>
As a matter of fact, pipelines can be faster and more efficient with added
delay units which don't really simplify matters...
>
>     So what am I trying to point out?  Merely that a large amount of hardware
>in present-day machines is there because of difficulties in software. ...
Hmmm.
>... Inside
>these beasts is a *lot* of hardware to keep one user away from his fellow
>hackers.  An equally large amount of hardware is provided for the demand-paged
>virtual memory system.  Add to that a healthy(?) helping of cache chippery...
>
You imply that memory management hardware can securely be replaced by a good
piece of S/W had we the appropriate tools ? The same comment on VM ! And do
you really mean that software may provide the efficiency gained from a cache !?
Either I'm pretty stupid or You must restate Your points more clearly. I don't
get ANY point...
>
>	- virtual memory is useful only when an application won't fit in
>	  physical memory.  But memory is cheap, so with lots of Mbytes
>	  who needs it, especially if the program is written well.
>
And what if I have a lot of 'concurrent' processes ? I DON'T like swap time
delays but disk is by far cheaper than RAM ( don't you know ? ) Also, I like
to provide the entire address space to each of the running processes. This
requires VM regardless of the quality of the program(mer)s.
>
>	- multi-user machines are too complicated to be both fast and simple.
>
You hit right on the usual tradeoff stuff.
>
>	- shared-memory is not necessary; it's a software issue that shouldn't
>	  be solved in hardware.
>
!!!!

My opinion:

 - A computer program is really a virtual machine. The real machine ( H/W ) 
   actually kind of interprets your object code. OK, well known stuff.
   So how do you expect it to be more efficient to execute a number of
   instructions to manage memory/protection/speed/etc problems with all
   semaphores and such, when a piece of hardware can fix it in a few cycles ? 

 - In general: The basic reason ( as I see it ) to have software at all is
   FLEXIBILITY. Special purpose hardware is ALWAYS faster than general d:o
   in solving the intended problem.
   Software is not cheap to produce ( just calculate on your own salary :-)
   So the real trick is TRADEOFF. We have a price/performance ratio to take 
   care of. Just having good software tools and languages will not solve that
   part.


-- 
Dept. of Electrical Engineering	     ...!uunet!mcvax!enea!rainier!bertil
University of Linkoping, Sweden	     bertil@rainier.se, bertil@rainier.UUCP

hankd@pur-ee.UUCP (Hank Dietz) (05/12/88)

In article <674@cernvax.UUCP>, hjm@cernvax.UUCP (hjm) writes:
> 	- shared-memory is not necessary; it's a software issue that shouldn't
> 	  be solved in hardware.

Shared memory's full name is "shared memory address space" -- it means
simply that some portion of the memory is addressible by more than one
processor.  In other words, it says that although memory may be physically
distributed, and may have access times which depend on the physical
structure as well as on bus/network traffic conditions, the WAY in which it
is referenced appears as a conventional load/store on an address.

The alternative is to create a MESSAGE which REQUESTS THAT SOMETHING ELSE
REFERENCE the desired memory location.  How do you create a message?  Well,
maybe a GET/PUT instruction or somesuch, but the key idea is that you're
sending the message TO SOME ACTIVE ELEMENT, not to a memory address.

As for which is better, because most message-passing systems only use
messages to access non-local memory, one must distinguish between local and
non-local references at compile-time to generate efficient code --
unfortunately, this is not always possible, and since the shared-memory
model doesn't require this distinction be made at compile time, it is in
some sense more powerful.  The implementation difficulty usually depends on
how you connect to memory:  for word transfers, shared-memory is easier; for
longer block transfers, messages are easier...  for the obvious reasons.

Shared memory DOES NOT MEAN CONSTANT ACCESS TIME independent of memory cell
addressed -- if that's your "software" definition of shared-memory, forget
it, because no highly-parallel machine will *ever* support that.  Ok, maybe
your software would still run if you wrote it with that assumption, but it
ain't gonna run fast, and that is what parallel processing is all about.

						-hankd

jim@belltec.UUCP (Mr. Jim's Own Logon) (05/13/88)

In article <674@cernvax.UUCP>, hjm@cernvax.UUCP (hjm) writes:
> 
> Dear All,
> 
>      I see the thorny subjects of RISC v. CISC and scalar v. vector have reared
> 
> 	- the cost of a computing system is primarily a function of size, 
> 	  weight and the number of chips or pins;
> 
> 	- to go really fast and to be efficient, the hardware should be simple;
> 
> 

    This is quite incorrect. The cost breakdown of any standard computer 
system is  power supply, hard disk, enclosure, memory (if you have a lot), 
burdened assembly cost, processor, other.  The actual cost difference 
between a Z80 and a 386 is minimal. They cost so much more because the 
entire system is upscaled: bigger power supply, large hard disk, etc.  A
386 PC which costs $1000 to build, only has $100 or so in logic (discounting
the CPU and memory). None of the major cost has anything to do with 
multiprocessing.  (I won't even waste time arguing that you can do 
multiprocessing on a single chip micro. It's the software that is complex
for multiprocessing, not the hardware.)

						-Jim Wall
						Bell Technologies Inc.

					...{ames,pyramid}!pacbell!belltec!jim

jkrueger@daitc.ARPA (Jonathan Krueger) (05/13/88)

>In article <674@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>>	- virtual memory is useful only when an application won't fit in
>>	  physical memory.  But memory is cheap, so with lots of Mbytes
>>	  who needs it, especially if the program is written well.

In article <9558@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes:
>Here are some counter-examples.  Others can provide more.

Two more:

avoiding memory fragmentation - virtual memory management provides a
way for multiple processes to share the physical store, cleanly and
without performance bottlenecks.  New processes start and grow all the
time, multiple requirements for space vary dynamically, each is
satisfied efficiently to the limits of available physical memory.
Even when physical memory is cheap, processor time to manage it is
not.

preventing unnecessary i/o - virtual memory systems need not load in
an entire image, thus performing fewer disk-to-memory reads per
execution, an advantage in a development cycle, among other places.
Even when physical memory is cheap, i/o bandwidth to fill it with
stuff copied from disk is not.

-- Jon

terry@wsccs.UUCP (Every system needs one) (05/17/88)

In article <674@cernvax.UUCP>, hjm@cernvax.UUCP (Hubert Matthews) writes:
> Dear All,
> 
>      I see the thorny subjects of RISC v. CISC and scalar v. vector have
> reared their ugly heads again, but in a different guise - multiprocessing!
> 
>      Allow to point out some of the ENGINEERING issues involved:
> 
> 	- the cost of a computing system is primarily a function of size, 
> 	  weight and the number of chips or pins;
> 
> 	- to go really fast and to be efficient, the hardware should be simple;
> 
>      So what am I trying to point out?  Merely that a large amount of hardware
> in present-day machines is there because of difficulties in software.

Let me point out that the entire purpose of hardware is to run the software.
Further, difficulties in software can generally be divided into 2 parts:

	1)	Difficulties based in the inherent complexity of a
		software structure. (we will ignore this, as some
		structures can not be easily reduced.
	2)	Difficulties caused by inadequate/bad hardware design...
		such as multiple instructions to cause a commonly
		desired result rather than a single instruction, badly
		implemented flags/overflow/branching/testing, etc.

> For example, take the common-place example of your local UNIX or VMS box.
> Inside these beasts is a *lot* of hardware to keep one user away from his
> fellow hackers.

Let me point out that this is a largely policy and/or design issue, NOT one
of software.

	1)	The concept of "keeping a user away" from data is one more
		of philosophy, rather than necessity.  Those people who
		refer to themselves as hackers, and those that refer to
		them as hackers using the word correctly, generally are not
		the security problem on systems, unless sensitive data needs
		to be protected from outside eyes as well as damage.  People
		who damage/destroy/alter data are not known as hackers; they
		are know as assholes.  There are many good reasons why data
		should be kept away from all but a select group of people,
		national security, to name one.
	2)	The requirement of "a *lot* of hardware" is a silly one which
		has been imposed by hardware designers refusing to attend to
		security issues which are better relegated to hardware.
		Instead, security is generally implemented as additional
		hardware not because it is necessary to do so, but because
		hardware designers have yet to see the merits of VLSI when
		applied to anything other than a CPU or it's support chips;
		most hardware security measures could easily be implemented
		in a single chip or in software.

> An equally large amount of hardware is provided for the demand-paged
> virtual memory system.

I think you are mixing your models here.  A virtual memory system need not
support demand paging, and a demand paging system need not imply virtual
memory.  If hardware designers understood what the machine was supposed to
do when it was put together more clearly than is made apparent by this
statement, perhaps there would not be this problem.  In addition, most
hardware designers use MMU chips to alleviate this problem entirely.

> Add to that a healthy(?) helping of cache chippery

I believe the memory cache was invented by a HARDWARE company so that their
HARDWARE would appear faster than their competitors.  Caching is, at times,
helpful; however, it can also be a great inconvenience, especially when one
is trying to design a mutliprocessor system or implement memory-mapped I/O.
When you are using dual-ported RAM to communicate with other hardware because
the designer was unable to cause the hardware to run at a reasonable rate
if the communication took place via interrupts, it is horribly inconvenient
to have your I/O cached, and perhaps lost.

[I realize I'm going to get a lot of flack here from people who love DMA
 and hate interrupts, such as the designers of message-passing operating
 systems, but consider this: a well designed computer system with an
 interrupt-based architecture can not lose data as long as you stay within
 system performance limits.  Synchronization of data flow in software is
 always subject to a number of failure modes, not the least of which is
 directly related to prioritization of tasks.]

> and what do you get - yes, a machine built upon boards the size of a small
> squash court!

You work for IBM, right? ;-)

> None of this hardware is simple, and applies to both the uniprocessor
> and the multiprocessor case.

I agree, but I have to modify this with the statement that this is only
true in the case of badly designed hardware.

>      Now, add in the magic multiprocessor devices and all hell breaks loose on
> the hardware front (not to mention the software - groan).

Exactly... "groan".  Software is always more complicated than hardware -
that's why software takes longer.  Add into this the apparent inability
of hardware designers to comprehend what is necessary for software AND
hardware to be smaller/faster/sooner, and you have machines which are so
radically different in operational concept from what needs to be done
that programmers who have to deal with the hardware are more often than not
prone to mistakes.  And built upon this are the users concepts of what "needs
to be there"... the actual bottom line.  To get from hardware to bottom line
can often take 7 or more layers of software fixing or bypassing bad (or
worse, ill-informed) design decisions made at the hardware level.  This is
the sort of atrocity that made it impossible to write decent terminal
emulators on CP/M systems: the UART hardware was often capable of getting
characters in excess of 19200 baud; the screen was often capable of displaying
characters at rates in excess of 38400 baud.  The bottoleneck was that when
the screen scrolled, the hardware locked out serial interrupts, thus causing
lost data from the serial channel.  There were exceptions, but not many.

> Everyones favourite trick seems to be finding evermore complicated ways
> of getting large numbers of CPUs to talk to the memory all at once.  Just
> imagine an ever increasing number of waiters trying to get in and out of
> the same kitchen all at once through one door, and you can see the mess.
> OK, let's increase the number of doors ... in hardware terms this means
> separating the memory into several pages which can be accessed simultaneously,
> thereby increasing the effective bandwidth of the memory.

Not everyones.  This is only true in cases where bad implementations have
occurred, in both hardware and/or software.  Extreme paralellism is only
useful for things which lend themselves to paralellism, and even then the
only truly useful emergence from the whole paralell mess has been data-
flow architectures, such as GoodYear's, for use in finite element modelling
or fluid-dynamics, and things such as the Sequent/NCR/Sperry/Multiflow
systems when applied to large numbers of users or online transaction
processing.  In these instances, it is, with very little intercommunication,
like having a number of computers with shared resources, in the same box...
not a bad idea, if you want to save money.  I have yet to sit down at one
of these machines and type "make" and have a seperate processor allocated
to each compile.

> Is this really admitting that shared memory is not necessary?

Shared memory is *not* something to just throw at a problem.  As you admitted,
increased bandwidth improves performance.

> Surely the highest bandwidth is achieved when each processor has its own
> memory which it shares with noone else?  It also makes the hardware a lot
> smaller.

Absolutely not.  Think of an infinite memory plane with processors "crawling"
over it, doing what needs to be done... somewhat like many small spiders
cooperating to perfect a web which none could complete by themselves.  This,
I think, is a representation of the ideal dataflow machine.  It might even be
useful, if we could figure out how to talk to it.

> 	- virtual memory is useful only when an application won't fit in
> 	  physical memory.  But memory is cheap, so with lots of Mbytes
> 	  who needs it, especially if the program is written well.

Memory is no longer cheap.

> 	- multi-user machines are too complicated to be both fast and simple.

Due to inappropriate hardware implementation, yes.

> 	- shared-memory is not necessary; it's a software issue that shouldn't
> 	  be solved in hardware.

How do you propose it be resolved in software, given hardware "protection"?
This is an inane idea, and perpetuates the major problem with breaking
software down into a true field of engineering.  Given the complexity of
software as compared to hardware, it may be impossible to derive a formula
method of producing software; perhaps it will always be an art.  But there
are things which could be done in hardware to make it easier and less arcane
than it currently is.  The current problem is one of hardware not resembling
the soloution which is the goal.

| Terry Lambert           UUCP: ...{ decvax, ihnp4 } ...utah-cs!century!terry |
| @ Century Software        OR: ...utah-cs!uplherc!sp7040!obie!wsccs!terry    |
| SLC, Utah                                                                   |
|                   These opinions are not my companies, but if you find them |
|                   useful, send a $20.00 donation to Brisbane Australia...   |
| 'Admit it!  You're just harrasing me because of the quote in my signature!' |

jack@cwi.nl (Jack Jansen) (05/19/88)

In article <53@daitc.ARPA> jkrueger@daitc.UUCP (Jonathan Krueger) writes:
[Giving reasons for virtual memory]
>avoiding memory fragmentation - virtual memory management provides a
>way for multiple processes to share the physical store, cleanly and
>without performance bottlenecks.

Not providing virtual memory doesn't mean that segments in memory
needs to be contiguous. There might be a point for copy-on-write
or zero-fill-on-demand, but I'm not sure that only these are worth
the trouble.
>
>preventing unnecessary i/o - virtual memory systems need not load in
>an entire image, thus performing fewer disk-to-memory reads per
>execution, an advantage in a development cycle, among other places.

Again, given heaps of memory and fast communication this doesn't
matter anymore. If my file server can keep a copy of emacs in it's
cache and download it with 700Kb/s I'm up and running in a second.
Quite acceptable, I think.

Of course, there will always be applications that *do* benefit from
virtual memory, but I guess within reasonable time VM will fall in the
same class as vector processors or other esoteric features that can
be found on specialized machines, not on everyday workstations.
-- 
	Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp)
	The shell is my oyster.