[comp.arch] parallel systems

hhd0@GTE.COM (Horace Dediu) (10/18/89)

In article <20336@princeton.Princeton.EDU>, mg@notecnirp.Princeton.EDU (Michael Golan) writes:
> This came for various people - the references are so confusing I removed them
> so as not to put the wrong words in someone's mouth:
> 
> >>>Supercomputers of the future will be scalable multiprocessors made of many
> >>>hundreds to thousands of commodity microprocessors.
> >>
> >This is the stuff of research papers right now, and rapid progress is being
> >made in this area.  The key issue is not having the components which establish
> >the interconnect cost much more than the micros, their off chip caches,
> >I currently lean to scalable coherent cache systems which minimize programmer
> >effort.  The exact protocols and hardware implementation which work best
> >for real applications is a current research topic. 
> 
> 1) There is no parallel machine currently the works faster than non-parallel
> machines for the same price. The "fastest" machines are also non-parallel - 
> these are vector processors.
> 

Consider the 8k processor NCUBE 2--"The World's Fastest Computer." 
(yes, one of those).  According to their literature:
"8,192 64 bit processors each equivalent to one VAX 780.  It delivers
60 billion instructions per second, 27 billion scalar FLOPS, exceeding the
performance of any other currently available or recently announced
supercomputer."  It's distributed memory .5MB per processor, runs UNIX, 
and is a hypercube.

I don't know the price, but I bet it's less than a Cray.  Interesting to
talk about GigaFLOPS.  This is fast.

> 2) A lot of research is going on - and went on for over 10 years now. As far
> as I know, no *really* scalable parallel architecture with shared memory exists
> that will scale far above 10 processors (i.e. 100). And it does not seems to
> me this will be possible in the near future.

Who cares about shared memory?  Distributed is the only way to scale.
Everybody realizes this since it can be proven.
The only reason shared memory machines exist is because we don't yet know
how to make good distributed machines.  (Yeah, right! tell that to Ncube) 
IMHO shared memory is a hack using available bus technology while waiting for 
the real parallel machines to come.  (they're already here)

> 3) personally I feel parallel computing has no real future as the single cpu
> gets a 2-4 folds performance boost every few years, and parallel machines
> constructions just can't keep up with that. It seems to me that for at least 
> the next 10 years, non-parallel machines will still give the best performance 
> and the best performance/cost.

This is very ambiguous.  Parallel machines can use off-the shelf CPU's.  If a
fast micro is available then you can design a parallel machine around it as
you would any workstation.  The other problem:  if cpu's increase 2-4
folds every few years and if this can be maintained for 10 years you can
only expect a 32 fold increase.  This is nothing.  You can't expect problems
to stay that small.   If you expect to go beyond that you'll hit a wall with
the fundamental boundary of the speed of light.  You can't drive clock rates
to infinity.  The only way to speed up is to do it in parallel.  Sure it's
hard to program, but it's a new field, the tools are rudimentary and only
hardware people are involved in their development.  If enough effort is put
into it parallel machines should not be any harder to program than your
basic workstation.

> time that really matters. And while computers get faster, its seems software
> complexity and the need for faster and faster machines is growing even more
> rapidly.

Of course.  To solve hard problems you *need* parallel execution.
It's no secret that every big iron maker and every supercomputer shop is
developing parallel machines.  These are still modest efforts (<100 cpu's),
but the leading egde is now in the 10k coarse grained, 64k fine grained
processors.  This should scale nicely to 1M processors in the next decade.
After that we can expect some kind of new barriers to come up.  

>  Michael Golan
>  mg@princeton.edu

-- 
Horace Dediu \"That's the nature of research--you don't know |GTE Laboratories
(617) 466-4111\  what in hell you're doing." `Doc' Edgerton  |40 Sylvan Road
UUCP:  ...!harvard!bunny!hhd0................................|Waltham, MA 02254
Internet: hhd0@gte.com or hhd0%gte.com@relay.cs.net..........|U. S. A.

yodaiken@freal.cs.umass.edu (victor yodaiken) (10/19/89)

In article <7651@bunny.GTE.COM> hhd0@GTE.COM (Horace Dediu) writes:
>Consider the 8k processor NCUBE 2--"The World's Fastest Computer." 
>(yes, one of those).  According to their literature:
>"8,192 64 bit processors each equivalent to one VAX 780.  It delivers
>60 billion instructions per second, 27 billion scalar FLOPS, exceeding the
>performance of any other currently available or recently announced
>supercomputer."  It's distributed memory .5MB per processor, runs UNIX, 
>and is a hypercube.
>
>I don't know the price, but I bet it's less than a Cray. 

Like to see the delivered price of a 8k processor system.

>Interesting to
>talk about GigaFLOPS.  This is fast.
>
This sounds like one of those total b.s. measures obtained by
multiplying the number of processors by the max mips/mflops rate
per processor. 
>
>Who cares about shared memory?  Distributed is the only way to scale.
>Everybody realizes this since it can be proven.

Proof citation? Sketch?  

There is a lot of mythologizing about parallelism. Parallel processing
is a standard technique for speed which is used in every carry-lookahead
adder, every bus, etc. It seems reasonable to believe that parallelism
will be an important technique in the future. It seems POSSIBLE that 
using multiple cpu's will be a useful technique. On the other hand
there is no reason why this technique must work, and it seems at least
as possible that cpu's should not be the basic unit of parallel
computation. 

>It's no secret that every big iron maker and every supercomputer shop is
>developing parallel machines.  These are still modest efforts (<100 cpu's),
>but the leading egde is now in the 10k coarse grained, 64k fine grained
>processors.  This should scale nicely to 1M processors in the next decade.
>After that we can expect some kind of new barriers to come up.  

I admire your confidence, but am unconvinced.  Evidence?

victor

jdarcy@multimax.UUCP (Jeff d'Arcy) (10/19/89)

From article <7651@bunny.GTE.COM>, by hhd0@GTE.COM (Horace Dediu):
> Who cares about shared memory?  Distributed is the only way to scale.

I'd like to see you back this one up with some *real* proof.  I could
possibly agree with the statement that distributed is the *best* way
to scale, or that distribution is necessary for *large* (>~30) scale
multiprocessing.  I think that shared memory architectures will still
be viable for a long time, perhaps as a component of a distributed
environment.  If you disagree please provide reaasons.

Jeff d'Arcy		jdarcy@encore.com		"Quack!"
 Encore has provided the medium, but the message remains my own

tomlic@yoda.ACA.MCC.COM (Chris Tomlinson) (10/19/89)

From article <20416@princeton.Princeton.EDU>, by mg@notecnirp.Princeton.EDU (Michael Golan):
> In article <7651@bunny.GTE.COM> hhd0@GTE.COM (Horace Dediu) writes:
>>
>>Consider the 8k processor NCUBE 2--"The World's Fastest Computer." 
>>(yes, one of those).  According to their literature:
>>"8,192 64 bit processors each equivalent to one VAX 780.  It delivers
>>60 billion instructions per second, 27 billion scalar FLOPS, exceeding the
> 
> This imply a VAX 780 is a 7 mips machine ?

The architecture of the processor is similar to the VAX ISA, not the performance.

> 
>>performance of any other currently available or recently announced
>>supercomputer."  It's distributed memory .5MB per processor, runs UNIX, 
>                                       ^^^^^^^^^^^^^^^^^^^^^^
>>and is a hypercube.
> 
> .5MB ? And this is faster than a Cray? How many problems you can't even

I understand that NCUBE makes provisions for up to 64MB per node on
those systems using the 64 bit processors. They also apparently have
incorporated a through-routing capability in the processors similar to
that found on the Symult mesh-connected machines.

> solve on this? And for how many, a 32Mb single VAX 780 will beat ?!
> One of the well known problems wtih Hypercubes is that if you look at a job
> that uses the whole memory (in this case 4Gb = Big Cray), a single machine 
> with the same performance of one processor (and all memory) will be almost 
> as good and sometimes even better.

The current trends in distributed memory MIMD machines are towards very
low communication latencies by comparison with the first generation
machines that used software routing and slow communication hardware.
This has a tendency to drive the machines more towards shared-memory
like access times, but of course physical limitations simply mean that
DM-MIMD machines are a scalable way of approximating shared-memory worse
and worse as the machine gets larger, but at least the machine can get
larger.

> 
> My original point was that MIMD, unless it has shared memory, is very hard
> to make use of with typical software/algorithms. Some problems can be solved
> nicely on a Hypercube, but most of them can not! And the state of the art

The state-of-the-art in parallel algorithm development is advancing rapidly
as machines become available to experiment on.  It is more of an issue of
algorithm design than paralyzing sequential codes.  There are quite a
number of problems that are tackled on Crays because of superior scalar
performance that do not make significant use of the SIMD vector capabilities.
I would point to the development of BLAS-2 and -3 as indications that even
on current supercomputers compiler technology just doesn't carry the day by
itself.

> in compilers, while having some luck with vectorized code, and less luck
> with shared memory code, has almost no luck with message-passing machines.
> 
> 
>  Michael Golan
>  mg@princeton.edu
> My opinions are my own. You are welcome not to like them.

Chris Tomlinson
tomlic@MCC.COM
--opinions....

david@cs.washington.edu (David Callahan) (10/19/89)

In article <7651@bunny.GTE.COM> hhd0@GTE.COM (Horace Dediu) writes:

>Who cares about shared memory?  Distributed is the only way to scale.

Perhaps you forgot a smiley? Or perhaps when you say "shared" you mean
"centralized"?

"Shared" memory is part of the virtual machine and clearly can be
implemented on a machine with "distributed" packaging of memory with
processors. The BBN Butterfly and the RP3 are both "distributed"
memory machines in the sense that memory is packaged with processsors
and the hardware takes care of building "messages" for every memory
request.

From a programing point of view, machines like the NCUBE have three
(IMHO) serious faults: message passing is done in software and is
therefore has orders of magnitude more latency than a "shared" memory
machine; data movement now requires software-controlled cooperation on
both processors; and finally, the programmer must determine the
location of the "most recent" value of every variable and which
processor was the last to write it or next to use it.

I care about shared memory --- its makes parallel machines much easier
to program.

>Everybody realizes this since it can be proven.
>The only reason shared memory machines exist is because we don't yet know
>how to make good distributed machines.  (Yeah, right! tell that to Ncube) 
>IMHO shared memory is a hack using available bus technology while waiting for 
>the real parallel machines to come.  (they're already here)

Shared memory has nothing to do with busses --- it has to do with
programming. 

>Horace Dediu \"That's the nature of research--you don't know |GTE Laboratories
>(617) 466-4111\  what in hell you're doing." `Doc' Edgerton  |40 Sylvan Road
>UUCP:  ...!harvard!bunny!hhd0................................|Waltham, MA 02254
>Internet: hhd0@gte.com or hhd0%gte.com@relay.cs.net..........|U. S. A.

Disclaimer: I work for a company designing a multiprocessor that
supports shared memory programming.

-- 
David Callahan  (david@tera.com, david@june.cs.washington.edu,david@rice.edu)
Tera Computer Co. 	400 North 34th Street  		Seattle WA, 98103

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (10/20/89)

Since I care about parallel machines, my two cents worth:

FACT: we have proof-by-existence that

- massively parallel machines can be built, can be reliable, etc.
  (defining massive as "more than 1000 processor chips").
- they can have aggregate properties ( GIPS, GFLOPS, GB, GB/s IO)
  that are in the supercomputer league. Yes, I have details.
- they allow memory-intensive algorithms, since they can use, in main
  memory, the slower/cheaper DRAMs that Cray uses only in backing
  memory. Yes, I can back this up.
- for selected large applications, these machines already are the 
  fastest hardware, and the cheapest hardware. Yes, both.
- for selected applications, these machines aren't that hard to program.

IT SEEMS AGREED THAT:

- MIMD machines can have automatic load balancing, timesharing, etc.
  (Actually, the timesharing is called "spacesharing".)
- MIMD machines with virtual memory could conveniently fault pages 
  around between nodes.
- some applications could use a Connection Machine with millions
  of processors.
- programming isn't as easy as we'd like.

RESEARCHERS HAVE FOND HOPES THAT, SOME DAY,

- automatic parallelization onto these machines will become practical
  for many applications.
- shared-memory/cache-coherency will be cost-effective on large MIMDs.

MY OPINION:

- most supercomputer applications will wind up on massively
  parallel machines.
- there will aways be a market for the fastest possible single CPU.
- MIMD machines don't want the fastest possible node, because
  (so far) the money is better spent buying several cheaper nodes.
- conventional instruction set architectures are suitable bases for 
  MIMD nodes.
- "scaling laws" are sophistry when 8K node MIMDs are here now.

Sorry to be so wordy. 
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

mbutts@mentor.com (Mike Butts) (10/20/89)

From article <10164@encore.Encore.COM>, by jdarcy@multimax.UUCP (Jeff d'Arcy):
> From article <7651@bunny.GTE.COM>, by hhd0@GTE.COM (Horace Dediu):
>> Who cares about shared memory?  Distributed is the only way to scale.
> 
> I'd like to see you back this one up with some *real* proof.  I could
> possibly agree with the statement that distributed is the *best* way
> to scale, or that distribution is necessary for *large* (>~30) scale
> multiprocessing.  I think that shared memory architectures will still
> be viable for a long time, perhaps as a component of a distributed
> environment.  If you disagree please provide reaasons.

Distributed architectures are obviously most desirable from a hardware point of
view, because they are simple and (nearly) arbitrarily scalable.  IMHO there
are two reasons why shared memory architectures will continue to be more
important for many of us for a long time to come.

1) Shared memory systems are *much* easier to program, and software development
costs *much* more than hardware nowadays.  I say this based on my experience as
a hardware engineer in a mostly software engineering environment.

2) Many problems have proven inefficient so far on distributed memory
architectures.  Distributed machines succeed beautifully on problems where some
real physical space can be mapped onto processor/memory nodes which are
arranged in a regular topology which is similar to the topology of the problem.
Modeling physical media, such as solids, liquids or gases, takes advantage of
the fact that the state of one parcel only directly affects the state of its
nearby neighbors.

Problems with irregular topology, such as electronic circuits, are very much
harder to solve efficiently, because the state of one gate or transistor may
affect the state of another at a great distance. Communications is far more
irregular and expensive, so speedups suffer.  Static load balancing among the
processors is also much harder.  Shared architectures need not statically
partition the problem, and communication speed depends much less on distance.

I'm aware of new algorithmic technology being developed to attack these
problems, such as distributed time discrete event simulation techniques, but 
there's still a lot of work ahead.

I agree that distributed memory is the only way to scale if you can, but there
are important problems which are much more readily solved on shared
architectures, at least so far.  A hybrid architecture, with physically
distributed but logically shared memory, of which several examples have been
built, may be the best transition path.
-- 
Michael Butts, Research Engineer       KC7IT           503-626-1302
Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton, OR 97005
!{sequent,tessi,apollo}!mntgfx!mbutts         mbutts@pdx.MENTOR.COM
Opinions are my own, not necessarily those of Mentor Graphics Corp.

bzs@world.std.com (Barry Shein) (10/22/89)

Why oh why is are people searching around for a philosopher's stone of
computing architectures. This isn't science, this is aberrant
psychology.

There exists problems which are best run on (pick one or more) MIMD,
SIMD, scalar, vector-processors, and/or hybrids. There also exist
problems which don't care what they're run on, at least not much.

For example, time-sharing and typical database environments seem to
run very well on MIMD systems for very little re-programming effort
(in the case of time-sharing, none.)  This is because of the large
granularity of the applications and their measuring of "runs well" as
mere response time.

There are other examples for MIMD and examples for SIMD and other
types of processors.

To say (MIMD/SIMD) processors are a bad idea because there exists some
large set of problems which are either impossible or very hard to
optimize for those architectures is so goddamn stupid it boggles the
mind.

MIMD processors are relatively easy and cheap to build out of mostly
commodity parts.

SIMD processors appear to be very, very good at certain classes of
problems which are very important to some people. Important enough
that they'll buy a SIMD box just to run those problems and tell people
with other problems that don't fit so well to go fly a kite (which is
the mathematically correct answer to them.)

We've had multi-processing and hardware optimizations almost since
computing began. What do you think makes a mainframe a mainframe?
Multi-processing, particularly in the I/O channels because mainframes
are bought for high I/O throughput. Most DP shops aren't CPU bound,
they're I/O bound, so they buy their CPUs in the channels.

It continues to astounds me how, particularly in the academic computer
science community, some dodo will stand in front of an audience, show
that there exists a class of problems which don't run well on parallel
architectures, and conclude that therefore parallel architectures are
bad.

What *frightens* me is that N people will sit in the audience and nod
their heads in agreement and go out to spread this gospel (as we've
seen on this list) instead of riding the dodo out on the first rail.

LOOK, there are folks out there using all these architectures and
winning big. Consider that before you attempt to prove on paper again
that it's impossible for a honeybee to fly.

What would be *useful* would be a taxonomy of algorithms classified by
how well (and difficulty of adaptation) to various architectures.

But it's so much easier to throw peanut shells from the bleachers.
-- 
        -Barry Shein

Software Tool & Die, Purveyors to the Trade         | bzs@world.std.com
1330 Beacon St, Brookline, MA 02146, (617) 739-0202 | {xylogics,uunet}world!bzs

mccalpin@masig3.masig3.ocean.fsu.edu (John D. McCalpin) (10/22/89)

In article <1989Oct21.234930.905@world.std.com> bzs@world.std.com
(Barry Shein) writes:
>Why oh why is are people searching around for a philosopher's stone of
>computing architectures. This isn't science, this is aberrant
>psychology.

While there may be some small component of "holy grail"-itis in this
thread, I think that most people are discussing a different problem
here.  This isn't science --- it is technology plus market forces....

>There exists problems which are best run on (pick one or more) MIMD,
>SIMD, scalar, vector-processors, and/or hybrids. There also exist
>problems which don't care what they're run on, at least not much.

I don't think that anyone disputes this.  The question is whether or
not each of these types of architectures can acquire a sufficient
market to be competitive with whatever architecture is selling best,
and which therefore has the most R&D money and the best economies of
scale. 

>To say (MIMD/SIMD) processors are a bad idea because there exists some
>large set of problems which are either impossible or very hard to
>optimize for those architectures is so goddamn stupid it boggles the
>mind.

On the other hand, it is a perfectly reasonable thing to decide that
it is not worth my time to learn how to work with/program on/optimize
on a particular architecture because it is not likely to be
commercially successful.  It is easy enough to be wrong about the
commercial success aspects, but it is an unavoidable question.

>It continues to astounds me how, particularly in the academic computer
>science community, some dodo will stand in front of an audience, show
>that there exists a class of problems which don't run well on parallel
>architectures, and conclude that therefore parallel architectures are
>bad.
>What *frightens* me is that N people will sit in the audience and nod
>their heads in agreement and go out to spread this gospel (as we've
>seen on this list) instead of riding the dodo out on the first rail.

I believe that it is far more common for people to conclude that
parallel architectures are not an effective approach for the class of
problems being discussed.  Since many of us out here are _users_,
rather than designers, it is hardly surprising that we would downplay
the potential usefulness of architectures that are believed to be
unhelpful in our chosen work.  This is quite a reasonable response --
it reminds me of democracy and enlightened self-interest and all of
that stuff.... :-)

>LOOK, there are folks out there using all these architectures and
>winning big. Consider that before you attempt to prove on paper again
>that it's impossible for a honeybee to fly.

It is a good point to note here that lots of the people who think that
parallel architectures are not useful in their field are wrong.  

>But it's so much easier to throw peanut shells from the bleachers.

And so much more fun! ;-)
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
		   mccalpin@scri1.scri.fsu.edu
		   mccalpin@delocn.udel.edu

crowl@cs.rochester.edu (Lawrence Crowl) (10/23/89)

In article <7651@bunny.GTE.COM> hhd0@GTE.COM (Horace Dediu) writes:
>Who cares about shared memory?  Distributed is the only way to scale.
>Everybody realizes this since it can be proven.  The only reason shared
>memory machines exist is because we don't yet know how to make good
>distributed machines.  IMHO shared memory is a hack using available bus
>technology while waiting for the real parallel machines to come.

You are mixing two concepts --- memory architecture (as the processors see it)
and communication interconnect.  Commonly available shared memory systems tend
to use a bus interconnect, so people assume that this is the only interconnect
for shared memory.  This assumption is wrong.  SHARED MEMORY DOES NOT IMPLY A
BUS INTERCONNECT.  The BBN Butterfly, IBM RP3, and NYU Ultracomputer all
supported a shared memory implemented over an FFT interconnect without busses.
The Butterfly is commercially available with up to 512 processors.  The RP3
is a research machine designed for as many as 512 processors, though I don't
know if IBM has configured one that large.  I don't recall the Ultracomputer
size.

SHARED MEMORY IS SCALABLE.  If the system supports a scalable interconnection,
and processors have local memory, then a shared memory system is scalable.
With local memory, only information that is truly shared need be communicated
between processors.  This is exactly the information that must be communicated
on distributed memory system via message passing.

SHARED MEMORY IS DESIREABLE.  The latency on remote memory access is typically
two orders of magnitured faster than message passing on distributed memory.
Applications with small messages performed on an infrequent bases will see
significant performance improvements.  For instance, a shared memory system
can increment a shared counter far faster than any distributed memory system.

SHARED MEMORY HAS A COST.  Implementing shared memory over a scalable
interconnect may require a larger aggregate bandwidth than that of distributed
memory systems..  I don't think there has been enough research here to know
the real tradeoff, but such a result would not suprise me.
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

brooks@vette.llnl.gov (Eugene Brooks) (10/24/89)

In article <1989Oct23.152120.25967@cs.rochester.edu> crowl@snipe.cs.rochester.edu (Lawrence Crowl) writes:
>SHARED MEMORY IS DESIREABLE.  The latency on remote memory access is typically
>two orders of magnitured faster than message passing on distributed memory.
Given equivalent performance interconnect, which rarely occurs because the
message passing machines tend to get short changed on the communication hardware,
I have found the "shared memory" systems to have much better communication
performance.  This is because the communication between processors is
directly supported in the memory management hardware.  In the message passing
machines sending a message invokes a "kernel call" on both the sending and
recieving ends.  This system call overhead is much greater than the hardware
latency itself, ammounting to a factor 5 or more.  One could try for complex
hardware support of messaging, but a better solution is to just memory map it...

Please note:  I am not talking about the really horrible interrupt handling
of message forwarding here.  This only compounds a bad situation for kernel
overhead.

brooks@maddog.llnl.gov, brooks@maddog.uucp

ingoldsb@ctycal.UUCP (Terry Ingoldsby) (10/24/89)

In article <7651@bunny.GTE.COM>, hhd0@GTE.COM (Horace Dediu) writes:
> The only reason shared memory machines exist is because we don't yet know
> how to make good distributed machines.  (Yeah, right! tell that to Ncube) 
> IMHO shared memory is a hack using available bus technology while waiting for 
> the real parallel machines to come.  (they're already here)
....
> 
> Of course.  To solve hard problems you *need* parallel execution.
> It's no secret that every big iron maker and every supercomputer shop is
> developing parallel machines.  These are still modest efforts (<100 cpu's),
> but the leading egde is now in the 10k coarse grained, 64k fine grained
> processors.  This should scale nicely to 1M processors in the next decade.
> After that we can expect some kind of new barriers to come up.  

It is true that the only hope for the kind of performance improvement that will
be required for the next generation of software is parallel processing.  It is
also true that many different parallel processing machines exist.  Here is where
I start to be less optimistic about how quickly we can adopt parallel processing.

The basic problem is *NOT* the software.  Although awkward to write, it can be
done and it is reasonable to assume that techniques will be developed as the
years go by.  The (IMHO) fundamental problem is that different problems partition
differently.  By this I mean that to be executed across many P.E.s (processing
elements) a problem must be partitioned.  One criteria for partitioning the
problem is to subdivide it in such a way that the different P.E.s do as little
interP.E. communication as is possible.  Generally speaking, the more you divide
the problem, the more P.E.s need to talk to each other.  This, of itself, is
not necessarily bad (at least, it is tolerable).  The problem is that different
classes of problems subdivide in different ways.  Thus it is difficult to
set up any *efficient* communication strategy that will let P.E's talk to
other P.E.s (without going through lots of intermediary P.Es) for general
problems.  Until someone addresses this issue, I am hard pressed to believe
that a *general purpose* parallel machine will be developed.  I can easily
foresee a time when there will be a custom (parallel) vision processor in
a computer, a custom speech recognition processor, a custom database processor,
... and so on.  I cannot see a given parallel architecture doing all of the
above.

My 2 cents!

-- 
  Terry Ingoldsby                       ctycal!ingoldsb@calgary.UUCP
  Land Information Systems                           or
  The City of Calgary         ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

vorbrueg@bufo.usc.edu (Jan Vorbrueggen) (10/24/89)

In article <36597@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:

>Given equivalent performance interconnect, which rarely occurs because the
>message passing machines tend to get short changed on the comm. hardware,
>I have found the "shared memory" systems to have much better communication
>performance.  This is because the communication between processors is
>directly supported in the memory management hardware.  In the message passing
>machines sending a message invokes a "kernel call" on both the sending and
>recieving ends.  This system call overhead is much greater than the hardware
>latency itself, ammounting to a factor 5 or more.  One could try for complex
>hardware support of messaging, but a better solution is to just memory map it.
>
>Please note:  I am not talking about the really horrible interrupt handling
>of message forwarding here.  This only compounds a bad situation for kernel
>overhead.

Eugene, ever seen a transputer? Overhead for receiving or sending a 
message is 19 cycles (630 ns for a 30 MHz part). The actual transfer
is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s
unidirectional or 2.4 MByte/s bidirectional. At 4 links/transputer
this gives 9.6 Mbytes/s, close to what most memory interfaces will
allow. Of course, very short messages will limit your transfer rate;
however, at 128 Bytes/message you see about 80% of the maximum rate.
There is no system call involved - the compiler just generates the 
necessary instruction.

Message forwarding isn't so difficult either. I've read of a system
requiring less than 10 us overhead per through-route (this probably
is for the destination link being available). No interrupt handling
involved here - that part is all handled in hardware.

Next generation (i.e., promised for start of 1991) will have 100 Mbit/s
per link and the possibility of hardware routing (a la wormhole).
The cpu will be faster by factor of 4 or so and a memory bandwith to match.

-- Jan Vorbrueggen

brooks@vette.llnl.gov (Eugene Brooks) (10/24/89)

In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes:
>Eugene, ever seen a transputer? Overhead for receiving or sending a 
Yes I have, a group here is trying to use them in a image processing project.

>Message forwarding isn't so difficult either. I've read of a system
>requiring less than 10 us overhead per through-route (this probably
>is for the destination link being available). No interrupt handling
>involved here - that part is all handled in hardware.
This level of overhead for each hop is completely intolerable.


brooks@maddog.llnl.gov, brooks@maddog.uucp

peter@ficc.uu.net (Peter da Silva) (10/24/89)

In article <1989Oct23.152120.25967@cs.rochester.edu> crowl@snipe.cs.rochester.edu (Lawrence Crowl) writes:
> SHARED MEMORY HAS A COST.  Implementing shared memory over a scalable
> interconnect may require a larger aggregate bandwidth than that of distributed
> memory systems..  I don't think there has been enough research here to know
> the real tradeoff, but such a result would not suprise me.

Also, the job of programming a shared memory system is a lot harder than
programming a system with messages as the communication medium. The number
of successful message-based operating systems demonstrates this.

Of course you can implement messages in shared memory (a trivial proof is
aforementioned operating systems), and gain a perfromance improvement. The
question is whether this offsets the extra bandwidth.

Ah well, I'd much rather have both.
-- 
Peter da Silva, *NIX support guy @ Ferranti International Controls Corporation.
Biz: peter@ficc.uu.net, +1 713 274 5180. Fun: peter@sugar.hackercorp.com. `-_-'
"That particular mistake will not be repeated.  There are plenty of        'U`
 mistakes left that have not yet been used." -- Andy Tanenbaum (ast@cs.vu.nl)

slackey@bbn.com (Stan Lackey) (10/24/89)

In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes:
>In article <36597@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>
>>Given equivalent performance interconnect, which rarely occurs because the
>>message passing machines tend to get short changed on the comm. hardware,
>>I have found the "shared memory" systems to have much better communication
>>performance ...

>Eugene, ever seen a transputer? Overhead for receiving or sending a 
>message is 19 cycles (630 ns for a 30 MHz part). The actual transfer
>is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s
>unidirectional or 2.4 MByte/s bidirectional.

1) Although it has high? peak bandwidth, the latency is still there.
The interconnect system waits for the processor to do an access, then
the processor waits for the interconnect to get the data.  Many
microseconds go by between the time the CPU needs dependent data and
it is usable.

2) This thread has been leaning toward the 'many killer micros in
parallel' being the supercomputers of the future.  I agree, but I
think it is still far away; commodity micros just don't include what
they need to support massively parallel systems, in the sense of
providing ease-of-use in a general purpose way.  And they never will,
if the #1 priority is to run whetstone and dhrystone fast.  Providing
these capabilities requires assumptions all the way from application
development, through the compiler, the OS, the chip interface, and on
down to the bare silicon.  It is Real Tough for the guys who design
commodity chips to anticipate what the vast range of users are going
to want, and they're just going to end up not pleasing everybody.

Please recall that even the hypercubes (well inmos, ncube and thinking
machines anyway) do not use commodity microprocessors, but proprietary
ones that do the extras in hardware that they need 

-Stan    Disclaimer: not necessarily the views of my organization.

a186@mindlink.UUCP (Harvey Taylor) (10/24/89)

In Msg-ID: <47279@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
|In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen)
| writes:
|>In article <36597@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov
|(Eugene Brooks) writes:
|>
|>> Given equivalent performance interconnect, which rarely occurs
|>> because the message passing machines tend to get short changed on
|>> the comm. hardware, I have found the "shared memory" systems to
|>> have much better communication performance ...
|
|>Eugene, ever seen a transputer? Overhead for receiving or sending a
|>message is 19 cycles (630 ns for a 30 MHz part). The actual transfer
|>is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s
|>unidirectional or 2.4 MByte/s bidirectional.
|
|1) Although it has high? peak bandwidth, the latency is still there.
|The interconnect system waits for the processor to do an access, then
|the processor waits for the interconnect to get the data.  Many
|microseconds go by between the time the CPU needs dependent data and
|it is usable.

  This is getting close to something I have wondered about. How
 applicable is Amdahl's Law to the transputer? How might it be
 affected by the network topology?

  <-Harvey

 "Insanity: A perfectly rational adjustment to an insane world."
                                                -RD Laing
      Harvey Taylor      Meta Media Productions
       uunet!van-bc!rsoft!mindlink!Harvey_Taylor
               a186@mindlink.UUCP

hsu@uicsrd.csrd.uiuc.edu (William Tsun-Yuk Hsu) (10/25/89)

In article <6655@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:
>
>Also, the job of programming a shared memory system is a lot harder than
>programming a system with messages as the communication medium. 

Umm, have you tried? 

When I took a parallel programming class a couple years ago, everybody
griped about the problems of coding Gaussian elimination for the
hypercube. Hardly anybody griped about implementing it for the
Sequent balance. (This single example of course doesn't prove
anything, but I would like to see some examples of problems that
are clearly easier to code for a message-passing system.)

Bill

priol@irisa.irisa.fr (Thierry Priol,TB131,Equipe Hypercubes,9936200-547,) (10/25/89)

A software user point of view:

Sometimes, it is interesting to simulate a shared memory on a
distributed memory parallel computers. We have found that a
ray-tracing algorithm is more efficient on a DMPC when it
is implementing with a shared memory programming model. It's 
due by the fact that it is difficult to partition the database
a priori (for example with a geometric partition). A software
shared memory service makes a dynamic data distribution (when
caches are using) during the computation. A paper which describes
our experimentation will be available soon. May be other algorithms
are in the same case ?

I hope that such a software shared memory service would be available
on DMPC in future operating system.

Thierry PRIOL

PS: I apologize for my poor english.

---------------------------------------------------------------------
Thierry PRIOL                                Phone:  99 36 20 00
IRISA / INRIA U.R. Rennes                    Fax:    99 38 38 32

roger@wraxall.inmos.co.uk (Roger Shepherd) (10/25/89)

In article <47279@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes:
>
>>Eugene, ever seen a transputer? Overhead for receiving or sending a 
>>message is 19 cycles (630 ns for a 30 MHz part). The actual transfer
>>is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s
>>unidirectional or 2.4 MByte/s bidirectional.
>
>1) Although it has high? peak bandwidth, the latency is still there.
>The interconnect system waits for the processor to do an access, then
>the processor waits for the interconnect to get the data.  Many
>microseconds go by between the time the CPU needs dependent data and
>it is usable.
>

This is true BUT the time between a request being made and the data
returning is NOT WASTED. It can be used to execute other processes. This
use of `excess' parallelism is important precisely because it can hide
latency. This is one reason why the transputer not only incorporates
high performance communication links but also a hardware scheduler. The
use of excess parallelism to hide latency is not new, it was used in the
HEP.

>Please recall that even the hypercubes (well inmos, ncube and thinking
>machines anyway) do not use commodity microprocessors, but proprietary
>ones that do the extras in hardware that they need 
>

I'm not sure what you mean by commodity microprocessors. The Inmos
transputers ARE commodity microprocessors, they are feely available to
anyone who wants to buy them. The communication system which is
provided with every transputer is there because it is generically
useful in multiprocessor system (and you might be surprised just how
many electronic systems are multiprocessor - my PC has at least 3
microprocessors in it (before I plug in my transputer card))!

Roger Shepherd, INMOS Ltd   JANET:    roger@uk.co.inmos 
1000 Aztec West             UUCP:     ukc!inmos!roger or uunet!inmos-c!roger
Almondsbury                 INTERNET: roger@inmos.com
+44 454 616616              ROW:      roger@inmos.com OR roger@inmos.co.uk

daver@dg.dg.com (Dave Rudolph) (10/25/89)

In article <36662@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes:
>>Message forwarding isn't so difficult either. I've read of a system
>>requiring less than 10 us overhead per through-route (this probably
>>is for the destination link being available). No interrupt handling
>>involved here - that part is all handled in hardware.
>This level of overhead for each hop is completely intolerable.

And what is the overhead for each interconnect level in a "shared
memory" machine such as the ultracomputer?  Keep in mind that in such a
machine, every non-local memory access must go through each level and 
back while the processor waits.

slackey@bbn.com (Stan Lackey) (10/25/89)

In article <2658@ganymede.inmos.co.uk> roger@inmos.co.uk (Roger Shepherd) writes:
>In article <47279@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>>[in a hypercube]
>>the interconnect system waits for the processor to do an access, then
>>the processor waits for the interconnect to get the data.  Many
>>microseconds go by between the time the CPU needs dependent data and
>>it is usable.
>This is true BUT the time between a request being made and the data
>returning is NOT WASTED. It can be used to execute other processes.

The context of my posting was of getting high speedups on a single
application on a massively parallel computer in an easy-to-use,
general purpose way.  My statements concerned the use of commodity
micros without special architectural mechanisms intended to support
massively parallel processing.  By 'commodity' I meant true general
purpose micros produced with massive volumes and multiple sources.

>The communication system which is
>provided with every transputer is there because it is generically
>useful in multiprocessor system (and you might be surprised just how
>many electronic systems are multiprocessor - my PC has at least 3
>microprocessors in it (before I plug in my transputer card))!

If the three micros in your PC don't provide a speedup for Lotus by
running the Lotus code in parallel, it is outside the scope of this
discussion.  Which should probably be in comp.parallel anyway.

I'm not saying that the communication mechanism in a transputer is
'bad' or unnecessary; I mentioned it as a supportive example of
architectural extensions at the processor level for making massively
parallel systems more effective.
-Stan

stan@Solbourne.COM (Stan Hanks) (10/25/89)

In article <6655@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:
>Also, the job of programming a shared memory system is a lot harder than
>programming a system with messages as the communication medium. The number
>of successful message-based operating systems demonstrates this.

Not necessarily so. It just demonstrates that there is more hardware 
available with which to build message-passing operating systems.

It's bunches easier to build the hardware needed for a message passing 
system -- all you need is CPUs with memory and some messaging media 
connection, right? You can use PCs with serial lines in the trivial 
case, on up to some sort of massively parallel system in the more 
complex. You can't even begin to count the number of ways that you 
can construct such a system.

It is however substantially more complicated to construct a shared memory
system. If you don't believe me, count the number available (relative
to the number of systems which have the hardware necessary to build
a messaging system). Or, try building one yourself sometime.... 8{)

Regards,

-- 
Stanley P. Hanks  Science Advisor                      Solbourne Computer, Inc.
Phone:            Corporate: (303) 772-3400             Houston: (713) 964-6705
E-mail:           ...!{boulder,sun,uunet}!stan!stan          stan@solbourne.com

david@cs.washington.edu (David Callahan) (10/26/89)

In article <224@dg.dg.com> uunet!dg!daver (David Rudolph) writes:
>In article <36662@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>>In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes:
>>>Message forwarding isn't so difficult either. I've read of a system
>>>requiring less than 10 us overhead per through-route
>>This level of overhead for each hop is completely intolerable.

>And what is the overhead for each interconnect level in a "shared
>memory" machine such as the ultracomputer?  Keep in mind that in such a
>machine, every non-local memory access must go through each level and 
>back while the processor waits.

How does two clock ticks per node sound? One for routing and one on
the wires. If you processor is slow enough it might be only one :-)

Your second statement has two false assumptions in it. First, no need
to build a "dancehall" machine: embed the processors in a 3d mesh so
that each processor is closer to some memory than the rest. Say 256
processors in a 16x16x16 cube gives an average distance of 12 hops to
a memory module. Note that this is half again log(256)=8 but there are
packing issues, message flux issues, and redundant path issues that
make the comparison difficult.

The second false assumption is that the processor waits. Sure, with
memory 50 cycles away its easy to build a machine that waits but it is
also possible to build a machine that multiplexs 50 independent
instruction streams on one ALU so, with sufficient parallelism the
memory latency is a non-issue such as the HEP. Of course, you need 50
streams.

-- 
David Callahan  (david@tera.com, david@june.cs.washington.edu,david@rice.edu)
Tera Computer Co. 	400 North 34th Street  		Seattle WA, 98103

khoult@bbn.com (Kent Hoult) (10/26/89)

In article <9600@june.cs.washington.edu> david@june.cs.washington.edu.cs.washington.edu (David Callahan) writes:
>In article <224@dg.dg.com> uunet!dg!daver (David Rudolph) writes:
>>In article <36662@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>>>In article <20764@usc.edu> vorbrueg@bufo.usc.edu (Jan Vorbrueggen) writes:
>>>>Message forwarding isn't so difficult either. I've read of a system
>>>>requiring less than 10 us overhead per through-route
>>>This level of overhead for each hop is completely intolerable.
>
>>And what is the overhead for each interconnect level in a "shared
>>memory" machine such as the ultracomputer?  Keep in mind that in such a
>>machine, every non-local memory access must go through each level and 
>>back while the processor waits.

In a machine like the BBN TC-2000 with a butterfly type switch. All remote
memory is equidistant from all others. Each switch stage takes 25 nS in each 
direction, and there are at most 3 stages (in a 504 node machine).

The normal time for a 32 bit read or write is around 1.5 uS to any other
node.

The time for local references tends to be more in the 200 nS area.


Kent Hoult
TEL: (617) 873-4385     ARPA: khoult@bbn.com

brooks@maddog.llnl.gov (Eugene Brooks) (10/26/89)

In article <224@dg.dg.com> uunet!dg!daver (David Rudolph) writes:
>And what is the overhead for each interconnect level in a "shared
>memory" machine such as the ultracomputer?  Keep in mind that in such a
>machine, every non-local memory access must go through each level and 
>back while the processor waits.
A good figure is a two clock pipeline delay per stage.  A network using
8x8 switch nodes will have 4 stages for a 4096 processor system (why
think small).  Round trip is 16 clocks.  Considering a 40 MHZ clock this
would be a round trip transit time of .4 microseconds.  You would add to
this of course the memory chip cycle time, pipeline delays associated with
getting on and off each end of the each network, and the number of clocks
required to worm hole the message through the available wires.  The memory chip
cycle time and "fixed delays" associated with getting on and off the network
amount to more than the pipeline delay through the network.  A one microsecond
total delay for a remote memory reference (cache miss) is not out of the
question.  A two microsecond delay is trival to achieve. The CPU does not have
to wait for each remote reference to complete before starting another, we
certainly did not stall the processor on every shared memory reference in the
Cerberus multiprocessor simulator, but current KILLER MICROS have a very
limited capability for multiple outstanding memory requests.  Hopefully
we can keep them alive with cache hits for now, and they will get better
about multiple outstanding requests in the future.


brooks@maddog.llnl.gov, brooks@maddog.uucp