[net.arch] Cube designs

peterb@pbear.UUCP (02/17/85)

	I have not looked deeply into the design of the 'Cube' as people
have called it. (i.e. taking a factor of n**3 CPU/MEM and hooking them up in
parrellel to accomplish a given task), so bear with me.

	First of all, rating the speed on any highly parallel system is
difficult in the least. You have to take your benchmarks in stride. If
I have a matrix problem that can be decomposed into two semiindependent
processes, then a VAX/11-782 would execute that program about twice as fast
as a VAX/11-780. But on the other hand if the program to be benchmarked
is highly sequential in nature (i.e. nth order numerical analysis of
differential equations) then the 782 and the 780 are going to run at about
the same speed. This applies to any parallel architecture.
	So a new standard of speed measure is required. I think
that something along the lines of Data Flow Operations/Second (DFO's
-or- Doofoh's) would fit the bill to benchmark these types of machines.
Then if you take the Cube and the Cray and put them both on the same scale
that reflects the architecures then you can compare speeds, otherwise you
are comparing apples to oranges.

	Second, any type of parallel machine relies heavily upon the
distrubution of data from one machine to another. This figures into the
overall speed of the machine since it is the exchange of data between
computing devices that drives any type of parallel architecture. This can be
a high/low speed type of architecture such as an ethernet(serial) or a
backplane(parallel) or even a combination of the two(i.e. eight processing
units on a backplane with a serial line to connect to other backplanes).
This was proved by Cm* created at CMU. Their data showed a severe OVERALL
system data transfer degradation as the amount of non/local I/O increased.
This is obvious to almost everybody. Cm* was limited only by the speed of
its backplane.

	Third, some type of control facility has to run the entire mess.
This can be slower than the other elements since it does not require the
massive data troughput of a processing element but still must have a
clean/quick architecture that lends itself to controling "devices" in a
quick and clean manner. Some form of overgrown/homebrew bit-slice seems
optimal in this situation since some instructions have to be general enough
for scheduling algorithms, processing I/O, feilding interupts, etc... but
quick and clean enough to service the resource request of each data element.
Whether this control facility is distributed or singular is up in debate
these days.  Different groups have differenet ideas regarding this.

	The idea of a cube is nice, and I think that it is about the fastest
architecture around for what it is designed for, but in no way will it
compete with a Cray at sequential MIPS. In parallel MIPS the cube would have
to be large, but the size would be managable.

	In order to increase/control data throughput, I think that a bus
architecure that combines the best of serial and parallel is in order.  I
think that each processing unit be hooked to three busses, one for the X
direction, one for the Y direction, and one for the Z direction. This would
require 3n**2 (n = size of cube on one side) busses each with n elements on
it. (i.e. for 8 data elements(a cube 2 on each side) requires 4 busses in the
X direction, 4 busses in the Y direction and 4 busses in the Z
direction(giving a total of 12 data busses). There would be a total of 3n**2
buss connections within the cube. But the advantage of this is that data (at
the most) has to pass through one data element on its way from source to
destination. Other paths can be created to get the data from node to node,
especially if each buss connection had a fifo on it to queue up transfers.
Also the data element can pass the information along from one buss to another
with very little overhead.

	I know this rough, but if the net kicks around the idea, we may all
one day(as a collective group) file for a patent (but I doubt it...)


						Peter Barada
						ima!pbear!peterb


PS      "its a long day, and it ain't going any faster..."

ian@loral.UUCP (Ian Kaplan) (02/28/85)

In article <7306@watrose.UUCP> cdshaw@watrose.UUCP (Chris Shaw) writes:
>
>The average communication length is 3, with no bus contention along the way.
>The contention is really hidden in the point-to-point waiting for receive and
>transmit.
>

  I think that things are a little more complex than this.  Contention is
  not entirely nonexistent.  Two processors can contend for the bus which
  connects them.  This is probably a minor issue, but it could be a problem
  in applications with heavy nearest neighbor communication.

  Other Problems with Networked Parallel Processors

  Before you start to plan your network connected parallel processor which
  will threaten Cray's dominance of the super-computer market here are a
  few things to consider.

  1. Network routing

  One problem which has not been addressed in the discussion of the cube
  architectures is the complexity of managing communication in the cube 
  network.  I have not seem a description of the devices which handle 
  the interprocessor busses, but I assume that they simply handle access 
  and contention.  Once a message arrives, it must be routed to another 
  I/O channel if it is addressed to that processor.  This seems to 
  require processor intervention.  After all, routing tables must be 
  examined to determine the port through which the message should be sent.  
  This processing will introduce some lag at each stop in the network.  
  If a separate network processor is not provided to handle network routing, 
  it must be done by the CPU which is executing the application code.  
  This will be overhead that will slow down application execution.

  Another area where lag can be introduced is when several processors route
  their messages through the same intermediate processor.  Presumably the
  network router can not handle them all at once and they will have to be 
  queued.

  2. Broadcasting

  Broadcasting in a network system is difficult.  There must be a special
  message header for messages which are broadcast.  When a processor gets
  this message it broadcasts it to all of the processors it is connected
  to.   
  
  All cube processors are connected to an ethernet, but I wonder if
  multicasting is supported.  Even if multicasting is supported, this
  divides communication between two media, one where multicasting is
  supported and one where it is not.  This adds still more complexity to
  the system.

  4. Software
  
  People have been building parallel processors for some time and it is now
  widely realized that the hard part is not constructing the machine, but
  programming it.  Below are listed several areas which must be addressed
  in the software for a network system.

  A. Associating Logical Communication Channels with Physical Communication
     Channels.

  I have not yet received anything from iNTEL on the software for the iNTEL
  cube, so I am still uncertain about how they program it.  According to the
  ACM article on the Cosmic Cube, Cal. Tech. uses communicating parallel
  processes.  I assume that iNTEL does the same.  When process A
  communicates with process B, it does so via a logical communication
  channel.  No processor address is specified.  When the program is loaded
  onto the cube there must be a mapping between the logical communication
  paths and the physical processor address.

  B. Programming Parallel Processes is Difficult

  A number of people have commented that programming applications composed
  of parallel processes is more like writing operating systems than
  applications as we usually regard them.  A particularly good discussion
  of the problems encounted in programs composed of parallel processes is 
  provided in [1].  In the Cosmic Cube article no example was given of an 
  application which had heavy interprocess communication.   By not
  discussing this, the author side stepped an important issue.  If the 
  iNTEL cube follows the Cal. Tech. model I think that it will be very
  hard to program.

  [1] "Exploiting Multiprocessors: Issues and Options" by James McGraw and
      Timothy S. Axelrod, Preprint - Lawrence Livermore National Labs.  
      UCRL-91734 Available from - Technical Information Dept. LLNL, 
      Livermore, CA 94550

  
			 Ian Kaplan
			 Loral Instrumentation
			 (619) 560-5888 x4812
		 USENET: {decvax, ucbvax,ihnp4}!sdcsvax!sdcc6!loral!ian
		 ARPA:   sdcc6!loral!ian@UCSD
		 USPS:   8401 Aero Dr. San Diego, CA 92123

gottlieb@cmcl2.UUCP (Allan Gottlieb) (03/02/85)

Although I would agree that programming parallel processors is
inherently more difficult that programming serial machines, the
additional effort required need not be as great as some contributers
have indicated.  There is a significant architectural variation in
parllel machines and some are easier to program than others.  We
believe that with Shared Memory MIMD machines, applications programs do
not resemble operating systems.  For example we have modified the NASA
"weather code" (the GISS atmospheric simulation program).  This is a
several thousand line FORTRAN program that doubtless represents
multiple manyears of development.  Our conversion required
considerably under 1 man year.   We have done many other largish
scientific codes and have never required nearly as much time to
parallelize the code as the original development.  Thus for those
problems, producing parallel programs is bounded by twice the serial
effort.

There is a price to pay for this relative programming ease.
Our design (the NYU Ultracomputer) includes a sophisticated
(and expensive) processor to memory interconnection network
that, among other things, can satisfy multiple simultaneous
memory references to the same word in just the time required
for one such reference.  Since this network will have a
non-negligible component count, the same amount of money will
buy more computing power if used for a cube, or other
nonshared memory design.

On the subject of operating systems our currently operational
"symmunix" is largly parallel (concurrently executed by multiple
processors) and symmetric (no master-slave relationships).
Ask its principle author (cmcl2!edler) for details.
-- 
Allan Gottlieb
GOTTLIEB@NYU
{floyd,ihnp4}!cmcl2!gottlieb   <---the character before the 2 is an el