peterb@pbear.UUCP (02/17/85)
I have not looked deeply into the design of the 'Cube' as people have called it. (i.e. taking a factor of n**3 CPU/MEM and hooking them up in parrellel to accomplish a given task), so bear with me. First of all, rating the speed on any highly parallel system is difficult in the least. You have to take your benchmarks in stride. If I have a matrix problem that can be decomposed into two semiindependent processes, then a VAX/11-782 would execute that program about twice as fast as a VAX/11-780. But on the other hand if the program to be benchmarked is highly sequential in nature (i.e. nth order numerical analysis of differential equations) then the 782 and the 780 are going to run at about the same speed. This applies to any parallel architecture. So a new standard of speed measure is required. I think that something along the lines of Data Flow Operations/Second (DFO's -or- Doofoh's) would fit the bill to benchmark these types of machines. Then if you take the Cube and the Cray and put them both on the same scale that reflects the architecures then you can compare speeds, otherwise you are comparing apples to oranges. Second, any type of parallel machine relies heavily upon the distrubution of data from one machine to another. This figures into the overall speed of the machine since it is the exchange of data between computing devices that drives any type of parallel architecture. This can be a high/low speed type of architecture such as an ethernet(serial) or a backplane(parallel) or even a combination of the two(i.e. eight processing units on a backplane with a serial line to connect to other backplanes). This was proved by Cm* created at CMU. Their data showed a severe OVERALL system data transfer degradation as the amount of non/local I/O increased. This is obvious to almost everybody. Cm* was limited only by the speed of its backplane. Third, some type of control facility has to run the entire mess. This can be slower than the other elements since it does not require the massive data troughput of a processing element but still must have a clean/quick architecture that lends itself to controling "devices" in a quick and clean manner. Some form of overgrown/homebrew bit-slice seems optimal in this situation since some instructions have to be general enough for scheduling algorithms, processing I/O, feilding interupts, etc... but quick and clean enough to service the resource request of each data element. Whether this control facility is distributed or singular is up in debate these days. Different groups have differenet ideas regarding this. The idea of a cube is nice, and I think that it is about the fastest architecture around for what it is designed for, but in no way will it compete with a Cray at sequential MIPS. In parallel MIPS the cube would have to be large, but the size would be managable. In order to increase/control data throughput, I think that a bus architecure that combines the best of serial and parallel is in order. I think that each processing unit be hooked to three busses, one for the X direction, one for the Y direction, and one for the Z direction. This would require 3n**2 (n = size of cube on one side) busses each with n elements on it. (i.e. for 8 data elements(a cube 2 on each side) requires 4 busses in the X direction, 4 busses in the Y direction and 4 busses in the Z direction(giving a total of 12 data busses). There would be a total of 3n**2 buss connections within the cube. But the advantage of this is that data (at the most) has to pass through one data element on its way from source to destination. Other paths can be created to get the data from node to node, especially if each buss connection had a fifo on it to queue up transfers. Also the data element can pass the information along from one buss to another with very little overhead. I know this rough, but if the net kicks around the idea, we may all one day(as a collective group) file for a patent (but I doubt it...) Peter Barada ima!pbear!peterb PS "its a long day, and it ain't going any faster..."
ian@loral.UUCP (Ian Kaplan) (02/28/85)
In article <7306@watrose.UUCP> cdshaw@watrose.UUCP (Chris Shaw) writes: > >The average communication length is 3, with no bus contention along the way. >The contention is really hidden in the point-to-point waiting for receive and >transmit. > I think that things are a little more complex than this. Contention is not entirely nonexistent. Two processors can contend for the bus which connects them. This is probably a minor issue, but it could be a problem in applications with heavy nearest neighbor communication. Other Problems with Networked Parallel Processors Before you start to plan your network connected parallel processor which will threaten Cray's dominance of the super-computer market here are a few things to consider. 1. Network routing One problem which has not been addressed in the discussion of the cube architectures is the complexity of managing communication in the cube network. I have not seem a description of the devices which handle the interprocessor busses, but I assume that they simply handle access and contention. Once a message arrives, it must be routed to another I/O channel if it is addressed to that processor. This seems to require processor intervention. After all, routing tables must be examined to determine the port through which the message should be sent. This processing will introduce some lag at each stop in the network. If a separate network processor is not provided to handle network routing, it must be done by the CPU which is executing the application code. This will be overhead that will slow down application execution. Another area where lag can be introduced is when several processors route their messages through the same intermediate processor. Presumably the network router can not handle them all at once and they will have to be queued. 2. Broadcasting Broadcasting in a network system is difficult. There must be a special message header for messages which are broadcast. When a processor gets this message it broadcasts it to all of the processors it is connected to. All cube processors are connected to an ethernet, but I wonder if multicasting is supported. Even if multicasting is supported, this divides communication between two media, one where multicasting is supported and one where it is not. This adds still more complexity to the system. 4. Software People have been building parallel processors for some time and it is now widely realized that the hard part is not constructing the machine, but programming it. Below are listed several areas which must be addressed in the software for a network system. A. Associating Logical Communication Channels with Physical Communication Channels. I have not yet received anything from iNTEL on the software for the iNTEL cube, so I am still uncertain about how they program it. According to the ACM article on the Cosmic Cube, Cal. Tech. uses communicating parallel processes. I assume that iNTEL does the same. When process A communicates with process B, it does so via a logical communication channel. No processor address is specified. When the program is loaded onto the cube there must be a mapping between the logical communication paths and the physical processor address. B. Programming Parallel Processes is Difficult A number of people have commented that programming applications composed of parallel processes is more like writing operating systems than applications as we usually regard them. A particularly good discussion of the problems encounted in programs composed of parallel processes is provided in [1]. In the Cosmic Cube article no example was given of an application which had heavy interprocess communication. By not discussing this, the author side stepped an important issue. If the iNTEL cube follows the Cal. Tech. model I think that it will be very hard to program. [1] "Exploiting Multiprocessors: Issues and Options" by James McGraw and Timothy S. Axelrod, Preprint - Lawrence Livermore National Labs. UCRL-91734 Available from - Technical Information Dept. LLNL, Livermore, CA 94550 Ian Kaplan Loral Instrumentation (619) 560-5888 x4812 USENET: {decvax, ucbvax,ihnp4}!sdcsvax!sdcc6!loral!ian ARPA: sdcc6!loral!ian@UCSD USPS: 8401 Aero Dr. San Diego, CA 92123
gottlieb@cmcl2.UUCP (Allan Gottlieb) (03/02/85)
Although I would agree that programming parallel processors is inherently more difficult that programming serial machines, the additional effort required need not be as great as some contributers have indicated. There is a significant architectural variation in parllel machines and some are easier to program than others. We believe that with Shared Memory MIMD machines, applications programs do not resemble operating systems. For example we have modified the NASA "weather code" (the GISS atmospheric simulation program). This is a several thousand line FORTRAN program that doubtless represents multiple manyears of development. Our conversion required considerably under 1 man year. We have done many other largish scientific codes and have never required nearly as much time to parallelize the code as the original development. Thus for those problems, producing parallel programs is bounded by twice the serial effort. There is a price to pay for this relative programming ease. Our design (the NYU Ultracomputer) includes a sophisticated (and expensive) processor to memory interconnection network that, among other things, can satisfy multiple simultaneous memory references to the same word in just the time required for one such reference. Since this network will have a non-negligible component count, the same amount of money will buy more computing power if used for a cube, or other nonshared memory design. On the subject of operating systems our currently operational "symmunix" is largly parallel (concurrently executed by multiple processors) and symmetric (no master-slave relationships). Ask its principle author (cmcl2!edler) for details. -- Allan Gottlieb GOTTLIEB@NYU {floyd,ihnp4}!cmcl2!gottlieb <---the character before the 2 is an el