[comp.parallel] Karp/Gordon Bell Award for 1987

eugene@pioneer.arpa (Eugene N. Miya) (04/06/88)
The following is reprinted with permission.

The following file comes from Alan Karp who graticiously presents it to
the net.  The file contains press release, title, abstract, summary,
etc. as well as directives for typesetters (human) and editorial
direction.  I've removed a few of the latter.

>From the Rock of Ages Home for Retired Hackers:

--eugene miya, NASA Ames Research Center, eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene
==========================================================================

Three places to send the press release would be

Science - you can find the address
Physics Today, American Institute of Physics, 500 Sunnyside Blvd.,
    Woodbury, NY 11797
Computers in Physics - same as above

In addition, there are a number of Society published magazines.
Check with your resident chemists, bioligists, engineers, etc.
to get their favorites.

Addresses are:

Jack Dongarra
Mathematics and Computer Science Division
Argonne National Laboratory
9700 South Cass Avenue
Argonne, Illinois 60439

Alan Karp
IBM Systems Center
18100 Frederick Pike
Gaithersburg, Maryland 20879

-----------------------------------------------------------------------

==========================================


AWARDS.TYP
Gordon Bell Awards
IEEE Software
May 1988
22225





Dongarra, Karp, and Kennedy judged the 1987 Gordon Bell Awards.
Dongarra is a numerical analyst at the Mathematics and Computer Science
Division at Argonne National Laboratory; Karp, who chaired
the judging committee, is a staff member at the IBM Palo Alto Scientific
Center currently on assignment at the IBM Washington Systems Center in
Gaithersburg, Maryland;
Kennedy is a professor of computer science at Rice University.

1987 Gordon Bell Awards: Winners achieved speedup of 400
Jack Dongarra, Alan Karp, and Ken Kennedy

     The Gordon Bell Awards recognize outstanding achievement in
the application of supercomputers to scientific and engineering problems.
In 1987, these awards were for the largest speedup on multiple-
instruction, multiple-data computers
in two categories of machines: general- and special-purpose.
Six of the nine entries met the rules for the competition.
     Gordon Bell, vice president of engineering at Ardent Computer in
Sunnyvale, Calif., will sponsor two $1,000 awards each year for 10
years to promote practical parallel-processing research. "Utility is
the key," he said at this year's awards presentation, held March 1 in
San Francisco at the Compcon conference.
     He made the offer to sponsor the awards during an interview with
[io]IEEE Software[ix] conducted while Bell managed the Computer and
Information Science Directorate at the National Science Foundation
("Parallel, Distributed Processing Lead NSF Software Research
Directions," Soft News, July, pp. 102-104). [io]IEEE Software[ix]
administers the awards and judging.
     The winning entry in the general-purpose computer category was
submitted by Robert Benner, John Gustafson, and Gary Montry of the
Parallel Processing Division of Sandia National Laboratory in
Albuquerque, N.M. They used a 1,024-processor NCUBE to run three
production applications with speedups ranging from 400 to more than
600 times. The judges had expected the winning speedup to be about 50.
The applications were a beam stress analysis, a surface-wave
simulation, and an unstable fluid-flow model. Bell presented a $1,000
check to the Sandia team, which has donated the award to The Association
for Retarded Citizens of Albuquerque.
     The type of work that won the prize is not normally undertaken
at Sandia. Montry told the judges that the possibility of
winning the Bell Award and Karp's challenge (see sidebar) were
key factors in convincing his management to expend resources on
this research.
     There were no acceptable entries in the special-purpose category,
so Bell decided to divide the money among several groups, although some
of them did not meet the criteria specified in the rules. "It's my money,
and I can do whatever I want with it," he said.
     Second place in the competition went to Marina Chen of Yale
University, Erik Benedictus of AT&T Bell Laboratories, Geoffery Fox of
the California Institute of Technology, Jingke Li of Yale, and David
Walker of Caltech for work done on the Caltech and NCUBE hypercubes.
They submitted results from three programs: a computational kernel
with a speedup of 98, a quantum-chronodynamics calculation with a
speedup of 458, and a circuit simulation with a speedup of 39. Bell
decided to give $300 to this group as a special award.
     Bell gave one special award for $500 to Robert Chervin of the
National Center for Atmospheric Research in Boulder, Colo., for a
global ocean-climate model running at 450 Mflops on all four
processors of a Cray X-MP/48. This work is impressive in two respects:
(1) It is a production code running in parallel and (2) it achieves a
significant fraction of the peak performance of the machine. The
judges decided this entry deserved an honorable mention.
     Although the rules for the competition limited consideration to
multiple-instruction, multiple-data machines, both the judges and Bell
thought that the submission by Stavros Zenios of the Wharton School at
the University of Pennsylvania was worthy of recognition. Zenios used
a 16,384-processor Connection Machine to solve nonlinear network-
optimization problems. Although it is impossible to measure parallel
speedup on such a machine, Zenios was able to solve a problem in 1.5
seconds that ran longer than 92 seconds on one processor of an IBM
3090-600. Bell decided to give Zenios a special award of $200.

     [bo]First place.[bx] The Sandia researchers did their work on an
NCUBE hypercube multicomputer with 1,024 processors. Each processor
has 512 Kbytes of private memory and is directly connected to 10 other
processors. Data is shared among processors by explicitly sending
messages from one processor to another. Each processor is capable of
about 80 Kflops.
     One problem the Sandia team addressed is measuring speedup.
Speedup is the time it takes to solve the problem on one processor
divided by the time it takes to solve the problem on all the
processors. But there are two difficulties with this definition.
     To get a reliable measurement, the program must run for at least
a few minutes. However, a problem that runs one minute on 1,024
processors would take tens of hours on one processor. The Sandia team
solved this problem by running a program that takes at least 15
seconds on 1,024 processors and about five hours on one processor.
     The second problem is more severe. Although the total memory of a
multicomputer is large, each processor has access to only a modest
amount of the total. Thus, the largest problem that fits on one
processor uses only a small part of the full configuration.
     A key factor in the performance of a multicomputer is the ratio
of the time spent computing to the time spent communicating;
the machine is only doing useful work when it is computing.
Large problems typically have a larger fraction of computation
than do small problems. Therefore, a problem small enough to fit on
one processor will be compute-bound on a small number of processors but
communication-bound on the full system.
     The Sandia team solve this problem by following two curves (which
they called the research line and application line) in the problem-
size versus number-of-processors plane. The research line runs the
largest problem that fits on one processor on different numbers of
processors. The numbers submitted for the competition follow this
research line. These speedup numbers represent a lower bound on what
is achievable because the problem is very small relative to the full
system.
     The Sandia team also presented timings for the application line,
where problem size increases as more processors are used. The
application line is more representative of how multicomputers are used
in production work. By measuring the idle time of each processor, they
could estimate the speedup they would get if the large problem fit on
one processor. They have validated this model and have found the
predictions are correct to within a few percent.
     All three programs submitted by the Sandia team showed a speedup
of more than 400. While one such result might be considered a special
case, the weight of evidence is significant. The judges were
particularly impressed with these results because they expected the
winning speedup to be 50 or less.
     One program, the beam strain-analysis problem, looked at the two-
dimensional deflection of a beam fixed at one end. The program uses a
finite-element formulation with a matrix-free, preconditioned,
conjugate-gradient algorithm. This algorithm is frequently used on
supercomputers when memory constraints prevent the use of more
efficient methods. The most time-consuming part of this approach is
the accumulation of dot products in which two vectors are
multiplied element by element and the resulting products added to
produce a scalar.
     The beam is divided into 2,048 bilinear finite elements -- the
largest problem that will fit on one node. Each processor is assigned
a rectangular subdomain. Synchronization occurs at two points during
each iteration. First, each processor must exchange boundary
information with the four processors sharing edges with it. Second,
five dot products must be computed and the results distributed to each
processor.
     Following the research line, the Sandia team found a speedup of
more than 452 for this code.  Equally interesting was the
application line. Running a problem with a 64[xm]32 grid on one
processor takes 1,614 seconds. A problem with 64[xm]32 finite elements
per processor on 1,024 processors (2 million elements) takes 1,619
seconds. The nearly constant execution time as the problem size
increases indicates that their scaled speedup of 1,021 is reasonably
accurate.
     The second application submitted by the Sandia team tracks two-
dimensional waves as they move past a set of barriers. The program can
handle any configuration of barriers and wave velocities that depend
nonlinearly on position and wave state.
     The wave equation is discretized using a five-point stencil in
space and an explicit time-step scheme. The domain is divided into
rectangular blocks, and boundary information is passed to adjoining
processors on each time step. To reduce communications overhead, the
Sandia team used a special quadruple buffering technique.
     Because this program does relatively few computations per
communication, the researchers used some assembly language to improve
the speedup. First, they wrote a routine to transfer data. This
routine helped most when noncontiguous data was to be transferred.
Second, they wrote the inner loop of the update in assembly language
to reduce the relative importance of loop start-up.
     Following the research line, the Sandia team reported a speedup
of 637, while the scaled results along the application line indicated
a speedup of 1,020. These scaled results appear to be reliable because
a 4,096-point problem run on one node took 12,780 seconds while a
4,096-point-per-node problem (4 million points) took only 12,824
seconds.
     The third problem submitted by the Sandia team was a
computational fluid-dynamics program that uses a flux-corrected
transport algorithm. This approach uses a low-order spatial
discretization where the flow is smooth and second- or fourth-order
schemes in regions of large gradients. The time behavior is followed
explicitly. The test problem follows the development of a Kelvin-
Helmholtz instability in a shear flow.
     This code is nearly perfectly parallel. As in the previous
problems, each processor must share boundary information with its
neighbors on each time step. In addition, each processor determines
the allowed time step based on the grid points in its subdomain. These
local time steps must be combined to produce a single step size, and
this value is broadcast to all processors.
     Speedup along the research line reached 519, while the scaled
result along the application line is 1016. A 32[xm]32 grid run on one
processor took 28,039 seconds; a 32[xm]32 grid for each of 1,024
processors took 28,258 seconds. As before, the overhead of running in
parallel is small.

     [bo]Second place.[bx] Second place went to the group from Yale,
Bell Labs, and Caltech. This group submitted three programs running on
different hypercubes. One program was a computational kernel,
LU decomposition used to solve a system of linear equations,
and will not be described here since kernels were explicitly excluded
from consideration in the rules.
     Another calculation submitted was the computation of the
potential energy of a quark and an antiquark using quantum-
chromodynamics theory to model the quark behavior.
The computational solution to the problem uses lattice-gauge theory.
A gauge theory is one that is defined in terms of a metric,
a measure of length that is independent of the observer's frame
of reference.
With a gauge theory,
a problem, such as computing the mass of the proton,
can be reduced to a multidimensional integral with each
dimension of the integral
representing a different path a particle can take through space-time.
     The numerical solution to a problem is obtained by
discretizing space-time on a four-dimensional grid,
the lattice of the lattice-gauge theories.
Each lattice
point has a set of variables representing the quarks associated with
it; the links between points contain information on gluons (which bind
quarks).
     A step of the Monte Carlo procedure to evaluate multidimensional
integrals involves randomly changing the variables on one link. This
change modifies the corresponding lattice variables. After many changes
have been made, it is possible to determine several physical
quantities from the statistical fluctuations of these variables.
     Each processor is assigned a part of the lattice. At each step,
all links that are not connected to a common lattice point can be
updated simultaneously. Because each processor runs at its own rate,
the only way to assure that this condition is met is to synchronize at
each step. Furthermore, if the link data is held by one processor and
the lattice point variables by another processor, a communications
step is needed. When run on many processors, data on most links is
sent between a pair of
processors at least once on each sweep over the lattice.
     A further synchronization is needed because of an artifact of the
implementation. Each update involves multiplying 3[xm]3 complex
matrices. To save memory and time, these calculations are carried out
using 16-bit integers instead of floating-point numbers. The round-off
errors associated with the multiplications are large enough to destroy
the accuracy of the calculation unless a correction is applied. This
correction is applied after every second sweep through the lattice.
     The Yale/Bell Labs/Caltech researchers submitted a problem
discretized on a 24[xm]24[xm]24[xm]48-point lattice. They calculated
how the potential energy of a quark/antiquark pair varies with their
separation. They showed that, when the quarks are close, the potential
energy follows Coulomb's Law, as would a proton pair or electron pair.
(Coulomb's Law says the force of repulsion between two like
charges of electricity concentrated at two points is proportional to
the product of their magnitudes and inversely proportional to the
square of the distance between them.)  At larger separations, the
potential energy falls logarithmically with separation. Such an energy
law implies that it would take an infinite amount of energy to
separate these particles. A speedup of 458 was obtained on a 512-
processor Caltech hypercube.
     Their second application was a circuit simulation to predict the
time behavior of a circuit to a set of inputs. At each time step, each
circuit element takes its input voltages and computes a set of output
voltages. The problem is made parallel by dividing the circuit
elements among the processors. The difficult part of making sure that
each circuit element uses the correct input voltage is handled by
using queues.
     The calculation cannot proceed completely asynchronously because
some circuit elements require more computation than others to update
their output voltages. During an arbitrarily long run, the input
queues of the slowest elements would certainly overflow. One way to
handle this problem is to run the simulation in segments of some fixed
number of time steps. At the end of each segment, the processors
synchronize (without passing any data) and then continue. Using such a
programming style, the Yale/Bell Labs/Caltech group reported a speedup
of 39 on a 127-processor NCUBE hypercube.

     [bo]Honorable mention.[bx] Robert Chervin of the National Center
for Atmospheric Research won an honorable mention for the global
ocean-climate model he parallelized with the help of Albert Semtner of
the Naval Postgraduate School in Monterey, Calif. This program is used
to study such large scale ocean phenomena as El Ni[at]no episodes in
the Pacific Basin. (An El Ni-at-no is an unusual change in sea-surface
temperatures that shifts cloud-cover and rainfall patterns across the
Pacific Ocean, resulting in drought, flooding, and altered fish
migration.) The program is also used to produce input data to the climate
models run to predict such things as the long-term effect of the
greenhouse effect (the rise in global air temperature caused by the
retention of solar energy by carbon dioxide, which is added to the
atmosphere by industrial processes).
     The ocean-model program uses a standard, second-order, finite-
difference discretization in three spatial dimensions and a leapfrog-
explicit time-step procedure.
In a leapfrog procedure, the change in the unknowns in going from time
n to time n+1 depends only on the solution at time n-1.
This method has higher accuracy than a single-step method but may
become unstable.
The leapfrog method also
requires more memory than single-step methods
since space for the solution
is needed at three time levels (at n-1 to compute the change in going
to step n+1, at n to convert the change in the variables to the values
of the new variables, and n+1 which is the desired result).
These disadvantages are more than compensated for by the larger time
steps allowed by the higher accuracy.
     The problem for which Chervin was recognized used a
grid made up of a point every 0.5
degree in latitude and longitude and 20 points in height for a
total of 4 million grid points. Four variables are computed at each
grid point.
Because the time-stepping procedure requires data from three
consecutive times, the problem is far too large to be contained in the
main memory of a Cray. The program uses 6 Mwords of main memory and 60
Mwords on the Cray SSD, a large, relatively slow memory used as a fast
I/O device. On each time step, more than 100 Mwords are transferred to
and from the SSD.
     In spite of the large amount of I/O, this program shows a speedup
of 3.74 on a four-processor Cray X-MP/48 using Cray microtasking. For
the runs cited in the awards entry, Chervin used 280 tasks. The
sustained processing rate of 450 Mflops is more than half the
theoretical peak of 880 Mflops. This speedup is considerably higher
than others have obtained.
     One of the limiting factors for parallel performance on the Cray
is usually memory-bank conflicts that occur when one task accesses a
memory bank needed by another task. Most researchers believe that
nothing can be done about this problem, but Chervin was able to
eliminate virtually all such bank conflicts.
The trick is to set up the data so each task uses a
distinct set of memory banks. It may be necessary to do some extra
data movement, as Chervin does, but the gain in memory access rates
clearly exceeds the cost.
nearly 100-percent parallel relates to the first part of the last
sentence. Is it that this high parallelism prohibits speedup
comparable to the ocean model's or that this high parallelism is a
benefit worth the lower speedup?|
     Chervin's success with this program is not an isolated event. He
has also achieved a speedup of 3.7 on a climate model that uses a
spectral method for the horizontal spatial discretization. Here, too,
he was able to eliminate most of the intertask memory bank conflicts
by carefully distributing his data. This spectral algorithm is not
vectorizable,
so the raw speed is not as impressive as the ocean model.
On the other hand, this program is 99.5-percent parallel.

     [bo]Special award.[bx] A special award was given to a program
that did not meet the eligibility criteria. However, the judges and
Bell were so impressed with the results that they awarded a special
prize to Stavros Zenios of the Wharton School at the University of
Pennsylvania. Zenios solved several problems in nonlinear network
optimization on a single-instruction, multiple-data CM-1 Connection
Machine.
     Nonlinear network optimization involves minimizing a strictly
convex, continuously differentiable, real-valued function subject to a
set of linear constraints. These problems are difficult because the
spatial dimension is high, making the resulting matrices extremely
large. Fortunately, these matrices are also extremely sparse. This
means that if one variable is changed, only a small subset of the
equations must be reevaluated to account for the change.
     A standard approach is to change one variable at a time
(coordinate-wise minimization) until a global minimum is achieved. You
can think of the nonlinear system as a graph. Each node in the graph
represents an unknown. An arc joins two nodes, say [io]i[ix] and
[io]j[ix], if variable [io]j[ix] appears in equation [io]i[ix]. When
variable [io]j[ix] is changed, only equations explicitly containing it
must be updated.
     Zenios used this fact to map the problem onto the CM-1. Each arc
in the graph is associated with two processors. On each iteration, all
variables are updated simultaneously. Next, each equation is evaluated
by combining the results of all processors associated with each
unknown using a special function on the CM-1 called a segmented-plus-
scan, which forms a set of running sums. Finally, the results of this
scan are distributed to all the nodes.
     This procedure appears to be inefficient in that it uses two
processors per unknown and does some computations twice. However, the
resulting reduction in communication overhead makes it run very fast.
About 35 percent of the time is spent communicating. On three
different problems, the CM-1 needed one, eight, and 29 seconds, while
an IBM 3090 uniprocessor needed 93, 109, and 83 seconds for the same
jobs. The CM-2 connection machine with floating-point should be even
faster.

     [bo]Other entries.[bx] Four other entries were judged eligible
for consideration. Although they did not win, they represent work
worthy of commendation.
     Paul Fischer of the Massachusetts Institute of Technology
submitted measurements of NEKTON, a commercial fluid-dynamics and
heat-flow package, running on an Intel iPSC-VX hypercube. The package
uses the so-called "p" finite-element method, sometimes called the
spectral-element method. Convergence is obtained by increasing the
order of approximation on each finite element instead of increasing
the number of finite elements as done in the "h" finite-element
methods. The resulting linear systems are solved with a diagonally
preconditioned conjugate-gradient algorithm.
     Fischer's results point out one of the problems with the contest
rules. He got a speedup of 7.2 running with 32 vector processors. When
he turned off vector processing, he achieved a speedup of 16. However,
the execution time of the nonvector job was almost 100 times longer
than the vector job. The judges have revised the rules to take this
into account; the revised rules appear in the [bo]box on p. XX[bx].
     Richard Roloff of the Center for Supercomputing Research and
Development at the University of Illinois at Urbana-Champaign made
parallel a program used heavily by the university's Chemical
Engineering Department. It models turbulent, incompressible flow with
a pseudospectral algorithm. The compiler did most of the
parallelization by vectorizing inner loops and parallelizing outer
loops. In addition, compiler directives were used so subroutine calls
could be executed in parallel. Roloff achieved a 6.5 speedup on an
eight-processor Alliant FX/8.
     Richard Hessel of Alliant Computer Systems submitted measurements
of ANSYS, a general-purpose finite-element package, on an Alliant
FX/8. He ran a standard benchmark data set called S4. This linear,
static-analysis problem requires 4,000 elements and has more than
24,000 unknowns.
     Hessel achieved a speedup of 5.0 on the eight-processor Alliant.
Most of the parallelism was extracted by the compiler, with the
exception of one routine. He recoded the rank-[io]n[ix] update routine
to use a block algorithm better suited to the Alliant architecture.
     As is often true, optimizing for parallelism leads to other
improvements. Hessel reports that the modified program runs 1.5 times
faster on one processor than the best previous code.
     David George and Frederica Darema-Rogers of IBM Research in
Hawthorne, N.Y., submitted a parallelized version of AIR3D, a three-
dimensional fluid code from the National Aeronautics and Space
Administration. This program models flow past a complex surface,
including boundary layers. It is heavily used to model the space
shuttle.
     The computationally slow part is the solution of large linear
systems, which are solved using an alternating-direction-implicit
scheme on planes. On each time step, the system is decoupled into a
set of two-dimensional problems by assuming the unknowns in the
[io]x[ix] direction are known and solving the resulting set of two-
dimensional problems. Next, the unknowns in [io]y[ix] are fixed, and a
set of two-dimensional problems solved. Finally, the [io]z[ix]
unknowns are fixed. This solution is done for each iteration of the
nonlinear solution step.
     The IBM group used EPEX, an experimental system for use within
IBM, to parallelize at the loop level. On each of the three steps of
the iteration, the solution of the two-dimensional problems are
independent and were run in parallel. They ran on an IBM 3090 with six
vector processors and achieved a speedup of 4.9.



[hs5]1988 rules

[ts4][ll19]    The Gordon Bell Awards recognize achievements in large-
scale scientific computing. The 1988 Awards will be given in two of
three categories:
     1. Performance: The entrant will be expected to convince the
judges that the submitted program is running faster than any other
comparable engineering or scientific application. Suitable evidence
will be the megaflop rate based on actual operation counts, or the
solution of the same problem with a properly tuned code on a machine
of known performance such as a Cray X-MP. If neither of these
measurements can be made, the submitter should document the
performance claims as well as possible.
     2. Price/performance: The entrant must show that the
performance of the application divided by the cost of the system is
better than any other entry. Performance measurement will be evaluated
as for the performance prize. For the purposes of this contest, cost
will be the list price of the computational engine (CPUs, including
enough memory to run the problem). Peripherals and software need not
be included in the price.
     3. Compiler parallelization: The compiler/application that
generates the best speed-up will be the winner. Speedup will be
measured by dividing the execution time of the program compiled
without automatic parallelization by the execution time with automatic
parallelization. A third run may be submitted that uses compiler
directives to improve parallelization. The judges will weight these
two parallel runs to determine the winner.
     There are some general conditions:
     1. The submitted program must have utility -- it must solve a
problem that is considered a routine production run, such as making
daily weather predictions or solving an important engineering or
scientific problem. It should not be a contrived or experimental
problem that is intended to show speedup.
     2. Entrants in the price/performance category must submit the
output from the Linpack 100[xm]100 benchmark showing a speed of at
least 5 Mflops.
     3. In all cases, the burden of proof is on the contestants.
The judges will make an honest effort to compare results of different
programs running on different machines, but will depend primarily on
the submitted material.
     4. Contestants should describe their entries in a three- to
four-page executive summary and submit this to Ted Lewis, Editor-in-
Chief, [io]IEEE Software[ix], c/o Computer Science Dept., Oregon State
University, Corvallis, OR 97331. The deadline is December 31, 1988. If
they need additional information, the judges will contact the
contestants.


[hs5]Sandia team wins Karp challenge
[ts4][ll25]    In 1985, Alan Karp of IBM issued a challenge in
a number of places including the NANET (Numerical Analysis Network),
SIGNUM Newsletter, and ACM Forum
saying he would pay $100 to the first person to demonstrate a
speedup of at least 200 on a general-purpose, multiple-instruction,
multiple-data computer used for scientific applications. The offer was
good through Dec. 31, 1995.
     In March, Karp awarded the $100 to the Sandia National Laboratory
team that won the 1987 Gordon Bell Award for parallel-processing
speedup. They donated the award to the Association for Retarded
Citizens in Albuquerque, N.M.