[comp.parallel] Parallel Paradigms for BBN Butterfly

huff@svax.cs.cornell.edu (Richard Huff) (02/23/90)

About a week ago I sent out a request for information on alternative
parallel processing paradigms/packages for the BBN Butterfly.  (That is,
alternatives to the Uniform System.)  Here is a summary of what I learned.

*******************************************************************************

From: reiher@onyx.Jpl.Nasa.Gov (Peter Reiher)

The Time Warp Operating System runs on the Butterfly, currently only under
Chrysalis.  It runs discrete event simulations in parallel, though you could
probably use it for a few other things, if they fit the paradigm.  Basically,
you have to be able to decompose your problem into subcomponents (objects) that 
communicate only via messages, and you have to be able to assign a timestamp
to each event performed (messages cause events) indicating the event's 
relative order in the simulation.  Time Warp then runs all objects on a node
independently of all other nodes' objects.  This can cause out-of-order
processing, in which case Time Warp uses rollback and message cancellation
to automatically correct its mistake.

TWOS can be gotten through NASA's Cosmic software distribution system, but
we were recently told that they intend to charge $1500 for a copy, which doesn't
seem reasonable to us.  I'm not sure if we can get them to lower it.  The
price does include source code.

The most important papers are

"Virtual Time", David Jefferson, ACM Trans on Prog. Lang. & Systems, Vol. 7,
No. 3, (July 1985).

and

"Distributed Simulation and the Time Warp Operating System", Jefferson et.
al., ACM Operating Systems Review, Vol. 20, No. 4, 1987.

The idea is basically to synchronize solely by rollback and message
cancellation, never by blocking.  Each node independently chooses which of
its local objects to run next, basing its decision on which local object
has the message with the earliest timestamp.  Other nodes may simulataneously
be choosing work with an earlier timestamp, work that may produce further
messages to be sent to this node.  Those messages could have earlier 
timestamps than the message this node just chose, in which case the work
about to be done will eventually be rolled back, and all of its effects
undone.

It is an unusual way to do things, but, with suitable constraints, it works,
and works quite well.

*******************************************************************************

From: mrg@Princeton.EDU (Mark Greenstreet)

You might want to contact Robert Fowler at the University of Rochester.
They have an operating system called "PLATINUM" that deals with object
migration to lower memory latency.  See
    "The Implementation of a Coherent Memory Abstraction on a NUMA
     Multiprocessor: Experiences with PLATINUM", Alan L. Cox and Robert
     J. Fowler.  University of Rochester Deparatment of Computer Science
     Technical report 263, 1989.

*******************************************************************************

Several people mentioned the C Threads library available from Ohio State
University.  You can get the library from tut.cis.ohio-state.edu in the
/pub/Threads directory.  The directory contains source code for both
Chrysalis and Mach 1000 versions of the Butterfly, as well as some man
pages.  In addition, pick up files 1.Z through 5.Z.  When they are
uncompressed and concatenated, you will have the postscript source to a
report for last year's Butterfly Users Group.

The library was written by Yiannis Samiotakis, who then wrote a Master's
Thesis on the experience, "A Thread Library for a Non Uniform Memory
Acess Multiprocessor", finished in 1989.

Thanks to:
	george@cis.ohio-state.edu (George M. Jones)
	slim@cosimo.osgp.osc.edu (Scott Whitman)
	john_r_mudd@cis.ohio-state.edu

*******************************************************************************

Thanks for everyone's help.

huff@svax.cs.cornell.edu (Richard Huff)

scott@cs.rochester.edu (03/03/90)

Several programming packages of potential interest have been developed
here at Rochester.  A good overview of these can be found in

%A T. J. LeBlanc
%A M. L. Scott
%A C. M. Brown
%L LeBlanc, Scott, and Brown, 1988
%T Large-Scale Parallel Programming: Experience with the BBN Butterfly
Parallel Processor
%K psyche
%J Proceedings of the ACM SIGPLAN PPEALS 1988 \(em Parallel Programming:
Experience with Applications, Languages, and Systems
%C New Haven, CT
%D 19-21 July 1988
%P 161-172
%X Also available as BPR 22, Computer Science Department, University of
Rochester, September 1988

More details can be found in other reports from the (now closed) Butterfly
Project Report series:

%A L. Bukys
%A T. J. LeBlanc
%T Getting Started with the BBN Butterfly Parallel Processor
%R BPR 1 (second edition)
%I Computer Science Department, University of Rochester
%D October 1986
%X This report is an introductory user's manual for the BBN Butterfly
multiprocessor at the University of Rochester.
It is intended to help novice users learn to access and use the various
Butterfly machines and the associated software environment.
Its primary purpose is to illustrate how to compile and execute simple
programs, and to describe what other documentation is available and where to
find it.

%A M. Fanty
%T A Connectionist Simulator for the BBN Butterfly Multiprocessor
%R TR 164, BPR 2
%I Computer Science Department, University of Rochester
%D January 1986
%X This report describes the implementation of a connectionist simulator
on the BBN Butterfly Multiprocessor.
It presents the model of connectionist networks used by the simulator
and gives a detailed account of the simulator's implementation.
Performance data for some sample networks is included.
Some plans for future improvements are discussed.

%A T. J. LeBlanc
%T Shared Memory Versus Message-Passing in a Tightly-Coupled Multiprocessor:
A Case Study
%P 463-466
%J Proceedings of the 1986 International Conference on Parallel Processing
%C St. Charles, IL
%D 19-22 August 1986
%O Expanded version available as BPR 3, Computer Science Department,
University of Rochester, January 1986.
%X Conceptually, the BBN Butterfly Parallel Processor can support a
model of computation based on either shared memory or message passing.
Recently, much of the work on the Butterfly, including finite element
analysis and computer vision algorithms, has assumed the shared-memory model.
In this paper we describe the results of our experiments with the
message-passing model.
The goal of the experiments was to analyze the tradeoffs between the
shared-memory model, as exemplified by the BBN Uniform System package,
and a simple message-passing model.
Important factors to be considered were performance, scalability, and
ease of programming.
We compare the two models with respect to these criteria and conclude
that the particular model of computation used is less important than how
well it is matched to the application.

%A T. J. Olson
%T Modula-2 on the BBN Butterfly Parallel Processor
%R BPR 4
%I Computer Science Department, University of Rochester
%D January 1986
%X This report describes the current state of the Rochester Butterfly
Modula-2 compiler and runtime environment.
It covers use of the compiler, the standard runtime environment, and the
(experimental) Butterfly environment.
It also describes methods of extending the experimental environment and
obtaining access to operating system functions.

%A E. Hinkelman
%T NET: A Utility for Building Regular Process Networks on the BBN
Butterfly Parallel Processor
%R BPR 5
%I Computer Science Department, University of Rochester
%D February 1986
$X Coarse-grained parallel architectures like the BBN Butterfly are
difficult to use without tools for building networks of communicating
processes.
Net is a tool for creating a useful class of process networks.
This report describes the Net utility, an example application, and some
design decisions and experiences.

%A M. L. Scott
%T The Interface Between Distributed Operating System and
High-Level Programming Language
%P 242-249
%J Proceedings of the 1986 International Conference on Parallel Processing
%C St. Charles, IL
%D 19-22 August 1986
%O Also published in the \f2University of Rochester
1986-87 Computer Science and Computer Engineering Research Review\fP, and
available as TR 182 and BPR 6, Department of Computer Science,
University of Rochester, September 1986 (revised)
%X A distributed operating system provides a process abstraction and
primitives for communication between processes.
A distributed programming language regularizes the use of the
primitives, making them both safer and more convenient.
The level of abstraction of the primitives dictates the division of
labor between the operating system and the language support routines,
with serious ramifications for both efficiency and flexibility.
%X LYNX, now available on the Butterfly, is a high-level language for
parallel programming.
Previous implementations of LYNX were designed for the Crystal
multicomputer at the University of Wisconsin and for Jon Kepecs's SODA.
Comparison of the three implementations provides important insights into
the nature of the language/operating system interface.
Among other things, the comparison suggests that a lower-level interface
is best; the Butterfly implementation is the simplest and the fastest of
the three.

%A M. L. Scott
%T LYNX Reference Manual
%R BPR 7
%I Computer Science Department, University of Rochester
%D August 1986 (revised)
%X LYNX is a message-based distributed programming language with novel
facilities for communication between processes and for management of
context within processes.
LYNX was first implemented on the Crystal multicomputer at the
University of Wisconsin.
It has subsequently been ported to the Butterfly Parallel Processor at
the University of Rochester.
%X This manual is intended for serious users of the Butterfly
implementation.
At the time of its writing it constitutes the \f2de facto\fP standard on
the syntax and semantics of LYNX.
It also describes ways in which the Butterfly implementation differs
from the standard, and in which that implementation resolves issues that
the standard leaves undefined.

%A T. J. LeBlanc
%A N. M. Gafter
%A T. Ohkami
%T SMP: A Message-Based Programming Environment for the BBN Butterfly
%R BPR 8
%I Computer Science Department, University of Rochester
%D July 1986
%X SMP is a message-bassed programming environment for the BBN Butterfly
similar in flavor and scope to the BBN Uniform System package.
As such, SMP provides an alternative to the shared memory model of the
Uniform System.
SMP supports the construction of \f2process families\fP, a fixed set of
asynchronous processes that communicate, using messages, according to a
given interconnection pattern.
A dynamic hierarchy of such process families is possible.
In this report we describe the SMP user interface and an implementation
of the Butterfly.

%A T. J. LeBlanc
%T Structured Message Passing on a Shared-Memory Multiprocessor
%J Proceedings of the 21st Annual Hawaii International Conference
on System Sciences
%K smp hicss
%D January 1988
%C Kailua-Kona, HI
%P 188-194

%A T. J. Olson
%T An Image Processing Package for the BBN Butterfly Parallel Processor
%R BPR 9
%I Computer Science Department, University of Rochester
%D September 1986
%X The University of Rochester computer vision group uses IFF (Image
File Format) as its internal standard for representing and manipulating
image data.
This report describes an adaptation of the IFF libraries and programming
methodology to the BBN Butterfly Multiprocessor.
It offers guidelines for programmers using the library and describes
early experience writing utilities with the system.

%A T. J. Olson
%T Finding Lines with the Hough Transform
on the BBN Butterfly Parallel Processor
%R BPR 10
%I Computer Science Department, University of Rochester
%D September 1986
%X The Hough transform is a popular method of finding parametric curves
in noisy data.
It is a key primitive in many current vision systems.
In order to evaluate the suitability of the Butterfly for this type of
computation, we have implemented two line-finding Hough transform
algorithms under the Uniform System software package.
We conclude that for these particular algorithms the Butterfly/Uniform
System combination works very well, although memory contention may be a
problem for the more general case.

%A L. Bukys
%T Connected Component Labeling and Border Following on the BBN
Butterfly Parallel Processor
%R BPR 11
%I Computer Science Department, University of Rochester
%D October 1986
%X Multiprocessor architectures present the problem of efficiently
merging the results of separate (parallel) subcomputations.
This report discusses an approach to this problem based on the
UNION-FIND algorithm, and presents experimental results obtained on the
BBN Butterfly Parallel Processor running an assortment of benchmarks.

%A T. J. LeBlanc
%A J. M. Mellor-Crummey
%T Debugging Parallel Programs with Instant Replay
%J IEEE Transactions on Computers
%V C-36
%N 4
%D April 1987
%P 471-482
%O Also available as BPR 12,
Computer Science Department, University of Rochester,
September 1986
%X The debugging cycle is the most common methodology for finding and
correcting errors in sequential programs.
Cyclic debugging is effective because sequential programs are usually
deterministic.
Debugging parallel programs is considerably more difficult because
successive executions of the same program often do not produce the same
results.
In this paper we present a general solution for reproducing the
execution behavior of parallel programs, termed \f2Instant Replay\fP.
During program execution we save the relative order of significant
events as they occur, not the data associated with such events.
As a result, our approach requires less time and space to save the
information needed for program replay than other methods.
Our technique is not dependent on any particular form of interprocess
communication.
It provides for replay of an entire program, rather than individual
processes in isolation.
No centralized bottlenecks are introduced and there is no need for
synchronized clocks or a globally-consistent logical time.
We describe a prototype implementation of Instant Replay on the BBN
Butterfly Parallel Processor, and discuss how it can be incorporated
into the debugging cycle for parallel programs.

%A C. M. Brown
%A R. J. Fowler
%A T. J. LeBlanc
%A M. L. Scott
%A M. Srinivas
%A et\0al.
%T DARPA Parallel Architecture Benchmark Study
%J DARPA Architecture Workshop
%C McLean, Virginia
%D October 1986
%O Available as BPR 13, Department of Computer Science,
University of Rochester
%X In intensive work over a four-week period in the summer of 1986,
seven problems were studied and implemented on the Butterfly.
The problems were inspired by various capabilities in computer vision,
and were proposed as benchmarks for a DARPA workshop on parallel
architectures.
They were: convolution and zero-crossing detection for edges, edge
tracking, connected component labeling, Hough transform, three
computational geometry problems (convex hull, Voronoi diagram, and
minimum spanning tree), three-dimensional visibility calculations,
subgraph isomorphism and minimum cost path calculation.
BPRs 10, 11, and 14 are detailed reports of three of the problems.
BPR 13 contains the conclusions of the study and writeups of the work
not covered in other BPRs.

%A J. Costanzo
%A L. Crowl
%A L. Sanchis
%A M. Srinivas
%T Subgraph Isomorphism on the BBN Butterfly Multiprocessor
%R BPR 14
%I Computer Science Department, University of Rochester
%D October 1986
%X This report describes an algorithm for finding subgraph isomorphisms
for a restricted class of graphs and a parallel implementation of the
algorithm on the BBN Butterfly Multiprocessor.
This effort was part of a larger project to assess the suitability of
the Butterfly architecture for a variety of machine vision tasks.
Our algorithm searches a tree in which each node represents a partial
assignment of vertices in the smaller graph to vertices in the larger
graph.
The algorithm prunes the search tree using properties of the two graphs
as constrained by the partial mapping.
These properties are vertex connectivity, distance between vertices and
the local topology of vertex clusters.
By carefully balancing the computational load and the contention for
shared resources, our algorithm achieves almost linear speedup in the
processing rate of search tree nodes.
However, the speedup of isomorphism detection rate is poor when looking
for a few isomorphisms, and good only when looking for many
isomorphisms.
We present an analysis of why we believe this property is intrinsic to
algorithms that parallelize the tree search without parallelizing the
node computations.
We also discuss the effectiveness of the Butterfly architecture and
programming environment in implementing such parallel algorithms.

%A L. Crowl
%T Chrysalis++
%R BPR 15
%I Computer Science Department, University of Rochester
%D December 1986
%X The standard Butterfly Parallel Processor's programming environment
uses the C programming language and the Chrysalis operating system.
This report describes Chrysalis++, a C++ programming language interface
to Chrysalis.
%X The binding strategy for the design of Chrysalis++ was to re-cast the
error-prone, explicit Chrysalis object management in C into the implicit
object management that occurs routinely in C++.
Chrysalis++ merges creation and access of Chrysalis objects into the
declaration of the variable used to represent them.
This strategy provides the entire Chrysalis++ programming environment
with one single object management strategy.
%X The development of Chrysalis++ highlighted some of the difficulties
in mixing a single process high level language with a low level
multiprocessor operating system.
Bringing C++ and Chrysalis together demonstrated weaknesses in the C++
class mechanism and showed inconsistencies in Chrysalis object treatment.
%X This report is composed of three parts.
The first part gives a background for this project and describes the
resulting Chrysalis++.
The second part describes the experiences in developing Chrysalis++ in
porting C++ to the Butterfly and some of the weaknesses discovered in C++.
Finally, the third part provides the details of Chrysalis++.
This last part is intended only for actual users of Chrysalis++.

%A L. A. Crowl
%T Shared Memory Multiprocessors and Sequential Programming Languages:
A Case Study
%J Proceedings of the 21st Annual Hawaii International Conference
on System Sciences
%D January 1988

%A J. Low
%T Experiments with Remote Procedure Call on the Butterfly
%R BPR 16
%I Computer Science Department, University of Rochester
%D December 1986
%X Several remote procedure call protocols were implemented and timed on
the Butterfly.
These implementations used either Chrysalis synchronization facilities
(Dual Queues and Events) or shared memory.
Analysis of some of the costs, benefits and deficiencies of each of the
techniques is presented.

%A M. L. Scott
%A A. L. Cox
%T An Empirical Study of Message-Passing Overhead
%J Proceedings of the Seventh International Conference on Distributed
Computing Systems
%K bpr17 lynx dcs
%C Berlin, West Germany
%D 21-25 September 1987
%P 536-543
%X Also published as BPR 17,
Computer Science Department, University of Rochester,
December 1986
%X Conventional wisdom holds that message-passing is orders of magnitude
more expensive than shared memory for communication between parallel
processes.
Differences in the speed of underlying hardware mechanisms fail to
account for a substantial portion of the performance gap.
The remainder is generally attributed to the ``inevitable cost'' of
higher-level semantics, but a deeper understanding of the factors that
contribute to message-passing overhead has not been forthcoming.
%X In this paper we provide a detailed performance analysis of one
message-passing system: the implementation for the BBN Butterfly Parallel
Processor of the LYNX distributed programming language.
The case study includes a description of the implementation, an
explanation of optimizations employed to improve its performance, and a
detailed breakdown of remaining costs.
The data provide a direct measure of the expense of individual features
in LYNX.
They also provide insight into the likely costs of other message-passing
systems, both present and future.
Lessons gained from our experience should be of use to other researchers
in performing similar studies.

%A P. Dibble
%T Benchmark Results for Chrysalis Functions
%R BPR 18
%I Computer Science Department, University of Rochester
%D December 1986
%X This paper presents the results of benchmarks for the important
Chrysalis functions and macros.
In most cases, the Chrysalis code that implements the function or the
assembly code generated by the macro is analyzed as well.
It is intended to help Butterfly programmers choose efficient ways to
use Chrysalis, and to guide system programmers in enhancing Chrysalis.

%A M. L. Scott
%A T. J. LeBlanc
%T Psyche: A General-Purpose Operating System for Shared-Memory
Multiprocessors
%R BPR 19, TR 223
%I Computer Science Department, University of Rochester
%D July 1987
%X The Psyche project at the University of Rochester aims to develop
a high-performance operating system to support a wide variety of
models for parallel programming on a shared-memory multiprocessor.
It is predicated on the assumption that no one model of process
state or style of communication will prove appropriate for all
applications, but that a shared-memory machine can and should support
all models.
Conventional approaches, such as shared memory or
message passing, can be regarded as points
on a continuum that reflects the degree of sharing between processes.
Psyche enables fully dynamic sharing
by providing a user interface
based on passive data abstractions in a uniform virtual address space.
It ensures that users pay for protection only when it is required
by permitting lazy evaluation of protection using keys and access lists.
The data abstractions define conventions for sharing the uniform
address space;
the tradeoff between protection and performance determines
the degree to which those conventions are enforced.
In the absence of protection boundaries,
access to a shared abstraction can be as efficient as
a procedure call or a pointer dereference.

%A M. L. Scott
%A T. J. LeBlanc
%A B. D. Marsh
%T Design Rationale for Psyche, a General-Purpose Multiprocessor Operating
System
%J Proceedings of the 1988 International Conference on Parallel Processing
%K icpp
%C St. Charles, IL
%D 15-19 August 1988
%P 255-262
%V II \(mi Software
%X Also published in the \f2University of Rochester
1988-89 Computer Science and Computer Engineering Research Review\fP.

%A J. M. Mellor-Crummey
%A T. J. LeBlanc
%A L. A. Crowl
%A N. M. Gafter
%A P. C. Dibble
%T Elmwood \(em An Object-Oriented Multiprocessor Operating System
%R BPR 20
%I Computer Science Department, University of Rochester
%D July 1987
%O Also published in the \f2University of Rochester
1987-88 Computer Science and Computer Engineering Research Review\fP, and
submitted for journal publication
%X Elmwood is an object-oriented, multiprocessor operating system designed as a
group project at the University of Rochester.
An Elmwood object, consisting of code and data, represents an instance of an
abstract data type.
Only the code associated with an object may access its data; interaction
between objects is via remote procedure call.
Access to an object requires that the caller provide an appropriate logical
object name, which denotes a kernel-procected pair containing an object
reference and a context value.
Elmwood provides only basic mechanisms for protection and synchronization; the
object itself supplies and interprets the context value, therby implementing
its own policies for protection and synchronization.
We describe the Elmwood design, a multiprocessor implementation for the BBN
Butterfly Parallel Processor, and our experiences in building a
functionally-complete operating system as a group project in four months.

%A T. J. LeBlanc
%A J. M. Mellor-Crummey
%A N. M. Gafter
%A L. A. Crowl
%A P. C. Dibble
%T The Elmwood Multiprocessor Operating System
%K elmwood
%J Software \(em Practice and Experience
%V 19
%N 11
%D November 1989
%P 1029-1056

%A M. L. Scott
%A K. R. Jones
%T Ant Farm: A Lightweight Process Programming Environment
%R BPR 21
%I Computer Science Department, University of Rochester
%D August 1988
%X Many parallel algorithms require a collection of processes whose
number is linear (or worse) in the size of the problem to be solved.
Few programming environments support such flagrant parallelism.  Most
often each virtual process of a program corresponds to a single,
expensive heavyweight entity supplied by the operating system.  Ant
Farm is a library package for the BBN Butterfly Parallel Processor that
allows programs to split computational effort across many lightweight
threads of control.  On each node of the Butterfly, Ant Farm provides
for the execution and management of dozens, even hundreds, of threads.
A full set of mechanisms is provided for location-transparent
communication, sharing, and synchronization.  This report describes the
Ant Farm model, its implementation, and its programming interface.
-- 
Michael L. Scott
Computer Science Dept.		(716) 275-7745, 5671
University of Rochester		scott@cs.rochester.edu
Rochester, NY  14627		...!rochester!scott