[comp.sys.sequent] I/O for parallel computers - Summary

dfk@romeo.cs.duke.edu (David F. Kotz) (03/28/89)
(also posted to comp.parallel)

This is a summary of the responses I received regarding my posting
about I/O for parallel computers.

>I am doing some research in the area of disk I/O and filesystems for
>parallel processors. Does anyone have any references, addresses, or
>relevant information on the I/O capabilities and strategies taken by
>the various commercial multiprocessors? (eg Encore, Sequent,
>hypercubes, Connection Machine, etc). 
>
>Also, any references regarding disk striping, disk caching and
>prefetching, parallel disk I/O, etc would be most welcome. 

This is a heavily edited summary of the comments I received. A
compilation of references in BIBTEX format will follow.

Thanks for all of your help!! (and I welcome more information!)

David Kotz
Department of Computer Science, Duke University, Durham, NC 27706 USA
ARPA:	dfk@cs.duke.edu
CSNET:	dfk@duke        
UUCP:	decvax!duke!dfk

===============================

Eugene Miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov:

Some people at MCC have done some work as well as Pixar.  I think Cray
and CDC have does the best commercial work, but this is artifact with
few really serious studies.  We have a CM2 with a data vault, but
again, everyone plays with their cards close.  IBM has also done some
work, I have one or two TRs but it never references some of the other
stuff (Garcia-Molina as an example).

Jean Bell at Colorado School of Mines might also have some data
organization stuff.  This is certainly one of the weakest areas in
parallel computing.  You basically have to ask to find the art.  Until
we get a little less protective I would not expect progress in this
area.  Most spend their time in the error and fault-tolerance aspects.

MCC == Microelectronics and Computer Corp. Austin, Tex.

The Colorado School of Mines is Golden, CO (near Coors Beer).  I don't
think they are on the net, but they do have big CDC H/W.

----

From: Goetz Graefe <graefe@ogccse.ogc.edu>

Consider database machine architectures
eg. GAMMA (VLDB 86, SIGMOD 88, SIGMOD 89)
Teradata (commercial product, backend for IBMs, same company name)
Bubba (MCC project)

Intel's hypercube uses I/O nodes which are 80386 CPUs
with SCSI but without DMA.

Sequent uses DCCs (dual channel disk controllers),
up to 4 on a Symmetry, for up to 32 drives.
they say their shared bus does not impede I/O performance;
they offer several DBMS products on the machine.

virtually all companies are working on it - call their
technical marketing departments.

----

From: Brett Fleisch <brett@CS.UCLA.EDU>

See the article on Rash in USENIX Proceedings Winter 89.

----

From: ephraim@Think.COM  (Ephraim Vishniac)
Organization: Thinking Machines Corporation, Cambridge MA, USA

You can get manuals for the DataVault (TMC's parallel storage system
for the CM) by contacting Thinking Machines.  Call 617-876-1111 or
write to us at 245 First Street, Cambridge, MA 02142-1214.  The manual
is not terribly informative about DV internals, but it's a start.

If you'd like to *use* a DataVault (and a CM, of course), you can
apply for a free account by writing to cmns-manager@think.com.  CMNS
is the Connection Machine Network Server, a DARPA-sponsored project.
The critical requirement for a CMNS account is the ability to Telnet
to cmns.think.com, the CMNS CM host.

----

From: kathya@june.cs.washington.edu (Katherine J. Armstrong)
I'm just starting work on the same topic.  

----

From: William Tsun-Yuk Hsu <hsu@uicsrd.csrd.uiuc.edu>
I'm working on research in the same area. 

	I've only started to look at parallel I/O a few months ago, and
so have nothing worth writing up yet. I'm more interested in the
memory system as a whole, how paging behavior affects secondary
storage and I/O, especially in a shared memory environment (since
that's what most people in the Cedar group are interested in). I have
additional references on parallel program behavior if you're
interesting (and am also looking at how to model multiprogramming in a
parallel environment; we would like to obtain realistic traces from
either real systems or simulators to evaluate different I/O
configurations. Wish we had access to something like Butterfly
traces.)

----

From: Rajeev Jog <jog%hpda@hp-sde.sde.hp.com>
I'm working on the performance of multiprocessor architectures and I/O 
systems.  Of late I've been focussing on tightly-coupled shared memory 
systems.

----

From: rpitt@sol.engin.umich.edu

We have a Stellar GS1000. It is one of the fast graphics systems around.
performance is 20-25 MIPS (it is a four processor machine).

Though our installation uses standard SCSI drives, the machine can
do striping on 3 drives (Like Maxtor 4380s).  It also does not use
a standard filesystem.  I do not know exactly how the filesystem
is implemented, but you may want to look into it.  The system runs a 
very clean implementation of Sys V with BSD extensions as well as
a shared memory approach to X11 client/server communication. I have
not used many very high end graphics machines, just the Stellar,
the Apollo DN10000 and the Ardent Titan (there's the Sun 4's, but they
aren't in the same class).

----

From: Russell R. Tuck <rrt@cs.duke.edu>

I believe Stellar Computer provides disk striping capabilities
with their workstations.  Since I'm not sure and can't find
anything much about it, all I can offer is their address and phone:

	Stellar Computer, Inc.
	75 Wells Avenue
	Newton, MA 02159
	(617) 964-1000

----

From: jsw@sirius.ctr.columbia.edu (John White)

I am also interested in this area. In particular, I am interested
in filesystems and strategies that are used for transputer networks,
their implementation and their performance. 

----

From: Dave Poplawski <pop@whitefish.mtu.edu>

There were several talks and poster sessions at the hypercube conference
this past week.  Papers corresponding to them will be published in the
proceeding, which we were told would be available by the end of April.

----

From: John.Pieper@B.GP.CS.CMU.EDU
I have been working on this problem for some time.

----

From: Shannon Nelson <shannon@intelisc.intel.com>

[this person is a contact at Intel]

----

From: Tom Jacob <convex!ntvax!jacob@mcnc.org> Univ. of North Texas

Here is (part of) a discussion on disk striping that was on the net
a couple of years ago.

/* Written  4:08 pm  Aug  3, 1987 by pioneer.UUCP!eugene in ntvax:comp.arch */
/* ---------- "Re: Disk Striping (description and" ---------- */
I meant to say DISK STRIPING.  This is the distribution of data across
multiple "spindles" in order to 1) increase total bandwidth, 2) for
reasons of fault-tolerance (like Tandems), 3) other miscellaneous
reasons.

Very little work has been done on the subject yet a fair number of
companies have implemented it: Cray, CDC/ETA, the Japanese
manufacturers, Convex, and Pyramid (so I am informed), and I think
Tandem, etc.  Now for important perspective: It seems that striping over
3-4 disks like in a personal computer is a marginal proposition.
Striping over 40 disks, now there is some use.  The break even-point is
probably between 8-16 disks (excepting the fault tolerance case).  A
person I know at Amdahl boiled the problem down to 3600 RPM running on 60
HZ wall clock: mechanical bottlenecks of getting data into and out of a
CPU from a disk.  The work is not glamourous as making CPUs, yet is just
as difficult (consider the possibility of losing just one spindle).

The two most cited papers I have seen are:

%A Kenneth Salem
%A Hector Garcia-Molina
%T Disk Striping
%R TR 332
%I EE CS, Princeton Univerity
%C Princeton, NJ
%D December 1984

%A Miron Livny
%A Setrag Khoshafian
%A Haran Boral
%T Multi-Disk Management Algorithms
%R DB-146-85
%I MCC
%C Austin, TX
%D 1985

Both of these are pretty good reports, but more work needs to be done in
this area, hopefully, one or two readers might seriously.  The issue is
not simply one of sequentially writing bits out to sequentially lined
disks.  I just received:

%A Michelle Y. Kim
%A Asser N. Tantawi
%T Asynchonous Disk Interleaving
%R RC 12496 (#56190)
%I IBM TJ Watson Research Center
%C Yorktown Heights, NY
%D Feb. 1987

This looks good, but what is interesting it that it does not cite either
of the two above reports, but quite a few others (RP^3 and Ultracomputer
based).

Kim's PhD disseration is on synchronous disk interleaving and she has a
paper on IEEE TOC.

Another paper I have is Arvin Park's paper on IOStone, an IO benchmark.
Park is also at Princeton under Garcia-Molina (massive memory VAXen).
I have other papers, but these are the major ones, just starting
thinking Terabytes and Terabytes.  From a badge I got at ACM/SIGGRAPH:

	Disk Space: The Final Frontier

>From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
/* End of text from ntvax:comp.arch */
/* Written  7:08 pm  Aug  3, 1987 by shukra.UUCP!ram in ntvax:comp.arch */
/* ---------- "Re: Disk Striping (description and" ---------- */
In article <2432@ames.arpa>, eugene@pioneer.arpa (Eugene Miya N.) writes:
> I meant to say DISK STRIPING.  This is the distribution of data across
> multiple "spindles" in order to 1) increase total bandwidth, 2) for
> reasons of fault-tolerance (like Tandems), 3) other miscellaneous
> reasons.
  
    So, my guess was right.  Does the CM use such a feature to obtain
    a large BW?  How come IBM's disk farms don't have these, or do they?

   Renu Raman				ARPA:ram@sun.com
   Sun Microsystems			UUCP:{ucbvax,seismo,hplabs}!sun!ram
   M/S 5-40, 2500 Garcia Avenue,
   Mt. View,  CA 94043
/* End of text from ntvax:comp.arch */
/* Written  3:03 am  Aug  5, 1987 by pyrltd.UUCP!bejc in ntvax:comp.arch */
/* ---------- "Re: Disk Striping (description and" ---------- */
In article <2432@ames.arpa>, eugene@pioneer.arpa (Eugene Miya N.) writes:
> Very little work has been done on the subject yet a fair number of
> companies have implemented it: Cray, CDC/ETA, the Japanese
> manufacturers, Convex, and Pyramid (so I am informed), and I think
> Tandem, etc.  Now for important perspective: It seems that striping over
> 3-4 disks like in a personal computer is a marginal proposition.
> Striping over 40 disks, now there is some use.  The break even-point is
> probably between 8-16 disks (excepting the fault tolerance case).  A
> person I know at Amdahl boiled the problem down to 3600 RPM running on 60
> HZ wall clock: mechanical bottlenecks of getting data into and out of a
> CPU from a disk.  The work is not glamourous as making CPUs, yet is just
> as difficult (consider the possibility of losing just one spindle).

Pyramid has been shipping "striped disks" as part of OSx 4.0 since early this
year."Striped disk"  is one of 4 "virtual disk" types offered under OSx, the
others being "Conatenated","Mirrored" and "Simple". A full description of the
techniques and thier implementation were given by Tom Van Baak of Pyramid
Technology Corporation at the February Usenix/Uniforum  meeting in Washington.

The principle reason for using "striped disk" is performance. The ability to
place interleaved clusters of data on different spindles can be a winner in
cases where the disk throughput rate is approaching the satuaration point of a
single disk, and you have a disk controller intelligent enough to know where
every disk head is at any given time. To take a case in point, ICC a company
based in the City of London, supplies financial data from an 8Gbyte database to
dial up subscribers. One index in the database is >800Mbyte long and has been
set up on a "concatenated virtual disk" made up of two 415 Mbyte Eagles. When
the set up was switched to the "striped virtual disk" model a throughput
increase of 34% was measured. This doesn't mean that "striped" disks are going
to answer everbody's disk performance problems, but they can provide
significant improvements in certain cases.

Both Tom and myself have produced papers on Virtual Disks and would be happy to
answers any further questions that you have. Tom can be contacted at:
			
			pyramid!tvb

while my address is given below.

      -m-------  Brian E.J. Clark	Phone : +44 276 63474
    ---mmm-----  Pyramid Technology Ltd Fax   : +44 276 685189
  -----mmmmm---                         Telex : 859056 PYRUK G
-------mmmmmmm-  			UUCP  : <england>!pyrltd!bejc
/* End of text from ntvax:comp.arch */
/* Written  7:44 am  Aug  7, 1987 by cc4.bbn.com.UUCP!lfernand in ntvax:comp.arch */
> Isn't this what IBM uses in their Airline Control Program?  There was that
> article in CACM a while back about the TWA reservation system, and it said
> something about spreading files over a large number of spindles for greater
> throughput.
> haynes@ucscc.ucsc.edu

What ACP does isn't what we are calling disk striping in this newsgroup.
ACP has an option to write each record to two different disks at the same
time.  This doesn't increase throughput but does has several benefits:

    1/ It creates a backup copy so the data will not be lost if one disk
       crashes.
    2/ It allows ACP a choice of where to read the data from.  ACP will
       read the data from the disk with the shortest queue, reducing the
       access delay.

...Lou

lfernandez@bbn.com
...!decwrl!bbncc4!lfernandez
/* End of text from ntvax:comp.arch */
/* Written  4:50 am  Aug  9, 1987 by amdcad.UUCP!rpw3 in ntvax:comp.arch */
In article <310@cc4.bbn.com.BBN.COM> lfernand@cc4.bbn.com.BBN.COM (Louis F. Fernandez) writes:

+---------------
| +---------------
| | Isn't this what IBM uses in their Airline Control Program?  There was that
| | article in CACM a while back about the TWA reservation system, and it said
| | something about spreading files over a large number of spindles for greater
| | throughput.  | haynes@ucscc.ucsc.edu
| +---------------
| What ACP does isn't what we are calling disk striping in this newsgroup.
| ACP has an option to write each record to two different disks at the same
| time.  This doesn't increase throughput but does has several benefits...
+---------------

Sorry, the original poster is correct (sort of). ACP *does* have disk striping,
in addition to the redundant writing you mentioned, but still it isn't quite
the same as we are talking about here. They spread a file across several disks,
all right, but the allocation of records (all fixed length -- this is *old*
database technology!) is such that the disk drive number is a direct-mapped
hash of some key in the record! What this does is spread accesses to similar
records (like adjacent seats on the same flight) across many disks (sometimes
up to 100 packs!!!).

Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun,attmail}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403
/* End of text from ntvax:comp.arch */

----

From: Donald.Lindsay@K.GP.CS.CMU.EDU (August 1988)

Monty Denneau's TF-1 is supposedly going to have several thousand
disk drives in parallel. (The TF-1 will have 32K CPU's.)
He implied he'd use error-correcting techniques, down the vector of
drives, so that drive failure wouldn't matter.

Also, NCUBE offers a drive-per-node arrangement.

denneau@ibm on csnet/bitnet can tell you what he's published.
I should also have mentioned the Data Vault for the Connection
Machine (actually, CM-2) at Thinking Machines. Since they've actually
shipped some of these, they may have live info. (They store
32 bit words across 39 disk drives.)

----

From: Bob Knighten <pinocchio!knighten@xenna.encore.com>

The Encore Multimax presents essentially a plain vanilla Unix view of I/O with
the addition that the I/O devices are attached to the system bus, so all
processors have equal access to them.

I know of no technical papers that deal specifically with the I/O system.
This is because to date there has been little work aimed specifically at the
I/O system.  Rather the work has gone into making the kernel truly symmetrical
and reducing lock contention.  For example here are some early timing done on
the Multimax using a master/slave kernel and using the multithreaded kernel.
(This is for a simple program that performs 1000 read operations from random
locations in a large file.)

copies of program	    multithreaded	    master/slave
	1			5 secs			8 secs
	4			5 secs		       25 secs
	8			9 secs		       50 secs
       12		       13 secs		       77 secs

There is no other difference at all in the I/O subsystem which is standard
4.2BSD.

Bob Knighten                        
Encore Computer Corp.
257 Cedar Hill St.        
Marlborough, MA 01752
(508) 460-0500 ext. 2626
Arpanet:  knighten@multimax.arpa OR knighten@multimax.encore.com
Usenet:   {bu-cs,decvax,necntc,talcott}!encore!knighten

----

From: Joseph Boykin <boykin@calliope.encore.com>

Encore currently ships four models of comptuer:
	310 320 510 520

The 310/320 use the NS32332 CPU (2MIPS/CPU).  The 510/520 use the
NS32532 CPU (8.5MIPS/CPU).  The 310/510 use a single "low-boy"
cabinet.  The 320/520 use dual-bay, full size cabinets.  The 310/520
also have more slots for memory, CPU, I/O cards than the 310/510.

All systems use SCSI direct to the disk/tape drive.  Some systems
are still going out with single-ended SCSI, some with differential
Synchronous.  Within the next few months, all systems will be differential
synchronous.  The difference is not noticeable for single transfer across
the bus, but when the (SCSI) bus gets busy, in which case 
differential synchronous is significantly faster.

Disk drives are CDC Syber for the X20 (1.1GB formatted per drive).
You can fit either 6 or 8 in the cabinet; more if you add cabinets.
Maximum is 64 drives.  We use the CDC Wren IV (?) (about 3XXMB/drive)
on the X10.  We'll be going to the Wren V (?) (6XXMB/drive) when
available.

We currently ship 3 OS's; System V; 4.2BSD; MACH.  All three have
been parallelized.  I.e. if two processes make simultaneous disk
requests, they will get as far as enqueing the request to the 
controller before they have to take a lock.  They may need some
locks in the meanwhile for the same inode, or to do disk
allocation, but for the most part, it's a pretty clean path.

The "main" CPU's don't really get too involved with I/O.  I/O
requests are passed to either an "EMC" (Ethernet Mass Storage Card)
or MSC (Mass Storage Card).  These contain processors and memory
of their own and talk directly to the SCCI bus.

Joe Boykin
Encore Computer Corp

----

From: david@symult.com (David Lim)

Our machine the System 2010 does parallel I/O by having a VME bus on every
node to which you can attach disk drives and tape drives. For that matter you
can attach any VME device you want. Our current file system is Unix like and
the File system is transparent to any program running on any of the nodes.
I.e. if a node program does open("filename",...);, the OS will automatically
open the file on the correct disk even if the disk is not attached to the node
the program is running on. Although not truly NFS, the FS is transparent to
all the nodes. 

An enhancement to the file system that is coming soon will give you true NFS
capabilities. I.e. the host would simply NFS mount the disks on the nodes, and
vice versa. 

david@symult.com
Symult Systems Corp (formerly Ametek Computer Research Division).

----

From: moore@Alliant.COM (Craig Moore)

Look into the Alliant computer.  It's the most practical and highest
performance machine on the market.

----

From: rab%sprite.Berkeley.EDU@ginger.Berkeley.EDU (Robert A. Bruce)

David,
I am involved in two research projects here at the University of California
that relate to filesystems for parallel computers.  The Sprite Network
Operating System is designed to run on multiprocessors, and uses prefetching
and cacheing to improve filesystem performance.  It is a fully functional
operating system, I am writing this letter from a sun3 running sprite.

The other project is `RAID', which stands for `redundant arrays of
inexpensive disks'.  A large array of small, slow, cheap disks is
used in order to provide cheap, reliable mass storage.  By using
large caches and log-based filesystems performance should still
be good.  This project is still just starting up, so we won't have
any solid results for a while.

----
Department of Computer Science, Duke University, Durham, NC 27706 USA
ARPA:	dfk@cs.duke.edu
CSNET:	dfk@duke        
UUCP:	decvax!duke!dfk