dfk@romeo.cs.duke.edu (David F. Kotz) (03/28/89)
(also posted to comp.parallel) This is a summary of the responses I received regarding my posting about I/O for parallel computers. >I am doing some research in the area of disk I/O and filesystems for >parallel processors. Does anyone have any references, addresses, or >relevant information on the I/O capabilities and strategies taken by >the various commercial multiprocessors? (eg Encore, Sequent, >hypercubes, Connection Machine, etc). > >Also, any references regarding disk striping, disk caching and >prefetching, parallel disk I/O, etc would be most welcome. This is a heavily edited summary of the comments I received. A compilation of references in BIBTEX format will follow. Thanks for all of your help!! (and I welcome more information!) David Kotz Department of Computer Science, Duke University, Durham, NC 27706 USA ARPA: dfk@cs.duke.edu CSNET: dfk@duke UUCP: decvax!duke!dfk =============================== Eugene Miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov: Some people at MCC have done some work as well as Pixar. I think Cray and CDC have does the best commercial work, but this is artifact with few really serious studies. We have a CM2 with a data vault, but again, everyone plays with their cards close. IBM has also done some work, I have one or two TRs but it never references some of the other stuff (Garcia-Molina as an example). Jean Bell at Colorado School of Mines might also have some data organization stuff. This is certainly one of the weakest areas in parallel computing. You basically have to ask to find the art. Until we get a little less protective I would not expect progress in this area. Most spend their time in the error and fault-tolerance aspects. MCC == Microelectronics and Computer Corp. Austin, Tex. The Colorado School of Mines is Golden, CO (near Coors Beer). I don't think they are on the net, but they do have big CDC H/W. ---- From: Goetz Graefe <graefe@ogccse.ogc.edu> Consider database machine architectures eg. GAMMA (VLDB 86, SIGMOD 88, SIGMOD 89) Teradata (commercial product, backend for IBMs, same company name) Bubba (MCC project) Intel's hypercube uses I/O nodes which are 80386 CPUs with SCSI but without DMA. Sequent uses DCCs (dual channel disk controllers), up to 4 on a Symmetry, for up to 32 drives. they say their shared bus does not impede I/O performance; they offer several DBMS products on the machine. virtually all companies are working on it - call their technical marketing departments. ---- From: Brett Fleisch <brett@CS.UCLA.EDU> See the article on Rash in USENIX Proceedings Winter 89. ---- From: ephraim@Think.COM (Ephraim Vishniac) Organization: Thinking Machines Corporation, Cambridge MA, USA You can get manuals for the DataVault (TMC's parallel storage system for the CM) by contacting Thinking Machines. Call 617-876-1111 or write to us at 245 First Street, Cambridge, MA 02142-1214. The manual is not terribly informative about DV internals, but it's a start. If you'd like to *use* a DataVault (and a CM, of course), you can apply for a free account by writing to cmns-manager@think.com. CMNS is the Connection Machine Network Server, a DARPA-sponsored project. The critical requirement for a CMNS account is the ability to Telnet to cmns.think.com, the CMNS CM host. ---- From: kathya@june.cs.washington.edu (Katherine J. Armstrong) I'm just starting work on the same topic. ---- From: William Tsun-Yuk Hsu <hsu@uicsrd.csrd.uiuc.edu> I'm working on research in the same area. I've only started to look at parallel I/O a few months ago, and so have nothing worth writing up yet. I'm more interested in the memory system as a whole, how paging behavior affects secondary storage and I/O, especially in a shared memory environment (since that's what most people in the Cedar group are interested in). I have additional references on parallel program behavior if you're interesting (and am also looking at how to model multiprogramming in a parallel environment; we would like to obtain realistic traces from either real systems or simulators to evaluate different I/O configurations. Wish we had access to something like Butterfly traces.) ---- From: Rajeev Jog <jog%hpda@hp-sde.sde.hp.com> I'm working on the performance of multiprocessor architectures and I/O systems. Of late I've been focussing on tightly-coupled shared memory systems. ---- From: rpitt@sol.engin.umich.edu We have a Stellar GS1000. It is one of the fast graphics systems around. performance is 20-25 MIPS (it is a four processor machine). Though our installation uses standard SCSI drives, the machine can do striping on 3 drives (Like Maxtor 4380s). It also does not use a standard filesystem. I do not know exactly how the filesystem is implemented, but you may want to look into it. The system runs a very clean implementation of Sys V with BSD extensions as well as a shared memory approach to X11 client/server communication. I have not used many very high end graphics machines, just the Stellar, the Apollo DN10000 and the Ardent Titan (there's the Sun 4's, but they aren't in the same class). ---- From: Russell R. Tuck <rrt@cs.duke.edu> I believe Stellar Computer provides disk striping capabilities with their workstations. Since I'm not sure and can't find anything much about it, all I can offer is their address and phone: Stellar Computer, Inc. 75 Wells Avenue Newton, MA 02159 (617) 964-1000 ---- From: jsw@sirius.ctr.columbia.edu (John White) I am also interested in this area. In particular, I am interested in filesystems and strategies that are used for transputer networks, their implementation and their performance. ---- From: Dave Poplawski <pop@whitefish.mtu.edu> There were several talks and poster sessions at the hypercube conference this past week. Papers corresponding to them will be published in the proceeding, which we were told would be available by the end of April. ---- From: John.Pieper@B.GP.CS.CMU.EDU I have been working on this problem for some time. ---- From: Shannon Nelson <shannon@intelisc.intel.com> [this person is a contact at Intel] ---- From: Tom Jacob <convex!ntvax!jacob@mcnc.org> Univ. of North Texas Here is (part of) a discussion on disk striping that was on the net a couple of years ago. /* Written 4:08 pm Aug 3, 1987 by pioneer.UUCP!eugene in ntvax:comp.arch */ /* ---------- "Re: Disk Striping (description and" ---------- */ I meant to say DISK STRIPING. This is the distribution of data across multiple "spindles" in order to 1) increase total bandwidth, 2) for reasons of fault-tolerance (like Tandems), 3) other miscellaneous reasons. Very little work has been done on the subject yet a fair number of companies have implemented it: Cray, CDC/ETA, the Japanese manufacturers, Convex, and Pyramid (so I am informed), and I think Tandem, etc. Now for important perspective: It seems that striping over 3-4 disks like in a personal computer is a marginal proposition. Striping over 40 disks, now there is some use. The break even-point is probably between 8-16 disks (excepting the fault tolerance case). A person I know at Amdahl boiled the problem down to 3600 RPM running on 60 HZ wall clock: mechanical bottlenecks of getting data into and out of a CPU from a disk. The work is not glamourous as making CPUs, yet is just as difficult (consider the possibility of losing just one spindle). The two most cited papers I have seen are: %A Kenneth Salem %A Hector Garcia-Molina %T Disk Striping %R TR 332 %I EE CS, Princeton Univerity %C Princeton, NJ %D December 1984 %A Miron Livny %A Setrag Khoshafian %A Haran Boral %T Multi-Disk Management Algorithms %R DB-146-85 %I MCC %C Austin, TX %D 1985 Both of these are pretty good reports, but more work needs to be done in this area, hopefully, one or two readers might seriously. The issue is not simply one of sequentially writing bits out to sequentially lined disks. I just received: %A Michelle Y. Kim %A Asser N. Tantawi %T Asynchonous Disk Interleaving %R RC 12496 (#56190) %I IBM TJ Watson Research Center %C Yorktown Heights, NY %D Feb. 1987 This looks good, but what is interesting it that it does not cite either of the two above reports, but quite a few others (RP^3 and Ultracomputer based). Kim's PhD disseration is on synchronous disk interleaving and she has a paper on IEEE TOC. Another paper I have is Arvin Park's paper on IOStone, an IO benchmark. Park is also at Princeton under Garcia-Molina (massive memory VAXen). I have other papers, but these are the major ones, just starting thinking Terabytes and Terabytes. From a badge I got at ACM/SIGGRAPH: Disk Space: The Final Frontier >From the Rock of Ages Home for Retired Hackers: --eugene miya NASA Ames Research Center eugene@ames-aurora.ARPA /* End of text from ntvax:comp.arch */ /* Written 7:08 pm Aug 3, 1987 by shukra.UUCP!ram in ntvax:comp.arch */ /* ---------- "Re: Disk Striping (description and" ---------- */ In article <2432@ames.arpa>, eugene@pioneer.arpa (Eugene Miya N.) writes: > I meant to say DISK STRIPING. This is the distribution of data across > multiple "spindles" in order to 1) increase total bandwidth, 2) for > reasons of fault-tolerance (like Tandems), 3) other miscellaneous > reasons. So, my guess was right. Does the CM use such a feature to obtain a large BW? How come IBM's disk farms don't have these, or do they? Renu Raman ARPA:ram@sun.com Sun Microsystems UUCP:{ucbvax,seismo,hplabs}!sun!ram M/S 5-40, 2500 Garcia Avenue, Mt. View, CA 94043 /* End of text from ntvax:comp.arch */ /* Written 3:03 am Aug 5, 1987 by pyrltd.UUCP!bejc in ntvax:comp.arch */ /* ---------- "Re: Disk Striping (description and" ---------- */ In article <2432@ames.arpa>, eugene@pioneer.arpa (Eugene Miya N.) writes: > Very little work has been done on the subject yet a fair number of > companies have implemented it: Cray, CDC/ETA, the Japanese > manufacturers, Convex, and Pyramid (so I am informed), and I think > Tandem, etc. Now for important perspective: It seems that striping over > 3-4 disks like in a personal computer is a marginal proposition. > Striping over 40 disks, now there is some use. The break even-point is > probably between 8-16 disks (excepting the fault tolerance case). A > person I know at Amdahl boiled the problem down to 3600 RPM running on 60 > HZ wall clock: mechanical bottlenecks of getting data into and out of a > CPU from a disk. The work is not glamourous as making CPUs, yet is just > as difficult (consider the possibility of losing just one spindle). Pyramid has been shipping "striped disks" as part of OSx 4.0 since early this year."Striped disk" is one of 4 "virtual disk" types offered under OSx, the others being "Conatenated","Mirrored" and "Simple". A full description of the techniques and thier implementation were given by Tom Van Baak of Pyramid Technology Corporation at the February Usenix/Uniforum meeting in Washington. The principle reason for using "striped disk" is performance. The ability to place interleaved clusters of data on different spindles can be a winner in cases where the disk throughput rate is approaching the satuaration point of a single disk, and you have a disk controller intelligent enough to know where every disk head is at any given time. To take a case in point, ICC a company based in the City of London, supplies financial data from an 8Gbyte database to dial up subscribers. One index in the database is >800Mbyte long and has been set up on a "concatenated virtual disk" made up of two 415 Mbyte Eagles. When the set up was switched to the "striped virtual disk" model a throughput increase of 34% was measured. This doesn't mean that "striped" disks are going to answer everbody's disk performance problems, but they can provide significant improvements in certain cases. Both Tom and myself have produced papers on Virtual Disks and would be happy to answers any further questions that you have. Tom can be contacted at: pyramid!tvb while my address is given below. -m------- Brian E.J. Clark Phone : +44 276 63474 ---mmm----- Pyramid Technology Ltd Fax : +44 276 685189 -----mmmmm--- Telex : 859056 PYRUK G -------mmmmmmm- UUCP : <england>!pyrltd!bejc /* End of text from ntvax:comp.arch */ /* Written 7:44 am Aug 7, 1987 by cc4.bbn.com.UUCP!lfernand in ntvax:comp.arch */ > Isn't this what IBM uses in their Airline Control Program? There was that > article in CACM a while back about the TWA reservation system, and it said > something about spreading files over a large number of spindles for greater > throughput. > haynes@ucscc.ucsc.edu What ACP does isn't what we are calling disk striping in this newsgroup. ACP has an option to write each record to two different disks at the same time. This doesn't increase throughput but does has several benefits: 1/ It creates a backup copy so the data will not be lost if one disk crashes. 2/ It allows ACP a choice of where to read the data from. ACP will read the data from the disk with the shortest queue, reducing the access delay. ...Lou lfernandez@bbn.com ...!decwrl!bbncc4!lfernandez /* End of text from ntvax:comp.arch */ /* Written 4:50 am Aug 9, 1987 by amdcad.UUCP!rpw3 in ntvax:comp.arch */ In article <310@cc4.bbn.com.BBN.COM> lfernand@cc4.bbn.com.BBN.COM (Louis F. Fernandez) writes: +--------------- | +--------------- | | Isn't this what IBM uses in their Airline Control Program? There was that | | article in CACM a while back about the TWA reservation system, and it said | | something about spreading files over a large number of spindles for greater | | throughput. | haynes@ucscc.ucsc.edu | +--------------- | What ACP does isn't what we are calling disk striping in this newsgroup. | ACP has an option to write each record to two different disks at the same | time. This doesn't increase throughput but does has several benefits... +--------------- Sorry, the original poster is correct (sort of). ACP *does* have disk striping, in addition to the redundant writing you mentioned, but still it isn't quite the same as we are talking about here. They spread a file across several disks, all right, but the allocation of records (all fixed length -- this is *old* database technology!) is such that the disk drive number is a direct-mapped hash of some key in the record! What this does is spread accesses to similar records (like adjacent seats on the same flight) across many disks (sometimes up to 100 packs!!!). Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun,attmail}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403 /* End of text from ntvax:comp.arch */ ---- From: Donald.Lindsay@K.GP.CS.CMU.EDU (August 1988) Monty Denneau's TF-1 is supposedly going to have several thousand disk drives in parallel. (The TF-1 will have 32K CPU's.) He implied he'd use error-correcting techniques, down the vector of drives, so that drive failure wouldn't matter. Also, NCUBE offers a drive-per-node arrangement. denneau@ibm on csnet/bitnet can tell you what he's published. I should also have mentioned the Data Vault for the Connection Machine (actually, CM-2) at Thinking Machines. Since they've actually shipped some of these, they may have live info. (They store 32 bit words across 39 disk drives.) ---- From: Bob Knighten <pinocchio!knighten@xenna.encore.com> The Encore Multimax presents essentially a plain vanilla Unix view of I/O with the addition that the I/O devices are attached to the system bus, so all processors have equal access to them. I know of no technical papers that deal specifically with the I/O system. This is because to date there has been little work aimed specifically at the I/O system. Rather the work has gone into making the kernel truly symmetrical and reducing lock contention. For example here are some early timing done on the Multimax using a master/slave kernel and using the multithreaded kernel. (This is for a simple program that performs 1000 read operations from random locations in a large file.) copies of program multithreaded master/slave 1 5 secs 8 secs 4 5 secs 25 secs 8 9 secs 50 secs 12 13 secs 77 secs There is no other difference at all in the I/O subsystem which is standard 4.2BSD. Bob Knighten Encore Computer Corp. 257 Cedar Hill St. Marlborough, MA 01752 (508) 460-0500 ext. 2626 Arpanet: knighten@multimax.arpa OR knighten@multimax.encore.com Usenet: {bu-cs,decvax,necntc,talcott}!encore!knighten ---- From: Joseph Boykin <boykin@calliope.encore.com> Encore currently ships four models of comptuer: 310 320 510 520 The 310/320 use the NS32332 CPU (2MIPS/CPU). The 510/520 use the NS32532 CPU (8.5MIPS/CPU). The 310/510 use a single "low-boy" cabinet. The 320/520 use dual-bay, full size cabinets. The 310/520 also have more slots for memory, CPU, I/O cards than the 310/510. All systems use SCSI direct to the disk/tape drive. Some systems are still going out with single-ended SCSI, some with differential Synchronous. Within the next few months, all systems will be differential synchronous. The difference is not noticeable for single transfer across the bus, but when the (SCSI) bus gets busy, in which case differential synchronous is significantly faster. Disk drives are CDC Syber for the X20 (1.1GB formatted per drive). You can fit either 6 or 8 in the cabinet; more if you add cabinets. Maximum is 64 drives. We use the CDC Wren IV (?) (about 3XXMB/drive) on the X10. We'll be going to the Wren V (?) (6XXMB/drive) when available. We currently ship 3 OS's; System V; 4.2BSD; MACH. All three have been parallelized. I.e. if two processes make simultaneous disk requests, they will get as far as enqueing the request to the controller before they have to take a lock. They may need some locks in the meanwhile for the same inode, or to do disk allocation, but for the most part, it's a pretty clean path. The "main" CPU's don't really get too involved with I/O. I/O requests are passed to either an "EMC" (Ethernet Mass Storage Card) or MSC (Mass Storage Card). These contain processors and memory of their own and talk directly to the SCCI bus. Joe Boykin Encore Computer Corp ---- From: david@symult.com (David Lim) Our machine the System 2010 does parallel I/O by having a VME bus on every node to which you can attach disk drives and tape drives. For that matter you can attach any VME device you want. Our current file system is Unix like and the File system is transparent to any program running on any of the nodes. I.e. if a node program does open("filename",...);, the OS will automatically open the file on the correct disk even if the disk is not attached to the node the program is running on. Although not truly NFS, the FS is transparent to all the nodes. An enhancement to the file system that is coming soon will give you true NFS capabilities. I.e. the host would simply NFS mount the disks on the nodes, and vice versa. david@symult.com Symult Systems Corp (formerly Ametek Computer Research Division). ---- From: moore@Alliant.COM (Craig Moore) Look into the Alliant computer. It's the most practical and highest performance machine on the market. ---- From: rab%sprite.Berkeley.EDU@ginger.Berkeley.EDU (Robert A. Bruce) David, I am involved in two research projects here at the University of California that relate to filesystems for parallel computers. The Sprite Network Operating System is designed to run on multiprocessors, and uses prefetching and cacheing to improve filesystem performance. It is a fully functional operating system, I am writing this letter from a sun3 running sprite. The other project is `RAID', which stands for `redundant arrays of inexpensive disks'. A large array of small, slow, cheap disks is used in order to provide cheap, reliable mass storage. By using large caches and log-based filesystems performance should still be good. This project is still just starting up, so we won't have any solid results for a while. ---- Department of Computer Science, Duke University, Durham, NC 27706 USA ARPA: dfk@cs.duke.edu CSNET: dfk@duke UUCP: decvax!duke!dfk