dfk@romeo.cs.duke.edu (David F. Kotz) (03/28/89)
This is a summary of the responses I received regarding my posting about and about very large data sets. >A number of you have or envision applications for computers that >involve large amounts of data. Perhaps these are databases, matrices, >images, VLSI circuit designs, etc. My research involves filesystems >for parallel computers, and I am unsure what kinds and arrangements of >data might be used by parallel computers. > >I am interested in the arrangement of the data on disk, and how the >application reads it into memory. Does it read a lot or write a lot, >or both? Does it read and write the same file? Does it read/write it >all at once, or throughout the run of the application? Does it >read/write sequentially or randomly? How might you modify the >application for a parallel computer, and specifically, for parallel >I/O? This is a heavily edited summary of the comments I received. Thanks for all of your help!! (and I welcome more information!) David Kotz Department of Computer Science, Duke University, Durham, NC 27706 USA ARPA: dfk@cs.duke.edu CSNET: dfk@duke UUCP: decvax!duke!dfk ============================================== From: ephraim@Think.COM (Ephraim Vishniac) Organization: Thinking Machines Corporation, Cambridge MA, USA I don't know if this is the kind of application you have in mind, but my office-mates and I are the implementers of a large-scale document-retrieval system for the Connection Machine. This system is now in commercial use at Dow Jones News Retrieval, as the back end of their "DowQuest" service. When the system is full, the database consists of roughly a gigabyte of text. If you'd like to know more, write. You can either write to me (ephraim@think.com) or to the group (newsroom@think.com). There may be restrictions on what we can disclose. I'll have to check on this. ---- From: chn%a@LANL.GOV (Charles Neil) I have become interested in the subject you're discussing. In our applications, we frequently spend 40-60% of the time accessing equation-of- state data. A colleague and I have begun to investigate the possibility of spooling such calculation to the Lab's newly acquired Connection Machine. However, databases we access can have a large amount of information, more, we fear, than the 8 Kbytes of memory each of the 64K processors gets. So we're wondering how to section the data efficiently. This effort on our part is speculative at this point, and such questions may be old hat for the experts, but I would be happy to discuss it further. ---- From: MOBY <gvw%etive.edinburgh.ac.uk@NSS.Cs.Ucl.AC.UK> (Greg Wilson) Hi there. For the last few years a group in Molecular Biology at the University of Edinburgh have been doing protein sequencing using a SIMD machine called the DAP (distributed array processor), originally built by ICL and now being built/marketed by a spinoff company called AMT. They have *very* large data sets, installed once on the disk and then read repeatedly as unidentified sequences are matched. If you would like more information, you should contact Dr. John Collins (J.Collins@uk.ac.ed, plus gumph to get across the Atlantic). ---- From: S. Sivaramakrishnan <sivaram%hpda@hp-sde.sde.hp.com> Most seismic data processing algorithms involve extremely large amounts of data - several Gbytes. There is a lot of computation involved. However, most of the big machines page-fault trying to bring in and out large data sets, that they end up being totally I/O bound. The data storage and the access pattern is dependent on the actual algorithm used. Most seismic algorithms are pretty well structured and the regular data access patterns, frequently sequential, may be used to advantage. The new "VISC" -Visualization in Scientific Computing - needs are far more I/O bound as it involves real-time I/O. Further the access patterns are more irregular, though a little-localized. Here again the data sets are in Gbytes. Some fault-tolerant applications - like the fault-testing algorithms may also involve large data sets. There are 2 seismic data processing algorithms that I am fairly familiar with which are "Migration" algorithms. It seems to me that both of them offer exactly the kind of spatial and temporal parallelism in the data transfer of large-scale data sets, depending ofcourse on the kind of machine you are actually working with. There are different kinds of data sets, all fairly large and can be made varied according to the needs. The algorithms being regular, so are the data access patterns. Efficient data-layout can improve the sequentiality of access. As some data sets are independent of others, they may be fetched in parallel. Also, the regularity of the computation allows for pipelining the data fetch and pre-fetch of data sets dramatically improve performance in terms of I/O bandwidth. There are synchronization points and differences in the processing times of pipelining stages which would encourage such tactics. Some data sets would be read-only and others read/write. Some actions need to be semaphored. Finally memory constraints would force efficient computation-I/O ratios. All of the above seem to be exactly what you are looking for. There are, however, no partial file access- though in a slightly different algorithm that may be possible. There is no truly random data access. As I mentioned earlier, the system (multiprocessor, I would guess) you are running on would greatly impact your actions. VISC - This is less predictable than the seismic computation. However, there is potential for pre-fetch, pipelining, parallel data access including to an extent, random access. Sizes of data access may truly differ. You may find detailed discussions on the above in a few technical reports at The University of Texas at Austin. You may wish to send email to Dr JC Browne who used to be my supervising professor. You can use my name. You may wish to outline your research interests and ask him for a copy of the reports on "Evaluation of Logical External Memory Architectures" and on the "Parallel Formulation of I/O Intensive Computation Structures" - both submitted to IBM for a research project. The first report details the seismic algorithm while the latter would be more detailed about VISC. My Masters thesis has the same title as the former report and would specifically address the modeling of seismic code on a particular machine. Dr JC Browne's email: browne@cs.utexas.edu I can give you further details if you wish. Sivaram. Ph: (408) 447-4204 (off) (408) 554-0584 (res) ---- From: Patrick Odent <prlb2!imec!odent@uunet.UU.NET> circuit simulation is one of the VLSI design applications developed on a parallel computer system, SEQUENT BALANCE 8000, at our institution. As you are interested in the form of disk I/O in such an application, following flowchart summerizes the file access in the system: read circuit description from file read simulation commands from file simulate circuit print simulation results in a file This program needs three different files: two input files and one output file. These files are simple sequential ascii files. The size of these file can be rather large: 1Mbyte for the circuit desription and up to 10Mbyte for the simulation results. The large output file is also used by a graphical postprocessor. Writting and reading this file can take a large amount of time, reducing the overall system performance. We observed a more serious problem concerning disk I/O. If the parallel program is used for the simulation of a big circuit with for example 10 processor, the performance is very bad. The reason for this degradation is "page faulding". The memory needed for the circuit description and intermediate results is so large that paging on disk becomes necessary. When 10 processors need a new page at the same time, the system can not serve all the requests at once. This makes the parallel execution very inefficient. I think a parallel disk I/O system is very usefull to solve this problem. Do you have an other solution to this problem? Patrick Odent IMEC vzw Kapeldreef 75 3030 Leuven Belgium ---- From: hearn@tcgould.TN.CORNELL.EDU (hearn) One of the biggest users of large data sets is the oil industry which collects reflection seismic data. They are also the number one supercomputer user for that reason. Even a simple filter becomes very time consuming with large data volumes. I work in academic deep seismic reflection profiling with the COCORP project at Cornell. A typical 2d data set for us is half a gigabyte. Industry uses 3d data sets of substatianlly larger volumes. We use the Cornell Supercomputer Facility (2 IBM 3090's) including some of their parallel processing capabilities. Our job times are typically at least 50% i/o. Even with a potentail 12 fold parallel speed up the i/o bottleneck restricts this to half that speed up. Unfortunately, neither the parallel fortran nor the CMS operating system alow parallel or asyncronous i/o so i don't know much about that execpt that we could sure use it. The fact is that for large data sets high speed large disk drives (no more tape to tape processing please) are the first step in speeding life up. Another major feature we take advantage of is the 999mbyte extended addressing (that's ibmese for virtual memory) with 512mbytes of real memory. Such space eliminates the need for temporary files and even changes the actual way in which we can process the data. Hope you find this usefull. use of 12 cpus here the maximum speed up Tom Hearn hearn%bat.tn.cornell.edu hearn@geology.tn.cornell.edu erdj@cornellf.bitnet ---- Department of Computer Science, Duke University, Durham, NC 27706 USA ARPA: dfk@cs.duke.edu CSNET: dfk@duke UUCP: decvax!duke!dfk