dfk@romeo.cs.duke.edu (David F. Kotz) (03/15/89)
A number of you have or envision applications for computers that involve large amounts of data. Perhaps these are databases, matrices, images, VLSI circuit designs, etc. My research involves filesystems for parallel computers, and I am unsure what kinds and arrangements of data might be used by parallel computers. I am interested in the arrangement of the data on disk, and how the application reads it into memory. Does it read a lot or write a lot, or both? Does it read and write the same file? Does it read/write it all at once, or throughout the run of the application? Does it read/write sequentially or randomly? How might you modify the application for a parallel computer, and specifically, for parallel I/O? Can I discuss this with you? Do you have some data that I can analyze to determine, for example, what is the size of each row of the matrix, or cell of the chip, etc? Thanks, David Kotz Department of Computer Science, Duke University, Durham, NC 27706 ARPA: dfk@cs.duke.edu CSNET: dfk@duke UUCP: decvax!duke!dfk
eugene@eos.arc.nasa.gov (Eugene Miya) (03/16/89)
I have discussed some of this with David by mail, but some of the comments might be useful for network discussion. Well, really large data sets won't fit on a disk. Even existing disk systems. Some examples from my experience and watching others talk about this include planetary imaging data (Landsat, Voyager, Seasat). Some of this is 2-d, so you make rows (scan lines) and "splice" them into a tape (frequently 90 MB per frequency, 4-7 bandwidths, etc.). Some of the data is strictly linear, time varying, some some seismic data, again tapes. Voyager on a planetary encounter as an example generates way over 10 thousand magnetic tapes. Such images are typicall 1K pixels on a side and higher resolutions are planned (10^18 bits for a future mission isn't uncommon). The physical manifestation of a DB is a warehouse. Most of course never gets looked at (remember the very end of Raiders of the Lost Art? ->;-) want to know how the ozone hole was over looked?), you don't want to look at the data, you get a machine to do this. In fact most raster systems can't hold complete images, you only view sections. So you end up always looking at portions. Now the numbers I mention (90 MW etc.) might all sound small. These can fit, that's not the point, okay, 10 images in a GB roughly. But that's only bandwidth. And that's only images. And it is TOO easy for computer people to understand. Deceptively easy. Some signal processing involves high dimension FFT (Fast Fourier Transform) processing. You might want to look at the frequency space. Or non-linear dynamics looking at the phase space. FFTs have been proposed which would involve a 1K by 4-D FFT. Now to get that in most users just resort to simple linear sweeps (varying one index). Just big arrays. Its not completely clear how proposals like RAIDs will help this out (nor images for that matter). Grids and meshes, same things. Try to visualize 4-space, 5, 6, 7 (the problems exist), and we have developed crude mechanisms to look at simple cases (Tufte). Oh yes, some data compression can help. Must be done carefully. A different way of organizing the data might be like the airlines (SABRE). There's a CACM paper and a very conference on this topic. There are some meetings (IEEE) and others (one in Oregon at this moment) which talk about some of these issues. Distributed databases. Transaction processing. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." Domains, the zip codes of networks.