[comp.parallel] Very Large Data Sets - what are they?

dfk@romeo.cs.duke.edu (David F. Kotz) (03/15/89)

A number of you have or envision applications for computers that
involve large amounts of data. Perhaps these are databases, matrices,
images, VLSI circuit designs, etc.  My research involves filesystems
for parallel computers, and I am unsure what kinds and arrangements of
data might be used by parallel computers. 

I am interested in the arrangement of the data on disk, and how the
application reads it into memory. Does it read a lot or write a lot,
or both? Does it read and write the same file?  Does it read/write it
all at once, or throughout the run of the application?  Does it
read/write sequentially or randomly? How might you modify the
application for a parallel computer, and specifically, for parallel
I/O? 

Can I discuss this with you? Do you have some data that I can analyze
to determine, for example, what is the size of each row of the matrix,
or cell of the chip, etc? 

Thanks,
David Kotz
Department of Computer Science, Duke University, Durham, NC 27706
ARPA:	dfk@cs.duke.edu
CSNET:	dfk@duke        
UUCP:	decvax!duke!dfk

eugene@eos.arc.nasa.gov (Eugene Miya) (03/16/89)

I have discussed some of this with David by mail, but some of the comments
might be useful for network discussion.

Well, really large data sets won't fit on a disk.  Even existing disk
systems.  Some examples from my experience and watching others
talk about this include planetary imaging data (Landsat, Voyager, Seasat).
Some of this is 2-d, so you make rows (scan lines) and "splice" them
into a tape (frequently 90 MB per frequency, 4-7 bandwidths, etc.).
Some of the data is strictly linear, time varying, some some seismic
data, again tapes.  Voyager on a planetary encounter as an example
generates way over 10 thousand magnetic tapes.  Such images are typicall
1K pixels on a side and higher resolutions are planned (10^18 bits for
a future mission isn't uncommon).  The physical manifestation of a DB is
a warehouse.  Most of course never gets looked at (remember the very end
of Raiders of the Lost Art? ->;-)  want to know how the ozone hole was over
looked?), you don't want to look at the data, you get a machine to do this.
In fact most raster systems can't hold complete images, you only view
sections.  So you end up always looking at portions.  Now the numbers
I mention (90 MW etc.) might all sound small.  These can fit,
that's not the point, okay, 10 images in a GB roughly.  But that's
only bandwidth.  And that's only images.  And it is TOO easy for
computer people to understand.

Deceptively easy.  Some signal processing involves high
dimension FFT (Fast Fourier Transform) processing.  You might
want to look at the frequency space.  Or non-linear dynamics looking at
the phase space.  FFTs have been proposed which would involve a 1K
by 4-D FFT.  Now to get that in most users just resort to simple linear
sweeps (varying one index).  Just big arrays.  Its not completely
clear how proposals like RAIDs will help this out (nor images for that
matter).  Grids and meshes, same things.  Try to visualize 4-space,
5, 6, 7 (the problems exist), and we have developed crude mechanisms
to look at simple cases (Tufte).

Oh yes, some data compression can help.  Must be done carefully.

A different way of organizing the data might be like the airlines
(SABRE).  There's a CACM paper and a very conference on this topic.
There are some meetings (IEEE) and others (one in Oregon at this moment)
which talk about some of these issues.  Distributed databases.
Transaction processing.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  Domains, the zip codes of networks.