[comp.parallel] Very Large Data Sets

dfk@romeo.cs.duke.edu (David F. Kotz) (03/28/89)
This is a summary of the responses I received regarding my posting
about and about very large data sets. 

>A number of you have or envision applications for computers that
>involve large amounts of data. Perhaps these are databases, matrices,
>images, VLSI circuit designs, etc.  My research involves filesystems
>for parallel computers, and I am unsure what kinds and arrangements of
>data might be used by parallel computers. 
>
>I am interested in the arrangement of the data on disk, and how the
>application reads it into memory. Does it read a lot or write a lot,
>or both? Does it read and write the same file?  Does it read/write it
>all at once, or throughout the run of the application?  Does it
>read/write sequentially or randomly? How might you modify the
>application for a parallel computer, and specifically, for parallel
>I/O? 

This is a heavily edited summary of the comments I received.

Thanks for all of your help!! (and I welcome more information!)

David Kotz
Department of Computer Science, Duke University, Durham, NC 27706 USA
ARPA:	dfk@cs.duke.edu
CSNET:	dfk@duke        
UUCP:	decvax!duke!dfk

==============================================

From: ephraim@Think.COM  (Ephraim Vishniac)
Organization: Thinking Machines Corporation, Cambridge MA, USA

I don't know if this is the kind of application you have in mind, but
my office-mates and I are the implementers of a large-scale
document-retrieval system for the Connection Machine.  This system is
now in commercial use at Dow Jones News Retrieval, as the back end of
their "DowQuest" service.  When the system is full, the database
consists of roughly a gigabyte of text.  If you'd like to know more,
write.  You can either write to me (ephraim@think.com) or to the group
(newsroom@think.com).

There may be restrictions on what we can disclose.  I'll have to check
on this.

----

From: chn%a@LANL.GOV (Charles Neil)
I have become interested in  the subject you're discussing.  In our
applications, we frequently spend 40-60% of the time accessing equation-of-
state data.  A colleague and I have begun to investigate the possibility of
spooling such calculation to the Lab's newly acquired Connection Machine.
However, databases we access can have a large amount of information, more,
we fear, than the 8 Kbytes of memory each of the 64K processors gets.  So
we're wondering how to section the data efficiently.  This effort on our
part is speculative at this point, and such questions may be old hat for
the experts, but I would be happy to discuss it further.

----

From: MOBY <gvw%etive.edinburgh.ac.uk@NSS.Cs.Ucl.AC.UK> (Greg Wilson)

Hi there.  For the last few years a group in Molecular Biology
at the University of Edinburgh have been doing protein sequencing
using a SIMD machine called the DAP (distributed array processor),
originally built by ICL and now being built/marketed by a spinoff
company called AMT.  They have *very* large data sets, installed
once on the disk and then read repeatedly as unidentified sequences
are matched.  If you would like more information, you should contact
Dr. John Collins (J.Collins@uk.ac.ed, plus gumph to get across
the Atlantic).

----

From: S. Sivaramakrishnan <sivaram%hpda@hp-sde.sde.hp.com>

Most seismic data processing algorithms involve extremely large amounts of
data - several Gbytes.  There is a lot of computation involved.  However,
most of the big machines page-fault trying to bring in and out large data
sets, that they end up being totally I/O bound.  The data storage and the
access pattern is dependent on the actual algorithm used.  Most seismic
algorithms are pretty well structured and the regular data access patterns,
frequently sequential, may be used to advantage.

The new "VISC" -Visualization in Scientific Computing - needs are far more
I/O bound as it involves real-time I/O.  Further the access patterns are more
irregular, though a little-localized.  Here again the data sets are in Gbytes.

Some fault-tolerant applications - like the fault-testing algorithms may also
involve large data sets.


There are 2 seismic data processing algorithms that I am fairly familiar with
which are "Migration" algorithms.  It seems to me that both of them offer
exactly the kind of spatial and temporal parallelism in the data transfer
of large-scale data sets, depending ofcourse on the kind of machine you
are actually working with.

There are different kinds of data sets, all fairly large and can be made
varied according to the needs.  The algorithms being regular, so are the
data access patterns.  Efficient data-layout can improve the sequentiality
of access.  As some data sets are independent of others, they may be fetched
in parallel.  Also, the regularity of the computation allows for pipelining
the data fetch and pre-fetch of data sets dramatically improve performance
in terms of I/O bandwidth.  There are synchronization points and differences
in the processing times of pipelining stages which would encourage such
tactics.  Some data sets would be read-only and others read/write. Some
actions need to be semaphored.  Finally memory constraints would force
efficient computation-I/O ratios.  All of the above seem to be exactly what
you are looking for. There are, however, no partial file access- though in
a slightly different algorithm that may be possible.  There is no truly
random data access.  As I mentioned earlier, the system (multiprocessor, I
would guess) you are running on would greatly impact your actions.

VISC - This is less predictable than the seismic computation.  However, there
is potential for pre-fetch, pipelining, parallel data access including
to an extent, random access.  Sizes of data access may truly differ.

You may find detailed discussions on the above in a few technical reports
at The University of Texas at Austin.

You may wish to send email to Dr JC Browne who used to be my supervising
professor.  You can use my name.  You may wish to outline your research
interests and ask him for a copy of the reports on "Evaluation of Logical
External Memory Architectures" and on the "Parallel Formulation of I/O
Intensive Computation Structures" - both submitted to IBM for a research
project.  The first report details the seismic algorithm while the latter
would be more detailed about VISC.  My Masters thesis has the same title as
the former report and would specifically address the modeling of seismic
code on a particular machine.

Dr JC Browne's email: browne@cs.utexas.edu

I can give you further details if you wish.

Sivaram.
Ph: (408) 447-4204 (off)
    (408) 554-0584 (res)

----

From: Patrick Odent <prlb2!imec!odent@uunet.UU.NET>

circuit simulation is one of the VLSI design applications developed 
on a parallel computer system, SEQUENT BALANCE 8000, at our institution.
As you are interested in the form of disk I/O in such an application,
following flowchart summerizes the file access in the system:

     read circuit description from file
    
     read simulation commands from file

     simulate circuit 

     print simulation results in a file

This program needs three different files: two input files and one 
output file. These files are simple sequential ascii files. 
The size of these file can be rather large: 1Mbyte for the circuit
desription and up to 10Mbyte for the simulation results.

The large output file is also used by a graphical postprocessor. 
Writting and reading this file can take a large amount of time,
reducing the overall system performance.

We observed a more serious problem concerning disk I/O. If the parallel program
is used for the simulation of a big circuit with for example 10 processor,
the performance is very bad. The reason for this degradation is "page
faulding". The memory needed for the circuit description and intermediate
results is so large that paging on disk becomes necessary. When 10
processors need a new page at the same time, the system can not serve all
the requests at once. This makes the parallel execution very inefficient.
I think a parallel disk I/O system is very usefull to solve this problem.
Do you have an other solution to this problem?

Patrick Odent
IMEC vzw
Kapeldreef 75
3030 Leuven
Belgium

----


From: hearn@tcgould.TN.CORNELL.EDU (hearn)

One of the biggest users of large data sets is the oil industry
which collects reflection seismic data. They are also the number one
supercomputer user for that reason. Even a simple filter becomes very
time consuming with large data volumes.
	I work in academic deep seismic reflection profiling with
the COCORP project at Cornell. A typical 2d data set for us
is half a gigabyte. Industry uses 3d data sets of substatianlly 
larger volumes. 
	We use the Cornell Supercomputer Facility (2 IBM 3090's)
including some of their parallel processing capabilities.
Our job times are typically at least 50% i/o. Even with a potentail
12 fold parallel speed up the i/o bottleneck restricts this to half
that speed up.
Unfortunately, neither the parallel fortran nor the CMS operating
system alow parallel or asyncronous i/o so i don't know much about
that execpt that we could sure use it. The fact is that for large
data sets high speed large disk drives (no more tape to tape processing
please) are the first step in speeding life up. Another major feature
we take advantage of is the 999mbyte extended addressing (that's 
ibmese for virtual memory) with 512mbytes of real memory. Such
space eliminates the need for temporary files and even changes the
actual way in which we can process the data. 
	Hope you find this usefull.

use of 12 cpus here the maximum speed up

	     Tom Hearn
	     hearn%bat.tn.cornell.edu
	     hearn@geology.tn.cornell.edu
	     erdj@cornellf.bitnet

----
Department of Computer Science, Duke University, Durham, NC 27706 USA
ARPA:	dfk@cs.duke.edu
CSNET:	dfk@duke        
UUCP:	decvax!duke!dfk