[comp.os.research] Extremely Fast File Systems

craig@BBN.COM (Craig Partridge) (07/26/90)

I'm curious.  Has anyone done research on building extremely
fast file systems, capable of delivering 1 gigabit or more of data
per second from disk into main memory?  I've heard rumors, but no
concrete citations.

I'm interested because I think we'll need such fast file systems as
we build distributed systems over gigabit networks, and I'm somewhat
curious to learn what, if anything, has been done so far in this area.

Craig Partridge
craig@bbn.com

%
% There is the RAID project at Berkeley, and some disk array work going on at
% IBM Almaden.  There is also Michael Scott's work at Rochester. --DL
%

solworth@uicbert.eecs.uic.edu (Jon Solworth) (07/27/90)

Saving a gigabit/second to disk is going to take a lot of disks.
If a disk can write at 6 MB/sec, the 20 disks are needed just
to accept at network speeds (this assumes that the disks are essentially
doing a totally sequential access).

Add in any kind of random seeks, and the number of disks can skyrocket.
In addition to Berkeley RAID and Sprite projects (which essentially
use disk arrays as striped disks), disk caching of writes is an
alternative way of getting near sequential access rates.

(C. Orji and I have a paper in the 1990 Sigmod entitled
"Write-only disk caching")


Jon Solworth
UIC

lm@snafu.Eng.Sun.COM (Larry McVoy) (07/28/90)

In article <5465@darkstar.ucsc.edu> craig@BBN.COM (Craig Partridge) writes:
>
>I'm curious.  Has anyone done research on building extremely
>fast file systems, capable of delivering 1 gigabit or more of data
>per second from disk into main memory?  I've heard rumors, but no
>concrete citations.
>
>I'm interested because I think we'll need such fast file systems as
>we build distributed systems over gigabit networks, and I'm somewhat
>curious to learn what, if anything, has been done so far in this area.

Why not: Drives
---------------

Let's approximate a gigabit by 107 megabytes.
Let's assume that a nice drive rotates at 7200 RPM, and has 64KB / track,
and has the ability to read two heads at once (this is about twice as
good as any commonly available drive such as SCSI, IPI, or XD).

Let's do some math:

	7200 revs / minute = 120 revs / sec = 8.33 milliseconds / rev.

	64KB * 2 heads * 120 revs / sec = 15,360 KB / sec.


This means that the most that you can expect from a drive like this is 15MB
per second.  This assumes that you have a filesystem that can run the drive
at the platter speed (which isn't such a bad assumption - I run SCSI's at
the platter speed with a hacked version of UFS).

These numbers are wildly optimistic.  I worked on super computer drives a 
couple of years ago and they could do 12MB / sec and drives cost about
$75K each.  I think you'll see cheap 2MB / sec drives in a year or two.
It will be a long time before you see cheap 15MB / sec drives.


Why not: Busses
---------------

A good bus these days runs at about 80 MB / sec flat out.  We can make them
faster but it gets harder and harder to do so and give them any size.


Why not: CPU speed
------------------

A Sun 4/490 is a reasonably fast machine.  Moving I/O requires a copy.
The 490 has copy hardware that maxes out at 25 MB / sec in the kernel
and 14 MB / sec in user space.  

Conclusion:
-----------

I think the answer is (a) yes, we've thought about it but (b) no it won't
happen with any conventional hardware soon.  I suspect that you'll need 
parallel busses, CPU's, and disks in order to get that kind of I/O.
Furthermore, the I/O requests have to be large (megabytes) in orer to
get all those parts working at the same time.  You'll *never* see a
magnetic disk that deliver > MB / sec timed over 10K.
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

eugene@wilbur.nas.nasa.gov (Eugene N. Miya) (07/30/90)

As a start, consider the Cray SSD (Solid State Disk).  It gets over 1 GB/S.

--e. nobuo miya, NASA Ames Research Center, eugene@orville.nas.nasa.gov
  {uunet,mailrus,other gateways}!ames!eugene

craig@BBN.COM (Craig Partridge) (07/31/90)

In article <5512@darkstar.ucsc.edu> lm@snafu.Eng.Sun.COM (Larry McVoy) writes:
>
>Why not: CPU speed
>------------------
>
>A Sun 4/490 is a reasonably fast machine.  Moving I/O requires a copy.
>The 490 has copy hardware that maxes out at 25 MB / sec in the kernel
>and 14 MB / sec in user space.  

Larry:

    I've heard the discussions about busses and disk drives before
but this is the first time someone's said CPUs will be a problem.

    Mostly I've heard the reverse argument -- CPUs are gonna gobble data
as fast as the network and disks can feed it.  For example, several researchers
are muttering about 250 MIP CPUs in the next couple of years -- one person I
know at DEC is talking about a 1 BIP workstation by 1995.

    Those CPUs will have modest memory caches that run at CPU speed --
so close to the CPU you'll have a system chomping on gigabits of data
per second (consider a 32-bit instruction with one 32-bit memory operand
 -- thats 250 MIPS * 64 bits = 16 gigabits/second of data flowing through
the CPU -- and that's clearly low [I haven't factored where the operands
contents go]).  So I think CPUs will be capable of moving gigabits around.

Craig

davecb@nexus.yorku.ca (David Collier-Brown) (07/31/90)

In <5465@darkstar.ucsc.edu> Craig Partridge writes:
|  I'm curious.  Has anyone done research on building extremely
|  fast file systems, capable of delivering 1 gigabit or more of data
|  per second from disk into main memory?  I've heard rumors, but no
|  concrete citations.

puder@zeno.informatik.uni-kl.de (Arno Puder) writes:
| Tanenbaum (ast@cs.vu.nl) has developed a distributed system called
| AMOEBA. Along with the OS-kernel there is a "bullet file server".
| (BULLET because it is supposed to be pretty fast).
| 
| Tanenbaum's philosophy is that memory is getting cheaper and cheaper,
| so why not load the complete file into memory? This makes the server
| extremely efficient. Operations like OPEN or CLOSE on files are no
| longer needed (i.e. the complete file is loaded for each update).

  Er, sorta...  You could easily write an interface that did writes
or reads without open or closes, for some specific subset of uses.

| 
| The benchmarks are quite impressive although I doubt that this
| concept is useful (especially when thinking about transaction
| systems in databases).

  Well, I have something of the opposite view: a system like Bullet makes a
very good substrate for a database system.  

  The applicable evidence is in the article "Performance of an OLTP
Application on Symmetry Multiprocessor System", in the 17th Annual
International Symposium on Computer Architecture, ACM SigArch Vil 18 Number
2, June 1990. (see, a reference (:-))
  The article uses all-in-memory databases in the TP1 benchmark as a
limiting case while investigating the OS and architectural support that
are necessary for good Transaction Processing speeds, and the speeds
are up in the range that Craig may find interesting...

  My speculation is that a bullet-like file system with a relation-
allocating layer (call it the Uzi filesystem? the speedloader filesystem??)
on top would make a very good platform for a relational database.  Certainly
the behavior patterns of an in-memory, load-whole-relation database would be
easy to reason about, and would be easy and interesting to investigate.

| You can download Tanenbaum's original paper (along with a "complete"
| description about AMOEBA) via anonymous ftp from midgard.ucsc.edu
| in ftp/pub/amoeba.


--dave
-- 
David Collier-Brown,  | davecb@Nexus.YorkU.CA, ...!yunexus!davecb or
72 Abitibi Ave.,      | {toronto area...}lethe!dave 
Willowdale, Ontario,  | "And the next 8 man-months came up like
CANADA. 416-223-8968  |   thunder across the bay" --david kipling

aglew@oberon.crhc.uiuc.edu (Andy Glew) (07/31/90)

>>Why not: CPU speed
>>------------------
>>
>>A Sun 4/490 is a reasonably fast machine.  Moving I/O requires a copy.
>>The 490 has copy hardware that maxes out at 25 MB / sec in the kernel
>>and 14 MB / sec in user space.  
>
>Larry:
>
>    I've heard the discussions about busses and disk drives before
>but this is the first time someone's said CPUs will be a problem.

Actually, it's not CPU speed, but the CPU-memory interface that's the
problem.
    The CPU-memory interface is increasing in speed, but nowhere near
as fast as CPUs.  Caches do not help large copies from I/O to user, if
I/O is uncached (even if cached it can be a problem, with a single
data port on the cache).  Burst protocols and wider busses seem to be
the favoured solutions.

--
Andy Glew, andy-glew@uiuc.edu

Propaganda:
    
    UIUC runs the "ph" nameserver in conjunction with email. You can
    reach me at many reasonable combinations of my name and nicknames,
    including:

    	andrew-forsyth-glew@uiuc.edu
    	andy-glew@uiuc.edu
    	sticky-glue@uiuc.edu

    and a few others. "ph" is a very nice thing which more USEnet
    sites should use.  UIUC has ph wired into email and whois (-h
    garcon.cso.uiuc.edu).  The nameserver and full documentation are
    available for anonymous ftp from uxc.cso.uiuc.edu, in the net/qi
    subdirectory.

narten@cs.albany.edu (Thomas Narten) (07/31/90)

In article <5555@darkstar.ucsc.edu> craig@BBN.COM (Craig Partridge) writes:
       I've heard the discussions about busses and disk drives before
   but this is the first time someone's said CPUs will be a problem.

Take a look at John Ousterhout's paper "Why Aren't Operating Systems
Getting Faster As Fast as Hardware?" in the June USENIX proceedings.
He reports on a number of benchmarks, and one of his conclusions is
that memory bandwidth is not keeping up with processor speed in RISC
machines.  
--
Thomas Narten
narten@cs.albany.edu

craig@BBN.COM (Craig Partridge) (08/01/90)

In article <5582@darkstar.ucsc.edu> narten@cs.albany.edu (Thomas Narten) writes:
>
>In article <5555@darkstar.ucsc.edu> craig@BBN.COM (Craig Partridge) writes:
>       I've heard the discussions about busses and disk drives before
>   but this is the first time someone's said CPUs will be a problem.
>
>Take a look at John Ousterhout's paper "Why Aren't Operating Systems
>Getting Faster As Fast as Hardware?" in the June USENIX proceedings.
>He reports on a number of benchmarks, and one of his conclusions is
>that memory bandwidth is not keeping up with processor speed in RISC

I've read Ousterhout's paper and don't disagree with it (as far as
I recall).  My sense however is that even though memroy is getting faster
slower than CPUs, when we look at gigabit computing, CPU and memory
speeds, while a nuisance, aren't gonna be problems in the same league
as busses or disks.

Craig

lm@snafu.Eng.Sun.COM (Larry McVoy) (08/01/90)

In article <5555@darkstar.ucsc.edu> craig@BBN.COM (Craig Partridge) writes:
>
>    I've heard the discussions about busses and disk drives before
>but this is the first time someone's said CPUs will be a problem.
>
>    Mostly I've heard the reverse argument -- CPUs are gonna gobble data
>as fast as the network and disks can feed it.  For example, several researchers
>are muttering about 250 MIP CPUs in the next couple of years -- one person I
>know at DEC is talking about a 1 BIP workstation by 1995.
>
>    Those CPUs will have modest memory caches that run at CPU speed --
>so close to the CPU you'll have a system chomping on gigabits of data
>per second (consider a 32-bit instruction with one 32-bit memory operand
> -- thats 250 MIPS * 64 bits = 16 gigabits/second of data flowing through
>the CPU -- and that's clearly low [I haven't factored where the operands
>contents go]).  So I think CPUs will be capable of moving gigabits around.

You have to feed the cache.  The question was "are there or will there be
file systems capable of gigabit transfer rates?" (paraphrased).  When you
are talking about I/O rates, you can forget the CPU cache - first
of all, the data won't be there to start, and second of all, it doesn't
get reused; it has to be fetched from memory, network, disk, wherever.  

You have to look at the whole path:

    disk			network
    disk controller		network controller
    bus				bus
    memory			memory
    cpu				cpu

		and back.

That path isn't likely to do gigabit any time soon.   Sure, you can make
CPU's that do it, and you can make memory that can do it, and you can
make a bus that can do it, etc.  But you have to have all of them together.
It's very similar to buying a stereo system.  You don't go out and buy
nine zillion dollars of equipment and use extension cords as speaker wire.

This is the classic application of Amdahl's law to performance.  Whenever
you fix one bottleneck, the system improves a little and then hits a 
different one.
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

fouts@unix.sri.com (Martin Fouts) (08/10/90)

In article <5512@darkstar.ucsc.edu> lm@snafu.Eng.Sun.COM (Larry McVoy) writes:

   In article <5465@darkstar.ucsc.edu> craig@BBN.COM (Craig Partridge) writes:
   >
   >I'm curious.  Has anyone done research on building extremely
   >fast file systems, capable of delivering 1 gigabit or more of data
   >per second from disk into main memory?  I've heard rumors, but no
   >concrete citations.
   >
   >I'm interested because I think we'll need such fast file systems as
   >we build distributed systems over gigabit networks, and I'm somewhat
   >curious to learn what, if anything, has been done so far in this area.

   [Good tecnical shootdown removed.  Summary of removed material:
    It's too expensive/difficult to do with existing technology ]

As part of the contract for the first Cray 2 at NASA Ames, CRI was
required to statisfy a 10mbyte/sec per drive demonstration of a simple
file transfer which copied the entire contents of a source drive to a
destination drive.  They were required to demonstrate 20 simultaneous
transfers (using 40 drives.)  Doing the math, 20x10x8
(instances*rate*bits/byte) = 1.6 Gigabits/second.  They ran that demo
for me at Ames five years ago.  The programs were written in C and
made ordinary read/write calls on files open in ordinary ways. (I've
run the identical source code on a huge number of Unix file systems.)

Tim Hoel et. al. at CRI had designed a fast file system for the Cray
2, which is in production use, and was used for this test.  With a
machine like the 2, the file system required very little cleverness to
pass this test.  Had the test been rewritten to use striping, it could
have been accomplished with a single transfer on a single file system.

BTW, that was a >$20M Cray 2 using fast expensive disk drives.  It was
also running a compute bound workload while running the copy test.

We aren't going to see a Gb/s file system on a PC clone in the near
future, but there are a number of mainframes capable of sufficient
aggrate disk performance now.

We've again reached the point where high performance I/O is a bigger
bottleneck than CPU horsepower.
--
Martin Fouts

 UUCP:  ...!pyramid!garth!fouts (or) uunet!ingr!apd!fouts
 ARPA:  apd!fouts@ingr.com
PHONE:  (415) 852-2310            FAX:  (415) 856-9224
 MAIL:  2400 Geng Road, Palo Alto, CA, 94303

Moving to Montana;  Goin' to be a Dental Floss Tycoon.
  -  Frank Zappa