[comp.arch] Unix File System Performance

martin@felix.UUCP (Martin McKendry) (09/10/87)

*

Given that there are many different disk formats and access
algorithms possible for a Unix file system, how do you decide
what improvements are best for a given system?  And having decided
how to invest your engineering dollars, how do you quantify
the improvements you have made?  Are there any standard tests
or benchmarks that can be run?  For example, what is the answer
to the question: "How much better is the 4.2 file system than
4.1?".  Note that "lots" is kind of a weak response.

Are there filesystem Dhrystones & Whetstones?  Of course, a single
application cannot give the whole picture -- you need to know how things
compare with multiple concurrent users beating on unrelated files.  None
of this "we can make our Unix faster than you can make yours".

Its also interesting to consider how you would compare the performance
of two file systems without factoring in CPU speeds, even though the
machine being compared use different CPU's at different speeds.
Maybe you would want to factor out channel speeds also. Even if 
you view the whole machine as indivisible, its non trivial.  A 
very high proportion of programs today are I/O bound -- a proportion 
that will increase as we get faster processors.  It seems to me that 
filesystem performance is the next big area for competition.  After 
all, that's what makes a mainframe a mainframe, right?

Comments?

--
Martin S. McKendry;    FileNet Corp;	{hplabs,trwrb}!felix!martin
Strictly my opinion; all of it

hammond@faline.UUCP (09/12/87)

In article <> martin@felix.UUCP (Martin McKendry) writes:
>
> ...  A 
>very high proportion of programs today are I/O bound -- a proportion 
>that will increase as we get faster processors.  It seems to me that 
>filesystem performance is the next big area for competition.  After 
>all, that's what makes a mainframe a mainframe, right?
>
>Comments?

About 98% of the programs run on our systems use <2 secs of 780 CPU time,
nor do they use very much I/O.  There are only a few I/O or CPU hogs.
That's based on ~10 million process records using modified 4.2 BSD
accounting.  Where do you get the idea that a high proportion are I/O bound?

On machines with 32+ MB of memory, I'm willing to bet that a large
proportion of all accesses are satisifed from the in core buffers,
i.e. my edit compile run edit cycles probably all run out of the in
core buffers once I've completed a cycle.  If the system were smart
enough to use all of memory as disk buffer rather than 10% of it, I'm
certain that my stuff would just stay in core.

I'll agree that file system performance could be improved, but I'm
inclined to believe that improving the use of main memory as buffers
would be a bigger general win than any changes to the disk layout.

Does anybody have records for their general use systems that prove
that the systems are I/O bound?  I want at least a continuous month's
worth of records, no one or two day or "peak" samples.

Rich Hammond	Bell Communications Research	hammond@bellcore.com

preece@ccvaxa.UUCP (09/13/87)

  hammond@faline.bellcore.com:
> About 98% of the programs run on our systems use <2 secs of 780 CPU
> time, nor do they use very much I/O.  There are only a few I/O or CPU
> hogs.  That's based on ~10 million process records using modified 4.2
> BSD accounting.  Where do you get the idea that a high proportion are
> I/O bound?

> Does anybody have records for their general use systems that prove that
> the systems are I/O bound?  I want at least a continuous month's worth
> of records, no one or two day or "peak" samples.
----------
So what's "general use" and who cares about it anyway?  Every machine
has its own workload profile; a filesystem optimized for one is very
likely to be inadequate, or at least sub-optimal, for many others.
While the "traditional Unix" load may indeed tend towards short
execution times and light filesystem activity, but a great many Unix
machines today are sold to entirely different kinds of users.

A user with a large office automation system, heavily dependent on an
underlying database, is probably going to want a wildly different
filesystem implementation (probably different enough that it will have
to be custom built and avoid Unix I/O entirely because the Unix I/O
strategy is inadequate for its needs).  A real time system which needs
to generate a megabyte per second of status and checkpoint information
is not going to want to run it through the buffer cache.

There are a lot of users who want to KNOW when their i/o has actually
finished (not just been accepted by the system), so they know when the
filesystem is safe.  There are users who know that they need to generate
lots and lots of tiny files and there are users who know that they need
to generate one or two huge files.

The buffer cache and the BSD fast filesystem are very nice for some
kinds of workloads (including the general software development load that
probably characterizes the bulk of the readers of this notesfile), but
to believe that it is all that is necessary is to grossly restrict the
market for your systems.

-- 
scott preece
gould/csd - urbana
uucp:	ihnp4!uiucdcs!ccvaxa!preece
arpa:	preece@Gould.com

ron@topaz.rutgers.edu (Ron Natalie) (09/14/87)

If you have a large proportion of short lived programs, wouldn't
you say that you spend a lot of time just bringing the program in
from disk.

-Ron

martin@felix.UUCP (Martin McKendry) (09/15/87)

In article <1384@faline.bellcore.com> hammond@faline.UUCP (Rich A. Hammond) writes:
>In article I wrote:
>>
>> ...  A 
>>very high proportion of programs today are I/O bound -- a proportion 
>>that will increase as we get faster processors.  It seems to me that 
>>filesystem performance is the next big area for competition.  After 
>>all, that's what makes a mainframe a mainframe, right?
>>
>>Comments?
>
>About 98% of the programs run on our systems use <2 secs of 780 CPU time,
>nor do they use very much I/O.  There are only a few I/O or CPU hogs.
>That's based on ~10 million process records using modified 4.2 BSD
>accounting.  Where do you get the idea that a high proportion are I/O bound?
>
From extensive workload analysis.  Try putting in a make.  Look at
your CPU utilization.  If its not 100%, you are waiting on I/O when you
could be processing.  Depending on how much you like to wait on I/O,
you are I/O bound.

To look at a single 780 is hardly representative of the world.  Most
of the world's data processing is production commercial data processing.
We do image processing.  Don't assume that your load is everyone's.  

In a previous life, I worked with extensive analyses of commercial
customer workloads taken from real customers sites.  Based on simulation
results and real benchmarks, we found that you could make changes
by large factors (2-5) in either direction in CPU performance
without seeing anything like the same change in throughput (total
time to run benchmark).  Like a factor of 4 or 5 faster in CPU
for only a factor of 2 change in throughput.   Idle time on the faster
CPU goes up as expected.   This on batch processing
with no terminal I/O.  If that's not I/O bound, I don't know what is. 
Since CPU speed/$ is improving at a faster rate than the corresponding
figure for disk, I'd expect the class of problems for which this
occurs to increase.

>On machines with 32+ MB of memory, I'm willing to bet that a large
>proportion of all accesses are satisifed from the in core buffers,
>i.e. my edit compile run edit cycles probably all run out of the in
>core buffers once I've completed a cycle.  If the system were smart
>enough to use all of memory as disk buffer rather than 10% of it, I'm
>certain that my stuff would just stay in core.
>
What if I want to support 400 users from one server, each of whom
wants 50Kb of data every 15 seconds.  Or if I have to process/merge two
or three 60Mb data files?   What if I don't want to ship 32 M on all
machines?

>I'll agree that file system performance could be improved, but I'm
>inclined to believe that improving the use of main memory as buffers
>would be a bigger general win than any changes to the disk layout.
>
What if I am planning to do both, and the incremental costs are worth it?

>Does anybody have records for their general use systems that prove
				     ***********
By whose definition?

>that the systems are I/O bound?  I want at least a continuous month's
>worth of records, no one or two day or "peak" samples.
					*****
Why not?  Often its the peaks I want to handle.  I can already handle the
regular loads.

I don't care for your tone.  I don't think my posting warranted it.

>Rich Hammond	Bell Communications Research	hammond@bellcore.com
--
Martin S. McKendry;    FileNet Corp;	{hplabs,trwrb}!felix!martin
Strictly my opinion; all of it

aeusesef@csun.UUCP (09/16/87)

In article <7327@felix.UUCP> martin@felix.UUCP (Martin McKendry) writes:
>In article <1384@faline.bellcore.com> hammond@faline.UUCP (Rich A. Hammond) writes:
>>In article I wrote:
>>> ...  A 
>>>very high proportion of programs today are I/O bound -- a proportion 
>>>that will increase as we get faster processors.  It seems to me that 
>>>filesystem performance is the next big area for competition.  After 
>>>all, that's what makes a mainframe a mainframe, right?
>>
>>About 98% of the programs run on our systems use <2 secs of 780 CPU time,
>>nor do they use very much I/O.  There are only a few I/O or CPU hogs.
>>Where do you get the idea that a high proportion are I/O bound?
>>
>(some stuff about programs and cpu load)
One of the I/O hogs happens to be the operating system itself (or do you
only support 1 user and 1 taks?).  The example I'm always being given (and
have started to give myslef 8-)) is a Cyber.  A Cyber 170/760 is a FAST
machine (8-10 M{I,FLO}PS [yeah, they're pretty much the same]), which also
happens to have *VERY* fast I/O (on the order of megawords a second, I
forget just how many).  (It does this by having a seperate I/O processor,
which handles all i/o:  the cpu just tells it what it wants.)  There is also
a Cyber 180/830, also a very fast machine.  It, however, gets only 1
M{I,FLO}PS (they added microcode 8-().  Therefore, it gets about the same
instruction speed as a VAX (more or less, I won't quibble too much).
However, it can support, nicely, about 30 or 40 users (moderately nicely; it
starts to slow down at about 10 to 15), whereas a VAX (this is a 780
equivalent machine, more or less) dies at that many.  Reason?  Jobs rolling
in and out of memory use up a lot of i/o bandwidth.

>Based on simulation
>results and real benchmarks, we found that you could make changes
>by large factors (2-5) in either direction in CPU performance
>without seeing anything like the same change in throughput (total
>time to run benchmark).  Like a factor of 4 or 5 faster in CPU
>for only a factor of 2 change in throughput.   Idle time on the faster
>CPU goes up as expected.   This on batch processing
>with no terminal I/O.  If that's not I/O bound, I don't know what is. 
>>(some stuff about large memories) (32M large?  I use 96 myself...)
>>If the system were smart
>>enough to use all of memory as disk buffer rather than 10% of it, I'm
>>certain that my stuff would just stay in core.
Unfortuneately, there's two problems:
 1)  Processes tend to use this memory themselves, for code/data.  Sure, you
can swap (demand paged vm), but that doesn't seem much more efficient than
just using the memory to hold the code.
 2)  That data has to be written to disk SOMETIME.  I think I would rather
put up with slow I/O than to have to worry about the machine corrupting my
data.  This can be cured if you have large memory AND a seperate I/O
processor (buffer, while the CPU is busy, have the iop write the data), but
I'm not too sure that that is done too often.
>>
>(some stuff about large number of users/data transfers)
>>I'll agree that file system performance could be improved, but I'm
>>inclined to believe that improving the use of main memory as buffers
>>would be a bigger general win than any changes to the disk layout.
This Cyber I'm talking about (830) has roughly 16Mbytes of main memory.  It
tends to use the memory to store jobs, and it gets better performance that
way then our 760 (with an equivelant load [5 or 6 times as many users])
which can only have 256KWords and has to roll jobs into/outof memory.
>>
>What if I am planning to do both, and the incremental costs are worth it?
>>Rich Hammond	Bell Communications Research	hammond@bellcore.com
>Martin S. McKendry;    FileNet Corp;	{hplabs,trwrb}!felix!martin
>Strictly my opinion; all of it
The Cybers are 20+ years old; Cray had the right idea when he designed them.
(But I HATE NOS!)


 -----

 Sean Eric Fagan          Office of Computing/Communications Resources
 (213) 852 5742           Suite 2600
 1GTLSEF@CALSTATE.BITNET  5670 Wilshire Boulevard
                          Los Angeles, CA 90036
{litvax, rdlvax, psivax, hplabs, ihnp4}!csun!aeusesef

hammond@faline.bellcore.com (Rich A. Hammond) (09/16/87)

Martin@felix.UUCP (Martin McKendry) responsed to my comments
on his original posting about file-system performance.

First, I must apologize for the tone of my article, I didn't mean
to offend Martin.
>>> Martin claimed a high proportion of jobs were I/O bound.
>>I asked: ... Where do you get the idea that a high proportion are I/O bound?
>>
>From extensive workload analysis.  Try putting in a make.  Look at
>your CPU utilization.  If its not 100%, you are waiting on I/O when you
>could be processing.  Depending on how much you like to wait on I/O,
>you are I/O bound.

This isn't realistic, given that disks have both a seek and rotational
delay, the only way to get rid of ALL disk I/O time for a single job is
to prefetch the data into main memory.  Only if you can predict what
file I want before I ask for it can you have 100% CPU utilization.
If you can do that, you can make a lot of money in the stock market. :-)

>To look at a single 780 is hardly representative of the world.  Most
>of the world's data processing is production commercial data processing.
>We do image processing.  Don't assume that your load is everyone's.  

I agree, but I thought we were talking about UNIX file system performance.

>In a previous life, I worked with extensive analyses of commercial
>customer workloads taken from real customers sites.  Based on simulation
>results and real benchmarks, we found that you could make changes
>by large factors (2-5) in either direction in CPU performance
>without seeing anything like the same change in throughput (total
>time to run benchmark).  Like a factor of 4 or 5 faster in CPU
>for only a factor of 2 change in throughput.   Idle time on the faster
>CPU goes up as expected.   This on batch processing
>with no terminal I/O.  If that's not I/O bound, I don't know what is. 
>Since CPU speed/$ is improving at a faster rate than the corresponding
>figure for disk, I'd expect the class of problems for which this
>occurs to increase.

No terminal I/O - were these UNIX systems? I am quite willing
to concede that UNIX and "real, commercial data processing" aren't the
same.  I'm not sure that we want them to become the same.  I can show
you UNIX systems where doubling the CPU speed doubles throughput.
But neither of our anecdotes should be generalized to the whole world.

Regarding my claims that larger main memory will help, Martin replies:

>What if I want to support 400 users from one server, each of whom
>wants 50Kb of data every 15 seconds.  Or if I have to process/merge two
>or three 60Mb data files?   What if I don't want to ship 32 M on all
>machines?

I'll agree, those could benefit from more I/O bandwidth.  But do they
need to be done under UNIX?  Wouldn't a dedicated OS to handle the
disks and communications work better in the first case, even if the
clients were UNIX systems?  Merging files might better be left to IBM MVS?
As for shipping 32 M on all systems, this is a tradeoff between your
development time and the incremental memory cost * # systems shipped.
If you only ship a few systems and the 32 M solves the problem adequately,
you'd be better off sticking it in.  Is the first one a real situation?

Regarding my claim that improving the use of main memory for buffering
would help Martin pointed out that he could do both that and disk layout.
I said that improving the buffering would have a better payoff, so that's
what I would look at first.  It wasn't clear from Martin's note that he
had already considered it.

Regarding my asking for long term measurements of I/O demand not peak
measurements Martin said:
>Why not?  Often its the peaks I want to handle.  I can already handle the
>regular loads.

I'm looking for the largest average payoff, which I perceive
as the regular loads.  Working to alleviate the peaks may not gain you
much if the result has little effect on regular loads.
For example, if the regular load runs pretty much out of the in-core
buffer pool and the only large amounts of I/O are the peaks, then you
may not save your customer much per $ of development time to spend
man-months or man-years reworking the I/O.   Engineering is a tradeoff,
I was saying that you have to know your work load and tailor your
efforts to extract the greatest gain.

Martin made a claim "that the vast majority of jobs were I/O bound"
which I didn't think justified in the UNIX environment.  His reply
to my comments indicated that he wasn't thinking of the UNIX
environment and that he had specific applications in mind.
Fine, but I thought that we were interested in improving UNIX in general
not Martin's product in particular.  In that context I claimed that
there are other things that might give a better payoff.

Don't take this to mean that I'm against file system I/O improvements,
I'll welcome any that come along.

In summary, we were talking at cross purposes and I apologize for the
resulting bad feelings.

Rich Hammond	hammond@bellcore

rbl@nitrex.UUCP (09/19/87)

In article <14704@topaz.rutgers.edu> ron@topaz.rutgers.edu (Ron Natalie) writes:
>If you have a large proportion of short lived programs, wouldn't
>you say that you spend a lot of time just bringing the program in
>from disk.
>
>-Ron


And a lot of time bringing in the Unix utility programs that the applictions
may use...  Sugit Kumar did his Ph.D. dissertation at Case Western Reserve
(about 7 yr. ago) looking at the role of solid-state disks (experiments
were done on a PDP-11/45) in UNIX performance.  The best speed-ups were
by allocating one SSD to the system and utility programs and another SSD
to /usr/tmp.  If you're working against the same data files time and time
again in the application, copy it to SSD (/usr/tmp).  The speed up could
theoretically be about 17,000 fold, but the device drivers place a lower
bound on the access time.

What does this mean about file system performance?  ...  simply that the
overhead of the device driver and file system itself drops proportionately
if larger block sizes are used.  One trade off is "wasted" disk space for
smaller files.

Rob Lake
-- 
Rob Lake
{decvax,ihnp4!cbosgd}!mandrill!nitrex!rbl

dwc@homxc.UUCP (D.CHEN) (09/21/87)

> >>> Martin claimed a high proportion of jobs were I/O bound.
> >>I asked: ... Where do you get the idea that a high proportion are I/O bound?
> >>
> >From extensive workload analysis.  Try putting in a make.  Look at
> >your CPU utilization.  If its not 100%, you are waiting on I/O when you
> >could be processing.  Depending on how much you like to wait on I/O,
> >you are I/O bound.
> 
> This isn't realistic, given that disks have both a seek and rotational
> delay, the only way to get rid of ALL disk I/O time for a single job is
> to prefetch the data into main memory.  Only if you can predict what
> file I want before I ask for it can you have 100% CPU utilization.
> If you can do that, you can make a lot of money in the stock market. :-)

actually, from a bottleneck point of view, only SYSTEMs can be I/O
bound, not workloads.  although one job may always have i/o delay,
multiprocessing can allow the system to run without i/o delay.

> 
> I'll agree, those could benefit from more I/O bandwidth.  But do they
> need to be done under UNIX?  Wouldn't a dedicated OS to handle the
> disks and communications work better in the first case, even if the
> clients were UNIX systems?  Merging files might better be left to IBM MVS?
> As for shipping 32 M on all systems, this is a tradeoff between your
> development time and the incremental memory cost * # systems shipped.
> If you only ship a few systems and the 32 M solves the problem adequately,
> you'd be better off sticking it in.  Is the first one a real situation?
> 
why not do them under UNIX?  i'm sure that the other "special"
OSs must have also evolved under workload analysis.

danny chen
homxc!dwc

scc@cl.cam.ac.uk (Stephen Crawley) (10/03/87)

In article <7075@felix.UUCP> martin@felix.UUCP (Martin McKendry) writes:
>Given that there are many different disk formats and access
>algorithms possible for a Unix file system, how do you decide
>what improvements are best for a given system?  And having decided
>how to invest your engineering dollars, how do you quantify
>the improvements you have made?  [...]

Bob Hagman of Xerox PARC has done some very interesting work in this
area.  I believe that he will be presenting a paper at this year's 
SOSP conference on his redesign of the Cedar file system.  I have 
mislaid my copy of his paper, so I'll say no more except that it is
very pertinent.

-- Steve