[comp.arch] Mixing paging and IO is inefficient

moss@cs.umass.edu (Eliot Moss) (06/27/90)

In article <1990Jun26.152907.979@cbnewsi.att.com> yam@cbnewsi.att.com (toshihiko.yamakami) writes:

   Dangling pointers?
   In general computation, it might be true.
   In case of temporary compiler output files,
   isn't it enough just try to "make" again?
   If something wrong, let's try "make clean" and "make".

   Or did I miss some other important points?

The problem is that the file system structure itself (inodes, free block info,
etc.) can get somewhat scrambled, such that it is not even a legal file system
structure. An earlier poster pointed out that synchronous writes are stronger
than what is needed; a weaker sufficient condition is ordered writes: certain
blocks should be written before others. Whether they lag reality may not be
important in many cases, and in the cases that it is one can doa "sync" (or
similar) operation, perhaps on a single file, set of files, directory, etc.,
rather than a whole file system. We may end up moving more towards the
functionality provided by transaction processing systems, with ordered logging
and so forth, but with most actions running with forced commits of data to
disk (a significant difference from the transaction processing world). We
probably have about beat this to death though .....		Eliot
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206; Moss@cs.umass.edu

ian@sibyl.eleceng.ua.OZ (Ian Dall) (06/27/90)

In article <137770@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes:
}>My approach is not doing sync write on /tmp file system. By specifying
}>"delay" option with mount, most sync write to the file system is
}>replaced by delayed write.
}
}I'm really getting sick of this thread.  Those who understand file system
}semantics dismissed this idea as flawed from the start.  The synchronous
}nature of certain file system writes are *required* for file system
}reliability.

A really simple solution would be to make /tmp a seperate partition and turn
of sync writes. A mkfs (or newfs) takes of the order of the same time as
an fsck anyway. /tmp is usually cleared on a reboot so there is no problem.
-- 
Ian Dall     life (n). A sexually transmitted disease which afflicts
                       some people more severely than others.       

lm@snafu.Sun.COM (Larry McVoy) (07/02/90)

I said:
>>... The synchronous
>>nature of certain file system writes are *required* for file system
>>reliability.  Just so you understand: consider what happens when you create
>>a file.  You allocate an inode and add a directory entry.  Think about the
>>steps and the order of operations.  If you do it wrong, and the system
>>crashes, you leave dangling pointers...

And Henry said:
>The requirement here, however, is not that writes be done synchronously,
>but that certain constraints on the *order* of writes be preserved, so
>that the disk is always adequately consistent.  

And Henry is correct.  A fairly straightforward way to improve performance
would be to make sure that certain writes were done in the proper order.
Henry goes on:
>Look at the
>Osterhout paper in the latest Usenix, comparing Sprite (delayed write)
>vs. NFS (synchronous write) performance, with performance factors of
>up to 50 in favor of Sprite.  Osterhout commented "almost every file
>you touch is a temporary file", and added that this applies to more
>than just /tmp, so it is well worth postponing most writes in hopes
>that the file will go away before you have to write it.  

This is a concocted benchmark.  Ousterhout admitted as much.  This
business about "every file you touch is a temp file" is (A) not true
and (B) a red herring.  

I'd like to see some data, taken from a whatever is a "normal" site,
that shows the temp file stuff.  My data indicates that a percentage
of files created are temp, but certainly not all, not even 50%.  It
really depends what you are doing.

Anyway, it's a moot point.  Only the control information is written 
synchronously and that is such a minor part of what is going on that
it's not really worth optimizing.

Don't believe me, huh?  Well, I don't entirely either.  Here's the deal:
things like compiles are not bound by the file system.  A coworker was
writing up a paper on tmpfs recently and benchmarked a kernel build
with /tmp over tmpfs vs /tmp over ufs.  It came out to 30 seconds 
difference.  Over 45 minutes.  Big deal.

On the other hand, there are times when delaying makes a lot of sense.
"rm *" for example, results in a lot of repeated writes to the same
directory block.  This is pretty stupid and should be fixed.

Just don't believe everything that is implied in Ousterhout's paper.
You can put in his file system and maybe that will make you feel
better, but your program won't run much faster, if at all.
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/03/90)

In article <138208@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes:

>Subject: Re: Mixing paging and IO is inefficient (was Re:  Compiler partions)

So, perhaps, you are talking about the topic of mixing paging and IO.

Or, maybe, you are still confused.

>I said:
>>>... The synchronous
>>>nature of certain file system writes are *required* for file system
>>>reliability.  Just so you understand: consider what happens when you create
>>>a file.  You allocate an inode and add a directory entry.  Think about the
>>>steps and the order of operations.  If you do it wrong, and the system
>>>crashes, you leave dangling pointers...

As several people pointed out, that is not a problem on ordinary UNIX.
Do you want to say it will be a problem if paging and IO is overly mixed?

Then, though I can't understand why, it is an another defect of mixing.

>Don't believe me, huh?  Well, I don't entirely either.  Here's the deal:
>things like compiles are not bound by the file system.  A coworker was
>writing up a paper on tmpfs recently and benchmarked a kernel build
>with /tmp over tmpfs vs /tmp over ufs.  It came out to 30 seconds 
>difference.  Over 45 minutes.  Big deal.

Do you want to say SUN's tmpfs is useless because of mixed IO and paging?

Then, though I can't understand why, it is an another defect of mixing.

By the way, on an ordinary UNIX with a state-of-the-art CPU (say 20MHz
R3000), delayed (thus fast) /tmp file system brings more than 20%
improvement for the performance of a kernel recomlilation.

>Just don't believe everything that is implied in Ousterhout's paper.
>You can put in his file system and maybe that will make you feel
>better, but your program won't run much faster, if at all.

If you are using slow CPU, most program is CPU bound, so, they won't
run much faster.

So, use fast CPU, and then, may program will then IO bound.

					Masataka Ohta

colin@array.UUCP (Colin Plumb) (07/04/90)

In article <137770@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes:
>I'm really getting sick of this thread.  Those who understand file system
>semantics dismissed this idea as flawed from the start.  The synchronous
>nature of certain file system writes are *required* for file system
>reliability.

Depending on the failure modes you're considering, the Unix file
system simply isn't reliable.  If you assume that block writes are
atomic, then enforcing sequencing will prevent thoroughly bogus
file system structures, although protection violations (file
gets extended, new pointer gets added to inode, system crashes
before block which used to contain unencrypted passwords gets
overwritten) certainly are.

> Just so you understand: consider what happens when you create
>a file.  You allocate an inode and add a directory entry.  Think about the
>steps and the order of operations.  If you do it wrong, and the system
>crashes, you leave dangling pointers.  Before you tell me that systems don't
>crash very often and this isn't a problem, think back to what things were
>like when we all knew how to use fsdb.  (If you never knew fsdb, you have no
>business in this discussion).  The reason that this isn't a problem anymore
>is that we fixed it.  You are suggesting that we undo the fix.  That might
>be acceptable in your environment (How is that ETA, anyway, it's been a
>while since I've logged on) but it is not acceptable for most customers.
>Most sites want both performance and robustness - if there is a conflict
>they will sacrifice performance for robustness.

In that particular case, if I'm allowed to assume that inodes are
marked "unused" when deleted, the order doesn't matter.  On
power-up fsck, I can either see an unreferenced inode (If I
allocate, then update the directory), or a directory entry pointing
to a free inode (which I can either abort, by removing the
directory entry, or commit by marking the inode as in use.)
Adding a data block to a directory might be a better example.

/tmp is a special case, I hope you'll admit - it routinely gets cleaned
out at reboot, anyway.  If you arrange for vi backup files to go
somewhere else, you can just mkfs on every reboot if you feel like it.

>Trying to solve this problem in the manner described in Ohta's Usenix paper 
>is a mistake.  It's an inapproriate solution to the problem.

I think it's a wierd hybrid of a ramdisk and stable storage, and I
think it's ugly, but I can't say it's *wrong*.

>The way that this problem is solved is via hardware.  I'll go out a limb
>and predict what the disk drive of the future looks like:  Every drive will
>have some non volatile memory, into which go all writes.  (The size of the
>memory is derivable from file system traces  - since I/O is consistantly
>bursty, the memory has to be big enough to handle a burst.)  If the system
>crashes, the drive could care less, it keeps dribbling out the writes.
>If power fails, the drive finishes the writes when it is powered back up.

I'd certainly like the NVRAM approach, as it's nice and fast,
but...

>If you must have a software solution now, I'm afraid that you are stuck with
>the tmpfs method of doing business.

See "Reimplementing the Cedar File System Using Logging and Group
Commit", Robert Hagmann, Proc. 11th ACM Symp. on Operating Systems
Principles, also known as ACM Operating Systems Review vol.21 no.5
(1987).  You don't need extra hardware.
-- 
	-Colin

renglish@hpcupt1.HP.COM (Robert English) (07/04/90)

> / mohta@necom830.cc.titech.ac.jp (Masataka Ohta) / 12:19 am  Jul  3, 1990 /

> >Don't believe me, huh?  Well, I don't entirely either.  Here's the deal:
> >things like compiles are not bound by the file system.  A coworker was
> >writing up a paper on tmpfs recently and benchmarked a kernel build
> >with /tmp over tmpfs vs /tmp over ufs.  It came out to 30 seconds 
> >difference.  Over 45 minutes.  Big deal.

> Do you want to say SUN's tmpfs is useless because of mixed IO and paging?...

> By the way, on an ordinary UNIX with a state-of-the-art CPU (say 20MHz
> R3000), delayed (thus fast) /tmp file system brings more than 20%
> improvement for the performance of a kernel recomlilation.

There are lots of interpretations possible here.  Maybe Sun's file
system is written well enough that tmpfs is no longer useless.  There
are certainly many techniques that would allow that.  Maybe Sun's
compilers make copious use of virtual memory and don't use temporary
files at all.

Maybe the 20% kernel compilation improvement on the R2000 is not
available on Suns because it's already been obtained through other
mechanisms.  That doesn't mean that tmpfs is not useful for other
applications, nor does it mean that a delayed-write /tmp is not useful.
By itself, it doesn't mean much more than "tmpfs doesn't help
compilation."

--bob--
renglish@hpda

yodaiken@freal.cs.umass.edu (victor yodaiken) (07/05/90)

In article <103@array.UUCP> colin@array.UUCP (Colin Plumb) writes:
>In article <137770@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes:
>>I'm really getting sick of this thread.  Those who understand file system
>>semantics dismissed this idea as flawed from the start.  The synchronous
>>nature of certain file system writes are *required* for file system
>>reliability.
...
>>If you must have a software solution now, I'm afraid that you are stuck with
>>the tmpfs method of doing business.
>
>See "Reimplementing the Cedar File System Using Logging and Group
>Commit", Robert Hagmann, Proc. 11th ACM Symp. on Operating Systems
>Principles, also known as ACM Operating Systems Review vol.21 no.5
>(1987).  You don't need extra hardware.
>-- 
>	-Colin

Also see the paper by Borg et al 10th ACM Symp on O.S. on the Auragen
Fault Tolerant Unix, and see the paper on hints for developers by
Butler Lampson in the same proceedings (might be the 9th come to think of
it). If you maintain a valid fs on the disk at all times, you can avoid
synchronous writes and still maintain fs integrity.  Auragen used a simple
technique, the fs root block (super-block) was duplicated. Block A pointed
to a safe version of the FS, block B was kept in memory, and on disk, using
the free blocks of A for writing. When B was safe, i.e. when everything
necessary was on disk, B wold be copied to A. If the system died while
B was not safe, on reboot A would point to a safe fs, with new data written
on its free list (where it would be ignored). Logging was also used to allow
recovery. Detailed information on this fs is in a tech report that I might
be able to find,if anyone is interested.

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/05/90)

In article <3830011@hpcupt1.HP.COM>
	renglish@hpcupt1.HP.COM (Robert English) writes:

>> >Don't believe me, huh?  Well, I don't entirely either.  Here's the deal:
>> >things like compiles are not bound by the file system.  A coworker was
>> >writing up a paper on tmpfs recently and benchmarked a kernel build
>> >with /tmp over tmpfs vs /tmp over ufs.  It came out to 30 seconds 
>> >difference.  Over 45 minutes.  Big deal.

>> By the way, on an ordinary UNIX with a state-of-the-art CPU (say 20MHz
>> R3000), delayed (thus fast) /tmp file system brings more than 20%
>> improvement for the performance of a kernel recomlilation.

>There are lots of interpretations possible here.

Yes. Especially when you don't have enough knowledge.

>Maybe Sun's file
>system is written well enough that tmpfs is no longer useless.  There
>are certainly many techniques that would allow that.

You should understand when memory disk brings performance
improvement. Do you know what is sync write?

>Maybe Sun's
>compilers make copious use of virtual memory and don't use temporary
>files at all.

Kernel recompilation was chosen by SUN as a benchmark of tmpfs.
Why? Of course, compilation process heavily use /tmp.

						Masataka Ohta