[comp.arch] FLAME ON: Re: Sun bogosities, including MMU thrashing

lm@slovax.Eng.Sun.COM (Larry McVoy) (02/05/91)

Ready flame throwers, take aim, .... fire!

In article <PCG.91Feb4191423@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>On 28 Jan 91 21:10:30 GMT, lm@slovax.Berkeley.EDU (Larry McVoy) said:
		Thank you, xrn ...    ^^^^^^^^^^^^
>
>A lot of bizarre nonobjections (more on those later) that seem to hint
>that contrary to evidence BSD/SunOS memory and file management is as
>good as it can be, contrarily to the arguments demonstrated in one of my

Ahem.  Nobody said, or even hinted, that things are as good as they can be.
I meant to say that I think you are, as usual, flaming things about which
you know a lot less than you should in order to flame said things.

>postings.  I had sent you a copy of my article a long time ago, for
>comments, in case I had made factual errors about Sun Architectures, and
>never got a reply.  Another Sun person bothered to reply, and made
>interesting remarks, after which I toned down considerably my draft.
>
>Even more bizarre, I have seen an article about some other McVoy (the
>evil twin? :->) reporting that he (or she) has just published a paper in

I'm the evil twin.  I'm me.  Male, too, I think. Me@berkeley is thanks
to xrn that thinks the entire world consists of the Berkeley.edu
domain.  A bit provincial, that.

>the Winter Usenix which shows that the old idea of dynamic clustering of
>small blocks works as well or better and is simpler than the BSD/SunOS
>FFS

Damn.  You did it again.  You obviously did not go read the paper.  The
changes were added to Sun's version of the FFS.  I needed the FFS
allocator, it's the only thing able to place the blocks contiguously
once the file system gets a bit scrambled.

>pcg> Ah, for /tmp it works [ ... it provides greater default clustering
>pcg> and read ahead ... ] The problems come when loads are not
>pcg> "typical". [ ... then the extra default clustering implied in a
>pcg> larger block size imply great time and space waste ... ]
>
>lm> It would appear that you are suggesting that the file system always
>lm> does read ahead and clustering.
>
>Only to somebody that is prejudiced, I am afraid, in the sense you
>mention. 

Hmm.  I must be stupid because it still looks to me like you are rebutting
with the /tmp gunk.  Silly me.

>Follow me: on sequential access, when block size is 1KB, you get 1KB
>read now and 1KB asynchronously read ahead; when it is 8KB, you get 1KB
>read now, which however cannot be accessed until an extra "synchronous"
>read ahead of 7KB is also completed, and then you get 8KB of
>asynchronously read ahead block.

Chuckle.  Snortle.  Chortle, even.  Exit, stage left, Mr Armchair warrior.
This is a really good argument that is all theory, no practice.  In the
clustering changes I made to SunOS, I kept the read ahead algorithm.  If
you had read the bloody paper, you would know that the read ahead code has
an extra heuristic that always reads ahead block 1 upon the first read of
block 0.  In SunOS 4.1.1, you may think of blocks as being 56KB long for
purposes of sequential access.  Since I kept the read ahead alg, I'm now
doing a 111KB of useless read by your definition.  (You seem to be assuming
that 1KB is some nifty magic size.)

Funny thing, Mr Armchair, is that not a soul has complained.  Furthermore,
when I did *my* homework, unlike some, I found that almost every application
benefited from the large read ahead.  I couldn't find a single benchmark
that was annoyed (head, file, and mh's scan are a bit peeved).  I tend to
agree that 111KB "extra" is a bit much.  What I don't agree with is that it
is "extra".  Almost everyone wants that data.

Now, I personally think that a 100K is a bit excessive and I intend to be
a bit more conservative in future releases; I'm going to squeeze the cluster
size down to the smallest thing that gives the sequential apps what they
need without blowing any revolutions.  Making the applications prove that
they were sequential hurt every benchmark I could find, so that's not
an option.

>Extending by 15 times the read ahead, half of which is "synchronous",
>pays only if the CPU bandwidth of the application and of the disk are
>such that their ratio gives a 1/(N+1) (N is the number of IO processors)
>duty cycle. Knuth discusses the joys (not the billjoys :->) of buffering

Don't flame Bill until you've done anything nearly as significant.  I hate
being forced into defending Bill since I wouldn't have done everything the
way he did given what I know now (hindsight being what it is).  Even so,
I like vi, I like csh, I like BSD based Unix, Bill had a lot to do with
those, what have you produced other than a bunch of noise?  What gives you
the right?  Hmm?

>for sequential access in the ACP series volume 1, "Fundamental
>algorithms", a book that seems to have been classified "Top Secret" by
>the NSA.
>
>Now for the quantitatively minded: do some back of the envelope
>calculations, and find out which bandwidth the application must have,
>given current IO subsystem bandwidths, to give the optimal duty cycles
>with 1KB and 8KB blocks.  Compare with actual application CPU bandwidths
>(less than 1KB/MIPS for troff/TeX to several hundred KB/MIPS for BSD
>ld).

Now for the intelligent people: figure out how slow your application
will go if you read things in 1KB chunks.  And don't tell me that the
application should prove sequential access and switch to large blocks
later until you consider that many files are small.  Suppose we assume
that all files are 8KB or less.  Suppose we take on faith that 90% of
the applications traverse the files sequentially.  Now consider how
stupid you are if you made an operating system that knew all that and
*still* forced you to prove it before giving you your damn data.
Pretty stupid, in my opinion.

>Conclude then, unless you work for Sun, DEC AT&T or other major Unix
>vendor, 

Yeah, I'd hate to work for the guys who made the machine on which Unix
was developed, the guys that developed Unix, or the guys that have done
almost all of the significant Unix development since the early BSD
releases.  What a bummer.  I'd rather work at a University and suggest
that I'm the only one in the world that can think straight.

>As to random access, having an 8KB block size means that instead of
>reading just the 1KB you needed, you read those and then a "synchronous"
>read ahead of 7KB, which is pointless if your record size is under 1KB,
>which seems fairly typical. 

So make a small block file system.  Read the mkfs, newfs, or tunefs man
pages lately?  It seems like all the world but you is happy with what
is the default.  You are welcome to go make a 1K file system if that's
your pleasure.

>Actually you not only waste 7KBs of space,
>you also waste the (usually fortunately relatively small) time needed to
>wait for the useless 7KB to be read in.

Hmm, so, unlike the rest of the world, you have a reaction time in the
single milliseconds, huh?  Well, that figures.  For the rest of us,
waiting a couple more milliseconds to get the whole block doesn't seem
too troublesome.  In fact, waiting 25 millisceonds to get a 56KB block
isn't noticable.

>lm> you would find that assuming sequential access is a win almost all
>lm> of the time
>
>This is what most papers on file system characterization show, and I am
>happy that you seem to be have read them at least as to this. 

Yup, I'm a good boy.  Read the paper.  Read the code.  Hacked the code.

>lm> How about you?  Do you have a better algorithm to suggest?
>
>Well, it is clearly detailed in the rest of my article. Too bad you seem
>not to read articles before shooting from the hip (on the other hand my
>articles are so long, to anticipate the trivial objections of wise guys
>and billjoys, that you may be excused).

Why, thank you.

>I read, in my delusionary world rotating around my armchair, that bitmap
>based filesystems based on 1KB blocks give essentially the same
>performance as BSD 8KB filesystems, simply because using a bitmap gives
>dynamically and almost by default the clustering that the BSD FFS
>imposes statically with its large block sizes and complicated
>positioning algorithms.

Oh, it does, does it?  And how, pray tell, does it manage to allocate
the blocks contiguously?  Have you thought about that rather thorny
issue?  If they are doing the same thing, then they have almost all of
the complexity of the FFS.  If they are not, then their performance is
likely to get very poor due to poor allocation techniques.  You can't
have it both ways, at least I don't understand how you can.  Perhaps
you'd like to explain it to me?

>Maybe you are too busy to read comp.unix.sv386, but that newsgroup has
>frequent postings with IO speed comparisons, and the Sv386 FFS versions
>floating around are for the most part simple bitmap based ones on top of
>the V7 FS. In particular the ISC one is a very clever retrofit idea on
>the V7 one, and it gives excellent performance.  Clearly not designed by
>a billjoy. The MUSS filesystem had, many years ago, a design similar to
>that of the ISC one, and it also gave very very good performance. Not

Numbers, please.  And the file systems had better be delivering the platter
speed of the drive or you get to eat your words.  In fact, I'll bet you
that not a one delivers more than 1/2 the platter speed.  Any takers?

>pcg> This is what I use to call "billjoyism", designing something that
>pcg> works well 80% of the time and breaks down badly in the remaining
>pcg> 20%, when a little more hard thinking (some call it "engineering")
>pcg> would find a more flexible solution, like, in the example above,
>pcg> dynamic adaptive clustering (which happens almost by default if one
>pcg> switches to a free list ordered by position instead of time, e.g. a
>pcg> bitmap based one) instead of static predictive clustering like
>pcg> raising the block size.
>
>lm> More drivel.  The file system does this.  Has for years.
>
>Oh, you have read too the BSD FFS paper! Too bad that you failed to
>notice that the BSD/SunOS FFS starts with a base block size that gives a
>large static clustering built in, specially on Sun and SunOS based
>machines with their regrettable 4KB or 8KB page size that prevents the
>use of in memory fragments.

That's it?  That's all you have to say?  You are offended that Sun chooses
an 8KB block size?  You need to try harder.  If you are so sure that this
a problem, go make a 4K file system and show us all some results that say
this is better.  I'll even let you define better.

>pcg> The billjoys will say "so what, sequential access is 80%, so we go
>pcg> for it, and damn the rest!". Unfortunatley there are two problems:
>
>pcg> 1) random access to small files is fairly common in UNIX, and
>pcg> cannot be so easily dismissed: directories, inode pages. Some
>pcg> little dbm, too.

I missed this one the first time around.  The file system clusters both
directories and inode pages such that you are likely to get a group of
related inodes or directories in a single read.  Big performance boost -
there is an old paper somewhere about a guy who did inode reorgs in the V7
file system so that a page of inodes contained adjacent inodes.  Nice hack,
worked great.  The BSD FFS does this sort of thing automatically, but you
know that already, right?

>lm> It is almost always necessary to copy the data when editting.
>
>There are interesting counter examples to that.

Such as?

>pcg> Also, the *contents* of directories are accessed sequentially, when
>pcg> a single B-tree based directory file (Mac, Cedar) or similar could
>pcg> be faster and more compact.
>
>lm> Again, have you tried this or is this yet another one of those "I
>lm> believe it to be true therefor it is true" deals?  I looked at this
>lm> a while ago (OS/2 does this, or did when I played with it) and it
>lm> sucked.
>
>Let me guess: a large, large percentage of your 4KB-8KB buffers or pages
>that are used for IO contain files or directories whose median size is
>well under twice that of a page, and a fairly large percentage contain
>files that are smaller than half a page. This, let me speculate :-),
>adds up to a wastage of about half the active (totally unused buffer
>slots or pages don't count) real memory used for buffering or pages.

Well, my friend, you're wrong.  Here's a histogram of my mail directory.  I
use mh, so *each message* is a file.  I don't know where you'll find a
better collection of small files.  Kinda looks like 4KB is a better number
that 1KB here, huh?

$ sizes -l ~/Mail
size log2 histogram
   1
   8 #####
  16 #
  32 ##
  64
 128 ####
 256 ###
 512 ########
  1K #########
  2K ##########
  4K ##########
  8K #########
 16K #######
 32K ######
 64K #####
128K ###
256K ##
512K
  1M
2733 files in 11M
$

Don't buy that?  OK, who about /usr/include?  That should be a bunch of small
files, right?

$ sizes -l /usr/include
size log2 histogram
  16 ####
  64
 128
 256 #######
 512 #######
  1K #######
  2K ########
  4K ########
  8K ########
 16K ######
 32K ####
930 files in 3M
$

A little worse, but notice that lack of a knee at 1KB.

>
>I have carefully looked at some Sun 3s with SunOS 3 and 4, using nothing
>more complicated than pstat(8), and even if they were lightly loaded I
>seemed to notice patterns not entirely dissimilar from my speculations.
>I invite you and every reader to have a go with pstat -i, just for
>starters.

All those directories that you are so worried about?  We still have a buffer
cache, into which we read inodes and directories.  We were worried about
fragmentation, you know.  And this buffer cache for UFS meta data has been
here since the beginning, it was designed in.  So, after throwing out all
the directories, devices, and 0 length files, we have:

size Number of files that fit into this power of two bucket
   1 #
   2 #
   8 ##
  16 ####
  32 ##
 512 #
  1K #
  2K ###
  4K ##
  8K ###
 16K ###########
 32K ##################
 64K #####
128K #####
256K #######
512K ######
  2M ##

Funny.  Not a lot of stuff around 1K.  By the way, this is a server, 32 MB
of ram, 4GB of disk, used for kernel builds (one was going on while I took
the stats).  The perl script is attached below, feel free to check it out
yourself.


So, let's see.  You are still worried.  The argument is that fragmentation
is bad.  OK, I can buy that.  So, lemme see now, OK - here's the 5 second
solution: smaller pages, yeah, that's it.  1KB pages, yeah, no
fragmentation, that's the ticket.

Don't worry about those page faults, now that you are take 8X as many.
Don't worry about those page tables, now that they are 8X as big.
Don't worry about that TLB, now that it is 1/8th as large.
In fact, don't consider any tradeoff, just worry about fragmentation,
yeah, that's the ticket. Yup, that'll make it better, faster, cheaper.

>Frankly, Mr. McVoy, I have not seen a single technical objection in your
>posting; I have written detailed analyses supported by references to
>well known and established results, you have replied by saying "how can
>you say that?" without bothering to offer any data, analysis or
>reference to well known and established sources.

Those who can, do.  Those who can't, whine.


I'd be happy to have you come to work for me and prove that you can
actually so something, in fact, I'd enjoy it, I'd run you ragged.  I get
the impression that you have a very bright mind, and that's what annoys
me the most.  I hate to see good brain cells go to waste in an armchair,
I'd love to see you do some work.  Blathering on usenet, fun though it
is, is not work, it's what you do while you are waiting for kernel to
compile, your thesis to print, whatever.

Speaking of which, I'm busily billjoying the file system, kernel just
finished, gotta go.

Cheers,
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

lm@slovax.Eng.Sun.COM (Larry McVoy) (02/05/91)

In article <417@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes:
>The perl script is attached below, feel free to check it out yourself.

Oops, forgot it.  Here it is:

#!/bin/perl

#
# Figure out how the big the files are that are in memory.
#
open(IN, "pstat -i|") || die "pstat";
$_ = <IN>;	# eat title
while (<IN>) {
	split;
	next if $#_ == 13;		# skip devices
	$d = $_[5];
	while (length($d) > 5) {	# trim leading mode bits
		$d =~ s/.//o;
	}
	next unless length($d) == 5;
	next if $d =~ /^4/;		# skip directories
	$sizes[&lg($_[8])]++;
}

for ($ix = 0; $ix <= $#sizes; ++$ix) {
	next unless defined($sizes[$ix]);
	printf "%4.4s ", &base2(1 << $ix);
	for ($j = $sizes[$ix]; $j-- > 0; ) {
		print '#';
	}
	print "\n";
}
exit 0;

#
# log base 2
#
sub lg {
	local($val, $ix) = @_;

	for ($ix = 0; (1 << $ix) < $val; ++$ix) {}
	return ($ix);
}

#
# print a number that will fit in 4 spaces, number is power of two.
#
sub base2 {
	local($num, $ix) = @_;
	if ($num < 1024) {
		return "$num";
	}
	if ($num < 1024*1024) {
		$num >>= 10;
		return "$num" . "K";
	}
	$num >>= 20;
	return "$num" . "M";
}
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/12/91)

On 5 Feb 91 10:16:52 GMT, lm@slovax.Eng.Sun.COM (Larry McVoy) said:

lm> Ready flame throwers, take aim, .... fire!

It's snowing out there -- thanks for the warmth :-).

lm> In article <PCG.91Feb4191423@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
lm> (Piercarlo Grandi) writes:

pcg> Even more bizarre, I have seen an article about some other McVoy
pcg> (the evil twin? :->) reporting that he (or she) has just published
pcg> a paper in

lm> I'm the evil twin.  I'm me.

I happened to know that. Wasn't fooled by a little NNTP bogosity. I was
just teasing you (and your evil twin :->) a bit. Idle minds tend to be a
bit pukish.

lm> Me@berkeley is thanks to xrn that thinks the entire world consists
lm> of the Berkeley.edu domain.  A bit provincial, that.

Just have a chat with your postmaster about what happens to From: lines
that happen to pass thru Sun.COM. Tsk tsk. Probably UCB are just
retaliating in kind. Header wars! :-).

In the following (a little more technical than the previous one! not at
all offensive! I wish all flames were like this!) posting I have trimmed
a lot of your points that are not essential (I have regrettably given
little attention to non sequential IO), barely suppressing my urge to
gnaw at them too.

But before going into the bloody long detail, here is a summary in a
couple dozen lines:

  Central to our argument is whether better performance transferring
  (sequentially) a given amount of data is obtained by having large IO
  transactions and a FIFO queue size of two or by having small transactions
  and a variable (usually small, but possibly larger than two) queue size.

  Example: is it better to read 128KB as one synchronous 56KB transaction
  plus one asynchronous 56KB one, or as one synchronous 1KB transaction plus
  N asynchronous 1KB ones?

  Larry McVoy advocates something like the first alternative, claiming that
  in practice he gets better benchmarks times with 56KB+56KB (actually he
  thinks that this is a bit excessive, and would be happy with lower
  values), and I claim that 40 years of computer science research
  conclusively demonstrate that 1KB+N*1KB is better, and that N can be
  fairly small (even if usually 1 is good, according to Knuth, as the Unix
  authors knew), offering detailed examples.

  I also claim that even 8KB+8KB is bad because the finer is the block/page
  granularity the greater the adaptability to fine granularity files (median
  sizes around 1KB and in any case under 4KB) and to extremely variable
  loads (spanning three orders of magnitude) while performance need not
  suffer because dynamic clustering can be employed *if* needed, while Larry
  McVoy reckons that coarser granularity and static clustering wins most of
  the time and the rest doesn't matter much. In other words I say that with
  8KB+1*8KB (even worse for 56KB+56KB) the 8KB granule is too large, or
  rather not as small as it could be, and the 1 read ahead may be either too
  *small* or too large, depending on highly variable contextual conditions.

  I also try to show that the better benchmark times experienced by Larry
  McVoy depend on larger IO transaction sizes obviating some bogosity in
  SunOS, not on their intrinsic value, and also depend on the assumption of
  unlimited real memory, as in his given example of a kernel build on
  a 32MB machine.

  I observe that these two assumptions explain the benchmarks, help the Sun
  bottom line, and explain why nobody complains -- just pop in more memory,
  and the bogosities are circumvented, even if not solved, just like raising
  the number of MMU cache slots.

The full discussion will appear as a separate article, with a subject line
like "A treatise on ... " :-).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk