lm@slovax.Eng.Sun.COM (Larry McVoy) (02/05/91)
Ready flame throwers, take aim, .... fire! In article <PCG.91Feb4191423@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: >On 28 Jan 91 21:10:30 GMT, lm@slovax.Berkeley.EDU (Larry McVoy) said: Thank you, xrn ... ^^^^^^^^^^^^ > >A lot of bizarre nonobjections (more on those later) that seem to hint >that contrary to evidence BSD/SunOS memory and file management is as >good as it can be, contrarily to the arguments demonstrated in one of my Ahem. Nobody said, or even hinted, that things are as good as they can be. I meant to say that I think you are, as usual, flaming things about which you know a lot less than you should in order to flame said things. >postings. I had sent you a copy of my article a long time ago, for >comments, in case I had made factual errors about Sun Architectures, and >never got a reply. Another Sun person bothered to reply, and made >interesting remarks, after which I toned down considerably my draft. > >Even more bizarre, I have seen an article about some other McVoy (the >evil twin? :->) reporting that he (or she) has just published a paper in I'm the evil twin. I'm me. Male, too, I think. Me@berkeley is thanks to xrn that thinks the entire world consists of the Berkeley.edu domain. A bit provincial, that. >the Winter Usenix which shows that the old idea of dynamic clustering of >small blocks works as well or better and is simpler than the BSD/SunOS >FFS Damn. You did it again. You obviously did not go read the paper. The changes were added to Sun's version of the FFS. I needed the FFS allocator, it's the only thing able to place the blocks contiguously once the file system gets a bit scrambled. >pcg> Ah, for /tmp it works [ ... it provides greater default clustering >pcg> and read ahead ... ] The problems come when loads are not >pcg> "typical". [ ... then the extra default clustering implied in a >pcg> larger block size imply great time and space waste ... ] > >lm> It would appear that you are suggesting that the file system always >lm> does read ahead and clustering. > >Only to somebody that is prejudiced, I am afraid, in the sense you >mention. Hmm. I must be stupid because it still looks to me like you are rebutting with the /tmp gunk. Silly me. >Follow me: on sequential access, when block size is 1KB, you get 1KB >read now and 1KB asynchronously read ahead; when it is 8KB, you get 1KB >read now, which however cannot be accessed until an extra "synchronous" >read ahead of 7KB is also completed, and then you get 8KB of >asynchronously read ahead block. Chuckle. Snortle. Chortle, even. Exit, stage left, Mr Armchair warrior. This is a really good argument that is all theory, no practice. In the clustering changes I made to SunOS, I kept the read ahead algorithm. If you had read the bloody paper, you would know that the read ahead code has an extra heuristic that always reads ahead block 1 upon the first read of block 0. In SunOS 4.1.1, you may think of blocks as being 56KB long for purposes of sequential access. Since I kept the read ahead alg, I'm now doing a 111KB of useless read by your definition. (You seem to be assuming that 1KB is some nifty magic size.) Funny thing, Mr Armchair, is that not a soul has complained. Furthermore, when I did *my* homework, unlike some, I found that almost every application benefited from the large read ahead. I couldn't find a single benchmark that was annoyed (head, file, and mh's scan are a bit peeved). I tend to agree that 111KB "extra" is a bit much. What I don't agree with is that it is "extra". Almost everyone wants that data. Now, I personally think that a 100K is a bit excessive and I intend to be a bit more conservative in future releases; I'm going to squeeze the cluster size down to the smallest thing that gives the sequential apps what they need without blowing any revolutions. Making the applications prove that they were sequential hurt every benchmark I could find, so that's not an option. >Extending by 15 times the read ahead, half of which is "synchronous", >pays only if the CPU bandwidth of the application and of the disk are >such that their ratio gives a 1/(N+1) (N is the number of IO processors) >duty cycle. Knuth discusses the joys (not the billjoys :->) of buffering Don't flame Bill until you've done anything nearly as significant. I hate being forced into defending Bill since I wouldn't have done everything the way he did given what I know now (hindsight being what it is). Even so, I like vi, I like csh, I like BSD based Unix, Bill had a lot to do with those, what have you produced other than a bunch of noise? What gives you the right? Hmm? >for sequential access in the ACP series volume 1, "Fundamental >algorithms", a book that seems to have been classified "Top Secret" by >the NSA. > >Now for the quantitatively minded: do some back of the envelope >calculations, and find out which bandwidth the application must have, >given current IO subsystem bandwidths, to give the optimal duty cycles >with 1KB and 8KB blocks. Compare with actual application CPU bandwidths >(less than 1KB/MIPS for troff/TeX to several hundred KB/MIPS for BSD >ld). Now for the intelligent people: figure out how slow your application will go if you read things in 1KB chunks. And don't tell me that the application should prove sequential access and switch to large blocks later until you consider that many files are small. Suppose we assume that all files are 8KB or less. Suppose we take on faith that 90% of the applications traverse the files sequentially. Now consider how stupid you are if you made an operating system that knew all that and *still* forced you to prove it before giving you your damn data. Pretty stupid, in my opinion. >Conclude then, unless you work for Sun, DEC AT&T or other major Unix >vendor, Yeah, I'd hate to work for the guys who made the machine on which Unix was developed, the guys that developed Unix, or the guys that have done almost all of the significant Unix development since the early BSD releases. What a bummer. I'd rather work at a University and suggest that I'm the only one in the world that can think straight. >As to random access, having an 8KB block size means that instead of >reading just the 1KB you needed, you read those and then a "synchronous" >read ahead of 7KB, which is pointless if your record size is under 1KB, >which seems fairly typical. So make a small block file system. Read the mkfs, newfs, or tunefs man pages lately? It seems like all the world but you is happy with what is the default. You are welcome to go make a 1K file system if that's your pleasure. >Actually you not only waste 7KBs of space, >you also waste the (usually fortunately relatively small) time needed to >wait for the useless 7KB to be read in. Hmm, so, unlike the rest of the world, you have a reaction time in the single milliseconds, huh? Well, that figures. For the rest of us, waiting a couple more milliseconds to get the whole block doesn't seem too troublesome. In fact, waiting 25 millisceonds to get a 56KB block isn't noticable. >lm> you would find that assuming sequential access is a win almost all >lm> of the time > >This is what most papers on file system characterization show, and I am >happy that you seem to be have read them at least as to this. Yup, I'm a good boy. Read the paper. Read the code. Hacked the code. >lm> How about you? Do you have a better algorithm to suggest? > >Well, it is clearly detailed in the rest of my article. Too bad you seem >not to read articles before shooting from the hip (on the other hand my >articles are so long, to anticipate the trivial objections of wise guys >and billjoys, that you may be excused). Why, thank you. >I read, in my delusionary world rotating around my armchair, that bitmap >based filesystems based on 1KB blocks give essentially the same >performance as BSD 8KB filesystems, simply because using a bitmap gives >dynamically and almost by default the clustering that the BSD FFS >imposes statically with its large block sizes and complicated >positioning algorithms. Oh, it does, does it? And how, pray tell, does it manage to allocate the blocks contiguously? Have you thought about that rather thorny issue? If they are doing the same thing, then they have almost all of the complexity of the FFS. If they are not, then their performance is likely to get very poor due to poor allocation techniques. You can't have it both ways, at least I don't understand how you can. Perhaps you'd like to explain it to me? >Maybe you are too busy to read comp.unix.sv386, but that newsgroup has >frequent postings with IO speed comparisons, and the Sv386 FFS versions >floating around are for the most part simple bitmap based ones on top of >the V7 FS. In particular the ISC one is a very clever retrofit idea on >the V7 one, and it gives excellent performance. Clearly not designed by >a billjoy. The MUSS filesystem had, many years ago, a design similar to >that of the ISC one, and it also gave very very good performance. Not Numbers, please. And the file systems had better be delivering the platter speed of the drive or you get to eat your words. In fact, I'll bet you that not a one delivers more than 1/2 the platter speed. Any takers? >pcg> This is what I use to call "billjoyism", designing something that >pcg> works well 80% of the time and breaks down badly in the remaining >pcg> 20%, when a little more hard thinking (some call it "engineering") >pcg> would find a more flexible solution, like, in the example above, >pcg> dynamic adaptive clustering (which happens almost by default if one >pcg> switches to a free list ordered by position instead of time, e.g. a >pcg> bitmap based one) instead of static predictive clustering like >pcg> raising the block size. > >lm> More drivel. The file system does this. Has for years. > >Oh, you have read too the BSD FFS paper! Too bad that you failed to >notice that the BSD/SunOS FFS starts with a base block size that gives a >large static clustering built in, specially on Sun and SunOS based >machines with their regrettable 4KB or 8KB page size that prevents the >use of in memory fragments. That's it? That's all you have to say? You are offended that Sun chooses an 8KB block size? You need to try harder. If you are so sure that this a problem, go make a 4K file system and show us all some results that say this is better. I'll even let you define better. >pcg> The billjoys will say "so what, sequential access is 80%, so we go >pcg> for it, and damn the rest!". Unfortunatley there are two problems: > >pcg> 1) random access to small files is fairly common in UNIX, and >pcg> cannot be so easily dismissed: directories, inode pages. Some >pcg> little dbm, too. I missed this one the first time around. The file system clusters both directories and inode pages such that you are likely to get a group of related inodes or directories in a single read. Big performance boost - there is an old paper somewhere about a guy who did inode reorgs in the V7 file system so that a page of inodes contained adjacent inodes. Nice hack, worked great. The BSD FFS does this sort of thing automatically, but you know that already, right? >lm> It is almost always necessary to copy the data when editting. > >There are interesting counter examples to that. Such as? >pcg> Also, the *contents* of directories are accessed sequentially, when >pcg> a single B-tree based directory file (Mac, Cedar) or similar could >pcg> be faster and more compact. > >lm> Again, have you tried this or is this yet another one of those "I >lm> believe it to be true therefor it is true" deals? I looked at this >lm> a while ago (OS/2 does this, or did when I played with it) and it >lm> sucked. > >Let me guess: a large, large percentage of your 4KB-8KB buffers or pages >that are used for IO contain files or directories whose median size is >well under twice that of a page, and a fairly large percentage contain >files that are smaller than half a page. This, let me speculate :-), >adds up to a wastage of about half the active (totally unused buffer >slots or pages don't count) real memory used for buffering or pages. Well, my friend, you're wrong. Here's a histogram of my mail directory. I use mh, so *each message* is a file. I don't know where you'll find a better collection of small files. Kinda looks like 4KB is a better number that 1KB here, huh? $ sizes -l ~/Mail size log2 histogram 1 8 ##### 16 # 32 ## 64 128 #### 256 ### 512 ######## 1K ######### 2K ########## 4K ########## 8K ######### 16K ####### 32K ###### 64K ##### 128K ### 256K ## 512K 1M 2733 files in 11M $ Don't buy that? OK, who about /usr/include? That should be a bunch of small files, right? $ sizes -l /usr/include size log2 histogram 16 #### 64 128 256 ####### 512 ####### 1K ####### 2K ######## 4K ######## 8K ######## 16K ###### 32K #### 930 files in 3M $ A little worse, but notice that lack of a knee at 1KB. > >I have carefully looked at some Sun 3s with SunOS 3 and 4, using nothing >more complicated than pstat(8), and even if they were lightly loaded I >seemed to notice patterns not entirely dissimilar from my speculations. >I invite you and every reader to have a go with pstat -i, just for >starters. All those directories that you are so worried about? We still have a buffer cache, into which we read inodes and directories. We were worried about fragmentation, you know. And this buffer cache for UFS meta data has been here since the beginning, it was designed in. So, after throwing out all the directories, devices, and 0 length files, we have: size Number of files that fit into this power of two bucket 1 # 2 # 8 ## 16 #### 32 ## 512 # 1K # 2K ### 4K ## 8K ### 16K ########### 32K ################## 64K ##### 128K ##### 256K ####### 512K ###### 2M ## Funny. Not a lot of stuff around 1K. By the way, this is a server, 32 MB of ram, 4GB of disk, used for kernel builds (one was going on while I took the stats). The perl script is attached below, feel free to check it out yourself. So, let's see. You are still worried. The argument is that fragmentation is bad. OK, I can buy that. So, lemme see now, OK - here's the 5 second solution: smaller pages, yeah, that's it. 1KB pages, yeah, no fragmentation, that's the ticket. Don't worry about those page faults, now that you are take 8X as many. Don't worry about those page tables, now that they are 8X as big. Don't worry about that TLB, now that it is 1/8th as large. In fact, don't consider any tradeoff, just worry about fragmentation, yeah, that's the ticket. Yup, that'll make it better, faster, cheaper. >Frankly, Mr. McVoy, I have not seen a single technical objection in your >posting; I have written detailed analyses supported by references to >well known and established results, you have replied by saying "how can >you say that?" without bothering to offer any data, analysis or >reference to well known and established sources. Those who can, do. Those who can't, whine. I'd be happy to have you come to work for me and prove that you can actually so something, in fact, I'd enjoy it, I'd run you ragged. I get the impression that you have a very bright mind, and that's what annoys me the most. I hate to see good brain cells go to waste in an armchair, I'd love to see you do some work. Blathering on usenet, fun though it is, is not work, it's what you do while you are waiting for kernel to compile, your thesis to print, whatever. Speaking of which, I'm busily billjoying the file system, kernel just finished, gotta go. Cheers, --- Larry McVoy, Sun Microsystems (415) 336-7627 ...!sun!lm or lm@sun.com
lm@slovax.Eng.Sun.COM (Larry McVoy) (02/05/91)
In article <417@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes: >The perl script is attached below, feel free to check it out yourself. Oops, forgot it. Here it is: #!/bin/perl # # Figure out how the big the files are that are in memory. # open(IN, "pstat -i|") || die "pstat"; $_ = <IN>; # eat title while (<IN>) { split; next if $#_ == 13; # skip devices $d = $_[5]; while (length($d) > 5) { # trim leading mode bits $d =~ s/.//o; } next unless length($d) == 5; next if $d =~ /^4/; # skip directories $sizes[&lg($_[8])]++; } for ($ix = 0; $ix <= $#sizes; ++$ix) { next unless defined($sizes[$ix]); printf "%4.4s ", &base2(1 << $ix); for ($j = $sizes[$ix]; $j-- > 0; ) { print '#'; } print "\n"; } exit 0; # # log base 2 # sub lg { local($val, $ix) = @_; for ($ix = 0; (1 << $ix) < $val; ++$ix) {} return ($ix); } # # print a number that will fit in 4 spaces, number is power of two. # sub base2 { local($num, $ix) = @_; if ($num < 1024) { return "$num"; } if ($num < 1024*1024) { $num >>= 10; return "$num" . "K"; } $num >>= 20; return "$num" . "M"; } --- Larry McVoy, Sun Microsystems (415) 336-7627 ...!sun!lm or lm@sun.com
pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/12/91)
On 5 Feb 91 10:16:52 GMT, lm@slovax.Eng.Sun.COM (Larry McVoy) said:
lm> Ready flame throwers, take aim, .... fire!
It's snowing out there -- thanks for the warmth :-).
lm> In article <PCG.91Feb4191423@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
lm> (Piercarlo Grandi) writes:
pcg> Even more bizarre, I have seen an article about some other McVoy
pcg> (the evil twin? :->) reporting that he (or she) has just published
pcg> a paper in
lm> I'm the evil twin. I'm me.
I happened to know that. Wasn't fooled by a little NNTP bogosity. I was
just teasing you (and your evil twin :->) a bit. Idle minds tend to be a
bit pukish.
lm> Me@berkeley is thanks to xrn that thinks the entire world consists
lm> of the Berkeley.edu domain. A bit provincial, that.
Just have a chat with your postmaster about what happens to From: lines
that happen to pass thru Sun.COM. Tsk tsk. Probably UCB are just
retaliating in kind. Header wars! :-).
In the following (a little more technical than the previous one! not at
all offensive! I wish all flames were like this!) posting I have trimmed
a lot of your points that are not essential (I have regrettably given
little attention to non sequential IO), barely suppressing my urge to
gnaw at them too.
But before going into the bloody long detail, here is a summary in a
couple dozen lines:
Central to our argument is whether better performance transferring
(sequentially) a given amount of data is obtained by having large IO
transactions and a FIFO queue size of two or by having small transactions
and a variable (usually small, but possibly larger than two) queue size.
Example: is it better to read 128KB as one synchronous 56KB transaction
plus one asynchronous 56KB one, or as one synchronous 1KB transaction plus
N asynchronous 1KB ones?
Larry McVoy advocates something like the first alternative, claiming that
in practice he gets better benchmarks times with 56KB+56KB (actually he
thinks that this is a bit excessive, and would be happy with lower
values), and I claim that 40 years of computer science research
conclusively demonstrate that 1KB+N*1KB is better, and that N can be
fairly small (even if usually 1 is good, according to Knuth, as the Unix
authors knew), offering detailed examples.
I also claim that even 8KB+8KB is bad because the finer is the block/page
granularity the greater the adaptability to fine granularity files (median
sizes around 1KB and in any case under 4KB) and to extremely variable
loads (spanning three orders of magnitude) while performance need not
suffer because dynamic clustering can be employed *if* needed, while Larry
McVoy reckons that coarser granularity and static clustering wins most of
the time and the rest doesn't matter much. In other words I say that with
8KB+1*8KB (even worse for 56KB+56KB) the 8KB granule is too large, or
rather not as small as it could be, and the 1 read ahead may be either too
*small* or too large, depending on highly variable contextual conditions.
I also try to show that the better benchmark times experienced by Larry
McVoy depend on larger IO transaction sizes obviating some bogosity in
SunOS, not on their intrinsic value, and also depend on the assumption of
unlimited real memory, as in his given example of a kernel build on
a 32MB machine.
I observe that these two assumptions explain the benchmarks, help the Sun
bottom line, and explain why nobody complains -- just pop in more memory,
and the bogosities are circumvented, even if not solved, just like raising
the number of MMU cache slots.
The full discussion will appear as a separate article, with a subject line
like "A treatise on ... " :-).
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk