[comp.arch] Sun bogosities, including MMU thrashing

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/18/91)

On 16 Jan 91 19:28:27 GMT, richard@aiai.ed.ac.uk (Richard Tobin) said:

richard> I've several times heard about the "knee" in the performance of
richard> Suns as the number of concurrent processes increases, so I wrote
richard> a simple benchmark. [ ... ] Here are some results. [ ... ]

The reason for the appall ingly bad results is simply the appallingly bad
design of virtually *all* the MMUs and schedulers of SunOS since the
very beginning. Note that I said *design*, not implementation -- the Sun
people seem able to do fairly clever hardware, after having come up with
appallingly bad architecture and software (hackers!).

I have been observing catastrophic design mistakes in SUN (and Dec, ...)
hardware or software for a while; since the machines I am most familiar
with are Suns (and 386s) here is a summary I wrote long ago of their
"problems", including the MMU ones.

Briefly, the MMU problem is that for some unfathomable reason (I know
which one, by the way, but I don't agree it is defensible) Sun decided
originally to cache entire page tables in the MMU instead of page tables
*entries*, and then to match an LRU page table replacement policy with a
FIFO scheduler, thus *guaranteeing* thrashing as soon as the number of
active page tables (note: the *total* number, not the *most frequently
active* ones) exceeds the MMU capacity.

Here are my notes, in a crescendo of astonishing bogosity, so that the
MMU notes are near the end.

I also wrote some notes on the wonders of NFS and of using Ethernet as
an IO bus and Ethernet adapters as bus access chips; this will have to
wait a bit for a posting, after the flames for this posting have sizzled
me :-).

========================================================================

I have decided to tell you some funny myths and other tales about Sun
systems. As such, some of them are quite subtle; by no means they are
limited to Sun systems, or to BSD based systems. I could tell you
equally incredible stories about System V (what? it still does expansion
swaps?) or DEC (what? the *pager* has a memory leak?).  Most of this
discussion is historical; the latest SunOS release, 4.1.1, and the
various Sun 4 models may have entirely different bogosities. Somehow I
don't feel inclined to believe that Sun will ever cease to amaze me with
ever new and incredible ones.

Beware: some of the material herein is speculative, but I hope at least
engagingly credible :-).

		BACKGROUND ON FILESYTEMS

The crew that did 4BSD observed that the performance of the V7 Unix
filesystem is poor when demanding applications are run on modern machines
and discs. The problems are poor locality of i-nodes, poor locality of disc
blocks, poor layout of disc blocks, and insufficient clustering of transfers.

A poor way of addressing some of these problems is to first double the
block size from one 512 byte sector to two (like in 2BSD), which is
however not bad (even if criticized by the Unix authors with compelling
arguments), and then raising the block size to 16 sectors, like in
SunOS. Surely this does improve clustering of blocks (you force 16
sectors of a file at a time to be contiguous), it does also improve the
clustering of transfers (when you read the first sector of a block you
also get the next 15).

It is very stupid because it is a poor approximation to the right
answer, which is to do *dynamic* and static clustering of runs of blocks
(actually the BSD guys did implement very complicated forms of static
clustering). It also has some big disadvantages, hinted at by
Thompson&Ritchie in the V7 papers (they advised against doubling the
block size from 1 to 2 sectors with terse and cogent reasoning).

The most obvious of these disadvantages is that Unix files on average
are quite a bit smaller than 8KB, and that having a uniform 8KB block
size would waste in internal fragmentation at least 50-60% of the disc
space on a typical Unix filesystem.

The BSD people therefore decided, complicating matters greatly, and in
the kernel, where things should be as simple as possible, to have
various block sizes, from 1KB to 8KB. Block sizes from 1KB to 7KB were
called "fragments".  A file would be made up of a number (usually zero!)
of 8KB blocks and then a single tail fragment.

This greatly complicates life, especially as the kernel, up to SunOS 3,
has a cache of blocks. Having variable block sizes means having variable
buffer sizes (which complicates life quite a bit, as you must then put
the buffer cache in virtual memory). Each buffer is described by a
buffer header, and now this not only contains its address but also the
block size.

The number of buffer headers and the number of KBytes allocated to the
buffer pool are fixed at boot time. One should have therefore a ratio of
buffer headers to buffer cache size that approximates the average
expected size of a block. Now, you recall from the above discussion that
fragments were introduced precisely because it was expected that a
*large* proportion of files would be smaller than a full block. It can
therefore be expected that a *large* proportion of buffers would contain
an entire file quite smaller than a full block; I would even venture to
suggest, in the absence of hard data, that probably the average buffer
size is around 2KB.

    In particular, directories are frequently accessed, and they had better
    be small. The only frequently accessed files of significant size are
    going to be probably only executables, and BSD executable fetching
    bypasses the buffer cache, thanks to demand paging from the filesystem
    itself.

		THE SunOS 3 BUFFER CACHE	

Well, the default for SunOS 3 is to allocate a buffer header every 8KB
of buffer cache. This means that the assumption is made that all buffers
will contain full blocks, or in other words, that internal fragmentation
should be avoided on discs but not in memory.

In practice, it is my expectation that the typical SunOS 3 buffer cache
will be about 70-80% unused, because the supply of buffer headers will
be exhausted well before that of buffer cache space. Notice that not
only all available statistics seem to point out that most files are well
under 8KB worth, but also a buffer header is under 100 bytes, while a
buffer cache slot is 1KB long; I don't know you, but I would more easily
overallocate and potentially waste 100 bytes resources than 1000 byte
ones...

Notice that the *only* way to override this ridiculous default is to
*patch the kernel binary*, an operation that most think is high wizardry
and would not even dare consider.

    Actually I strongly suspect that even this will not work. The ability to
    have variable length buffers depends on the ability to map small pages;
    since most Suns have a 8KB VM page, probably fragment buffers are not
    supported at all, and therefore raising the number of buffer headers
    will be pretty useless. This of course is not very nice in itself...

Notice also that Unix is often heavvvvily IO bound, and an effective and
large buffer cache is tremendously effective, especially if the users
tend to edit and compile repeatedly a number of small files, as they do
typically do.

Notice also that as another default only 10% of memory is reserved for
the buffer cache, which is, for your typical timesharing or program
development usage pattern, way too small.

It is no wonder that Sun recommends adding more memory to solve
performance problems...

		THE SunOS 4 BUFFERING SCHEME

SunOS 4 no longer uses the buffer cache. All files are accessed via the
virtual memory technology. This should help tremendously in reducing
system overhead, as for most files after the initial 'open()' no system
call is required until the 'close()'. It also means that file pages
compete with process pages for main memory, as there is no longer a
buffer cache, and all of free memory is dedicated to the most
(supposedly) recently used pages, be their process local or memory
mapped files.

    For some reason there is still an 'nbuf' variable in the kernel to set
    the number of buffer headers. It would be interesting to know what role
    have buffer headers in the new architecture. I have this suspicion that
    'nbuf' is just a *limit*.

Unfortunately, the Sun virtual memory technology has an 8KB page size.
Sun justifies this *extravagantly large* page size with the idea that
nowadays memory is cheaper.

This means that under SunOS 4 you effectively no longer have variable
sized buffers against the problem of internal fragmentation.  All of
central memory uses only 8KB pages as the allocation unit. We can thus
contemplate the ridiculous situation that internal fragmentation is a
concern for disc space wastage, but not for memory space.

    Something tells me that as soon Sun looks again at the problem, they
    will remove the code that handles short blocks (fragments) from the
    filesystem as well, because now disc storage is cheap as well :->.

If memory is cheaper, it would be my naive idea to use it to *reduce*
internal fragmentation, by having a small page size; a small page size
implies a larger page table, but memory is supposedly cheap, isn't it?

		THE Sun VIRTUAL MEMORY CACHE

Actually Sun have made another wonderful design decision. Most virtual
memory technologies support multiple page tables, and cache entries
(supposedly the most frequently used ones) from such tables (mostly in
inane ways, some in more intelligent ones). Sun have chosen to cache not
just selected *entries*, but the most frequently used *tables* in their
entirety.  A typical Sun VM cache has 8 slots, and each slot contains
the page table for a process.

When there are more than 8 active processes, a context switch may cause
a page table to be written back to memory, and another to be loaded into
that slot. This costs *a lot*. In practice, this limits severely the
size of a page table, and thus of the address space of a process, and
also puts a lower limit on the size of a page. Sun MMU technology works
well only for for address spaces that are small and densely allocated,
not large and sparse as more modern programming technologies (notably
threading, but also memory mapped files...) would suggest.

The Sun 4 SPARC MMU instead does not cache entire page tables, but
contiguous subsets of these, called 'pmeg's. Each pmeg more or less maps
a region of the address space, such as text, data, bss, stack, or shared
segment. The idea of caching contiguous submaps is not as bad as that of
caching entire maps; in particular it allows the use of smaller pages,
like 4096 bytes, which is still fairly large (32-64 pages per pmeg), but
not too much so. The problem seems to be that SunOS does not share the
pmegs that map shared segments, so that even if a lot of processes are
executing the same executable image or mapping the same shared library
(both extremely likely events) a pmeg for each region will be consumed
by each process.

Unfortunately there are not that many pmeg cache slots around; in a
typical implementation there are about 128 pmeg cache slosts, for a
total of 4096-8192 pages or 16-32 megabytes. This seems enough, until
you realize that each process consumes are least 4 pmegs and potentially
many more, as they are not shared. The total working set of the current
machine load may well be under 16-32 megabytes, but there will often not
be enough pmegs to cover it because because usually a large fraction of
those 16-32 megabyte will be shared, thus requiring multiple pmegs.

Interestingly if SunOS runs out of pmegs it will steal them from an
existing process; it may well happen that a resident process has all its
pmegs stolen. If this happens, its page will be marked unmapped AND
THEREFORE free and swappable, and it will often be swapped even if there
is no memory shortage.

There is another interesting aspect of both to the Sun 3 and and Sun 4
MMU schemes; as far as I remember, the virtual memory cache slots are
managed with some approximation of LRU, i.e.  essentially LIFO, while
the scheduler dispatches processes with prioritized round robin, i.e.
essentially FIFO.  Unfortunately a FIFO access pattern to a LIFO cache
guarantees a cache miss on *every* access if the FIFO is longer than the
LIFO, such as if there are more than 8 active processes. This guarantees
a collapse in throughput. Note that the load average may well be under
8, because the load average counts essentially CPU bound active
processes, while there may well be IO bound active processes.

		THE SunOS 4 OPTIMIZED ACCESS TO FILES

This brings us to another subject. With SunOS 4 files are memory mapped,
i.e. file access is integrated with virtual memory. Virtual memory is
usually managed with some (often poor...)  approximation of LRU, because
virtual memory accesses tend to be clustered in time and space in some
way.

Access to files by contrast is often sequential; in particular,
expecially under the BSD filesystem, sequential access is very much
favoured, so Unix applications tend to use sequential access even when
other file structures and access patterns could be used (copying a file
in its entirety is often preferred to updating it in place).

In particular, the original V7 filesystem did provide read ahead and
write behind, and the large block sizes introduced by BSD essentially
provide more of the same.

Unfortunately a FIFO (sequential file) access pattern tends to go
against the grain of a LIFO (LRU approximating) virtual memory policy.
In particular, when reading a file sequentially, the most recently
accessed block is the one least likely to be used again in the near
future, while the virtual memory subsystem assumes exactly the opposite.

BSD and SunOS, by the way, do provide a system call to advise the paging
subsystem that pages recently referenced will not be reused shortly, but
major applications don't use it at all (e.g.  not 'cp'). For example,
the 'stdio' library could profitably use it on all those files that are
not opened for read/write, and in particular those that are opened for
write or append only, as available statistics show that this happens
fequently and such a file is usually accessed strictly sequentially.

There used to be a system call to suppress not just virtual memory keep
behind, but also to ask for fault-in ahead, i.e. to knowingly circumvent
the 'on-demand' principle of virtual memory management.  It has
apparently disappeared, replaced by ever larger pages, which give some
form of read ahead, but are a big lose for small files and random access
(as Thompson & Ritchie observed long ago).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

guy@auspex.auspex.com (Guy Harris) (01/21/91)

>A poor way of addressing some of these problems is to first double the
>block size from one 512 byte sector to two (like in 2BSD), which is
>however not bad (even if criticized by the Unix authors with compelling
>arguments), and then raising the block size to 16 sectors, like in
>SunOS.

I presume you're not saying that SunOS has have fixed 16-sector block
sizes; it has standard 8K-block, 1K-fragment BSD-style file systems, of
the sort you describe later in your posting.

>It also has some big disadvantages, hinted at by Thompson&Ritchie in
>the V7 papers (they advised against doubling the block size from 1 to
>2 sectors with terse and cogent reasoning).

Which paper was that?

>This greatly complicates life, especially as the kernel, up to SunOS 3,
>has a cache of blocks. Having variable block sizes means having variable
>buffer sizes (which complicates life quite a bit, as you must then put
>the buffer cache in virtual memory).

Which makes the buffer chunk size the page size; each buffer is backed
by at least one page of physical memory.

>Notice also that as another default only 10% of memory is reserved for
>the buffer cache, which is, for your typical timesharing or program
>development usage pattern, way too small.

That is, BTW, standard BSD behavior (actually, 4.3 uses 10% of the first
2MB, 5% of the remaining memory).

>    For some reason there is still an 'nbuf' variable in the kernel to set
>    the number of buffer headers. It would be interesting to know what role
>    have buffer headers in the new architecture.

From "SunOS Virtual Memory Implementation", in some EUUG proceedings:

	8.3 UFS Control Information

	     Another difficult issue related to the UFS file system and
	the VM system is dealing with the control information that the
	"vnode" driver uses to manage the logical file.  For the UFS
	implementation, the control information consists of the
	"inodes", indirect blocks, cylinder groups, and super blocks. 
	The control information is not part of the logical file and thus
	the control information  still needs to be named by the block
	device offsets, not the logical file offsets.  To provide the
	greatest flexibility we decided to retain the old buffer cache
	code with certain modifications for optional use by file system
	implementations.  The biggest driving force behind this is that
	we did not want to rely on the system page size being smaller
	than or equal to the size of the control information for all
	file system implementations.  Other reasons for maintaining
	parts of the old buffer cache code included some compatibility
	issues for customer written drivers and file systems.  In
	current versions of SunOS, what's left of the old buffer cache
	is used strictly for UFS control buffers.  We did improve the
	old buffer code so that buffers are allocated and freed
	dynamically.  If no file system types choose to use the old
	buffer cache code (e.g., a diskless system), then no physical
	memory will be allocated to this pool.  When the guffer cache is
	being used (e.g., for control information for UFS file systems),
	memory allocated to the buffer pool will be freed when demand
	for these system resources decreases.

>Unfortunately, the Sun virtual memory technology has an 8KB page size.

At least on Sun-3s, Sun-3xs, and Sun-4s.  The "desktop SPARC" machines
have a 4KB page size.  I don't remember what the page size on the 386i
was.  (The page size on the Sun-2 was 2KB.)

>This means that under SunOS 4 you effectively no longer have variable
>sized buffers against the problem of internal fragmentation.  All of
>central memory uses only 8KB pages as the allocation unit. We can thus
>contemplate the ridiculous situation that internal fragmentation is a
>concern for disc space wastage, but not for memory space.

Yeah, I thought so too, once; I asked Rob Gingell about it, and he
pointed out that SunOS 4.x is no different from 4.2andupBSD in this
regard, as noted above.  Both of them allocate page-sized chunks for
buffers.  4.2andupBSD just happens to have originally been done on a
machine with 512-byte physical pages, and used 1024-byte "logical"
pages.  The 4.2BSD buffer pool code in SunOS 3.x used 8KB pages as the
allocation unit on Sun-3s....

>A typical Sun VM cache has 8 slots, and each slot contains
>the page table for a process.

Probably true, given that most Suns out there are *probably* "Desktop
SPARC" machines with 8 contexts, although the bigger SPARC machines have
16 or 64 contexts.

>The Sun 4 SPARC MMU instead does not cache entire page tables, but
>contiguous subsets of these, called 'pmeg's.

That's actually not unique to "the Sun4 SPARC MMU"; all Sun MMUs work
that way (as opposed to the Motorola MMU on Sun-3xs or the Intel MMU on
Sun386is).

>Each pmeg more or less maps a region of the address space, such as
>text, data, bss, stack, or shared segment.

More or less.  Each pmeg maps a chunk of the address space, but it may
take more than one pmeg to map such a region; the code that handles the
MMU doesn't really know about text or data/bss (they're combined) or
stack or....

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/22/91)

On 20 Jan 91 20:13:55 GMT, guy@auspex.auspex.com (Guy Harris) said:

pcg> [ ... raising the default block size is not a clever move ... ]

pcg> It also has some big disadvantages, hinted at by Thompson&Ritchie in
pcg> the V7 papers (they advised against doubling the block size from 1 to
pcg> 2 sectors with terse and cogent reasoning).

guy> Which paper was that?

One of "Unix Implementation" or "Unix IO system", or "A retrospective".

The argument was that a lot of Unix files are very small (directories as
a rule); BSD fragments get around the space overhead problem, but not
around the buffer cache hit rate problems. Doubling the block size is
based on the expectation that each IO transaction will read in or write
out twice the number of "useful" bytes as before. For purely sequential
access this is approximately true, but the same effect can be achieved
more efficiently by using dynamic clustering and extending read ahead to
N blocks istead of just 1 like V7, and for random access one loses badly.

guy> [ ... noting that a lot of SunOS's questionable design decisions
guy> are taken straight off from BSD ... ]

Well, as 4.2BSD was being completed, some key OS designer was still
working with UCB CSRG as well as working at Sun itself.

pcg> For some reason there is still an 'nbuf' variable in the kernel to set
pcg> the number of buffer headers. It would be interesting to know what role
pcg> have buffer headers in the new architecture.

guy> From "SunOS Virtual Memory Implementation", in some EUUG proceedings:

Yes, I had read that paper (I referred to "some reason"). The reasons
seem weak, given that, for example, the interior of the kernel has been
changed quite a bit anyhow. Still, the power of backward compatibility!

pcg> Unfortunately, the Sun virtual memory technology has an 8KB page size.

guy> At least on Sun-3s, Sun-3xs, and Sun-4s.  The "desktop SPARC" machines
guy> have a 4KB page size.  I don't remember what the page size on the 386i
guy> was.  (The page size on the Sun-2 was 2KB.)

Ah yes. At least 2KB was reasonable, and was a funtion of the Sun-2
having a limited address space. But the choice of caching entire page
tables meant that when Sun had to enlarge the virtual address space they
had to enlarge the page size as well. I think that there are some good
arguments both pro and con the Sun MMU design for the Sun 1; maybe also
for the Sun 2; but in the Sun 3 case I would not rate the "pro"
arguments as good.

The 386i was 4KB, like all 386s, and the SPARCs. Barely tolerable, but
still too large: but for IO problems, because of which you want to use
dynamic clustering, not bigger page sizes, the smaller is the page size,
the better. IMNHO a page size of 1024 bytes is more or less the right
one, given prevailing IO technology etc.

pcg> This means that under SunOS 4 you effectively no longer have variable
pcg> sized buffers against the problem of internal fragmentation.  All of
pcg> central memory uses only 8KB pages as the allocation unit. We can thus
pcg> contemplate the ridiculous situation that internal fragmentation is a
pcg> concern for disc space wastage, but not for memory space.

guy> Yeah, I thought so too, once; I asked Rob Gingell about it, and he
guy> pointed out that SunOS 4.x is no different from 4.2andupBSD in this
guy> regard, as noted above.

No surprise again :-). More surprising (but maybe not, considering human
nature) is that Sun has never corrected the design mistakes inherited
from 4.xBSD.

It's just like System V.3.2 still having an expansion swap policy which
is still there from V7, where it made sense on slow-CPU, fast-IO PDPs, a
situation 100% the opposite of current machines.

guy> Both of them allocate page-sized chunks for buffers.  4.2andupBSD
guy> just happens to have originally been done on a machine with
guy> 512-byte physical pages, and used 1024-byte "logical" pages.  The
guy> 4.2BSD buffer pool code in SunOS 3.x used 8KB pages as the
guy> allocation unit on Sun-3s....

Oh yes, and this is the crazy thing. It is yet another nefarious
consequence of the botched Sun 3 MMU architecture: that the page size is
as large as the disk block size, not as small as the disk fragment size.

Shifting the blame to 4.2BSD is too easy; a large page size is a
decision that Sun took with the Sun 3 MMU, against all reasonableness
and known properties of VM systems.


In summary, I agree that a lot of Sun's misdesigns are inherited from
BSD (some are even inherited from V7/32V!); that Sun appropriated them
enthusiastically, and stuck to them for years and thru several OS
release and product generations is an impressive demonstration of
something.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

dennis@gpu.utcs.utoronto.ca (Dennis Ferguson) (01/22/91)

In article <PCG.91Jan21160353@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>On 20 Jan 91 20:13:55 GMT, guy@auspex.auspex.com (Guy Harris) said:
>
>pcg> [ ... raising the default block size is not a clever move ... ]
>
>pcg> It also has some big disadvantages, hinted at by Thompson&Ritchie in
>pcg> the V7 papers (they advised against doubling the block size from 1 to
>pcg> 2 sectors with terse and cogent reasoning).
>
>guy> Which paper was that?
>
>One of "Unix Implementation" or "Unix IO system", or "A retrospective".

While I'm unwilling to dig through the references to determine whether
Thompson and Ritchie actually said this, if they did I do think they may
have changed their minds about it.  The file system used (on Vaxes) by
Version 8 was essentially the V7 file system with the block size increased
to 4096.  If memory serves, the stated reason this was done was simply
that it made the file system run 8 times faster under typical loads.
And I distinctly remember arguments being made at the time to the effect
that the speed of the Berkeley fast file system (still a fairly recent
innovation then) was almost exclusively due to the larger block size, and
that the block clustering algorithm, which makes the supporting code complex
and relatively CPU-intensive when writing, really was unnecessary.

While I can't judge the merit of the rest of your arguments, I do think
that the documentation with later versions of research Unix not only doesn't
support your opinion, but rather directly contradicts it.

Dennis Ferguson

mike@thor.acc.stolaf.edu (Mike Haertel) (01/23/91)

In article <1991Jan21.225211.17757@gpu.utcs.utoronto.ca> dennis@gpu.utcs.utoronto.ca (Dennis Ferguson) writes:
>While I'm unwilling to dig through the references to determine whether
>Thompson and Ritchie actually said this, if they did I do think they may
>have changed their minds about it.  The file system used (on Vaxes) by
>Version 8 was essentially the V7 file system with the block size increased
>to 4096.  If memory serves, the stated reason this was done was simply
>that it made the file system run 8 times faster under typical loads.
>And I distinctly remember arguments being made at the time to the effect
>that the speed of the Berkeley fast file system (still a fairly recent
>innovation then) was almost exclusively due to the larger block size, and
>that the block clustering algorithm, which makes the supporting code complex
>and relatively CPU-intensive when writing, really was unnecessary.

My ninth edition manual documents 2 types of file systems:

1.  a version-7 style file system with the block size increased to 1K.
2.  a bitmapped file system with 4K blocks.

I suspect that increasing the block size of a v7 file system from 1K to 4K
does not give nearly as dramatic a performance improvement as going to a
bit mapped free list with a localized allocation policy.

The bit map for the 4K v9 file system is held entirely in the super block,
in an array of 961 longs, and hence the file system cannot be larger
than 961*32*4096 == ~125 MB.

Note that this does not necessarily contradict your post:  It's possible
that v8 had a 4K v7-style file system, and that the bit mapped file system
did not appear until later, but I don't have a v8 manual and so can't check.
--
Mike Haertel <mike@stolaf.edu>
"He's a tie with the ambition to become a full-blown suit." -- Jon Westbrock

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/24/91)

On 21 Jan 91 22:52:11 GMT, dennis@gpu.utcs.utoronto.ca (Dennis Ferguson) said:

dennis> In article <PCG.91Jan21160353@odin.cs.aber.ac.uk>
dennis> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pcg> [ ... there are good reasons explained by the Unix authors why it
pcg> would have been a bad idea to double the block size fromn 512 to
pcg> 1024 bytes ... ]

dennis> While I'm unwilling to dig through the references to determine
dennis> whether Thompson and Ritchie actually said this, if they did I
dennis> do think they may have changed their minds about it.  The file
dennis> system used (on Vaxes) by Version 8 was essentially the V7 file
dennis> system with the block size increased to 4096.

This is the filesystem designed fot the /tmp partitions. It is described
in some BSTJ issue thick with database under UNIX articles.

dennis> If memory serves, the stated reason this was done was simply
dennis> that it made the file system run 8 times faster under typical
dennis> loads.

Ah, for /tmp it works -- almost all file accesses on /tmp are
sequential, and raising the block size is a stupid but effective way of
achieving greater physical clustering and read-ahead, at the expense of
latency, when it is known that "typical laods" are about sequential access.

The problems come when loads are not "typical". Then performance suffers
horribly because the mandatory extra clustering and read ahead not only
is not of benefit, it is highly damaging, because of lower buffer cache
hit rates, and of extra internal access fragmentation.

This is what I use to call "billjoyism", designing something that works
well 80% of the time and breaks down badly in the remaining 20%, when a
little more hard thinking (some call it "engineering") would find a more
flexible solution, like, in the example above, dynamic adaptive
clustering (which happens almost by default if one switches to a free
list ordered by position instead of time, e.g. a bitmap based one)
instead of static predictive clustering like raising the block size.

dennis> And I distinctly remember arguments being made at the time to
dennis> the effect that the speed of the Berkeley fast file system
dennis> (still a fairly recent innovation then) was almost exclusively
dennis> due to the larger block size, and that the block clustering
dennis> algorithm, which makes the supporting code complex and
dennis> relatively CPU-intensive when writing, really was unnecessary.

Yes, on that type of machine (stupid disc controllers, timesharing)
that's true. The question: is the merit of the improvement because of
fixed static clustering or because of its consequences in the case of
sequential access? ....


The billjoys will say "so what, sequential access is 80%, so we go for
it, and damn the rest!". Unfortunatley there are two problems:

1) random access to small files is fairly common in UNIX, and cannot be
so easily dismissed: directories, inode pages. Some little dbm, too.

2) a lot of the sequential access preponderance in UNIX is an historical
consequence of the fact that it was especially optimized in the original
design, and that it is has become even more so with time.


Thus sequential access is used also where under a more balanced design
random access would be used. For example UNIX editors traditionally
copied the file to edit twice sequentially on every edit. Now they load
it into memory and write it back from there, which gives most VM
subsystems the fits, for other similar reasons. Also, the *contents* of
directories are accessed sequentially, when a single B-tree based
directory file (Mac, Cedar) or similar could be faster and more compact.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

phil@brahms.amd.com (Phil Ngai) (01/25/91)

In article <1991Jan21.225211.17757@gpu.utcs.utoronto.ca> dennis@gpu.utcs.utoronto.ca (Dennis Ferguson) writes:
|While I'm unwilling to dig through the references to determine whether
|Thompson and Ritchie actually said this, if they did I do think they may

If I recall correctly, the argument was that sequential look ahead on
reads has all the advantages provided by bigger block size without the
fragmentation problems. This probably assumed reads were much more
frequent than writes, which may have some dependency on the use of
program swapping since a program which is in swap space only needs to
be read in, not written out.

--
When someone drinks and drives and hurts someone, the abuser is blamed.
When someone drinks and handles a gun and hurts someone,
the media calls for a gun ban.

suitti@ima.isc.com (Stephen Uitti) (01/25/91)

In article <PCG.91Jan23183334@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>On 21 Jan 91 22:52:11 GMT, dennis@gpu.utcs.utoronto.ca (Dennis Ferguson) said:
>dennis> And I distinctly remember arguments being made at the time to
>dennis> the effect that the speed of the Berkeley fast file system
>dennis> (still a fairly recent innovation then) was almost exclusively
>dennis> due to the larger block size, and that the block clustering
>dennis> algorithm, which makes the supporting code complex and
>dennis> relatively CPU-intensive when writing, really was unnecessary.
>
>Yes, on that type of machine (stupid disc controllers, timesharing)
>that's true. The question: is the merit of the improvement because of
>fixed static clustering or because of its consequences in the case of
>sequential access? ....

Interactive Systems has a file system based on the V.3.2
(essentially V7) file system, but with bitmapped block
allocation.  You can increase the static block size, but that
doesn't buy anything, anymore.  It improves file system
performance by a factor of 10 to 15, on good hardware by my
benchmarks.  There is a vast difference between a good disk
controller and a lesser disk controller.  The drivers make use of
track buffering, if available.  A controller which can read a
track in one revolution performs as if rotational latency is not
an issue.  With read-ahead, and contiguous or near contiguous
files, sequential access is very fast.  Random access is still
pretty fast, mostly because the blocks are still close to each
other.  The overhead of figuring out where they are is still an issue.

The straight V.3.2 file system was so slow that a 386/25 with a
300 MB 17 millisecond average access disk drive was good for only
a small number of engineers, about four.  With the software
change, you can really run thirty.  Of course, we don't do that.
We run X windows instead.

I used to think that using a file system as complicated as the
Berkeley Fast File System would not be as fast as a simpler file
system with a bitmap access.  If it is a nontrivial exercise to
figure out where the next block of a file might be, you won't do
it quickly.  V.4 shows us that if you throw all of your memory at
buffering, then you can make it fast.  I can't help but believe
there is some better use for memory.  Perhaps a relatively simple
file system using sorted extent lists would provide a faster
system for sequential and random access.  If the file system were
implemented in a user process under UNIX, you'd be able to
profile the system, and figure out what the latencies really are
for each part.  Then if you made interprocess communication
fast...

It would also be nice if the inode information were near something
else.  For that matter it would be nice if inodes had a larger
name space than 16 bits (V7) and could be allocated dynamically,
rather than statically (at mkfs time).  A good spot would be in
directories.  This makes hard links more difficult.  This should
not be a deterrent.

>The billjoys will say "so what, sequential access is 80%, so we go for
>it, and damn the rest!".

In billjoys defense, before this change, it was 100% slow, rather than
just 20% slow.

>Thus sequential access is used also where under a more balanced design
>random access would be used. For example UNIX editors traditionally
>copied the file to edit twice sequentially on every edit. Now they load
>it into memory and write it back from there, which gives most VM
>subsystems the fits, for other similar reasons.

I wrote a text editor for my record-oriented calculator.  Memory
and file space are shared in this environment.  I decided that I
couldn't afford two copies of the file.  I only read enough of
the file for the current display (not alot).  When the user makes
a change, I remember the change, where it goes, and what was
deleted.  I only write the new file at the end, if the user wants
it.  I write the file by deleting all records that were marked
for deletion, creating the new file, then cleaning up the
remainder.  I decided that a similar system could be used for a
block oriented editor for larger systems.  It would handle
arbitrarily large files, be able to make changes to single files
that nearly fill disks, have very fast startup (since it only has
to read the first screen), would not chew up the VM needlessly,
etc.  One day, I'll write the editor's file handling subroutine library.

Stephen.
suitti@ima.isc.com
"We Americans want peace, and it is now evident that we must be
prepared to demand it.  For other peoples have wanted peace, and
the peace they received was the peace of death." - the Most Rev.
Francis J. Spellman, Archbishop of New York.  22 September, 1940

srg@quick.com (Spencer Garrett) (01/26/91)

In article <PCG.91Jan18142616@teachk.cs.aber.ac.uk>,
	pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

> This greatly complicates life, especially as the kernel, up to SunOS 3,
> has a cache of blocks. Having variable block sizes means having variable
> buffer sizes (which complicates life quite a bit, as you must then put
> the buffer cache in virtual memory). Each buffer is described by a
> buffer header, and now this not only contains its address but also the
> block size.

You've made an incorrect assumption here, and the rest of your story
about the buffer cache bogosities therefore falls wide of the mark.
The buffer cache is a *disk* cache, not a *file* cache.  Each buffer
and its associated buffer header handle a fixed-size chunk of disk
which may or may not be related to the filesystem blocksize on that disk.
When the filesystem code gets a transfer request (read or write) it
translates the file offset into an absolute sector number and
*then* goes to the disk cache to find that sector (for efficiency's
sake it actually does ranges of sectors at a time, but the idea
is the same).  Remember that you can read the disk directly, without
going through the filesystem, and still use the buffer cache.  This
is what the "cooked" disk device is for.  Lots of files are smaller
than 8k, and one disk buffer may well hold several of them.

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/27/91)

On 24 Jan 91 19:34:58 GMT, suitti@ima.isc.com (Stephen Uitti) said:

suitti> In article <PCG.91Jan23183334@odin.cs.aber.ac.uk>
suitti> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pcg> On 21 Jan 91 22:52:11 GMT, dennis@gpu.utcs.utoronto.ca (Dennis
pcg> Ferguson) said:

dennis> And I distinctly remember arguments being made at the time to
dennis> the effect that the speed of the Berkeley fast file system
dennis> (still a fairly recent innovation then) was almost exclusively
dennis> due to the larger block size, and that the block clustering
dennis> algorithm, which makes the supporting code complex and
dennis> relatively CPU-intensive when writing, really was unnecessary.

pcg> Yes, on that type of machine (stupid disc controllers, timesharing)
pcg> that's true. The question: is the merit of the improvement because
pcg> of fixed static clustering or because of its consequences in the
pcg> case of sequential access? ....

On rereading it I think I should have qualified a bit my comment. There
are difficult issues like the interplay between sophisticated rotational
latency clustering, stupid controllers, using a bitmap for free list,
the hash based search for free space, and so on. 

suitti> Interactive Systems has a file system based on the V.3.2
suitti> (essentially V7) file system, but with bitmapped block
suitti> allocation. [ ... and its wonderful performance, especially
suitti> with clever disk controllers ... ]

Well, yes, this is one of the cleverer approaches, because as you hint
using a bitmap essentially gives you dynamic instead of static
clustering. Thus just the use of a bitmap buys you almost all the
performance improvement of the BSD FFS for a fraction of the cost and
with just 1024 bytes per block and withotu causing problems for random
access. As Stephen Uitti says later there are still problems with
inodes, but for the time being we can be happy.

suitti> The straight V.3.2 file system was so slow that a 386/25 with a
suitti> 300 MB 17 millisecond average access disk drive was good for only
suitti> a small number of engineers, about four.  With the software
suitti> change, you can really run thirty.

Don't put the V7 filesystem down that much, for all its (later)
regrettable aspects. It did have some good performance, if freshly
rebuilt (the freelist was sorted for physical clustering).

pcg> The billjoys will say "so what, sequential access is 80%, so we go
pcg> for it, and damn the rest!".

suitti> In billjoys defense, before this change, it was 100% slow,
suitti> rather than just 20% slow.

Here I pick a medium size nit, just for the sake of historical accuracy:
as hinted above the performance of the traditional Unix filesystem was
not that bad, if it was regularly straightened out, as the free list
initially was in physical address order.

The problem is that nobody did the straightening regularly. I remember
reading Stonebraker in the Ingres retrospective and his repentance over
using static ISAM indexes that needed periodic reorganization, instead
of dynamic VSAM style ones that are self reorganizing, all because if
properly reorganized static indexes have marginally better performance.
He found out that nobody bothered to refresh the indexes periodically...

suitti>	[ ... describing a log based editor instead of a copy based one
suitti> like, in different ways, vi or emacs ... ]

Precisely what I think is the best organization! Especially for virtual
memory. Another nice organization was used by the TSS/370 editor, which
used VSAM (then called another way) for (memory mapped) files.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

dhinds@elaine19.stanford.edu (David Hinds) (01/27/91)

In article <1991Jan25.185333.607@quick.com> srg@quick.com (Spencer Garrett) writes:
>In article <PCG.91Jan18142616@teachk.cs.aber.ac.uk>,
>	pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>
>You've made an incorrect assumption here, and the rest of your story
>about the buffer cache bogosities therefore falls wide of the mark.
>The buffer cache is a *disk* cache, not a *file* cache.  Each buffer
>and its associated buffer header handle a fixed-size chunk of disk
>which may or may not be related to the filesystem blocksize on that disk.
>When the filesystem code gets a transfer request (read or write) it
>translates the file offset into an absolute sector number and
>*then* goes to the disk cache to find that sector (for efficiency's
>sake it actually does ranges of sectors at a time, but the idea
>is the same).  Remember that you can read the disk directly, without
>going through the filesystem, and still use the buffer cache.  This
>is what the "cooked" disk device is for.  Lots of files are smaller
>than 8k, and one disk buffer may well hold several of them.

But how often would you actually want those particular small files that
just happened to be in the same disk blocks that were read in for some
other purpose?  OK, so the extra stuff stored in the buffer cache isn't
guaranteed to be useless, but in practice, it would seem to have very
little value.  Or is the file system smart enough to group tiny files
together based on correlations in their access patterns?

 -David Hinds
  dhinds@cb-iris.stanford.edu

minich@unx2.ucc.okstate.edu (Robert Minich) (01/27/91)

suitti>	[ ... describing a log based editor instead of a copy based one
suitti> like, in different ways, vi or emacs ... ]

Grandi> Precisely what I think is the best organization! Especially for
Grandi> virtual memory. Another nice organization was used by the TSS/370
Grandi> editor, which used VSAM (then called another way) for (memory
Grandi> mapped) files.

With a log based editor, what happens if someone changes the file while
you're editing it? Are the files locked while in use?
-- 
|_    /| | Robert Minich            |
|\'o.O'  | Oklahoma State University| "I'm not discouraging others from using
|=(___)= | minich@d.cs.okstate.edu  |  their power of the pen, but mine will
|   U    | - "Ackphtth"             |  continue to do the crossword."  M. Ho

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/28/91)

On 25 Jan 91 18:53:33 GMT, srg@quick.com (Spencer Garrett) said:

srg> In article <PCG.91Jan18142616@teachk.cs.aber.ac.uk>,
srg> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

	[ ... the BSD FFS has block and fragment sizes from
	1KB to 8KB in 1KB increments, but pages are fixed size
	to 8KB, and the buffer cache is VM mapped in 4BSD/SunOS ... ]

pcg> This greatly complicates life, especially as the kernel, up to SunOS 3,
pcg> has a cache of blocks. Having variable block sizes means having variable
pcg> buffer sizes (which complicates life quite a bit, as you must then put
pcg> the buffer cache in virtual memory). Each buffer is described by a
pcg> buffer header, and now this not only contains its address but also the
pcg> block size.

srg> You've made an incorrect assumption here, and the rest of your story
srg> about the buffer cache bogosities therefore falls wide of the mark

I cannot see where I made an incorrect assumption. Maybe I have not been
clear.  Maybe you just have not understood what I wrote. Others did,
though.

srg> The buffer cache is a *disk* cache, not a *file* cache.

The well known fact that the UNIX cache is a disk cache has no effect
whatsoever on my argument. Incidentally, as far as I know no USG/BSD
derived Unix kernel uses a file cache, in the sense in which OS/360 and
others use to. Amoeba has some kind of a file cache, but its FS designed
is very unconventional, taking to an extreme the idea of having the
first block of a file stored in the inode (it stores the *entire* file
in the inode, in a sense, as all files are made up of a single block,
and blocks are variable sized, and block sizes coincide with file
sizes). But Amoeba is not Unix, even if it can emulate its API quite
closely. Another system that does file caching I know is CMU's Virtue,
but again in a quite different spirit and setting from Unix.

srg> When the filesystem code gets a transfer request (read or write) it
srg> translates the file offset into an absolute sector number and
srg> *then* goes to the disk cache to find that sector (for efficiency's
srg> sake it actually does ranges of sectors at a time, but the idea is
srg> the same).

No, it does not. All *filesystem* IO is done in block (System V) or
fragment (BSD) units; only the driver and the cache manager actually
see sector numbers. The filesystem code is carefully written using
macros that abstract away from the physical sector number.


I may well have been hallucinating for the past dozen years, but reading
the kernel code, the Unix IO system paper, the BSD FFS paper, and the
4.3 book, gives me a completely different story from yours:

srg> Lots of files are smaller than 8k, and one disk buffer may well
srg> hold several of them.

I am quite sure this does not happen under 4BSD (or under *any* other
UNIX implementation derived from USG or BSD sources) and under SunOS in
particular, which on most machines simply does not have buffers smaller
than 8KB because they are VM mapped, VM pages are 8KB long, and buffers
are made up of *multiple* pages. Please reread more carefully the above
mentioned papers and books. I will do the same, just in case.

You may be confused in that it does happen that the same 8KB disk block
may indeed become split into 8 1KB fragments *on the disk*, each holding
a different file.

But in the buffer cache each of the fragments will be allocated a
different buffer slot, which has a minimum size of one page, which is
8KB on many Suns (or 4KB on many others). Since as you seem to correctly
remember *most* files are under 8KB, this means that, as I have
remarked, internal fragmentation is countered by fragments on disk, but
not in the buffer cache. This is particularly bad as the percentage of
files that are wel under 8KB is greater for the dynamic case than for
the static case.

It may be thought that a disk block of 8KB containing 8 1KB fragments
could be mapped eight times as 8 different buffer slots; in practice
this cannot happen because:

1) each fragment would be at a different offset in the buffer slot,
complicating matters even more;

2) it would be hard to make the file grow, as this would require
taking care to unmap the old block and remap the new one, and copying,
and so on.

The BSD FFS/buffer cache design was meant for architectures where the VM
page size was about the same size as a fragment, not as a block, so
growing the last (direct?) block of a file from 1KB to 8KB could be done
just by mapping new pages in the same buffer slot, without any copying
at all. The BSD assumption was not bad -- the smaller the page size the
better, with dynamic clustering for spreading IO latencies and
overheads.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

lm@slovax.Berkeley.EDU (Larry McVoy) (01/29/91)

In article <PCG.91Jan21160353@odin.cs.aber.ac.uk>, pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
|> On 20 Jan 91 20:13:55 GMT, guy@auspex.auspex.com (Guy Harris) said:
|> pcg> For some reason there is still an 'nbuf' variable in the kernel to set
|> pcg> the number of buffer headers. It would be interesting to know what role
|> pcg> have buffer headers in the new architecture.

nbuf is the upper bound of a dynamically sized (yes, piercarlo, dynamically
sized) cache in the kernel.  The cache is used by UFS to read in meta data such as super blocks, inodes, indirect blocks, etc.  Cranking up nbuf will use up more memory but make lookups go faster.

|> pcg> Unfortunately, the Sun virtual memory technology has an 8KB page size.
|> 
|> guy> At least on Sun-3s, Sun-3xs, and Sun-4s.  The "desktop SPARC" machines
|> guy> have a 4KB page size.  I don't remember what the page size on the 386i
|> guy> was.  (The page size on the Sun-2 was 2KB.)
|> 
|> Ah yes. At least 2KB was reasonable, and was a funtion of the Sun-2
|> having a limited address space. But the choice of caching entire page
|> tables meant that when Sun had to enlarge the virtual address space they
|> had to enlarge the page size as well. I think that there are some good
|> arguments both pro and con the Sun MMU design for the Sun 1; maybe also
|> for the Sun 2; but in the Sun 3 case I would not rate the "pro"
|> arguments as good.
|> 
|> The 386i was 4KB, like all 386s, and the SPARCs. Barely tolerable, but
|> still too large: but for IO problems, because of which you want to use
|> dynamic clustering, not bigger page sizes, the smaller is the page size,
|> the better. IMNHO a page size of 1024 bytes is more or less the right
|> one, given prevailing IO technology etc.

Maybe.  Maybe not.  Instead of claiming that 1K is the right size, how about sharing what evidence you have that lead you to this conclusion?  If you have no evidence then you are just playing arm chair OS architect.  I can do that too but it is a meaningless pursuit.

---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

lm@slovax.Berkeley.EDU (Larry McVoy) (01/29/91)

In article <PCG.91Jan23183334@odin.cs.aber.ac.uk>, pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
|> On 21 Jan 91 22:52:11 GMT, dennis@gpu.utcs.utoronto.ca (Dennis Ferguson) said:
|> dennis> If memory serves, the stated reason this was done was simply
|> dennis> that it made the file system run 8 times faster under typical
|> dennis> loads.
|> 
|> Ah, for /tmp it works -- almost all file accesses on /tmp are
|> sequential, and raising the block size is a stupid but effective way of
|> achieving greater physical clustering and read-ahead, at the expense of
|> latency, when it is known that "typical laods" are about sequential access.
|> 
|> The problems come when loads are not "typical". Then performance suffers
|> horribly because the mandatory extra clustering and read ahead not only
|> is not of benefit, it is highly damaging, because of lower buffer cache
|> hit rates, and of extra internal access fragmentation.

It would appear that you are suggesting that the file system always does read ahead and clustering.  This is not the case.  The read ahead algorithm is enabled *dynamically* when the kernel detects sequential access.  Only for the first block of the file does the file system assume sequential access.  If you were to go out and measure usage (when's the last time you spent time gather and analyzing some file system traces?) you would find that assuming sequential access is a win almost all of the time.  Yes







, it adds an extra block to the system when the guess is wrong.  I'd rather lose that extra block 20% then not have my read ahead block 80% of the time.  How about you?  Do you have a better algorithm to suggest?

|> This is what I use to call "billjoyism", designing something that works
|> well 80% of the time and breaks down badly in the remaining 20%, when a
|> little more hard thinking (some call it "engineering") would find a more
|> flexible solution, like, in the example above, dynamic adaptive
|> clustering (which happens almost by default if one switches to a free
|> list ordered by position instead of time, e.g. a bitmap based one)
|> instead of static predictive clustering like raising the block size.

More drivel.  The file system does this.  Has for years.

|> The billjoys will say "so what, sequential access is 80%, so we go for
|> it, and damn the rest!". Unfortunatley there are two problems:
|> 
|> 1) random access to small files is fairly common in UNIX, and cannot be
|> so easily dismissed: directories, inode pages. Some little dbm, too.
|> 
|> 2) a lot of the sequential access preponderance in UNIX is an historical
|> consequence of the fact that it was especially optimized in the original
|> design, and that it is has become even more so with time.

Wake up and smell the coffee, please.  Much of what the BSD fast file system (aka UFS, also aka FFS) did was address these issues.  Why don't you go off and measure the system for a while and then come back and tell me that the system stinks.  Oh, yeah, please tell me how you would improve the system and send me the simulation results that show that your improvements are well founded.

Just as a teaser, I'll predict that a system using the FFS can do without disksort() entirely.  I believe that the file system is doing such a good job of grouping related requests that sorting them in the disk queue won't buy you more than a 10% speedup and usually won't buy you anything at all.  

|> Thus sequential access is used also where under a more balanced design
|> random access would be used. For example UNIX editors traditionally
|> copied the file to edit twice sequentially on every edit. Now they load
|> it into memory and write it back from there, which gives most VM
|> subsystems the fits, for other similar reasons.

It is almost always necessary to copy the data when editting.  Think about inserting a line and tell me how you are going to do that w/o copying data.
If you end up w/ a read only copy of the original file and a bunch of data structures describing changes (the usually case) then the stupid editor should just map in the file copy on write.

|> Also, the *contents* of
|> directories are accessed sequentially, when a single B-tree based
|> directory file (Mac, Cedar) or similar could be faster and more compact.

Again, have you tried this or is this yet another one of those "I believe it to be true therefor it is true" deals?  I looked at this a while ago (OS/2 does this, or did when I played with it) and it sucked.  There are cases that this sort of structure has nasty behavior (an exercise left for the reader, if you can't figure it out don't argue the point).
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/29/91)

On 27 Jan 91 00:38:21 GMT, minich@unx2.ucc.okstate.edu (Robert Minich) said:

suitti>	[ ... describing a log based editor instead of a copy based one
suitti> like, in different ways, vi or emacs ... ]

Grandi> Precisely what I think is the best organization! Especially for
Grandi> virtual memory. Another nice organization was used by the TSS/370
Grandi> editor, which used VSAM (then called another way) for (memory
Grandi> mapped) files.

minich> With a log based editor, what happens if someone changes the file while
minich> you're editing it? Are the files locked while in use?

Well, the problem exists in any editor organization; if two users edit
the same file, the end result will be either one of the two modified
copies or any in between mix, with no guarantee (if two copy based
editors copy back at the same time, funny things will be written). The
only way to guarantee consistency is to use locking, under any editor
structure.

It is arguable whether locking should be in the editor itself. One can
use RCS or SCCS for example to ensure atomic updates, or even branches.
Or one can use something like the technology used in Locus.

Going back to VM, it is interesting to note that copy-on-write may help
here. Also note the particular implementation of the MasPar and its way
of duplicating pages and then merging them back. There are many ways to
solve the concurrent modification problem.

The semantics of concurrent modifications are an architectural issue
orthogonal to how the editor is implemented.

The two interact, but performance wise. Some editor organizations are
better in some cases, some in others. Copy based editors are attractive
under Unix mainly because of the heavy preference given to sequential
reads and writes theu the filesystem. Now that Unix implementations are
moving towards Multics/TSS370 like memory mapped files with a very poor
performance profile for sequential access maybe things will change.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

itcp@praxis.co.uk (Tom Parke) (01/29/91)

suitti@ima.isc.com (Stephen Uitti) writes:

>It would also be nice if the inode information were near something
>else.  For that matter it would be nice if inodes had a larger
>name space than 16 bits (V7) and could be allocated dynamically,
>rather than statically (at mkfs time).  A good spot would be in
>directories.  This makes hard links more difficult.  This should
>not be a deterrent.

It would also seem to make the inode look up less efficient - because
inodes are in a single contiguous space the value of any given inode
can be fetched in a single disk access. I once designed and
implemented a file store where I did everything right (bit map for
free blocks, nice simple algorithm to cluster block allocated to
files, all management data held in the middle of the disk to reduce
average seek) except I let the inode space be dynamic, adding the
extra indirection killed any advantages I might otherwise have gained.

	Tom
-- 

Tom Parke (my opinions and spelling are strictly temporary) 
itcp@praxis.co.uk
Praxis, 20 Manvers St., Bath BA1 1PX, UK

renglish@hplabsz.HP.COM (Bob English) (01/30/91)

In article <PCG.91Jan28111718@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>Since as you seem to correctly
>remember *most* files are under 8KB, this means that, as I have
>remarked, internal fragmentation is countered by fragments on disk, but
>not in the buffer cache.

That all depends on what you think of as fragmentation.  In most BSD
systems, each buffer is allocated an 8KB virtual slot, which may or may
not be backed by physical memory.  The virtual space allocated is
larger than the physical space used by the buffers.  The virtual memory
ends up highly fragmented, but that doesn't cost very much.  Physical
memory ends up much less fragmented, and if integrated into the global
memory pool, can be used to hold program data as well as file system
data.

--bob--
renglish@hplabs.hp.com

guy@auspex.auspex.com (Guy Harris) (01/31/91)

>Incidentally, as far as I know no USG/BSD derived Unix kernel uses a
>file cache, in the sense in which OS/360 and others use to.

What sense is that?  Caching entire files?

SunOS 4.x and S5R4's page cache is, of course, keyed by vnode pointer
and offset into vnode, rather than by disk device and block on disk; the
traditional buffer cache in SunOS 2.x and 3.x can either be keyed by
that or by vnode and offset, the latter being needed for NFS (the Sun
NFS source distribution, as used by some other vendors in their NFS
implementations, works that way as well).  It doesn't cache files in
their entirety, although it may end up caching all the blocks of a file,
so it's not a file cache in that sense.

pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/01/91)

On 29 Jan 91 19:59:53 GMT, renglish@hplabsz.HP.COM (Bob English) said:

renglish> In article <PCG.91Jan28111718@odin.cs.aber.ac.uk>
renglish> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pcg> Since as you seem to correctly remember *most* files are under 8KB,
pcg> this means that, as I have remarked, internal fragmentation is
pcg> countered by fragments on disk, but not in the buffer cache.

renglish> That all depends on what you think of as fragmentation.

It all depends on how much of an article you read before shooting from
the hip. :-).

renglish> In most BSD systems, each buffer is allocated an 8KB virtual
renglish> slot, which may or may not be backed by physical memory.  The
renglish> virtual space allocated is larger than the physical space used
renglish> by the buffers. [ ... and so on ... ]

I see that instead of reading the rest of my (as always exceissively
verbose, I admit) original article, in which I explain the same thing
more clearly, and in particular that on many a Sun (the thread title
begins with "Sun bogosities") that have 8KB pages each buffer slot
occupies 8KB physical as well as virtual, you have decided to show that
you are a wise guy, and that you have read the BSD FFS paper too. I am
duly impressed. Well, here you are... :-)
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/01/91)

On 29 Jan 91 15:09:29 GMT, itcp@praxis.co.uk (Tom Parke) said:

itcp> I once designed and implemented a file store where I did
itcp> everything right (bit map for free blocks, nice simple algorithm
itcp> to cluster block allocated to files, all management data held in
itcp> the middle of the disk to reduce average seek) except I let the
itcp> inode space be dynamic, adding the extra indirection killed any
itcp> advantages I might otherwise have gained.

Ah, this is interesting. I guess that the reason is not the indirection
in itself, but that you forgot (naughty, naughty!) about physical
clustering of inodes and directory entries, so that the indirection
really implied an extra seek, not just a pointer chase in memory.
Putting the first block of the file in the inode itself is also another
way of eliminating an indirection that often resolves into a seek.

The BSD FFS does use some technology to try to cluster inodes near to
the data blocks they describe, and related inodes together, or at least
in the same cylinder.

I surmise that this, in your design would have counteracted any
indirection (not that BSD uses indirection). I would also surmise that
putting the inode *in the directory* (maybe putting the first block as
well there -- but far more controversial) would be really the good thing
to do; it guarantees that directory entires and inodes are physically
clustered, and does away with one indirection anyhow. Hard links become
quite a bit more difficult, though. Multics, that used an organization
like this, only had soft links and synonyms (multiple names for the same
directory entry, like hard links, but only within the same directory)
for this reason.

All this reminds me that another lost art is physical clustering. Only
the people that work on real machines (Big Iron!) seem to care about
physical clustering anymore, except maybe those that are rediscovering
it in another form at another level in the storage hierarchy, for
example in the cache of the RS/6000.

Another of my favourite themes, related to what you write above, is that
a directory system is just an index from a string (the full pathname) to
the start of the file blocks (the inumber is merely a level of
indirection), and that directories as we know it on Unix are a very poor
index, having about the worst possible physical clustering
characteristics, being an index tree which is tall and thin and
unbalanced, having by construction a very low fanout (you can have lots
of entries per each Unix directory, but this creates much worse
problems). I'd rather use a B-tree or an hash/signature file (some will
remember that I'd like keyword based filesystems), that have much better
physical clustering properties, being short and squat and balanced, with
as good a fanout as you can get.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

boyd@necisa.ho.necisa.oz.au (Boyd Roberts) (02/01/91)

In article <1991Jan21.225211.17757@gpu.utcs.utoronto.ca> dennis@gpu.utcs.utoronto.ca (Dennis Ferguson) writes:
>
>While I'm unwilling to dig through the references to determine whether
>Thompson and Ritchie actually said this, if they did I do think they may
>have changed their minds about it.  The file system used (on Vaxes) by
>Version 8 was essentially the V7 file system with the block size increased
>to 4096.  If memory serves, the stated reason this was done was simply
>that it made the file system run 8 times faster under typical loads.

The standard V8 file-system was the 4.1BSD 1k file-system.  Just like
V7 except the block size was 1k.  The 4k file-system was Weinberger's
fast file-system (aka bitfs) that had a bit-mapped free block list.
It's underlying structure was V7 with 4k blocks and the bit-map.

The speed up was that you got 8 times more data per I/O.  Making the
blocksize bigger speeds up I/O but at the cost of media wastage.

Check out the BSD fast file-system paper.  It analysed media wastage based
on file-systerm block size.  By the time you got to 8k blocks a staggering
half of the file-system was wasted.  God knows why they added frags.  It
would've been better to either bite the bullet or compromise on a smaller
block size.  The complexities of that file-system far outweigh its advantages.

Boyd Roberts			boyd@necisa.ho.necisa.oz.au

``When the going gets wierd, the weird turn pro...''

pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/01/91)

On 30 Jan 91 18:45:04 GMT, guy@auspex.auspex.com (Guy Harris) said:

pcg> Incidentally, as far as I know no USG/BSD derived Unix kernel uses a
pcg> file cache, in the sense in which OS/360 and others use to.

guy> What sense is that?  Caching entire files?

No, in the sense that caching is done by assigning a specific buffer
pool to each file as it is opened. Traditional Unix has a central buffer
cache shared by all processes, and logical file IO (byte stream, ...).

In way of principle (and depending on the access method, for example
VSAM is completely different, and more SunOS 4/SVR4 like) OS does not
have quite a filesystem in the Unix sense, it does not even have logical
file IO; files are actually slices of a disk (each "file" can have a
different sector size, and sector size is not invisible to programs!)
and reading and writing is done via library subroutines that issue
physical (well, not quite) IO operations.

In this environment a central shared buffer cache does not make any
sense; every time you open a file you say how many buffers you want to
assign to this particular opening of that file, and so on.

This may or may not be a win; for example it is a win when one can
closely tailor the buffering mechanism to the expected access patterns,
which is indeed frequently possible in certain environments. It is a
loss when loads of people open the same files. In other words, it
depends on whether you are running batch jobs with specialized access
patterns and low latency requirements or a time sharing system with
shared access patterns and high latency profile.

guy> SunOS 4.x and S5R4's page cache is, of course, keyed by vnode
guy> pointer and offset into vnode, rather than by disk device and block
guy> on disk;

Well, I would not say that a "page cache" exists in SunOS 4 and SVR4;
this is pure nominalism. What you have is segments ("vnodes"), and then
pages belong to the segment and are either only on disk or also in
memory/swap. No possible comparison with the traditional Unix shared IO
buffer cache (that was used to "cache" even tape blocks! -- well, now
tape drivers can steal pages from the page free list).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

henry@zoo.toronto.edu (Henry Spencer) (02/02/91)

In article <2006@necisa.ho.necisa.oz.au> boyd@necisa.ho.necisa.oz.au (Boyd Roberts) writes:
>The speed up was that you got 8 times more data per I/O.  Making the
>blocksize bigger speeds up I/O but at the cost of media wastage.
>
>Check out the BSD fast file-system paper...
>God knows why they added frags.  It
>would've been better to either bite the bullet or compromise on a smaller
>block size.  The complexities of that file-system far outweigh its advantages.

Check out the McVoy+Kleiman paper in the latest Usenix for the *correct*
answer to the performance problem:  don't mindlessly crank up the block
size, keep the smaller blocks but lay them out contiguously so the system
can read many at once.  Pity Berkeley didn't think of that.
-- 
If the Space Shuttle was the answer,   | Henry Spencer at U of Toronto Zoology
what was the question?                 |  henry@zoo.toronto.edu   utzoo!henry

pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/05/91)

On 28 Jan 91 20:50:42 GMT, lm@slovax.Berkeley.EDU (Larry McVoy) said:

lm> In article <PCG.91Jan21160353@odin.cs.aber.ac.uk>, pcg@cs.aber.ac.uk
lm> (Piercarlo Grandi) writes:

pcg> [ ... most Sun models have had 8KB, now 4KB is prevalent, as Guy
pcg> Harris observes, which is IMNHO ... ] Barely tolerable, but still
pcg> too large: but for IO problems, because of which you want to use
pcg> dynamic clustering, not bigger page sizes, the smaller is the page
pcg> size, the better. IMNHO a page size of 1024 bytes is more or less
pcg> the right one, given prevailing IO technology etc.

lm> Maybe.  Maybe not.  Instead of claiming that 1K is the right size,
lm> how about sharing what evidence you have that lead you to this
lm> conclusion?

Please read, for example, Shaw, "The logical design of operating
systems". There are plenty of statistics in the literature that show
that the smaller the page size, the smaller the working set. In theory
the optimal page size is of the order of magnitude as the average record
size, which is a few dozen bytes.

Naturally a physical IO transaction size of a few dozen bytes is not
efficient with current mass storage technology (CCDs and bubbles would
have solved the problem, but they went down the drain), so physical
clustering has to occur, and there are several schemes for it.

There are several papers that show that 1KB is a fine compromise, given
current hw technology, between having a small page size and large IO
transaction size, especially if dynamic clustering is used to further
boost the IO transaction size when this is possible, e.g. for FIFO
access patterns.

All these are parts of well established scientific knowledge that need
not be proven again every time they are discussed, just like on need not
prove every time that perpetual motions machines cannot be.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/05/91)

On 28 Jan 91 21:10:30 GMT, lm@slovax.Berkeley.EDU (Larry McVoy) said:

A lot of bizarre nonobjections (more on those later) that seem to hint
that contrary to evidence BSD/SunOS memory and file management is as
good as it can be, contrarily to the arguments demonstrated in one of my
postings.  I had sent you a copy of my article a long time ago, for
comments, in case I had made factual errors about Sun Architectures, and
never got a reply.  Another Sun person bothered to reply, and made
interesting remarks, after which I toned down considerably my draft.

Even more bizarre, I have seen an article about some other McVoy (the
evil twin? :->) reporting that he (or she) has just published a paper in
the Winter Usenix which shows that the old idea of dynamic clustering of
small blocks works as well or better and is simpler than the BSD/SunOS
FFS (the article is slightly reformatted for readability):

  From: henry@zoo.toronto.edu (Henry Spencer)
  Subject: Re: Sun bogosities, including MMU thrashing
  Newsgroups: comp.arch Date: 1 Feb 91 16:48:18 GMT

  In article <2006@necisa.ho.necisa.oz.au> boyd@necisa.ho.necisa.oz.au
  (Boyd Roberts) writes:

    >The speed up was that you got 8 times more data per I/O.  Making the
    >blocksize bigger speeds up I/O but at the cost of media wastage.
    >Check out the BSD fast file-system paper...
    >God knows why they added frags.  It would've been better to either bite
    >the bullet or compromise on a smaller block size.  The complexities of
    >that file-system far outweigh its advantages.

  Check out the McVoy+Kleiman paper in the latest Usenix for the *correct*
  answer to the performance problem:  don't mindlessly crank up the block
  size, keep the smaller blocks but lay them out contiguously so the system
  can read many at once.  Pity Berkeley didn't think of that.

Bah. Misteries of the Unix world. Now, let's consider the technical
arguments:

lm> In article <PCG.91Jan23183334@odin.cs.aber.ac.uk>, pcg@cs.aber.ac.uk
lm> (Piercarlo Grandi) writes:

On 21 Jan 91 22:52:11 GMT, dennis@gpu.utcs.utoronto.ca (Dennis Ferguson)
said:

	[ ... rationale for BSD/SunOS upping the default block
	size from 512/1024 bytes to 4096/8192 bytes ... ]

dennis> If memory serves, the stated reason this was done was simply
dennis> that it made the file system run 8 times faster under typical
dennis> loads.

pcg> Ah, for /tmp it works [ ... it provides greater default clustering
pcg> and read ahead ... ] The problems come when loads are not
pcg> "typical". [ ... then the extra default clustering implied in a
pcg> larger block size imply great time and space waste ... ]

lm> It would appear that you are suggesting that the file system always
lm> does read ahead and clustering.

Only to somebody that is prejudiced, I am afraid, in the sense you
mention. I mentioned very clearly in the rest of my posting that the IO
subsystem of a traditional Unix kernel detects sequential access and
does asynchronous read ahead of one block only. I have even mentioned
the original V7 paper in which this is clearly explained, alongside with
the stern warning that incrementing the block size would give problems.

It is rather clear from the rest of my posting that my criticism is that
having an 8KB block size gives de facto static clustering and read ahead
of an extra 15 times on sequential access versus a 1KB block size (well,
I have had comments based on this correct reading of my posting).

Follow me: on sequential access, when block size is 1KB, you get 1KB
read now and 1KB asynchronously read ahead; when it is 8KB, you get 1KB
read now, which however cannot be accessed until an extra "synchronous"
read ahead of 7KB is also completed, and then you get 8KB of
asynchronously read ahead block.

Extending by 15 times the read ahead, half of which is "synchronous",
pays only if the CPU bandwidth of the application and of the disk are
such that their ratio gives a 1/(N+1) (N is the number of IO processors)
duty cycle. Knuth discusses the joys (not the billjoys :->) of buffering
for sequential access in the ACP series volume 1, "Fundamental
algorithms", a book that seems to have been classified "Top Secret" by
the NSA.

Now for the quantitatively minded: do some back of the envelope
calculations, and find out which bandwitdh the application must have,
given current IO subsystem bandwidths, to give the optimal duty cycles
with 1KB and 8KB blocks.  Compare with actual application CPU bandwidths
(less than 1KB/MIPS for troff/TeX to several hundred KB/MIPS for BSD
ld).

Conclude then, unless you work for Sun, DEC AT&T or other major Unix
vendor, that only a dynamically adaptable read ahead policy, based on
smaller rather than larger block sizes, can give you optimal
performance. "Research" idea: extend Knuth's analysis to multi
processors. Explain why having more CPUs than IO processors may be a bad
idea, and when (almost always, for general purpose applications :->).

    Note for people who want to understand computers and OS design:
    really try to work out some examples on the back of an envelope or on
    a paper napkin. Consider humans at a terminal as very high latency
    peripherals.

As to random access, having an 8KB block size means that instead of
reading just the 1KB you needed, you read those and then a "synchronous"
read ahead of 7KB, which is pointless if your record size is under 1KB,
which seems fairly typical. Actually you not only waste 7KBs of space,
you also waste the (usually fortunately relatively small) time needed to
wait for the useless 7KB to be read in. Even assuming that your record
size is, for your particular application, larger than 1KB, it would be
better to just read N 1KB blocks, which have been clustered together
physically in a dynamic way, than to force every application to read in
8KB blocks.  Please, please note that in the BSD FFS only the last block
of a file can be a fragment.

lm> you would find that assuming sequential access is a win almost all
lm> of the time

This is what most papers on file system characterization show, and I am
happy that you seem to be have read them at least as to this. This
prevalence of sequential access is also an effect of the poor
performance of non sequential access. But, as a true billjoy, you add:

lm> I'd rather lose that extra block 20% then not have my read ahead
lm> block 80% of the time.

Fortunately you can win both ways -- you get excellent sequential access
performance 80% of the time, and not waste a lot of space (and time) the
remaining 20% of the time, if you do dynamic clustering instead of the
static one implied in large block sizes, and you advise the pager/buffer
manager of your expected access patterns. This is really nothing new;
there are enlightening pages on filesystem implementation in the very
valuable "Database design" by Wiederhold. Also, the book "Ingres papers"
by Stonebraker has many interesting lessons for the OS designer. An
interesting discussion is also present in the BLTJ issue of some years
ago on databases under Unix.

lm> How about you?  Do you have a better algorithm to suggest?

Well, it is clearly detailed in the rest of my article. Too bad you seem
not to read articles before shooting from the hip (on the other hand my
articles are so long, to anticipate the trivial objections of wise guys
and billjoys, that you may be excused).

I read, in my delusionary world rotating around my armchair, that bitmap
based filesystems based on 1KB blocks give essentially the same
performance as BSD 8KB filesystems, simply because using a bitmap gives
dynamically and almost by default the clustering that the BSD FFS
imposes statically with its large block sizes and complicated
positioning algorithms.

Maybe you are too busy to read comp.unix.sv386, but that newsgroup has
frequent postings with IO speed comparisons, and the Sv386 FFS versions
floating around are for the most part simple bitmap based ones on top of
the V7 FS. In particular the ISC one is a very clever retrofit idea on
the V7 one, and it gives excellent performance.  Clearly not designed by
a billjoy. The MUSS filesystem had, many years ago, a design similar to
that of the ISC one, and it also gave very very good performance. Not
designed by a billjoy too. You can also have a look at the summer Usenic
proceedings of 3-4 years ago and there is some bloke from DEC that has
implemented under their BSD derived Ultrix also VMS and MSDOS structured
filesystems, nad gives some small detail of their relative performance.

pcg> This is what I use to call "billjoyism", designing something that
pcg> works well 80% of the time and breaks down badly in the remaining
pcg> 20%, when a little more hard thinking (some call it "engineering")
pcg> would find a more flexible solution, like, in the example above,
pcg> dynamic adaptive clustering (which happens almost by default if one
pcg> switches to a free list ordered by position instead of time, e.g. a
pcg> bitmap based one) instead of static predictive clustering like
pcg> raising the block size.

lm> More drivel.  The file system does this.  Has for years.

Oh, you have read too the BSD FFS paper! Too bad that you failed to
notice that the BSD/SunOS FFS starts with a base block size that gives a
large static clustering built in, specially on Sun and SunOS based
machines with their regrettable 4KB or 8KB page size that prevents the
use of in memory fragments.

pcg> The billjoys will say "so what, sequential access is 80%, so we go
pcg> for it, and damn the rest!". Unfortunatley there are two problems:

pcg> 1) random access to small files is fairly common in UNIX, and
pcg> cannot be so easily dismissed: directories, inode pages. Some
pcg> little dbm, too.

pcg> 2) a lot of the sequential access preponderance in UNIX is an
pcg> historical consequence of the fact that it was especially optimized
pcg> in the original design, and that it is has become even more so with
pcg> time.

pcg> Thus sequential access is used also where under a more balanced
pcg> design random access would be used. For example UNIX editors
pcg> traditionally copied the file to edit twice sequentially on every
pcg> edit. Now they load it into memory and write it back from there,
pcg> which gives most VM subsystems the fits, for other similar reasons.

lm> It is almost always necessary to copy the data when editting.

There are interesting counter examples to that.

lm> If you end up w/ a read only copy of the original file and a bunch
lm> of data structures describing changes (the usually case) then the
lm> stupid editor should just map in the file copy on write.

It would be a really stupid editor. Such an implementation, for most
edits, would result in extremely poor locality. You would fault and
duplicate a page for a change to just one line. Studies on editing, and
day to day practice, show that in most edit sessions changes are fairly
localized, and most of the file is simply not touched. Not to mention
that copy-on-write is a bogosity in itself, a hazardous hack around more
fundamental difficulties that ought to be solved directly. If you want,
however, I can admit that copy-on-write is an inflexible application of
change logs/journaling where the modified chunk size is mandatorily page
sized.

pcg> Also, the *contents* of directories are accessed sequentially, when
pcg> a single B-tree based directory file (Mac, Cedar) or similar could
pcg> be faster and more compact.

lm> Again, have you tried this or is this yet another one of those "I
lm> believe it to be true therefor it is true" deals?  I looked at this
lm> a while ago (OS/2 does this, or did when I played with it) and it
lm> sucked.

Well, *my* impression is that the Unix directories suck. Why don't you,
given that you hint to have gathered trace data, tell us which
percentage of the incore pages in a running SunOS 4 system is directory
pages (there are plenty of published statistics on the *static* average
size and percentage of directories in a Unix FS), and how full are they
on average? Please, I wnat to hear it from you. I would also like to
hear from you which is the average and median size of non executable and
executable files open (under SunOS 3) or memory mapped (under SunOS 4)
on a Sun 3 and a Sun 4. I am sure you and Sun have extensively analyzed
the system and do not mind releasing this little amusing tidbit of
information.

Let me guess: a large, large percentage of your 4KB-8KB buffers or pages
that are used for IO contain files or directories whose median size is
well under twice that of a page, and a fairly large percentage contain
files that are smaller than half a page. This, let me speculate :-),
adds up to a wastage of about half the active (totally unused buffer
slots or pages don't count) real memory used for buffering or pages.

I have carefully looked at some Sun 3s with SunOS 3 and 4, using nothing
more complicated than pstat(8), and even if they were lightly loaded I
seemed to notice patterns not entirely dissimilar from my speculations.
I invite you and every reader to have a go with pstat -i, just for
starters.

lm> There are cases that this sort of structure has nasty behavior (an
lm> exercise left for the reader, if you can't figure it out don't argue
lm> the point).

Every algorithm has pathological cases. The comparison in on the average
performance, and on bad is the worst case. A lot of literature shows
that a B-tree is provably a much better indexing structure than a low
fanout tree like Unix style directories, especially for disk based
structures. This is not subject to discussion.  There are plenty of
examples of B-tree based directories; their performance profiles have
been well studied. I have even seen an interesting Stanford Ph.D. thesis
on an attribute based filesystem based on better indexing structures
that was competitive with the directory based one, despite numerous
disadvantages because of the experimental nature of the implementation.

Frankly, Mr. McVoy, I have not seen a single technical objection in your
posting; I have written detailed analyses supported by references to
well known and established results, you have replied by saying "how can
you say that?" without bothering to offer any data, analysis or
reference to well known and established sources.

You also seem to have grossly misunderstood the gist of my argument
about the static clustering implied in a block size that is 8 times
larger than otherwise. I am not impressed... I will try to get a copy of
the Usenix paper by the other McVoy, which seems to support my
arguments, and see if I am better impressed by your evil twin :-).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (02/07/91)

In article <417@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes:

>Subject: Re: FLAME ON: Re: Sun bogosities, including MMU thrashing

>Ready flame throwers, take aim, .... fire!

Would you stop posting such things unless you want to be the living
evidence of Sun bogosity?

>Since I kept the read ahead alg, I'm now
>doing a 111KB of useless read by your definition.  (You seem to be assuming
>that 1KB is some nifty magic size.)

Everyone will admit it work very well 80% of time.

>>Let me guess: a large, large percentage of your 4KB-8KB buffers or pages
>>that are used for IO contain files or directories whose median size is
>>well under twice that of a page, and a fairly large percentage contain
>>files that are smaller than half a page.

>A little worse, but notice that lack of a knee at 1KB.

No one claimed there is a knee at 1KB.

						Masataka Ohta

PS

I prefere a good (or even bad) armchair architect much better than a bad
architect developping some real product.