[comp.arch] Incremental sync

mo@messy.bellcore.com (Michael O'Dell) (03/08/91)

Line-eater fodder.

"laser-guided line-eater" fodder.
 
Incremental sync() is a good idea in some ways, but not all.

When George Goble did the famous dual-headed vax with 32 megabytes of
memory, one of the first things noticed was that once every 30 seconds,
things got very slow as update flushed out memory.  One of the things
he did was to put a clock-hand style thing in the kernel so the equivalent
of update could push out the pages in a slow, steady stream, instead of
the gigantic clumps of dirty disk blocks.

However, the assumption that there is any disk idle time is basically wrong.
On large, heavily-loaded systems, the disks run pretty constantly.
This isn't to say that drizzling things out at a controlled rate rather
than in big lumps isn't useful, but sometimes it doesn't help, either.
One real problem is that the mapped file may have semantics which require
the user program to not terminate until the write to disk has finished
because the program wants to be quite certain the data is out there.
And if the references to the file were quite random (like a large database
hash table index), then there's a good chance that incremental page pushing
did NOT clean some substantive fraction of the dirty pages, still producing
the impulse load on the disk queue.

One important observation is that no VM system can run faster than it can
get clean pages - if there is an adquate supply in memory, then fine.
Otherwise, how quickly you can turn the pages to disk is the limiting factor
for overall throughput (jobs completed per hour, or conversely, time for
a single large VM-bound job).  This factor most directly effects the
level of multiprogramming a given system can sustain, assuming the workload
isn't all just small jobs which trivially fit in memory and do no substantial
I/O. (IBM MVS I/O systems are often tuned to maximize the sustainable
paging rate without Trashing. Think about it.)

One serious elevator anomoly in fast machines is as follows.  Assume several
processes trying to read different files, each one broken into several
reasonably large chunks (could easily be extents, so that doesn't fix it).
Further, assume that one process's file is broken into several more chunks
than the others with these smaller chunks spread over the same distance
as the other files.  Finally, assume the machine is enough faster than the
disks that each process can complete its processing BEFORE the heads
can seek past the next request.  So, using the standard elevator which
resorts the queue on every request, the process with the large number
of small extents (fortuitously layed-out in request order) will completely
monopolize the disk!  Because the standard elevator will let you "get back
into the boat" even after it's left, the process gets its data, and
spins out another request before the head finishes another request in the
same neighborhood, so it zooms to the front of the elevator queue and
gets to do another read.  The poor processes with their blocks toward the
end of the run starve to death by comparsion. (Note we are assuming a
one-way elevator; it's worse with a 2-way.)  What do you do?  One scheme
used successfully is "generation scheduling".  This collects an elevator-load
of requests, sorts them, and then services them WITHOUT admitting
anyone to the car along the way.  This is a way of insuring fairness.
It also turns out that this scheme can be modified to eleviate SOME of
the problems with the big memory flush problem.  Getting the details
right is complex, but the general approach is as follows.  There is a
"generation counter" which increments at some rate like 2-5x a request
service time.  Each request is marked with the generation when it
arrives. Further, you use a 2-dimensional queue, sorting the subqueues
by generation and within the subqueue keeping the original FIFO arrival order.
(There is some discussion about sorting the sub-queues. jury is still out.)
You now load the elevator car across the subqueues, always getting at least
one request from every generation pending, more with some weighting
function like queue length. [You can implement "must do in this order"
requests by including a special generation queue which always loads the
car first and is not sorted in the car.]   A further enhancement is
to add priorities to requests.  FOr instance, LOW, MEDIUM, and HIGH,
plus FIFO looks good.  LOW is page turning when the pager isn't frantic
for clean pages, MEDIUM for normal reads and writes which you want to
complete soon, and HIGH for things like directory searches in namei()
and some metadata updates, and FIFO for critical metadata updates.
Couple this with the full generational scheduling described above
and you will go a long way toward making the low-level stuff stable
in  the face of large (and truly unavoidable) impulse loads....

Oh yes - much of the thinking about this low-level scheduling stuff
came from Bill Fuller, Brian Berliner, and Tony Schene, then of Prisma.

	-Mike O'Dell

torek@elf.ee.lbl.gov (Chris Torek) (03/09/91)

In article <1991Mar8.142031.9098@bellcore.bellcore.com>
mo@messy.bellcore.com (Michael O'Dell) writes:
>[much about disk queue sorting]

Of course, certain disk controllers (whose names I shall not name)
have a tendency to swallow up a whole bunch of I/O requests and do
them in its own order, rather than listen to / allow you to adjust
yours....

Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices,
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

torek@elf.ee.lbl.gov (Chris Torek) (03/09/91)

In article <10773@dog.ee.lbl.gov> I wrote:
>Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices,

It seems people know exactly what I mean here... I got several replies
to this in the span of a few hours.  A quote from one of them:

>I think it was Rob Pike who once pointed out there is a real
>difference between "smart" and "smart-ass" controllers, and that
>in his estimation (one which I agree with completely), most
>controllers which claim to be smart, are in fact, in the other group.

This nails it down pretty well.  So: we should call it the `SO SAD
Coalition', where `SO SAD' stands for `Stamp Out Smart-Ass Devices'.

Note that there is nothing wrong with (truly) intelligent controllers,
provided that they do not sacrifice something important to attain this
intelligence.  Important things that tend to get sacrificed include
both speed and flexibility; and in fact, speed and flexibility can get
in each other's way, particularly with programmable devices.

(For instance, many SCSI controllers take several milliseconds---not
microseconds, *milli*seconds---to react to a command.  At 3600 RPM, 3
milliseconds is almost 1/5 of a disk revolution.  This is a serious
delay.  Another example is certain Ethernet chips, where the FIFOs are
just a bit too short, and when a collision occurs, they goof up the
recovery because they cannot `back up' their DMA, so they simply
restart with garbage.)
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

kinch@no31sun.csd.uwo.ca (Dave Kinchlea) (03/10/91)

In article <10773@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:


   Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices,
   -- 
   In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
   Berkeley, CA		Domain:	torek@ee.lbl.gov

Actually I have been having quite the oposite thoughts lately. It seems to me
that it would be highly advantagous (in the general case) to take all of
the filesystem information out of the kernel and give it to the I/O controller.

I don't just mean what requests are satisified first (although this would 
be one of its tasks) but one which also supports an abstract filesystem. 
This would take out alot of logic from the kernel, it needn't spend time
through namei() et al. let an intellegent controller do that. 

Am I missing something important here, other than the fact that no operating
system I am aware of has a concept of an abstract filesystem (except at 
the user level). There is still some logic needed re virtual memory and 
possible page-outs etc but I think it could work. Any comments?

cheers kinch
Dave Kinchlea, cs grad student at UWO (thats in London Ontario)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/11/91)

In article <10773@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:

| Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices,

  I think I heard this argument before when operating systems started
buffering i/o... I guess what you're looking for is MS-DOS, where you're
sure how things work (slowly).

  As long as there's a working sync or other means for my process to do
ordered writes in that less than one percent of the time when I care, I
am delighted to have things done in the fastest possible way the rest
of the time. The only time I ever care is when doing something like
database or T.P. where order counts in case of error. If I'm doing a
compile, or save out of an editor, or writing a report, as long as what
I read comes back as the same data in the same order, I really don't
care about write order (or byte order, bit order, etc) on the disk.

  There are cases when order is important, but as long as those rare
cases are satisfied, any smarts which improve performance as welcome on
my system.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

rbw00@ccc.amdahl.com ( 213 Richard Wilmot) (03/12/91)

In article <KINCH.91Mar9170121@no31sun.csd.uwo.ca> kinch@no31sun.csd.uwo.ca (Dave Kinchlea) writes:
>In article <10773@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>
>
>   Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices,
>   -- 
>   In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
>   Berkeley, CA		Domain:	torek@ee.lbl.gov
>
>Actually I have been having quite the oposite thoughts lately. It seems to me
>that it would be highly advantagous (in the general case) to take all of
>the filesystem information out of the kernel and give it to the I/O controller.
>
>I don't just mean what requests are satisified first (although this would 
>be one of its tasks) but one which also supports an abstract filesystem. 
>This would take out alot of logic from the kernel, it needn't spend time
>through namei() et al. let an intellegent controller do that. 
>
>Am I missing something important here, other than the fact that no operating
>system I am aware of has a concept of an abstract filesystem (except at 
>the user level). There is still some logic needed re virtual memory and 
>possible page-outs etc but I think it could work. Any comments?
>
>cheers kinch
>Dave Kinchlea, cs grad student at UWO (thats in London Ontario)

I see some problems with transaction processing systems which rely on
being able to absolutely control the timing of disk writes. Some (the
more efficient ones only need do this for their logs/journals) while others
want to flush out all changes made by a transaction and ensure that
it all got there before sending a terminal reply or dispensing the
ATM cash. There may be more problems with the more efficient systems
because although they don't insist on flushing out all database changes
to disk on termination of each transaction, they RELY ON NOT HAVING ANY
UNCOMMITTED (UNFINISHED) CHANGES WRITTEN TO DISK. That is, if the system
crashed, then an advanced transaction system would expect to see NONE
of the changes made by any incomplete transactions from before the crash.
 
If a file system cannot accommodate this kind of use then the transaction
system implementors will again be forced into using raw I/O - to
avoid the file system. 

Alas, RAW I/O is still the answer for most database/transaction systems.
They keep their own set of buffers and file structures. It need not be
so if the file system incorporates the semantic needs of transaction/database
systems.
-- 
  Dick Wilmot  | I declaim that Amdahl might disclaim any of my claims.
                 (408) 746-6108

mmm@iconsys.icon.com (Mark Muhlestein) (03/12/91)

Dave Kinchlea, cs grad student at UWO writes:
>
>Actually I have been having quite the oposite thoughts lately. It seems to me
>that it would be highly advantagous (in the general case) to take all of
>the filesystem information out of the kernel and give it to the I/O controller.
>
>I don't just mean what requests are satisified first (although this would 
>be one of its tasks) but one which also supports an abstract filesystem. 
>This would take out alot of logic from the kernel, it needn't spend time
>through namei() et al. let an intellegent controller do that. 
>
>Am I missing something important here, other than the fact that no operating
>system I am aware of has a concept of an abstract filesystem (except at 
>the user level). There is still some logic needed re virtual memory and 
>possible page-outs etc but I think it could work. Any comments?
>

At Sanyo/Icon we actually implemented this idea approximately five years
ago in our first Unix port.  We used a dedicated 68020 with an expandable
cache (up to 128MB!) to run the filesystem code.

It has worked reasonably well for us, but, in retrospect, I'm not at all
sure I would do it again.  Although some things are a win, such as
being able to overlap relatively CPU-intensive functions such as namei()
with other processing, we discovered very soon that a naive approach
had several severe performance and maintenance problems.

For one thing, some normally trivial things like lseek and updating times
on files become heavy-weight operations involving message passing to the
filesystem processor.  Also, very small reads and writes were extremely
inefficient compared to a normal unix implementation.  Another problem was
keeping the filesystem code in sync with new versions of the rest of the
kernel.  Enhancements and new features were relatively much more
difficult to implement.  For example, since the filesystem procecessor
handles all filesystem requests, it is necessary to teach it about NFS,
file system types, etc.  It goes far beyond just an "intelligent" controller.

We solved these problems using various techniques that relied on the fact
that the filesystem processor shared memory with the processor running the
rest of the kernel.  These techniques don't really make sense for, say,
a SCSI-based controller.

From my experience, a good SCSI controller with scatter-gather DMA capability
has about the right amount of intelligence.  It doesn't require a lot
of hand-holding to keep it going, and it gets the job done with a
minimum of hassling with low-level details like device geometry,
bad block management, or retries.
-- 

	Mark Muhlestein @ Sanyo/Icon

uunet!iconsys!mmm

dhepner@hpcuhc.cup.hp.com (Dan Hepner) (03/12/91)

From: rbw00@ccc.amdahl.com (  213  Richard Wilmot)

>I see some problems with transaction processing systems which rely on
>being able to absolutely control the timing of disk writes. Some (the
>more efficient ones only need do this for their logs/journals) while others
>want to flush out all changes made by a transaction and ensure that
>it all got there before sending a terminal reply or dispensing the
>ATM cash.

All common DBMS SW rely on a need for notification when the write
is in fact completed, although they may be willing to write the log
immediately and the rest of it more asynchronously.  That such 
notification is not available seems to be a real deficiency in 
the current crop of caching disk controllers.

 There may be more problems with the more efficient systems
>because although they don't insist on flushing out all database changes
>to disk on termination of each transaction, they RELY ON NOT HAVING ANY
>UNCOMMITTED (UNFINISHED) CHANGES WRITTEN TO DISK. That is, if the system
>crashed, then an advanced transaction system would expect to see NONE
>of the changes made by any incomplete transactions from before the crash.

Agreed.
The drives we're familiar with do in fact support a synchronous access,
either by request or "setting the controller in that mode".  For all
database usage, we intend that any controller caching be bypassed.  This
does however leave an assumption that "somebody else" must be able to
make use of that caching, because OLTP sure can't.  It's also worth
noting that a battery backed up controller cache might turn out to
be vastly more interesting.

>If a file system cannot accommodate this kind of use then the transaction
>system implementors will again be forced into using raw I/O - to
>avoid the file system. 
>Alas, RAW I/O is still the answer for most database/transaction systems.

This is perhaps the biggest trap of all.  Using raw IO has nothing
to do with the behavior of a disk controller unless one has specifically
modified one's kernel to do something special, such as post all
raw writes as synchronous writes.  The default behavior will be for
raw writes to be treated like any other write; the disk controller 
doesn't know or care where this write came from.

>They keep their own set of buffers and file structures. It need not be
>so if the file system incorporates the semantic needs of transaction/database
>systems.

Do you actually recommend this, for which configurations, and for
which reasons? 

>  Dick Wilmot  | I declaim that Amdahl might disclaim any of my claims.
>                 (408) 746-6108

Dan Hepner
Not a statement of the Hewlett Packard Co.

xxremak@csduts1.lerc.nasa.gov (David A. Remaklus) (03/13/91)

In article <107340003@hpcuhc.cup.hp.com> dhepner@hpcuhc.cup.hp.com (Dan Hepner) writes:
>From: rbw00@ccc.amdahl.com (  213  Richard Wilmot)
>
>>because although they don't insist on flushing out all database changes
>>to disk on termination of each transaction, they RELY ON NOT HAVING ANY
>>UNCOMMITTED (UNFINISHED) CHANGES WRITTEN TO DISK. That is, if the system
>>crashed, then an advanced transaction system would expect to see NONE
>>of the changes made by any incomplete transactions from before the crash.
>
>Agreed.
>The drives we're familiar with do in fact support a synchronous access,
>either by request or "setting the controller in that mode".  For all
>database usage, we intend that any controller caching be bypassed.  This
>does however leave an assumption that "somebody else" must be able to
>make use of that caching, because OLTP sure can't.  It's also worth
>noting that a battery backed up controller cache might turn out to
>be vastly more interesting.
>

Amdahls latest update to the 6100 Caching disk controller includes just
that, battery backed-up cache memory for disk writes (as does big blues
3990 controller).  Before this battery supported cache (called Non-Volatile
Store), all writes were automatically write-back (I/O completion was
not signalled until the data was written to the disk).

>>If a file system cannot accommodate this kind of use then the transaction
>>system implementors will again be forced into using raw I/O - to
>>avoid the file system. 
>>Alas, RAW I/O is still the answer for most database/transaction systems.
>
>This is perhaps the biggest trap of all.  Using raw IO has nothing
>to do with the behavior of a disk controller unless one has specifically
>modified one's kernel to do something special, such as post all
>raw writes as synchronous writes.  The default behavior will be for
>raw writes to be treated like any other write; the disk controller 
>doesn't know or care where this write came from.
>

The latest release of the UTS (Amdahl mainframe U*ix) includes a new
file system type that enables an application to specify via fcntl
calls that I/O to a particular file is to be synchronous, ie, control
is not returned until the data has successfully been written to the
I/O device.  Normal U*ix system calls to normal U*ix files can be
used for the DBMS files.  This feature was specifically added for DBMS
support.

( stuff deleted )

--
David A. Remaklus		   Currently at: NASA Lewis Research Center
Amdahl Corporation				 MS 142-4
(216) 642-1044					 Cleveland, Ohio  44135
(216) 433-5119					 xxremak@csduts1.lerc.nasa.gov

dhepner@hpcuhc.cup.hp.com (Dan Hepner) (03/13/91)

From: davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr)

>  I think I heard this argument before when operating systems started
>buffering i/o... I guess what you're looking for is MS-DOS, where you're
>sure how things work (slowly).

The most striking irritation of those who pioneered the usage 
of Un*x for for commercial applications has come because of some 
implementations which provided no way around such buffering.
Many of these offerings were still deficient well after the need 
for such had been established.

>  As long as there's a working sync or other means for my process to do
>ordered writes in that less than one percent of the time when I care, I
>am delighted to have things done in the fastest possible way the rest
>of the time.

Unless your kernel vendor has provided some non-ordinary behavior
for you to deal correctly with buffered disk drives, you're back
in the same boat of users before O_SYNC.  Does your O_SYNC do 
the right thing with such a drive?  What about fsync(2)?  What, 
for that matter, happens when you write to a "raw disk"? 

 The only time I ever care is when doing something like
>database or T.P. where order counts in case of error.
>bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)

It might be worth pointing out that your mix, namely 99% something
else, and 1% this kind, is a direct opposite of a lot of people,
whose usage is 99% this kind, and 1% things which it is ok to
indefinitely buffer.  Agreed, not many of them post to the net.

Dan Hepner
Not a statement of the Hewlett Packard Co.

henry@zoo.toronto.edu (Henry Spencer) (03/13/91)

In article <3236@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>| Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices,
>  There are cases when order is important, but as long as those rare
>cases are satisfied, any smarts which improve performance as welcome on
>my system.

The trouble is, what if the "smarts" *don't* improve performance, on your
workload?

The only sensible place to put smarts is in host software, where it can
be changed to match the workload and to keep up with new developments.
"Smart" hardware almost always makes policy decisions, which is a mistake.
The money spent on "smartening" the hardware is usually better spent
on making the main CPU faster so that you can get the same performance
with smart *software*... especially since a faster CPU is good for a
lot of other things too.
-- 
"But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry

dparter@shorty.cs.wisc.edu (David Parter) (03/13/91)

In article <1991Mar12.010155.3268@iconsys.icon.com> mmm@iconsys.icon.com (Mark Muhlestein) writes:
>Dave Kinchlea, cs grad student at UWO writes:
>>
>>Actually I have been having quite the oposite thoughts lately. It seems to me
>>that it would be highly advantagous (in the general case) to take all of
>>the filesystem information out of the kernel and give it to the I/O controller.
>At Sanyo/Icon we actually implemented this idea approximately five years
>ago in our first Unix port.  We used a dedicated 68020 with an expandable
>cache (up to 128MB!) to run the filesystem code.
>
>It has worked reasonably well for us, but, in retrospect, I'm not at all
>sure I would do it again.  Although some things are a win, such as
>being able to overlap relatively CPU-intensive functions such as namei()
>with other processing, we discovered very soon that a naive approach
>had several severe performance and maintenance problems.
>
> [interesting examples of flaws deleted]

[Note: I keep saying "they" in the following section. We didn't have an
"us" vs "them" situation, once we starting talking to each other about
the project, the two groups worked well together].

At a place I used to work (before returning to school), I participated
in discussions with a group that wanted to do just this (put the f/s
code into a controller). However, the scheme had the following flaws
(among others):
	* Since not everyone would have the integrated f/s and
	  controller, the entire file system code would have to remain
	  in the kernel as well, for those using traditional controllers.

	* For RFS and NFS, and least part of the file system code had
	  to remain in the kernel

	* The proposed solution assumed 1) the performance bottlenecks
	  were in the code that would be off-loaded and 2) either the
	  dedicated processor (on the controller) could run the code
	  faster than the existing (main) cpu, or there was something
	  more useful that the main processor could be doing while the
	  controller was running the file system code. None of these 
	  assumptions had been studied.

The best implimentation (from a software point of view) would be to use
the vnode switch (Sun's virtual file system layer) and implement the
integrated system as a new file system type -- sort of running NFS over
the bus, but not really (we wouldn't use UDP, we'd use the
message-passing already in place for communicating with "smart" i/o
devices). This wasn't quite what they had in mind, but that is because
they didn't know about NFS (they were a hardware/firmware group, not
part of the Unix group originally).

In addition, it was proposed to run a stripped-down unix on the
controller, so that the unix filesystem code would "just run" (thus
gaining uniformity of operation and semantics, and keeping the code in
lock-step with the kernel code with little software effort). They then
pointed out that other i/o could be handled by this super-controller as
well...

the paper monster that evolved was an odd multiprocessor configuration:
Two identical (or nearly identical) CPUs, one running user processes
and and most of the kernel, the other running various parts of the i/o
system (with some duplication, such as a buffer cache in both places,
etc). A more general multiprocessor solution (where the additional cpu/s
could be used either for I/O or computation) made more sense.

	--david
-- 
david parter					dparter@cs.wisc.edu

henry@zoo.toronto.edu (Henry Spencer) (03/13/91)

In article <KINCH.91Mar9170121@no31sun.csd.uwo.ca> kinch@no31sun.csd.uwo.ca (Dave Kinchlea) writes:
>... it would be highly advantagous (in the general case) to take all of
>the filesystem information out of the kernel and give it to the I/O controller.

Which filesystem?  System V's?  That's what you'd get, you know...

That aside, the most probable result is that your big expensive main host
CPU, which could undoubtedly run that code a lot faster, will spend all
its time waiting for the dumb little I/O controller to run the filesystem.
This is not a cost-effective use of hardware resources.

[Incidentally, is there some reason why twits (or readers written by twits)
keep saying "Followup-To: comp.protocols.nfs", when this topic is only
marginally related to NFS and highly related to architecture?  It's quite
annoying to have to keep manually fixing this.]
-- 
"But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/13/91)

In article <1991Mar12.194704.17859@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:

| The only sensible place to put smarts is in host software, where it can
| be changed to match the workload and to keep up with new developments.
| "Smart" hardware almost always makes policy decisions, which is a mistake.
| The money spent on "smartening" the hardware is usually better spent
| on making the main CPU faster so that you can get the same performance
| with smart *software*... especially since a faster CPU is good for a
| lot of other things too.

  And in a perfect world the faster CPU to provide the boost will cost
the same as the smart controller. Unfortunately I don't live there, and
I suspect most readers don't either.

  The incremental price to get i/o improvement is *in most cases* much
smaller to upgrade a controller (add cache, whatever) than to upgrade
the CPU speed and all the support circuits that implies. For a
multiprocessor system this becomes incrementally true.

  There's also an issue of reliability. For any hardware or software
failure other than power failure the smart controller seems more likely
to complete moving the data from cache to disk than the kernel to move
it from disk buffers to disk. That's a gut feling, the smart controller
may have a higher failure rate than the dumb controller, but it seems
likely that the smaller hardware and software content of a controller
will make it more reliable than the CPU, memory, and any other
controllers which could do something to make the o/s crash.

  The interesting thing is that in systems with multiple CPUs, if one
CPU is handling all the interrupts it has a tendancy to become an
extremely expensive smart controller. Yes, it can do more for the CPU
bound jobs, but is that cost effective for any load other than heavy
CPU? 

  I see no reason why an expensive CPU should be used to handle scatter
gather, remap sectors, log errors and issue retries, etc. It doesn't
take much in the way of smarts to do this stuff. It certainly doesn't
take floating point or vector hardware, processor cache, or an MMU
which supports paging.

  A CPU designed for embedded use can have a small interrupt controller
and some parallel i/o built into the chip. This lowers chip count,
connections, and latency, which means smaller, less expensive, and more
reliable devices. And i/o buffers can be made from cheap slow memory
and still stay ahead of the physical devices.

  Moving the filesystem to another CPU isn't really a "smart controller"
it's "distributed processing" more or less. That's certainly not what I
mean by smart controller, at any rate, so maybe the term is being used
loosely. I'm all in favor of having the decisions made by the o/s, but
when it comes time to actually move data from memory to disk, I'll find
something better for my CPU to do than keep an eye on the process.

  If I can issue an i/o and tell when it's done, and if the controller
is configured to insure that data don't sit in the cache for more than
time X (you define X), then I don't see any problem providing ordered
writes as needed for data security, and good performance as needed by
loaded and i/o bound machines. That's what I mean by a smart controller
and that's what I think is optimal for both performance and cost
effectiveness.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/13/91)

In article <1991Mar12.202238.19586@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:

| That aside, the most probable result is that your big expensive main host
| CPU, which could undoubtedly run that code a lot faster, will spend all
| its time waiting for the dumb little I/O controller to run the filesystem.
| This is not a cost-effective use of hardware resources.

  This is the heart of the matter, and I agree completely. What I can't
see is how anyone can feel that the main CPU should be wasted in error
logging and retries, bad sector mapping, and handling multiple interrupts.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/13/91)

In article <107340004@hpcuhc.cup.hp.com> dhepner@hpcuhc.cup.hp.com (Dan Hepner) writes:
| From: davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr)

| >  As long as there's a working sync or other means for my process to do
| >ordered writes in that less than one percent of the time when I care, I
| >am delighted to have things done in the fastest possible way the rest
| >of the time.
| 
| Unless your kernel vendor has provided some non-ordinary behavior
| for you to deal correctly with buffered disk drives, you're back
| in the same boat of users before O_SYNC.  

  O_SYNC, fsync, whatever. I don't consider having system calls work to
be "non-ordanary behavior." I said in the posting you quoted that there
has to be a way to insure an i/o is complete, and you seem to be saying
the same thing as if you were disagreeing with me.

| It might be worth pointing out that your mix, namely 99% something
| else, and 1% this kind, is a direct opposite of a lot of people,
| whose usage is 99% this kind, and 1% things which it is ok to
| indefinitely buffer.  Agreed, not many of them post to the net.

  Less than it appears. When doing a transaction, you might do something
like this:

	save old records (log file or whatever)
	SYNC
	lock records
	update records
	SYNC
	delete log or whatever (ie mark done)
	SYNC
	report as complete

  Note that this seems to be groups of i/o with SYNC points, and that if
I write five records, I should care what order the writes are done as
long as they are *all* done when the SYNC returns. And if I care about
order of record writes, then I should put in a SYNC here and there, no?

  Even on a system used for lots of TP, there are a lot of i/o which can
be buffered as long as you have a way to checkpoint and insure that all
changes are complete. And quite honestly I believe that a lot of
existing database and TP software makes assumptions which standard UNIX
(and other systems) don't always meet.

  I am using SYNC to represent any system call forcing block until i/o
complete, be it sync, fsync, O_SYNC, etc.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (03/13/91)

>The only sensible place to put smarts is in host software, where it can
>be changed to match the workload and to keep up with new developments.
>"Smart" hardware almost always makes policy decisions, which is a mistake.
>The money spent on "smartening" the hardware is usually better spent
>on making the main CPU faster so that you can get the same performance
>with smart *software*... especially since a faster CPU is good for a


I don't know about "usually" - depends on how you define "smart".  You 
can't get much main CPU for the couple of dollars more it costs to 
have smart(er) serial ports (which can provide significant performance 
increases).  

Same with smart keyboards, smart graphics controllers, smart terminals, etc.

Smart hardware is usually quite effective for small simple jobs.

-- 
Jon Zeeff (NIC handle JZ)	 zeeff@b-tech.ann-arbor.mi.us

rbw00@ccc.amdahl.com ( 213 Richard Wilmot) (03/14/91)

In article <107340003@hpcuhc.cup.hp.com> dhepner@hpcuhc.cup.hp.com (Dan Hepner) writes:
>From: rbw00@ccc.amdahl.com (  213  Richard Wilmot)
>
>>I see some problems with transaction processing systems which rely on
>>being able to absolutely control the timing of disk writes. Some (the
>>more efficient ones only need do this for their logs/journals) while others
>>want to flush out all changes made by a transaction and ensure that
>>it all got there before sending a terminal reply or dispensing the
>>ATM cash.
>
>All common DBMS SW rely on a need for notification when the write
>is in fact completed, although they may be willing to write the log
>immediately and the rest of it more asynchronously.  That such 
>notification is not available seems to be a real deficiency in 
>the current crop of caching disk controllers.
>
> There may be more problems with the more efficient systems
>>because although they don't insist on flushing out all database changes
>>to disk on termination of each transaction, they RELY ON NOT HAVING ANY
>>UNCOMMITTED (UNFINISHED) CHANGES WRITTEN TO DISK. That is, if the system
>>crashed, then an advanced transaction system would expect to see NONE
>>of the changes made by any incomplete transactions from before the crash.
>
>Agreed.
>The drives we're familiar with do in fact support a synchronous access,
>either by request or "setting the controller in that mode".  For all
>database usage, we intend that any controller caching be bypassed.  This
>does however leave an assumption that "somebody else" must be able to
>make use of that caching, because OLTP sure can't.  It's also worth
>noting that a battery backed up controller cache might turn out to
>be vastly more interesting.

I was more worried about file systems than disk controllers, but disk
controllers can be worrisome if they don't allow bypassing of cache
functions or make the OLTP system pay a significant performance
penalty for doing so. OLTP system performance is particularly sensitive
to WRITE latency for logging (recovery journal information). This
performance can be greatly augmented through appropriate use of non-
volatile memory in the controller so that writes as well as reads can
be cached. In fact it is then generally easier to cache writes than reads.

>
>>If a file system cannot accommodate this kind of use then the transaction
>>system implementors will again be forced into using raw I/O - to
>>avoid the file system. 
>>Alas, RAW I/O is still the answer for most database/transaction systems.
>
>This is perhaps the biggest trap of all.  Using raw IO has nothing
>to do with the behavior of a disk controller unless one has specifically
>modified one's kernel to do something special, such as post all
>raw writes as synchronous writes.  The default behavior will be for
>raw writes to be treated like any other write; the disk controller 
>doesn't know or care where this write came from.
>
>>They keep their own set of buffers and file structures. It need not be
>>so if the file system incorporates the semantic needs of transaction/database
>>systems.
>
>Do you actually recommend this, for which configurations, and for
>which reasons? 

What I recommend is that file systems be constructed so as to support
truly synchronous operation when required. Many file systems DO NOT
REALLY SUPPORT SYNCHRONOUS I/O OR DO IT INAPPROPRIATELY. A synch operation
which merely adds your request to a software queue to be done as soon
as convenient does not solve the problem. Some I/O in some applications
(e.g. Online Transaction Processing, OLTP) is crucial to correct system
operation and interference with such requirements by the file system
software or disk controller hardware/software will lead to not using it
for those applications.

I will consider the problem addressed when most OLTP/DBMS software
vendors always use the operating system supplied file system. As another
post from my organization notes, we are trying to provide the structure
to allow those vendors to do just that. Other providers of file systems
and/or disk controllers are well advised to consider these same needs.
It will help customers who must implement and manage online transaction
and databased systems.

>
>>  Dick Wilmot  | I declaim that Amdahl might disclaim any of my claims.
>>                 (408) 746-6108
>
>Dan Hepner
>Not a statement of the Hewlett Packard Co.


-- 
  Dick Wilmot  | I declaim that Amdahl might disclaim any of my claims.
                 (408) 746-6108

pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) (03/14/91)

On 12 Mar 91 22:36:00 GMT, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) said:

davidsen> In article <1991Mar12.202238.19586@zoo.toronto.edu>
davidsen> henry@zoo.toronto.edu (Henry Spencer) writes:

henry> That aside, the most probable result is that your big expensive
henry> main host CPU, which could undoubtedly run that code a lot
henry> faster, will spend all its time waiting for the dumb little I/O
henry> controller to run the filesystem.  This is not a cost-effective
henry> use of hardware resources.

This is the low level performance side of the "down with smart devices"
argument. The more important one, as you well know, but let's restate
it, is that smart devices have theyir own "smart" policies that not
necessairly (euphemism) are anywhere near being flexible and efficient
enough. In other words, system performance should be treated as a whole,
it cannot be achieved by building assumptions into each component of the
system.

The extreme example are those caching controllers that have the
structure of a DOS volume built into their optimization patterns...

davidsen> What I can't see is how anyone can feel that the main CPU
davidsen> should be wasted in error logging and retries, bad sector
davidsen> mapping, and handling multiple interrupts.

Ahhhh, but then who should handle them? The CPU on the controller, of
course. The real *performance* question then is not about smart
controller vs. dumb controller.

Any architecture with asynchronous IO is a de facto multiprocessor; the
question is whether some of the CPUs in a multiprocessors should be
*specialized* or not, and which is the optimal power to assign to each
CPU if they are specialized.

You speak of "main" CPU, thus *assuming* that you have one main CPU and
some "smart" slave processors. The alternative is really multiple main
CPUs, whose function floats.

As to the specific examples you make, diagnostics (error logging and
retries, bad sector mapping) should all be done by software in the
"main" CPU OS anyhow, as of all things surely assumptions on error
recovery strategies should not be embedded in the drive, because
different OSes may well have very different fault models and fault
recovery policies.

Handling command chaining (multiple interrupts) can indeed be performed
fairly efficiently by the main CPU in well designed OS kernels that
offer lightweight interrupt handling and threading.

Unfortunately industry standard OS kernels are very poorly written, so
much so that for example on a 6 MIPS 386 running System V it is faster
to have command chaining handled by the 8085 on the 1542 SCSI
controller.

As to me, I'd rather have multiple powerful CPUs on an equal footing
doing programmed IO on very stupid devices than to have smart
controllers, which seems to be the idea behind Henry Spencer's thinking.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk

henry@zoo.toronto.edu (Henry Spencer) (03/14/91)

In article <3254@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  This is the heart of the matter, and I agree completely. What I can't
>see is how anyone can feel that the main CPU should be wasted in error
>logging and retries, bad sector mapping, and handling multiple interrupts.

How often do your disks get errors or bad sectors?  How much CPU time is
*actually spent* on doing this?  Betcha it's just about zero.  You lose
some infinitesimal fraction of your CPU, and in return you gain a vast
improvement in *how* such problems can be handled, because the software
on the main CPU has a much better idea of the context of the error and
has more resources available to resolve it.
-- 
"But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry

henry@zoo.toronto.edu (Henry Spencer) (03/14/91)

In article <ZJT-JQ+@b-tech.uucp> zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) writes:
> ...You
>can't get much main CPU for the couple of dollars more it costs to 
>have smart(er) serial ports (which can provide significant performance 
>increases).  

What do you mean by "smart(er)"?  If you just mean throwing in some FIFOs
to ease latency requirements and make it possible to move than one byte
per interrupt, I agree.  I was assuming that the base for discussion
was dumb i/o devices, not brain-damaged ones.  If you mean DMA, that
does *not* cost a mere "couple of dollars more" if it's the first DMA
device on your system (or, for that matter, if it's the second), and
it can actually hurt performance.  (As a case in point, the DMA on the
LANCE Ethernet chip ties up your memory far longer than data transfers
by a modern CPU would.)

>Same with smart keyboards, smart graphics controllers, smart terminals, etc.

I'm not at all sure what you mean by "smart keyboards"; if you mean having
a keyboard encoder chip to do the actual key-scanning, that does not require
any form of "smartness" -- see comments above on dumb vs. brain-damaged.
Keyboards did that long before keyboards had micros in them.  The micros
replaced dedicated keyboard encoders because they were cheaper and a bit
more flexible, not because they added useful "smartness".

"Smart" graphics controllers are useful only if they actually bring
specialized hardware resources into the graphics operations.  All too
many "smart" graphics controllers are slower and less flexible than doing
it yourself in software.  Just *talking* to them to tell them what you
want to do can take more time than doing it yourself.  (This is a
particularly common vice of "smart" devices.)

"Smart" terminals are useful only if they are programmable.

>Smart hardware is usually quite effective for small simple jobs.

Small simple jobs don't need smart hardware.
-- 
"But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry

jesup@cbmvax.commodore.com (Randell Jesup) (03/14/91)

In article <3254@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>In article <1991Mar12.202238.19586@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>
>| That aside, the most probable result is that your big expensive main host
>| CPU, which could undoubtedly run that code a lot faster, will spend all
>| its time waiting for the dumb little I/O controller to run the filesystem.
>| This is not a cost-effective use of hardware resources.
>
>  This is the heart of the matter, and I agree completely. What I can't
>see is how anyone can feel that the main CPU should be wasted in error
>logging and retries, bad sector mapping, and handling multiple interrupts.

	I agree also that FS code should be kept in the main CPU.  Device-
handling code, though, should be pushed off as much as possible into
smart devices or auxiliary processors.  A good modern example of this is
the NCR 53c700/710.  These scsi chips are essentially scsi-processors.  They
can take a major amount of interrupt and bus-twiddling code off of the main
processor, they can handle gather/scatter, they can bus-master, they can
process queues of requests, etc.  They only interrupt the main processor
on IO completion or on nasty errors.

	Perhaps my 100 mips super-mega-pipelined processor might be able to
execute some of the code faster.  But it has to talk to an IO chip that has
a maximum access speed far slower than the processor; it has to handle a
bunch of interrupts, it requires more instructions to deal with things like
state transitions, etc, etc.  While it could be happily executing some
user process trying to do something, while a smart IO device like the 53c710
is handling a series of IO requests.  IO is far less influenced by processor
speed than many things - interrupt speed and the number of interrupts are
often more important (assuming some level of DMA in hardware).

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

kinch@no17sun.csd.uwo.ca (Dave Kinchlea) (03/14/91)

In article <1991Mar12.202238.19586@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:


   [Incidentally, is there some reason why twits (or readers written by twits)
                                           ^^^^^
   keep saying "Followup-To: comp.protocols.nfs", when this topic is only
   marginally related to NFS and highly related to architecture?  It's quite
   annoying to have to keep manually fixing this.]

As I started this particular thread I suppose I am to blame, a plead guilty 
and throw myself at the mercy of USENET.

cheers kinch

alex@vmars.tuwien.ac.at (Alexander Vrchoticky) (03/14/91)

henry@zoo.toronto.edu (Henry Spencer) writes:
>it can actually hurt performance.  (As a case in point, the DMA on the
>LANCE Ethernet chip ties up your memory far longer than data transfers
>by a modern CPU would.)

Sigh ... who do you tell? We have been conducting some measurements of 
the DMA overhead of a single-board computer used for real-time applications.
Almost 50 percent of the memory cycles get burned by the LANCE.
The aim of the measurements was to see whether we could guarantee 
that a reasonable and, above all, predictable, amount of CPU power was 
available for application tasks.

In the end we concluded that we'd have to design a dual-processor board
with one CPU being dedicated to I/O handling. Which we did. 

[BTW, I can't see any connection to NFS here, therefore I removed 
that newsgroup from the Newsgroups-line.]

--
Alexander Vrchoticky            | alex@vmars.tuwien.ac.at
TU Vienna, CS/Real-Time Systems | +43/222/58801-8168
"those who feel they're touched by madness, sit down next to me" (james)

toon@news.sara.nl (03/15/91)

In article <1991Mar12.202238.19586@zoo.toronto.edu>,
	henry@zoo.toronto.edu (Henry Spencer) writes:
> In article <KINCH.91Mar9170121@no31sun.csd.uwo.ca>
>	kinch@no31sun.csd.uwo.ca (Dave Kinchlea) writes:
>>... it would be highly advantagous (in the general case) to take all of
>>the filesystem information out of the kernel and give it to the I/O controller.
> 
> Which filesystem?  System V's?  That's what you'd get, you know...
>
Actually, it sounds more like the CDC Cyber 6600 .. 170 series: one or
two CPU's and 10 to 20 Peripheral Processors, the latter designed to do
the I/O in broad sense (handling all from disk block assignment to actual
channel I/O).
 
> That aside, the most probable result is that your big expensive main host
> CPU, which could undoubtedly run that code a lot faster, will spend all
> its time waiting for the dumb little I/O controller to run the filesystem.
> This is not a cost-effective use of hardware resources.

In the Cyber series this was 'solved' by assigning the compute intensive
parts to the CPU (e.g.: converting Record Block Number (logical disk block)
to cylinder/track/sector triples v.v.)
> 
> [Incidentally, is there some reason why twits (or readers written by twits)
> keep saying "Followup-To: comp.protocols.nfs", when this topic is only
> marginally related to NFS and highly related to architecture?  It's quite
> annoying to have to keep manually fixing this.]

I don't know. IMHO performance problems of NFS are of a far different
nature - think of all your I/O spread over 512 byte UDP packets and the
interrupt rate this generates on your favorite U*X system (can you say:
Cray Y-MP ?)
> -- 
> "But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
> for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry
-- 

Toon Moene, SARA - Amsterdam (NL)
Internet: TOON@SARA.NL

/usr/lib/sendmail.cf: Do.:%@!=/

glew@pdx007.intel.com (Andy Glew) (03/15/91)

>[Dave Parter]
>the paper monster that evolved was an odd multiprocessor configuration:

Actually, it was prototyped...
--
---

Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/15/91)

In article <PCG.91Mar13180706@aberdb.test.aber.ac.uk> pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) writes:

| You speak of "main" CPU, thus *assuming* that you have one main CPU and
| some "smart" slave processors. The alternative is really multiple main
| CPUs, whose function floats.

  The CPU with the expensive cache, float, and maybe vector capability.
As opposed to an 8 bit CPU with integrated interrupt controller, some
parallel i/o, etc. And one or many, I can usually find better use for
their capabilities than manipulating status bytes.

| As to the specific examples you make, diagnostics (error logging and
| retries, bad sector mapping) should all be done by software in the
| "main" CPU OS anyhow, as of all things surely assumptions on error
| recovery strategies should not be embedded in the drive, because
| different OSes may well have very different fault models and fault
| recovery policies.

  What has decisions got to do with implementation? The CPU running the
o/s can decide how many retries (including none if there's ever a disk
that doesn't need one now and then), and what to do with the count of
retries returned by the smart controller. But to have the retries
actually done by the CPU which can do more? To what gain?

| Handling command chaining (multiple interrupts) can indeed be performed
| fairly efficiently by the main CPU in well designed OS kernels that
| offer lightweight interrupt handling and threading.

  Remember the fastest way to do something is to avoid having to do it.
Every interrupt will require a context switch in and out of the
interrupt handler. The only real low cost way to do this is to have a
set of dedicated interrupt registers (like the 2nd register set of the
Z80), and I bet no one will suggest that a CPU should dedicate area to a
set of registers just to avoid a smart controller.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

peter@ficc.ferranti.com (Peter da Silva) (03/15/91)

In article <1991Mar13.194527.28164@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
> some infinitesimal fraction of your CPU, and in return you gain a vast
> improvement in *how* such problems can be handled, because the software
> on the main CPU has a much better idea of the context of the error and
> has more resources available to resolve it.

Of course, in practice this becomes "print an error message on the console,
and return an error indication to the process requesting the action. If the
kernel requested it, panic".
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

peter@ficc.ferranti.com (Peter da Silva) (03/15/91)

Here we're talking about putting file systems in smart processors. How about
putting other stuff there?

	Erase and kill processing.	(some PC smart cards do this,
					 as did the old Berkeley Bussiplexer)
	Window management.		(all the way from NeWS servers
					 with Postscript in the terminal,
					 down through X terminals and Blits,
					 to the 82786 graphics chip)
	Network processing.		(Intel, at least, is big on doing
					 lots of this in cards, to the point
					 where the small memory on the cards
					 becomes a problem... they do tend
					 to handle high network loads nicely)
	Tape handling.			(Epoch-1 "infinite storage" server,
					 etc...)

What else? The Intel 520 has multiple 80186 and 80286 CPUs on its smart
CPU cards, and seems to do quite an impressive job for a dinky little
CISC based machine.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

moss@cs.umass.edu (Eliot Moss) (03/15/91)

I would like to make a small observation here:

   Insuring that things are done in a particular ORDER is not the same as
   insuring that they are done NOW.

Sync features address the latter need without directly addressing the former.
I think this may be suboptimal. For example, when writing a log, it may be
important (to the recovery code) that it be written in order ALL THE TIME, but
it only needs to be forced (synced) at particular times (checkpoints, say, or
possibly commit points (there are MANY variations on database resiliency
techniques)).

On a slightly different note, there are certainly occasions where a database
application may need to read a number of records and does not care about the
order in which they are delivered. Having a smart system that allows MANY
oustanding read requests and satisfies them in an order that is most efficient
at the low level is also a good idea.
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206, 545-1249 (fax); Moss@cs.umass.edu

henry@zoo.toronto.edu (Henry Spencer) (03/16/91)

In article <3265@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>Every interrupt will require a context switch in and out of the
>interrupt handler. The only real low cost way to do this is to have a
>set of dedicated interrupt registers (like the 2nd register set of the
>Z80), and I bet no one will suggest that a CPU should dedicate area to a
>set of registers just to avoid a smart controller.

Nonsense.  If the handling of the interrupt is sufficiently trivial,
several modern CPUs -- e.g. the 29k -- can do it without a full context
switch, by having a small number of registers dedicated to it.  This is
a very cost-effective use of silicon, adding a small amount to the CPU
to avoid the hassle and complexity of smart controllers.

Efficient fielding of simple interrupts (ones that require no decision
making) is, in any case, a solved problem even for older CPUs.  It just
takes some work and some thought.  Blindly taking a context switch for
such trivial servicing is a design mistake.
-- 
"But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/16/91)

In article <MOSS.91Mar15090837@ibis.cs.umass.edu> moss@cs.umass.edu writes:
| I would like to make a small observation here:
| 
|    Insuring that things are done in a particular ORDER is not the same as
|    insuring that they are done NOW.
| 
| Sync features address the latter need without directly addressing the former.

  If you need A to be written before B you would have to do a SYNC after
A, true. If you were able to promise that writes from a single process
would be done in order, without SYNC that might be all you need. While I
can visualize a simple way to do this, I can't claim to have seen it
implemneted in an interface to a controller.

  What needs to be done is to add one i/o request from a process to the
queue, and either prevent any other i/o from that process from being
queued, or force it to be later in the actual service queue. This means
keeping track of the order in which things will be done.

| I think this may be suboptimal. For example, when writing a log, it may be
| important (to the recovery code) that it be written in order ALL THE TIME, but
| it only needs to be forced (synced) at particular times (checkpoints, say, or
| possibly commit points (there are MANY variations on database resiliency
| techniques)).
|
| On a slightly different note, there are certainly occasions where a database
| application may need to read a number of records and does not care about the
| order in which they are delivered. Having a smart system that allows MANY
| oustanding read requests and satisfies them in an order that is most efficient
| at the low level is also a good idea.

  And unless your implementation writes in exactly one sector blocks
(highly non-portable to other devices), some writes will span sectors,
heads, and cylinders, even if your disk allocation is contiguous. This
means there are lots of interrupts which can be handled by the
controller.

  I would seem that logical i/o which spans several physical i/o could
be done by the controller, reads could be ordered by the controller as
long as all requests get services in one pass through the queue, and
that many writes can be done in arbitrary order. A way to sync is
absolutely required, and that some way to order writes is needed. It's
not clear to me if that means writes by a process to all files need to
be ordered or writes to just each individual file. In any case this can
be insured by use of sync.

  Some of the systems I check spend more time waiting i/o than in the
kernel, and have no idle CPU to measure. Anything which will make the
i/o faster is great, and if it save some CPU for the user processes,
that's a bonus.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

renglish@hplabsz.HP.COM (Bob English) (03/16/91)

In article <1991Mar12.202238.19586@zoo.toronto.edu>
henry@zoo.toronto.edu (Henry Spencer) writes:

> That aside, the most probable result is that your big expensive
> main host CPU, which could undoubtedly run that code a lot
> faster, will spend all its time waiting for the dumb little I/O
> controller to run the filesystem.  This is not a cost-effective
> use of hardware resources.

This is only true if the dumb little I/O controller is not fast enough
to keep up with the device it's controlling.  As long as it can, and as
long as the device is designed so that its "intelligence" doesn't
appreciably increase its latency, it doesn't really matter whether it's
as fast as the host CPU.  If it can offload functions, the system wins.

With uPs continuing to surge ahead of disks in the performance race,
disk controllers are getting ever faster compared to the disks they
control, and there is certainly potential there to exploit.  In spite of
arguments that CPU power is more effectively concentrated in a central
location, the CPU power to build intelligent controllers will soon be,
effectively, free.  When the performance of cheap, imbedded
microcontrollers reaches 20 MIPS or so, arguments about effective
concentration of CPU power will no longer be relevant.

As Piercarlo points out, however, device controllers often make poor
decisions and end up penalizing the system rather than helping them, in
part because the standard interfaces used to talk to disk drives do not
provide the information necessary to make good decisions and in part
because most "intelligent" devices are programmed by disk drive
designers focused on raw access times rather than total system
performance.

I don't want to sound like I'm hammering on the disk drive guys,
however.  The centralized CPU (cluster) and its operating system are in
no better shape.  Standardized file systems are based on idealized models
of disks that no longer reflect actual structures and performance
characteristics of real disks.  The 4.2BSD file system, for example,
assumes that all tracks and cylinders are the same size and that all
tracks start at the same rotational position, assumptions that are false
in many cases.  There may be file systems available that take advantage
of information such as precise, current head position and settle time in
order to optimize requests, but if so, they are rare.

The fundamental problem here is that the controller and the CPU do not
communicate well enough to cooperate well together.  The protocols by
which the CPU could keep the controller informed on its global state
cheaply and efficiently do not currently exist, certainly not in a
standard way, and neither do the protocols by which the controller could
keep the CPU informed of its current state.  If anything, current
standards seem to be progressing in the opposite direction.

Where does this put intelligent disk controllers?  I must confess that I
don't know.  There seem to me opportunities for improved performance
there, but the continuing decline in DRAM prices and the consistent
performance gap between DRAM and disk systems makes we wonder whether
those potentials will ever be economically important.  Sometime in the
next 10-20 years, DRAM and disk storage prices are projected to cross.
After that disks will be important only for persistance and power
consumption reasons, and will have to compete with other technologies on
those bases.  Perhaps a new, cheaper technology will come along with
better access times than disks.

It may be more interesting to think of intelligent storage servers.  As
uP-based systems become more powerful, their data and storage appetites
will increase as well.  In the not-to-distant future, it may be common
for storage servers to have to manage terabytes of data and transfer it
at gigabit or gigabyte rates to and from a large number of clients.  I
doubt that any current filesystems or file access protocols are really
ready for such an environment.

--bob--
renglish@hplabs.hp.com

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (03/17/91)

In article <3268@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:
>  ....
>   If you need A to be written before B you would have to do a SYNC after
> A, true. If you were able to promise that writes from a single process
> would be done in order, without SYNC that might be all you need. While I
> can visualize a simple way to do this, I can't claim to have seen it
> implemneted in an interface to a controller.
> ...


Years ago, at a start-up, I led the development at of a small, multiuser,
multiprocessor system.  We built a radically smart disk controller--it had
a whole 8085.  The controller patrolled a linked list of requests, with each
request containing the obvious unit number, word count, disk block number
(not sector/head/track--unsual in that era), completion bit and status,
etc.  Each request also contained an optionally non-null pointer to
another, "predecessor" request.  The controller was allowed to service
requests in any order it thought good, subject only to the prior completion
of the precessor.

It took lots of energy to convince the people working on the controller
that something more than a fixed array of word-count/sector/head/track was
possible, let alone a good idea.  We sold a few of the controllers to other
companies, but no one else ever seemed to think the predecessor idea was
worthwhile.  The company is long dead and forgotten.  Sigh--the universe is
unfair.


Vernon Schryver,   vjs@sgi.com

andrew@frip.WV.TEK.COM (Andrew Klossner) (03/22/91)

[]

		"... your big expensive main host CPU will spend all
		its time waiting for the dumb little I/O controller
		..."

	"This is only true if the dumb little I/O controller is not
	fast enough to keep up with the device it's controlling."

It must also be fast enough to hold up its end of the conversation when
it communicates with the host.  I worked on a system with a 68020 host,
talking over a SCSI channel to a disk whose controller used a Z8.
There's a lot of back-and-forth in the SCSI protocol, and you could
just about fall asleep waiting for that Z8.  It was so slow that the
68020 spent some serious time waiting, but not so slow that it would
have paid off to dismiss from interrupt at each step of the
conversation.

  -=- Andrew Klossner   (uunet!tektronix!frip.WV.TEK!andrew)    [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

ingoldsb@ctycal.UUCP (Terry Ingoldsby) (03/26/91)

In article <1991Mar15.165124.18039@zoo.toronto.edu>, henry@zoo.toronto.edu (Henry Spencer) writes:
> In article <3265@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
> >Every interrupt will require a context switch in and out of the
> >interrupt handler. The only real low cost way to do this is to have a
...
> Nonsense.  If the handling of the interrupt is sufficiently trivial,
> several modern CPUs -- e.g. the 29k -- can do it without a full context
> switch, by having a small number of registers dedicated to it.  This is

It doesn't even have to be modern!

Perhaps not as elegant as what you are referring to, but the 8 bit MC6809
used to have a FIRQ (Fast Interrupt Request) in which only a very few
registers were saved.  This let you take the interrupt, store one or
two registers explicitly, do your thing and get back out quickly.



-- 
  Terry Ingoldsby                ingoldsb%ctycal@cpsc.ucalgary.ca
  Land Information Services                 or
  The City of Calgary       ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

mlord@bwdls58.bnr.ca (Mark Lord) (03/27/91)

In article <PCG.91Mar13180706@aberdb.test.aber.ac.uk> pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) writes:
<
<As to me, I'd rather have multiple powerful CPUs on an equal footing
<doing programmed IO on very stupid devices than to have smart
<controllers, which seems to be the idea behind Henry Spencer's thinking.

I vote for smart device controllers, to which the O/S can download
software.  This gives the O/S complete control, and still allows it
to optimally offload tedious tasks.  Sort of like channel processors
on mainframes..  What?  You mean it's already been done?

mike@sojurn.UUCP (Mike Sangrey) (04/21/91)

In article <ZU=9R=8@xds13.ferranti.com> peter@ficc.ferranti.com (Peter da Silva) writes:
>Here we're talking about putting file systems in smart processors. How about
>putting other stuff there?
>

The A-series computers of Unisys even put the process task switches (if I 
remember rightly) on separate hardware.

-- 
   |   UUCP-stuff:  devon!sojurn!mike     |  "It muddles me rather"     |
   |   Slow-stuff:  832 Strasburg Rd.     |             Winnie the Pooh |
   |                Paradise, Pa.  17562  |    with apologies to        |
   |   Fast-stuff:  (717) 442-8959        |             A. A. Milne     |