mo@messy.bellcore.com (Michael O'Dell) (03/08/91)
Line-eater fodder. "laser-guided line-eater" fodder. Incremental sync() is a good idea in some ways, but not all. When George Goble did the famous dual-headed vax with 32 megabytes of memory, one of the first things noticed was that once every 30 seconds, things got very slow as update flushed out memory. One of the things he did was to put a clock-hand style thing in the kernel so the equivalent of update could push out the pages in a slow, steady stream, instead of the gigantic clumps of dirty disk blocks. However, the assumption that there is any disk idle time is basically wrong. On large, heavily-loaded systems, the disks run pretty constantly. This isn't to say that drizzling things out at a controlled rate rather than in big lumps isn't useful, but sometimes it doesn't help, either. One real problem is that the mapped file may have semantics which require the user program to not terminate until the write to disk has finished because the program wants to be quite certain the data is out there. And if the references to the file were quite random (like a large database hash table index), then there's a good chance that incremental page pushing did NOT clean some substantive fraction of the dirty pages, still producing the impulse load on the disk queue. One important observation is that no VM system can run faster than it can get clean pages - if there is an adquate supply in memory, then fine. Otherwise, how quickly you can turn the pages to disk is the limiting factor for overall throughput (jobs completed per hour, or conversely, time for a single large VM-bound job). This factor most directly effects the level of multiprogramming a given system can sustain, assuming the workload isn't all just small jobs which trivially fit in memory and do no substantial I/O. (IBM MVS I/O systems are often tuned to maximize the sustainable paging rate without Trashing. Think about it.) One serious elevator anomoly in fast machines is as follows. Assume several processes trying to read different files, each one broken into several reasonably large chunks (could easily be extents, so that doesn't fix it). Further, assume that one process's file is broken into several more chunks than the others with these smaller chunks spread over the same distance as the other files. Finally, assume the machine is enough faster than the disks that each process can complete its processing BEFORE the heads can seek past the next request. So, using the standard elevator which resorts the queue on every request, the process with the large number of small extents (fortuitously layed-out in request order) will completely monopolize the disk! Because the standard elevator will let you "get back into the boat" even after it's left, the process gets its data, and spins out another request before the head finishes another request in the same neighborhood, so it zooms to the front of the elevator queue and gets to do another read. The poor processes with their blocks toward the end of the run starve to death by comparsion. (Note we are assuming a one-way elevator; it's worse with a 2-way.) What do you do? One scheme used successfully is "generation scheduling". This collects an elevator-load of requests, sorts them, and then services them WITHOUT admitting anyone to the car along the way. This is a way of insuring fairness. It also turns out that this scheme can be modified to eleviate SOME of the problems with the big memory flush problem. Getting the details right is complex, but the general approach is as follows. There is a "generation counter" which increments at some rate like 2-5x a request service time. Each request is marked with the generation when it arrives. Further, you use a 2-dimensional queue, sorting the subqueues by generation and within the subqueue keeping the original FIFO arrival order. (There is some discussion about sorting the sub-queues. jury is still out.) You now load the elevator car across the subqueues, always getting at least one request from every generation pending, more with some weighting function like queue length. [You can implement "must do in this order" requests by including a special generation queue which always loads the car first and is not sorted in the car.] A further enhancement is to add priorities to requests. FOr instance, LOW, MEDIUM, and HIGH, plus FIFO looks good. LOW is page turning when the pager isn't frantic for clean pages, MEDIUM for normal reads and writes which you want to complete soon, and HIGH for things like directory searches in namei() and some metadata updates, and FIFO for critical metadata updates. Couple this with the full generational scheduling described above and you will go a long way toward making the low-level stuff stable in the face of large (and truly unavoidable) impulse loads.... Oh yes - much of the thinking about this low-level scheduling stuff came from Bill Fuller, Brian Berliner, and Tony Schene, then of Prisma. -Mike O'Dell
torek@elf.ee.lbl.gov (Chris Torek) (03/09/91)
In article <1991Mar8.142031.9098@bellcore.bellcore.com> mo@messy.bellcore.com (Michael O'Dell) writes: >[much about disk queue sorting] Of course, certain disk controllers (whose names I shall not name) have a tendency to swallow up a whole bunch of I/O requests and do them in its own order, rather than listen to / allow you to adjust yours.... Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices, -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
torek@elf.ee.lbl.gov (Chris Torek) (03/09/91)
In article <10773@dog.ee.lbl.gov> I wrote: >Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices, It seems people know exactly what I mean here... I got several replies to this in the span of a few hours. A quote from one of them: >I think it was Rob Pike who once pointed out there is a real >difference between "smart" and "smart-ass" controllers, and that >in his estimation (one which I agree with completely), most >controllers which claim to be smart, are in fact, in the other group. This nails it down pretty well. So: we should call it the `SO SAD Coalition', where `SO SAD' stands for `Stamp Out Smart-Ass Devices'. Note that there is nothing wrong with (truly) intelligent controllers, provided that they do not sacrifice something important to attain this intelligence. Important things that tend to get sacrificed include both speed and flexibility; and in fact, speed and flexibility can get in each other's way, particularly with programmable devices. (For instance, many SCSI controllers take several milliseconds---not microseconds, *milli*seconds---to react to a command. At 3600 RPM, 3 milliseconds is almost 1/5 of a disk revolution. This is a serious delay. Another example is certain Ethernet chips, where the FIFOs are just a bit too short, and when a collision occurs, they goof up the recovery because they cannot `back up' their DMA, so they simply restart with garbage.) -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
kinch@no31sun.csd.uwo.ca (Dave Kinchlea) (03/10/91)
In article <10773@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices,
--
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA Domain: torek@ee.lbl.gov
Actually I have been having quite the oposite thoughts lately. It seems to me
that it would be highly advantagous (in the general case) to take all of
the filesystem information out of the kernel and give it to the I/O controller.
I don't just mean what requests are satisified first (although this would
be one of its tasks) but one which also supports an abstract filesystem.
This would take out alot of logic from the kernel, it needn't spend time
through namei() et al. let an intellegent controller do that.
Am I missing something important here, other than the fact that no operating
system I am aware of has a concept of an abstract filesystem (except at
the user level). There is still some logic needed re virtual memory and
possible page-outs etc but I think it could work. Any comments?
cheers kinch
Dave Kinchlea, cs grad student at UWO (thats in London Ontario)
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/11/91)
In article <10773@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: | Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices, I think I heard this argument before when operating systems started buffering i/o... I guess what you're looking for is MS-DOS, where you're sure how things work (slowly). As long as there's a working sync or other means for my process to do ordered writes in that less than one percent of the time when I care, I am delighted to have things done in the fastest possible way the rest of the time. The only time I ever care is when doing something like database or T.P. where order counts in case of error. If I'm doing a compile, or save out of an editor, or writing a report, as long as what I read comes back as the same data in the same order, I really don't care about write order (or byte order, bit order, etc) on the disk. There are cases when order is important, but as long as those rare cases are satisfied, any smarts which improve performance as welcome on my system. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
rbw00@ccc.amdahl.com ( 213 Richard Wilmot) (03/12/91)
In article <KINCH.91Mar9170121@no31sun.csd.uwo.ca> kinch@no31sun.csd.uwo.ca (Dave Kinchlea) writes: >In article <10773@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: > > > Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices, > -- > In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) > Berkeley, CA Domain: torek@ee.lbl.gov > >Actually I have been having quite the oposite thoughts lately. It seems to me >that it would be highly advantagous (in the general case) to take all of >the filesystem information out of the kernel and give it to the I/O controller. > >I don't just mean what requests are satisified first (although this would >be one of its tasks) but one which also supports an abstract filesystem. >This would take out alot of logic from the kernel, it needn't spend time >through namei() et al. let an intellegent controller do that. > >Am I missing something important here, other than the fact that no operating >system I am aware of has a concept of an abstract filesystem (except at >the user level). There is still some logic needed re virtual memory and >possible page-outs etc but I think it could work. Any comments? > >cheers kinch >Dave Kinchlea, cs grad student at UWO (thats in London Ontario) I see some problems with transaction processing systems which rely on being able to absolutely control the timing of disk writes. Some (the more efficient ones only need do this for their logs/journals) while others want to flush out all changes made by a transaction and ensure that it all got there before sending a terminal reply or dispensing the ATM cash. There may be more problems with the more efficient systems because although they don't insist on flushing out all database changes to disk on termination of each transaction, they RELY ON NOT HAVING ANY UNCOMMITTED (UNFINISHED) CHANGES WRITTEN TO DISK. That is, if the system crashed, then an advanced transaction system would expect to see NONE of the changes made by any incomplete transactions from before the crash. If a file system cannot accommodate this kind of use then the transaction system implementors will again be forced into using raw I/O - to avoid the file system. Alas, RAW I/O is still the answer for most database/transaction systems. They keep their own set of buffers and file structures. It need not be so if the file system incorporates the semantic needs of transaction/database systems. -- Dick Wilmot | I declaim that Amdahl might disclaim any of my claims. (408) 746-6108
mmm@iconsys.icon.com (Mark Muhlestein) (03/12/91)
Dave Kinchlea, cs grad student at UWO writes: > >Actually I have been having quite the oposite thoughts lately. It seems to me >that it would be highly advantagous (in the general case) to take all of >the filesystem information out of the kernel and give it to the I/O controller. > >I don't just mean what requests are satisified first (although this would >be one of its tasks) but one which also supports an abstract filesystem. >This would take out alot of logic from the kernel, it needn't spend time >through namei() et al. let an intellegent controller do that. > >Am I missing something important here, other than the fact that no operating >system I am aware of has a concept of an abstract filesystem (except at >the user level). There is still some logic needed re virtual memory and >possible page-outs etc but I think it could work. Any comments? > At Sanyo/Icon we actually implemented this idea approximately five years ago in our first Unix port. We used a dedicated 68020 with an expandable cache (up to 128MB!) to run the filesystem code. It has worked reasonably well for us, but, in retrospect, I'm not at all sure I would do it again. Although some things are a win, such as being able to overlap relatively CPU-intensive functions such as namei() with other processing, we discovered very soon that a naive approach had several severe performance and maintenance problems. For one thing, some normally trivial things like lseek and updating times on files become heavy-weight operations involving message passing to the filesystem processor. Also, very small reads and writes were extremely inefficient compared to a normal unix implementation. Another problem was keeping the filesystem code in sync with new versions of the rest of the kernel. Enhancements and new features were relatively much more difficult to implement. For example, since the filesystem procecessor handles all filesystem requests, it is necessary to teach it about NFS, file system types, etc. It goes far beyond just an "intelligent" controller. We solved these problems using various techniques that relied on the fact that the filesystem processor shared memory with the processor running the rest of the kernel. These techniques don't really make sense for, say, a SCSI-based controller. From my experience, a good SCSI controller with scatter-gather DMA capability has about the right amount of intelligence. It doesn't require a lot of hand-holding to keep it going, and it gets the job done with a minimum of hassling with low-level details like device geometry, bad block management, or retries. -- Mark Muhlestein @ Sanyo/Icon uunet!iconsys!mmm
dhepner@hpcuhc.cup.hp.com (Dan Hepner) (03/12/91)
From: rbw00@ccc.amdahl.com ( 213 Richard Wilmot) >I see some problems with transaction processing systems which rely on >being able to absolutely control the timing of disk writes. Some (the >more efficient ones only need do this for their logs/journals) while others >want to flush out all changes made by a transaction and ensure that >it all got there before sending a terminal reply or dispensing the >ATM cash. All common DBMS SW rely on a need for notification when the write is in fact completed, although they may be willing to write the log immediately and the rest of it more asynchronously. That such notification is not available seems to be a real deficiency in the current crop of caching disk controllers. There may be more problems with the more efficient systems >because although they don't insist on flushing out all database changes >to disk on termination of each transaction, they RELY ON NOT HAVING ANY >UNCOMMITTED (UNFINISHED) CHANGES WRITTEN TO DISK. That is, if the system >crashed, then an advanced transaction system would expect to see NONE >of the changes made by any incomplete transactions from before the crash. Agreed. The drives we're familiar with do in fact support a synchronous access, either by request or "setting the controller in that mode". For all database usage, we intend that any controller caching be bypassed. This does however leave an assumption that "somebody else" must be able to make use of that caching, because OLTP sure can't. It's also worth noting that a battery backed up controller cache might turn out to be vastly more interesting. >If a file system cannot accommodate this kind of use then the transaction >system implementors will again be forced into using raw I/O - to >avoid the file system. >Alas, RAW I/O is still the answer for most database/transaction systems. This is perhaps the biggest trap of all. Using raw IO has nothing to do with the behavior of a disk controller unless one has specifically modified one's kernel to do something special, such as post all raw writes as synchronous writes. The default behavior will be for raw writes to be treated like any other write; the disk controller doesn't know or care where this write came from. >They keep their own set of buffers and file structures. It need not be >so if the file system incorporates the semantic needs of transaction/database >systems. Do you actually recommend this, for which configurations, and for which reasons? > Dick Wilmot | I declaim that Amdahl might disclaim any of my claims. > (408) 746-6108 Dan Hepner Not a statement of the Hewlett Packard Co.
xxremak@csduts1.lerc.nasa.gov (David A. Remaklus) (03/13/91)
In article <107340003@hpcuhc.cup.hp.com> dhepner@hpcuhc.cup.hp.com (Dan Hepner) writes: >From: rbw00@ccc.amdahl.com ( 213 Richard Wilmot) > >>because although they don't insist on flushing out all database changes >>to disk on termination of each transaction, they RELY ON NOT HAVING ANY >>UNCOMMITTED (UNFINISHED) CHANGES WRITTEN TO DISK. That is, if the system >>crashed, then an advanced transaction system would expect to see NONE >>of the changes made by any incomplete transactions from before the crash. > >Agreed. >The drives we're familiar with do in fact support a synchronous access, >either by request or "setting the controller in that mode". For all >database usage, we intend that any controller caching be bypassed. This >does however leave an assumption that "somebody else" must be able to >make use of that caching, because OLTP sure can't. It's also worth >noting that a battery backed up controller cache might turn out to >be vastly more interesting. > Amdahls latest update to the 6100 Caching disk controller includes just that, battery backed-up cache memory for disk writes (as does big blues 3990 controller). Before this battery supported cache (called Non-Volatile Store), all writes were automatically write-back (I/O completion was not signalled until the data was written to the disk). >>If a file system cannot accommodate this kind of use then the transaction >>system implementors will again be forced into using raw I/O - to >>avoid the file system. >>Alas, RAW I/O is still the answer for most database/transaction systems. > >This is perhaps the biggest trap of all. Using raw IO has nothing >to do with the behavior of a disk controller unless one has specifically >modified one's kernel to do something special, such as post all >raw writes as synchronous writes. The default behavior will be for >raw writes to be treated like any other write; the disk controller >doesn't know or care where this write came from. > The latest release of the UTS (Amdahl mainframe U*ix) includes a new file system type that enables an application to specify via fcntl calls that I/O to a particular file is to be synchronous, ie, control is not returned until the data has successfully been written to the I/O device. Normal U*ix system calls to normal U*ix files can be used for the DBMS files. This feature was specifically added for DBMS support. ( stuff deleted ) -- David A. Remaklus Currently at: NASA Lewis Research Center Amdahl Corporation MS 142-4 (216) 642-1044 Cleveland, Ohio 44135 (216) 433-5119 xxremak@csduts1.lerc.nasa.gov
dhepner@hpcuhc.cup.hp.com (Dan Hepner) (03/13/91)
From: davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) > I think I heard this argument before when operating systems started >buffering i/o... I guess what you're looking for is MS-DOS, where you're >sure how things work (slowly). The most striking irritation of those who pioneered the usage of Un*x for for commercial applications has come because of some implementations which provided no way around such buffering. Many of these offerings were still deficient well after the need for such had been established. > As long as there's a working sync or other means for my process to do >ordered writes in that less than one percent of the time when I care, I >am delighted to have things done in the fastest possible way the rest >of the time. Unless your kernel vendor has provided some non-ordinary behavior for you to deal correctly with buffered disk drives, you're back in the same boat of users before O_SYNC. Does your O_SYNC do the right thing with such a drive? What about fsync(2)? What, for that matter, happens when you write to a "raw disk"? The only time I ever care is when doing something like >database or T.P. where order counts in case of error. >bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) It might be worth pointing out that your mix, namely 99% something else, and 1% this kind, is a direct opposite of a lot of people, whose usage is 99% this kind, and 1% things which it is ok to indefinitely buffer. Agreed, not many of them post to the net. Dan Hepner Not a statement of the Hewlett Packard Co.
henry@zoo.toronto.edu (Henry Spencer) (03/13/91)
In article <3236@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >| Sometimes I think we need a Coalition to Stamp Out `Smart' I/O Devices, > There are cases when order is important, but as long as those rare >cases are satisfied, any smarts which improve performance as welcome on >my system. The trouble is, what if the "smarts" *don't* improve performance, on your workload? The only sensible place to put smarts is in host software, where it can be changed to match the workload and to keep up with new developments. "Smart" hardware almost always makes policy decisions, which is a mistake. The money spent on "smartening" the hardware is usually better spent on making the main CPU faster so that you can get the same performance with smart *software*... especially since a faster CPU is good for a lot of other things too. -- "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry
dparter@shorty.cs.wisc.edu (David Parter) (03/13/91)
In article <1991Mar12.010155.3268@iconsys.icon.com> mmm@iconsys.icon.com (Mark Muhlestein) writes: >Dave Kinchlea, cs grad student at UWO writes: >> >>Actually I have been having quite the oposite thoughts lately. It seems to me >>that it would be highly advantagous (in the general case) to take all of >>the filesystem information out of the kernel and give it to the I/O controller. >At Sanyo/Icon we actually implemented this idea approximately five years >ago in our first Unix port. We used a dedicated 68020 with an expandable >cache (up to 128MB!) to run the filesystem code. > >It has worked reasonably well for us, but, in retrospect, I'm not at all >sure I would do it again. Although some things are a win, such as >being able to overlap relatively CPU-intensive functions such as namei() >with other processing, we discovered very soon that a naive approach >had several severe performance and maintenance problems. > > [interesting examples of flaws deleted] [Note: I keep saying "they" in the following section. We didn't have an "us" vs "them" situation, once we starting talking to each other about the project, the two groups worked well together]. At a place I used to work (before returning to school), I participated in discussions with a group that wanted to do just this (put the f/s code into a controller). However, the scheme had the following flaws (among others): * Since not everyone would have the integrated f/s and controller, the entire file system code would have to remain in the kernel as well, for those using traditional controllers. * For RFS and NFS, and least part of the file system code had to remain in the kernel * The proposed solution assumed 1) the performance bottlenecks were in the code that would be off-loaded and 2) either the dedicated processor (on the controller) could run the code faster than the existing (main) cpu, or there was something more useful that the main processor could be doing while the controller was running the file system code. None of these assumptions had been studied. The best implimentation (from a software point of view) would be to use the vnode switch (Sun's virtual file system layer) and implement the integrated system as a new file system type -- sort of running NFS over the bus, but not really (we wouldn't use UDP, we'd use the message-passing already in place for communicating with "smart" i/o devices). This wasn't quite what they had in mind, but that is because they didn't know about NFS (they were a hardware/firmware group, not part of the Unix group originally). In addition, it was proposed to run a stripped-down unix on the controller, so that the unix filesystem code would "just run" (thus gaining uniformity of operation and semantics, and keeping the code in lock-step with the kernel code with little software effort). They then pointed out that other i/o could be handled by this super-controller as well... the paper monster that evolved was an odd multiprocessor configuration: Two identical (or nearly identical) CPUs, one running user processes and and most of the kernel, the other running various parts of the i/o system (with some duplication, such as a buffer cache in both places, etc). A more general multiprocessor solution (where the additional cpu/s could be used either for I/O or computation) made more sense. --david -- david parter dparter@cs.wisc.edu
henry@zoo.toronto.edu (Henry Spencer) (03/13/91)
In article <KINCH.91Mar9170121@no31sun.csd.uwo.ca> kinch@no31sun.csd.uwo.ca (Dave Kinchlea) writes: >... it would be highly advantagous (in the general case) to take all of >the filesystem information out of the kernel and give it to the I/O controller. Which filesystem? System V's? That's what you'd get, you know... That aside, the most probable result is that your big expensive main host CPU, which could undoubtedly run that code a lot faster, will spend all its time waiting for the dumb little I/O controller to run the filesystem. This is not a cost-effective use of hardware resources. [Incidentally, is there some reason why twits (or readers written by twits) keep saying "Followup-To: comp.protocols.nfs", when this topic is only marginally related to NFS and highly related to architecture? It's quite annoying to have to keep manually fixing this.] -- "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/13/91)
In article <1991Mar12.194704.17859@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: | The only sensible place to put smarts is in host software, where it can | be changed to match the workload and to keep up with new developments. | "Smart" hardware almost always makes policy decisions, which is a mistake. | The money spent on "smartening" the hardware is usually better spent | on making the main CPU faster so that you can get the same performance | with smart *software*... especially since a faster CPU is good for a | lot of other things too. And in a perfect world the faster CPU to provide the boost will cost the same as the smart controller. Unfortunately I don't live there, and I suspect most readers don't either. The incremental price to get i/o improvement is *in most cases* much smaller to upgrade a controller (add cache, whatever) than to upgrade the CPU speed and all the support circuits that implies. For a multiprocessor system this becomes incrementally true. There's also an issue of reliability. For any hardware or software failure other than power failure the smart controller seems more likely to complete moving the data from cache to disk than the kernel to move it from disk buffers to disk. That's a gut feling, the smart controller may have a higher failure rate than the dumb controller, but it seems likely that the smaller hardware and software content of a controller will make it more reliable than the CPU, memory, and any other controllers which could do something to make the o/s crash. The interesting thing is that in systems with multiple CPUs, if one CPU is handling all the interrupts it has a tendancy to become an extremely expensive smart controller. Yes, it can do more for the CPU bound jobs, but is that cost effective for any load other than heavy CPU? I see no reason why an expensive CPU should be used to handle scatter gather, remap sectors, log errors and issue retries, etc. It doesn't take much in the way of smarts to do this stuff. It certainly doesn't take floating point or vector hardware, processor cache, or an MMU which supports paging. A CPU designed for embedded use can have a small interrupt controller and some parallel i/o built into the chip. This lowers chip count, connections, and latency, which means smaller, less expensive, and more reliable devices. And i/o buffers can be made from cheap slow memory and still stay ahead of the physical devices. Moving the filesystem to another CPU isn't really a "smart controller" it's "distributed processing" more or less. That's certainly not what I mean by smart controller, at any rate, so maybe the term is being used loosely. I'm all in favor of having the decisions made by the o/s, but when it comes time to actually move data from memory to disk, I'll find something better for my CPU to do than keep an eye on the process. If I can issue an i/o and tell when it's done, and if the controller is configured to insure that data don't sit in the cache for more than time X (you define X), then I don't see any problem providing ordered writes as needed for data security, and good performance as needed by loaded and i/o bound machines. That's what I mean by a smart controller and that's what I think is optimal for both performance and cost effectiveness. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/13/91)
In article <1991Mar12.202238.19586@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: | That aside, the most probable result is that your big expensive main host | CPU, which could undoubtedly run that code a lot faster, will spend all | its time waiting for the dumb little I/O controller to run the filesystem. | This is not a cost-effective use of hardware resources. This is the heart of the matter, and I agree completely. What I can't see is how anyone can feel that the main CPU should be wasted in error logging and retries, bad sector mapping, and handling multiple interrupts. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/13/91)
In article <107340004@hpcuhc.cup.hp.com> dhepner@hpcuhc.cup.hp.com (Dan Hepner) writes: | From: davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) | > As long as there's a working sync or other means for my process to do | >ordered writes in that less than one percent of the time when I care, I | >am delighted to have things done in the fastest possible way the rest | >of the time. | | Unless your kernel vendor has provided some non-ordinary behavior | for you to deal correctly with buffered disk drives, you're back | in the same boat of users before O_SYNC. O_SYNC, fsync, whatever. I don't consider having system calls work to be "non-ordanary behavior." I said in the posting you quoted that there has to be a way to insure an i/o is complete, and you seem to be saying the same thing as if you were disagreeing with me. | It might be worth pointing out that your mix, namely 99% something | else, and 1% this kind, is a direct opposite of a lot of people, | whose usage is 99% this kind, and 1% things which it is ok to | indefinitely buffer. Agreed, not many of them post to the net. Less than it appears. When doing a transaction, you might do something like this: save old records (log file or whatever) SYNC lock records update records SYNC delete log or whatever (ie mark done) SYNC report as complete Note that this seems to be groups of i/o with SYNC points, and that if I write five records, I should care what order the writes are done as long as they are *all* done when the SYNC returns. And if I care about order of record writes, then I should put in a SYNC here and there, no? Even on a system used for lots of TP, there are a lot of i/o which can be buffered as long as you have a way to checkpoint and insure that all changes are complete. And quite honestly I believe that a lot of existing database and TP software makes assumptions which standard UNIX (and other systems) don't always meet. I am using SYNC to represent any system call forcing block until i/o complete, be it sync, fsync, O_SYNC, etc. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (03/13/91)
>The only sensible place to put smarts is in host software, where it can >be changed to match the workload and to keep up with new developments. >"Smart" hardware almost always makes policy decisions, which is a mistake. >The money spent on "smartening" the hardware is usually better spent >on making the main CPU faster so that you can get the same performance >with smart *software*... especially since a faster CPU is good for a I don't know about "usually" - depends on how you define "smart". You can't get much main CPU for the couple of dollars more it costs to have smart(er) serial ports (which can provide significant performance increases). Same with smart keyboards, smart graphics controllers, smart terminals, etc. Smart hardware is usually quite effective for small simple jobs. -- Jon Zeeff (NIC handle JZ) zeeff@b-tech.ann-arbor.mi.us
rbw00@ccc.amdahl.com ( 213 Richard Wilmot) (03/14/91)
In article <107340003@hpcuhc.cup.hp.com> dhepner@hpcuhc.cup.hp.com (Dan Hepner) writes: >From: rbw00@ccc.amdahl.com ( 213 Richard Wilmot) > >>I see some problems with transaction processing systems which rely on >>being able to absolutely control the timing of disk writes. Some (the >>more efficient ones only need do this for their logs/journals) while others >>want to flush out all changes made by a transaction and ensure that >>it all got there before sending a terminal reply or dispensing the >>ATM cash. > >All common DBMS SW rely on a need for notification when the write >is in fact completed, although they may be willing to write the log >immediately and the rest of it more asynchronously. That such >notification is not available seems to be a real deficiency in >the current crop of caching disk controllers. > > There may be more problems with the more efficient systems >>because although they don't insist on flushing out all database changes >>to disk on termination of each transaction, they RELY ON NOT HAVING ANY >>UNCOMMITTED (UNFINISHED) CHANGES WRITTEN TO DISK. That is, if the system >>crashed, then an advanced transaction system would expect to see NONE >>of the changes made by any incomplete transactions from before the crash. > >Agreed. >The drives we're familiar with do in fact support a synchronous access, >either by request or "setting the controller in that mode". For all >database usage, we intend that any controller caching be bypassed. This >does however leave an assumption that "somebody else" must be able to >make use of that caching, because OLTP sure can't. It's also worth >noting that a battery backed up controller cache might turn out to >be vastly more interesting. I was more worried about file systems than disk controllers, but disk controllers can be worrisome if they don't allow bypassing of cache functions or make the OLTP system pay a significant performance penalty for doing so. OLTP system performance is particularly sensitive to WRITE latency for logging (recovery journal information). This performance can be greatly augmented through appropriate use of non- volatile memory in the controller so that writes as well as reads can be cached. In fact it is then generally easier to cache writes than reads. > >>If a file system cannot accommodate this kind of use then the transaction >>system implementors will again be forced into using raw I/O - to >>avoid the file system. >>Alas, RAW I/O is still the answer for most database/transaction systems. > >This is perhaps the biggest trap of all. Using raw IO has nothing >to do with the behavior of a disk controller unless one has specifically >modified one's kernel to do something special, such as post all >raw writes as synchronous writes. The default behavior will be for >raw writes to be treated like any other write; the disk controller >doesn't know or care where this write came from. > >>They keep their own set of buffers and file structures. It need not be >>so if the file system incorporates the semantic needs of transaction/database >>systems. > >Do you actually recommend this, for which configurations, and for >which reasons? What I recommend is that file systems be constructed so as to support truly synchronous operation when required. Many file systems DO NOT REALLY SUPPORT SYNCHRONOUS I/O OR DO IT INAPPROPRIATELY. A synch operation which merely adds your request to a software queue to be done as soon as convenient does not solve the problem. Some I/O in some applications (e.g. Online Transaction Processing, OLTP) is crucial to correct system operation and interference with such requirements by the file system software or disk controller hardware/software will lead to not using it for those applications. I will consider the problem addressed when most OLTP/DBMS software vendors always use the operating system supplied file system. As another post from my organization notes, we are trying to provide the structure to allow those vendors to do just that. Other providers of file systems and/or disk controllers are well advised to consider these same needs. It will help customers who must implement and manage online transaction and databased systems. > >> Dick Wilmot | I declaim that Amdahl might disclaim any of my claims. >> (408) 746-6108 > >Dan Hepner >Not a statement of the Hewlett Packard Co. -- Dick Wilmot | I declaim that Amdahl might disclaim any of my claims. (408) 746-6108
pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) (03/14/91)
On 12 Mar 91 22:36:00 GMT, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) said:
davidsen> In article <1991Mar12.202238.19586@zoo.toronto.edu>
davidsen> henry@zoo.toronto.edu (Henry Spencer) writes:
henry> That aside, the most probable result is that your big expensive
henry> main host CPU, which could undoubtedly run that code a lot
henry> faster, will spend all its time waiting for the dumb little I/O
henry> controller to run the filesystem. This is not a cost-effective
henry> use of hardware resources.
This is the low level performance side of the "down with smart devices"
argument. The more important one, as you well know, but let's restate
it, is that smart devices have theyir own "smart" policies that not
necessairly (euphemism) are anywhere near being flexible and efficient
enough. In other words, system performance should be treated as a whole,
it cannot be achieved by building assumptions into each component of the
system.
The extreme example are those caching controllers that have the
structure of a DOS volume built into their optimization patterns...
davidsen> What I can't see is how anyone can feel that the main CPU
davidsen> should be wasted in error logging and retries, bad sector
davidsen> mapping, and handling multiple interrupts.
Ahhhh, but then who should handle them? The CPU on the controller, of
course. The real *performance* question then is not about smart
controller vs. dumb controller.
Any architecture with asynchronous IO is a de facto multiprocessor; the
question is whether some of the CPUs in a multiprocessors should be
*specialized* or not, and which is the optimal power to assign to each
CPU if they are specialized.
You speak of "main" CPU, thus *assuming* that you have one main CPU and
some "smart" slave processors. The alternative is really multiple main
CPUs, whose function floats.
As to the specific examples you make, diagnostics (error logging and
retries, bad sector mapping) should all be done by software in the
"main" CPU OS anyhow, as of all things surely assumptions on error
recovery strategies should not be embedded in the drive, because
different OSes may well have very different fault models and fault
recovery policies.
Handling command chaining (multiple interrupts) can indeed be performed
fairly efficiently by the main CPU in well designed OS kernels that
offer lightweight interrupt handling and threading.
Unfortunately industry standard OS kernels are very poorly written, so
much so that for example on a 6 MIPS 386 running System V it is faster
to have command chaining handled by the 8085 on the 1542 SCSI
controller.
As to me, I'd rather have multiple powerful CPUs on an equal footing
doing programmed IO on very stupid devices than to have smart
controllers, which seems to be the idea behind Henry Spencer's thinking.
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk
henry@zoo.toronto.edu (Henry Spencer) (03/14/91)
In article <3254@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > This is the heart of the matter, and I agree completely. What I can't >see is how anyone can feel that the main CPU should be wasted in error >logging and retries, bad sector mapping, and handling multiple interrupts. How often do your disks get errors or bad sectors? How much CPU time is *actually spent* on doing this? Betcha it's just about zero. You lose some infinitesimal fraction of your CPU, and in return you gain a vast improvement in *how* such problems can be handled, because the software on the main CPU has a much better idea of the context of the error and has more resources available to resolve it. -- "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry
henry@zoo.toronto.edu (Henry Spencer) (03/14/91)
In article <ZJT-JQ+@b-tech.uucp> zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) writes: > ...You >can't get much main CPU for the couple of dollars more it costs to >have smart(er) serial ports (which can provide significant performance >increases). What do you mean by "smart(er)"? If you just mean throwing in some FIFOs to ease latency requirements and make it possible to move than one byte per interrupt, I agree. I was assuming that the base for discussion was dumb i/o devices, not brain-damaged ones. If you mean DMA, that does *not* cost a mere "couple of dollars more" if it's the first DMA device on your system (or, for that matter, if it's the second), and it can actually hurt performance. (As a case in point, the DMA on the LANCE Ethernet chip ties up your memory far longer than data transfers by a modern CPU would.) >Same with smart keyboards, smart graphics controllers, smart terminals, etc. I'm not at all sure what you mean by "smart keyboards"; if you mean having a keyboard encoder chip to do the actual key-scanning, that does not require any form of "smartness" -- see comments above on dumb vs. brain-damaged. Keyboards did that long before keyboards had micros in them. The micros replaced dedicated keyboard encoders because they were cheaper and a bit more flexible, not because they added useful "smartness". "Smart" graphics controllers are useful only if they actually bring specialized hardware resources into the graphics operations. All too many "smart" graphics controllers are slower and less flexible than doing it yourself in software. Just *talking* to them to tell them what you want to do can take more time than doing it yourself. (This is a particularly common vice of "smart" devices.) "Smart" terminals are useful only if they are programmable. >Smart hardware is usually quite effective for small simple jobs. Small simple jobs don't need smart hardware. -- "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry
jesup@cbmvax.commodore.com (Randell Jesup) (03/14/91)
In article <3254@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >In article <1991Mar12.202238.19586@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: > >| That aside, the most probable result is that your big expensive main host >| CPU, which could undoubtedly run that code a lot faster, will spend all >| its time waiting for the dumb little I/O controller to run the filesystem. >| This is not a cost-effective use of hardware resources. > > This is the heart of the matter, and I agree completely. What I can't >see is how anyone can feel that the main CPU should be wasted in error >logging and retries, bad sector mapping, and handling multiple interrupts. I agree also that FS code should be kept in the main CPU. Device- handling code, though, should be pushed off as much as possible into smart devices or auxiliary processors. A good modern example of this is the NCR 53c700/710. These scsi chips are essentially scsi-processors. They can take a major amount of interrupt and bus-twiddling code off of the main processor, they can handle gather/scatter, they can bus-master, they can process queues of requests, etc. They only interrupt the main processor on IO completion or on nasty errors. Perhaps my 100 mips super-mega-pipelined processor might be able to execute some of the code faster. But it has to talk to an IO chip that has a maximum access speed far slower than the processor; it has to handle a bunch of interrupts, it requires more instructions to deal with things like state transitions, etc, etc. While it could be happily executing some user process trying to do something, while a smart IO device like the 53c710 is handling a series of IO requests. IO is far less influenced by processor speed than many things - interrupt speed and the number of interrupts are often more important (assuming some level of DMA in hardware). -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com BIX: rjesup The compiler runs Like a swift-flowing river I wait in silence. (From "The Zen of Programming") ;-)
kinch@no17sun.csd.uwo.ca (Dave Kinchlea) (03/14/91)
In article <1991Mar12.202238.19586@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
[Incidentally, is there some reason why twits (or readers written by twits)
^^^^^
keep saying "Followup-To: comp.protocols.nfs", when this topic is only
marginally related to NFS and highly related to architecture? It's quite
annoying to have to keep manually fixing this.]
As I started this particular thread I suppose I am to blame, a plead guilty
and throw myself at the mercy of USENET.
cheers kinch
alex@vmars.tuwien.ac.at (Alexander Vrchoticky) (03/14/91)
henry@zoo.toronto.edu (Henry Spencer) writes: >it can actually hurt performance. (As a case in point, the DMA on the >LANCE Ethernet chip ties up your memory far longer than data transfers >by a modern CPU would.) Sigh ... who do you tell? We have been conducting some measurements of the DMA overhead of a single-board computer used for real-time applications. Almost 50 percent of the memory cycles get burned by the LANCE. The aim of the measurements was to see whether we could guarantee that a reasonable and, above all, predictable, amount of CPU power was available for application tasks. In the end we concluded that we'd have to design a dual-processor board with one CPU being dedicated to I/O handling. Which we did. [BTW, I can't see any connection to NFS here, therefore I removed that newsgroup from the Newsgroups-line.] -- Alexander Vrchoticky | alex@vmars.tuwien.ac.at TU Vienna, CS/Real-Time Systems | +43/222/58801-8168 "those who feel they're touched by madness, sit down next to me" (james)
toon@news.sara.nl (03/15/91)
In article <1991Mar12.202238.19586@zoo.toronto.edu>, henry@zoo.toronto.edu (Henry Spencer) writes: > In article <KINCH.91Mar9170121@no31sun.csd.uwo.ca> > kinch@no31sun.csd.uwo.ca (Dave Kinchlea) writes: >>... it would be highly advantagous (in the general case) to take all of >>the filesystem information out of the kernel and give it to the I/O controller. > > Which filesystem? System V's? That's what you'd get, you know... > Actually, it sounds more like the CDC Cyber 6600 .. 170 series: one or two CPU's and 10 to 20 Peripheral Processors, the latter designed to do the I/O in broad sense (handling all from disk block assignment to actual channel I/O). > That aside, the most probable result is that your big expensive main host > CPU, which could undoubtedly run that code a lot faster, will spend all > its time waiting for the dumb little I/O controller to run the filesystem. > This is not a cost-effective use of hardware resources. In the Cyber series this was 'solved' by assigning the compute intensive parts to the CPU (e.g.: converting Record Block Number (logical disk block) to cylinder/track/sector triples v.v.) > > [Incidentally, is there some reason why twits (or readers written by twits) > keep saying "Followup-To: comp.protocols.nfs", when this topic is only > marginally related to NFS and highly related to architecture? It's quite > annoying to have to keep manually fixing this.] I don't know. IMHO performance problems of NFS are of a far different nature - think of all your I/O spread over 512 byte UDP packets and the interrupt rate this generates on your favorite U*X system (can you say: Cray Y-MP ?) > -- > "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology > for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry -- Toon Moene, SARA - Amsterdam (NL) Internet: TOON@SARA.NL /usr/lib/sendmail.cf: Do.:%@!=/
glew@pdx007.intel.com (Andy Glew) (03/15/91)
>[Dave Parter] >the paper monster that evolved was an odd multiprocessor configuration: Actually, it was prototyped... -- --- Andy Glew, glew@ichips.intel.com Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, Hillsboro, Oregon 97124-6497 This is a private posting; it does not indicate opinions or positions of Intel Corp.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/15/91)
In article <PCG.91Mar13180706@aberdb.test.aber.ac.uk> pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) writes: | You speak of "main" CPU, thus *assuming* that you have one main CPU and | some "smart" slave processors. The alternative is really multiple main | CPUs, whose function floats. The CPU with the expensive cache, float, and maybe vector capability. As opposed to an 8 bit CPU with integrated interrupt controller, some parallel i/o, etc. And one or many, I can usually find better use for their capabilities than manipulating status bytes. | As to the specific examples you make, diagnostics (error logging and | retries, bad sector mapping) should all be done by software in the | "main" CPU OS anyhow, as of all things surely assumptions on error | recovery strategies should not be embedded in the drive, because | different OSes may well have very different fault models and fault | recovery policies. What has decisions got to do with implementation? The CPU running the o/s can decide how many retries (including none if there's ever a disk that doesn't need one now and then), and what to do with the count of retries returned by the smart controller. But to have the retries actually done by the CPU which can do more? To what gain? | Handling command chaining (multiple interrupts) can indeed be performed | fairly efficiently by the main CPU in well designed OS kernels that | offer lightweight interrupt handling and threading. Remember the fastest way to do something is to avoid having to do it. Every interrupt will require a context switch in and out of the interrupt handler. The only real low cost way to do this is to have a set of dedicated interrupt registers (like the 2nd register set of the Z80), and I bet no one will suggest that a CPU should dedicate area to a set of registers just to avoid a smart controller. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
peter@ficc.ferranti.com (Peter da Silva) (03/15/91)
In article <1991Mar13.194527.28164@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: > some infinitesimal fraction of your CPU, and in return you gain a vast > improvement in *how* such problems can be handled, because the software > on the main CPU has a much better idea of the context of the error and > has more resources available to resolve it. Of course, in practice this becomes "print an error message on the console, and return an error indication to the process requesting the action. If the kernel requested it, panic". -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
peter@ficc.ferranti.com (Peter da Silva) (03/15/91)
Here we're talking about putting file systems in smart processors. How about putting other stuff there? Erase and kill processing. (some PC smart cards do this, as did the old Berkeley Bussiplexer) Window management. (all the way from NeWS servers with Postscript in the terminal, down through X terminals and Blits, to the 82786 graphics chip) Network processing. (Intel, at least, is big on doing lots of this in cards, to the point where the small memory on the cards becomes a problem... they do tend to handle high network loads nicely) Tape handling. (Epoch-1 "infinite storage" server, etc...) What else? The Intel 520 has multiple 80186 and 80286 CPUs on its smart CPU cards, and seems to do quite an impressive job for a dinky little CISC based machine. -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
moss@cs.umass.edu (Eliot Moss) (03/15/91)
I would like to make a small observation here: Insuring that things are done in a particular ORDER is not the same as insuring that they are done NOW. Sync features address the latter need without directly addressing the former. I think this may be suboptimal. For example, when writing a log, it may be important (to the recovery code) that it be written in order ALL THE TIME, but it only needs to be forced (synced) at particular times (checkpoints, say, or possibly commit points (there are MANY variations on database resiliency techniques)). On a slightly different note, there are certainly occasions where a database application may need to read a number of records and does not care about the order in which they are delivered. Having a smart system that allows MANY oustanding read requests and satisfies them in an order that is most efficient at the low level is also a good idea. -- J. Eliot B. Moss, Assistant Professor Department of Computer and Information Science Lederle Graduate Research Center University of Massachusetts Amherst, MA 01003 (413) 545-4206, 545-1249 (fax); Moss@cs.umass.edu
henry@zoo.toronto.edu (Henry Spencer) (03/16/91)
In article <3265@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >Every interrupt will require a context switch in and out of the >interrupt handler. The only real low cost way to do this is to have a >set of dedicated interrupt registers (like the 2nd register set of the >Z80), and I bet no one will suggest that a CPU should dedicate area to a >set of registers just to avoid a smart controller. Nonsense. If the handling of the interrupt is sufficiently trivial, several modern CPUs -- e.g. the 29k -- can do it without a full context switch, by having a small number of registers dedicated to it. This is a very cost-effective use of silicon, adding a small amount to the CPU to avoid the hassle and complexity of smart controllers. Efficient fielding of simple interrupts (ones that require no decision making) is, in any case, a solved problem even for older CPUs. It just takes some work and some thought. Blindly taking a context switch for such trivial servicing is a design mistake. -- "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/16/91)
In article <MOSS.91Mar15090837@ibis.cs.umass.edu> moss@cs.umass.edu writes: | I would like to make a small observation here: | | Insuring that things are done in a particular ORDER is not the same as | insuring that they are done NOW. | | Sync features address the latter need without directly addressing the former. If you need A to be written before B you would have to do a SYNC after A, true. If you were able to promise that writes from a single process would be done in order, without SYNC that might be all you need. While I can visualize a simple way to do this, I can't claim to have seen it implemneted in an interface to a controller. What needs to be done is to add one i/o request from a process to the queue, and either prevent any other i/o from that process from being queued, or force it to be later in the actual service queue. This means keeping track of the order in which things will be done. | I think this may be suboptimal. For example, when writing a log, it may be | important (to the recovery code) that it be written in order ALL THE TIME, but | it only needs to be forced (synced) at particular times (checkpoints, say, or | possibly commit points (there are MANY variations on database resiliency | techniques)). | | On a slightly different note, there are certainly occasions where a database | application may need to read a number of records and does not care about the | order in which they are delivered. Having a smart system that allows MANY | oustanding read requests and satisfies them in an order that is most efficient | at the low level is also a good idea. And unless your implementation writes in exactly one sector blocks (highly non-portable to other devices), some writes will span sectors, heads, and cylinders, even if your disk allocation is contiguous. This means there are lots of interrupts which can be handled by the controller. I would seem that logical i/o which spans several physical i/o could be done by the controller, reads could be ordered by the controller as long as all requests get services in one pass through the queue, and that many writes can be done in arbitrary order. A way to sync is absolutely required, and that some way to order writes is needed. It's not clear to me if that means writes by a process to all files need to be ordered or writes to just each individual file. In any case this can be insured by use of sync. Some of the systems I check spend more time waiting i/o than in the kernel, and have no idle CPU to measure. Anything which will make the i/o faster is great, and if it save some CPU for the user processes, that's a bonus. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
renglish@hplabsz.HP.COM (Bob English) (03/16/91)
In article <1991Mar12.202238.19586@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: > That aside, the most probable result is that your big expensive > main host CPU, which could undoubtedly run that code a lot > faster, will spend all its time waiting for the dumb little I/O > controller to run the filesystem. This is not a cost-effective > use of hardware resources. This is only true if the dumb little I/O controller is not fast enough to keep up with the device it's controlling. As long as it can, and as long as the device is designed so that its "intelligence" doesn't appreciably increase its latency, it doesn't really matter whether it's as fast as the host CPU. If it can offload functions, the system wins. With uPs continuing to surge ahead of disks in the performance race, disk controllers are getting ever faster compared to the disks they control, and there is certainly potential there to exploit. In spite of arguments that CPU power is more effectively concentrated in a central location, the CPU power to build intelligent controllers will soon be, effectively, free. When the performance of cheap, imbedded microcontrollers reaches 20 MIPS or so, arguments about effective concentration of CPU power will no longer be relevant. As Piercarlo points out, however, device controllers often make poor decisions and end up penalizing the system rather than helping them, in part because the standard interfaces used to talk to disk drives do not provide the information necessary to make good decisions and in part because most "intelligent" devices are programmed by disk drive designers focused on raw access times rather than total system performance. I don't want to sound like I'm hammering on the disk drive guys, however. The centralized CPU (cluster) and its operating system are in no better shape. Standardized file systems are based on idealized models of disks that no longer reflect actual structures and performance characteristics of real disks. The 4.2BSD file system, for example, assumes that all tracks and cylinders are the same size and that all tracks start at the same rotational position, assumptions that are false in many cases. There may be file systems available that take advantage of information such as precise, current head position and settle time in order to optimize requests, but if so, they are rare. The fundamental problem here is that the controller and the CPU do not communicate well enough to cooperate well together. The protocols by which the CPU could keep the controller informed on its global state cheaply and efficiently do not currently exist, certainly not in a standard way, and neither do the protocols by which the controller could keep the CPU informed of its current state. If anything, current standards seem to be progressing in the opposite direction. Where does this put intelligent disk controllers? I must confess that I don't know. There seem to me opportunities for improved performance there, but the continuing decline in DRAM prices and the consistent performance gap between DRAM and disk systems makes we wonder whether those potentials will ever be economically important. Sometime in the next 10-20 years, DRAM and disk storage prices are projected to cross. After that disks will be important only for persistance and power consumption reasons, and will have to compete with other technologies on those bases. Perhaps a new, cheaper technology will come along with better access times than disks. It may be more interesting to think of intelligent storage servers. As uP-based systems become more powerful, their data and storage appetites will increase as well. In the not-to-distant future, it may be common for storage servers to have to manage terabytes of data and transfer it at gigabit or gigabyte rates to and from a large number of clients. I doubt that any current filesystems or file access protocols are really ready for such an environment. --bob-- renglish@hplabs.hp.com
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (03/17/91)
In article <3268@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes: > .... > If you need A to be written before B you would have to do a SYNC after > A, true. If you were able to promise that writes from a single process > would be done in order, without SYNC that might be all you need. While I > can visualize a simple way to do this, I can't claim to have seen it > implemneted in an interface to a controller. > ... Years ago, at a start-up, I led the development at of a small, multiuser, multiprocessor system. We built a radically smart disk controller--it had a whole 8085. The controller patrolled a linked list of requests, with each request containing the obvious unit number, word count, disk block number (not sector/head/track--unsual in that era), completion bit and status, etc. Each request also contained an optionally non-null pointer to another, "predecessor" request. The controller was allowed to service requests in any order it thought good, subject only to the prior completion of the precessor. It took lots of energy to convince the people working on the controller that something more than a fixed array of word-count/sector/head/track was possible, let alone a good idea. We sold a few of the controllers to other companies, but no one else ever seemed to think the predecessor idea was worthwhile. The company is long dead and forgotten. Sigh--the universe is unfair. Vernon Schryver, vjs@sgi.com
andrew@frip.WV.TEK.COM (Andrew Klossner) (03/22/91)
[] "... your big expensive main host CPU will spend all its time waiting for the dumb little I/O controller ..." "This is only true if the dumb little I/O controller is not fast enough to keep up with the device it's controlling." It must also be fast enough to hold up its end of the conversation when it communicates with the host. I worked on a system with a 68020 host, talking over a SCSI channel to a disk whose controller used a Z8. There's a lot of back-and-forth in the SCSI protocol, and you could just about fall asleep waiting for that Z8. It was so slow that the 68020 spent some serious time waiting, but not so slow that it would have paid off to dismiss from interrupt at each step of the conversation. -=- Andrew Klossner (uunet!tektronix!frip.WV.TEK!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
ingoldsb@ctycal.UUCP (Terry Ingoldsby) (03/26/91)
In article <1991Mar15.165124.18039@zoo.toronto.edu>, henry@zoo.toronto.edu (Henry Spencer) writes: > In article <3265@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > >Every interrupt will require a context switch in and out of the > >interrupt handler. The only real low cost way to do this is to have a ... > Nonsense. If the handling of the interrupt is sufficiently trivial, > several modern CPUs -- e.g. the 29k -- can do it without a full context > switch, by having a small number of registers dedicated to it. This is It doesn't even have to be modern! Perhaps not as elegant as what you are referring to, but the 8 bit MC6809 used to have a FIRQ (Fast Interrupt Request) in which only a very few registers were saved. This let you take the interrupt, store one or two registers explicitly, do your thing and get back out quickly. -- Terry Ingoldsby ingoldsb%ctycal@cpsc.ucalgary.ca Land Information Services or The City of Calgary ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb
mlord@bwdls58.bnr.ca (Mark Lord) (03/27/91)
In article <PCG.91Mar13180706@aberdb.test.aber.ac.uk> pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) writes:
<
<As to me, I'd rather have multiple powerful CPUs on an equal footing
<doing programmed IO on very stupid devices than to have smart
<controllers, which seems to be the idea behind Henry Spencer's thinking.
I vote for smart device controllers, to which the O/S can download
software. This gives the O/S complete control, and still allows it
to optimally offload tedious tasks. Sort of like channel processors
on mainframes.. What? You mean it's already been done?
mike@sojurn.UUCP (Mike Sangrey) (04/21/91)
In article <ZU=9R=8@xds13.ferranti.com> peter@ficc.ferranti.com (Peter da Silva) writes: >Here we're talking about putting file systems in smart processors. How about >putting other stuff there? > The A-series computers of Unisys even put the process task switches (if I remember rightly) on separate hardware. -- | UUCP-stuff: devon!sojurn!mike | "It muddles me rather" | | Slow-stuff: 832 Strasburg Rd. | Winnie the Pooh | | Paradise, Pa. 17562 | with apologies to | | Fast-stuff: (717) 442-8959 | A. A. Milne |