rcd@ico.isc.com (Dick Dunn) (03/11/90)
Carrying on the discussion about SCSI hiding drive geometry...(I see comp.periphs.scsi is in the discussion; should we also move it out of .unix.aix to something more general?)... mjacob@wonky.Sun.COM (Matt Jacob) writes: ... > My own personal opinion is that geometry based filesystems are > getting to be a bad microoptimization... But SCSI is not the only interface around, and I think there are some open questions about how much device-sensitivity you want in the mid level of the file/disk system. That is, if you've got a more traditional disk interface (some of which are pretty high performance) you need to deal with geometry. Do you want to ignore geometry some of the time? It gets harder and harder to know how/where to make the cut. (My own personal opinion, not necessarily well substantiated, is that SCSI was at best premature, and at worst wrong, in trying to hide drive geometry from the host system.) >...With the coming of SCSI-2 > multiple command targets, it seems to me that one should just > concentrate on getting requests out to the target as quickly > as possible and let the microprocessor on the drive figure out > the best order do them in. This raises a sticky issue of who's in control of the disk system. Consider reliability issues. Two examples come to mind. First, in a UNIX file system, you probably want to have some control over the order of operations so that you can have some reasonable assurance that operations on inodes, indirect blocks, directories, and data happen in a way that will allow you a good chance for recovery if you crash while there are operations in the queue. Second, in a database it is essential that you be able to control the sequencing of operations so that commits really commit, journaling happens when you expect, etc. Frankly, I don't want to trust J Random Microcoder to give a disk-write- reordering algorithm that won't screw things up. Even if I'm assured of some sort of "fair" algorithm, trying to sequence things in the kernel to compensate for all the possible variants of reordering sounds like a pain. (It's also redundant in a perverse way: You have to write code to un-do decisions which are going to be made for you that you don't want.) I think it would make the job of kernel folks a lot easier if they could deal with interfaces which just attempt to be fast in a predictable way, instead of trying to be smart. -- Dick Dunn rcd@ico.isc.com uucp: {ncar,nbires}!ico!rcd (303)449-2870 ...Relax...don't worry...have a homebrew.
jesup@cbmvax.commodore.com (Randell Jesup) (03/12/90)
In article <1990Mar11.045128.17732@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes: >mjacob@wonky.Sun.COM (Matt Jacob) writes: >... >> My own personal opinion is that geometry based filesystems are >> getting to be a bad microoptimization... > >But SCSI is not the only interface around, and I think there are some open >questions about how much device-sensitivity you want in the mid level of >the file/disk system. That is, if you've got a more traditional disk >interface (some of which are pretty high performance) you need to deal with >geometry. Do you want to ignore geometry some of the time? It gets harder >and harder to know how/where to make the cut. You can easily separate the levels when you have a "traditional" disk interface. Under AmigaDos, on the A590 SCSI/"PC Bus Drive" HD interface, you can send direct SCSI commands to either the SCSI bus, or the "PC Bus" drives (the driver deals with them). >(My own personal opinion, not necessarily well substantiated, is that SCSI >was at best premature, and at worst wrong, in trying to hide drive geometry >from the host system.) Ah, but it doesn't! Use READ_CAPACITY in the "tell me where the next slowdown in read is" mode. This allows you to build a list of groups of sectors that are "fast", and know where the breaks are. Note that this handles Zone-Recorded drives quite well, while still allowing the FS to know the geometry (who ever said disks had to be regular arrays? Even the old PET disks used Zone recording...) >This raises a sticky issue of who's in control of the disk system. >Consider reliability issues. Two examples come to mind. First, in a UNIX >file system, you probably want to have some control over the order of >operations so that you can have some reasonable assurance that operations >on inodes, indirect blocks, directories, and data happen in a way that will >allow you a good chance for recovery if you crash while there are >operations in the queue. Second, in a database it is essential that you be >able to control the sequencing of operations so that commits really commit, >journaling happens when you expect, etc. You can still do this under SCSI, though it may be slightly less simple than straight Read/Write commands (though I think you can force serialization of writes pretty easily). >I think it would make the job of kernel folks a lot easier if they could >deal with interfaces which just attempt to be fast in a predictable way, >instead of trying to be smart. The interfaces are only as smart as you want them to be. Filesystems are the "customer" of essentially all SCSI drives; and they're set up pretty well to make things nice for filesystems (and drivers). Also, every SCSI drive I've seen has defaults that are "safe" - no reordering of writes, etc. -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup Common phrase heard at Amiga Devcon '89: "It's in there!"
taylor@anthrax.Solbourne.COM (Dick Taylor) (03/13/90)
In article <1990Mar11.045128.17732@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes: >... >mjacob@wonky.Sun.COM (Matt Jacob) writes: >>...With the coming of SCSI-2 >> multiple command targets, it seems to me that one should just >> concentrate on getting requests out to the target as quickly >> as possible and let the microprocessor on the drive figure out >> the best order do them in. > >This raises a sticky issue of who's in control of the disk system. >... > >Frankly, I don't want to trust J Random Microcoder to give a disk-write- >reordering algorithm that won't screw things up. And this is the root of the debate. It's a question of trust and shared authority. It balances the definite benefit of farming out the grunt work (do you REALLY want per-sector interrupts in the operating system?) against the loss of critical control over the order of operations and error recovery. Multiprocessor systems (and anything that has a CPU and a separate SCSI disk drive is a multiprocessor system, like it or not) have advantages in speed and disadvantages in complexity and potential for trouble. The disadvantages are normally mitigated by careful design. When you're adding a SCSI device to a UNIX filesystem, however, you're denied a lot of things that would be useful. As another poster pointed out, UNIX has certain things (inodes, user database information, and so on) where the order of operations makes a critical difference. It also has data where the order of writes may be very unimportant. NONE of this information about the data is passed down through the driver level to the drive. Without that, optimization algorithms can make guesses (based on buffer header contents, size and location of requests, and context within an operation), but the guesses are never guaranteed. Add in the indifferent way that many companies seem to implement their firmware and there's not a lot of room for trust. Nonetheless, there are companies (including one which I used to work for) that have made a reputation and quite a chunk of change improving the speed of the UNIX filesystem. The benefits, which can be substantial given the partially brain-dead way that UNIX generates I/O requests, outweigh the problems. >... >I think it would make the job of kernel folks a lot easier if they could >deal with interfaces which just attempt to be fast in a predictable way, >instead of trying to be smart. Speaking as a kernel folk, I'd have to agree, with a major addition. I'd rather have a device that's fast than one that tries to be smart. But I'd really rather have one that IS smart and that can take some of the load off of my CPU, which has better things to do than optimize I/O requests. SCSI, good or bad, hides the drive geometry from the kernel. It also gives the drive a lot of control over the actual execution of a request. Given this, I think that Mr. Jacob's original statement is a better way of thinking about the role of the OS between the filesystem and the drive, and that we need to concentrate where we can on improving UNIX's ability to handle a multiprocessor filesystem.
mats@alruna.UUCP (Mats Wichmann) (03/14/90)
mjacob@wonky.Sun.COM (Matt Jacob) writes: >>...With the coming of SCSI-2 multiple command targets, it seems to me that >> one should just concentrate on getting requests out to the target as quickly >> as possible and let the microprocessor on the drive figure out the best >> order do them in. rcd@ico.isc.com (Dick Dunn) writes: >This raises a sticky issue of who's in control of the disk system. >Consider reliability issues. Two examples come to mind. First, in a UNIX >file system, you probably want to have some control over the order of >operations so that you can have some reasonable assurance that operations >on inodes, indirect blocks, directories, and data happen in a way that will >allow you a good chance for recovery if you crash while there are >operations in the queue. Second, in a database it is essential that you be >able to control the sequencing of operations so that commits really commit, >journaling happens when you expect, etc. >Frankly, I don't want to trust J Random Microcoder to give a disk-write- >reordering algorithm that won't screw things up. Even if I'm assured of >some sort of "fair" algorithm, trying to sequence things in the kernel to >compensate for all the possible variants of reordering sounds like a pain. >(It's also redundant in a perverse way: You have to write code to un-do >decisions which are going to be made for you that you don't want.) >I think it would make the job of kernel folks a lot easier if they could >deal with interfaces which just attempt to be fast in a predictable way, >instead of trying to be smart. Smart is in the eyes of the beholder. Microcomputers still suffer from I/O problems, compared to "big" machines. One commonly quoted "difference" between micros and big iron is that micros still wait for the the right sector to come around before reading it. Big Iron always has the I/O system doing something. So say you want to build a controller that takes a request, seeks to the right track, and starts pulling data into the buffer right there, up until the right sector spins around. That doesn't cost you anything more - otherwise you just wait - and it might gain something. Then, just for kicks, if there is no other pending request, or if another request in the queue is in the same area, you can go on and read the rest of the track. If there is something more important to do, you go do that right away (we built such controllers at Dual Systems long ago (long ago? six years?) - where, incidentally, Matt Jacob also worked). That's not a "geometry-based file system" (although you can do things to your file system to make this scheme work bettter), but its' something you can't do on SCSI because your controller (Host Adaptor, to be more precise) doesn't get to know enough about the geometry. Instead, you have decide ahead of time which sector numbers you want and ask for them; can't ever pull the trick of reading _before_ the target sector. Maybe that idea isn't really current any more; I've also worked with some controller people who felt that the "only thing that mattered" was getting data into the kernel buffer cache as quickly as possible - bypassing controller buffering except a small amount to serve as a FIFO. So no value judgement here... What worries me, like Dick, is that in SCSI, the real "controller" is on the drive. If Imprimis or whoever decides to make a drive that they expect to sell the vast majority of to the UNIX-box market (whatever that is), and they hire programmers (and specifiers) who really understand what that market is, maybe we get something that matches the needs of a UNIX vendor. If he thinks that 92% of his drives are going to go into DOS or OS/2 machines, we don't. Then you start worrying about different groups within the "UNIX market" - traditional AT&T filesystems, BSD "Fast" File systems, file systems enhanced for "commercial" use (TP, and such like). Do you really want to leave this up to the drive vendors, rather than people like, say, Interphase, who supposedly has more detailed expertise about a smaller segment of the overall drive market? >Dick Dunn rcd@ico.isc.com uucp: {ncar,nbires}!ico!rcd (303)449-2870 > ...Relax...don't worry...have a homebrew. Mmm, good idea. I've still got some of that Black Death Stout in the back... -mats wichmann
bjstaff@zds-ux.UUCP (Brad Staff) (03/14/90)
What critical item did I miss in the discussion about SCSI-2 devices sorting their work lists? As near as I can tell, disk drivers (System V/386 3.2 at least) do this all the time. In fact, the System V kernel provides a routine, disksort(), for this very purpose! After looking around in os/bio.c, I found three routines the kernel uses for writing buffers to disk: bwrite(), bdwrite(), and bawrite(). bwrite() initiates the write by calling strategy() on the buffer, and then waits for the write to complete by calling iowait() on the buffer. bdwrite() sets the B_DELWRI and B_DONE flags in the buffer and then releases it. It will be written out at some later time. bawrite() initiates the write by calling strategy() on the buffer, but doesn't wait for the write to complete. When the System V kernel really cares about the order of writes, it uses bwrite(). Otherwise, it might use bdwrite() or bawrite(). I don't see any problem here. -- Brad Staff | Zenith Data Systems | "A government that can forbid certain 616-982-5791 | psychoactive drugs can mandate others." ...!uunet!zds-ux!bjstaff | - Russell Turpin
jlohmeye@entec.Wichita.NCR.COM (John Lohmeyer) (03/15/90)
In article <1990Mar11.220934.23771@light.uucp> bvs@light.UUCP (Bakul Shah) writes: >In article <1990Mar11.045128.17732@ico.isc.com> rcd@ico.isc.com (Dick Dunn) >writes: >> [deleted] >>Frankly, I don't want to trust J Random Microcoder to give a disk-write- >>reordering algorithm that won't screw things up. Even if I'm assured of >>some sort of "fair" algorithm, trying to sequence things in the kernel to >>compensate for all the possible variants of reordering sounds like a pain. >>(It's also redundant in a perverse way: You have to write code to un-do >>decisions which are going to be made for you that you don't want.) > >I will second that. > >Some more points: > [Interesting points about why the os can do it better omitted in follow-up] You guys really ought to read the SCSI-2 draft standard before complaining about "smart" disks and controllers. There are methods to control or not control these features to your heart's content. If you want to control queue re-ordering, use an ORDERED queue tag. If you want to see the drive geometry, there are gobs of controls in the mode pages. You can even deal with notched drives (a.k.a., zone bit recording), but it is ugly. Most people would rather let the drive deal with the variable number of blocks per cylinder than try to manage it. There are controls to enable/disable automatic sparing. There is a READ LONG command that you can use to try to recover trashed blocks. You can even control whether or not error recovery is employed. In short, if you really want to manage these things, you can do so. If you would rather spend your time on other things and let the drive manage itself, you can do that. Please send me email if there are any controls we left out -- there is always SCSI-3. :-) -- John Lohmeyer J.Lohmeyer@Wichita.NCR.COM NCR Corp. uunet!ncrlnk!ncrwic!entec!jlohmeye 3718 N. Rock Rd. Voice: 316-636-8703 Wichita, KS 67226 SCSI BBS 316-636-8700 300/1200/2400 24 hours
mjacob@wonky.Sun.COM (Matt Jacob) (03/15/90)
[ Sorry- my machine was down for a couple of days so I am late in responding to this.. ] >... >> My own personal opinion is that geometry based filesystems are >> getting to be a bad microoptimization... > >But SCSI is not the only interface around, and I think there are some open >questions about how much device-sensitivity you want in the mid level of >the file/disk system. That is, if you've got a more traditional disk >interface (some of which are pretty high performance) you need to deal with >geometry. Do you want to ignore geometry some of the time? It gets harder >and harder to know how/where to make the cut. > >(My own personal opinion, not necessarily well substantiated, is that SCSI >was at best premature, and at worst wrong, in trying to hide drive geometry >from the host system.) > Ah, but SCSI wasn't premature- it was/is an extension of the IBM channel concept to smaller lower-cost machines. Granted, more 'traditional' disk interfaces need and should allow the main CPU to know and take advantage of disk geometry. However, the 256-512kb of code to handle the 4.3 filesytem can be considered *wasted* main CPU cycles if you can offload the processing. >>...With the coming of SCSI-2 >> multiple command targets, it seems to me that one should just >> concentrate on getting requests out to the target as quickly >> as possible and let the microprocessor on the drive figure out >> the best order do them in. > >This raises a sticky issue of who's in control of the disk system. >Consider reliability issues. Two examples come to mind. First, in a UNIX >file system, you probably want to have some control over the order of >operations so that you can have some reasonable assurance that operations >on inodes, indirect blocks, directories, and data happen in a way that will >allow you a good chance for recovery if you crash while there are >operations in the queue. Second, in a database it is essential that you be >able to control the sequencing of operations so that commits really commit, >journaling happens when you expect, etc. There are quite adequate mechanisms in SCSI to handle this (e.g., the *real* use of linked commands, which provide means for specifying atomic operations w.r.t. to multiple sets of i/o from a single initiator). It is true that Unix itself does not provide good hooks for reliability or database sequencing, but to criticize SCSI for allowing you to do things your OS can't handle well to begin with is the tail wagging the dog. > >Frankly, I don't want to trust J Random Microcoder to give a disk-write- >reordering algorithm that won't screw things up. Even if I'm assured of >some sort of "fair" algorithm, trying to sequence things in the kernel to >compensate for all the possible variants of reordering sounds like a pain. >(It's also redundant in a perverse way: You have to write code to un-do >decisions which are going to be made for you that you don't want.) > Now this is a valid point, in a way. I've gone over this issue in several different contexts (having been a microcoder in my dim past). In the case where you have more than one decision maker, *one* must make the choice decisisions as to optimal i/o ordering, etc., else chaos results. In the case of distributed I/O subsystems (SCSI or otherwise), I have found that you *have* to do things like *not* disksort on the stub cpu side of things. If you have the BSD filesystem, you *must* specify things like 0 rotational delay, etc., in order to *not* have the filesystem and the i/o subsystem cancel each other out. Ideally, one would like a a filesytem to form requests that have precedence, priority, and cache-retention parameters. That is, the filesystem associates with each data it wants transferred loose statements like: "Write this *NOW*" "Write this, and hang on to it, 'coz I'll likely ask for it back soon." "Write this *before* Reading *that*" and so on. I feel that we (as in the Unix commercial marketplace) are very far from that (flame on, everyone!).... >I think it would make the job of kernel folks a lot easier if they could >deal with interfaces which just attempt to be fast in a predictable way, >instead of trying to be smart. For about two years at Sun, I had posted on my office door a one-page printout (well, it was small font) entitled "The Ideal and Perfect Driver". It was for the PDP-11 RK05 removable 2.5mb drive. Also, I have kicking around at home a 200-odd word pdp-11 assembler language rm03 driver I wrote for RT-11. These are *very* simple. Unfortunately, I have not been able to beg, plead, extort, bribe, or otherwise convince hardware engineers to take such simple interfaces and run them up to a decent speed. Ergo, complexity in s/w has been a natural result. -matt