[comp.periphs.scsi] SCSI hiding geometry

rcd@ico.isc.com (Dick Dunn) (03/11/90)

Carrying on the discussion about SCSI hiding drive geometry...(I see
comp.periphs.scsi is in the discussion; should we also move it out of
.unix.aix to something more general?)...

mjacob@wonky.Sun.COM (Matt Jacob) writes:
...
> My own personal opinion is that geometry based filesystems are
> getting to be a bad microoptimization...

But SCSI is not the only interface around, and I think there are some open
questions about how much device-sensitivity you want in the mid level of
the file/disk system.  That is, if you've got a more traditional disk
interface (some of which are pretty high performance) you need to deal with
geometry.  Do you want to ignore geometry some of the time?  It gets harder
and harder to know how/where to make the cut.

(My own personal opinion, not necessarily well substantiated, is that SCSI
was at best premature, and at worst wrong, in trying to hide drive geometry
from the host system.)

>...With the coming of SCSI-2
> multiple command targets, it seems to me that one should just
> concentrate on getting requests out to the target as quickly
> as possible and let the microprocessor on the drive figure out
> the best order do them in.

This raises a sticky issue of who's in control of the disk system.
Consider reliability issues.  Two examples come to mind.  First, in a UNIX
file system, you probably want to have some control over the order of
operations so that you can have some reasonable assurance that operations
on inodes, indirect blocks, directories, and data happen in a way that will
allow you a good chance for recovery if you crash while there are
operations in the queue.  Second, in a database it is essential that you be
able to control the sequencing of operations so that commits really commit,
journaling happens when you expect, etc.

Frankly, I don't want to trust J Random Microcoder to give a disk-write-
reordering algorithm that won't screw things up.  Even if I'm assured of
some sort of "fair" algorithm, trying to sequence things in the kernel to
compensate for all the possible variants of reordering sounds like a pain.
(It's also redundant in a perverse way:  You have to write code to un-do
decisions which are going to be made for you that you don't want.)

I think it would make the job of kernel folks a lot easier if they could
deal with interfaces which just attempt to be fast in a predictable way,
instead of trying to be smart.
-- 
Dick Dunn     rcd@ico.isc.com    uucp: {ncar,nbires}!ico!rcd     (303)449-2870
   ...Relax...don't worry...have a homebrew.

jesup@cbmvax.commodore.com (Randell Jesup) (03/12/90)

In article <1990Mar11.045128.17732@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes:
>mjacob@wonky.Sun.COM (Matt Jacob) writes:
>...
>> My own personal opinion is that geometry based filesystems are
>> getting to be a bad microoptimization...
>
>But SCSI is not the only interface around, and I think there are some open
>questions about how much device-sensitivity you want in the mid level of
>the file/disk system.  That is, if you've got a more traditional disk
>interface (some of which are pretty high performance) you need to deal with
>geometry.  Do you want to ignore geometry some of the time?  It gets harder
>and harder to know how/where to make the cut.

	You can easily separate the levels when you have a "traditional" disk
interface.  Under AmigaDos, on the A590 SCSI/"PC Bus Drive" HD interface,
you can send direct SCSI commands to either the SCSI bus, or the "PC Bus"
drives (the driver deals with them).

>(My own personal opinion, not necessarily well substantiated, is that SCSI
>was at best premature, and at worst wrong, in trying to hide drive geometry
>from the host system.)

	Ah, but it doesn't!  Use READ_CAPACITY in the "tell me where the next
slowdown in read is" mode.  This allows you to build a list of groups of
sectors that are "fast", and know where the breaks are.  Note that this 
handles Zone-Recorded drives quite well, while still allowing the FS to
know the geometry (who ever said disks had to be regular arrays? Even the
old PET disks used Zone recording...)

>This raises a sticky issue of who's in control of the disk system.
>Consider reliability issues.  Two examples come to mind.  First, in a UNIX
>file system, you probably want to have some control over the order of
>operations so that you can have some reasonable assurance that operations
>on inodes, indirect blocks, directories, and data happen in a way that will
>allow you a good chance for recovery if you crash while there are
>operations in the queue.  Second, in a database it is essential that you be
>able to control the sequencing of operations so that commits really commit,
>journaling happens when you expect, etc.

	You can still do this under SCSI, though it may be slightly less
simple than straight Read/Write commands (though I think you can force
serialization of writes pretty easily).

>I think it would make the job of kernel folks a lot easier if they could
>deal with interfaces which just attempt to be fast in a predictable way,
>instead of trying to be smart.

	The interfaces are only as smart as you want them to be.  Filesystems
are the "customer" of essentially all SCSI drives; and they're set up pretty
well to make things nice for filesystems (and drivers).  Also, every SCSI
drive I've seen has defaults that are "safe" - no reordering of writes, etc.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"

taylor@anthrax.Solbourne.COM (Dick Taylor) (03/13/90)

In article <1990Mar11.045128.17732@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes:
>...
>mjacob@wonky.Sun.COM (Matt Jacob) writes:
>>...With the coming of SCSI-2
>> multiple command targets, it seems to me that one should just
>> concentrate on getting requests out to the target as quickly
>> as possible and let the microprocessor on the drive figure out
>> the best order do them in.
>
>This raises a sticky issue of who's in control of the disk system.
>...
>
>Frankly, I don't want to trust J Random Microcoder to give a disk-write-
>reordering algorithm that won't screw things up. 

And this is the root of the debate.  It's a question of trust and shared 
authority.  It balances the definite benefit of farming out the grunt work
(do you REALLY want per-sector interrupts in the operating system?) against 
the loss of critical control over the order of operations and error recovery.

Multiprocessor systems (and anything that has a CPU and a separate SCSI disk
drive is a multiprocessor system, like it or not) have advantages in speed 
and disadvantages in complexity and potential for trouble.  The disadvantages
are normally mitigated by careful design.  When you're adding a SCSI device
to a UNIX filesystem, however, you're denied a lot of things that would be
useful.

As another poster pointed out, UNIX has certain things (inodes, user database 
information, and so on) where the order of operations makes a critical 
difference.  It also has data where the order of writes may be very 
unimportant.  NONE of this information about the data is passed down through 
the driver level to the drive.  Without that, optimization algorithms can 
make guesses (based on buffer header contents, size and location of requests,
and context within an operation), but the guesses are never guaranteed.  Add 
in the indifferent way that many companies seem to implement their firmware 
and there's not a lot of room for trust.

Nonetheless, there are companies (including one which I used to work for) that
have made a reputation and quite a chunk of change improving the speed of the 
UNIX filesystem.  The benefits, which can be substantial given the partially 
brain-dead way that UNIX generates I/O requests, outweigh the problems.

>...
>I think it would make the job of kernel folks a lot easier if they could
>deal with interfaces which just attempt to be fast in a predictable way,
>instead of trying to be smart.

Speaking as a kernel folk, I'd have to agree, with a major addition.  I'd 
rather have a device that's fast than one that tries to be smart.  But I'd 
really rather have one that IS smart and that can take some of the load off 
of my CPU, which has better things to do than optimize I/O requests.

SCSI, good or bad, hides the drive geometry from the kernel.  It also
gives the drive a lot of control over the actual execution of a request.
Given this, I think that Mr. Jacob's original statement is a better way
of thinking about the role of the OS between the filesystem and the drive,
and that we need to concentrate where we can on improving UNIX's ability
to handle a multiprocessor filesystem.

mats@alruna.UUCP (Mats Wichmann) (03/14/90)

mjacob@wonky.Sun.COM (Matt Jacob) writes:

>>...With the coming of SCSI-2 multiple command targets, it seems to me that
>> one should just concentrate on getting requests out to the target as quickly
>> as possible and let the microprocessor on the drive figure out the best 
>> order do them in.

rcd@ico.isc.com (Dick Dunn) writes:

>This raises a sticky issue of who's in control of the disk system.
>Consider reliability issues.  Two examples come to mind.  First, in a UNIX
>file system, you probably want to have some control over the order of
>operations so that you can have some reasonable assurance that operations
>on inodes, indirect blocks, directories, and data happen in a way that will
>allow you a good chance for recovery if you crash while there are
>operations in the queue.  Second, in a database it is essential that you be
>able to control the sequencing of operations so that commits really commit,
>journaling happens when you expect, etc.

>Frankly, I don't want to trust J Random Microcoder to give a disk-write-
>reordering algorithm that won't screw things up.  Even if I'm assured of
>some sort of "fair" algorithm, trying to sequence things in the kernel to
>compensate for all the possible variants of reordering sounds like a pain.
>(It's also redundant in a perverse way:  You have to write code to un-do
>decisions which are going to be made for you that you don't want.)

>I think it would make the job of kernel folks a lot easier if they could
>deal with interfaces which just attempt to be fast in a predictable way,
>instead of trying to be smart.

Smart is in the eyes of the beholder.  Microcomputers still suffer from I/O
problems, compared to "big" machines.  One commonly quoted "difference"
between micros and big iron is that micros still wait for the the right
sector to come around before reading it. Big Iron always has the I/O system
doing something.   So say you want to build a controller that takes a
request, seeks to the right track, and starts pulling data into the buffer
right there, up until the right sector spins around.  That doesn't cost you
anything more - otherwise you just wait - and it might gain something.
Then, just for kicks, if there is no other pending request, or if another
request in the queue is in the same area, you can go on and read the rest
of the track.  If there is something more important to do, you go do that
right away (we built such controllers at Dual Systems long ago (long ago?
six years?) - where, incidentally, Matt Jacob also worked).

That's not a "geometry-based file system" (although you can do things to
your file system to make this scheme work bettter), but its' something you
can't do on SCSI because your controller (Host Adaptor, to be more precise)
doesn't get to know enough about the geometry.  Instead, you have decide
ahead of time which sector numbers you want and ask for them; can't ever
pull the trick of reading _before_ the target sector.  Maybe that idea
isn't really current any more; I've also worked with some controller people
who felt that the "only thing that mattered" was getting data into the
kernel buffer cache as quickly as possible - bypassing controller buffering
except a small amount to serve as a FIFO.  So no value judgement here...

What worries me, like Dick, is that in SCSI, the real "controller" is on
the drive.  If Imprimis or whoever decides to make a drive that they expect
to sell the vast majority of to the UNIX-box market (whatever that is), and
they hire programmers (and specifiers) who really understand what that
market is, maybe we get something that matches the needs of a UNIX vendor.
If he thinks that 92% of his drives are going to go into DOS or OS/2
machines, we don't.  Then you start worrying about different groups within
the "UNIX market" - traditional AT&T filesystems, BSD "Fast" File systems,
file systems enhanced for "commercial" use (TP, and such like).  Do you
really want to leave this up to the drive vendors, rather than people like,
say, Interphase, who supposedly has more detailed expertise about a smaller
segment of the overall drive market?

>Dick Dunn     rcd@ico.isc.com    uucp: {ncar,nbires}!ico!rcd     (303)449-2870
>   ...Relax...don't worry...have a homebrew.
Mmm, good idea. I've still got some of that Black Death Stout in the back...

-mats wichmann

bjstaff@zds-ux.UUCP (Brad Staff) (03/14/90)

What critical item did I miss in the discussion about SCSI-2 devices
sorting their work lists?  As near as I can tell, disk drivers (System
V/386 3.2 at least) do this all the time.  In fact, the System V kernel
provides a routine, disksort(), for this very purpose!

After looking around in os/bio.c, I found three routines the kernel
uses for writing buffers to disk:  bwrite(), bdwrite(), and bawrite().
bwrite() initiates the write by calling strategy() on the buffer, and
then waits for the write to complete by calling iowait() on the buffer.
bdwrite() sets the B_DELWRI and B_DONE flags in the buffer and then
releases it.  It will be written out at some later time.
bawrite() initiates the write by calling strategy() on the buffer, but
doesn't wait for the write to complete.

When the System V kernel really cares about the order of writes, it uses
bwrite().  Otherwise, it might use bdwrite() or bawrite().  I don't see
any problem here.
-- 
Brad Staff               |
Zenith Data Systems      | "A government that can forbid certain
616-982-5791             |  psychoactive drugs can mandate others."
...!uunet!zds-ux!bjstaff |	- Russell Turpin

jlohmeye@entec.Wichita.NCR.COM (John Lohmeyer) (03/15/90)

In article <1990Mar11.220934.23771@light.uucp> bvs@light.UUCP (Bakul Shah) writes:
>In article <1990Mar11.045128.17732@ico.isc.com> rcd@ico.isc.com (Dick Dunn)
>writes:
>>    [deleted]
>>Frankly, I don't want to trust J Random Microcoder to give a disk-write-
>>reordering algorithm that won't screw things up.  Even if I'm assured of
>>some sort of "fair" algorithm, trying to sequence things in the kernel to
>>compensate for all the possible variants of reordering sounds like a pain.
>>(It's also redundant in a perverse way:  You have to write code to un-do
>>decisions which are going to be made for you that you don't want.)
>
>I will second that.
>
>Some more points:
>
 [Interesting points about why the os can do it better omitted in follow-up]

You guys really ought to read the SCSI-2 draft standard before complaining
about "smart" disks and controllers.  There are methods to control or not 
control these features to your heart's content.

If you want to control queue re-ordering, use an ORDERED queue tag. If you
want to see the drive geometry, there are gobs of controls in the mode
pages.  You can even deal with notched drives (a.k.a., zone bit recording),
but it is ugly.  Most people would rather let the drive deal with the variable
number of blocks per cylinder than try to manage it.

There are controls to enable/disable automatic sparing.  There is a READ LONG
command that you can use to try to recover trashed blocks.  You can even
control whether or not error recovery is employed.

In short, if you really want to manage these things, you can do so.  If you
would rather spend your time on other things and let the drive manage itself,
you can do that.

Please send me email if there are any controls we left out -- there is always
SCSI-3.  :-)

-- 
John Lohmeyer         J.Lohmeyer@Wichita.NCR.COM
NCR Corp.             uunet!ncrlnk!ncrwic!entec!jlohmeye
3718 N. Rock Rd.      Voice: 316-636-8703
Wichita, KS 67226     SCSI BBS 316-636-8700 300/1200/2400 24 hours

mjacob@wonky.Sun.COM (Matt Jacob) (03/15/90)

[ Sorry- my machine was down for a couple of days so I am late in responding
  to this.. ]

>...
>> My own personal opinion is that geometry based filesystems are
>> getting to be a bad microoptimization...
>
>But SCSI is not the only interface around, and I think there are some open
>questions about how much device-sensitivity you want in the mid level of
>the file/disk system.  That is, if you've got a more traditional disk
>interface (some of which are pretty high performance) you need to deal with
>geometry.  Do you want to ignore geometry some of the time?  It gets harder
>and harder to know how/where to make the cut.
>
>(My own personal opinion, not necessarily well substantiated, is that SCSI
>was at best premature, and at worst wrong, in trying to hide drive geometry
>from the host system.)
>

Ah, but SCSI wasn't premature- it was/is an extension of the IBM channel
concept to smaller lower-cost machines.

Granted, more 'traditional' disk interfaces need and should allow the
main CPU to know and take advantage of disk geometry. However, the
256-512kb of code to handle the 4.3 filesytem can be considered *wasted*
main CPU cycles if you can offload the processing.

>>...With the coming of SCSI-2
>> multiple command targets, it seems to me that one should just
>> concentrate on getting requests out to the target as quickly
>> as possible and let the microprocessor on the drive figure out
>> the best order do them in.
>
>This raises a sticky issue of who's in control of the disk system.
>Consider reliability issues.  Two examples come to mind.  First, in a UNIX
>file system, you probably want to have some control over the order of
>operations so that you can have some reasonable assurance that operations
>on inodes, indirect blocks, directories, and data happen in a way that will
>allow you a good chance for recovery if you crash while there are
>operations in the queue.  Second, in a database it is essential that you be
>able to control the sequencing of operations so that commits really commit,
>journaling happens when you expect, etc.

There are quite adequate mechanisms in SCSI to handle this (e.g., the *real*
use of linked commands, which provide means for specifying atomic operations
w.r.t. to multiple sets of i/o from a single initiator).

It is true that Unix itself does not provide good hooks for reliability
or database sequencing, but to criticize SCSI for allowing you to do
things your OS can't handle well to begin with is the tail wagging the
dog.

>
>Frankly, I don't want to trust J Random Microcoder to give a disk-write-
>reordering algorithm that won't screw things up.  Even if I'm assured of
>some sort of "fair" algorithm, trying to sequence things in the kernel to
>compensate for all the possible variants of reordering sounds like a pain.
>(It's also redundant in a perverse way:  You have to write code to un-do
>decisions which are going to be made for you that you don't want.)
>

Now this is a valid point, in a way. I've gone over this issue in several
different contexts (having been a microcoder in my dim past). In the
case where you have more than one decision maker, *one* must make the
choice decisisions as to optimal i/o ordering, etc., else chaos results.

In the case of distributed I/O subsystems (SCSI or otherwise), I have
found that you *have* to do things like *not* disksort on the stub cpu
side of things. If you have the BSD filesystem, you *must* specify
things like 0 rotational delay, etc., in order to *not* have the
filesystem and the i/o subsystem cancel each other out.

Ideally, one would like a a filesytem to form requests that have
precedence, priority, and cache-retention parameters. That is, the
filesystem associates with each data it wants transferred loose
statements like:

	"Write this *NOW*"

	"Write this, and hang on to it, 'coz I'll likely ask for it back soon."

	"Write this *before* Reading *that*"

and so on. I feel that we (as in the Unix commercial marketplace) are
very far from that (flame on, everyone!)....


>I think it would make the job of kernel folks a lot easier if they could
>deal with interfaces which just attempt to be fast in a predictable way,
>instead of trying to be smart.

For about two years at Sun, I had posted on my office door
a one-page printout (well, it was small font) entitled
"The Ideal and Perfect Driver". It was for the PDP-11 RK05
removable 2.5mb drive.

Also, I have kicking around at home a 200-odd word pdp-11 assembler
language rm03 driver I wrote for RT-11.

These are *very* simple. Unfortunately, I have not been able to
beg, plead, extort, bribe, or otherwise convince hardware engineers
to take such simple interfaces and run them up to a decent speed.
Ergo, complexity in s/w has been a natural result.

-matt