[comp.sys.sun] SCSI & IPI rates

lm@sun.eng.sun.com (Larry McVoy) (05/26/90)

In article <7563@brazos.Rice.edu> I wrote:
$ > So the question is, can a SCSI disk/controller that is rated at 4
$ >MB/s handle 2 MB/s for at least one minute? Preferably through the file
$ >system, but we could live with writes to the raw disk if necessary. And
$ >what about SMD and IPI (or IPI-2)?
$ 
$ I do I/O performance engineering.  You won't see 2meg / sec through SCSI
$ on any 4.1 system.  The fastest that I have seen is 1.3 on an IBM 320 meg
$ drive (nice drives - we don't sell them yet).  This was with a modified
$ kernel - your rates will be lower.

I should have been more careful here.  While it is true that it's hard to
get to 2Mbyte/sec through scsi, it is by no means impossible.  You need to
have a drive that has that high of a data rate - you can compute the upper
bound on the data rate by the following [note: assumes only one head
active at a time, a reasonable assumption for many drives]

	Kbyte / sec = nsects / 2 * rpm / 60

where nsects and rpm are available in /etc/format.dat.  Note that this is
an upper bound, not a lower bound.  Many drives reserve some of those
sectors for sector slipping, effectively reducing the platter speed of the
drive.  For example, I found (commented out, what that means I can only
guess) a CDC Wren VII 94601-12G that rotates at 3596 RPM and has 80
sectors per track.  That works out to a max platter speed of 2398
Kbytes/second.  Getting 2 megs/sec off of this drive should be no sweat.
Putting it on is a little harder, since SCSI typically doesn't do 0
latency writes.  If you do your writes in larger chunks, say 120K at a
time, then you don't blow revolutions as often and can approach the 2 meg
rate.  It is critical that the writes be large;  naive OS implementations
send writes down to the drive in 8K (file system block size) chunks.  The
difference between that and 120K chunks is almost a factor of two.  On
experimental versions of SunOS I've rates go from 800K to 1300K on rates
just by increasing the block size.

The high order bit here is that you have to look at your drive to see what
data rate you can expect.  And expect slower writes then reads since most
drives have track buffers that are write through caches (i.e., help only
on reads).

$ IPI will do what you want - I've seen 2 megs through them with vanilla
$ 4.1.

I lied.  I went and tried it.  On SunOS 4.1 PSRA (shipping currently) with
CDC IPI 9720 drives, it's easy to get 2.2 or 2.3 megs / second on a single
drive, reads or writes.  Those drives have smart controllers and I believe
they have zero latency write ability so they don't have the problem of
blowing revs inbetween each write.

$ Note that the SunOS VM system is cool - your perceived performance will be
$ much higher than the disk rate for small (< 4meg) writes since the kernel
$ just copies in the data and tells you that it is done and then proceeds to
$ dribble it out to the drive.

Someone sent me mail and complained about NFS saying that this wasn't so.
He's right.  Doing writes to an NFS file is similar to doing writes to a
UFS file where the file was opened with O_SYNC (the difference is that an
NFS file will cache small writes for a short time).  Write performance
over NFS suffers due to the stateless nature of NFS.  It is a requirement
for correct operation that the data be on the server's drive before the
server says OK to the client.  If this were not so then you would be in
serious trouble each time a server crashed.

Larry McVoy, Sun Microsystems    (415) 336-7627       ...!sun!lm or lm@sun.com

glenn@uunet.uu.net (Glenn Herteg) (05/29/90)

lm@sun.eng.sun.com (Larry McVoy) writes:
> Those drives have smart controllers and I believe they have zero
> latency write ability so they don't have the problem of blowing
> revs in between each write.

It seems to me that, for most file uses, you don't *want* zero-latency
writes.  You'd like the performance, of course, but the downside is
increased risk of damage should the machine crash.  As I recall, adding
"ordered writes" was a Feature of the first System V release, intended to
add robustness to file system consistency should the machine crash during
the file transfer.  Early UNIX file systems often had many, many problems
dredged up by fsck after a crash; these days, they're fairly rare, and I
think this forced consistency has a lot to do with it.  Databases, in
particular, need some kind of write-ordering semantics to guarantee proper
recording of transactions (isn't this what they call two-phase commit?).
There was a Bell Systems Technical Journal article some years ago that
discussed this issue and its relationship to UNIX file systems.

> It is a requirement for correct operation that the data be on the
> server's drive before the server says OK to the client.  If this were
> not so then you would be in serious trouble each time a server crashed.

Exactly my point, but it doesn't just apply to NFS file systems.  Frankly,
given the number of bugs in UNIX software (yes, even SunOS 4.1 has its
share [*]), system crashes (or hangs, with user-forced reboots) are still
rather too common to ignore this issue of file system repair.  It's
certainly no fun poring through a massive fsck output listing and
wondering how much ancillary damage you might incur by choosing the wrong
order in which to choose to repair individual things that are bad in the
file system data structures, especially, when you can't see the damage in
enough detail to understand what *really* went wrong (and which files
you'll need to restore from backup).  And if you're not already a UNIX
guru at the moment the machine crashes, you don't stand a chance of
deciphering all that mumbledegook about inodes anyway.

I suppose that, with the file system interfaces becoming more flexible,
you might eventually be able to substitute your own kind of file system in
a particular disk partition for recording bulk data (say, from a fast A/D
converter), not caring about recovering such data in the event of a crash.
Then you'd want to add ioctl()s (mount options) to the device driver to
tell it when to and not to perform zero-latency writes, on a per-partition
basis.  Easy in theory, but you wouldn't want to endanger the rest of your
file systems by insufficient testing of such capabilities.

I'm curious about a related aspect of hard disk drives and device drivers,
especially for SCSI devices for which you have this extra microcomputer
embedded on the drive and interceding between you and your media.  When
you want to fsync() or msync(), has the data merely been written over the
SCSI bus to the disk cache, or has it actually been written to the media,
when the function call returns?  Does this depend on the drive
manufacturer, or is there some standard SCSI command used on *all* drives
that probes for this kind of command completion?

[*] Sun is already reporting patches to SunOS 4.1 -- including one for a
problem in which file system blocks show up in files to which they do not
belong!  Just when will 4.1.1 be out? :-)