[comp.arch] Slow SCSI

ccplumb@rose.waterloo.edu (Colin Plumb) (11/18/89)

In article <35985@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
> 1) Why are SCSI disk subsystems *so* slow.  On sequential reads, ~300 KB/sec.
> is typical.  The filesystems and all hardware interfaces should be able to
> easily sustain 3X as much on sequential reads of large, unfragmented files.  
> (Which SCSI disk subsystems you ask?  I would rather turn it around, and say,
> are there any exceptions to the above sweeping generalization?  :-)  I use
> SMD disks as a baseline...)

It's a common question.  I worked on the design of one SCSI subsystem where
we benchmarked a few local machines (VAX, Sequent) and got quite awful
performance figures.  250K/sec read/write average or so.

Of course, all the Amiga fans out there get to repeat what they said a month
or two ago and point out that a 7.14 MHz 68000 somehow manages to leave a
sun 4/xxx in its dust, in this one area at least.  800 KB/sec is typical
for contiguous files with decent hardware.

One common reason is that the file systems are bloody awful.  The BSD
"Fast File System" will get, on a good day, 25% of the available disk
bandwidth.  A lot of it is just fragmentation, doing reads a block at
a time (so you have to turn around the SCSI bus and issue a new command
to get the next 8K, during which time, you might just miss the next
sector), and copying in the buffer cache.  Copying large expanses of
bytes around is just stupid if it can be avoided.

Did you have a look at the system time for copying a 1 meg file to
/dev/null?  On the 4.3 BSD microvax I'm on, it's 0.7 seconds.
(5 seconds real, but the load average is 2.7 and I shouldn't be reading
news.)  Why on earth does it take 700,000 instructions to find 256 4K blocks?
I assume "cp" and /dev/null aren't doing anything grotesquely stupid, but
I may be wrong... anyway, *something* takes far too long.

> 2) Is the reason lack of gather/scatter on (inexpensive) SCSI controllers?

It's complicated.  Requiring blocks be above a certain size helps, and
most systems just want the page size, ayway, but it adds complexity
to the incrementer hooked up to the address bus usually used.

I like it, however.  It saves the system architect from either having to
subvert the page allocator to get contiguous physical pages for a long
data transfer (tricky on writes when you don't know beforehand that the
user will be writing from that memory), do lots of data copying, or issue
I/O requests a page at a time.

> 3) Are there implementation limitations of the new RISC-based systems which 
> make I/O cost more than on older systems?  (e.g. VAX or 68K).  Is there
> something about SCSI in particular that is a problem?

No, it's just that the processors cost *less*, so I/O cost gets squeezed
the same amount, but the high-volume chips that enable performance at
low cost aren't as easy to find.  Also, people have discussed the
microcomputer origins of lot of RISC designs, while good I/O subsystems
are the realm of Big Iron and all things IBM-ish.

Asynchronus SCSI isn't blindingly fast, but it seems to be faster than the
bit clocks on a lot of small (300 MB and under - isn't it wonderful how
terms change?) 5.25" hard drives.  It's true that SCSI hardware also has
a microcomputer heritage, so people rave about how much faster it is than
ST506 and don't bother to notice how much slower it is than an IBM channel.

> 5) (Not really a comp.arch question, but related to the above - aside:
> Does anyone make synchronous SCSI disks which really perform?)

I don't know.  If you find one, tell me.  I want something that will feed
me a track at 2.5MB/sec.  This is related to the above in that a lot of
hard drives don't run their bit clocks at over 15 MHz, making it tricky
to get 2MB/sec out of them...  Otherwise, I have to play with RAID.

Arrays of cheap disks is one nice idea, like the CM's data vault.
Want 100 MB/sec?  Get 100 SCSI drives and run them in parallel.
-- 
	-Colin

casey@lll-crg.llnl.gov (Casey Leedom) (11/19/89)

  Ack.  This is probably a mistake since it's just this far away from a
flame, but sometimes you just have to do what you have to do ...

| From: ccplumb@rose.waterloo.edu (Colin Plumb)
| 
| One common reason [that disk transfer is so slow] is that the file
| systems are bloody awful.  The BSD "Fast File System" will get, on a good
| day, 25% of the available disk bandwidth.  A lot of it is just
| fragmentation, doing reads a block at a time (so you have to turn around
| the SCSI bus and issue a new command to get the next 8K, during which
| time, you might just miss the next sector), and copying in the buffer
| cache.  Copying large expanses of bytes around is just stupid if it can
| be avoided.
| 
| Did you have a look at the system time for copying a 1 meg file to
| /dev/null?  On the 4.3 BSD microvax I'm on, it's 0.7 seconds.  (5 seconds
| real, but the load average is 2.7 and I shouldn't be reading news.)  Why
| on earth does it take 700,000 instructions to find 256 4K blocks?  I
| assume "cp" and /dev/null aren't doing anything grotesquely stupid, but I
| may be wrong... anyway, *something* takes far too long.

  Please read ``A Fast File System for UNIX'', McKusick, Joy, Leffler,
and Fabry, Computer Systems Research Group, Computer Science Division,
Department of Electrical Engineering and Computer Science, University of
California, Berkeley.

  You're just so wrong I don't even want to start on addressing this
other than to point it out so that people won't think that it's true.

	1. Study the problem.
	2. Make hypothesis.
	3. Design test of hypothesis.
	4. Perform test.
	5. Analyze test results.
	6. Go to 1.

I.e. the ``scientific method'' (I've probably mistated the exact outline
of the "scientific method" as described in many high school science
texts.  Sorry.)

  Two points in particular should be addressed however:

    1.	Your test of .7 seconds of system time is far too short.  Since
	CPU time is attributed to various states of the system by sampling
	the state at discrete intervals, that means you have a digital
	data set.  Your sample set is way too small.  Moreover, you have
	to be extremely careful when testing to make sure you're only
	measuring what you want to measure and not program load time and
	other factors that may crop up.  I.e. the shell ``time'' command
	is not going to work in most cases.

    2.	The Fast File System was tuned for ``typical'' UNIX work loads.
	What's typical for UNIX has changed over the years with people
	using it for scientific computing, business, etc.  Two areas where
	the Fast File System does fall down are in the areas of non-
	sequential file access (note - I did *NOT* say random) and large
	file support.  These are areas if current research.  (There will be
	a couple of papers presented at this Winter's USENIX on these
	topics.)

	There are undoubtedly other types of access which are not handled
	optimally by the Fast File System, but we have to design to what
	we know the systems are going to be used for and provide for
	other activities if we can by appropriate generalization and
	flexibility.

  Sorry for the flame-like nature of this posting.

Casey

casey@gauss.llnl.gov (Casey Leedom) (11/19/89)

  Urp.  The editor ate the line that said ``The Fast file system is an
example of the [scientific method].''

iyengar@grad1.cis.upenn.edu (Anand Iyengar) (11/20/89)

In article <18292@watdragon.waterloo.edu> ccplumb@rose.waterloo.edu (Colin Plumb) writes:
>Did you have a look at the system time for copying a 1 meg file to
>/dev/null?  On the 4.3 BSD microvax I'm on, it's 0.7 seconds.
>(5 seconds real, but the load average is 2.7 and I shouldn't be reading
>news.)  Why on earth does it take 700,000 instructions to find 256 4K blocks?
	This is an oversimplification of the issue (!).  There's a *lot*
more going on here than simple division will account for.  The amount of time
this takes can depend on a whole slew of other things (where the drive heads
happen to be seeking Vs. where your file is, file contiguity, what other
processes are running and what they're trying trying to, phase of the
moon...:-).  Also, microvaxes were never known for raw horse-power.  

							Anand.  
--
"I've got more important things to waste my time on."
{arpa | bit}net: iyengar@eniac.seas.upenn.edu
uucp: !$ | uunet
--- Lbh guvax znlor vg'yy ybbx orggre ebg-guvegrrarg? ---

henry@utzoo.uucp (Henry Spencer) (11/21/89)

In article <38912@lll-winken.LLNL.GOV> casey@lll-crg.llnl.gov (Casey Leedom) writes:
>    2.	The Fast File System was tuned for ``typical'' UNIX work loads.

Only if you think a "typical" Unix workload is a single-user machine with
its own disks.  All the benchmarks were run single-user.
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

cgn@leo.UUCP (Chris Nieves) (11/22/89)

In article <35985@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
> 1) Why are SCSI disk subsystems *so* slow.  On sequential reads, ~300 KB/sec.
> 2) Is the reason lack of gather/scatter on (inexpensive) SCSI controllers?
> 3) Are there implementation limitations of the new RISC-based systems which
> make I/O cost more than on older systems?  (e.g. VAX or 68K).  Is there
> something about SCSI in particular that is a problem?
> 5) (Not really a comp.arch question, but related to the above - aside:
> Does anyone make synchronous SCSI disks which really perform?)

The current drives being shipped by HP, Maxtor, Imprimis, etc, in the 700+MB
range all have about the same performance coming off the disk:

                 seek - track to track    ~3-4 ms
                 seek - average            ~16 ms
                 rotation                 3600 rpms   (HP 4000rpms)
                 rotational latency       8.33 ms     (HP 7.47 ms)
                 data xfer rate off disk  ~15.5 mbit/sec  =  ~1.4 mbytes/sec
                 scsi bus transfer rate   4 mbytes/sec (sync burst)

Since the controller can get data at a maximum rate of 1.4 mbytes, the
controller cannot transfer on the scsi bus at 4MB without playing tricks.
Some of the tricks are 64K buffers, continueing reading data from the disk
after the current read data is in the controllers buffer (just in case you're
doing a sequential read), and disconnecting until most of the data is in the
buffer.

It seems the high performance HDAs are getting cheaper and SCSI is getting
higher performance.  With drive companies announcing 1+ GB drives with
SCSI interfaces, you'll see better performance.  Give SCSI a break, its only
starting to be used in high performance applications.  Remember SCSI stands
for Small Computer System Interface.

In article <18292@watdragon.waterloo.edu>, ccplumb@rose.waterloo.edu (Colin Plumb) writes:
> Arrays of cheap disks is one nice idea, like the CM's data vault.
> Want 100 MB/sec?  Get 100 SCSI drives and run them in parallel.

Sounds great! But what do I do when one of these inexpensive drives breaks
and how do I go about backing up 100 SCSI drives (~70MB of data)?

-------------------------------------------------------------------------
Chris Nieves
UUCP : ccicpg!leo!cgn  or cgn@leo.ccicpg
USPS : ICL North America, 9801 Muirlands Blvd., Irvine, CA 92718-2521
PHONE: (714) 458-7282
-------------------------------------------------------------------------

news@haddock.ima.isc.com (overhead) (11/23/89)

In article <48398@leo.UUCP> cgn@leo.UUCP (Chris Nieves) writes:
>The current drives being shipped by HP, Maxtor, Imprimis, etc, in the 700+MB
>range all have about the same performance coming off the disk:
>[discussion of speeds (with real data) & system performance deleted.]
>...Give SCSI a break, its only
>starting to be used in high performance applications.  Remember SCSI stands
>for Small Computer System Interface.

If you're buying new stuff, you can buy whatever is on the
market.  If you have (for example) a Mac, it might be much
cheaper to go SCSI (since you don't need a controller).  On the
other hand, I've seen SCSI systems perform at similar speeds as
ESDI systems with similar bus, CPU, and OS at the high end
(386's).  We were getting 300 Kb/sec transfers (similar to a VAX
780 with RA 81's running 4.3 BSD).

>In article <18292@watdragon.waterloo.edu>, ccplumb@rose.waterloo.edu (Colin Plumb) writes:
>> Arrays of cheap disks is one nice idea, like the CM's data vault.
>> Want 100 MB/sec?  Get 100 SCSI drives and run them in parallel.
>
>Sounds great! But what do I do when one of these inexpensive drives breaks
>and how do I go about backing up 100 SCSI drives (~70MB of data)?

Let's say you have 32 drives providing data.  Some number of
additional drives run in parallel for ECC correction.  There
might be 48 drives total, providing 2 bit correction.  Thus for
any 32 bit word - 2 bits could be wrong and yet still be reliably
computable from the total data.  Thus, you could have head
crashes on two drives and still recover all data.  You just
remove the two drives, put in two new formatted drives, tell the
CM to rebuild the two drives.

32 760 MB drives provides 24 Gigabytes of data.  There are 2+
Gigabyte helical scan (video tape) backup devices on the market.
A dozen tapes could be used to back up the entire system.  The
tapes and drives are relatively cheap.  If you are spending a
million dollars on the rest of the system, having ten tape drives
will not be an unreasonable percentage of the total cost.  I
recall dimly that the data vault can be accessed without going
through the CM, so you don't have to tie up the CM for backup.

The CM that i saw used fairly small drives - under 100 MB each.
They indicated that the system was able to use larger drives, but
of course, the smaller drives were cheaper.  If they had needed
the space, they would have spent the money.  I don't recall which
interface the CM uses.  It might not have been SCSI.

Many large shops (insurance companies) have huge disk farms.
I wonder if these shops have started moving towards the newer
low cost media.

Stephen.
suitti@haddock.ima.isc.com

desnoyer@apple.com (Peter Desnoyers) (11/23/89)

In article <15249@haddock.ima.isc.com> news@haddock.ima.isc.com (overhead) 
writes:
> >In article <18292@watdragon.waterloo.edu>, ccplumb@rose.waterloo.edu (
> >Colin Plumb) writes:
> >> Arrays of cheap disks is one nice idea, like the CM's data vault.
> >> Want 100 MB/sec?  Get 100 SCSI drives and run them in parallel.
> >
> >Sounds great! But what do I do when one of these inexpensive drives breaks
> >and how do I go about backing up 100 SCSI drives (~70MB of data)?
> 
> Let's say you have 32 drives providing data.  Some number of
> additional drives run in parallel for ECC correction.  There
> might be 48 drives total, providing 2 bit correction.  Thus for
> any 32 bit word - 2 bits could be wrong and yet still be reliably
> computable from the total data.  

You can get even better performance by relying on the fact that hard 
errors are erasure errors rather than random errors - i.e. the disk tells 
you that it can't read the data, instead of silently corrupting it. For 
instance, you can correct all 1-bit errors with a single parity bit. 

You still need a backup system - for instance, to recover from a fire.
However, that is a problem no matter how you go about creating 
a gigabyte file system. 

                                      Peter Desnoyers
                                      Apple ATG
                                      (408) 974-4469

daveh@cbmvax.UUCP (Dave Haynie) (11/23/89)

in article <15249@haddock.ima.isc.com>, news@haddock.ima.isc.com (overhead) says:
> Keywords: Peripheral Controllers, Gather, Scatter, I/O Architecture

> On the other hand, I've seen SCSI systems perform at similar speeds
> as ESDI systems with similar bus, CPU, and OS at the high end
> (386's).  We were getting 300 Kb/sec transfers (similar to a VAX
> 780 with RA 81's running 4.3 BSD).

Not surprising, considering it's not at all unreasonable for the SCSI
device to actually have an EDSI hard disk talking to the other end
of it's SCSI controller.  The Real Nice Thing about a high level
protocol like SCSI, vs. something low level like EDSI, is that you
can take advantage of advances in drive technology without throwing
out your controller.  With our system and the latest Wren VI drives
we're getting better than a megabyte/sec, at the filesystem level.
This is still with asynchronous SCSI.  Last year the record was more
like 900 Kb/sec.  

> Stephen.

-- 
Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
                    Too much of everything is just enough