ccplumb@rose.waterloo.edu (Colin Plumb) (11/18/89)
In article <35985@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > 1) Why are SCSI disk subsystems *so* slow. On sequential reads, ~300 KB/sec. > is typical. The filesystems and all hardware interfaces should be able to > easily sustain 3X as much on sequential reads of large, unfragmented files. > (Which SCSI disk subsystems you ask? I would rather turn it around, and say, > are there any exceptions to the above sweeping generalization? :-) I use > SMD disks as a baseline...) It's a common question. I worked on the design of one SCSI subsystem where we benchmarked a few local machines (VAX, Sequent) and got quite awful performance figures. 250K/sec read/write average or so. Of course, all the Amiga fans out there get to repeat what they said a month or two ago and point out that a 7.14 MHz 68000 somehow manages to leave a sun 4/xxx in its dust, in this one area at least. 800 KB/sec is typical for contiguous files with decent hardware. One common reason is that the file systems are bloody awful. The BSD "Fast File System" will get, on a good day, 25% of the available disk bandwidth. A lot of it is just fragmentation, doing reads a block at a time (so you have to turn around the SCSI bus and issue a new command to get the next 8K, during which time, you might just miss the next sector), and copying in the buffer cache. Copying large expanses of bytes around is just stupid if it can be avoided. Did you have a look at the system time for copying a 1 meg file to /dev/null? On the 4.3 BSD microvax I'm on, it's 0.7 seconds. (5 seconds real, but the load average is 2.7 and I shouldn't be reading news.) Why on earth does it take 700,000 instructions to find 256 4K blocks? I assume "cp" and /dev/null aren't doing anything grotesquely stupid, but I may be wrong... anyway, *something* takes far too long. > 2) Is the reason lack of gather/scatter on (inexpensive) SCSI controllers? It's complicated. Requiring blocks be above a certain size helps, and most systems just want the page size, ayway, but it adds complexity to the incrementer hooked up to the address bus usually used. I like it, however. It saves the system architect from either having to subvert the page allocator to get contiguous physical pages for a long data transfer (tricky on writes when you don't know beforehand that the user will be writing from that memory), do lots of data copying, or issue I/O requests a page at a time. > 3) Are there implementation limitations of the new RISC-based systems which > make I/O cost more than on older systems? (e.g. VAX or 68K). Is there > something about SCSI in particular that is a problem? No, it's just that the processors cost *less*, so I/O cost gets squeezed the same amount, but the high-volume chips that enable performance at low cost aren't as easy to find. Also, people have discussed the microcomputer origins of lot of RISC designs, while good I/O subsystems are the realm of Big Iron and all things IBM-ish. Asynchronus SCSI isn't blindingly fast, but it seems to be faster than the bit clocks on a lot of small (300 MB and under - isn't it wonderful how terms change?) 5.25" hard drives. It's true that SCSI hardware also has a microcomputer heritage, so people rave about how much faster it is than ST506 and don't bother to notice how much slower it is than an IBM channel. > 5) (Not really a comp.arch question, but related to the above - aside: > Does anyone make synchronous SCSI disks which really perform?) I don't know. If you find one, tell me. I want something that will feed me a track at 2.5MB/sec. This is related to the above in that a lot of hard drives don't run their bit clocks at over 15 MHz, making it tricky to get 2MB/sec out of them... Otherwise, I have to play with RAID. Arrays of cheap disks is one nice idea, like the CM's data vault. Want 100 MB/sec? Get 100 SCSI drives and run them in parallel. -- -Colin
casey@lll-crg.llnl.gov (Casey Leedom) (11/19/89)
Ack. This is probably a mistake since it's just this far away from a flame, but sometimes you just have to do what you have to do ... | From: ccplumb@rose.waterloo.edu (Colin Plumb) | | One common reason [that disk transfer is so slow] is that the file | systems are bloody awful. The BSD "Fast File System" will get, on a good | day, 25% of the available disk bandwidth. A lot of it is just | fragmentation, doing reads a block at a time (so you have to turn around | the SCSI bus and issue a new command to get the next 8K, during which | time, you might just miss the next sector), and copying in the buffer | cache. Copying large expanses of bytes around is just stupid if it can | be avoided. | | Did you have a look at the system time for copying a 1 meg file to | /dev/null? On the 4.3 BSD microvax I'm on, it's 0.7 seconds. (5 seconds | real, but the load average is 2.7 and I shouldn't be reading news.) Why | on earth does it take 700,000 instructions to find 256 4K blocks? I | assume "cp" and /dev/null aren't doing anything grotesquely stupid, but I | may be wrong... anyway, *something* takes far too long. Please read ``A Fast File System for UNIX'', McKusick, Joy, Leffler, and Fabry, Computer Systems Research Group, Computer Science Division, Department of Electrical Engineering and Computer Science, University of California, Berkeley. You're just so wrong I don't even want to start on addressing this other than to point it out so that people won't think that it's true. 1. Study the problem. 2. Make hypothesis. 3. Design test of hypothesis. 4. Perform test. 5. Analyze test results. 6. Go to 1. I.e. the ``scientific method'' (I've probably mistated the exact outline of the "scientific method" as described in many high school science texts. Sorry.) Two points in particular should be addressed however: 1. Your test of .7 seconds of system time is far too short. Since CPU time is attributed to various states of the system by sampling the state at discrete intervals, that means you have a digital data set. Your sample set is way too small. Moreover, you have to be extremely careful when testing to make sure you're only measuring what you want to measure and not program load time and other factors that may crop up. I.e. the shell ``time'' command is not going to work in most cases. 2. The Fast File System was tuned for ``typical'' UNIX work loads. What's typical for UNIX has changed over the years with people using it for scientific computing, business, etc. Two areas where the Fast File System does fall down are in the areas of non- sequential file access (note - I did *NOT* say random) and large file support. These are areas if current research. (There will be a couple of papers presented at this Winter's USENIX on these topics.) There are undoubtedly other types of access which are not handled optimally by the Fast File System, but we have to design to what we know the systems are going to be used for and provide for other activities if we can by appropriate generalization and flexibility. Sorry for the flame-like nature of this posting. Casey
casey@gauss.llnl.gov (Casey Leedom) (11/19/89)
Urp. The editor ate the line that said ``The Fast file system is an example of the [scientific method].''
iyengar@grad1.cis.upenn.edu (Anand Iyengar) (11/20/89)
In article <18292@watdragon.waterloo.edu> ccplumb@rose.waterloo.edu (Colin Plumb) writes: >Did you have a look at the system time for copying a 1 meg file to >/dev/null? On the 4.3 BSD microvax I'm on, it's 0.7 seconds. >(5 seconds real, but the load average is 2.7 and I shouldn't be reading >news.) Why on earth does it take 700,000 instructions to find 256 4K blocks? This is an oversimplification of the issue (!). There's a *lot* more going on here than simple division will account for. The amount of time this takes can depend on a whole slew of other things (where the drive heads happen to be seeking Vs. where your file is, file contiguity, what other processes are running and what they're trying trying to, phase of the moon...:-). Also, microvaxes were never known for raw horse-power. Anand. -- "I've got more important things to waste my time on." {arpa | bit}net: iyengar@eniac.seas.upenn.edu uucp: !$ | uunet --- Lbh guvax znlor vg'yy ybbx orggre ebg-guvegrrarg? ---
henry@utzoo.uucp (Henry Spencer) (11/21/89)
In article <38912@lll-winken.LLNL.GOV> casey@lll-crg.llnl.gov (Casey Leedom) writes: > 2. The Fast File System was tuned for ``typical'' UNIX work loads. Only if you think a "typical" Unix workload is a single-user machine with its own disks. All the benchmarks were run single-user. -- A bit of tolerance is worth a | Henry Spencer at U of Toronto Zoology megabyte of flaming. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
cgn@leo.UUCP (Chris Nieves) (11/22/89)
In article <35985@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > 1) Why are SCSI disk subsystems *so* slow. On sequential reads, ~300 KB/sec. > 2) Is the reason lack of gather/scatter on (inexpensive) SCSI controllers? > 3) Are there implementation limitations of the new RISC-based systems which > make I/O cost more than on older systems? (e.g. VAX or 68K). Is there > something about SCSI in particular that is a problem? > 5) (Not really a comp.arch question, but related to the above - aside: > Does anyone make synchronous SCSI disks which really perform?) The current drives being shipped by HP, Maxtor, Imprimis, etc, in the 700+MB range all have about the same performance coming off the disk: seek - track to track ~3-4 ms seek - average ~16 ms rotation 3600 rpms (HP 4000rpms) rotational latency 8.33 ms (HP 7.47 ms) data xfer rate off disk ~15.5 mbit/sec = ~1.4 mbytes/sec scsi bus transfer rate 4 mbytes/sec (sync burst) Since the controller can get data at a maximum rate of 1.4 mbytes, the controller cannot transfer on the scsi bus at 4MB without playing tricks. Some of the tricks are 64K buffers, continueing reading data from the disk after the current read data is in the controllers buffer (just in case you're doing a sequential read), and disconnecting until most of the data is in the buffer. It seems the high performance HDAs are getting cheaper and SCSI is getting higher performance. With drive companies announcing 1+ GB drives with SCSI interfaces, you'll see better performance. Give SCSI a break, its only starting to be used in high performance applications. Remember SCSI stands for Small Computer System Interface. In article <18292@watdragon.waterloo.edu>, ccplumb@rose.waterloo.edu (Colin Plumb) writes: > Arrays of cheap disks is one nice idea, like the CM's data vault. > Want 100 MB/sec? Get 100 SCSI drives and run them in parallel. Sounds great! But what do I do when one of these inexpensive drives breaks and how do I go about backing up 100 SCSI drives (~70MB of data)? ------------------------------------------------------------------------- Chris Nieves UUCP : ccicpg!leo!cgn or cgn@leo.ccicpg USPS : ICL North America, 9801 Muirlands Blvd., Irvine, CA 92718-2521 PHONE: (714) 458-7282 -------------------------------------------------------------------------
news@haddock.ima.isc.com (overhead) (11/23/89)
In article <48398@leo.UUCP> cgn@leo.UUCP (Chris Nieves) writes: >The current drives being shipped by HP, Maxtor, Imprimis, etc, in the 700+MB >range all have about the same performance coming off the disk: >[discussion of speeds (with real data) & system performance deleted.] >...Give SCSI a break, its only >starting to be used in high performance applications. Remember SCSI stands >for Small Computer System Interface. If you're buying new stuff, you can buy whatever is on the market. If you have (for example) a Mac, it might be much cheaper to go SCSI (since you don't need a controller). On the other hand, I've seen SCSI systems perform at similar speeds as ESDI systems with similar bus, CPU, and OS at the high end (386's). We were getting 300 Kb/sec transfers (similar to a VAX 780 with RA 81's running 4.3 BSD). >In article <18292@watdragon.waterloo.edu>, ccplumb@rose.waterloo.edu (Colin Plumb) writes: >> Arrays of cheap disks is one nice idea, like the CM's data vault. >> Want 100 MB/sec? Get 100 SCSI drives and run them in parallel. > >Sounds great! But what do I do when one of these inexpensive drives breaks >and how do I go about backing up 100 SCSI drives (~70MB of data)? Let's say you have 32 drives providing data. Some number of additional drives run in parallel for ECC correction. There might be 48 drives total, providing 2 bit correction. Thus for any 32 bit word - 2 bits could be wrong and yet still be reliably computable from the total data. Thus, you could have head crashes on two drives and still recover all data. You just remove the two drives, put in two new formatted drives, tell the CM to rebuild the two drives. 32 760 MB drives provides 24 Gigabytes of data. There are 2+ Gigabyte helical scan (video tape) backup devices on the market. A dozen tapes could be used to back up the entire system. The tapes and drives are relatively cheap. If you are spending a million dollars on the rest of the system, having ten tape drives will not be an unreasonable percentage of the total cost. I recall dimly that the data vault can be accessed without going through the CM, so you don't have to tie up the CM for backup. The CM that i saw used fairly small drives - under 100 MB each. They indicated that the system was able to use larger drives, but of course, the smaller drives were cheaper. If they had needed the space, they would have spent the money. I don't recall which interface the CM uses. It might not have been SCSI. Many large shops (insurance companies) have huge disk farms. I wonder if these shops have started moving towards the newer low cost media. Stephen. suitti@haddock.ima.isc.com
desnoyer@apple.com (Peter Desnoyers) (11/23/89)
In article <15249@haddock.ima.isc.com> news@haddock.ima.isc.com (overhead) writes: > >In article <18292@watdragon.waterloo.edu>, ccplumb@rose.waterloo.edu ( > >Colin Plumb) writes: > >> Arrays of cheap disks is one nice idea, like the CM's data vault. > >> Want 100 MB/sec? Get 100 SCSI drives and run them in parallel. > > > >Sounds great! But what do I do when one of these inexpensive drives breaks > >and how do I go about backing up 100 SCSI drives (~70MB of data)? > > Let's say you have 32 drives providing data. Some number of > additional drives run in parallel for ECC correction. There > might be 48 drives total, providing 2 bit correction. Thus for > any 32 bit word - 2 bits could be wrong and yet still be reliably > computable from the total data. You can get even better performance by relying on the fact that hard errors are erasure errors rather than random errors - i.e. the disk tells you that it can't read the data, instead of silently corrupting it. For instance, you can correct all 1-bit errors with a single parity bit. You still need a backup system - for instance, to recover from a fire. However, that is a problem no matter how you go about creating a gigabyte file system. Peter Desnoyers Apple ATG (408) 974-4469
daveh@cbmvax.UUCP (Dave Haynie) (11/23/89)
in article <15249@haddock.ima.isc.com>, news@haddock.ima.isc.com (overhead) says: > Keywords: Peripheral Controllers, Gather, Scatter, I/O Architecture > On the other hand, I've seen SCSI systems perform at similar speeds > as ESDI systems with similar bus, CPU, and OS at the high end > (386's). We were getting 300 Kb/sec transfers (similar to a VAX > 780 with RA 81's running 4.3 BSD). Not surprising, considering it's not at all unreasonable for the SCSI device to actually have an EDSI hard disk talking to the other end of it's SCSI controller. The Real Nice Thing about a high level protocol like SCSI, vs. something low level like EDSI, is that you can take advantage of advances in drive technology without throwing out your controller. With our system and the latest Wren VI drives we're getting better than a megabyte/sec, at the filesystem level. This is still with asynchronous SCSI. Last year the record was more like 900 Kb/sec. > Stephen. -- Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests" {uunet|pyramid|rutgers}!cbmvax!daveh PLINK: hazy BIX: hazy Too much of everything is just enough