rayan@utegc.UUCP (09/07/87)
With the faster CPUs we've been seeing lately, we have noticed that too many processes spend their wall clock time in disk waits. The disk i/o subsystem indeed seems to be one of the top 2 bottlenecks (along with memory starvation) for our systems. We've been looking for something better than our trusty Xylogics 451's, which has been the default controller around here until now. To our knowledge, there are three companies in the market with high-performance controllers that are/will be used in Suns, namely Interphase, Ciprico, and Xylogics. The contending products are: Interphase: 4200 "Cheetah" Ciprico: 3200 "Rimfire" Xylogics: 75* Sun default At this time, I can release my evaluation of two of these boards. What follows are my impressions of the products, along with a few numbers for people who like to see the mundane stuff. The boards compared are the Cheetah and the Rimfire. In summary, the Cheetah is very fast on flat out reads (at 2MB/sec read, it approaches the theoretical limit of 2.4MB/Sec for this disk, especially considering sector control data). The Cheetah firmware fell flat on its face when asked to do I *and* O interleaved, and degraded to about Xy451 performance levels. In comparison, the Rimfire did 1.8MB/sec on a flat out read, but did much better on mixed I/O. Based on this gross error in the Cheetah, and the results of the benchmarks, I'd give the current overall performance edge to the Ciprico board. Here are my recollections of the products: ===== Cheetah: No problem with installation once I figured out what the bits meant... This controller has a control byte which is apparently stored in the sector header of each sector upon a format. The bits include flags like Runt Sector Enable, Spare Sector Enable (sector slipping), Cache Enable, STatus Change, Retry, etc. The status change bit is supposed to send reports when the disk goes offline/online, unfortunately it doesn't work too well... almost any kind of disk access with this bit turned on would crash the driver we got for the Cheetah. I figured out the appropriate bits (RSE,SSE,CE,...) and started formatting the disk. Disk maintenance is through a diag utility (modified Sun diag). While the format was going on I read some more in the manual and discovered that the board does sector slipping and bad track forwarding, but not bad sector forwarding. If sector slipping is enabled, bad track forwarding is done after the spare sector on the track has been used up. Still, I think this is rather wasteful of the spare cylinders at the end of the disk. In my limited experience, bad spots on the disk don't usually render much of a track unusable. The format and 5 verify passes finally finished after 75 minutes (c.f. 6 hours for an xy451/SuperEagle). During format, the thing does bad track forwarding if I recall right. 2 bad spots, one around cylinder 8 and one somewhere in the middle of the h partition. The forwarded tracks didn't measurably affect the benchmarks, although degraded performance was observable from diag/read. The disk was formatted using the same parameters as the other board (5 sector skew, gap1/gap2 = 16/19). Boy this controller screams! If you're into doing image copies of your disk, this thing is Real Quick. I was therefore rather surprised when I tried some mixed I/O and saw performance no better than an Xy451. We called up Interphase to make sure our configuration and setup was correct. It seems that if one gets 2 MB/sec reads, one's configuration is correct. The problem seems to be silly firmware that flushes the read cache on every write. The engineer we spoke to said they were working actively with Sun on new firmware that would fix this ("improve performance 30%"), unfortunately at this time the new PROMs aren't yet available. Incidentally, we got a note from Interphase mentioning an upgrade program for people with old firmware, so they do keep their customers in mind. Most benchmarks reinforced this initial impression (great at reading, lousy at reading *and* writing). One more thing I noted was that performance seems to degrade exessively when one accesses different partitions in e.g. a disk-to-disk copy or with parallel activity on two filesystems. Since it happens in this latter case too, I doubt it has much to do with the cache algorithm syndrome. Addendum: we got the new firmware PROMs but couldn't get the disk formatted properly. The Interphase engineer mentioned something about the Runt sector possibly being too small due to a bug in the experimental PROMs we got. Eventually we decided things (the entire evaluation period for all boards) had dragged on long enough and we sent the board back. I have heard they are now shipping 'final production' 4200's. Even if the performance on some of the tests was upped 30%, it wouldn't make much overall difference. ===== Rimfire: This one was real easy to set up the first time around, because one of Ciprico's globetrotting engineers flew up to install the board and make sure everything was ok. That first time, things went by in a bit of a blur, but I then had to duplicate the installation on my own later. Downtime is minimal when installing a Rimfire, because all disk maintenance is done online with a special utility program. This is a full-screen maintenance program from which one can do all the usual diag stuff, and in addition has some cache twiddling and reporting functions. To set up the disk, I chose a disk template from a list of a score or so, that included the popular Fujitsu, NEC, CDC and Toshiba disks, and some miscellany. I then modify the format parameters appropriately (interleave on, headskew=5, cylskew=21, hgskew=0, recovery=0, idpre=16, datapre=19) to values recommended by the Ciprico engineer. The idpre and datapre parameters are apparently the synchronization gap sizes (a.k.a. gap1/gap2). The parameters not present on the Cheetah are cylinder skew, 'hg' skew, and the recovery options on the Rimfire are slightly better. This setup information is then passed to the driver, and I could start formatting. Nine (9) (sic!) minutes afterwards, the format is done. The five verify passes take another 30-odd minutes, but who is complaining... When the Ciprico fellow did this in front of me, everything worked fine. When I did it by myself, the verify went through 7 cylinders and then happily stopped as if it was all done. I tracked it down to a compiler bug (removing a 'register' in a declaration fixed things) that showed up with the pre-release 'rfutil' we got. There were a few other problems with that version, but a later and more mature program we received worked beautifully with no bugs and no surprises. As regards handling bad spots, sector slipping, and bad track and bad sector mapping are all supported. Caching algorithm or parameters are specified by the driver at disk access time, and the driver can differentiate between three kinds of accesses; raw disk I/O, file system I/O in 4k block size multiple, file system I/O not in 4k bs multiple. The caching scheme, read-ahead size and retry options, are specified separately for each kind of access. Read-ahead when enabled can be up to 244 sectors. The control bits for caching are: search cache (SEA), save read data in cache (CRD), read ahead priority (RAP), save write data in cache (CWT), sort reads (SRD), sort writes (SWT), read ahead across head (XHD), and read ahead across cylinder (XCY). The boot-time parameters are hardwired into the kernel in a table in a header file. The default parameters (unless otherwise mentioned) are SEA|XHD for filesystem I/O, and SEA|XHD|XCY for raw I/O, both with 100 sector read ahead. These parameters are defaulted from an array in the kernel, and can be modified online, and cache statistics examined (the best we got was 99.97% hit rate on a long raw disk read). The Rimfire is slightly slower than the Cheetah in terms of maximum throughput, coming in at 1.8MB/sec max. read speed. It has much better behavior when doing mixed reads/writes (or rather, it has 'expected' behavior whereas the Cheetah had broken firmware). It degrades linearly when kept busy on different parts of the disk, in my tests. Essentially, it is consistent with expectations about the overall improvements relative to an Xy451. A passtime with this controller is fiddling the cache control bits to see what effect it would have on things. It turned out that several of the cache parameters were quite expensive if you didn't actually need them. I found readahead a good thing on file i/o, and pretty much irrelevant on long raw i/o. Since fsck is short raw i/o, I ended up with the parameters 301-10/101-100 as pretty much the optimal ones for my purposes (301-10 == XHD,XCY,SEA, 10 sectors readahead raw i/o; 101-10 == XHD,SEA, 100 sectors readahead file i/o). There is of course overhead in the firmware depending on the extra work you want the controller to do by setting the cache options. One of my tuning experiment was with more-or-less-random-access fsck where time taken would change from 3/4th to twice of the Cheetah. Note that the parameters that are optimal for fsck may be lousy for dd, and vice versa (adaptive algorithms would be nice). Turning on fancy options, for example cross-cylinder readahead or sorts for file i/o, may be a mistake. After speaking with some engineers from various companies, I get the impression the Ciprico board is the only one around that does NOT flush the read cache on any write operation. To do this they do some data structure manipulation in firmware, which is not easy in a hardware cache design. I don't design these things for a living, but invalidating the cache on every write operation is a *bad* thing to do. ===== Comparison: The impression I get is that the Ciprico is about 1.9-2.0 times an Xy451 on most things, across the board. The Cheetah is 2.1-2.2 times an Xy451 on a certain set of activities, and degrades to 1.0-1.1 times an Xy451 on heavy read/write I/O mixes. The software with the Ciprico is nicer and gives finer control than the Interphase (having a Cache Enable bit with nothing else, doesn't give much choice). However, one uses the setup program once, when the disk is installed, and forget about it. Online maintenance is nice of course. Improving the Interphase by 30% or so as promised by new firmware, is not likely to change the picture too much apart from making the Interphase more palatable. The Rimfire is comparatively "involved" to set up the disk and tune the cache properly, whereas the Cheetah is a turn-key kind of product as far as configuration goes. We have firsthand experience with another board, but are still under a non-disclosure agreement. In a nutshell, if you have a readonly disk, the Cheetah is for you; otherwise, look into the Rimfire board. We're buying and recommending Rimfires for the foreseeable future. All companies we have dealt with have been very professional and attentive to problems. We have most and best experience with Ciprico (since they sent someone up here to babysit the installation and listen to reactions/concerns). Finally, neither I personally, nor the Computer Systems Research Institute through which the evaluations were arranged, nor anyone in it, has any connection with any of the companies mentioned except as potential (and in some cases present) customers. Use the information in this message at your own risk. You may redistribute as you see fit. (Of course, if you do cause us to get some better service by referring to the UofToronto name and this test here and there, we won't complain :-) rayan Rayan Zachariassen, AI group, University of Toronto (rayan@ai.toronto.edu) ===== Data: First, the test configuration: A Sun 3/180S (normally configured with two Eagles on an xy451) with a Swallow-II (Fujitsu 2344 -- great disk btw, cutest thing you ever saw!) on loan for the duration of the test. All timing results are the minima of several runs in multiuser on an idle system (normal daemons running) at -20 priority. Interphase 4200 Cheetah (firmware revision X0E). Ciprico 3200 Rimfire (firmware rev 14 of 87/04/13, engin. rev level 38) Disk layout is illustrated by this partial dkinfo output (the 2344 was set up for 69 sectors per track -- that's 67 data, 1 spare, 1 runt): 620 cylinders 27 heads 67 sectors/track a: 18090 sectors (10 cyls) starting cylinder 0 b: 36180 sectors (20 cyls) starting cylinder 10 c: 1121580 sectors (620 cyls) starting cylinder 0 d: 280395 sectors (155 cyls) starting cylinder 30 f: 280395 sectors (155 cyls) starting cylinder 185 h: 506520 sectors (280 cyls) starting cylinder 340 The filesystems used for file I/O tests were set up with 'tunefs -a 40 -d 0'. I mounted d, f, and h, on /u1, /u2, and /u3 respectively. The 'bigfile' mentioned was approx 8.5Megabytes (12 * /vmunix). All dd file I/O on a recently newfs'ed filesystem. /tmp and /local are on the Xy451 of the Sun, on an Eagle (2351). Throughput figures are in kilobytes per second, based on real time of course. 1 cylinder == 1809 sectors. NB! These measurements are not scientific or controlled to any degree, nor claimed as such. They satisfy our purposes. If you have measurements for other controllers, or can fill in the blanks, I'd be happy to act as a collection point for such data. Format format 5verify total minutes minutes minutes ---------------------------------------------------------------------------- Ciprico 3200 9 27 36 Interphase 4200 15 60 75 Command blocksize 8k 1 track 1 cylinder kb/s kb/s kb/s dd if=/dev/rXX0a of=/dev/null ---------------------------------------------------------------------------- Ciprico 3200 1116 1447 1781 Interphase 4200 1304 ? ? dd if=/dev/rXX0f of=/dev/null ---------------------------------------------------------------------------- Ciprico 3200 1129 1447 1824 Interphase 4200 ? ? 2005 dd if=/u2/bigfile of=/u3/bigfile ---------------------------------------------------------------------------- Ciprico 3200 737 745 866 Interphase 4200 260 ? 560 dd if=/u2/bigfile of=/u2/bigfile.new ---------------------------------------------------------------------------- Ciprico 3200 733 775 889 Interphase 4200 270 ? 530 dd if=/u2/bigfile of=/dev/null ---------------------------------------------------------------------------- Ciprico 3200 1216 1212 1191 Interphase 4200 1435 ? 1536 dd if=/u2/bigfile of=/tmp/bigfile ---------------------------------------------------------------------------- Ciprico 3200 910 ? ? Interphase 4200 796 ? ? dd if=/dev/rXX0a of=/dev/rXX0b ---------------------------------------------------------------------------- Ciprico 3200 400 990 1447 Interphase 4200 370 ? 1460 Miscellaneous: Command blocksize Cheetah Rimfire ---------------------------------------------------------------------------- dd if=/u2/bin/gnuemacs of=/dev/null 8k 2.2*Xy451 1.9*Xy451 cd /u2; dump 0f - /local | restore rf - 1524.2 real<1> 1490.0 real<6> cd /u1; dump 0f - /u2 | restore rf - 1782.0 real<2> 1585.2 real<7> fsck /dev/rXX0f 40.0 real<3> 29.1 real<8> 4 parallel fsck's, offset 5 secs each, of ip0f 159.4 real<4> 117.0 real<9> 2 parallel fsck's, of ip0f and ip0d 100.7 real<5> 57.3 real<10> footnotes: 1) 1524.2 real 60.7 user 275.4 sys 2) 1782.0 real 58.5 user 270.0 sys 3) 40.0 real 8.0 user 3.3 sys 4) 159.4 real 25.4 user 11.4 sys 5) 100.7 real 8.6 user 3.7 sys Cheetah dump/restore/fsck test filesystem: Filesystem kbytes used avail capacity Mounted on /dev/xy0g 110631 94633 4934 95% /local /dev/ip0f 137460 95010 28704 77% /u2 6) 1490.0 real 59.2 user 263.1 sys 7) 1585.2 real 60.3 user 258.9 sys 8) Cache bits Readahead 48.4 real 7.9 user 3.3 sys 0 60 37.0 real 7.0 user 3.3 sys SEA|CRD 60 29.1 real 7.8 user 3.3 sys SEA|CRD 10 37.0 real 8.0 user 2.4 sys SEA|CRD|RAP|SRD 60 79.5 real 8.1 user 2.8 sys SEA|XHD(|XCY) 60 9) 117.0 real 31.9 user 15.3 sys 10) 57.3 real 8.4 user 3.5 sys Rimfire dump/restore/fsck test filesystem: Filesystem kbytes used avail capacity Mounted on /dev/xy0g 110631 90478 9089 91% /local /dev/rf0f 137460 90838 32876 73% /u2 As you may have noticed, the test filesystems unfortunately aren't identical. The table above should therefore really include a 5% or so adjustment in one column, but this is not significant.
ron@topaz.rutgers.edu (Ron Natalie) (09/07/87)
I have the Ciprico Rim Fire 2220 (Multibus II) controller and it seems to be pretty nice. One interesting aspect of the driver they provide is the absence of any I/O queue in the driver. Since you can tag Multibus II Requests/Completions with an arbitrary number (like the buffer pointer) they just stuff them all into the controller as they come in and then just iodone 'em when the completion comes back. Amusing. Multibus II message passing is very handy for UNIX drivers. Everything falls into place nicely. -RON
dan@rna.UUCP (Dan Ts'o) (09/08/87)
In article <8709062258.AA22818@ephemeral.ai.toronto.edu> rayan@utegc.UUCP writes: > >With the faster CPUs we've been seeing lately, we have noticed that too >many processes spend their wall clock time in disk waits. The disk i/o >subsystem indeed seems to be one of the top 2 bottlenecks (along with >memory starvation) for our systems. We've been looking for something better >than our trusty Xylogics 451's, which has been the default controller >around here until now. To our knowledge, there are three companies in the >market with high-performance controllers that are/will be used in Suns, >namely Interphase, Ciprico, and Xylogics. The contending products are: > >Interphase: 4200 "Cheetah" >Ciprico: 3200 "Rimfire" >Xylogics: 75* Sun default > >At this time, I can release my evaluation of two of these boards. >What follows are my impressions of the products, along with a few numbers >for people who like to see the mundane stuff. > >The boards compared are the Cheetah and the Rimfire. In summary, the >Cheetah is very fast on flat out reads (at 2MB/sec read, it approaches >the theoretical limit of 2.4MB/Sec for this disk, especially considering >sector control data). The Cheetah firmware fell flat on its face when asked >to do I *and* O interleaved, and degraded to about Xy451 performance levels. >In comparison, the Rimfire did 1.8MB/sec on a flat out read, but did much >better on mixed I/O. Based on this gross error in the Cheetah, and the >results of the benchmarks, I'd give the current overall performance edge to >the Ciprico board. I went through a similar decision cycle for the Sun several months ago. (For better or worse), we decided on the Xylogics 752. Well, we've had to wait a long time, something I might not have done, had I known. But we just did get a 752 with the Sun driver and boot. Actually, the hardware has been ready for at least 3 months, but they contracted Sun consulting to do the software, and it has been slow in coming. In any case, the installation has been nearly flawless (a very minor glitch occurred, nothing substantial), and we have ben running for a little while. I'll try to digest your long posting to see if I can arrive at comparable stats for the 752. If you have any canned shell scripts or distilled benchmarks, I'd be happy to run them. Anyways, I think the Xylogics (now) is very much a contender in the VME disk controller race. Certainly Sun decided on them. We almost went with Interphase, but the promise of continued Sun compatibility persuaded us to wait (what we thought would be just a short while). Cheers, Dan Ts'o Dept. Neurobiology 212-570-7671 Rockefeller Univ. ...cmcl2!rna!dan 1230 York Ave. rna!dan@nyu.arpa NY, NY 10021