dave@calgary.UUCP (Dave Mason) (05/26/88)
We are planning to replace 2 of our Vax 11/780s with 2 Sun 4/280s. Each vax has 6 Mbytes of memory, 2 RA80 and 1 RA81, and 40 terminals. The vaxes are currently running 4.3 BSD + NFS (from Mt Xinu). Each sun is planned to have 32 Mbytes of memory, 2 of the new NEC disk drives and will be running the same 40 terminals. The vaxes are being used by undergrads doing pascal, f77 and C programming (compile and bomb). Most students use umacs (micro-emacs) as their text editor. What I was wondering is has anyone done a similiar type switchover? Is there a horendous degradation of response when the load average gets sufficiently high or does it degrade linearly with respect to load average? Is overall performance of a Sun 4/280 better/worse/the same as a similiarly loaded vax 11/780 (as configured above)? Were there any surprises when you did the switchover? My personal feeling is that we will win big, but the local DEC salesman is making noises about Sun 4/280 performance, especially with > 15 users. I just want to confirm if my opinion of the local DEC sales office is well founded :-). Please mail your responses. If there is sufficient interest I'll post a summary to the net. Thanks in advance for any comments. Dave Mason University of Calgary {ubc-cs,alberta,utai}!calgary!dave
weiser.pa@xerox.com (05/28/88)
What your DEC salesperson may have heard, undoubtedly very indirectly, is that there is a knee in the performance curve of the Sun-4/280 at > 15 processes ready-to-run. This has nothing to do with > 15 users: more like a load average of > 15. Do your vaxes ever run with a load average of > 15? If not, ok. But, if they EVER hit 16 or 17, watch out on the Sun-4's: I can trivially get my Sun-4 completely wedged so I have to reboot with L1-A by just starting 19 little processes which sleep for 100ms, wake-up and sleep again. This doesn't even raise the load average (but amounts to a load average of 19 to the context switching mechanism, although not to the cpu). And the Sun-3's are no better: the knee there is >7 processes. -mark
olson@modular.UUCP (Jon Olson) (05/29/88)
> What your DEC salesperson may have heard, undoubtedly very indirectly, is that > there is a knee in the performance curve of the Sun-4/280 at > 15 processes > ready-to-run. This has nothing to do with > 15 users: more like a load average > of > 15. Do your vaxes ever run with a load average of > 15? If not, ok. But, > if they EVER hit 16 or 17, watch out on the Sun-4's: I can trivially get my > Sun-4 completely wedged so I have to reboot with L1-A by just starting 19 little > processes which sleep for 100ms, wake-up and sleep again. This doesn't even > raise the load average (but amounts to a load average of 19 to the context > switching mechanism, although not to the cpu). > > And the Sun-3's are no better: the knee there is >7 processes. > > -mark Nonsense, I just tried forking 32 copies of the following program on my Sun 3/60 workstation. Each one sleeps for 100 milliseconds, wakes up, and sleeps again. With 32 copies of it running, I could notice no difference in response time and a `ps aux' showed none of them using a significant amount of CPU time. Maybe you are just running out of memory and doing alot of swapping? What I have noticed on our Vax 11/780, running VMS, is that it is often equally slow with 1 user or 20 users. Possibly VMS avoids the `knee' by raising the priority of the NULL task when there aren't many people on the machine??? --------------------------------------------------- #include <sys/time.h> main() { struct timeval tv; tv.tv_sec = 0; tv.tv_usec = 100000; for( ;; ) select( 0, 0, 0, 0, &tv ); } -- Jon Olson, Modular Mining Systems USENET: {ihnp4,allegra,cmcl2,hao!noao}!arizona!modular!olson INTERNET: modular!olson@arizona.edu -- Jon Olson, Modular Mining Systems USENET: {ihnp4,allegra,cmcl2,hao!noao}!arizona!modular!olson INTERNET: modular!olson@arizona.edu
olson@modular.UUCP (Jon Olson) (05/29/88)
I also tried forking 32 `for(;;) ;' loops on a 3/60 with 8-mb. Each process got about 3 percent of the CPU and the reponse was still quote good for interactive work. This stuff about a `knee' at 7 processes just isn't real... -- Jon Olson, Modular Mining Systems USENET: {ihnp4,allegra,cmcl2,hao!noao}!arizona!modular!olson INTERNET: modular!olson@arizona.edu
stan@sdba.UUCP (Stan Brown) (05/31/88)
> What your DEC salesperson may have heard, undoubtedly very indirectly, is that > there is a knee in the performance curve of the Sun-4/280 at > 15 processes > ready-to-run. This has nothing to do with > 15 users: more like a load average > of > 15. Do your vaxes ever run with a load average of > 15? If not, ok. But, > if they EVER hit 16 or 17, watch out on the Sun-4's: I can trivially get my > Sun-4 completely wedged so I have to reboot with L1-A by just starting 19 little > processes which sleep for 100ms, wake-up and sleep again. This doesn't even > raise the load average (but amounts to a load average of 19 to the context > switching mechanism, although not to the cpu). > > And the Sun-3's are no better: the knee there is >7 processes. > > -mark Realizing that if this *is* true on the RoadRuner it will be true at a much lower number, does anyone know if such a thing is true on it ? -- Stan Brown S. D. Brown & Associates 404-292-9497 (uunet gatech)!sdba!stan "vi forever"
arosen@eagle.ulowell.edu (MFHorn) (06/01/88)
In article <601@modular.UUCP> olson@modular.UUCP (Jon Olson) writes: > there is a knee in the performance curve of the Sun-4/280 at > 15 processes > And the Sun-3's are no better: the knee there is >7 processes. Some time ago I saw a Sun 3/280 with a load average of 17+. There were 17 'extra' jobs running. I don't know what they were doing (they weren't mine), but there was no [noticable] degradation in response time at all. Andy Rosen | arosen@hawk.ulowell.edu | "I got this guitar and I ULowell, Box #3031 | ulowell!arosen | learned how to make it Lowell, Ma 01854 | | talk" -Thunder Road RD in '88 - The way it should be
weiser.pa@xerox.com (06/01/88)
--------------------
Nonsense, I just tried forking 32 copies of the following program
on my Sun 3/60 workstation. Each one sleeps for 100 milliseconds,
wakes up, and sleeps again. With 32 copies of it running, I could
notice no difference in response time and a `ps aux' showed none
of them using a significant amount of CPU time. Maybe you are just
running out of memory and doing alot of swapping?
What I have noticed on our Vax 11/780, running VMS, is that it is
often equally slow with 1 user or 20 users. Possibly VMS avoids the
`knee' by raising the priority of the NULL task when there aren't many
people on the machine???
#include <sys/time.h>
main()
{
struct timeval tv;
tv.tv_sec = 0;
tv.tv_usec = 100000;
for( ;; )
select( 0, 0, 0, 0, &tv );
}
--------------------
No, not nonsense. I changed 100000 to 25000, and ran 18 of these on my
Sun-4/260 with 120MB swap and 24MB ram, with very little else going on.
Perfmeter shows no disk activity, ps aux shows each of the 18 using almost no
cpu. (And each of the 18 has more than millisecond to get in and out of select,
which is certainly enough). And the system is to its knees! (If it doesn't
work for you, try 19 or 20 or 21). Window refreshes take 10's of seconds. If I
kill off 3 of these, all is back to normal.
I don't have a 60C to try this on. But, try reducing that delay factor and see
if you don't also see a knee in the performance curve well before the cpu should
be swamped. (And in any case, swamped cpu doesn't need to imply knee in the
curve...)
-mark
bzs@bu-cs.BU.EDU (Barry Shein) (06/01/88)
Although I don't disagree with the original claim of Suns having knees (related to NeXT being pronounced Knee-zit? never mind) the discussion can lose sight of reality here. A 780 cost around $400K* and supported around 20 logins, a Sun4 or even Sun3/280 probably comes close to that in support for around 1/5 the price or less, and the CPU is much faster when a job gets it. If your Vax was horribly overloaded and had 32 users just buy more than one system and split the community, you'll also double the I/O paths that way and probably have at least one system up almost all the time (we NFS'd everything between our Suns in Math/Computer Science and Information Technology here so they can log into any of them although that does mean that if your home dir is on a down system you lose.) Also the cost of things like memory is so much lower that you can cheat like hell on getting performance. Who ever had a 32MB 780? That's practically a minimum config for a Sun4 server. The best use for a Sun server as a time-sharer is if a) you don't expect rapid growth in the number of logins (eg. doubling in a year) that will outgrow the machine and b) you expect a lot of the community using the system to migrate from dumb terminals to workstations in the reasonably near future, that way voila, you have the server, especially if each new workstation means one less time-sharer and it converges fairly rapidly. It's a nice way to give them time to get their financial act together to buy workstations. For example, for our CS and Math Faculty here having 3 servers worked out very well, many of the users have now grown into workstations and the server facilities were "just there". Another rationale of course is that you're looking for just a little system for perhaps a dozen or so peak load people, I don't know any system off-hand that can do that as nicely as a system like the above for the money. If your needs are much more in the domain of traditional time-sharing (eg. hordes of students that never ceases growing term to term, dumb terminals and staying that way for the next few years [typically, if you ever get them workstations you'll put an appropriate, separate, server in *that* budget]) then you probably want to look at something more expandable/upgradeable. I find Encores and (no direct experience but I hear good things) Sequents pretty close to perfect for that kind of usage. I'm sure there are others that will suffice but we don't use them so I can't comment (we have 7 Encores and over 100 Suns here.) Anyhow, seat-of-the-pants systems analysis on the net is probably a precarious thing at best, I hope I've pointed out the issues are several and small differences in two groups' needs can make any recommendation inapplicable. All I can say is we have quite a few Sun 3 servers here doing something resembling traditional time-sharing and everyone seems very happy with it. Given the right conditions it works out well, given the wrong ones no doubt it would be a nightmare, so what else is new? -Barry Shein, Boston University P.S. I have no vested interest in any of the above mentioned companies although I am on the Board of Directors of the Sun Users Group, I doubt that would be considered "vested". * Yes I realize that it's been almost 10 years since the 780 came out, but that was the original question.
guy@gorodish.Sun.COM (Guy Harris) (06/01/88)
> Realizing that if this *is* true on the RoadRuner it will be > true at a much lower number, No, not true. The RR has about the same raw CPU speed as a 3/200-series machine. Furthermore, it has a different memory management unit; it appears the MMU may be the crux of the biscuit here. > does anyone know if such a thing is true on it ? If, as would be indicated by the number of processes at which the knee occurs, the knee is caused by running out of MMU contexts (the Sun-3 MMU has 7 contexts available for user processes, the Sun-4 has 15), I would tend not to expect the same phenomenon on an RR; the '386 has a fairly conventional in-memory-page-table MMU. DISCLAIMER: This is just an educated guess. I don't have any numbers to back this up. Don't take this as gospel truth; if you *do* get numbers, let us all know, the results may be interesting (especially if they *don't* back this guess up).
jfh@rpp386.UUCP (John F. Haugh II) (06/02/88)
In article <7331@swan.ulowell.edu> arosen@hawk.ulowell.edu (MFHorn) writes: >In article <601@modular.UUCP> olson@modular.UUCP (Jon Olson) writes: >> there is a knee in the performance curve of the Sun-4/280 at > 15 processes >> And the Sun-3's are no better: the knee there is >7 processes. > >Some time ago I saw a Sun 3/280 with a load average of 17+. There were >17 'extra' jobs running. I don't know what they were doing (they weren't >mine), but there was no [noticable] degradation in response time at all. our plexus p/95 (20MHz 68020, vme bus, 8MB ram, esdi controller) knees at about 20 users with a load average of 10+. on the few occasions the machine has been to 13+ it has crashed shortly thereafter. the p/55 (12.5MHz 68020, multi bus, 4MB ram, scsi? controller) knees at about 10 users. i don't know the load average off hand but it has been up around 10 without crashing. it just gets painfully slow. i suggest the Big Problem is with the disk/controller combinations. my '386 can't run an expire and an rn together because the disk saturates. same seems to be true with the plexus machines. the p/55 has a single controller and a single drive. the p/95 has a single (faster) controller with two drives. once the i/o on either plexus is saturated (the famous popcorn noise is my general working definition), regardless of the number of processes, adding one more serious dogs the system. - john. -- John F. Haugh II | "If you aren't part of the solution, River Parishes Programming | you are part of the precipitate." UUCP: ihnp4!killer!rpp386!jfh | -- long since forgot who DOMAIN: jfh@rpp386.uucp |
weiser.pa@xerox.com (06/03/88)
Andy Rosen writes:
"Some time ago I saw a Sun 3/280 with a load average of 17+. There were
17 'extra' jobs running. I don't know what they were doing (they weren't
mine), but there was no [noticable] degradation in response time at all."
I just tried this on my Sun-4/280 by running 30 cpueating processes.
("while(1);"). Sure enough, even with the load at 30, I got much better
response than I did with only 20 of the little 50 ms sleeper programs I posted a
day or so ago. One way to interpret this is that when Sun's scheduler knows
that it has 30 processes on the queue, it does a better job of sharing the
limited resource of contexts, than if it thinks there is nothing to do, but
every 50ms 20 jobs all suddenly leap up and call for attention... But I don't
know for sure.
Perhaps the 50ms. sleeper test is a red herring, and that pathological state is
not one that is ever seen under normal user loads.
But in any case, we got on to this topic by something a DEC salesperson said to
discourage Sun purchaes, and I think, because the knee is real but perhaps only
in pathological cases that no one really cares about, we have exactly a
salesperson sort of "fact". Mystery resolved.
-mark
m5@lynx.UUCP (Mike McNally) (06/04/88)
Re: small processes that sleep-wakeup-sleep-wakeup... I tried this on my Integrated Solutions 68020 thing and got results similar to those of the Sun; that is, up to about 6 or 7 of them the system works fine, but after that everything gets real slow (I can't test it too much because everybody gets mad here when the machine freezes up). I tried the same thing under LynxOS, our own BSD-compatible real-time OS, and didn't notice very much degradation at all. A major difference between our machine and the Integrated Solutions is the MMU: even though our platform is a 68010, our MMU is 16K of static RAM that holds all the page tables all the time. Context switch time is thus real small. Also, I think it's possible that the mechanism for dealing with the timeout in select() is different internally under LynxOS as opposed to Unix. Of course, under the real-time OS, a high-priority CPU-bound task gets the whole CPU, no questions asked. That's a great way of degrading editor response :-). As a somewhat related side question, what does the Sun 4/SPARC MMU look like? Are lookaside buffer reloads done in software like on the MIPS R[23]000? (Is that really true about the R[23]000 anyhow?) -- Mike McNally of Lynx Real-Time Systems uucp: lynx!m5 (maybe pyramid!voder!lynx!m5 if lynx is unknown)
hedrick@athos.rutgers.edu (Charles Hedrick) (06/05/88)
I've played around with our Sun 4's a bit (and also with a VAX 750) to duplicate the various tests. I can confirm that with many processes waiting for very small times, there is in fact a very sharp "knee". It happened for me at something like 19 processes. Vmstat is probably the best tool for watching this. With 18 processes, vmstat showed over 90% of the system idle. Start one more and suddenly 3% idle and over 90% of the CPU spent in system state. Killing and restarting that last process would cause the system to toggle between the two states. It was very dramatic. In retrospect it is very clear what is going on. There are a finite number of hardware contexts in the MMU. Presumably (assuming rational system programmers) they are managed much like virtual memory. That is, when a process is to be activated, its MMU info must be put in one of the contexts. If it isn't there already, some algorithm (maybe LRU?) is used to decide which process' information to remove. Every time a process is activated and its information isn't already in a context register, some work has to be done (which seems to take about 1 msec). Problems are going to occur when new processes have to be put in context registers at a rate that is more than about 100/sec. This requires not only a lot of processes, but also a lot of process activations. That is, you are always OK if the number of active processes is less than 15, since those will fit into the hardware context registers. But you are also OK if you have more than 15 processes, as long as they aren't being activated at a high rate. Even if you have 100 CPU-bound processes, the problem won't occur as long as the scheduler gives them fairly long runtime slices. This is the reason that changing the amount of sleep time in the tests was so critical. It's hard to know offhand exactly when this problem will show up in practice, but I have to believe that somebody at Sun has done simulation studies with reasonable job mixes, since that's the way the game is played these days. But it is not the case that your system will come to a screaming halt when you activate the 16th process, and it certainly is not limited to 15 users. On the other hand, nobody is claiming that the Sun 4's are intended for 100 users.
egisin@watmath.waterloo.edu (Eric Gisin) (06/05/88)
In article <15875@brl-adm.ARPA>, weiser.pa@xerox.com writes: > I just tried this on my Sun-4/280 by running 30 cpueating processes. > ("while(1);"). Sure enough, even with the load at 30, I got much better > response than I did with only 20 of the little 50 ms sleeper programs I posted a > day or so ago. One way to interpret this is that when Sun's scheduler knows > that it has 30 processes on the queue, it does a better job of sharing the > limited resource of contexts, than if it thinks there is nothing to do, but > every 50ms 20 jobs all suddenly leap up and call for attention... But I don't > know for sure. > The scheduler doesn't know that it has 30 process on the queue. With the 20 50ms sleeper jobs there will be 20*20 = 400 context switches per second. The 16 (or whatever) sets of MMU mapping can't hold all the active processes. With 30 "while(1);" jobs, the scheduler reschedules a compute bound job every N clock ticks, or a few times a second. If you are using a screen editor, it is likely the process's context stays in the MMU between keystrokes in the latter case, resulting in quick interactive response. Or it could be that the 50 ms jobs wake up with a high priority relative to the interactive process.
mangler@cit-vax.Caltech.Edu (Don Speck) (06/05/88)
In article <15875@brl-adm.ARPA>, weiser.pa@xerox.com writes: > Perhaps the 50ms. sleeper test is a red herring, and that pathological state is > not one that is ever seen under normal user loads. When I started using 4.3 BSD /etc/dump with two tape drives on this 780, I was getting 250 context switches per second among 8 processes. For another curious VAX/SUN comparison, notice how pipes on Sun-3's run half as fast when the buffer address is odd. No such penalty on vaxen. Don Speck speck@vlsi.caltech.edu {amdahl,ames!elroy}!cit-vax!speck
mash@mips.COM (John Mashey) (06/05/88)
In article <3859@lynx.UUCP> m5@lynx.UUCP (Mike McNally) writes: ... >As a somewhat related side question, what does the Sun 4/SPARC MMU look >like? Are lookaside buffer reloads done in software like on the MIPS >R[23]000? (Is that really true about the R[23]000 anyhow?) The Sun-4 MMU, like earlier Suns, doesn't use a TLB, but has SRAMs for memory maps (16 contexts' worth, compared to 8 in Sun-3/200, for example). The R[23]000 indeed do TLB-miss refill handling in software; this is not unusual in RISC machines: HP Precision and AMD 29K (at least) do this also. The overall cost if typically 1% or less of CPU time, which is fairly competitive with hardware refill, especially since one of the larger costs on faster machines is the accumulated cache-miss penalty for fetching PTEs from memory. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
greg@xios.XIOS.UUCP (Greg Franks) (06/06/88)
>I also tried forking 32 `for(;;) ;' loops on a 3/60 with 8-mb. >Each process got about 3 percent of the CPU and the reponse was >still quote good for interactive work. This stuff about a `knee' >at 7 processes just isn't real... However, the 32 processes do nothing but chew up CPU cycles. Add some disk I/O, other random interrupts, and a desire for memory to your test. -- Greg Franks XIOS Systems Corporation, 1600 Carling Avenue, utzoo!dciem!nrcaer!xios!greg Ottawa, Ontario, Canada, K1Z 8R8. (613)725-5411. ACME Electric: When you can't find your short, call us!
rick@seismo.CSS.GOV (Rick Adams) (06/07/88)
Last year when seismo (a Sun 3/160) was still passing mail around, there was a VERY obvious performance degradation when the 8th or 9th sendmail became active. (No we didn't run out of memory. That happened at about 14 sendmails) I have always attributed it to the 7 user contexts. ---rick
brw@jim.UUCP (06/08/88)
At the risk of adding to this pointless discussion, why are we comparing a 10 year old machine with what I think is the latest Sun. How about a comparison of the Sun 4/280 and a simmilarly configured MicroVax III (Yes thats 3, not 2!). I think they are comparable in price, are they in performance (I was quoted 50K aust for a diskless MV3 a while ago, what is a sun 4/280 worth?) -- Brian Wallis (brw@jim.odr.oz) O'Dowd Research P/L. (03) 562-0100 Fax: (03) 562-0616, Telex: Jacobs Radio (Bayswater) 152093
bzs@bu-cs.BU.EDU (Barry Shein) (06/09/88)
From: brw@jim.odr.oz (Brian Wallis) >At the risk of adding to this pointless discussion, why are we >comparing a 10 year old machine with what I think is the latest Sun. Mostly because that's how the question was originally posed, something like "if I replace my 780 with a Sun4/280 will my community be serviced better/worse/same?" It's a reasonable question if that's precisely the situation you are facing as apparently many are. You want to know if the replacement will work out. >How about a comparison of the Sun 4/280 and a simmilarly configured >MicroVax III (Yes thats 3, not 2!). I think they are comparable in >price, are they in performance (I was quoted 50K aust for a diskless >MV3 a while ago, what is a sun 4/280 worth?) A different, reasonable question. I think a Sun4/280 will run a bit less in list price, but perhaps not so much different as to affect the comparison (there are other, more important considerations when you're only talking about price differences on items under US$100K anyhow.) -Barry Shein, Boston University
allbery@ncoast.UUCP (Brandon S. Allbery) (06/10/88)
As quoted from <2282@rpp386.UUCP> by jfh@rpp386.UUCP (John F. Haugh II): +--------------- | our plexus p/95 (20MHz 68020, vme bus, 8MB ram, esdi controller) knees at | about 20 users with a load average of 10+. on the few occasions the | machine has been to 13+ it has crashed shortly thereafter. | | the p/55 (12.5MHz 68020, multi bus, 4MB ram, scsi? controller) knees at | about 10 users. i don't know the load average off hand but it has been | up around 10 without crashing. it just gets painfully slow. +--------------- The P/55 has a dumb SMD controller, unless you bought the EMSP, which is a smart SMD controller identical to that used on at least some Sun-3's. Query: how did you calculate load average? Is the P/95 *really* ESDI? I thought they used a VMEbus Xylogics SMD... but I don't really know that much about the '95. 4MB RAM is not the best way to run if you have 10 users. This is from experience. You swap *way* too much under SVR2, experience with 2MB on a 386 box with SVR3.1 and 8 heavy database users shows way too much paging. P/60 with 12.5MHz 68020, multibus, 7MB RAM, EMSP (Xylogics 451 SMD) disk interface: ran 18 users at a load average (courtesy my /etc/avenrun) of 2. [Note that I've never been certain of the reliability of /etc/avenrun as compared to BSD, since the actual code isn't mine and *definitely* isn't in the kernel where it should be to be accurate.] No problems whatsoever; lots of DBMS, some WP and at least one C compile -- often two concurrent (that was me ;-) and the system still responded quite well. Usage has changed since I left the company that has that configuration; I can check on the current statistics. (N.B. The configuration described above is actually a P/60 which has been upgraded to a P/75, except for the serial/parallel I/O controllers.) As for the 386 box mentioned above: at 4MB, 8 users all doing database (or 7 users doing database and one compile, again me), it *said* that the kernel load average was 72. The "load average" it reports, however, seems to be not what we usually deoad abverage; you have to divide by the number of processes, typically 75-80 = load average slightly less than 1. Performance? Let's put it this way: at full load, it was faster than ncoast (68000, 2MB RAM, dumb SMD controller: P/35) with *one* user. -- Brandon S. Allbery | "Given its constituency, the only uunet!marque,sun!mandrill}!ncoast!allbery | thing I expect to be "open" about Delphi: ALLBERY MCI Mail: BALLBERY | [the Open Software Foundation] is comp.sources.misc: ncoast!sources-misc | its mouth." --John Gilmore
mangler@cit-vax.Caltech.Edu (Don Speck) (06/13/88)
I am reminded of this article from comp.arch: In article <44083@beno.seismo.CSS.GOV>, rick@seismo.CSS.GOV (Rick Adams) writes: > Well, to start with I've got a Vax 11/780 with 7 6250 bpi 125 ips > tape drives on it. It performs adequately when they are all running. > I STILL haven't found anything to replace it with for a reasonable amount > of money. Nothing in the Sun price range can handle that I/O volume. I've seen a PDP-11/70 with eight tape drives, too. And as Barry Shein said, "An IBM mainframe is an awesome thing...". One weekend, noticing the 4341 spinning a pair of GCR drives at over half their rated 275 ips, I was shocked to learn that it was reading the disk file-by-file, not track at a time. BSD filesystems just can't compare to what this 2-MIPS machine could do with apparent ease. How do they get that kind of throughput? I refuse to believe that it's all hardware. Mainframe disks rotate at 3600 RPM like everybody else's and their 3 MB/s transfer rate is only slightly higher than a SuperEagle. A 2-MIPS CPU would be inadequate to run a BSD filesystem at those speeds, so obviously their software overhead is a lot lower, while at the same time wasting no disk time. What is VM doing efficiently that Unix does inefficiently? Don Speck speck@vlsi.caltech.edu {amdahl,ames!elroy}!cit-vax!speck
bzs@bu-cs.BU.EDU (Barry Shein) (06/13/88)
>How do they get that kind of throughput? I refuse to believe that it's >all hardware. Mainframe disks rotate at 3600 RPM like everybody else's >and their 3 MB/s transfer rate is only slightly higher than a SuperEagle. >A 2-MIPS CPU would be inadequate to run a BSD filesystem at those speeds, >so obviously their software overhead is a lot lower, while at the same >time wasting no disk time. What is VM doing efficiently that Unix does >inefficiently? > >Don Speck speck@vlsi.caltech.edu {amdahl,ames!elroy}!cit-vax!speck I think a lot of it *is* hardware. I know the big mainframes better than the small ones. I/O devices are attached indirectly thru channel controllers. Channels have their own paths to/from memory (that's critical, multiple DMAs simultaneously.) Also, channels are intelligent, I remember people saying the channels for the 370/168 had roughly the same computing power as the 370/158 (ie. one model down, sort of like saying that Sun3/280's use Sun3/180's as disk controllers, actually the compute power is very similar in that comparison.) Channels execute channel commands directly out of memory, sort of linked list structs in C lingo, with commands, offsets etc embedded in them (this has become more common in the mini market also, the UDA is similar tho I don't know if it's quite as general.) Channels can also do things like search disks for particular keys, hi/lo/equal, without involving the central processor. I don't know how much this is used in the various filesystems, obviously a general data base thing. The channels themselves aren't all that fast, around 3MB/sec, but 16 of them pumping simultaneously to/from different blocks of memory can certainly make it feel fast. I heard IBM recently announced a new addition to the 3381 disk series (these are multi-GB disks) with 256MB (1/4 GB) of cache in the disk. Rich or poor it's better to be rich. The file systems tend to be much simpler (they avoid indirection at the lower levels), at least in OS, which I'm sure contributes to the performance, I/O is very asynchronous from a software perspective so starting multiple I/Os is a natural way to program and sit back waiting for completions. Note that RMS in VMS tries to mimic this kind of architecture, but no one ever accused a Vax of having fast I/O. A lot of what we would consider application code is in the OS I/O code, known as "access methods", so reading various file formats (zillions, actually, VSAM, ISAM, BDAM, BSAM...) and I/O disciplines (VTAM etc) can be optimized at the "kernel" level (there's also microcode assist on various machines for various operations), it also tends to push applications programmers towards "being kind" to the OS, things like pre-allocation of resources is pretty much enforced so a lot of the dynamic resource management is just not done during execution. There is little doubt that to get a lot of this speedup on Unix systems you'd have to give up niceties like tree'd directories, extending files whenever you feel like, dynamic file opening during run-time (OS tends to do deadlock avoidance rather than detection or recovery so it needs to know what files you plan to use before your jobs starts, that explains a *lot* of what JCL is all about, pre-allocation of resources), etc. You probably wouldn't like it, it would look just like MVS :-) You'd also have to give up what we call "terminals" in most cases, IBM terminals (327x's) on big systems are much more like disks, half-duplex, fill in a screen locally and then blast entire screens to/from memory in one block I/O operation, no per-char I/O. Emacs would die. It helps, especially when you have a lot of terminals. I read about an IBM transaction system with 15,000 terminals logged in, I said a lot of terminals. But don't underestimate raw, frothing, manic hardware. It's a big trade-off, large IBM mainframes are to I/O what Crays are to floating point, but you really have to have the problem to want the cure, for most folks it's unnecessary, MasterCard etc excepted. -Barry Shein, Boston University
dmr@alice.UUCP (06/14/88)
After decribing a lot of the grot you have to go through to get 3MB/s out of an MVS system, Barry Shein wrote, > But don't underestimate raw, frothing, manic hardware. > It's a big trade-off, large IBM mainframes are to I/O what Crays are > to floating point... Crays are better at I/O, too. For example, I made a 9947252-byte file by catting 4 copies of the dictionary and read it: 3K$ time dd bs=172032 </tmp/big >/dev/null 57+1 blocks in 57+1 blocks out seconds elapsed 1.251356 user 0.000639 sys 0.300725 which is a cool 8MB/s read from an ordinary Unix file in competition with other processes on the machine. (OK, I gave it a big buffer.) The big guys would complain that they didn't get the full 10 or 12 MB/s that the disks give. They would really be annoyed that I could get only 50 MB/s when I read the file from the SSD, which runs at 1000MB/s, but to get it to go at full speed you need to resort to non-standard Unix things. The disk format on Unicos (Cray's version of SVr2) is an extent-based scheme supporting the full Unix semantics except that they don't handle files with holes (that is, the holes get filled in). In an early version, a naive allocation algorithm sometimes created files ungrowable past a certain point, but I think they've worked on the problem since then. Dennis Ritchie
bzs@bu-cs.BU.EDU (Barry Shein) (06/14/88)
Dennis Ritchie points out that his Cray observes disk I/O speeds that compare favorably to those claimed for large IBM mainframes, thus in contrast to my claim Crays may indeed be the "Crays" of I/O. I think the proper question is sort/merging a disk farm and doing 1000 transactions/sec or more while keeping 8 or 12 tapes turning at or near their rated 200 ips, not pushing bits thru a single channel (if we're talking Crays then we're talking 3090's.) If the Cray can keep pumping the I/O under those conditions (typical job stream for a JC Penney's or Mastercard) then we all better short IBM. Software or price would be no object if the Cray could do it better (and more reliably, I guess that *is* an issue, but let's skip that for now.) Then again, who knows? Old beliefs die hard, far be it for me to defend the Itsy Bitsy Machine company. Mayhaps the Amdahl crew can provide some appropriate viciousness at this point :-) Oh, please do! -Barry Shein, Boston University
terryl@tekcrl.TEK.COM (06/14/88)
In article <6926@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (Don Speck) writes: >And as Barry Shein said, "An IBM mainframe is an awesome thing...". >One weekend, noticing the 4341 spinning a pair of GCR drives at over >half their rated 275 ips, I was shocked to learn that it was reading >the disk file-by-file, not track at a time. BSD filesystems just >can't compare to what this 2-MIPS machine could do with apparent ease. > >How do they get that kind of throughput? I refuse to believe that it's >all hardware. Mainframe disks rotate at 3600 RPM like everybody else's >and their 3 MB/s transfer rate is only slightly higher than a SuperEagle. >A 2-MIPS CPU would be inadequate to run a BSD filesystem at those speeds, >so obviously their software overhead is a lot lower, while at the same >time wasting no disk time. What is VM doing efficiently that Unix does >inefficiently? Well, it might be partially due to hardware. Remember the dedicated I/O channels the 360-370 systems have??? Do the 4341's have anything similar???? Similar to CDC Cyber's peripheral(sp?) processors. Tape drives on a Cyber are capable of blindingly fast things, but then, I've seen a tape drive on a Cyber that could read a tape faster than ANY tape drive could rewind under UNIX (caveat: I'm talking mainly DEC tape drives here). Also, reading file by file; to quote a good joke from many moons ago: "That man must be a lawyer. The information he gave is 100% accurate, but totally useless." We need a little (actually quite a lot) more infor- mation before we can say anything. What's the layout of the file on the disk??? What type of file is it??? Is it extent-based, or something different. If it's extent-based, what are the sizes of the extents??? Is there really a file system on the disk in question, or is it just that one file???? etc..... Boy Do I Hate Inews !!!!
mangler@cit-vax.Caltech.Edu (Don Speck) (06/16/88)
In article <23326@bu-cs.BU.EDU>, bzs@bu-cs.BU.EDU (Barry Shein) writes: > I think the proper question is sort/merging a disk farm and doing 1000 > transactions/sec or more while keeping 8 or 12 tapes turning at or > near their rated 200 ips, not pushing bits thru a single channel The hard part of this is getting enough disk throughput to feed even one of those 200-ips tape drives. The rest is replication. Channels sound like essentially moving the disk driver into an I/O processor, with lists of channel control blocks being analogous to lists of struct buf's. This makes it feasible to do more optimizations, even real-time stuff like scatter-gather, chaining, and rotational scheduling. Barry mentions the UDA-50 as being similar. But its processor is an 8085, and DMA speed is only 0.8 MB/s, making it much slower than a dumb controller. And the driver ends up spending as much time constructing the channel control blocks as it would spend tending a dumb controller like the Emulex SC7003. The Xylogics 450, Xylogics 472, and DEC TS11 are like this too. I find them all disappointingly slow. I suspect the real reason for channel processors is to reduce interrupts, which are so costly on big CPU's. It makes sense for terminals; people have made I/O processors that talk to Unix in clists (KMC-11's, etc) which cuts the total interrupt rate by a large fraction. But I don't think it's necessary, or necessarily desirable, to inflict this on disks & tapes, and certainly not unless the channel processor can talk in struct buf's. For all the optimizations that these I/O processors are supposed to do, Unix rarely gives them the chance. Unless there's more than two requests outstanding at once, once they finish one, there's only one request to choose from. Unix has minimal readahead, so that's as many requests as a single process can generate. Raw I/O is even worse. Asynchronous reads would be the obvious way to get enough requests in the queue to optimize, but that seems unlikely to happen. Rather, explicit read commands are giving way to memory-mapped files (in Mach and SunOS 4.0) where readahead becomes synonymous with prepaging. It remains to be seen whether much attention is put into this. Barry credits the asynchronous nature of I/O on mainframe OS's to the access methods, like RMS on VMS. People avoid those when they want speed (imagine using dbm to do sequential reads). For instance, the VMS "copy" command bypasses RMS when copying disk-to-disk, with the curious result that it's faster to copy to a disk than to the null device, because the null device is record-oriented, requiring RMS. As DMR demonstrates, parallel-transfer disks are great for big files. They're horrendously expensive though, and it's hard enough to find controllers that keep up with even 3 MB/s, much less 10 MB/s. But they can be simulated with ordinary disks by striping across multiple controllers, *if* the disks rotate as one. Does anyone know of a cost- effective disk that can phase-lock its spindle motor to that of a second disk, or perhaps with the AC line? With direct-drive electronically- controlled motors becoming common, this should be possible. The Eagle has such a motor, but no provision for external sync. I recall stories of Cray's using phase-locked disks to advantage. Of course, to get the most from high transfer rates, you need large blocksizes; DMR's example looked like about one revolution. Hence the extent-based file allocation of mainframe OS's, etc. Perhaps it's time to pester Berkeley to double MAXBSIZE to 16384 bytes? It would use 0.3% of memory for additional kernel page tables on a VAX, but proportionately less on machines with larger page sizes. 8192 is practically the *minimum* blocksize on Suns, these days. The one point that nobody mentioned is that you don't want the CPU copying the data around between kernel and user address spaces when there's a lot! (Maybe it was just too obvious). Don Speck speck@vlsi.caltech.edu {amdahl,ames!elroy}!cit-vax!speck
rbj@cmr.icst.nbs.gov (Root Boy Jim) (06/16/88)
? From: Barry Shein <bzs@bu-cs.bu.edu> ? Mayhaps the Amdahl crew can provide some appropriate viciousness at ? this point :-) Oh, please do! I suspect that Amdahl only says nice things about HAL. Don't bite the hand that feeds. ? -Barry Shein, Boston University (Root Boy) Jim Cottrell <rbj@icst-cmr.arpa> National Bureau of Standards Flamer's Hotline: (301) 975-5688 The opinions expressed are solely my own and do not reflect NBS policy or agreement Careful with that VAX Eugene!
friedl@vsi.UUCP (Stephen J. Friedl) (06/19/88)
In article <6963@cit-vax.Caltech.Edu>, mangler@cit-vax.Caltech.Edu (Don Speck) writes: > > But they [parallel-head disks, I think] can be simulated with > ordinary disks by striping across multiple controllers, *if* the > disks rotate as one. Does anyone know of a cost-effective disk > that can phase-lock its spindle motor to that of a second disk, > or perhaps with the AC line? I think Micropolis has done this with a handful of their 700MB drives. They put nine drives into a rack, use a special synchronizing controller, and give you striped drives. The ninth drive was for parity, and if a drive failed, you could pull out the bad one and drop in the new one: the controller would reconstruct the parity drive. Sorry, I don't have a reference... -- Steve Friedl V-Systems, Inc. (714) 545-6442 3B2-kind-of-guy friedl@vsi.com {backbones}!vsi.com!friedl attmail!vsi!friedl Nancy Reagan on the Mac-II architecture: "Just say Nu"