280 performance

dave@calgary.UUCP (Dave Mason) (05/26/88)

We are planning to replace 2 of our Vax 11/780s with  2 Sun 4/280s.
Each vax has 6 Mbytes of memory, 2 RA80 and 1 RA81, and 40 terminals.
The vaxes are currently running 4.3 BSD + NFS (from Mt Xinu).
Each sun is planned to have 32 Mbytes of memory, 2 of the new NEC
disk drives and will be running the same 40 terminals. The vaxes
are being used by undergrads doing pascal, f77 and C programming
(compile and bomb).  Most students use umacs (micro-emacs) as
their text editor.

What I was wondering is has anyone done a similiar type switchover?
Is there a horendous degradation of response when the load average gets
sufficiently high or does it degrade linearly with respect to load average?
Is overall performance of a Sun 4/280 better/worse/the same as a
similiarly loaded vax 11/780 (as configured above)?
Were there any surprises when you did the switchover?

My personal feeling is that we will win big, but the local DEC salesman
is making noises about Sun 4/280 performance, especially with > 15 users.
I just want to confirm if my opinion of the local DEC sales office is well
founded :-).

Please mail your responses. If there is sufficient interest I'll post
a summary to the net.

Thanks in advance for any comments.


				Dave Mason
				University of Calgary
				{ubc-cs,alberta,utai}!calgary!dave

weiser.pa@xerox.com (05/28/88)

What your DEC salesperson may have heard, undoubtedly very indirectly, is that
there is a knee in the performance curve of the Sun-4/280 at > 15 processes
ready-to-run.  This has nothing to do with > 15 users: more like a load average
of > 15.  Do your vaxes ever run with a load average of > 15?  If not, ok.  But,
if they EVER hit 16 or 17, watch out on the Sun-4's:  I can trivially get my
Sun-4 completely wedged so I have to reboot with L1-A by just starting 19 little
processes which sleep for 100ms, wake-up and sleep again.  This doesn't even
raise the load average (but amounts to a load average of 19 to the context
switching mechanism, although not to the cpu).

And the Sun-3's are no better: the knee there is >7 processes.

-mark

olson@modular.UUCP (Jon Olson) (05/29/88)

> What your DEC salesperson may have heard, undoubtedly very indirectly, is that
> there is a knee in the performance curve of the Sun-4/280 at > 15 processes
> ready-to-run.  This has nothing to do with > 15 users: more like a load average
> of > 15.  Do your vaxes ever run with a load average of > 15?  If not, ok.  But,
> if they EVER hit 16 or 17, watch out on the Sun-4's:  I can trivially get my
> Sun-4 completely wedged so I have to reboot with L1-A by just starting 19 little
> processes which sleep for 100ms, wake-up and sleep again.  This doesn't even
> raise the load average (but amounts to a load average of 19 to the context
> switching mechanism, although not to the cpu).
> 
> And the Sun-3's are no better: the knee there is >7 processes.
> 
> -mark

Nonsense, I just tried forking 32 copies of the following program
on my Sun 3/60 workstation.  Each one sleeps for 100 milliseconds,
wakes up, and sleeps again.  With 32 copies of it running, I could
notice no difference in response time and a `ps aux' showed none
of them using a significant amount of CPU time.  Maybe you are just
running out of memory and doing alot of swapping?

What I have noticed on our Vax 11/780, running VMS, is that it is
often equally slow with 1 user or 20 users.  Possibly VMS avoids the
`knee' by raising the priority of the NULL task when there aren't many
people on the machine???

---------------------------------------------------
#include <sys/time.h>

main()
  {
  struct timeval tv;

  tv.tv_sec = 0;
  tv.tv_usec = 100000;
  for( ;; )
    select( 0, 0, 0, 0, &tv );
  }

-- 
Jon Olson, Modular Mining Systems
USENET:     {ihnp4,allegra,cmcl2,hao!noao}!arizona!modular!olson
INTERNET:   modular!olson@arizona.edu
-- 
Jon Olson, Modular Mining Systems
USENET:     {ihnp4,allegra,cmcl2,hao!noao}!arizona!modular!olson
INTERNET:   modular!olson@arizona.edu

olson@modular.UUCP (Jon Olson) (05/29/88)

I also tried forking 32 `for(;;) ;' loops on a 3/60 with 8-mb.
Each process got about 3 percent of the CPU and the reponse was
still quote good for interactive work.  This stuff about a `knee'
at 7 processes just isn't real...
-- 
Jon Olson, Modular Mining Systems
USENET:     {ihnp4,allegra,cmcl2,hao!noao}!arizona!modular!olson
INTERNET:   modular!olson@arizona.edu

stan@sdba.UUCP (Stan Brown) (05/31/88)

> What your DEC salesperson may have heard, undoubtedly very indirectly, is that
> there is a knee in the performance curve of the Sun-4/280 at > 15 processes
> ready-to-run.  This has nothing to do with > 15 users: more like a load average
> of > 15.  Do your vaxes ever run with a load average of > 15?  If not, ok.  But,
> if they EVER hit 16 or 17, watch out on the Sun-4's:  I can trivially get my
> Sun-4 completely wedged so I have to reboot with L1-A by just starting 19 little
> processes which sleep for 100ms, wake-up and sleep again.  This doesn't even
> raise the load average (but amounts to a load average of 19 to the context
> switching mechanism, although not to the cpu).
> 
> And the Sun-3's are no better: the knee there is >7 processes.
> 
> -mark

	Realizing that if this *is* true on the RoadRuner it will be
	true at a much lower number, does anyone know if such a thing
	is true on it ?


-- 
Stan Brown	S. D. Brown & Associates	404-292-9497
(uunet gatech)!sdba!stan				"vi forever"

arosen@eagle.ulowell.edu (MFHorn) (06/01/88)

In article <601@modular.UUCP> olson@modular.UUCP (Jon Olson) writes:
> there is a knee in the performance curve of the Sun-4/280 at > 15 processes
> And the Sun-3's are no better: the knee there is >7 processes.

Some time ago I saw a Sun 3/280 with a load average of 17+.  There were
17 'extra' jobs running.  I don't know what they were doing (they weren't
mine), but there was no [noticable] degradation in response time at all.

Andy Rosen           | arosen@hawk.ulowell.edu | "I got this guitar and I
ULowell, Box #3031   | ulowell!arosen          |  learned how to make it
Lowell, Ma 01854     |                         |  talk" -Thunder Road
                   RD in '88 - The way it should be

weiser.pa@xerox.com (06/01/88)

--------------------
Nonsense, I just tried forking 32 copies of the following program
on my Sun 3/60 workstation.  Each one sleeps for 100 milliseconds,
wakes up, and sleeps again.  With 32 copies of it running, I could
notice no difference in response time and a `ps aux' showed none
of them using a significant amount of CPU time.  Maybe you are just
running out of memory and doing alot of swapping?

What I have noticed on our Vax 11/780, running VMS, is that it is
often equally slow with 1 user or 20 users.  Possibly VMS avoids the
`knee' by raising the priority of the NULL task when there aren't many
people on the machine???

#include <sys/time.h>

main()
  {
  struct timeval tv;

  tv.tv_sec = 0;
  tv.tv_usec = 100000;
  for( ;; )
    select( 0, 0, 0, 0, &tv );
  }

--------------------

No, not nonsense.  I changed 100000 to 25000, and ran 18 of these on my
Sun-4/260 with 120MB swap and 24MB ram, with very little else going on.
Perfmeter shows no disk activity, ps aux shows each of the 18 using almost no
cpu.  (And each of the 18 has more than millisecond to get in and out of select,
which is certainly enough).  And the system is to its knees!  (If it doesn't
work for you, try 19 or 20 or 21).  Window refreshes take 10's of seconds.  If I
kill off 3 of these, all is back to normal.

I don't have a 60C to try this on.  But, try reducing that delay factor and see
if you don't also see a knee in the performance curve well before the cpu should
be swamped.  (And in any case, swamped cpu doesn't need to imply knee in the
curve...)

-mark

bzs@bu-cs.BU.EDU (Barry Shein) (06/01/88)

Although I don't disagree with the original claim of Suns having knees
(related to NeXT being pronounced Knee-zit? never mind) the discussion
can lose sight of reality here.

A 780 cost around $400K* and supported around 20 logins, a Sun4 or
even Sun3/280 probably comes close to that in support for around 1/5
the price or less, and the CPU is much faster when a job gets it. If
your Vax was horribly overloaded and had 32 users just buy more than
one system and split the community, you'll also double the I/O paths
that way and probably have at least one system up almost all the time
(we NFS'd everything between our Suns in Math/Computer Science and
Information Technology here so they can log into any of them although
that does mean that if your home dir is on a down system you lose.)

Also the cost of things like memory is so much lower that you can
cheat like hell on getting performance. Who ever had a 32MB 780?
That's practically a minimum config for a Sun4 server.

The best use for a Sun server as a time-sharer is if a) you don't
expect rapid growth in the number of logins (eg. doubling in a year)
that will outgrow the machine and b) you expect a lot of the community
using the system to migrate from dumb terminals to workstations in the
reasonably near future, that way voila, you have the server,
especially if each new workstation means one less time-sharer and it
converges fairly rapidly. It's a nice way to give them time to get
their financial act together to buy workstations. For example, for our
CS and Math Faculty here having 3 servers worked out very well, many
of the users have now grown into workstations and the server
facilities were "just there".

Another rationale of course is that you're looking for just a little
system for perhaps a dozen or so peak load people, I don't know any
system off-hand that can do that as nicely as a system like the above
for the money.

If your needs are much more in the domain of traditional time-sharing
(eg. hordes of students that never ceases growing term to term, dumb
terminals and staying that way for the next few years [typically, if
you ever get them workstations you'll put an appropriate, separate,
server in *that* budget]) then you probably want to look at something
more expandable/upgradeable. I find Encores and (no direct experience
but I hear good things) Sequents pretty close to perfect for that kind
of usage. I'm sure there are others that will suffice but we don't use
them so I can't comment (we have 7 Encores and over 100 Suns here.)

Anyhow, seat-of-the-pants systems analysis on the net is probably a
precarious thing at best, I hope I've pointed out the issues are
several and small differences in two groups' needs can make any
recommendation inapplicable.

All I can say is we have quite a few Sun 3 servers here doing
something resembling traditional time-sharing and everyone seems very
happy with it. Given the right conditions it works out well, given the
wrong ones no doubt it would be a nightmare, so what else is new?

	-Barry Shein, Boston University

P.S. I have no vested interest in any of the above mentioned companies
although I am on the Board of Directors of the Sun Users Group, I
doubt that would be considered "vested".

* Yes I realize that it's been almost 10 years since the 780 came out,
but that was the original question.

guy@gorodish.Sun.COM (Guy Harris) (06/01/88)

> 	Realizing that if this *is* true on the RoadRuner it will be
> 	true at a much lower number,

No, not true.  The RR has about the same raw CPU speed as a 3/200-series
machine.  Furthermore, it has a different memory management unit; it appears
the MMU may be the crux of the biscuit here.

> does anyone know if such a thing is true on it ?

If, as would be indicated by the number of processes at which the knee occurs,
the knee is caused by running out of MMU contexts (the Sun-3 MMU has 7 contexts
available for user processes, the Sun-4 has 15), I would tend not to expect the
same phenomenon on an RR; the '386 has a fairly conventional
in-memory-page-table MMU.

DISCLAIMER: This is just an educated guess.  I don't have any numbers to back
this up.  Don't take this as gospel truth; if you *do* get numbers, let us all
know, the results may be interesting (especially if they *don't* back this
guess up).

jfh@rpp386.UUCP (John F. Haugh II) (06/02/88)

In article <7331@swan.ulowell.edu> arosen@hawk.ulowell.edu (MFHorn) writes:
>In article <601@modular.UUCP> olson@modular.UUCP (Jon Olson) writes:
>> there is a knee in the performance curve of the Sun-4/280 at > 15 processes
>> And the Sun-3's are no better: the knee there is >7 processes.
>
>Some time ago I saw a Sun 3/280 with a load average of 17+.  There were
>17 'extra' jobs running.  I don't know what they were doing (they weren't
>mine), but there was no [noticable] degradation in response time at all.

our plexus p/95 (20MHz 68020, vme bus, 8MB ram, esdi controller) knees at
about 20 users with a load average of 10+.  on the few occasions the
machine has been to 13+ it has crashed shortly thereafter.

the p/55 (12.5MHz 68020, multi bus, 4MB ram, scsi? controller) knees at
about 10 users.  i don't know the load average off hand but it has been
up around 10 without crashing.  it just gets painfully slow.

i suggest the Big Problem is with the disk/controller combinations.  my
'386 can't run an expire and an rn together because the disk saturates.
same seems to be true with the plexus machines.  the p/55 has a single
controller and a single drive.  the p/95 has a single (faster) controller
with two drives.  once the i/o on either plexus is saturated (the famous
popcorn noise is my general working definition), regardless of the number
of processes, adding one more serious dogs the system.

- john.
-- 
John F. Haugh II                 | "If you aren't part of the solution,
River Parishes Programming       |  you are part of the precipitate."
UUCP:   ihnp4!killer!rpp386!jfh  | 		-- long since forgot who
DOMAIN: jfh@rpp386.uucp          |

weiser.pa@xerox.com (06/03/88)

Andy Rosen writes:
"Some time ago I saw a Sun 3/280 with a load average of 17+.  There were
17 'extra' jobs running.  I don't know what they were doing (they weren't
mine), but there was no [noticable] degradation in response time at all."

I just tried this on my Sun-4/280 by running 30 cpueating processes.
("while(1);").  Sure enough, even with the load at 30, I got much better
response than I did with only 20 of the little 50 ms sleeper programs I posted a
day or so ago.  One way to interpret this is that when Sun's scheduler knows
that it has 30 processes on the queue, it does a better job of sharing the
limited resource of contexts, than if it thinks there is nothing to do, but
every 50ms 20 jobs all suddenly leap up and call for attention...  But I don't
know for sure.

Perhaps the 50ms. sleeper test is a red herring, and that pathological state is
not one that is ever seen under normal user loads.

But in any case, we got on to this topic by something a DEC salesperson said to
discourage Sun purchaes, and I think, because the knee is real but perhaps only
in pathological cases that no one really cares about, we have exactly a
salesperson sort of "fact".  Mystery resolved.

-mark

m5@lynx.UUCP (Mike McNally) (06/04/88)

Re: small processes that sleep-wakeup-sleep-wakeup...

I tried this on my Integrated Solutions 68020 thing and got results
similar to those of the Sun; that is, up to about 6 or 7 of them the
system works fine, but after that everything gets real slow (I can't
test it too much because everybody gets mad here when the machine
freezes up).

I tried the same thing under LynxOS, our own BSD-compatible real-time
OS, and didn't notice very much degradation at all.  A major difference
between our machine and the Integrated Solutions is the MMU: even
though our platform is a 68010, our MMU is 16K of static RAM that holds
all the page tables all the time.  Context switch time is thus real
small.  Also, I think it's possible that the mechanism for dealing with
the timeout in select() is different internally under LynxOS as opposed
to Unix.

Of course, under the real-time OS, a high-priority CPU-bound task gets
the whole CPU, no questions asked.  That's a great way of degrading
editor response :-).

As a somewhat related side question, what does the Sun 4/SPARC MMU look
like?  Are lookaside buffer reloads done in software like on the MIPS
R[23]000?  (Is that really true about the R[23]000 anyhow?)

-- 
Mike McNally of Lynx Real-Time Systems

uucp: lynx!m5 (maybe pyramid!voder!lynx!m5 if lynx is unknown)

hedrick@athos.rutgers.edu (Charles Hedrick) (06/05/88)

I've played around with our Sun 4's a bit (and also with a VAX 750) to
duplicate the various tests.  I can confirm that with many processes
waiting for very small times, there is in fact a very sharp "knee".
It happened for me at something like 19 processes.  Vmstat is probably
the best tool for watching this.  With 18 processes, vmstat showed
over 90% of the system idle.  Start one more and suddenly 3% idle and
over 90% of the CPU spent in system state.  Killing and restarting
that last process would cause the system to toggle between the two
states.  It was very dramatic.  In retrospect it is very clear what is
going on.  There are a finite number of hardware contexts in the MMU.
Presumably (assuming rational system programmers) they are managed
much like virtual memory.  That is, when a process is to be activated,
its MMU info must be put in one of the contexts.  If it isn't there
already, some algorithm (maybe LRU?) is used to decide which process'
information to remove.  Every time a process is activated and its
information isn't already in a context register, some work has to be
done (which seems to take about 1 msec).  Problems are going to occur
when new processes have to be put in context registers at a rate that
is more than about 100/sec.  This requires not only a lot of
processes, but also a lot of process activations.  That is, you are
always OK if the number of active processes is less than 15, since
those will fit into the hardware context registers.  But you are also
OK if you have more than 15 processes, as long as they aren't being
activated at a high rate.  Even if you have 100 CPU-bound processes,
the problem won't occur as long as the scheduler gives them fairly
long runtime slices.  This is the reason that changing the amount of
sleep time in the tests was so critical.  It's hard to know offhand
exactly when this problem will show up in practice, but I have to
believe that somebody at Sun has done simulation studies with
reasonable job mixes, since that's the way the game is played these
days.  But it is not the case that your system will come to a
screaming halt when you activate the 16th process, and it certainly
is not limited to 15 users.  On the other hand, nobody is claiming
that the Sun 4's are intended for 100 users.

egisin@watmath.waterloo.edu (Eric Gisin) (06/05/88)

In article <15875@brl-adm.ARPA>, weiser.pa@xerox.com writes:
> I just tried this on my Sun-4/280 by running 30 cpueating processes.
> ("while(1);").  Sure enough, even with the load at 30, I got much better
> response than I did with only 20 of the little 50 ms sleeper programs I posted a
> day or so ago.  One way to interpret this is that when Sun's scheduler knows
> that it has 30 processes on the queue, it does a better job of sharing the
> limited resource of contexts, than if it thinks there is nothing to do, but
> every 50ms 20 jobs all suddenly leap up and call for attention...  But I don't
> know for sure.
> 

The scheduler doesn't know that it has 30 process on the queue.

With the 20 50ms sleeper jobs there will be 20*20 = 400 context switches
per second. The 16 (or whatever) sets of MMU mapping can't hold
all the active processes.

With 30 "while(1);" jobs, the scheduler reschedules a compute
bound job every N clock ticks, or a few times a second.

If you are using a screen editor, it is likely the process's
context stays in the MMU between keystrokes in the latter case,
resulting in quick interactive response.
Or it could be that the 50 ms jobs wake up with a high priority
relative to the interactive process.

mangler@cit-vax.Caltech.Edu (Don Speck) (06/05/88)

In article <15875@brl-adm.ARPA>, weiser.pa@xerox.com writes:
> Perhaps the 50ms. sleeper test is a red herring, and that pathological state is
> not one that is ever seen under normal user loads.

When I started using 4.3 BSD /etc/dump with two tape drives on this
780, I was getting 250 context switches per second among 8 processes.

For another curious VAX/SUN comparison, notice how pipes on Sun-3's run
half as fast when the buffer address is odd.  No such penalty on vaxen.

Don Speck   speck@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

mash@mips.COM (John Mashey) (06/05/88)

In article <3859@lynx.UUCP> m5@lynx.UUCP (Mike McNally) writes:
...
>As a somewhat related side question, what does the Sun 4/SPARC MMU look
>like?  Are lookaside buffer reloads done in software like on the MIPS
>R[23]000?  (Is that really true about the R[23]000 anyhow?)

The Sun-4 MMU, like earlier Suns, doesn't use a TLB, but has SRAMs
for memory maps (16 contexts' worth, compared to 8 in Sun-3/200, for
example).

The R[23]000 indeed do TLB-miss refill handling in software;
this is not unusual in RISC machines: HP Precision and AMD 29K (at least)
do this also.  The overall cost if typically 1% or less of CPU time,
which is fairly competitive with hardware refill, especially since one
of the larger costs on faster machines is the accumulated cache-miss penalty
for fetching PTEs from memory.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

greg@xios.XIOS.UUCP (Greg Franks) (06/06/88)

>I also tried forking 32 `for(;;) ;' loops on a 3/60 with 8-mb.
>Each process got about 3 percent of the CPU and the reponse was
>still quote good for interactive work.  This stuff about a `knee'
>at 7 processes just isn't real...

However, the 32 processes do nothing but chew up CPU cycles.  Add some
disk I/O, other random interrupts, and a desire for memory to your test. 

-- 
Greg Franks                   XIOS Systems Corporation, 1600 Carling Avenue,
utzoo!dciem!nrcaer!xios!greg  Ottawa, Ontario, Canada, K1Z 8R8. (613)725-5411.
   ACME Electric: When you can't find your short, call us!

rick@seismo.CSS.GOV (Rick Adams) (06/07/88)

Last year when seismo (a Sun 3/160) was still passing mail around,
there was a VERY obvious performance degradation when the 8th or
9th sendmail became active. (No we didn't run out of memory. That
happened at about 14 sendmails)

I have always attributed it to the 7 user contexts.

---rick

brw@jim.UUCP (06/08/88)

At the risk of adding to this pointless discussion, why are we
comparing a 10 year old machine with what I think is the latest Sun.
How about a comparison of the Sun 4/280 and a simmilarly configured
MicroVax III (Yes thats 3, not 2!). I think they are comparable in
price, are they in performance (I was quoted 50K aust for a diskless
MV3 a while ago, what is a sun 4/280 worth?)

-- 
Brian Wallis (brw@jim.odr.oz)		    O'Dowd Research P/L.
	(03) 562-0100 Fax: (03) 562-0616,
	Telex: Jacobs Radio (Bayswater) 152093

bzs@bu-cs.BU.EDU (Barry Shein) (06/09/88)

From: brw@jim.odr.oz (Brian Wallis)
>At the risk of adding to this pointless discussion, why are we
>comparing a 10 year old machine with what I think is the latest Sun.

Mostly because that's how the question was originally posed, something
like "if I replace my 780 with a Sun4/280 will my community be
serviced better/worse/same?"

It's a reasonable question if that's precisely the situation you are
facing as apparently many are. You want to know if the replacement
will work out.

>How about a comparison of the Sun 4/280 and a simmilarly configured
>MicroVax III (Yes thats 3, not 2!). I think they are comparable in
>price, are they in performance (I was quoted 50K aust for a diskless
>MV3 a while ago, what is a sun 4/280 worth?)

A different, reasonable question. I think a Sun4/280 will run a bit
less in list price, but perhaps not so much different as to affect the
comparison (there are other, more important considerations when you're
only talking about price differences on items under US$100K anyhow.)

	-Barry Shein, Boston University

allbery@ncoast.UUCP (Brandon S. Allbery) (06/10/88)

As quoted from <2282@rpp386.UUCP> by jfh@rpp386.UUCP (John F. Haugh II):
+---------------
| our plexus p/95 (20MHz 68020, vme bus, 8MB ram, esdi controller) knees at
| about 20 users with a load average of 10+.  on the few occasions the
| machine has been to 13+ it has crashed shortly thereafter.
| 
| the p/55 (12.5MHz 68020, multi bus, 4MB ram, scsi? controller) knees at
| about 10 users.  i don't know the load average off hand but it has been
| up around 10 without crashing.  it just gets painfully slow.
+---------------

The P/55 has a dumb SMD controller, unless you bought the EMSP, which is a
smart SMD controller identical to that used on at least some Sun-3's.

Query:  how did you calculate load average?

Is the P/95 *really* ESDI?  I thought they used a VMEbus Xylogics SMD... but
I don't really know that much about the '95.

4MB RAM is not the best way to run if you have 10 users.  This is from
experience.  You swap *way* too much under SVR2, experience with 2MB on a
386 box with SVR3.1 and 8 heavy database users shows way too much paging.

P/60 with 12.5MHz 68020, multibus, 7MB RAM, EMSP (Xylogics 451 SMD) disk
interface:  ran 18 users at a load average (courtesy my /etc/avenrun) of 2.
[Note that I've never been certain of the reliability of /etc/avenrun as
compared to BSD, since the actual code isn't mine and *definitely* isn't in
the kernel where it should be to be accurate.]  No problems whatsoever; lots
of DBMS, some WP and at least one C compile -- often two concurrent (that
was me ;-) and the system still responded quite well.  Usage has changed
since I left the company that has that configuration; I can check on the
current statistics.  (N.B.  The configuration described above is actually a
P/60 which has been upgraded to a P/75, except for the serial/parallel I/O
controllers.)

As for the 386 box mentioned above:  at 4MB, 8 users all doing database (or
7 users doing database and one compile, again me), it *said* that the kernel
load average was 72.  The "load average" it reports, however, seems to be
not what we usually deoad abverage; you have to divide by the number of
processes, typically 75-80 = load average slightly less than 1.  Performance?
Let's put it this way:  at full load, it was faster than ncoast (68000, 2MB
RAM, dumb SMD controller:  P/35) with *one* user.
-- 
Brandon S. Allbery			  | "Given its constituency, the only
uunet!marque,sun!mandrill}!ncoast!allbery | thing I expect to be "open" about
Delphi: ALLBERY	       MCI Mail: BALLBERY | [the Open Software Foundation] is
comp.sources.misc: ncoast!sources-misc    | its mouth."  --John Gilmore

mangler@cit-vax.Caltech.Edu (Don Speck) (06/13/88)

I am reminded of this article from comp.arch:

In article <44083@beno.seismo.CSS.GOV>, rick@seismo.CSS.GOV (Rick Adams) writes:
> Well, to start with I've got a Vax 11/780 with 7 6250 bpi 125 ips
> tape drives on it. It performs adequately when they are all running.
> I STILL haven't found anything to replace it with for a reasonable amount
> of money. Nothing in the Sun price range can handle that I/O volume.

I've seen a PDP-11/70 with eight tape drives, too.

And as Barry Shein said, "An IBM mainframe is an awesome thing...".
One weekend, noticing the 4341 spinning a pair of GCR drives at over
half their rated 275 ips, I was shocked to learn that it was reading
the disk file-by-file, not track at a time.  BSD filesystems just
can't compare to what this 2-MIPS machine could do with apparent ease.

How do they get that kind of throughput?  I refuse to believe that it's
all hardware.  Mainframe disks rotate at 3600 RPM like everybody else's
and their 3 MB/s transfer rate is only slightly higher than a SuperEagle.
A 2-MIPS CPU would be inadequate to run a BSD filesystem at those speeds,
so obviously their software overhead is a lot lower, while at the same
time wasting no disk time.  What is VM doing efficiently that Unix does
inefficiently?

Don Speck   speck@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

bzs@bu-cs.BU.EDU (Barry Shein) (06/13/88)

>How do they get that kind of throughput?  I refuse to believe that it's
>all hardware.  Mainframe disks rotate at 3600 RPM like everybody else's
>and their 3 MB/s transfer rate is only slightly higher than a SuperEagle.
>A 2-MIPS CPU would be inadequate to run a BSD filesystem at those speeds,
>so obviously their software overhead is a lot lower, while at the same
>time wasting no disk time.  What is VM doing efficiently that Unix does
>inefficiently?
>
>Don Speck   speck@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

I think a lot of it *is* hardware. I know the big mainframes better
than the small ones. I/O devices are attached indirectly thru channel
controllers. Channels have their own paths to/from memory (that's
critical, multiple DMAs simultaneously.) Also, channels are
intelligent, I remember people saying the channels for the 370/168 had
roughly the same computing power as the 370/158 (ie. one model down,
sort of like saying that Sun3/280's use Sun3/180's as disk
controllers, actually the compute power is very similar in that
comparison.)

Channels execute channel commands directly out of memory, sort of
linked list structs in C lingo, with commands, offsets etc embedded in
them (this has become more common in the mini market also, the UDA is
similar tho I don't know if it's quite as general.) Channels can also
do things like search disks for particular keys, hi/lo/equal, without
involving the central processor. I don't know how much this is used in
the various filesystems, obviously a general data base thing.

The channels themselves aren't all that fast, around 3MB/sec, but 16
of them pumping simultaneously to/from different blocks of memory can
certainly make it feel fast.

I heard IBM recently announced a new addition to the 3381 disk series
(these are multi-GB disks) with 256MB (1/4 GB) of cache in the disk.
Rich or poor it's better to be rich.

The file systems tend to be much simpler (they avoid indirection at
the lower levels), at least in OS, which I'm sure contributes to the
performance, I/O is very asynchronous from a software perspective so
starting multiple I/Os is a natural way to program and sit back
waiting for completions. Note that RMS in VMS tries to mimic this kind
of architecture, but no one ever accused a Vax of having fast I/O.

A lot of what we would consider application code is in the OS I/O
code, known as "access methods", so reading various file formats
(zillions, actually, VSAM, ISAM, BDAM, BSAM...) and I/O disciplines
(VTAM etc) can be optimized at the "kernel" level (there's also
microcode assist on various machines for various operations), it also
tends to push applications programmers towards "being kind" to the OS,
things like pre-allocation of resources is pretty much enforced so a
lot of the dynamic resource management is just not done during
execution.

There is little doubt that to get a lot of this speedup on Unix
systems you'd have to give up niceties like tree'd directories,
extending files whenever you feel like, dynamic file opening during
run-time (OS tends to do deadlock avoidance rather than detection or
recovery so it needs to know what files you plan to use before your
jobs starts, that explains a *lot* of what JCL is all about,
pre-allocation of resources), etc. You probably wouldn't like it, it
would look just like MVS :-)

You'd also have to give up what we call "terminals" in most cases, IBM
terminals (327x's) on big systems are much more like disks,
half-duplex, fill in a screen locally and then blast entire screens
to/from memory in one block I/O operation, no per-char I/O. Emacs
would die. It helps, especially when you have a lot of terminals.  I
read about an IBM transaction system with 15,000 terminals logged in,
I said a lot of terminals.

But don't underestimate raw, frothing, manic hardware.

It's a big trade-off, large IBM mainframes are to I/O what Crays are
to floating point, but you really have to have the problem to want the
cure, for most folks it's unnecessary, MasterCard etc excepted.

	-Barry Shein, Boston University

dmr@alice.UUCP (06/14/88)

After decribing a lot of the grot you have to go through to get
3MB/s out of an MVS system, Barry Shein wrote,

> But don't underestimate raw, frothing, manic hardware.
> It's a big trade-off, large IBM mainframes are to I/O what Crays are
> to floating point...

Crays are better at I/O, too.  For example,
I made a 9947252-byte file by catting 4 copies of the dictionary
and read it:

3K$ time dd bs=172032 </tmp/big >/dev/null
57+1 blocks in
57+1 blocks out
	seconds
elapsed	1.251356
user	0.000639
sys	0.300725

which is a cool 8MB/s read from an ordinary Unix file in competition
with other processes on the machine.  (OK, I gave it a big buffer.)
The big guys would complain that they didn't get the full 10 or 12
MB/s that the disks give.  They would really be annoyed that I could
get only 50 MB/s when I read the file from the SSD, which runs at
1000MB/s, but to get it to go at full speed you need to resort to
non-standard Unix things.

The disk format on Unicos (Cray's version of SVr2) is an extent-based
scheme supporting the full Unix semantics except that they don't handle
files with holes (that is, the holes get filled in).  In an early
version, a naive allocation algorithm sometimes created files
ungrowable past a certain point, but I think they've worked on the
problem since then. 

				Dennis Ritchie

bzs@bu-cs.BU.EDU (Barry Shein) (06/14/88)

Dennis Ritchie points out that his Cray observes disk I/O speeds that
compare favorably to those claimed for large IBM mainframes, thus in
contrast to my claim Crays may indeed be the "Crays" of I/O.

I think the proper question is sort/merging a disk farm and doing 1000
transactions/sec or more while keeping 8 or 12 tapes turning at or
near their rated 200 ips, not pushing bits thru a single channel (if
we're talking Crays then we're talking 3090's.)

If the Cray can keep pumping the I/O under those conditions (typical
job stream for a JC Penney's or Mastercard) then we all better short
IBM. Software or price would be no object if the Cray could do it
better (and more reliably, I guess that *is* an issue, but let's skip
that for now.)

Then again, who knows? Old beliefs die hard, far be it for me to
defend the Itsy Bitsy Machine company.

Mayhaps the Amdahl crew can provide some appropriate viciousness at
this point :-) Oh, please do!

	-Barry Shein, Boston University

terryl@tekcrl.TEK.COM (06/14/88)

In article <6926@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (Don Speck) writes:
>And as Barry Shein said, "An IBM mainframe is an awesome thing...".
>One weekend, noticing the 4341 spinning a pair of GCR drives at over
>half their rated 275 ips, I was shocked to learn that it was reading
>the disk file-by-file, not track at a time.  BSD filesystems just
>can't compare to what this 2-MIPS machine could do with apparent ease.
>
>How do they get that kind of throughput?  I refuse to believe that it's
>all hardware.  Mainframe disks rotate at 3600 RPM like everybody else's
>and their 3 MB/s transfer rate is only slightly higher than a SuperEagle.
>A 2-MIPS CPU would be inadequate to run a BSD filesystem at those speeds,
>so obviously their software overhead is a lot lower, while at the same
>time wasting no disk time.  What is VM doing efficiently that Unix does
>inefficiently?


     Well, it might be partially due to hardware. Remember the dedicated
I/O channels the 360-370 systems have??? Do the 4341's have anything
similar???? Similar to CDC Cyber's peripheral(sp?) processors. Tape
drives on a Cyber are capable of blindingly fast things, but then, I've
seen a tape drive on a Cyber that could read a tape faster than ANY tape
drive could rewind under UNIX (caveat: I'm talking mainly DEC tape drives
here).

     Also, reading file by file; to quote a good joke from many moons
ago: "That man must be a lawyer. The information he gave is 100% accurate,
but totally useless." We need a little (actually quite a lot) more infor-
mation before we can say anything. What's the layout of the file on the
disk??? What type of file is it??? Is it extent-based, or something
different. If it's extent-based, what are the sizes of the extents???
Is there really a file system on the disk in question, or is it just that
one file???? etc.....


Boy
Do
I
Hate
Inews
!!!!

mangler@cit-vax.Caltech.Edu (Don Speck) (06/16/88)

In article <23326@bu-cs.BU.EDU>, bzs@bu-cs.BU.EDU (Barry Shein) writes:
> I think the proper question is sort/merging a disk farm and doing 1000
> transactions/sec or more while keeping 8 or 12 tapes turning at or
> near their rated 200 ips, not pushing bits thru a single channel

The hard part of this is getting enough disk throughput to feed even
one of those 200-ips tape drives.  The rest is replication.

Channels sound like essentially moving the disk driver into an I/O
processor, with lists of channel control blocks being analogous to
lists of struct buf's.	This makes it feasible to do more optimizations,
even real-time stuff like scatter-gather, chaining, and rotational
scheduling.

Barry mentions the UDA-50 as being similar.  But its processor is an
8085, and DMA speed is only 0.8 MB/s, making it much slower than a dumb
controller.  And the driver ends up spending as much time constructing
the channel control blocks as it would spend tending a dumb controller
like the Emulex SC7003.  The Xylogics 450, Xylogics 472, and DEC TS11
are like this too.  I find them all disappointingly slow.

I suspect the real reason for channel processors is to reduce interrupts,
which are so costly on big CPU's.  It makes sense for terminals; people
have made I/O processors that talk to Unix in clists (KMC-11's, etc)
which cuts the total interrupt rate by a large fraction.  But I don't
think it's necessary, or necessarily desirable, to inflict this on disks
& tapes, and certainly not unless the channel processor can talk in
struct buf's.

For all the optimizations that these I/O processors are supposed to do,
Unix rarely gives them the chance.  Unless there's more than two requests
outstanding at once, once they finish one, there's only one request to
choose from.  Unix has minimal readahead, so that's as many requests as
a single process can generate.	Raw I/O is even worse.

Asynchronous reads would be the obvious way to get enough requests in
the queue to optimize, but that seems unlikely to happen.  Rather,
explicit read commands are giving way to memory-mapped files (in Mach
and SunOS 4.0) where readahead becomes synonymous with prepaging.  It
remains to be seen whether much attention is put into this.

Barry credits the asynchronous nature of I/O on mainframe OS's to the
access methods, like RMS on VMS.  People avoid those when they want
speed (imagine using dbm to do sequential reads).  For instance, the
VMS "copy" command bypasses RMS when copying disk-to-disk, with the
curious result that it's faster to copy to a disk than to the null
device, because the null device is record-oriented, requiring RMS.

As DMR demonstrates, parallel-transfer disks are great for big files.
They're horrendously expensive though, and it's hard enough to find
controllers that keep up with even 3 MB/s, much less 10 MB/s.  But
they can be simulated with ordinary disks by striping across multiple
controllers, *if* the disks rotate as one.  Does anyone know of a cost-
effective disk that can phase-lock its spindle motor to that of a second
disk, or perhaps with the AC line?  With direct-drive electronically-
controlled motors becoming common, this should be possible.  The Eagle
has such a motor, but no provision for external sync.  I recall stories
of Cray's using phase-locked disks to advantage.

Of course, to get the most from high transfer rates, you need large
blocksizes; DMR's example looked like about one revolution.  Hence
the extent-based file allocation of mainframe OS's, etc.  Perhaps
it's time to pester Berkeley to double MAXBSIZE to 16384 bytes?
It would use 0.3% of memory for additional kernel page tables on a
VAX, but proportionately less on machines with larger page sizes.
8192 is practically the *minimum* blocksize on Suns, these days.

The one point that nobody mentioned is that you don't want the CPU
copying the data around between kernel and user address spaces when
there's a lot!	(Maybe it was just too obvious).

Don Speck   speck@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

rbj@cmr.icst.nbs.gov (Root Boy Jim) (06/16/88)

? From: Barry Shein <bzs@bu-cs.bu.edu>

? Mayhaps the Amdahl crew can provide some appropriate viciousness at
? this point :-) Oh, please do!

I suspect that Amdahl only says nice things about HAL.
Don't bite the hand that feeds.

? 	-Barry Shein, Boston University

	(Root Boy) Jim Cottrell	<rbj@icst-cmr.arpa>
	National Bureau of Standards
	Flamer's Hotline: (301) 975-5688
	The opinions expressed are solely my own
	and do not reflect NBS policy or agreement
	Careful with that VAX Eugene!

friedl@vsi.UUCP (Stephen J. Friedl) (06/19/88)

In article <6963@cit-vax.Caltech.Edu>, mangler@cit-vax.Caltech.Edu (Don Speck) writes:
>
> But they [parallel-head disks, I think] can be simulated with
> ordinary disks by striping across multiple controllers, *if* the
> disks rotate as one.  Does anyone know of a cost-effective disk
> that can phase-lock its spindle motor to that of a second disk,
> or perhaps with the AC line?

I think Micropolis has done this with a handful of their 700MB drives.
They put nine drives into a rack, use a special synchronizing controller,
and give you striped drives.  The ninth drive was for parity, and if
a drive failed, you could pull out the bad one and drop in the new
one: the controller would reconstruct the parity drive.  Sorry, I don't
have a reference...
-- 
Steve Friedl    V-Systems, Inc. (714) 545-6442      3B2-kind-of-guy
friedl@vsi.com     {backbones}!vsi.com!friedl    attmail!vsi!friedl

Nancy Reagan on the Mac-II architecture: "Just say Nu"