[comp.unix.ultrix] 58n0

mf@ircam.ircam.fr (Michel Fingerhut) (09/30/90)

Since "DEC" is listening to this group, and sometimes even responding,
what about a response about this very serious problem (58n0 under
Ultrix 4.0)?  I.e.:

1.  What are the problems as known to DEC today (so that we're less
    pissed off when we encounter them).  That would be *some* help.
    I am rather upset that the local support tells me, when I call them
    with this problem "oh yeah we knew this from the start, why don't
    you just turn all but one CPUs off until further notice...".  If
    they knew it from the start, don't send us 4.0 or warn us.

2.  What they do or intend to do in order to solve them (other than
    suggesting we buy another machine and/or go to another vendor, as
    has already been suggested here).

Michael Fingerhut

alan@shodha.enet.dec.com ( Alan's Home for Wayward Notes File.) (10/01/90)

In article <1990Sep30.123350.14441@ircam.ircam.fr>, mf@ircam.ircam.fr (Michel Fingerhut) writes:
> Since "DEC" is listening to this group, and sometimes even responding,
> what about a response about this very serious problem (58n0 under
> Ultrix 4.0)?  I.e.:

	"DEC" is a very big company and many of us that are
	listening don't have access to the wide variety of
	hardware needed to test customer problems.  Many of
	those that respond do so because we happen to know
	the answer.  Please don't confuse us with the people
	in the development group who's job it is to test and
	fix these sorts problems.  Occasionally someone in
	Engineering does respond, but usually they're working
	on the bug fixes and new features of the next version.

	If you have a problem the appropirate way to report is
	to submit a Software Performance Report and/or go through
	the Customer Support Center nearest you.
> 
> 1.  What are the problems as known to DEC today (so that we're less
>     pissed off when we encounter them).  That would be *some* help.
>     I am rather upset that the local support tells me, when I call them
>     with this problem "oh yeah we knew this from the start, why don't
>     you just turn all but one CPUs off until further notice...".  If
>     they knew it from the start, don't send us 4.0 or warn us.
> 
	Most problems that we know about go into the release notes,
	but sometimes the problems aren't found until after the
	release notes have been printed.  It would be nice if there
	were a nice easy way to report verified problems back to
	you.

> 2.  What they do or intend to do in order to solve them (other than
>     suggesting we buy another machine and/or go to another vendor, as
>     has already been suggested here).

	Hopefully fix the problem once we know what's wrong.  Of
	course until the people that own the problem know about
	it they can't do anything.  If they don't happen to be
	reading this newsgroup then it might be a while before
	they find out about it.  I won't report a problem to them
	until >>>I<<< can verify it.  Since I don't have a 58xx
	to test with there isn't much I can do.

	One thing that would help is a better description of "slow".  
	What is the program doing?  Lots of system calls, disk I/O,
	network I/O, lots of memory use, paging?  I suggest looking
	at cpustat(1), iostat(1), netstat(1) and vmstat(1).  One of
	these days I'll see if I can put a source archive of monitor
	for V4 on gatekeeper.dec.com.
> 
> Michael Fingerhut


-- 
Alan Rollow				alan@nabeth.enet.dec.com

mf@ircam.ircam.fr (Michel Fingerhut) (10/01/90)

To Alan Rollow (alan@shodha.enet.dec.com): you miss the point.  One
would assume that DEC would have checked Ultrix 4.0 on 58n0 (n>1)
*before* shipping it out, and would have realized that such commands as
"ls" take several *seconds* for small directories.  This is IMMEDIATELY
noticeable.  One would also have assumed that if a problem had been
found then, it would have appeared either in the release notes, or in
mandatory patches, or in a special page added to the release (as
sometimes happens).

Well, this was not the case.  So *either* the software was not tried on
such configurations (hard to believe, but this would not be the first
time, eh?  Remember the GT62?) or *else* customers were not informed
(which I believe is the case, since the support center was well aware
of the problem when I called them).

As to reporting the problem: we can do it only by phone, are given a
call number and most of the time hear that it will get in the next
release, hopefully.

But to this particular problem, I was also told it was a much more
serious problem, namely design flaws in the 58n0.  That was DEC's
response.  So you bet I'm worried.

aem@aber-cs.UUCP (Alec D.E. Muffett) (10/01/90)

In article <1990Sep30.123350.14441@ircam.ircam.fr> mf@ircam.ircam.fr (Michel Fingerhut) writes:

>1.  What are the problems as known to DEC today (so that we're less
>    pissed off when we encounter them).

Here in Aberystwyth we are running 2x DEC 5830's with Rev 179 Ultrix 4.0.
We have observed that the Symmetric Multi-Processing behaves badly under
a low machine load and are therefore permamently running 2 low-priority
cpu-burning jobs which sleep for bursts of 15 seconds if the load
average goes >4.0.

When these jobs are running, the performance is greatly improved, we
believe this is because the presence of the two jobs (1 per spare CPU)
solves some sort of ordering problem in the scheduler.  DEC definitely
DO know about this, it has appeared on a list of SPR's sent to us.  No
solution is yet forthcoming, but we live in hope...

It's not the perfect solution, because the two jobs tend to eat away at
the cpu, and if some user puts a heavily i/o bound job up as well, the
machine starts to groan.  Then we just kill them fast and put them back
later... 

So, DEC have given us the ultimate reciprocal machine...  the more load
you put on it, the faster it goes...  8)

alec (and robert :-) )

rosenblg@cmcl2.NYU.EDU (Gary J. Rosenblum) (10/02/90)

What also gets me is that we have a 5820 running Ultrix 4.0 Rev 179, and
have never received one SPR from DEC about the problem!  It got so bad two
weeks ago that we went to the top at DEC to get things straightened out!  
I've put in a request to DSIN, and when I get an answer, I'll post it here.

					Gary

Gary J. Rosenblum	
UNIX Systems Manager			rosenblg@nyu.edu
New York University			gary@nyu.edu

jmg@cernvax.UUCP (mike gerard) (10/02/90)

In article <1990Oct1.080535.17017@ircam.ircam.fr> mf@ircam.ircam.fr (Michel Fingerhut) writes:
>As to reporting the problem: we can do it only by phone, are given a
>call number and most of the time hear that it will get in the next
>release, hopefully.

We suffer from the same restrictions, except for the fact that normally
we don't even hear such pleasant news as "fixed in next release".
In addition, "official" DEC channels refuse to comment on problems
mentioned in places like this: they ask you to submit a bug report if
you have a problem.

It is ridiculous that there is (apparently) no data base of known
problems and, where possible, available patches. I know that there are
various patches available, some of which seem to be considered "mandatory".
However, access seems only to be for those people having wasted their
time identifying a known problem on their own systems.
-- 
 _ _  o |            __                    |    jmg@cernvax.uucp
| | |   |     _     /  \  _   __  _   __  _|    jmg@cernvax.bitnet
| | | | |_)  /_)    |  __/_) | (___\ | (_/ |  J. M. Gerard, Div. DD, CERN,
| | |_|_| \_/\___   \__/ \___|   (_|_|   \_|_ 1211 Geneva 23, Switzerland

rosenblg@cmcl2.NYU.EDU (Gary J. Rosenblum) (10/03/90)

Of course, with a problem of this magnitude, DEC won't say if there is
a problem with SMP (I can see their stock plunging, heads rolling, etc)
since it is a problem of HUGE magnitude.  (I'm not saying I agree, BTW).

I got a call from DEC today, the person said that he will forward this to
the local office.  However, he said there are two things to look for:
Is the configuration on an HSC, and if so, how many disks/requestors are
configured?  The second was a suggestion - changing bufcache in your config
to 25 instead of the 10%.  If you make the change, let me know how it goes.

Gary J. Rosenblum	
UNIX Systems Manager			rosenblg@nyu.edu
New York University			gary@nyu.edu

alan@shodha.enet.dec.com ( Alan's Home for Wayward Notes File.) (10/04/90)

In article <1990Oct1.080535.17017@ircam.ircam.fr>, mf@ircam.ircam.fr (Michel Fingerhut) writes:
> To Alan Rollow (alan@shodha.enet.dec.com): you miss the point.  One
> would assume that DEC would have checked Ultrix 4.0 on 58n0 (n>1)
> *before* shipping it out, and would have realized that such commands as
> "ls" take several *seconds* for small directories.  This is IMMEDIATELY
> noticeable.  One would also have assumed that if a problem had been
> found then, it would have appeared either in the release notes, or in
> mandatory patches, or in a special page added to the release (as
> sometimes happens).
> 

	Actually I do get the point.  You mention two possible
	variations of the problem:

	1.  We didn't test the configuration.

	2.  We knew about the problem and shipped V4.0 anyway.

	I propose a 3rd.  The problem doesn't occur on all systems
	and didn't occur on the systems we tested.  Now I don't know
	exactly how our engineering group does their testing, but I
	KNOW that the DECsystem 5810, 5820, 5830 and 5840 were all
	tested.  Furthemore, I've heard from people I trust that 5840's 
	used internally aren't having the problem.

	So, please provide us with as information as possible to
	help us solve the problem.  The official reporting channel
	for this things is an SPR.  As you're aware you can submit
	them through the CSC or mail one of the stupid zillion carbon
	things to the address listed on it (*).

	The sorts of things we need to know.  

	    o   Characterization of your work load.  

		All interactive users?  Doing what?  NFS server?
		How many clients?  Diskless workstation server?
		How many clients?  What sort of workload on the
		clients?  Local or remote paging?

	    o   Configuration.  How much memory?  Which Ethernet
		controller?  Disk controllers?  Disks?  What version
		of ULTRIX installed?  Is the Mandatory patch installed
		and has the kernel been rebuilt (Rev. 179)?  If the
		disks are connected via an HSC what version of HSC
		code?

	    o   Load information.  Collect what you can from iostat
		vmstat, netstat and cpustat.  Or if you can get a
		the sources for Monitor V1.3 now available on gatekeeper-
		.dec.com in pub/DEC/monitor_v4src.tar.Z.

	(*) My personal opinion is that we should allow submitting SPRs 
	via e-mail, but I'm only a system manager in the back waters 
	of Colorado Springs.  Who's going to listen to me?

-- 
Alan Rollow				alan@nabeth.enet.dec.com

mf@ircam.ircam.fr (Michel Fingerhut) (10/08/90)

alan@shodha.enet.dec.com ( Alan's Home for Wayward Notes File.) writes:
>	The problem doesn't occur on all systems and didn't occur on
>	the systems we tested.  Now I don't know exactly how our engineering
>	group does their testing...

... so please don't say it did not occur.  DEC acknowledges the problem
occurs, and that one of the problems is a design flaw in the scheduler which
causes lousy response time for small interactive jobs or commands (such
as ls) but which makes the 5820 a great machine for batch.  Too bad.  Should
have gotten an IBM.

>	So, please provide us with as information as possible to
>	help us solve the problem.

Response to all the information I gave was "next release".

>	The official reporting channel for this things is an SPR.

No, at least not this side of the ocean.

>	(*) My personal opinion is that we should allow submitting SPRs 
>	via e-mail

Yeah mine too, with automatic aknowledgment, and the possibility to consult
an online database of bug reports

>	but I'm only a system manager 

What about some responses from people at DEC who KNOW what's happening?

Michael Fingerhut