[comp.unix.microport] 386/ix: cpio to floppy panics kernel

campbell@redsox.UUCP (Larry Campbell) (10/01/88)

I'm running Interactive's 386/ix on a Dell 310.  On the whole it all works
just fine, but there's one showstopper of a bug:  using cpio to back up
the disk to floppies panics the kernel with a memory parity (NMI) trap!

Now, I don't really suspect the memory since everything else works fine.
Strangely, I can format and mount floppies, and read and write to them all
day, without a panic.  It seems to be only writing (not reading) to the raw
floppy device rather than the block device that panics the kernel.

Interactive say they can't reproduce the problem, but they also haven't got
a Dell 310 to try it on.  Is anyone out there successfully running 386/ix
on a Dell 310?  Has anyone on any kind of machine ever had this problem?
Although the system works on the whole, I'm getting a bit nervous, because
I can't do backups...  I can feel Murphy breathing down my neck...

debra@alice.UUCP (Paul De Bra) (10/02/88)

In article <455@redsox.UUCP> campbell@sushi.UUCP (Larry Campbell) writes:
>I'm running Interactive's 386/ix on a Dell 310.  On the whole it all works
>just fine, but there's one showstopper of a bug:  using cpio to back up
>the disk to floppies panics the kernel with a memory parity (NMI) trap!
>
>Now, I don't really suspect the memory since everything else works fine.
>...
>I can't do backups...  I can feel Murphy breathing down my neck...

You are damn right to feel Murphy breathing down your neck, because this
kind of problem DOES indicate a memory problem with the Dell 310.

The problem (in general) is that not all memory accesses are equally
critical. Accessing memory in some specific order can generate parity
errors or worse if your memory does not safely give the memory chips
enough time to respond. All these new cranked-up 286 and 386 boxes are
pushing things BEYOND their limit. You can run DOS for years, or memory
diagnostics for years and never find a problem, yet your unix tries just
the kind of access which fails. It is not surprising that IS cannot
reproduce the problem on another machine. I have seen this problem before,
many times I must add, and in general there is only one solution: either
replace the memory chips by faster ones or lower the clock frequency.
Your system seems not to run safely at its top speed, because the MMU
sometimes does not give the memory chips enough time to respond. (Our
supplier has often been able to solve our problems by replacing the memory.
It REALLY works.)

I am not blaming Dell specifically here, because many companies make the
same error. Since memory is expensive they just put in slower chips than
the machine really needs. I hope they are listening?????????

Paul.

|--------------------------------------------------------------------------
|Paul De Bra              | I am completely surrounded by giant bugs !    |
|debra@research.att.com   | There's millions of them, all over this code! |
|uunet!research!debra     | Beam me up quickly...Please...                |
|--------------------------------------------------------------------------

james@bigtex.uucp (James Van Artsdalen) (10/06/88)

In article <8254@alice.UUCP>, debra@alice.UUCP () wrote:

[ discussion of bizarre memory problems with floppy/hard disk/DMA ... ]

> All these new cranked-up 286 and 386 boxes are pushing things BEYOND
> their limit.

I think the 310 designer would be willing to argue that point.

> You can run DOS for years, or memory diagnostics for years and never
> find a problem, yet your unix tries just the kind of access which
> fails.

Which is precisely why unix/Xenix is part of the standard test suite,
along with OS/2, Windows, and lots of other exotic stuff.

> [...] Your system seems not to run safely at its top speed, because
> the MMU sometimes does not give the memory chips enough time to
> respond.

That's a simple design flaw.  There's no excuse for it.

> (Our supplier has often been able to solve our problems by replacing
> the memory.  It REALLY works.)

Then your supplier has demonstrated that it is a design flaw.

I don't think the question is just related to too-slow RAM.  There may
be subtle design flaws in the various motherboard chipsets that don't
show up unless more than one DMA channel is running.  Perhaps it would
be a good thing to modify the memory test to run more than one DMA
channel while doing the RAM test.  I'll have to consult some engineers
on what the worst cases really are...

> Since memory is expensive they just put in slower chips than the
> machine really needs. I hope they are listening?????????

If you let the marketing/purchasing/finance people run wild, that
probably would happen.  But Systems Validation would never sign off to
it.  In our case (Dell), there is the additional threat of paying for
on-site service (and paying for the 800 line time).  There are ways to
cut corners without compromising reliability, but you'll hurt raw
performance - and you don't do things to hurt the prime selling point
for your machine.
-- 
James R. Van Artsdalen   ...!uunet!utastro!bigtex!james   "Live Free or Die"
Home: 512-346-2444 Work: 338-8789   10926 Jollyville Rd #901 Austin TX 78759