[comp.unix.microport] lots of panics in uPort V -> Problems @ 10MHz

fortin@zap.UUCP (Denis Fortin) (03/14/88)

In article <829@ddsw1.UUCP> karl@ddsw1.UUCP (Karl Denninger) writes:
>In article <115@hawkmoon.UUCP> det@hawkmoon.UUCP (Derek E. Terveer) writes:
>>[running SysV/386]  The problem is that
>>i seem to be having lots of (relatively) unexplainable panics and i was
>>wondering if anyone else with the 386 version was also having numbers of these
>>panics, like "kernel mode traps" (type e), "user mode traps" (type 2 and 8),
>>and "iupdat - iaddress >2^24" panics.  Plus i keep getting a number of "NMI in
>>system mode" messages.  
>
>The key one is the NMI message.
>
>This can only be generated one way -- if your memory board(s) generate a
>parity fault.

Hmmm.  I have been having similar problems, so I guess this is a good time
to post about them...

I'm running Microport System V/AT 2.3 on a 6/10MHz AT-class machine.

I can run the system with no problems at 6MHz, and I can run it at 10MHz
without any problems under both DOS and IBM XENIX 1.0 (which I don't use
anymore since I have uPort).  Unfortunately, whenever I attempt to run
SV/AT at 10MHz, it crashes after a few minutes (sometimes even before that).

In a few cases, I've noticed the message "NMI in system mode" at boot time,
*but* my 2MB of RAM are all 100ns chips, which *should* be more than 
sufficient for 10MHz operation!!!

(Question: my video board is an original IBM EGA board.  Could it be the
culprit?  Can it handle 10MHz?)

In general, the system will boot without any problems, and after a few
minutes, the response time slows down a lot and ultimately I get the
following message on the console:

	user=0xC7E
	cs=0x208 ds=0x220 es=0x220 ss=0x213 di=0x400 si=0x5BE0
	bp=0x2C0 bx=0x7 dx=0xA1 cx=0x0 ax=0x7 ip=0x5807 flags=0x202
	trap type 0xD
	err=0x210
	stack frame address = E830270
	Double panic: Software detects double fault

I have also seen "user=0x10 ... err=0x8173".

I know that this is a bit cryptic, but none of my requests for help from 
Microport on this issue (even when my SysV/AT was still under warranty) 
have yielded any result.  (In most cases, I was told that the info was
transfered to someone else ... who never got back to me.)

I currently have an update contract (I still think that uPort is a pretty
good product), but I have not purchased a technical support contract because
from what I have seen during my warranty period, their technical service
won't help me much with this problem (note: this was about 1.5 years ago).

I guess my biggest problem is that I have really no way of knowing what
the register dump really means...  Also, I'm very puzzled by the fact
that IBM Xenix 1.0 will run on my machine at 10MHz (I can understand why
DOS works: it's not as demanding interrupt-wise on the machine, but XENIX
*does* work and that annoys me!)

Anyway, if anybody has an idea about things I could try, I would appreciate
it *** A LOT ***.  Running my AT at 6MHz is definitely not the same as
running it at 10HMz (SIGH!).

(Also, could anybody post a description of what those "user=n" messages
 mean?  It's not anywhere in the documentation.  (I understand that you
 get vanilla-flavored SysV documentation, but I still feel that it's
 quite annoying to run software that generates hexadecimal error messages
 with no explanations!))
-- 
Denis Fortin                            | fortin@zap.UUCP
CAE Electronics Ltd                     | philabs!micomvax!zap!fortin
The opinions expressed above are my own | fortin%zap.uucp@uunet.uu.net

det@hawkmoon.MN.ORG (Derek E. Terveer) (03/17/88)

In article <421@zap.UUCP>, fortin@zap.UUCP (Denis Fortin) writes:
> In article <829@ddsw1.UUCP> karl@ddsw1.UUCP (Karl Denninger) writes:
> >In article <115@hawkmoon.UUCP> det@hawkmoon.UUCP (Derek E. Terveer) writes:
> >>[running SysV/386]  The problem is that
> >>i seem to be having lots of (relatively) unexplainable panics and i was
> >>wondering if anyone else with the 386 version was also having numbers of these
> >>panics, like "kernel mode traps" (type e), "user mode traps" (type 2 and 8),
> >>and "iupdat - iaddress >2^24" panics.  Plus i keep getting a number of "NMI in
> >>system mode" messages.  
> >
> >The key one is the NMI message.
> >
> >This can only be generated one way -- if your memory board(s) generate a
> >parity fault.

When i <det@hawkmoon> posted the original message, above, my machine was not
only in panic mode --> so was i.  I had just spent a not inconsiderable amount
of money on a '386 machine plus the uport software to run on it.  It installed
and then pretty much from day 2 started crashing from 1 to 5 times a day.
Imagine my horror at witnessing these events!  So of course i was worried and
hoped that someone on the net would be able to help.  Karl <karl@ddsw1> was the
one that tipped me off.  He stated that the "NMI in system mode" messages only
happened in this plane of existence when parity errors occurred.  I thought
about it for a
little bit -- there should not have been *ANY* parity errors from the chips
themselves, it was a brand new board and i tested the board and all the chips
for some 70+ times.  I was confident that it couldn't be the board.  Therefore,
drawing on my experience with running some other unix machines (vax 11/780s), i
inspected my environment.  Power?  I had my pc plugged into an outlet with one
of those little-itsy-bitsy noise filters that plug into a two outlet wall
receptacle and provide three somewhat filterd outlets.  I also had my stereo
and a lamp plugged into the same outlet - considering the house is 40+ years
old and they didn't use electricity back then (:-) i only have two outlets in
my room!  I decided to test whether or not the stereo and lamp were somehow
dirtying the power to my pc and moved them to another outlet, courtesy of a
long extension cord.  I ran for a couple of days -- no problems.

I have now run for TWO weeks with not a *single* problem!!!!  Obviously, i was
getting some sort of substandard power with the other stuff plugged in.
Especially upon retrospect i realized that my pc only seemed to crash when my
stereo was on.

So an apology is on order here from me to microport.  They may or may not have
problems, but my PC crashing all the time is now not one of them.  I am very
pleased with the stability of the system now (now if only i had more memory it
would be a little faster (:-)).

Moral of the story:  When you start getting errors like the ones i described
(esp. NMI errors), check the environment and isolate your pc if you can.

Now all i have to do is figure out where to put my damn extension cord
connecting my stereo and lamp.....

> [..]
> I can run the system with no problems at 6MHz, and I can run it at 10MHz
> without any problems under both DOS and IBM XENIX 1.0 (which I don't use
> anymore since I have uPort).  Unfortunately, whenever I attempt to run
> SV/AT at 10MHz, it crashes after a few minutes (sometimes even before that).
> [..]
> In a few cases, I've noticed the message "NMI in system mode" at boot time,
> *but* my 2MB of RAM are all 100ns chips, which *should* be more than 
> sufficient for 10MHz operation!!!

I got the same message within about 5 minutes or less of boot time, but i don't
think it has to do with the speed of the chips.  Karl pointed out that these
are parity errors.  You could run your board diagnostics, if you have any, and
see if there are any problems there.  Also, a lot of times the speed of the
BUS is different than the speed of the cpu/motherboard.  For example, i have a
16MHz cpu, but my bus speed, to which i have attached 2Mb of memory on an intel
above board, is *only* ~8Mhz.  You didn't state your config as far as memory
goes, so if you only need the barest minimum of memory, like 640K, to run
uport, perhaps you could take out the extra memory if its on a seperate board
and see if you can then run uport at 10MHz.  If you can, then theres obviously
a problem with your extra memory.

You suggestion about the graphics board is a valid one, but i don't know enough
about that to comment further.

Finally, check the power.

Hope this helps...!

> (Also, could anybody post a description of what those "user=n" messages
>  mean?

Yes, yes, please!

derek
-- 
Derek Terveer	det@hawkmoon.MN.ORG	uunet!rosevax!elric!hawkmoon!det

karl@ddsw1.UUCP (Karl Denninger) (03/17/88)

In article <421@zap.UUCP> fortin@zap.UUCP (Denis Fortin) writes:
>In article <829@ddsw1.UUCP> karl@ddsw1.UUCP (Karl Denninger) writes:
>>
>>The key one is the NMI message.
>>
>>This can only be generated one way -- if your memory board(s) generate a
>>parity fault.
>
>Hmmm.  I have been having similar problems, so I guess this is a good time
>to post about them...

[Some detail deleted]

>In general, the system will boot without any problems, and after a few
>minutes, the response time slows down a lot and ultimately I get the
>following message on the console:
>
>	user=0xC7E
>	cs=0x208 ds=0x220 es=0x220 ss=0x213 di=0x400 si=0x5BE0
>	bp=0x2C0 bx=0x7 dx=0xA1 cx=0x0 ax=0x7 ip=0x5807 flags=0x202
>	trap type 0xD
>	err=0x210
>	stack frame address = E830270
>	Double panic: Software detects double fault

Aha!  Now we're talking.  A crash dump (well, sorta)!

To find the routine in the kernel which caused the panic, you do this:

	nm -x /system5 >/tmp/xxxxx  (dump list of kernel to file)

Now, go looking for the address you panic'd at.  You put the 'cs' and 'ip'
values together to get this number (code segment & instruction pointer).  
In this case, you get 0x0208005807.

Find the routine (use 'vi' or another editor; looking will take ALL DAY;
this is a huge file!)  which has the largest address LESS THAN the panic 
address.

This is the routine which was executing when the system crashed.

From the numbers above, I'll guess that the routine you'll find will be
'rmsd'.  IF SO - get on the phone and yell loudly -- you have a manifestation 
of the very-common SIO crash which has plagued us poor '286 Microport owners 
for over a year!

If it's NOT 'rmsd' then please post the name of the routine (or mail it to
me), as it's probably a new one... and might give all us net.gurus some ideas!

>I have also seen "user=0x10 ... err=0x8173".
>
>I know that this is a bit cryptic, but none of my requests for help from 
>Microport on this issue (even when my SysV/AT was still under warranty) 
>have yielded any result.  (In most cases, I was told that the info was
>transfered to someone else ... who never got back to me.)

This is interesting -- they didn't even tell you how to get the address of
the routine where you paniced?   Sheesh!  A master list at Uport doesn't help
anyone with this, as the addresses move if you use the link kit.

>I currently have an update contract (I still think that uPort is a pretty
>good product), but I have not purchased a technical support contract because
>from what I have seen during my warranty period, their technical service
>won't help me much with this problem (note: this was about 1.5 years ago).
>
>I guess my biggest problem is that I have really no way of knowing what
>the register dump really means...  Also, I'm very puzzled by the fact
>that IBM Xenix 1.0 will run on my machine at 10MHz (I can understand why
>DOS works: it's not as demanding interrupt-wise on the machine, but XENIX
>*does* work and that annoys me!)

It ought to annoy you... it does us as well.

-----
Karl Denninger		       |  Data: +1 312 566-8912
Macro Computer Solutions, Inc. | Voice: +1 312 566-8910
...ihnp4!ddsw1!karl	       | "Quality solutions for work or play"

root@uwspan.UUCP (John Plocher) (03/17/88)

+---- fortin@zap.UUCP (Denis Fortin) writes in <421@zap.UUCP> ----
| 	user=0xC7E
| 	cs=0x208 ds=0x220 es=0x220 ss=0x213 di=0x400 si=0x5BE0
| 	bp=0x2C0 bx=0x7 dx=0xA1 cx=0x0 ax=0x7 ip=0x5807 flags=0x202
| 	trap type 0xD
| 	err=0x210
| 	stack frame address = E830270
| 	Double panic: Software detects double fault
| 
| I have also seen "user=0x10 ... err=0x8173".
| 
| I guess my biggest problem is that I have really no way of knowing what
| the register dump really means
+----

The user= and the err= don't really tell you anything; the ones you are
interested in are cs= and ip=.

First you need a copy of the symbol table from the kernel.  You get this by
executing the following command:
	nm /system5 > system5.nm
system5.mn is LARGE - several hundred K - so be sure you don't put it in (your
small) root filesystem.  It looks like this:

Symbols from /system5:
Name                  Value   Class        Type         Size   Line  Section
gdt.s               |        | file |                  |      |     |
sludge              |35656025|static|                  |      |     |.data
tfsbot              |35658780|static|                  |      |     |.data
tfstack             |35659804|static|                  |      |     |.data
wnsbot              |35659892|static|                  |      |     |.data
wnstack             |35660916|static|                  |      |     |.data
conf.c              |        | file |                  |      |     |
prfintr             |33554432|extern|            int( )|     6|     |.text
emul_present        |33554438|extern|            int( )|    11|     |.text
sioscan             |33554449|extern|           long( )|    12|     |.text
lomem.c             |        | file |                  |      |     |
linesw.c            |        | file |                  |      |     |
buffers.c           |        | file |                  |      |     |
...
(but your numbers WILL be different than these :-)

Then you get the CS:IP address from the above register dump:
    | 	cs=0x208
    | 	                                      ip=0x5807
and form them into a full address:
	02085807

At this point you need to find out where the panic happened:
    vi system5.mn	- start the editor
    /020858  		- search for the MSBs of the address
			-- NOTE that the LSBs are not looked for
			  (Most Significant Bits, Least Significant Bits)
			- There may be a few places where this search succeedes,
			  look for the place that is the closest one with a
			  value SMALLER than the address calculated above.
			  eg. in choosing between 02085800 and 02085810, you
			  would choose the first one.  This is because you want
			  to find the name of the routine which was executing
			  when the panic happened, not the name of the one
			  just after it in memory.

At this point you should be able to find a class text address near the one
calculated from CS and IP above.  NOTE that the variable "sioscan" is
located near 33554440, as is "emul_present" and "prfintr".  We can also
tell from this symbol table that these routines can be found in the file
conf.c.


- Hope this isn't too confusing,
    John

-- 
Comp.Unix.Microport is now unmoderated!  Use at your own risk :-)

david@bdt.UUCP (David Beckemeyer) (03/18/88)

In article <421@zap.UUCP> fortin@zap.UUCP (Denis Fortin) writes:
>I know that this is a bit cryptic, but none of my requests for help from 
>Microport on this issue (even when my SysV/AT was still under warranty) 
>have yielded any result.  (In most cases, I was told that the info was
>transfered to someone else ... who never got back to me.)

Typical of my experiences with uport too.  I've tried calling. I've
tried "official" bug-reports. I've tried everything short of driving
down there and pounding on the front desk until somebody reacts.

It is very frustrating to try to deal with uport.  They seem very
unorganized.  I have the two drive bug, among others.

We will be developing our products for SCO Xenix, me thinks.
-- 
David Beckemeyer			| "To understand ranch lingo all yuh
Beckemeyer Development Tools		| have to do is to know in advance what
478 Santa Clara Ave, Oakland, CA 94610	| the other feller means an' then pay
UUCP: ...!ihnp4!hoptoad!bdt!david 	| no attention to what he says"

caf@omen.UUCP (Chuck Forsberg WA7KGX) (03/19/88)

In article <126@hawkmoon.MN.ORG> det@hawkmoon.MN.ORG (Derek E. Terveer) writes:
:                           Power?  I had my pc plugged into an outlet with one
:of those little-itsy-bitsy noise filters that plug into a two outlet wall
:receptacle and provide three somewhat filterd outlets.  I also had my stereo
:and a lamp plugged into the same outlet - considering the house is 40+ years
:old and they didn't use electricity back then (:-) i only have two outlets in
:my room!  I decided to test whether or not the stereo and lamp were somehow
:dirtying the power to my pc and moved them to another outlet, courtesy of a
:long extension cord.  I ran for a couple of days -- no problems.

Unless your voltage is rather low, a good computer supply should be able
to operate properly even when the light(s) visibly dim.  Furthermore,
the power supply should stop processing and reset the computer if it
does run out of stored energy.

So, I should suspect either a marginal power supply or a marginal EMC
(ElectroMagnetic Compatibility) problem that was fixed by rearranging
the grounds et al.

(EMC is the reverse of Radio Frequency Interference (RFI).  EMC relates
to the ability of external noise to "talk" to the computer.)  Poor EMC
is what sells anti static treatments for DP rooms.

Chuck Forsberg WA7KGX          ...!tektronix!reed!omen!caf 
Author of YMODEM, ZMODEM, Professional-YAM, ZCOMM, and DSZ
  Omen Technology Inc    "The High Reliability Software"
17505-V NW Sauvie IS RD   Portland OR 97231   503-621-3406
TeleGodzilla BBS: 621-3746   CIS: 70007,2304    Genie: CAF