[unix-pc.general] Which chip

sbw@naucse.UUCP (Steve Wampler) (08/11/89)

My 3b1 has been suffering from an intermittent parity error ever since
I added a 1.5Meg combo card (2Meg on the motherboard).  I finally was
able to record some of the errors in unix.log (normally the system just
crashed).  Could some kind soul tell me *exactly* which chip(s) are
involved?  I wouldn't object to an explanation on how to read the
address (what does *hpte mean?) given in the error message, nor an
explanation on how the chips correspond to memory addresses.  I know
this was posted a while back, but I never thought *my machine* would
need it.

My eternal gratitude.

Herewith, the error messages:

NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24D) Tue Aug  8 17:25:24 1989
NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24D) Tue Aug  8 17:25:25 1989
NMI (parity error) at 0x81000 (*hpte: 0x4251) Tue Aug  8 18:00:01 1989
NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24B) Fri Aug 11 04:25:27 1989
NMI (parity error) at 0x2FF2E6 (*hpte: 0xC24B) Fri Aug 11 04:25:29 1989
NMI (parity error) at 0x301004 (*hpte: 0xC22F) Fri Aug 11 05:26:30 1989
NMI (parity error) at 0x301004 (*hpte: 0xC22F) Fri Aug 11 05:26:30 1989
NMI (parity error) at 0x81000 (*hpte: 0x4238) Fri Aug 11 06:04:45 1989

-- 
	Steve Wampler
	{....!arizona!naucse!sbw}

jbm@uncle.UUCP (John B. Milton) (08/16/89)

In article <1648@naucse.UUCP> sbw@naucse.UUCP (Steve Wampler) writes:
[ how do you map the /usr/adm/unix.log NMI parity errors to memory chips? ]

>address (what does *hpte mean?) given in the error message, nor an
Hardware Page Table Entry, the physical memory mapping RAM.

>NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24D) Tue Aug  8 17:25:24 1989
>NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24D) Tue Aug  8 17:25:25 1989
>NMI (parity error) at 0x81000 (*hpte: 0x4251) Tue Aug  8 18:00:01 1989
>NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24B) Fri Aug 11 04:25:27 1989
>NMI (parity error) at 0x2FF2E6 (*hpte: 0xC24B) Fri Aug 11 04:25:29 1989
>NMI (parity error) at 0x301004 (*hpte: 0xC22F) Fri Aug 11 05:26:30 1989
>NMI (parity error) at 0x301004 (*hpte: 0xC22F) Fri Aug 11 05:26:30 1989
>NMI (parity error) at 0x81000 (*hpte: 0x4238) Fri Aug 11 06:04:45 1989

The first number is the virtual address, the number your program sees. The
*hpte means, not the address of the mapping RAM used, but the contents of the
mapping RAM for the virtual page in question. The UNIXpc page size is 4k (2^12).
The virtual address is useless in tracking down the memory chip. The RAM which
does virtual to physical mapping on the UNIXpc is 16 bits wide. During a virtual
memory cycle, what goes into the mapping RAM is address bits A12 to A21. What
comes out are mapped address bit MA12 to MA21. The MA12 to MA21 together with
A1 to A11 (and UDS/LDS) lines form the actual physical address used to access
memory.

Ok, that's 10 bits for address mapping. That means that 1024 pages of 4k each
can be mapped by the mapping hardware, for a total of 4M. This is the unusual
part. The UNIXpc has a maximum physical memory limit of 4M, which is exactly
the same as the maximum virtual memory limit. On most machines with virtual
memory, the virtual space is quite large, typically 4G. On these machines,
the maximum physical memory limit is far smaller than the maximum virtual limit.
Ok, 10 bits, that leave 6 page status bits, PS0 to PS4 and WE. Only two (PS0
and PS1) of the page status bits are currently used. Hmmm. This is the encoding
of PS0 and PS1:

PS0 PS1 Status of page
 0   0  Page is not present (or not allowed)
 0   1  Page is present, but has not been accessed
 1   0  Page has been read but not written (clean)
 1   1  Page has been written to (dirty)

These page status bits are what makes virtual memory work. They are
automatically updated by hardware during each memory access. When an access is
made to page that is not present, a fault is generated. It is then up to the
UNIX kernel to decided whether the process is really allowed to access that
page or not, and what to do about it. When it is time for a page belonging to
a process to be removed from memory, the page status bits are checked to see
what has happened to the page. If the page has been written to, it is assumed
that the page was changed and is now different than the page in /dev/swap. It
must therefor, be written to disk. If the page was read but not written, then
the version on disk is the same, so no write is needed.

This is the mapping of the bits in the mapping RAM when read/written:

Bit Desc	Bit Desc	Bit Desc	Bit Desc
 0  MA12	 1  MA13	2   MA14	3   MA15
 4  MA16	 5  MA17	6   MA18	7   MA19
 8  MA20	 9  MA21	10  PS4		11  PS3
12  PS2		13  PS0		14  PS1		15  WE


We can use the above table to decode the error messages to get a physical
address. Remember that the low 12 bits of the physical address are not mapped,
so they are the same as the virtual address. In the first case from above,
The virtual address is 0x2FF2E6, and the contents of the mapping RAM for that
virtual address is 0xE24D, so the physical address is

(0x2FF2E6 & 0xFFF) + ((0xE24D & 0x3FF) << 12)
(0x2E6) + (24D000)
0x24D2E6

Well, that was the easy part! Now for the more difficult part of finding which
memory chip that is. All the memory for the UNIXpc produced by Convergent, AT&T
and others has all consisted of 64k or 256k dynamic memory chips, organized as
64k by 1 or 256k by 1. All memory for the UNIXpc MUST have a valid parity bit,
because the UNIXpc has a scheme to TEST the parity bits. The CPU chip used in
the UNIXpc is the Motorola 68010, which has a 16 bit data bus. The 68010 can
also access, for read or write, wither the odd or even byte exclusively. The
UNIXpc thus has memory which is organized in groups of 18 chips, 16 for data
and two for parity. The trouble with transient (one shot) parity errors when
running UNIX is that you can't tell whether it was the even, the odd, or both
bytes with the given information.

The first thing we can tell right off the bat is that this memory location is
not on the mother board. Internal memory addresses range from 0x000000 to
0x1FFFFF, and external from 0x200000 to 0x3FFFFF.  Here is a sorted, uniqed
table of the addresses:

Virtual  *hpte  Physical
0x301004 0xC22F 0x22F004
0x081000 0x4238	0x238000
0x2FF2E6 0xE24B 0x24B2E6
0x2FF2E6 0xE24D 0x24D2E6
0x081000 0x4251 0x251000

Hmm. They all seem fairly well clustered together. In this case we know that
the expansion memory is on a Combo card. The Combo card has three sets of
18 chips, with these addresses:

Chip coords.	Address range
5A to  6K	0x200000 to 0x27FFFF
7A to  8K	0x280000 to 0x2FFFFF
9A to 10K	0x300000 to 0x37FFFF

All the addresses I got from the error messages range from 0x22F004 to 0x251000.
So, given the slim information we got from the kernel, it's the first bank.
Without further info, there's no way to tell which of the 18 chips may be bad.
Note that I say MAY. There is no guarantee that there is anything wrong with
all the memory on the board. The simple addition of the board to the system may
have put the power requirements over what the power supply can handle. It may
be that one (or more) of the chips in the first bank of the Combo card was the
first to show the signs of insufficient power. Note that expansion cards will
be the first to show signs of low voltage, since they are furthest away from the
power connector. One way to get around the power problem is to clean the contact
on the mother board where the power supply ribbon cable plugs in. Just pull it
off and put it on a couple of times, making sure it's all the way on the last
time.

The next thing I would try is to run the memory test on the diagnostics disk
all night long to try to catch the bad location. You will then run into one of
the more annoying things about the memory diagnostics, you can't test a sub-
section of memory, you have to test all of it. This is especially bad when
you have 2M on the mother board and you're looking for problems with expansion
memory.

If the diagnostics can catch a memory problem, you will get a lot more infor-
mation. As far as memory on the Combo boards goes, the boards were layed out
so they could be repaired easily. Row K is all parity chips, alternating low
high. Column A to K is D0 to D7 for odd (5, 7, 9) or D8 to D15 for even.
So, if the diagnostics say bad parity chip, 0x293475 high, that would be the
second bank (7A to 8K), parity (row K), high (column 8): 8K.

The layout on the mother board is much the same, the major difference
being in what kind of memory chips are on you mother board. If a memory chip
completely fails in the first bank of memory on the mother board, UNIX will
panic and you won't be able to boot anything. What was that repair center
number again?

The mother board memory is organized much the same as the Combo board. Mother
board memory goes from 2A (front right) to 10H (back left). Column 10 is all
parity chips, alternating low high, front to back. Column 2 to 9 is D0 to D7
for rows A, C, E, G or D8 to D15 for rows B, D, F, H.

Chip coords.	Address range (64k)	Address range (256k)
2G to 10H	0x000000 to 0x01FFFF	0x000000 to 0x07FFFF
2E to 10F	0x020000 to 0x03FFFF	0x080000 to 0x0FFFFF
2C to 10D	0x040000 to 0x05FFFF	0x100000 to 0x17FFFF
2A to 10B	0x060000 to 0x07FFFF	0x180000 to 0x1FFFFF

For example, 0x056789 would be one of the 18 chips at 2C to 10D if you have a
512k machine fully populated with 64k RAM chips, or one of the 18 chips at
2G to 10H if you have a half populated 1M mother board or a 2M mother board.

Well, I've been working from schematics up to now, let me take a guess at the
512k/2M AT&T RAM Expansion card.

**** WARNING THIS IS A GUESS ****

Row K parity, odd columns low, even columns high.

Chip coords.	Address range (64k)	Address range (256k)
 5A to  6K	0x000000 to 0x01FFFF	0x000000 to 0x07FFFF
 7A to  8K	0x020000 to 0x03FFFF	0x080000 to 0x0FFFFF
 9A to 10K	0x040000 to 0x05FFFF	0x100000 to 0x17FFFF
11A to 12K	0x060000 to 0x07FFFF	0x180000 to 0x1FFFFF

Well finding a bad memory chip and replacing it are two different things. If you
have the skill and go after you mother board, PUT SOCKETS IN when you start
putting things back together. If you have bad 64k chips on a 512k mother board,
consider replacing ALL the chips and doing an upgrade to 2M. If you have a Combo
board (read sockets), replace the offending bank with new chips. If you can't,
swap the chips in the bad bank with another bank (assuming you have more than
one bank installed), then see if the memory problems move.

The continuity may be a little bad on this one folks; I've got a cold and this
was done in three sessions.

John
-- 
John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu
(614) h:294-4823, w:785-1110; N8KSN, AMPR: 44.70.0.52; Don't FLAME, inform!