sbw@naucse.UUCP (Steve Wampler) (08/11/89)
My 3b1 has been suffering from an intermittent parity error ever since I added a 1.5Meg combo card (2Meg on the motherboard). I finally was able to record some of the errors in unix.log (normally the system just crashed). Could some kind soul tell me *exactly* which chip(s) are involved? I wouldn't object to an explanation on how to read the address (what does *hpte mean?) given in the error message, nor an explanation on how the chips correspond to memory addresses. I know this was posted a while back, but I never thought *my machine* would need it. My eternal gratitude. Herewith, the error messages: NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24D) Tue Aug 8 17:25:24 1989 NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24D) Tue Aug 8 17:25:25 1989 NMI (parity error) at 0x81000 (*hpte: 0x4251) Tue Aug 8 18:00:01 1989 NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24B) Fri Aug 11 04:25:27 1989 NMI (parity error) at 0x2FF2E6 (*hpte: 0xC24B) Fri Aug 11 04:25:29 1989 NMI (parity error) at 0x301004 (*hpte: 0xC22F) Fri Aug 11 05:26:30 1989 NMI (parity error) at 0x301004 (*hpte: 0xC22F) Fri Aug 11 05:26:30 1989 NMI (parity error) at 0x81000 (*hpte: 0x4238) Fri Aug 11 06:04:45 1989 -- Steve Wampler {....!arizona!naucse!sbw}
jbm@uncle.UUCP (John B. Milton) (08/16/89)
In article <1648@naucse.UUCP> sbw@naucse.UUCP (Steve Wampler) writes: [ how do you map the /usr/adm/unix.log NMI parity errors to memory chips? ] >address (what does *hpte mean?) given in the error message, nor an Hardware Page Table Entry, the physical memory mapping RAM. >NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24D) Tue Aug 8 17:25:24 1989 >NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24D) Tue Aug 8 17:25:25 1989 >NMI (parity error) at 0x81000 (*hpte: 0x4251) Tue Aug 8 18:00:01 1989 >NMI (parity error) at 0x2FF2E6 (*hpte: 0xE24B) Fri Aug 11 04:25:27 1989 >NMI (parity error) at 0x2FF2E6 (*hpte: 0xC24B) Fri Aug 11 04:25:29 1989 >NMI (parity error) at 0x301004 (*hpte: 0xC22F) Fri Aug 11 05:26:30 1989 >NMI (parity error) at 0x301004 (*hpte: 0xC22F) Fri Aug 11 05:26:30 1989 >NMI (parity error) at 0x81000 (*hpte: 0x4238) Fri Aug 11 06:04:45 1989 The first number is the virtual address, the number your program sees. The *hpte means, not the address of the mapping RAM used, but the contents of the mapping RAM for the virtual page in question. The UNIXpc page size is 4k (2^12). The virtual address is useless in tracking down the memory chip. The RAM which does virtual to physical mapping on the UNIXpc is 16 bits wide. During a virtual memory cycle, what goes into the mapping RAM is address bits A12 to A21. What comes out are mapped address bit MA12 to MA21. The MA12 to MA21 together with A1 to A11 (and UDS/LDS) lines form the actual physical address used to access memory. Ok, that's 10 bits for address mapping. That means that 1024 pages of 4k each can be mapped by the mapping hardware, for a total of 4M. This is the unusual part. The UNIXpc has a maximum physical memory limit of 4M, which is exactly the same as the maximum virtual memory limit. On most machines with virtual memory, the virtual space is quite large, typically 4G. On these machines, the maximum physical memory limit is far smaller than the maximum virtual limit. Ok, 10 bits, that leave 6 page status bits, PS0 to PS4 and WE. Only two (PS0 and PS1) of the page status bits are currently used. Hmmm. This is the encoding of PS0 and PS1: PS0 PS1 Status of page 0 0 Page is not present (or not allowed) 0 1 Page is present, but has not been accessed 1 0 Page has been read but not written (clean) 1 1 Page has been written to (dirty) These page status bits are what makes virtual memory work. They are automatically updated by hardware during each memory access. When an access is made to page that is not present, a fault is generated. It is then up to the UNIX kernel to decided whether the process is really allowed to access that page or not, and what to do about it. When it is time for a page belonging to a process to be removed from memory, the page status bits are checked to see what has happened to the page. If the page has been written to, it is assumed that the page was changed and is now different than the page in /dev/swap. It must therefor, be written to disk. If the page was read but not written, then the version on disk is the same, so no write is needed. This is the mapping of the bits in the mapping RAM when read/written: Bit Desc Bit Desc Bit Desc Bit Desc 0 MA12 1 MA13 2 MA14 3 MA15 4 MA16 5 MA17 6 MA18 7 MA19 8 MA20 9 MA21 10 PS4 11 PS3 12 PS2 13 PS0 14 PS1 15 WE We can use the above table to decode the error messages to get a physical address. Remember that the low 12 bits of the physical address are not mapped, so they are the same as the virtual address. In the first case from above, The virtual address is 0x2FF2E6, and the contents of the mapping RAM for that virtual address is 0xE24D, so the physical address is (0x2FF2E6 & 0xFFF) + ((0xE24D & 0x3FF) << 12) (0x2E6) + (24D000) 0x24D2E6 Well, that was the easy part! Now for the more difficult part of finding which memory chip that is. All the memory for the UNIXpc produced by Convergent, AT&T and others has all consisted of 64k or 256k dynamic memory chips, organized as 64k by 1 or 256k by 1. All memory for the UNIXpc MUST have a valid parity bit, because the UNIXpc has a scheme to TEST the parity bits. The CPU chip used in the UNIXpc is the Motorola 68010, which has a 16 bit data bus. The 68010 can also access, for read or write, wither the odd or even byte exclusively. The UNIXpc thus has memory which is organized in groups of 18 chips, 16 for data and two for parity. The trouble with transient (one shot) parity errors when running UNIX is that you can't tell whether it was the even, the odd, or both bytes with the given information. The first thing we can tell right off the bat is that this memory location is not on the mother board. Internal memory addresses range from 0x000000 to 0x1FFFFF, and external from 0x200000 to 0x3FFFFF. Here is a sorted, uniqed table of the addresses: Virtual *hpte Physical 0x301004 0xC22F 0x22F004 0x081000 0x4238 0x238000 0x2FF2E6 0xE24B 0x24B2E6 0x2FF2E6 0xE24D 0x24D2E6 0x081000 0x4251 0x251000 Hmm. They all seem fairly well clustered together. In this case we know that the expansion memory is on a Combo card. The Combo card has three sets of 18 chips, with these addresses: Chip coords. Address range 5A to 6K 0x200000 to 0x27FFFF 7A to 8K 0x280000 to 0x2FFFFF 9A to 10K 0x300000 to 0x37FFFF All the addresses I got from the error messages range from 0x22F004 to 0x251000. So, given the slim information we got from the kernel, it's the first bank. Without further info, there's no way to tell which of the 18 chips may be bad. Note that I say MAY. There is no guarantee that there is anything wrong with all the memory on the board. The simple addition of the board to the system may have put the power requirements over what the power supply can handle. It may be that one (or more) of the chips in the first bank of the Combo card was the first to show the signs of insufficient power. Note that expansion cards will be the first to show signs of low voltage, since they are furthest away from the power connector. One way to get around the power problem is to clean the contact on the mother board where the power supply ribbon cable plugs in. Just pull it off and put it on a couple of times, making sure it's all the way on the last time. The next thing I would try is to run the memory test on the diagnostics disk all night long to try to catch the bad location. You will then run into one of the more annoying things about the memory diagnostics, you can't test a sub- section of memory, you have to test all of it. This is especially bad when you have 2M on the mother board and you're looking for problems with expansion memory. If the diagnostics can catch a memory problem, you will get a lot more infor- mation. As far as memory on the Combo boards goes, the boards were layed out so they could be repaired easily. Row K is all parity chips, alternating low high. Column A to K is D0 to D7 for odd (5, 7, 9) or D8 to D15 for even. So, if the diagnostics say bad parity chip, 0x293475 high, that would be the second bank (7A to 8K), parity (row K), high (column 8): 8K. The layout on the mother board is much the same, the major difference being in what kind of memory chips are on you mother board. If a memory chip completely fails in the first bank of memory on the mother board, UNIX will panic and you won't be able to boot anything. What was that repair center number again? The mother board memory is organized much the same as the Combo board. Mother board memory goes from 2A (front right) to 10H (back left). Column 10 is all parity chips, alternating low high, front to back. Column 2 to 9 is D0 to D7 for rows A, C, E, G or D8 to D15 for rows B, D, F, H. Chip coords. Address range (64k) Address range (256k) 2G to 10H 0x000000 to 0x01FFFF 0x000000 to 0x07FFFF 2E to 10F 0x020000 to 0x03FFFF 0x080000 to 0x0FFFFF 2C to 10D 0x040000 to 0x05FFFF 0x100000 to 0x17FFFF 2A to 10B 0x060000 to 0x07FFFF 0x180000 to 0x1FFFFF For example, 0x056789 would be one of the 18 chips at 2C to 10D if you have a 512k machine fully populated with 64k RAM chips, or one of the 18 chips at 2G to 10H if you have a half populated 1M mother board or a 2M mother board. Well, I've been working from schematics up to now, let me take a guess at the 512k/2M AT&T RAM Expansion card. **** WARNING THIS IS A GUESS **** Row K parity, odd columns low, even columns high. Chip coords. Address range (64k) Address range (256k) 5A to 6K 0x000000 to 0x01FFFF 0x000000 to 0x07FFFF 7A to 8K 0x020000 to 0x03FFFF 0x080000 to 0x0FFFFF 9A to 10K 0x040000 to 0x05FFFF 0x100000 to 0x17FFFF 11A to 12K 0x060000 to 0x07FFFF 0x180000 to 0x1FFFFF Well finding a bad memory chip and replacing it are two different things. If you have the skill and go after you mother board, PUT SOCKETS IN when you start putting things back together. If you have bad 64k chips on a 512k mother board, consider replacing ALL the chips and doing an upgrade to 2M. If you have a Combo board (read sockets), replace the offending bank with new chips. If you can't, swap the chips in the bad bank with another bank (assuming you have more than one bank installed), then see if the memory problems move. The continuity may be a little bad on this one folks; I've got a cold and this was done in three sessions. John -- John Bly Milton IV, jbm@uncle.UUCP, n8emr!uncle!jbm@osu-cis.cis.ohio-state.edu (614) h:294-4823, w:785-1110; N8KSN, AMPR: 44.70.0.52; Don't FLAME, inform!