barry@muddcs.UUCP (Barry Lustig) (08/14/84)
<Got them line eater blues> I want to thank everyone who sent mail and/or posted to the net trying to help me with our memory problems. My original question to the net was how do you interpret the mcr errors that pop up onto the console (e.g. mcr0: soft ecc addr 1265 syn 75). Our memory is National Semiconductor NS753. We finally managed to track down our bad chip, but it took quite a bit of work. The first time through we ran DEC's ECKAM diagnostic. We ran the long test because it would seem that that should be the most comprehensive. After letting it run all night we found, in the morning, that we did indeed have a single bit error. Unfortunately, the diagnostic didn't give either the Data Bit or the Syndrome. After a call to National I found out that only the ECKAM QUICK test will give the actual Data Bit or Syndrome (Makes a lot of sense huh :-)). Another problem I ran into was interpreting the syndrome. The National memory chart didn't even list the the syndrome that the console popped up with! When I looked at the chart for Trendata memory on a nearby 780 it had our syndrome and pointed to the right chip. Down below are the responses I received nicely packaged for your consumption!!! <<<<<<<<<<<<<<<<<<<<<<<<<<<===================>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> From scgvaxd!mkp Sat Aug 4 17:27:01 1984 Subject: re: your memory errors Enclosed is an rcsdiff of /sys/vax/machdep.c. It addresses some specific 750 memory problems, including the infamous "translation buffer parity error crash". This diff is from machdep.c 6.2, which is the original 4.2bsd source code, and our local version at 6.3. Hope this helps. Mike 1a2 > /* $Header: machdep.c,v 6.3 84/05/03 13:12:11 rcs Exp $ */ 480,481c481,491 < printf("mcr%d: soft ecc addr %x syn %x\n", < m, M750_ADDR(&amcr), M750_SYN(&amcr)); --- > /* > * modified to distinguish hard and soft errors > * (W. Sebok astrovax!wls 3/7/83) > */ > if (M750_ERR(mcr)&M750_UNCORR) { > printf("mcr%d: hard error",m); > } else { > printf("mcr%d: soft ecc",m); > } > printf(" addr %x syn %x\n", > M750_ADDR(&amcr), M750_SYN(&amcr)); 810c820 < if ((mcf->mc5_mcesr&0xf) == MC750_TBPAR) { --- > if ((mcf->mc5_mcesr&0xe) == MC750_TBPAR) { <<<<<<<<<<<<<<<<<<<<<<<<<<<===================>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> From scgvaxd!allegra!watmath!kwlalonde Sun Aug 5 16:56:17 1984 Subject: Re: mcr: errors References: <169@muddcs.UUCP> The same thing was plagueing one of our 780's last week. The nice man from DEC came in and switched a board. No change - two crashes three days later. Turns out he replaced the wrong board. The console message comes from /sys/vax/machdep.c, routine memerr(). The macro MS780C_ADDR (sp?) in /sys/vax/mem.h prints out the memory address, which contains the board number in bits 24-27. Replace that board. (I'm writing this from memory - see your 780 System guide from DEC for details.) I'm going to change memerr() to spell out what is wrong a bit more clearly, so I don't have to dig through the manual again. <<<<<<<<<<<<<<<<<<<<<<<<<<<===================>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> From: scgvaxd!trwrb!sdcrdcf!hplabs!ucbvax!RWS@MIT-XX.ARPA >From: Robert W. Scheifler <hplabs!ucbvax!RWS@MIT-XX.ARPA> Subject: Re: mcr: errors The "addr" field is in hex and goes up 0x200 for each 256K bytes. ------- <<<<<<<<<<<<<<<<<<<<<<<<<<<===================>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> From: scgvaxd!trwrb!sdcrdcf!hplabs!ucbvax!jcp@BRL-TGR.ARPA >From: Joe Pistritto <hplabs!ucbvax!jcp@BRL-TGR.ARPA> Subject: Re: mcr: errors Well, the two numbers you have are the address, (probably in pages for 4.2BSD), and the error syndrome. The syndrome is a unique 8 bit code which identifies which chip in a 64K bank is bad. The address tells you which bank it is. The mapping of '1265' to a physical address requires looking at the source, to find out what the units are. (probably Kbytes, but I'm not sure). Going from the syndrome to a bit # requires looking in the hardware reference for your memory boards. There is most likely a table in there relating Syndromes to bit #s. (There are 22 bits, I believe, although only 16 of these contain data). -JCP- <<<<<<<<<<<<<<<<<<<<<<<<<<<===================>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> From: scgvaxd!trwrb!sdcrdcf!hplabs!hao!seismo!uwvax!brian >From: hplabs!hao!seismo!uwvax!brian (Brian Pinkerton) Subject: Re: mcr: errors References: <169@muddcs.UUCP> The location of the chip is dependent on the brand of board you have. The easiest way to find the chip would be to look in the manual you got with the memory. Or, if it's DEC memory, call them or run memory diagnostics yourself (these will tell you what array is failing, then you have to use the address to locate the chip). DEC memory is also hard to fix because the chips are soldered right onto the board. Brian Pinkerton @ wisconsin ...!{allegra,heurikon,ihnp4,seismo,sfwin,ucbvax,uwm-evax}!uwvax!brian brian@wisc-rsch.arpa <<<<<<<<<<<<<<<<<<<<<<<<<<<===================>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> From: scgvaxd!trwrb!sdcrdcf!hplabs!ucbvax!LASH@COLUMBIA-20.ARPA >From: Alan Lash <hplabs!ucbvax!LASH@COLUMBIA-20.ARPA> Subject: Re: mcr: errors In-Reply-To: Message from "hplabs!sdcrdcf!trwrb!scgvaxd!muddcs!barry@UCB-VAX.ARPA" of Wed 1 Aug 84 18:37:20-EDT Status: R It would be appreciated if you could foward any responses that you get to me. We have on occasion had similar problems with no solutions. Thanks AL LASH LASH@COLUMBIA ------- <<<<<<<<<<<<<<<<<<<<<<<<<<<===================>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> From: scgvaxd!trwrb!sdcrdcf!sdcsvax!dcdwest!ittvax!qubix!msc () >From: sdcsvax!dcdwest!qubix!msc (Mark Callow) Subject: Re: mcr: errors In-Reply-To: your article <169@muddcs.UUCP> Read your Vax hardware manual. Make sure you use the table for the 750 not the 780. The error message contains all the information you need to find the bad chip using this table. It's listed under memory controller. believe it or not. Mark <<<<<<<<<<<<<<<<<<<<<<<<<<<===================>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> From: scgvaxd!decvax!decwrl!amd!amdcad!resonex!nancy (Nancy Blachman) Subject: Re: mcr: errors References: <169@muddcs.UUCP> I would be interested in seeing the responses you receive since I have seen a similar thing on my 750 running 4.2. Nancy Blachman {allegra,amd,hplabs,inhp4,sun}!resonex!nancy Resonex, Sunnyvale, CA (408)720 8600 x26 <<<<<<<<<<<<<<<<<<<<<<<<<<<===================>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> From: scgvaxd!trwrb!sdcrdcf!hplabs!ucbvax!rbbb@rice.ARPA >From: David Chase <hplabs!ucbvax!rbbb@rice.ARPA> Subject: Re: mcr: errors I will give this a look, if you would like. I once wrote an awk script to decode the CSRC from 780 memory errors into chip and board locations. Yes, it was a big help. To help you, or to help you help yourself, I need the following information: 1) who makes the memory, and what is the type of the board? 2) is there any variety of maintenance/overview manual? 3) I would feel most comfortable with the "CSR0" register, or very precise knowledge of how the numbers "addr" and "syn" were derived and printed (for instance, are those octal, decimal, or hexadecimal"?) I try not to be a unix wizard, so I would like to avoid this if you have source. For instance, Mostek provided a technical manual for their MK8016 780 memory with charts to help you find chips. This is what I coded into an awk script. I also have the installation guide for a National NS753, and it contains similar information. Hexadecimal is the preferred representation for this information. If you find the charts and feel unsure about this, I will gladly try to walk you through them; it's a little scray the first time around (especially because the charts often contain errors - you can tell because there is a break in the pattern). Also, there is your good buddy ECKAM, to give you definitive values for "addr" and "syndrome". "Addr" should be the address of the 512-byte page containing the sick chip. "Syn" should be the error syndrome. Good luck, David Chase <<<<<<<<<<<<<<<<<<<<<<<<<<<===================>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> From scgvaxd!ihnp4!uw-beaver!ssc-vax!jeff Sun Aug 12 08:39:37 1984 From: scgvaxd!ihnp4!uw-beaver!ssc-vax!jeff (Jeffrey Jongeward) Subject: Re: mcr: errors References: <169@muddcs.UUCP> Here is a chart someone (don't remember who) submitted to the net a while ago. (I "%s/^H/##/" for the sake of some mailers, so you will have to switch 'em back.) It may help if you have either DEC memory. The info in the NSC book is better than this chart, however, if you have that kind of memory. ssc-vax!jeff jeff jongeward DIGITAL MS780 (M8210) MEMORY ERROR MAP Error is reported as: soft ecc addr GFEDC syn BA G = bits 27:24 ^ F = bits 23:20 | E = bits 19:16 | array board D = bits 15:12 C = bits 11:08 B = bits 07:04 A = bits 03:00 27:24 = array board in error 0000 =0= board 01 0001 =1= " 02 0010 =2= " 03 0011 =3= " 04 0100 =4= " 05 0101 =5= " 06 DEC Memory 0110 =6= " 07 0111 =7= " 08 1000 =8= " 09 1001 =9= " 10 1010 =A= " 11 1011 =B= " 12 --------------------------- 1100 =C= " 13 1101 =D= " 14 NSC Memory 1110 =E= " 15 1111 =F= " 16 23 = Array bank in error. 0 = lower; 1 = upper 22:09 = 16K chip address in error 08 = Word in error. 0 = lower; 1 = upper 07:00 = Error syndrome March 3, 1983 -1- DIGITAL MS780 (M8210) MEMORY ERROR MAP ______________________________________________________________ _##|##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##_E##_##_R##_##_R##_##_O##_##_R##_##__##_S##_##_Y##_##_N##_##__##_C##_##_H##_##_I##_##_P##_##__##_L##_##_O##_##_C##_##_A##_##_T##_##_I##_##_O##_##_N##_##__##_C##_##_H##_##_A##_##_R##_##_T##_ ##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##|##_ |ERROR SYNDROME| LONG WORD | BIT IN ERROR| BANK 0| BANK 1| _##|_______________##|_____________##|_______________##|_________##|_________##| | 01 | CHECK BYTE| C00 | E2 | E6 | | 02 | CHECK BYTE| C01 | E3 | E7 | | 04 | CHECK BYTE| C02 | E4 | E8 | | 08 | CHECK BYTE| C03 | E5 | E9 | | 10 | CHECK BYTE| C04 | E115 | E119 | | 19 | LOWER | 01 | E12 | E16 | | 1A | LOWER | 02 | E13 | E17 | | 1C | LOWER | 04 | E20 | E24 | | 1F | LOWER | 07 | E23 | E27 | | 20 | CHECK BYTE| C05 | E117 | E120 | | 38 | LOWER | 00 | E11 | E15 | | 3B | LOWER | 03 | E14 | E18 | | 3D | LOWER | 05 | E21 | E25 | | 3E | LOWER | 06 | E22 | E26 | | 40 | CHECK BYTE| C06 | E117 | E121 | | 49 | LOWER | 09 | EE30 | E34 | | 4A | LOWER | 10 | E31 | E35 | | 4C | LOWER | 12 | E38 | E42 | | 4F | LOWER | 15 | E41 | E45 | | 51 | LOWER | 17 | E48 | E52 | | 52 | LOWER | 18 | E49 | E53 | | 54 | LOWER | 20 | E56 | E60 | | 57 | LOWER | 23 | E59 | E63 | | 58 | LOWER | 24 | E65 | E69 | | 5B | LOWER | 27 | E68 | E72 | | 5D | LOWER | 29 | E75 | E79 | | 5E | LOWER | 30 | E76 | E80 | | 68 | LOWER | 08 | E29 | E33 | | 6B | LOWER | 11 | E32 | E36 | | 6D | LOWER | 13 | E39 | E43 | | 6E | LOWER | 14 | E40 | E44 | | 70 | LOWER | 16 | E47 | E51 | | 73 | LOWER | 19 | E50 | E54 | | 75 | LOWER | 21 | E57 | E61 | | 76 | LOWER | 22 | E58 | E62 | | 79 | LOWER | 25 | E66 | E70 | | 7A | LOWER | 26 | E67 | E71 | | 7C | LOWER | 28 | E74 | E78 | _##|##|______7##_F##________##|__L##_O##_W##_E##_R##_______##|_______3##_1##_______##|____E##_7##_7##___##|____E##_8##_1##___##|##| March 3, 1983 -2- DIGITAL MS780 (M8210) MEMORY ERROR MAP ______________________________________________________________ _##|##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##_E##_##_R##_##_R##_##_O##_##_R##_##__##_S##_##_Y##_##_N##_##__##_C##_##_H##_##_I##_##_P##_##__##_L##_##_O##_##_C##_##_A##_##_T##_##_I##_##_O##_##_N##_##__##_C##_##_H##_##_A##_##_R##_##_T##_ ##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##|##_ |ERROR SYNDROME| LONG WORD | BIT IN ERROR| BANK 0| BANK 1| _##|_______________##|_____________##|_______________##|_________##|_________##| | 80 | CHECK BYTE| C07 | E118 | E122 | | 89 | UPPER | 01 | E125 | E128 | | 8A | UPPER | 02 | E126 | E130 | | 8C | UPPER | 04 | E133 | E137 | | 8F | UPPER | 07 | E136 | E140 | | 91 | UPPER | 09 | E143 | E147 | | 92 | UPPER | 10 | E144 | E148 | | 94 | UPPER | 12 | E151 | E155 | | 97 | UPPER | 15 | E154 | E158 | | 98 | UPPER | 16 | E160 | E164 | | 9B | UPPER | 19 | E163 | E167 | | 9D | UPPER | 21 | E170 | E174 | | 9E | UPPER | 22 | E171 | E175 | | A8 | UPPER | 00 | E124 | E129 | | AB | UPPER | 03 | E127 | E131 | | AD | UPPER | 05 | E134 | E138 | | AE | UPPER | 06 | E135 | E139 | | B0 | UPPER | 08 | E142 | E146 | | B3 | UPPER | 11 | E145 | E149 | | B5 | UPPER | 13 | E152 | E156 | | B6 | UPPER | 14 | E153 | E157 | | B9 | UPPER | 17 | E161 | E165 | | BA | UPPER | 18 | E162 | E166 | | BC | UPPER | 20 | E169 | E173 | | BF | UPPER | 23 | E172 | E176 | | C1 | UPPER | 25 | E179 | E183 | | C2 | UPPER | 26 | E180 | E184 | | C4 | UPPER | 28 | E187 | E191 | | C7 | UPPER | 31 | E190 | E194 | | E0 | UPPER | 24 | E178 | E182 | | E3 | UPPER | 27 | E181 | E185 | | E5 | UPPER | 29 | E188 | E192 | _##|##|______E##_6##________##|__U##_P##_P##_E##_R##_______##|_______3##_0##_______##|___E##_1##_8##_9##___##|___E##_1##_9##_3##___##|##| NOTE: 1. All error syndromes in this table have an odd number of bits equal to a "1" and are correctable. Example: syndrome 38=00111000 this syndrome has 3 "1"'s NOTE: 2. Error syndromes with an even number of bits equal to "1" mean double bit error. Double bit errors are not correctable. --------------------------------------------------------- March 3, 1983 -3- MOTOROLA MMS780 MEMORY ERROR MAP Error is reported as: soft ecc addr GFEDC syn BA G = bits 27:24 F = bits 23:20 E = bits 19:16 D = bits 15:12 C = bits 11:08 B = bits 07:04 A = bits 03:00 27:24 = array board in error 0000 =0= board 01 0001 =1= " 02 0010 =2= " 03 0011 =3= " 04 0100 =4= " 05 0101 =5= " 06 0110 =6= " 07 0111 =7= " 08 1000 =8= " 09 1001 =9= " 10 1010 =A= " 11 1011 =B= " 12 1100 =C= " 13 1101 =D= " 14 1110 =E= " 15 1111 =F= " 16 23 = Array bank in error. 0 = lower; 1 = upper 22:09 = 16K chip address in error 08 = Word in error. 0 = lower; 1 = upper 07:00 = Error syndrome March 3, 1983 -4- MOTOROLA MMS780 MEMORY ERROR MAP ______________________________________________________________ _##|##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##_E##_##_R##_##_R##_##_O##_##_R##_##__##_S##_##_Y##_##_N##_##__##_C##_##_H##_##_I##_##_P##_##__##_L##_##_O##_##_C##_##_A##_##_T##_##_I##_##_O##_##_N##_##__##_C##_##_H##_##_A##_##_R##_##_T##_ ##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##|##_ |ERROR SYNDROME| LONG WORD | BIT IN ERROR| BANK 0| BANK 1| _##|_______________##|_____________##|_______________##|_________##|_________##| | 01 | CHECK BYTE| C00 | 2C | 2G | | 02 | CHECK BYTE| C01 | 2D | 2H | | 04 | CHECK BYTE| C02 | 2E | 2I | | 08 | CHECK BYTE| C03 | 2F | 2J | | 10 | CHECK BYTE| C04 | 16C | 16G | | 19 | LOWER | 01 | 3D | 3H | | 1A | LOWER | 02 | 3E | 3I | | 1C | LOWER | 04 | 4C | 4G | | 1F | LOWER | 07 | 4F | 4J | | 20 | CHECK BYTE| C05 | 16D | 16H | | 38 | LOWER | 00 | 3C | 3G | | 3B | LOWER | 03 | 3F | 3J | | 3D | LOWER | 05 | 4D | 4H | | 3E | LOWER | 06 | 4E | 4I | | 40 | CHECK BYTE| C06 | 16E | 16I | | 49 | LOWER | 09 | 5D | 5H | | 4A | LOWER | 10 | 5E | 5I | | 4C | LOWER | 12 | 6C | 6G | | 4F | LOWER | 15 | 6F | 6J | | 51 | LOWER | 17 | 7D | 7H | | 52 | LOWER | 18 | 7E | 7I | | 54 | LOWER | 20 | 8C | 8G | | 57 | LOWER | 23 | 8F | 8J | | 58 | LOWER | 24 | 9C | 9G | | 5B | LOWER | 27 | 9F | 9J | | 5D | LOWER | 29 | 10D | 10H | | 5E | LOWER | 30 | 10E | 10I | | 68 | LOWER | 08 | 5C | 5G | | 6B | LOWER | 11 | 5F | 5J | | 6D | LOWER | 13 | 6D | 6H | | 6E | LOWER | 14 | 6E | 6I | | 70 | LOWER | 16 | 7C | 7G | | 73 | LOWER | 19 | 7F | 7J | | 75 | LOWER | 21 | 8D | 8H | | 76 | LOWER | 22 | 8E | 8I | | 79 | LOWER | 25 | 9D | 9H | | 7A | LOWER | 26 | 9E | 9I | | 7C | LOWER | 28 | 10C | 10G | _##|##|______7##_F##________##|__L##_O##_W##_E##_R##_______##|_______3##_1##_______##|___1##_0##_F##____##|___1##_0##_J##____##|##| March 3, 1983 -5- MOTOROLA MMS780 MEMORY ERROR MAP ______________________________________________________________ _##|##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##_E##_##_R##_##_R##_##_O##_##_R##_##__##_S##_##_Y##_##_N##_##__##_C##_##_H##_##_I##_##_P##_##__##_L##_##_O##_##_C##_##_A##_##_T##_##_I##_##_O##_##_N##_##__##_C##_##_H##_##_A##_##_R##_##_T##_ ##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##__##|##_ |ERROR SYNDROME| LONG WORD | BIT IN ERROR| BANK 0| BANK 1| _##|_______________##|_____________##|_______________##|_________##|_________##| | 80 | CHECK BYTE| C07 | 16F | 16J | | 89 | UPPER | 01 | 17D | 17G | | 8A | UPPER | 02 | 17E | 17I | | 8C | UPPER | 04 | 18C | 18G | | 8F | UPPER | 07 | 18F | 18J | | 91 | UPPER | 09 | 19D | 19H | | 92 | UPPER | 10 | 19E | 19I | | 94 | UPPER | 12 | 20C | 20G | | 97 | UPPER | 15 | 20F | 20J | | 98 | UPPER | 16 | 21C | 21G | | 9B | UPPER | 19 | 21F | 21J | | 9D | UPPER | 21 | 22D | 22H | | 9E | UPPER | 22 | 22E | 22I | | A8 | UPPER | 00 | 17C | 17H | | AB | UPPER | 03 | 17F | 17J | | AD | UPPER | 05 | 18D | 18H | | AE | UPPER | 06 | 18E | 18I | | B0 | UPPER | 08 | 19C | 19G | | B3 | UPPER | 11 | 19F | 19J | | B5 | UPPER | 13 | 20D | 20H | | B6 | UPPER | 14 | 20E | 20I | | B9 | UPPER | 17 | 21D | 21H | | BA | UPPER | 18 | 21E | 21I | | BC | UPPER | 20 | 22C | 22G | | BF | UPPER | 23 | 22F | 22J | | C1 | UPPER | 25 | 23D | 23H | | C2 | UPPER | 26 | 23E | 23I | | C4 | UPPER | 28 | 24C | 24G | | C7 | UPPER | 31 | 24F | 24J | | E0 | UPPER | 24 | 23C | 23G | | E3 | UPPER | 27 | 23F | 23J | | E5 | UPPER | 29 | 24D | 24H | _##|##|______E##_6##________##|__U##_P##_P##_E##_R##_______##|_______3##_0##_______##|___2##_4##_E##____##|___2##_4##_I##____##|##| NOTE: 1. All error syndromes in this table have an odd number of bits equal to a "1" and are correctable. Example: syndrome 38=00111000 this syndrome has 3 "1"'s NOTE: 2. Error syndromes with an even number of bits equal to "1" mean double bit error. Double bit errors are not correctable. March 3, 1983 -6- -- Barry Lustig Harvey Mudd College UUCP: {ihnp4,allegra,seismo}!scgvaxd!muddcs!barry PHONE: At the moment --- (714) 621-8000 x8023 When the revolution comes kill all the lawyers first!
dmmartindale@watcgl.UUCP (Dave Martindale) (08/16/84)
Please, people, when posting a summary of responses to a question, just post the informative answers! Posting the messages which answer the question may very well prove valuable to someone else. But posting the letters that just say "Please tell me what you find out" to the rest of the net is pointless.