mdapoz@hybrid.uucp (Mark Dapoz) (02/05/91)
Well, I knew it was bound to happen sooner or later, unfortunatly it was sooner than I had hoped. It seems my trusty 'ole 3b1 has packed it in, it now just sits there with the number 1 & 3 led's on when I turn it on. Looking in the hardware tech ref this indicates that test 5, the dynamic ram test is failing. Great, one, if not more of my 72 ram chips has died. I already have 2 meg on the motherboard so I don't want to go ripping them all out just to find out which one is bad. Does anyone have any idea how I can go about narrowing down exactly which ram chip is bad? I can't even boot a diag disk so none of those fancy tests will do me any good. I remember some talk recently about making up some new diag roms to help track problems like this down. Did anyone ever get this working? It's really bizzare how the machine just gave out for no apparent reason. There I was, working on my Sun while the 3b1 was madly unpacking a few meg of news, and then...... silence. I figured the kernel just hung for some bizzare reason so I reset the machine. It cleared the screen as usual and then it went to alternating black and white lines, uh oh. I quickly reset it again... same thing. Powered it down completely... same thing. Luckily I have a second 3b1 for times just like this. I even added the second drive expansion socket to it just in case I had to swap boards. After opening up both machines and swapping boards I now have my main machine back up and running but the spare one is now dead. Any suggestions on how to fix this dead board would be greatly appreciated. -- Managing a software development team | Mark Dapoz is a lot like being on the psychiatric | mdapoz%hybrid@cs.toronto.edu ward. -Mitch Kapor, San Jose Mercury | mdapoz@torvm3.iinus1.ibm.com
dt@yenta.alb.nm.us (David B. Thomas) (02/06/91)
mdapoz@hybrid.uucp (Mark Dapoz) writes: > Does anyone have any idea how > I can go about narrowing down exactly which ram chip is bad? At the BOF, I remember Craig Votava mentioning that someone had figured out a way to jumper things on the motherboard so you can trick the hardware into thinking that any 512k bank you like is the ONLY 512k in the system. That would sure help! Anybody know how to do that? little david -- Computer interfaces and user interfaces are as different as night and 1.
rmfowler@texrex.uucp (Rex Fowler) (02/07/91)
In article <1991Feb6.025147.22371@yenta.alb.nm.us> dt@yenta.alb.nm.us (David B. Thomas) writes: >mdapoz@hybrid.uucp (Mark Dapoz) writes: > >> Does anyone have any idea how >> I can go about narrowing down exactly which ram chip is bad? > >At the BOF, I remember Craig Votava mentioning that someone had figured out >a way to jumper things on the motherboard so you can trick the hardware >into thinking that any 512k bank you like is the ONLY 512k in the system. > The someone was Peter Fales <psfales@ihlpb.att.com>. I sent mail to him requesting his instructions but have received no response. If anyone has these instructions, please email me a copy. -- Rex Fowler <rmfowler%texrex@cirr.com> UUCP: egsner!texrex!rmfowler
njc@rick.att.com (Neil Cherry) (02/07/91)
I just tried to compile the CDRAW program from OSU, but found that I didn't have the C Bindings installed. After searching around I found the manual but no DISK! Anybody out there got a copy. I may be able to find it at the hotline but I doubt it since 3B1 users are fewer and fewer these days. NJC
botton@i88.isc.com (Brian D. Botton) (02/12/91)
In article <1991Feb5.070902.1260@hybrid.UUCP> mdapoz@hybrid.uucp (Mark Dapoz) writes: >Well, I knew it was bound to happen sooner or later, unfortunatly it was >sooner than I had hoped. It seems my trusty 'ole 3b1 has packed it in, it >now just sits there with the number 1 & 3 led's on when I turn it on. >Looking in the hardware tech ref this indicates that test 5, the dynamic >ram test is failing. Great, one, if not more of my 72 ram chips has died. >I already have 2 meg on the motherboard so I don't want to go ripping them >all out just to find out which one is bad. Does anyone have any idea how >I can go about narrowing down exactly which ram chip is bad? I can't even >boot a diag disk so none of those fancy tests will do me any good. I >remember some talk recently about making up some new diag roms to help >track problems like this down. Did anyone ever get this working? I was the one that brought that subject up and yes, I have some results to report. I too had a memory problem, however it wasn't a hard failure like yours. My machine was able to boot up UNIX or the diag disk and then it would crap out. The problem was with low memory and the diag disk doesn't check low memory. I did several things to solve my problem, which I will try to enumerate in proper order. I also have a second machine, from work, so I had a good system to work on. I loaded the devrom device driver (John Milton) and read the ROM object into a file. I then tried to use dis to dissassemble the object, but ran into trouble because the object module wasn't a COFF file, of course. So, being of stout heart and maybe a brick or two short, I spent the next week dissassembling the object module by hand. On the surface you might think that was a stupid thing to do, however I think it worked rather well. I relearned 68000 assembler, the silly AT&T syntax, and became very familiar with how the ROM was put together. During this process, I figured out how to use the -n option of ld(1) to build an object module linked at any physical address I desired. I also wrote a quick and dirty program that would take this object module and create two Intel HEX files, one for high, the other for low, bytes. Using these somewhat primative tools, I was able to assemble and link my own version of the boot ROM. After making sure I could make an exact binary copy of the boot ROM, I started the job of commenting the code. I commented the assembly through all the steps of initializing the hardware after a reset and then to the main loop that causes the squares to be drawn on the screen. It was very obvious that code from the main loop on was the output of a C compiler. Things got a little tougher here because the 3.51 compiler puts out similar, but sometimes drastically different code. Anyway, during this time I was sending regular updates to Craig Votava, and he suggested I take a look at the source for the extended diagnostic disk (don't ask, I'm not at liberty to share, ;-(). What a stroke of luck, because the disk manipulation routines we very identifiable with my ROM assembly! Anywy, by this time, I could make my own ROM and manipulate the MMU code, which is critical to my next step. Having the source for the diagnoistics, I had already tried linking it so that it ran in high memory. Unfortuantely that didn't work, because the ROM only maps the bottom .5Meg of RAM, probably because this is the minimum RAM size possible. When the kernel boots it sets up the rest of the memory pages. Anyway, when the loader placed the diagnostics at high memory, it was placed at whatever page mappings happened to be in the MMU after the page map test, which wasn't good. The solution has two parts, first, because I knew low memory was bad, and I have a 1.5Meg combo board, I modified the initialization code so that the MMU was no longer unity mapped. I placed page 0, not at physical address 0, but at physical address 0x200000. Page 1 after page 0, and so on. What this did for me is that now when the loader was read off the floppy, it was placed into known good memory. Second, I fixed all the places in the diagnostic source where the MMU page tables are set up. This happens anytime there is going to be some kind of RAM test. I modified this code so it continued to map high physical memory as low virtual memory. I also maped low physical memory into high virtual memory. Once I did this, I booted the floppy and ran the EXTENDED MEMORY TEST! All of this took a couple of days work over New Years vaction. After running my new diagnostic for a few hours, the memory address test started to fail. This started to be reproducible every few minutes, which is very good news. Address bit A1 was bad, which means one of the address multiplexor chips. I took out my trusty O'scope and started poking around, trying to see a bad signal. While I was doing this, the stupid think started to fail even the simple power on diagnostics the boot ROM runs, giving me the dashed lines on the screen, sound familiar Mark?! So now I couldn't even boot my special diagnostic disk, ;-(. I did the only thing a reasonable person could do; I said S@#$ and a few other choice words and went to bed. The next day the machine was still dead in the water. The only option now was to do diagnostics in ROM. Normally the ROM's RAM diagnostic draws the lines on the screen and then executes a stop instruc- tion. Instead of doing that, I decided to display the address, data read and the data that it should have been, all in bit codes on the screen. 1s represented by a square and 0s by a space. To help keep track of which bit was which, I first drew 24 squares for the address, and then 2 sets of 16 squares for the data. So when the memory test was run, it drew the 24+16+16 squares on the screen, with a few blank spaces in between. When it found an error, instead of stopping, it drew out the address, data read, and the data it should have been. Finally, to help troupleshoot, it went into an infinate loop reading and writing the bad location. With yet another ROM in hand, I was getting consitent failures at physical address 2, which points to a particular 512K bank of RAM and 2 particular address multiplexor chips. What I really needed now was a logic analyzer because the old O'scope just can't capture those once in a thousand glitches, which happen several times a second. So the only alternative was to trace the address select logic from the 68010 all the way to the RAM chips. After spending about 4 hours doing this I discovered a BAD SOCKET!!!!!! When I upgraded from 1Meg to 2Meg on the motherboard I bought cheap sockets. Three of us went in together and did our motherboards at the same time, they haven't had any socket problems. Anyway, one of the multipexor chip's socket had a bad ground pin that had anywhere from 50 to 50K ohms of resistance, depending on how you stressed the socket! I continued my quest and discovered a couple of the memory chips had > 5 ohms resistance at the +5V pins, and another one of the multiplexor chips had a ground line with > 5 ohms. These problems, I suspect, were caused by too much heat when removing solder from the holes. A few lengths of wire-wrap wire fixed these problems. A week later, after an order of good quality machined sockets arrived and were installed, I booted up my special diagnostics disk. It ran for about 18 hours and one of those #$%&* multiplexor chips went bad! Replaced it and ran diagnostics for ~5.5 days. Put on the hard disk out of the borrowed machine, mine had been transplaned into that machine long ago, and booted UNIX for the first time since April 1990. It ran for ~7 days without a glitch. Returned my hard disk to my machine and its been up ever since, about 3 weeks worth. So, what did I learn? 1. DON'T EVEN THINK ABOUT USING CHEAP SOCKETS. They just aren't worth the aggrivation. 2. Even though you may be a good solderer, I repaired circuit boards when I was in the Air Force, it isn't that hard to damage a multilayer PC board. Ground and power leads are especially hard to work with because they sink a lot of heat. A better solder sucker whould have helped. 3. Had a great, although painful, time dissassembling the ROM. I don't feel comfortable giving out the ROM code. If I had done it completely on my own I might consider it, but I did have help. What I am planning on doing is converting the initial assembler into C code and fixing up the main loop, these I'll post. The disk routines I want to update with the 3.51 loader because the loader supports the P5.1 disks. I also want to include the disk writing routines, the ROM has only the disk reading routines. These I'll have to put in a library and distribute in object form. Since the IHV (Independant Hardware Vendor) diagnostic disk is public, I want to modify the code so status messages are in ASCII instead of "marching squares." With proper disk, screen and keyboard support, I think some very interesting things can be done. BTW, I did look at the disk routines in the IHV kit and unless I missed something, they are extremely limited. By the time I finished converting assembler into C, I had written a new object to Intel HEX conversion program. This time I programmed in the code to read the .text, .data and the .bss headers. The .data and .text sections are placed at the proper address in ROM and the .bss section is ignored because it is in RAM. The current boot ROM doesn't clear the .bss section like it should, but it would be relatively easy to implement as part of the startup code. I know there are people who are having problems with their memory and might want to use my special ROMs and diagnostic disk. I am open to helping them out, but I don't really want to get too bogged down supporting something that I want to put in a full featured ROM. If you are really having problems and could benifit from my stuff, send E-mail and we'll work something out. Otherwise, I think it would be a good idea to wait until I can get something more useful out to the net. I am taking ideas for the "enhanced ROM", and I have had a volunteer to help me out. That's right David K., I haven't forgotten about you! So please keep sending ideas, hopefully in the next few weeks I'll post a summary. Just as an idea, I like two ideas I've received that suggest I put the loader and the diagnostics in ROM. I also want to put a monitor in there. BTW, the first square is always drawn on the screen. Then the floppy and then hard disks are checked to see if they are ready to be read. If neither of them are, another square is drawn and the process repeates. > >It's really bizzare how the machine just gave out for no apparent reason. Mine did the same thing, except it limped along for a month before I declared it brain damaged. If you need some help let me know. -- ... ___ *** _][_n_n___i_i ________ ******* Brian D. Botton (____________I_I______I_I_______I laidbak!botton or /ooOOOO OOOOoo oo oooo oo oo laidbak!bilbo!brian