[comp.sys.3b1] 3b1 memory problems

mdapoz@hybrid.uucp (Mark Dapoz) (02/05/91)

Well, I knew it was bound to happen sooner or later, unfortunatly it was
sooner than I had hoped.  It seems my trusty 'ole 3b1 has packed it in, it
now just sits there with the number 1 & 3 led's on when I turn it on.
Looking in the hardware tech ref this indicates that test 5, the dynamic
ram test is failing.  Great, one, if not more of my 72 ram chips has died.
I already have 2 meg on the motherboard so I don't want to go ripping them
all out just to find out which one is bad.  Does anyone have any idea how
I can go about narrowing down exactly which ram chip is bad?  I can't even
boot a diag disk so none of those fancy tests will do me any good.  I
remember some talk recently about making up some new diag roms to help
track problems like this down.  Did anyone ever get this working?

It's really bizzare how the machine just gave out for no apparent reason.
There I was, working on my Sun while the 3b1 was madly unpacking a few meg
of news, and then...... silence.  I figured the kernel just hung for some
bizzare reason so I reset the machine.  It cleared the screen as usual and
then it went to alternating black and white lines,  uh oh.  I quickly reset
it again... same thing.  Powered it down completely... same thing.  Luckily
I have a second 3b1 for times just like this.  I even added the second drive
expansion socket to it just in case I had to swap boards.  After opening up
both machines and swapping boards I now have my main machine back up and
running but the spare one is now dead.  Any suggestions on how to fix this
dead board would be greatly appreciated.
-- 
Managing a software development team 	|   Mark Dapoz  
is a lot like being on the psychiatric	|   mdapoz%hybrid@cs.toronto.edu
ward.  -Mitch Kapor, San Jose Mercury	|   mdapoz@torvm3.iinus1.ibm.com

dt@yenta.alb.nm.us (David B. Thomas) (02/06/91)

mdapoz@hybrid.uucp (Mark Dapoz) writes:

> Does anyone have any idea how
> I can go about narrowing down exactly which ram chip is bad?

At the BOF, I remember Craig Votava mentioning that someone had figured out
a way to jumper things on the motherboard so you can trick the hardware
into thinking that any 512k bank you like is the ONLY 512k in the system.

That would sure help!  Anybody know how to do that?

					little david
-- 
Computer interfaces and user interfaces are as different as night and 1.

rmfowler@texrex.uucp (Rex Fowler) (02/07/91)

In article <1991Feb6.025147.22371@yenta.alb.nm.us> dt@yenta.alb.nm.us (David B. Thomas) writes:
>mdapoz@hybrid.uucp (Mark Dapoz) writes:
>
>> Does anyone have any idea how
>> I can go about narrowing down exactly which ram chip is bad?
>
>At the BOF, I remember Craig Votava mentioning that someone had figured out
>a way to jumper things on the motherboard so you can trick the hardware
>into thinking that any 512k bank you like is the ONLY 512k in the system.
>

The someone was Peter Fales <psfales@ihlpb.att.com>.  I sent mail to
him requesting his instructions but have received no response.  If
anyone has these instructions, please email me a copy.

-- 
Rex Fowler <rmfowler%texrex@cirr.com>
UUCP:  egsner!texrex!rmfowler

njc@rick.att.com (Neil Cherry) (02/07/91)

I just tried to compile the CDRAW program from OSU, but found that I didn't
have the C Bindings installed. After searching around I found the manual but
no DISK! Anybody out there got a copy. I may be able to find it at the hotline
but I doubt it since 3B1 users are fewer and fewer these days.

NJC

botton@i88.isc.com (Brian D. Botton) (02/12/91)

In article <1991Feb5.070902.1260@hybrid.UUCP> mdapoz@hybrid.uucp (Mark Dapoz) writes:
>Well, I knew it was bound to happen sooner or later, unfortunatly it was
>sooner than I had hoped.  It seems my trusty 'ole 3b1 has packed it in, it
>now just sits there with the number 1 & 3 led's on when I turn it on.
>Looking in the hardware tech ref this indicates that test 5, the dynamic
>ram test is failing.  Great, one, if not more of my 72 ram chips has died.
>I already have 2 meg on the motherboard so I don't want to go ripping them
>all out just to find out which one is bad.  Does anyone have any idea how
>I can go about narrowing down exactly which ram chip is bad?  I can't even
>boot a diag disk so none of those fancy tests will do me any good.  I
>remember some talk recently about making up some new diag roms to help
>track problems like this down.  Did anyone ever get this working?

  I was the one that brought that subject up and yes, I have some results
to report.  I too had a memory problem, however it wasn't a hard failure
like yours.  My machine was able to boot up UNIX or the diag disk and then
it would crap out.  The problem was with low memory and the diag disk
doesn't check low memory.  I did several things to solve my problem, which
I will try to enumerate in proper order.

  I also have a second machine, from work, so I had a good system to work
on.  I loaded the devrom device driver (John Milton) and read the ROM
object into a file.  I then tried to use dis to dissassemble the object,
but ran into trouble because the object module wasn't a COFF file, of
course.  So, being of stout heart and maybe a brick or two short, I spent
the next week dissassembling the object module by hand.  On the surface
you might think that was a stupid thing to do, however I think it worked
rather well.  I relearned 68000 assembler, the silly AT&T syntax, and
became very familiar with how the ROM was put together.
  During this process, I figured out how to use the -n option of ld(1) to
build an object module linked at any physical address I desired.  I also
wrote a quick and dirty program that would take this object module and
create two Intel HEX files, one for high, the other for low, bytes.
Using these somewhat primative tools, I was able to assemble and link
my own version of the boot ROM.
  After making sure I could make an exact binary copy of the boot ROM, I
started the job of commenting the code.  I commented the assembly through
all the steps of initializing the hardware after a reset and then to
the main loop that causes the squares to be drawn on the screen.  It was
very obvious that code from the main loop on was the output of a C compiler.
Things got a little tougher here because the 3.51 compiler puts out similar,
but sometimes drastically different code.  Anyway, during this time I was
sending regular updates to Craig Votava, and he suggested I take a look at
the source for the extended diagnostic disk (don't ask, I'm not at liberty
to share, ;-().  What a stroke of luck, because the disk manipulation
routines we very identifiable with my ROM assembly! Anywy, by this time,
I could make my own ROM and manipulate the MMU code, which is critical to
my next step.
  Having the source for the diagnoistics, I had already tried linking it
so that it ran in high memory.  Unfortuantely that didn't work, because
the ROM only maps the bottom .5Meg of RAM, probably because this is the
minimum RAM size possible.  When the kernel boots it sets up the rest of
the memory pages.  Anyway, when the loader placed the diagnostics at
high memory, it was placed at whatever page mappings happened to be in
the MMU after the page map test, which wasn't good.
  The solution has two parts, first, because I knew low memory was bad, and
I have a 1.5Meg combo board, I modified the initialization code so that
the MMU was no longer unity mapped.  I placed page 0, not at physical
address 0, but at physical address 0x200000.  Page 1 after page 0, and so
on.  What this did for me is that now when the loader was read off the
floppy, it was placed into known good memory.
  Second, I fixed all the places in the diagnostic source where the MMU
page tables are set up.  This happens anytime there is going to be some
kind of RAM test.  I modified this code so it continued to map high physical
memory as low virtual memory.  I also maped low physical memory into high
virtual memory.  Once I did this, I booted the floppy and ran the EXTENDED
MEMORY TEST!
  All of this took a couple of days work over New Years vaction.  After
running my new diagnostic for a few hours, the memory address test started
to fail.  This started to be reproducible every few minutes, which is
very good news.  Address bit A1 was bad, which means one of the address
multiplexor chips.  I took out my trusty O'scope and started poking around,
trying to see a bad signal.  While I was doing this, the stupid think
started to fail even the simple power on diagnostics the boot ROM runs,
giving me the dashed lines on the screen, sound familiar Mark?!
  So now I couldn't even boot my special diagnostic disk, ;-(.  I did the
only thing a reasonable person could do; I said S@#$ and a few other choice
words and went to bed.  The next day the machine was still dead in the water.
The only option now was to do diagnostics in ROM.  Normally the ROM's RAM
diagnostic draws the lines on the screen and then executes a stop instruc-
tion.  Instead of doing that, I decided to display the address, data read
and the data that it should have been, all in bit codes on the screen.  1s
represented by a square and 0s by a space.  To help keep track of which bit
was which, I first drew 24 squares for the address, and then 2 sets of 16
squares for the data.  So when the memory test was run, it drew the 24+16+16
squares on the screen, with a few blank spaces in between.  When it found
an error, instead of stopping, it drew out the address, data read, and the
data it should have been.  Finally, to help troupleshoot, it went into an
infinate loop reading and writing the bad location.
  With yet another ROM in hand, I was getting consitent failures at physical
address 2, which points to a particular 512K bank of RAM and 2 particular
address multiplexor chips.  What I really needed now was a logic analyzer
because the old O'scope just can't capture those once in a thousand glitches,
which happen several times a second.  So the only alternative was to trace
the address select logic from the 68010 all the way to the RAM chips.  After
spending about 4 hours doing this I discovered a BAD SOCKET!!!!!!
  When I upgraded from 1Meg to 2Meg on the motherboard I bought cheap sockets.
Three of us went in together and did our motherboards at the same time, they
haven't had any socket problems.  Anyway, one of the multipexor chip's socket
had a bad ground pin that had anywhere from 50 to 50K ohms of resistance,
depending on how you stressed the socket!  I continued my quest and discovered
a couple of the memory chips had > 5 ohms resistance at the +5V pins, and
another one of the multiplexor chips had a ground line with > 5 ohms.  These
problems, I suspect, were caused by too much heat when removing solder from
the holes.  A few lengths of wire-wrap wire fixed these problems.
  A week later, after an order of good quality machined sockets arrived and
were installed, I booted up my special diagnostics disk.  It ran for about
18 hours and one of those #$%&* multiplexor chips went bad!  Replaced it
and ran diagnostics for ~5.5 days.  Put on the hard disk out of the borrowed
machine, mine had been transplaned into that machine long ago, and booted
UNIX for the first time since April 1990.  It ran for ~7 days without a
glitch.  Returned my hard disk to my machine and its been up ever since,
about 3 weeks worth.

  So, what did I learn?

1.	DON'T EVEN THINK ABOUT USING CHEAP SOCKETS.  They just aren't
	worth the aggrivation.

2.	Even though you may be a good solderer, I repaired circuit
	boards when I was in the Air Force, it isn't that hard to
	damage a multilayer PC board.  Ground and power leads are
	especially hard to work with because they sink a lot of heat.
	A better solder sucker whould have helped.

3.	Had a great, although painful, time dissassembling the ROM.

  I don't feel comfortable giving out the ROM code.  If I had done it
completely on my own I might consider it, but I did have help.  What I
am planning on doing is converting the initial assembler into C code and
fixing up the main loop, these I'll post.  The disk routines I want to
update with the 3.51 loader because the loader supports the P5.1 disks.
I also want to include the disk writing routines, the ROM has only the
disk reading routines.  These I'll have to put in a library and distribute
in object form.  Since the IHV (Independant Hardware Vendor) diagnostic
disk is public, I want to modify the code so status messages are in ASCII
instead of "marching squares."  With proper disk, screen and keyboard
support, I think some very interesting things can be done.
  BTW, I did look at the disk routines in the IHV kit and unless I missed
something, they are extremely limited.

  By the time I finished converting assembler into C, I had written a new
object to Intel HEX conversion program.  This time I programmed in the
code to read the .text, .data and the .bss headers.  The .data and .text
sections are placed at the proper address in ROM and the .bss section is
ignored because it is in RAM.  The current boot ROM doesn't clear the .bss
section like it should, but it would be relatively easy to implement as
part of the startup code.

  I know there are people who are having problems with their memory and might
want to use my special ROMs and diagnostic disk.  I am open to helping them
out, but I don't really want to get too bogged down supporting something that
I want to put in a full featured ROM.  If you are really having problems and
could benifit from my stuff, send E-mail and we'll work something out.
Otherwise, I think it would be a good idea to wait until I can get something
more useful out to the net.
  I am taking ideas for the "enhanced ROM", and I have had a volunteer to
help me out.  That's right David K., I haven't forgotten about you!
So please keep sending ideas, hopefully in the next few weeks I'll post
a summary.  Just as an idea, I like two ideas I've received that suggest
I put the loader and the diagnostics in ROM.  I also want to put a monitor
in there.

  BTW, the first square is always drawn on the screen.  Then the floppy and
then hard disks are checked to see if they are ready to be read.  If neither
of them are, another square is drawn and the process repeates.

>
>It's really bizzare how the machine just gave out for no apparent reason.

  Mine did the same thing, except it limped along for a month before I
declared it brain damaged.

  If you need some help let me know.
--
     ...     ___	     ***
   _][_n_n___i_i ________  *******		Brian D. Botton
  (____________I_I______I_I_______I		laidbak!botton  or
  /ooOOOO OOOOoo  oo oooo  oo   oo		laidbak!bilbo!brian