[comp.sys.sun] Various Notes

rgreene@bnr.ca (Robert D. Greene) (04/18/91)
Hi everyone. Sorry for the big delays over the last week. We had a couple of 
interesting problems that kept me pretty busy (not that I'm not pretty 
busy in general anyway).

The first experience was that our Sun 3/60 which I use to do all the 
Sunspots processing suffered a very bad hard disk crash (lots of bad
blocks everywhere that couldn't be read). The 3/60 would not even boot
up 'cause fsck couldn't repair the disks. So, myself and a fellow coworker
used the tried and true "Bob Greene" method of repairing disks on the fly.
I've done this a couple of times, and really, I'm just curious how 
the rest of the world deals with this - ie: am I setting myself up for
a fall sometime in the far future? (and please, no "If you just kept 
backups you wouldn't have to do this" messages - I know, I know. :) :)). 

Here's the problem: when fsck'ing the disk, it comes up with a variety of
errors over several blocks (usually hard data errors and other good stuff).
Fsck is unable to to repair it and eventually dies. So, the way I do repair
this (this is pretty destructive, but the idea is destroy as little of the
disk data as possible) is to boot miniroot off the tape and then use
format to analyze the disk. I run this in read mode on the entire disk
to get a list of the bad blocks. If I'm lucky (which I wasn't) the
format repairs the blocks. Once that fails, I gradually escalate up
the list of analyze options (setting up my format options to only work
on one bad block at a time) until I get to purge. The purge has the same
effects as reformatting the entire bad block, and after a purge, 
I am able to successfully issue a read on that block. Interestingly
enough, purging some blocks allows read to fix other blocks. Anyway,
so I purge how ever many blocks I need to and then exit format and 
run fsck on the newly repaired disk. Of course, fsck finds lots of 
missing inodes and other fun things, so I just nuke them, noting the
filenames that it says it is destroying. After this, I can reboot
the machine (fsck says disk is clean) and then restore the missing
files. I've had to do this a total of three times to date. So, the 
question is: just how bad is this and what's the "better answer"?

Okay, the next link in the story of trials and tribulations. My own
personal Sparcstation 1+ finally bought the farm yesterday morning. About
2 months ago, we lost air conditioning in our building and during that
day I experienced about 5 panics (it wasn't THAT hot - about 70F or so).
Since that time, the Sparcstation has been sporadicly panic'ing with
"synctodr: failed to read clock chip" errors and SCSI select timeouts.
During the last week when I was away from my desk, the Sparcstation
supposedly repeately panic'ed with SCSI select errors. Yesterday morning,
it finally gave out totally: first, it wouldn't even poweron with the 
CD-ROM attached and Powered ON on the SCSI bus. I disconnected all 
SCSI devices and tried rebooting. At that point it came up with an
Ethernet address of "2:2:2:2:2:2" (very bogus). The self tests completed
without any indication of system failure. Over a period of several minutes,
I tried to reboot from sd() several times - the responses were either 
"SCSI Select failed" or "Memory alignment error". A diag "scsi-probe"
sometimes showed the SCSI disk and sometimes did not. In any event,
Sun Support wants to swap out the CPU board (this somes like the 
standard approach - I keep remembering those horror stories I've seen
of 26 CPU board swaps later still having the problem.. :)). It looks
a lot like a SCSI Bus failure to me, although the mangled Ethernet
address looks like the PROMS may be mucked up as well.

But.... luckily, I was not dead totally. Later the same day, I was
fortunate enough to receive a band spankin' new Tatung Sparcstation 
clone for evaluation. This is the slow (12.5 MIPS I think) color desktop 
machine that they advertise at around $8000. Mine came with 8Megs of 
memory, 19" color monitor (looks like it might be a Sony Monitor - have
to check the documenation) and 207 Mbytes of disk space. I found actually
setting up the hardware was relatively easy - the setup is very similar
to a Sparcstation, and it readily accepted a variety of scavanged 
peripherals (Tape and CD-ROM) from my Sparcstation. The one bad point
was that the design of the case made it difficult to correctly seat
the AUI cable on the ethernet connector. Additionally, the slide type
fastener was noticeably inferior to others I have seen.

Their own version of SunOS (Sparc/OS) was already installed on the
internal drive; however, I opted to reinstall from tape to take advantage
of the "quick configuration" options. The quick configuration loaded
without any problems and during execution looked very similar to 
Sun's own install program. At one point during the installation, a lot
of garbage was written to the screen (it looks like they reused a 
string without clearing it's old contents). Besides this, the installation
went very smoothly and had me up and running with NIS and NFS operational
within an hour. 

As installed, the Tatung had only Sunview (or should I say something that
looked very much like Sunview? I'm not sure how much of the system is
actually real Sun routines and how much is just stuff that looks like
their Sun counterparts) installed. I ftp'd direct from another Sparcstation
all of our local routines and X11R4. Note that I transferred only
the binaries and libraries - I did not recompile at all. 

So, I logged in as myself and tried to run X11R4. Total failure. The
X11R4 processes started up (ie: the screen initialized and the mouse
appeared) but then the message "Unable to BIND Unix socket" appeared.
I thought about it awhile and took the easy way out (the wonders
of being in a relatively secure environment) - changing the 
permissions of /usr/bin/X11/Xsun and /usr/bin/X11/xinit to setuid root.
This fixed the problem. I need to investigate why this is needed,
though since it isn't on our Sparcstations. Restarting X, I now
was able to get a Console window, however, all other processes were
dying - uanble to write to /tmp. I checked and sure enough, I couldn't 
touch a file in /tmp. For some reason, /tmp looked like:

   7 drwxrwxr-T   3   root      6656  Apr 17 13:43 /tmp

So, I executed changed it's permissions back to what they should be
and presto! everything was happy. I have been using the system since that
time for everything I normally do, and it's not that bad. The keyboard
has a different feel (it's *very* loud and clacky, especially for fast
typists), but all in all it's a pretty nice clone. Although it's officially
supposed to be slower (12.5 vs 15.8 MIPS), I haven't noticed the speed
difference. It's still much faster than a 3/60. :) 

Within a week or so, I should have an OPUS clone in as well (plus
hopefully my Sparcstation 1+ will be back on it's feet) and then I'll
give you guys some side by side by side comparisons. 

Bob Greene            Sunspots (comp.sys.sun) Moderator     ESN 446-7396
LAN/WAN Engineering and Support                            (214) 907-7396
Bell Northern Research, Richardson, Texas, USA             rgreene@bnr.ca

PS: I'm always tempted to sign this 
    "Please email me direct as I don't read this group often."
    and see how many people take me seriously. :) :)