rgreene@bnr.ca (Robert D. Greene) (04/18/91)
Hi everyone. Sorry for the big delays over the last week. We had a couple of interesting problems that kept me pretty busy (not that I'm not pretty busy in general anyway). The first experience was that our Sun 3/60 which I use to do all the Sunspots processing suffered a very bad hard disk crash (lots of bad blocks everywhere that couldn't be read). The 3/60 would not even boot up 'cause fsck couldn't repair the disks. So, myself and a fellow coworker used the tried and true "Bob Greene" method of repairing disks on the fly. I've done this a couple of times, and really, I'm just curious how the rest of the world deals with this - ie: am I setting myself up for a fall sometime in the far future? (and please, no "If you just kept backups you wouldn't have to do this" messages - I know, I know. :) :)). Here's the problem: when fsck'ing the disk, it comes up with a variety of errors over several blocks (usually hard data errors and other good stuff). Fsck is unable to to repair it and eventually dies. So, the way I do repair this (this is pretty destructive, but the idea is destroy as little of the disk data as possible) is to boot miniroot off the tape and then use format to analyze the disk. I run this in read mode on the entire disk to get a list of the bad blocks. If I'm lucky (which I wasn't) the format repairs the blocks. Once that fails, I gradually escalate up the list of analyze options (setting up my format options to only work on one bad block at a time) until I get to purge. The purge has the same effects as reformatting the entire bad block, and after a purge, I am able to successfully issue a read on that block. Interestingly enough, purging some blocks allows read to fix other blocks. Anyway, so I purge how ever many blocks I need to and then exit format and run fsck on the newly repaired disk. Of course, fsck finds lots of missing inodes and other fun things, so I just nuke them, noting the filenames that it says it is destroying. After this, I can reboot the machine (fsck says disk is clean) and then restore the missing files. I've had to do this a total of three times to date. So, the question is: just how bad is this and what's the "better answer"? Okay, the next link in the story of trials and tribulations. My own personal Sparcstation 1+ finally bought the farm yesterday morning. About 2 months ago, we lost air conditioning in our building and during that day I experienced about 5 panics (it wasn't THAT hot - about 70F or so). Since that time, the Sparcstation has been sporadicly panic'ing with "synctodr: failed to read clock chip" errors and SCSI select timeouts. During the last week when I was away from my desk, the Sparcstation supposedly repeately panic'ed with SCSI select errors. Yesterday morning, it finally gave out totally: first, it wouldn't even poweron with the CD-ROM attached and Powered ON on the SCSI bus. I disconnected all SCSI devices and tried rebooting. At that point it came up with an Ethernet address of "2:2:2:2:2:2" (very bogus). The self tests completed without any indication of system failure. Over a period of several minutes, I tried to reboot from sd() several times - the responses were either "SCSI Select failed" or "Memory alignment error". A diag "scsi-probe" sometimes showed the SCSI disk and sometimes did not. In any event, Sun Support wants to swap out the CPU board (this somes like the standard approach - I keep remembering those horror stories I've seen of 26 CPU board swaps later still having the problem.. :)). It looks a lot like a SCSI Bus failure to me, although the mangled Ethernet address looks like the PROMS may be mucked up as well. But.... luckily, I was not dead totally. Later the same day, I was fortunate enough to receive a band spankin' new Tatung Sparcstation clone for evaluation. This is the slow (12.5 MIPS I think) color desktop machine that they advertise at around $8000. Mine came with 8Megs of memory, 19" color monitor (looks like it might be a Sony Monitor - have to check the documenation) and 207 Mbytes of disk space. I found actually setting up the hardware was relatively easy - the setup is very similar to a Sparcstation, and it readily accepted a variety of scavanged peripherals (Tape and CD-ROM) from my Sparcstation. The one bad point was that the design of the case made it difficult to correctly seat the AUI cable on the ethernet connector. Additionally, the slide type fastener was noticeably inferior to others I have seen. Their own version of SunOS (Sparc/OS) was already installed on the internal drive; however, I opted to reinstall from tape to take advantage of the "quick configuration" options. The quick configuration loaded without any problems and during execution looked very similar to Sun's own install program. At one point during the installation, a lot of garbage was written to the screen (it looks like they reused a string without clearing it's old contents). Besides this, the installation went very smoothly and had me up and running with NIS and NFS operational within an hour. As installed, the Tatung had only Sunview (or should I say something that looked very much like Sunview? I'm not sure how much of the system is actually real Sun routines and how much is just stuff that looks like their Sun counterparts) installed. I ftp'd direct from another Sparcstation all of our local routines and X11R4. Note that I transferred only the binaries and libraries - I did not recompile at all. So, I logged in as myself and tried to run X11R4. Total failure. The X11R4 processes started up (ie: the screen initialized and the mouse appeared) but then the message "Unable to BIND Unix socket" appeared. I thought about it awhile and took the easy way out (the wonders of being in a relatively secure environment) - changing the permissions of /usr/bin/X11/Xsun and /usr/bin/X11/xinit to setuid root. This fixed the problem. I need to investigate why this is needed, though since it isn't on our Sparcstations. Restarting X, I now was able to get a Console window, however, all other processes were dying - uanble to write to /tmp. I checked and sure enough, I couldn't touch a file in /tmp. For some reason, /tmp looked like: 7 drwxrwxr-T 3 root 6656 Apr 17 13:43 /tmp So, I executed changed it's permissions back to what they should be and presto! everything was happy. I have been using the system since that time for everything I normally do, and it's not that bad. The keyboard has a different feel (it's *very* loud and clacky, especially for fast typists), but all in all it's a pretty nice clone. Although it's officially supposed to be slower (12.5 vs 15.8 MIPS), I haven't noticed the speed difference. It's still much faster than a 3/60. :) Within a week or so, I should have an OPUS clone in as well (plus hopefully my Sparcstation 1+ will be back on it's feet) and then I'll give you guys some side by side by side comparisons. Bob Greene Sunspots (comp.sys.sun) Moderator ESN 446-7396 LAN/WAN Engineering and Support (214) 907-7396 Bell Northern Research, Richardson, Texas, USA rgreene@bnr.ca PS: I'm always tempted to sign this "Please email me direct as I don't read this group often." and see how many people take me seriously. :) :)