dkf@helios.iec.ufl.edu (Dan FitzPatrick) (06/12/90)
SUMMARY OF PROBLEM: Commands such as "df" and "w" result in the following message, some commands followed by a core dump: machine% w Illegal instruction "vmstat" locks the console. Other than these *minor* problems, the system is happily chugging away performing its file and mail server duties. System: HCX-9 running HUX/UX 3.0C SUMMARY OF CORRECTIVE STEPS TAKEN: 1) The disk drive with the root file system had experienced some corruption recently. It was first assumed that this possibly had corrupted some of the /dev entries. The drive was re-formatted and reloaded with a known working version of / and /usr. The problem persisted. 2) Run the HCX System Level Tests (specifically sys401). The result of these diagnostics were: The "sys401" program came up with 63 errors. 62 of which had the same "Illegal Instruction" message - no test diagnostic message, i.e., the test exited before that point. However, the "fpp3" test, Exited with a "data compare error" and identified the probable source of the failure as the FPP hardware. It was not able to distinguish between the Floating Summ (FS) or Floating Multiply (FM) boards. The system was rebooted, paying careful attention to the console messages and the following flashed by: FPP POC dsk(4,0,0,0)/fppoc ? CP FPP POC error 0004 So, I guess this kinda pinpoints some problems with either the FS or FM boards of the FPP hardware because they not passing the power-on-confidence checks. However, the Console Processor Reference manual states that when this test fails, the CP assumes the FPP hardware does not exist (implies that the FPP hardware is disabled). This might also imply that the only way to detect FPP hardware problems, other that running diagnostics, is by noting the above message on full boots or by sensing that the system was running a bit sluggish. There being logical conflicts, proceed a bit further to step number... 4) Run the HCX CPU and Memory Standalone Diagnostics tests - actually all the tests in the "fall_s" script. The results here were similar: The /fppoc test completed with an Error Code (on the control board) of 0x53 which implies a error with single precision floating point mulitplies (the actual LED values top-to-bottom were 10100011 to avoid interpretation/(documentation) error which indicates a bit order of 45673210 top-to-bottom). OK, so the FPP hardware at this point would be highly suspect. But some vague areas remain, so go one more step... 5) Physically remove the FPP hardware, and for added measure disable the FPP hardware with the "y100" Console Processor command. Rerun the HCX CPU and Memory Standalone Diagnostics tests, this time using the "all_s" script which does not run any FPP hardware diagnostics. Assumption: Removing the FPP hardware required no setting of jumpers, dip switches, or whatever. This was essentially verified with the HCX Processor System Installation Manual. Well, this time all the tests passed with flying colors. Went to full boot the system and it comes up successfully but the problem STILL REMAINS. QUESTIONS: 1) Is only physically removing the FPP hardware all that is required? i.e., the installation manual indicates no additional steps for the installation of these optional products, so removal should be just as easy, correct? I am assuming here that on a cold boot, the system actually tests for the presence of the hardware and enables it through the completion of a successful test. 2) If the FPP hardware is not suspect, then what would be causing the diagnostics to indicate that it was? I would (like to) assume that the standalone diagnostics tests that must be passed prior to those that test the FPP hardware would rule anything else like this out. 3) Where is the actual source of the message "Illegal Instruction" I have run strings on the OS and did not find it here. However, the System Level tests did identify it as a SIGILL signal. I anyone has had similar experiences with this or other Tahoe machines, or have any advice, I would very much appreciate hearing from you. Thanks in advance. --Dan -- Dan FitzPatrick dkf@iec.ufl.edu 339 Larsen Hall, Integrated Electronics Center University of Florida, Gainesville, FL 32611 (904) 392-8935
amos@taux01.nsc.com (Amos Shapir) (06/12/90)
In article <23515@uflorida.cis.ufl.EDU> dkf@helios.iec.ufl.edu (Dan FitzPatrick) writes: > 1) Is only physically removing the FPP hardware all that is >required? i.e., the installation manual indicates no additional steps >for the installation of these optional products, so removal should >be just as easy, correct? Since not all Tahoes have FPP, the system catches any attempt to execute an illegal instruction, and if it is a FPP instruction it is emulated; otherwise, a SIGILL signal is generated, which usually causes the process to die. > 3) Where is the actual source of the message "Illegal Instruction" >I have run strings on the OS and did not find it here. However, the >System Level tests did identify it as a SIGILL signal. This string is not in the OS but in the shell (your command interpreter, usually /bin/sh or /bin/csh). It receives the SIGILL indication through the system call 'wait' (see man 2) and prints the appropriate error message. -- Amos Shapir amos@taux01.nsc.com, amos@nsc.nsc.com National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel Tel. +972 52 522408 TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N
turner@udecc.engr.udayton.edu (Bob Turner) (06/13/90)
In article <23515@uflorida.cis.ufl.EDU> dkf@helios.iec.ufl.edu (Dan FitzPatrick) writes: > >FPP POC >dsk(4,0,0,0)/fppoc >? CP FPP POC error 0004 > >So, I guess this kinda pinpoints some problems with either the FS >or FM boards of the FPP hardware because they not passing the >power-on-confidence checks. However, the Console Processor Reference >manual states that when this test fails, the CP assumes the FPP >hardware does not exist (implies that the FPP hardware is disabled). >This might also imply that the only way to detect FPP hardware problems, >other that running diagnostics, is by noting the above message on >full boots or by sensing that the system was running a bit sluggish. > Congratulations, you are the proud owner of a dead FPP. We have had a HCX-9 for about the last 3 years and cooked about 4 FPP's. We are real happy with it otherwise though. > >There being logical conflicts, proceed a bit further to step number... > > 1) Is only physically removing the FPP hardware all that is >required? i.e., the installation manual indicates no additional steps >for the installation of these optional products, so removal should >be just as easy, correct? I am assuming here that on a cold boot, >the system actually tests for the presence of the hardware and enables >it through the completion of a successful test. > Yep. Thats all folks. We have the process down to a drill for the most part. The only intresting thing is we remove the floating point emulation routines from the kernel for normal operation. (option FPE) Why do I do that? I like small kernels that don't chew up core. But its awfully hard to do floating point without either the FPP or the emulation routines. So I have to cut in the backup kernel that has the routines in it. I keep two unixes in the root partition so if the FPP craps out. We pull the FPP cards, call the FE, boot and select the FPE kernel and reboot. > 2) If the FPP hardware is not suspect, then what would be >causing the diagnostics to indicate that it was? I would (like to) >assume that the standalone diagnostics tests that must be passed >prior to those that test the FPP hardware would rule anything else >like this out. > It is a known bug that the FPP is responsible for dying. If you can, get your FE to install the boards with Plastic carriers. Believe it or not, there was a heat conduction problem with the ceramic chips. (I would guessed the other way around) If you need more info call me...... Bob -- ==================================================================== Bob Turner Network Manager, School of Engineering 513-229-3171 turner@udecc.engr.udayton.edu Univ. of Dayton, Engineering Computing Center-KL211, Dayton OH 45469