[comp.sys.tahoe] HCX-9: "df" dumps core w/ "Illegal instruction"

dkf@helios.iec.ufl.edu (Dan FitzPatrick) (06/12/90)

SUMMARY OF PROBLEM:

Commands such as "df" and "w" result in the following message, some 
commands followed by a core dump:

         machine% w
         Illegal instruction

"vmstat" locks the console.

Other than these *minor* problems, the system is happily chugging
away performing its file and mail server duties.

System:  HCX-9 running HUX/UX 3.0C

SUMMARY OF CORRECTIVE STEPS TAKEN:

    1)  The disk drive with the root file system had experienced 
some corruption recently.  It was first assumed that this possibly
had corrupted some of the /dev entries.  The drive was re-formatted
and reloaded with a known working version of / and /usr.  The problem
persisted.

	2)  Run the HCX System Level Tests (specifically sys401).  The
result of these diagnostics were:

The "sys401" program came up with 63 errors.  62 of which had the
same "Illegal Instruction" message - no test diagnostic message,
i.e., the test exited before that point.  However, the "fpp3" test,
Exited with a "data compare error" and identified the probable 
source of the failure as the FPP hardware.  It was not able to 
distinguish between the Floating Summ (FS) or Floating Multiply (FM)
boards.

The system was rebooted, paying careful attention to the console
messages and the following flashed by:

FPP POC
dsk(4,0,0,0)/fppoc
? CP FPP POC error 0004

So, I guess this kinda pinpoints some problems with either the FS
or FM boards of the FPP hardware because they not passing the
power-on-confidence checks.  However, the Console Processor Reference
manual states that when this test fails, the CP assumes the FPP
hardware does not exist (implies that the FPP hardware is disabled).
This might also imply that the only way to detect FPP hardware problems,
other that running diagnostics, is by noting the above message on 
full boots or by sensing that the system was running a bit sluggish.

There being logical conflicts, proceed a bit further to step number...

	4)  Run the HCX CPU and Memory Standalone Diagnostics tests -
actually all the tests in the "fall_s" script.  The results here 
were similar:

The /fppoc test completed with an Error Code (on the control board)
of 0x53 which implies a error with single precision floating point
mulitplies (the actual LED values top-to-bottom were 10100011 to 
avoid interpretation/(documentation) error which indicates a bit
order of 45673210 top-to-bottom).

OK, so the FPP hardware at this point would be highly suspect.  But
some vague areas remain, so go one more step...

	5)  Physically remove the FPP hardware, and for added measure
disable the FPP hardware with the "y100" Console Processor command.
Rerun the HCX CPU and Memory Standalone Diagnostics tests, this time
using the "all_s" script which does not run any FPP hardware
diagnostics.

Assumption:  Removing the FPP hardware required no setting of jumpers,
dip switches, or whatever.  This was essentially verified with the HCX 
Processor System Installation Manual.

Well, this time all the tests passed with flying colors.  Went to 
full boot the system and it comes up successfully but the problem
STILL REMAINS.


QUESTIONS:

	1)  Is only physically removing the FPP hardware all that is 
required?  i.e., the installation manual indicates no additional steps
for the installation of these optional products, so removal should
be just as easy, correct?  I am assuming here that on a cold boot,
the system actually tests for the presence of the hardware and enables
it through the completion of a successful test.

	2)  If the FPP hardware is not suspect, then what would be 
causing the diagnostics to indicate that it was?  I would (like to)
assume that the standalone diagnostics tests that must be passed 
prior to those that test the FPP hardware would rule anything else 
like this out.

	3)  Where is the actual source of the message "Illegal Instruction"
I have run strings on the OS and did not find it here.  However, the
System Level tests did identify it as a SIGILL signal.


I anyone has had similar experiences with this or other Tahoe machines,
or have any advice, I would very much appreciate hearing from you.

Thanks in advance.

--Dan


--
Dan FitzPatrick                                dkf@iec.ufl.edu
339 Larsen Hall, Integrated Electronics Center
University of Florida, Gainesville, FL  32611   (904) 392-8935

amos@taux01.nsc.com (Amos Shapir) (06/12/90)

In article <23515@uflorida.cis.ufl.EDU> dkf@helios.iec.ufl.edu (Dan FitzPatrick) writes:
>	
1)  Is only physically removing the FPP hardware all that is 
>required?  i.e., the installation manual indicates no additional steps
>for the installation of these optional products, so removal should
>be just as easy, correct?

Since not all Tahoes have FPP, the system catches any attempt to execute
an illegal instruction, and if it is a FPP instruction it is emulated;
otherwise, a SIGILL signal is generated, which usually causes the process
to die.

>	3)  Where is the actual source of the message "Illegal Instruction"
>I have run strings on the OS and did not find it here.  However, the
>System Level tests did identify it as a SIGILL signal.

This string is not in the OS but in the shell (your command interpreter,
usually /bin/sh or /bin/csh).  It receives the SIGILL indication through
the system call 'wait' (see man 2) and prints the appropriate error message.

-- 
	Amos Shapir		amos@taux01.nsc.com, amos@nsc.nsc.com
National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel
Tel. +972 52 522408  TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N

turner@udecc.engr.udayton.edu (Bob Turner) (06/13/90)

In article <23515@uflorida.cis.ufl.EDU> dkf@helios.iec.ufl.edu (Dan FitzPatrick) writes:
>
>FPP POC
>dsk(4,0,0,0)/fppoc
>? CP FPP POC error 0004
>
>So, I guess this kinda pinpoints some problems with either the FS
>or FM boards of the FPP hardware because they not passing the
>power-on-confidence checks.  However, the Console Processor Reference
>manual states that when this test fails, the CP assumes the FPP
>hardware does not exist (implies that the FPP hardware is disabled).
>This might also imply that the only way to detect FPP hardware problems,
>other that running diagnostics, is by noting the above message on 
>full boots or by sensing that the system was running a bit sluggish.
>

Congratulations, you are the proud owner of a dead FPP. We have had a
HCX-9 for about the last 3 years and cooked about 4 FPP's. We are
real happy with it otherwise though. 

>
>There being logical conflicts, proceed a bit further to step number...
>
>	1)  Is only physically removing the FPP hardware all that is 
>required?  i.e., the installation manual indicates no additional steps
>for the installation of these optional products, so removal should
>be just as easy, correct?  I am assuming here that on a cold boot,
>the system actually tests for the presence of the hardware and enables
>it through the completion of a successful test.
>
Yep. Thats all folks.

We have the process down to a drill for the most part. The only
intresting thing is we remove the floating point emulation routines
from the kernel for normal operation. (option FPE) Why do I do that?
I like small kernels that don't chew up core. But its awfully hard
to do floating point without either the FPP or the emulation routines. 
So I have to cut in the backup kernel that has the routines in it.

I keep two unixes in the root partition so if the FPP craps out. We
pull the FPP cards, call the FE, boot and select the FPE kernel and
reboot. 

>	2)  If the FPP hardware is not suspect, then what would be 
>causing the diagnostics to indicate that it was?  I would (like to)
>assume that the standalone diagnostics tests that must be passed 
>prior to those that test the FPP hardware would rule anything else 
>like this out.
>

It is a known bug that the FPP is responsible for dying. If you can,
get your FE to install the boards with Plastic carriers. Believe
it or not, there was a heat conduction problem with the ceramic chips.
(I would guessed the other way around)

If you need more info call me......

						Bob

 



-- 
====================================================================
Bob Turner                    Network Manager, School of Engineering
513-229-3171                           turner@udecc.engr.udayton.edu
Univ. of Dayton, Engineering Computing Center-KL211, Dayton OH 45469