[comp.bugs.4bsd] crash in namei

salex@linc.cis.upenn.edu (Scott Alexander) (05/22/87)

With much help from DEC Field Service, we seem to have successfully
tracked down the "crash in namei()" problem that people have been
seeing just after converting to 4.3 or Ultrix 1.2.

It turns out that the problem is a hardware problem; my best guess is
that fast sequential access through the namei cache (eg, miss rate
near 100%) is tickling the bug.  The typical symptom seems to be that
when you run find at night, the machine crashes in iget() which is
called from namei().  If you use adb -k, namei() will appear just
below Xtransflt() in the stack trace; tracing back the address the
kernel prints out on the console will show that you are in iget().

If you see this symptom, the next thing to do is try running the
machine with set clock slow.  If you have this problem, it should go
away with set clock slow.  We ran for a weekend without crashes with a
slow clock.  When we set the clock normal again the next week, we went
back to nightly crashes.

If you see these problems, you should get Field Service to consider
upgrading you to a Revision E 785 (FCO785-E-R003, I think).  They will
be somewhat reluctant to do this since it involves either adding and
deleting 100 wires to/from the backplane or replacing the backplane.
It is listed as a 10 hour FCO.

We had this FCO performed on Monday, May 11 and have not crashed
since.  Currently, we have been up for 8 days.

There is believed to be a similar FCO for 780s, but I don't know any
details on that.

An item worth noting if you do go through this FCO:  when everything
got glued back together, field service ran into a problem in which
micro diag #4 kept failing after step 7.  The fix for this is to run
the latest rev of micro 4 (version 3.0, I think).

If you want help or more details, send mail or call if you have
trouble getting mail through.

Scott Alexander
University of Pennsylvania
salex@linc.cis.upenn.edu
+1 215 898 5617