salex@linc.cis.upenn.edu (Scott Alexander) (05/22/87)
With much help from DEC Field Service, we seem to have successfully tracked down the "crash in namei()" problem that people have been seeing just after converting to 4.3 or Ultrix 1.2. It turns out that the problem is a hardware problem; my best guess is that fast sequential access through the namei cache (eg, miss rate near 100%) is tickling the bug. The typical symptom seems to be that when you run find at night, the machine crashes in iget() which is called from namei(). If you use adb -k, namei() will appear just below Xtransflt() in the stack trace; tracing back the address the kernel prints out on the console will show that you are in iget(). If you see this symptom, the next thing to do is try running the machine with set clock slow. If you have this problem, it should go away with set clock slow. We ran for a weekend without crashes with a slow clock. When we set the clock normal again the next week, we went back to nightly crashes. If you see these problems, you should get Field Service to consider upgrading you to a Revision E 785 (FCO785-E-R003, I think). They will be somewhat reluctant to do this since it involves either adding and deleting 100 wires to/from the backplane or replacing the backplane. It is listed as a 10 hour FCO. We had this FCO performed on Monday, May 11 and have not crashed since. Currently, we have been up for 8 days. There is believed to be a similar FCO for 780s, but I don't know any details on that. An item worth noting if you do go through this FCO: when everything got glued back together, field service ran into a problem in which micro diag #4 kept failing after step 7. The fix for this is to run the latest rev of micro 4 (version 3.0, I think). If you want help or more details, send mail or call if you have trouble getting mail through. Scott Alexander University of Pennsylvania salex@linc.cis.upenn.edu +1 215 898 5617