[comp.arch] Anecdote about dereferencing null pointers

jimv@omepd.UUCP (02/22/87)

I spent last Thursday at a sawmill watching a friend pull a bug out of
the system that he programmed for them over 10 years ago.  In many ways
this is really a story about core memory, but the recent discussion
about mistaken code that dereferences null pointers prompts me to tell
it.

The log cutting at this mill is managed by a Nova-2, that collects
scanner data about the logs coming in, and controls a variety of
conveyor belts and saws to sort, move, and cut the logs to maximize
profit.  The computer suggests cuts to the operator, who confirms the
suggested cut, or overrides the computer and runs the cuts manually.
To keep production at a profitable level, the logs must be kept flying,
which means the computer has to be almost always right.  (As an aside,
running the line is very much like playing a video game for 10 hours a
day, with every screw-up costing not a quarter, but tens or hundreds of
dollars.)

The program dynamically allocates a data packet for each log it sees
and passes these packets around, shifting them from one queue to
another as the log is processed.  At any point in time, there can be
several logs on each of many different conveyor belts (really chains).
The loading of all these belts, the saw usage, and so on are all
displayed on a monitor for the operator.  The logs that haven't had
their cuts confirmed on yet are marked with a `-', and the current log
that the control panel is `operating' on is marked with a `*'.  (This
is in effect a cursor.)

The problem reported was that the cursor wasn't popping up on new logs
when they were entering the system.  The cursor had worked fine for
many years.  This failure meant that the operator had to move the
cursor onto the log before confirming or changing the cut, and was
causing serious production problems.  When the cursor first started
failing, the electrician (yes, electrician; the mill keeps an
electrician on full-time for every shift to handle problems as they
come up) swapped the computer's memory board with one of his spares,
and everything worked again.  For a while.  Eighteen memory boards and
a few years later, the spares had all been used up and the cursor
problem was back again.  So the contracting company was called about
this hardware problem with memory boards going bad, and my friend was
sent down to diagnose the problem.

The problem turned out to be a piece of code that counted on a
reference to a non-existent memory location to return 0.  This worked
when the system had 24K of memory, but when the memory was upgraded to
32K (needed to support the new code that handled ply-blocks that was
added 2 years later), that memory location now had real memory
associated with it.  When the logs started getting backed up, this area
of memory got used for a log packet, and the value of the crucial
memory location was set to non-zero.  And then the cursor stopped
working.

One entertaining part of this bug was that the electrician was right:
the core memory boards *were* broken when the cursor problem showed up,
since once that memory location was set nonzero, it stayed nonzero --
even when the board was removed from the system.


On modern systems with memory management, the equivalent of this
problem might be a program that references unused memory at the end of
the last allocated page, counting on it to be zero.  The lesson is that
any pointer, and not just null pointers, might mistakenly count on
dereferencing to return 0.  I wouldn't be surprised is there are many
programs with off-by-one errors that have this behavior.


To finish my story: we patched the offending instruction (while the
system was running), and went down and played the "log game" with the
operator.  For an hour or two we chatted about strikes, basketball, and
deficiencies in the user interface while we batted logs around.  It was
very satisfying.
--
Jim Valerio	{verdix,uoregon,intelca!mipos3}!omepd!jimv

amos@instable.UUCP (02/24/87)

In article <428@omepd> jimv@omepd.UUCP (Jim Valerio) writes:
>On modern systems with memory management, the equivalent of this
>problem might be a program that references unused memory at the end of
>the last allocated page, counting on it to be zero.  The lesson is that
>any pointer, and not just null pointers, might mistakenly count on
>dereferencing to return 0.  I wouldn't be surprised is there are many
>programs with off-by-one errors that have this behavior.

Some known bugs of that nature:
- the vanilla 'curses' library has exactly such a bug.
- the BSD4.2 keeps a 'red zone' protected page to guard the end of the kernel
 stack, but it's in the wrong place. (I dont think any configuration has
 ever overflown the stack into the red zone, though).
- The Bourne shell has a bug that crashes only on systems that do not have 0
 in location 0 *and* clear released memory pages.
-- 
	Amos Shapir
National Semiconductor (Israel)
6 Maskit st. P.O.B. 3007, Herzlia 46104, Israel
(011-972) 52-522261  amos%nsta@nsc.com 34.48'E 32.10'N