jimv@omepd.UUCP (02/22/87)
I spent last Thursday at a sawmill watching a friend pull a bug out of the system that he programmed for them over 10 years ago. In many ways this is really a story about core memory, but the recent discussion about mistaken code that dereferences null pointers prompts me to tell it. The log cutting at this mill is managed by a Nova-2, that collects scanner data about the logs coming in, and controls a variety of conveyor belts and saws to sort, move, and cut the logs to maximize profit. The computer suggests cuts to the operator, who confirms the suggested cut, or overrides the computer and runs the cuts manually. To keep production at a profitable level, the logs must be kept flying, which means the computer has to be almost always right. (As an aside, running the line is very much like playing a video game for 10 hours a day, with every screw-up costing not a quarter, but tens or hundreds of dollars.) The program dynamically allocates a data packet for each log it sees and passes these packets around, shifting them from one queue to another as the log is processed. At any point in time, there can be several logs on each of many different conveyor belts (really chains). The loading of all these belts, the saw usage, and so on are all displayed on a monitor for the operator. The logs that haven't had their cuts confirmed on yet are marked with a `-', and the current log that the control panel is `operating' on is marked with a `*'. (This is in effect a cursor.) The problem reported was that the cursor wasn't popping up on new logs when they were entering the system. The cursor had worked fine for many years. This failure meant that the operator had to move the cursor onto the log before confirming or changing the cut, and was causing serious production problems. When the cursor first started failing, the electrician (yes, electrician; the mill keeps an electrician on full-time for every shift to handle problems as they come up) swapped the computer's memory board with one of his spares, and everything worked again. For a while. Eighteen memory boards and a few years later, the spares had all been used up and the cursor problem was back again. So the contracting company was called about this hardware problem with memory boards going bad, and my friend was sent down to diagnose the problem. The problem turned out to be a piece of code that counted on a reference to a non-existent memory location to return 0. This worked when the system had 24K of memory, but when the memory was upgraded to 32K (needed to support the new code that handled ply-blocks that was added 2 years later), that memory location now had real memory associated with it. When the logs started getting backed up, this area of memory got used for a log packet, and the value of the crucial memory location was set to non-zero. And then the cursor stopped working. One entertaining part of this bug was that the electrician was right: the core memory boards *were* broken when the cursor problem showed up, since once that memory location was set nonzero, it stayed nonzero -- even when the board was removed from the system. On modern systems with memory management, the equivalent of this problem might be a program that references unused memory at the end of the last allocated page, counting on it to be zero. The lesson is that any pointer, and not just null pointers, might mistakenly count on dereferencing to return 0. I wouldn't be surprised is there are many programs with off-by-one errors that have this behavior. To finish my story: we patched the offending instruction (while the system was running), and went down and played the "log game" with the operator. For an hour or two we chatted about strikes, basketball, and deficiencies in the user interface while we batted logs around. It was very satisfying. -- Jim Valerio {verdix,uoregon,intelca!mipos3}!omepd!jimv
amos@instable.UUCP (02/24/87)
In article <428@omepd> jimv@omepd.UUCP (Jim Valerio) writes: >On modern systems with memory management, the equivalent of this >problem might be a program that references unused memory at the end of >the last allocated page, counting on it to be zero. The lesson is that >any pointer, and not just null pointers, might mistakenly count on >dereferencing to return 0. I wouldn't be surprised is there are many >programs with off-by-one errors that have this behavior. Some known bugs of that nature: - the vanilla 'curses' library has exactly such a bug. - the BSD4.2 keeps a 'red zone' protected page to guard the end of the kernel stack, but it's in the wrong place. (I dont think any configuration has ever overflown the stack into the red zone, though). - The Bourne shell has a bug that crashes only on systems that do not have 0 in location 0 *and* clear released memory pages. -- Amos Shapir National Semiconductor (Israel) 6 Maskit st. P.O.B. 3007, Herzlia 46104, Israel (011-972) 52-522261 amos%nsta@nsc.com 34.48'E 32.10'N