rang@cs.wisc.edu (Anton Rang) (08/13/90)
[ This isn't really a C issue; I'm redirecting it to comp.misc. ] In article <49041@seismo.CSS.GOV> stead@beno.CSS.GOV (Richard Stead) writes: >Do VAX-CISC programmers spend their days branching to random data? Nobody that I know of actually does this on *purpose*. However, it's not that hard to screw up a program so that it does this. For instance, passing the wrong argument (or an uninitialized pointer) to a routine expecting a function pointer can do it. Or trashing the stack by going past array bounds. Lots of possibilities. >Or if I ever do, I would fix it pretty damn quick. Same here. Nobody's saying that branching to random data is good. >Who could possibly care that a random instruction sequence crashes a risc box? Well...if I have my own workstation, and always have all my work in progress saved before I run a program (generally true for the most part, since I'm paranoid) I don't care that much, though it's still annoying . If I'm sharing the machine, I *don't* want it to crash twice a day because the guy down the hall is trying to find the stack-clobbering bug in his code. Educational environments are even more prone to this. If you have a Sun-4/490 shared between 200 users, you don't want somebody to be able to crash the machine at will ("denial of service" attack). There are students out there who will think this is a great joke.... I hope that the problems shown up by this test can all be fixed in software; I expect that they can, and that the vendors will do so. In general, nothing a user-mode process can do should be able to crash the machine...otherwise, what's the point of privileged instructions, kernel mode, etc.? Anton +---------------------------+------------------+-------------+ | Anton Rang (grad student) | rang@cs.wisc.edu | UW--Madison | +---------------------------+------------------+-------------+
gordoni@chook.adelaide.edu.au (Gordon Irlam) (08/14/90)
I've managed to track down the cause of one of the crashes on a Sun4. The following C program crashes a 4/330 running SunOS 4.03. ---- start of crash_sun.c ---- main = 0xbfafffff; ---- end of crash_sun.c ---- This is a floating point compare instruction with an invalid type of value for the comparison (ie. not a single, double, or extended precision value). Presumably the instruction gets passed to the floating point unit, causing it to panic in some way, which in turn results in the CPU crashing. I imagine that the CPU, FPU interface is one of the most common areas for such bugs due to the complexities caused by its heavily asynchronous nature. The program does not crash a 4/60 running SunOS 4.1. But I can't tell whether I am looking at a hardware problem, or an operating system problem. Could someone with a 4/330 running SunOS 4.1, or a 4/60 running SunOS 4.03 please try this program so that the cause of the problem can be determined. Speculating on likely causes of such crashes I would imagine that the vast majority are simply O/S bugs. A few might be system design bugs, although most of these can probably be made safe by clever O/S programming. This includes neglecting things such as the possibility of a page fault being caused by pre-fetching an annulled instruction. And any number of bugs in the various bits of design glue that hold a modern system together. I think the possibility of a CPU bug is quite unlikely. In fact I think that executing random sequences of code is one of the common tests used to check a new CPU design. I don't think their is any real merit to the claim that such bugs are more likely on RISC than CISC machines - in fact the simplicity of RISC machines could be used to argue the other way. Any differences that are seen between such machines can be more than explained by the considerably difference in the age of the O/S ports for the respective architectures. Most of the O/S bugs on the CISC machines have probably been fixed long ago. An important thing to realize given the state of software engineering today, is that there is a serious tradeoff between functionality and reliability. And most vendors seem to be putting functionality first. Based on my experience reliable NFS is an oxymoron, but that hasn't stopped it from being adopted by almost the entire unix speaking world. A distinction also needs to be made between bugs, and simply strange quirks of the hardware. This distinction isn't always clear. And in many cases failure to understand a hardware quirk by the O/S designers can result in a bug. SPARC has several quirks/bugs which I am aware of: 1) The read status register, write status register sequence is interruptable. Also it is not possible to only write particular fields in the psr. This means that it is not possible to use this sequence to clear the trap enable bit and thereby disable traps since between reading and writing the psr an interrupt trap may have occurred causing the cwp field of the psr to have changed value. This almost certainly wasn't realised when the architecture was designed, instead it follows as a natural consequence of other design decisions. It is not possible to alter the architecture, and so this quirk stays. Fortunately, some fairly complicated ways exist that allow you to get around this quirk, and disable traps. 2) Setting the interrupt level and trap enable fields of the psr simultaneously can cause spurious interrupts, with the Fujitsu chip set. This is now documented in the SPARC manuals, and so it is no longer a bug. It just requires using two separate instructions, where one would have been used previously. 3) Early versions of SPARC only had an atomic swap memory with 0xff instruction. This is sufficient for implementing semaphores, etc., but is no good if the semantics of the swapped value are determined by external hardware, such as is the case with page tables. New versions of the architecture, and the Cypress chip set, include an atomic swap memory with register instruction. 4) Current versions of SPARC do not have multiply or divide instructions. Sun seems to be worried about the marketing implications of this, and so opcodes for these operations have been assigned. This isn't really a bug but the marketing people behave as if it is rather than justifying the original design decision. 5) An opcode exists for flushing an internal instruction cache, but not for flushing an internal data cache. Perhaps there is a good reason for this, but I can't see one. It is not clear if future processors will use the one instruction to flush both instruction and data caches, or whether a new instruction will be added. ---- A note for people running crashme.c. Typically you have to run it with many different arguments before you encounter a crash. And on Sun's, at least, the image frequently gets terminated long before the specified number of iterations have completed. (Someone mentioned that a 4/60 did not crash. But after playing around with one for a while I managed to crash it (under SunOS 4.1)). Gordon Irlam gordoni@cs.adelaide.edu.au
gordoni@chook.adelaide.edu.au (Gordon Irlam) (08/15/90)
In a previous article I mentioned that the following C program crashes a 4/330 running SunOS 4.03. ---- start of crash_sun.c ---- main = 0xbfafffff; ---- end of crash_sun.c ---- It looks like this is a bug in SunOS 4.03, and not a hardware problem. Unless, perhaps it is an underlying hardware flaw, but it is capable of being masked by the operating system. Thanks to John M. Blasik (john@mlb.semi.harris.com) for running it on a 4/330 under SunOS 4.1, where it didn't crash. And to Vick Khera (khera@cs.duke.edu) for running on a 4/60 under SunOS 4.03, where it did crash. Gordon Irlam (gordoni@cs.adelaide.edu.au)