[comp.os.vms] Microcode problem on 8600 processors

NED@YMIR.BITNET (Ned Freed) (01/17/88)

Recently I found a microcode bug on our VAX 8600. When we reported it to DEC
they provided us with a microcode update that fixed the problem immediately.
This bug is not especially esoteric, as the following example MACRO program
shows:

        .entry start,^m<r2,r3>

        movl    first,-(sp)
        movl    second,r0
        mulf2   r0,(sp)
        movl    (sp)+,r0
        ret

first:  .float  1.0e-14
second: .float  1.0e-27

        .end    start

The operation performed by the program is quite simple. One small floating
point value is loaded onto the stack and another is loaded into R0. These two
values are then multiplied with the result stored on the stack. This operation
will underflow so the result should be 0. This result is then popped off the
stack into R0 and the program returns. The net result should then be a
"NONAME-W-NOMSG, Message number 000000" message reported as the program exits.
And yes indeed, the program does just this on the VAX-11/750, the uVAX-II and
the VAX 8700.

But not on our 8600. On our 8600 the program returns a status value of 1 and
not 0! If you carefully single step the program in the debugger the reason for
this will become clear -- for some reason the "mulf2 r0,(sp)" instruction
DECREMENTS the stack pointer by 4. Thus you end up picking up some random
value off the stack that turns out to be a 1.

This problem has been verified on two different 8600s at different sites, both
under DEC maintenance, so don't assume that YOUR microcode is up to date. Our
8600 recently had a whole slew of hardware problems and almost every part of it
was replaced and checked, but the microcode was not updated until I found this
problem.

Here are a few additional technical points:

(1) Almost ANY change to the program will cause the problem to go away. For
    example, everything works fine if you add a "nop" just after the "mulf2", or
    remove the "movl (sp)+,r0", or do almost anything else.

(2) The problem appears not to be sensitive to the floating point values
    involved; anything that causes an underflow will cause the problem. The
    program works properly if the multiply does not underflow.

(3) The problem does not appear to exist when using floating point types other
    than F_floating.

(4) Despite the clear indications that the error has something to do with
    the handling of floating underflow in the 8600 pipeline, the error does
    manifest itself even when single stepping in the debugger. I think this
    is especially strange.

(5) I have not tried this program on an 8650, and I would be very interested
    to find out if this problem exists on that CPU. In fact, I would appreciate
    receiving reports of the results people get when they run this program on
    their systems, regardless of what type of CPU they have.

This hardware error has been plaguing our local software for more than two
years, causing a whole series of access violations and divide by zero errors. I
have been looking for the cause off and on for quite a while, but it just
didn't occur to me that a microcode bug could be to blame!

I am somewhat upset that DEC knew about this problem and didn't see fit to
distribute a fix for it. It is quite conceivable that this problem could
manifest itself in such a way that a program would report no obvious errors but
would return erroneous results.

                                Ned Freed
                                ned@ymir.bitnet

gkn@SDS.SDSC.EDU (Gerard K. Newman) (01/17/88)

	From:	 Ned Freed <NED%YMIR.BITNET@CUNYVM.CUNY.EDU>
	Subject: Microcode problem on 8600 processors
	Date:	 Sat, 16 Jan 88 21:35 PST

	[long and lucid description of problem omitted ... gkn]

	(5) I have not tried this program on an 8650, and I would be very interested
	    to find out if this problem exists on that CPU. In fact, I would appreciate
	    receiving reports of the results people get when they run this program on
	    their systems, regardless of what type of CPU they have.

I just ran it on an 8650 and it produced the correct results.

Regards,

gkn
----------------------------------------
Internet: GKN@SDS.SDSC.EDU
Bitnet:   GKN@SDSC
Span:	  SDSC::GKN (27.1)
USPS:	  Gerard K. Newman
	  San Diego Supercomputer Center
	  P.O. Box 85608
	  San Diego, CA 92138-5608
AT&T:	  619.534.5076

levy@ttrdc.UUCP (Daniel R. Levy) (01/19/88)

In article <8801170735.AA23357@ucbvax.Berkeley.EDU>, NED@YMIR.BITNET (Ned Freed) writes:
#>         .entry start,^m<r2,r3>
#> 
#>         movl    first,-(sp)
#>         movl    second,r0
#>         mulf2   r0,(sp)
#>         movl    (sp)+,r0
#>         ret
#> 
#> first:  .float  1.0e-14
#> second: .float  1.0e-27
#> 
#>         .end    start
#> 
#> The net result should then be a
#> "NONAME-W-NOMSG, Message number 000000" message reported as the program exits.
#> 
#> (5) I have not tried this program on an 8650, and I would be very interested
#>     to find out if this problem exists on that CPU. In fact, I would appreciate
#>     receiving reports of the results people get when they run this program on
#>     their systems, regardless of what type of CPU they have.

This works fine on our 8650 under VMS 4.5.
-- 
|------------Dan Levy------------|  Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa,
|         an Engihacker @        |  	<most AT&T machines>}!ttrdc!ttrda!levy
| AT&T Computer Systems Division |  Disclaimer?  Huh?  What disclaimer???
|--------Skokie, Illinois--------|

ZWARTS@HGRRUG51.BITNET (01/19/88)

>  (5) I have not tried this program on an 8650, and I would be very interested
>      to find out if this problem exists on that CPU. In fact, I would
>      appreciate receiving reports of the results people get when they run
>      this program on their systems, regardless of what type of CPU they have.

I have tried it on our Vax-8300, VAXstation 2000 and MicroVax I. On all these
machines it runs correctly.

        F. Zwarts                               Phone:          (+31)50-633619
        Kernfysisch Versneller Instituut        Bitnet/Earn:    ZWARTS@HGRRUG51
        Zernikelaan 25                          Surfnet:        KVIANA::ZWARTS
        9747 AA  Groningen                      Telefax:        (+31)50-634003
        The Netherlands                         Telex:          53410 rugro nl

SYSTEM@CRNLNS.BITNET (01/20/88)

Ned,

Both of our 8600s run your microcode bug detector correctly,
returning a 0.

We have been running our current microcode for quite a while
(many months).

I don't want to stir up any trouble, but
I think you need to talk to your DEC field service management
about bringing your systems' FCO's up to date in a timely fashion,
including installing new console software when it becomes available.

The file "NOTICE.NEW" on our console packs claims that it is for
both 8600s and 8650s and that it includes CI microcode rev 7.0.
This last may have a bearing. Since our systems are
clustered, I made a point of telling our Field Service rep that
VMS v4's release notes mentioned the desirability of using CI rev 7.

I hope this helps.

Selden E. Ball, Jr.
(Wilson Lab's network and system manager)

Cornell University                 NYNEX: +1-607-255-0688
Laboratory of Nuclear Studies     BITNET: SYSTEM@CRNLNS
Wilson Synchrotron Lab          Internet: SYSTEM%CRNLNS.BITNET@CUNYVM.CUNY.EDU
Judd Falls & Dryden Road     HEPnet/SPAN: LNS61::SYSTEM = 44283::SYSTEM
Ithaca, NY, USA  14853