[comp.os.msdos.desqview] Qemm slows floating point

feg@cbnewsb.cb.att.com (forrest.e.gehrke) (05/14/91)

I have a Gateway 386/33 with Micronics Asic motherboard
and no coprocessor.  4MB ram, 32 ram cache.

When I load qemm.sys and run a Whetstone benchmark (or any
floating point operations program), I find that there is
about a 30% reduction in speed.  This occurs even with 
no features of qemm in use, i.e. merely loading qemm.sys
with the line   device=c:\qemm.sys with no other parameters
in the config.sys file, and no programs loaded into
high memory.

Removing that line from config.sys will show a 30%
improvement in floating point operations speed.
For example, Gateway includes with the Micronics
motherboard a program QAPLUS which has the Whetstone
benchmark.  Without QEMM.SYS loaded this benchmark
will report 202.1 whetstones/sec.  With QEMM.SYS loaded
this will drop to 154 whetstones/sec.

I can not envision any reason for these results unless
QEMM inserts useless code to be executed when floating
point operations are involved.  The same QAPLUS includes
the Dhrystone benchmark; there is no difference in its
report with or without QEMM.SYS loaded.

Has anyone else noticed this?  Is there a solution
for it?

Forrest Gehrke feg\@floyd.att.com

dmurdoch@watstat.waterloo.edu (Duncan Murdoch) (05/14/91)

In article <1991May14.123233.17734@cbfsb.att.com> feg@cbnewsb.cb.att.com (forrest.e.gehrke) writes:
>I have a Gateway 386/33 with Micronics Asic motherboard
>and no coprocessor.  4MB ram, 32 ram cache.
>
>When I load qemm.sys and run a Whetstone benchmark (or any
>floating point operations program), I find that there is
>about a 30% reduction in speed.  This occurs even with 
>no features of qemm in use, i.e. merely loading qemm.sys
>with the line   device=c:\qemm.sys with no other parameters
>in the config.sys file, and no programs loaded into
>high memory.
...
>I can not envision any reason for these results unless
>QEMM inserts useless code to be executed when floating
>point operations are involved.  The same QAPLUS includes
>the Dhrystone benchmark; there is no difference in its
>report with or without QEMM.SYS loaded.
>
>Has anyone else noticed this?  Is there a solution
>for it?

I haven't noticed it, but haven't done any tests.  Here's a guess at why
it's happening:  

Most compilers use interrupt calls in place of each floating point instruction
to jump to the emulator when there's no FPU installed.  (Borland's TP does
this on the first call whether the emulator is installed or not; MS languages
seem to do it on the first call only if there's an emulator there.  Both
of them patch the code back to a FPU instruction if there's an '87 there, so
you only execute the interrupt once.)

QEMM runs the 386 in protected mode, with your session running in V86
mode.  I don't have 386 timings handy, but I think the INT instruction is
much slower in V86 mode than in real mode.

So it appears the only solution is to buy a 387 - then your program will
be slower on the first pass through each instruction, but will go full
speed after that.

Duncan Murdoch
dmurdoch@watstat.waterloo.edu

reisert@mast.enet.dec.com (Jim Reisert) (05/15/91)

In article <1991May14.142323.1929@maytag.waterloo.edu>, dmurdoch@watstat.waterloo.edu (Duncan Murdoch) writes...
>In article <1991May14.123233.17734@cbfsb.att.com> feg@cbnewsb.cb.att.com (forrest.e.gehrke) writes:
>>I have a Gateway 386/33 with Micronics Asic motherboard
>>and no coprocessor.  4MB ram, 32 ram cache.
>>
>>When I load qemm.sys and run a Whetstone benchmark (or any
>>floating point operations program), I find that there is
>>about a 30% reduction in speed.
>
>So it appears the only solution is to buy a 387 - then your program will
>be slower on the first pass through each instruction, but will go full
>speed after that.

This doesn't seem right.  I have a Cyrix coprocessor in my 386 box, and I
suffer similar speed penalties as Forrest, when using programs that make
heavy use of the coprocessor.  It must be something else.

- Jim

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

"The opinions expressed here in no way represent the views of Digital
 Equipment Corporation."

James J. Reisert                Internet:  reisert@mast.enet.dec.com
Digital Equipment Corp.         UUCP:      ...decwrl!mast.enet!reisert
146 Main Street			Voice:     508-493-5747
Maynard, MA  01754		FAX:       508-493-0395

feg@cbnewsb.cb.att.com (forrest.e.gehrke) (05/15/91)

In article <22671@shlump.lkg.dec.com> reisert@mast.enet.dec.com (Jim Reisert) writes:
>
>In article <1991May14.142323.1929@maytag.waterloo.edu>, dmurdoch@watstat.waterloo.edu (Duncan Murdoch) writes...
>>In article <1991May14.123233.17734@cbfsb.att.com> feg@cbnewsb.cb.att.com (forrest.e.gehrke) writes:
>>>I have a Gateway 386/33 with Micronics Asic motherboard
>>>and no coprocessor.  4MB ram, 32 ram cache.
>>>
>>>When I load qemm.sys and run a Whetstone benchmark (or any
>>>floating point operations program), I find that there is
>>>about a 30% reduction in speed.
>>
>>So it appears the only solution is to buy a 387 - then your program will
>>be slower on the first pass through each instruction, but will go full
>>speed after that.
>
>This doesn't seem right.  I have a Cyrix coprocessor in my 386 box, and I
>suffer similar speed penalties as Forrest, when using programs that make
>heavy use of the coprocessor.  It must be something else.
>
>- Jim

Several people have responded directly and on this net to the effect
that QEMM is trapping a floating point exception interrupt with
each instruction, causing this slowdown.

My results using a whetstone benchmark were 202 whetstones/second
without QEMM and 154 with QEMM.  However, one of the people here at
BTL with a machine identical to mine except with an Intel 387
installed reports 2650 whetstones/second with or without QEMM.
Apparently the CPU only generates one interrupt at the beginning
and then the 387 goes its merry way without anymore interrupts
for QEMM to handle.

One person suggested compiling the source with MSC using the
parameter /Fpa which emulates the 387.  He speculates that
this will operate in the same way as having a 387 installed.
Of course, this is only useful if one has the C source for
the program.

Forrest Gehrke feg\@dodger.att.com

dmurdoch@watstat.waterloo.edu (Duncan Murdoch) (05/15/91)

In article <22671@shlump.lkg.dec.com> reisert@mast.enet.dec.com (Jim Reisert) writes:
>
>In article <1991May14.142323.1929@maytag.waterloo.edu>, dmurdoch@watstat.waterloo.edu (Duncan Murdoch) writes...
>>In article <1991May14.123233.17734@cbfsb.att.com> feg@cbnewsb.cb.att.com (forrest.e.gehrke) writes:
>>>When I load qemm.sys and run a Whetstone benchmark (or any
>>>floating point operations program), I find that there is
>>>about a 30% reduction in speed.
>>
>>So it appears the only solution is to buy a 387 - then your program will
>>be slower on the first pass through each instruction, but will go full
>>speed after that.
>
>This doesn't seem right.  I have a Cyrix coprocessor in my 386 box, and I
>suffer similar speed penalties as Forrest, when using programs that make
>heavy use of the coprocessor.  It must be something else.
>
>Digital Equipment Corp.         UUCP:      ...decwrl!mast.enet!reisert

Very mysterious.  I wrote a little test program in Turbo Pascal (it's attached
below), that just runs loops multiplying 16 bit integers, 32 bit integers, and
64 bit doubles.  I calibrated it under Desqview so that each loop took 5 seconds
on my 486-25 (which has the coprocessor built in).

Times under various conditions are shown below; it sure looks as though QEMM
has no effect when the floating point instructions are used, but causes a big
slowdown when they're emulated.

Duncan Murdoch

                        Time in seconds
 Type           Desqview       QEMM         Clean      Number of cycles

 Integer          5.0           4.0          4.0            4150000
 Longint          4.9           4.0          4.0             990000
 FPU Doubles      5.0           4.0          4.0            2210000
 Emul Doubles     6.4           5.2          3.7              60000

  program timecalc;

  { Times integer and floating point arithmetic }

  uses opdos;  { Object professional unit supplies the timer }

  var
    i,j : integer;
    i1,i2 : integer;
    l1,l2 : longint;
    d1,d2 : double;
    start,stop : longint;
  begin
    start := timems;

    i1 := 5;
    for i:=1 to 10000 do
      for j:=1 to 415 do
        i2 := i1*i1;
    stop := timems;
    writeln('Integers took ',(stop-start)/1000:8:3,' seconds.');

    start := timems;

    l1 := 5;
    for i:=1 to 10000 do
      for j:=1 to 99 do
        l2 := l1*l1;
    stop := timems;
    writeln('Longints took ',(stop-start)/1000:8:3,' seconds.');

    start := timems;

    d1 := 5;
    for i:=1 to 10000 do
      for j:=1 to 6 do  { use 6 for 87=n, 221 for 87=Y }
        d2 := d1*d1;
    stop := timems;
    writeln('Doubles took  ',(stop-start)/1000:8:3,' seconds.');

  end.