[comp.os.msdos.programmer] SUMMARY - 32-bit registers without a DOS extender

nelson@bolyard.wpd.sgi.com (Nelson Bolyard) (10/24/90)
In my original posting, I told about my application that runs in Turbo-C's
"tiny" model (CS==DS==ES==SS, all fits in 64 k bytes) but intensively uses
LONG integers.  Performance of 8086 instructions was bad.  I was looking
for a way to use the 32-bit "extended" registers of the 386, in a DOS
environment, without a "DOS extender" if possible.  I wanted speed, not
more memory.

After my summary of replies, I will explain the final results, and how I got
them.  The final results were 4.3 times faster without any extender.

I got 32 responses, including some from people who don't read this
newsgroup (comp.os.msdos.programmer) but instead are on the
386-Users@udel.edu mailing list maintained by Bill Davidsen, who is also
the moderator of comp.binaries.ibm.pc.  About half the responses simply 
told me about various extenders.  Since (it turned out) an extender is not
necessary for my purposes, I have not included any of those responses.
There were far too many responses to repeat them all here, so I will truly
summarizefthem.  I asked 4 questions, which were:

1. Do I need a dos extender to be able to use the 32-bit registers?

>No. The 32 bit register instructions are available via an escape code byte
>on each instruction (not as efficient as running in protected mode, but
>still much better than doing 16 bit arithmetic). 

>No.  The 32-bit register are available in both "real" and "virtual-86"
>modes, which DOS would use.  A dos-extender is primarily used to allow
>you to write your program to use the "protected" mode of the processor.

>The 32 bit registers can be accessed simply by putting a "size" prefix
>byte in front of the instruction:  The prefix byte causes what would
>normally be a 16 bit register access to become a 32 bit register access
>instead.  (In 386 protected mode, the default access width can either
>be 16 bit or 32 bit:  the prefix is used to access the "other" size,
>whichever that is.)

>The 386 is a strange chameleon beast.  All the addressing stuff that used
>to be 16 bits can be made 32 bits instead by putting an override byte in
>front of an individual instruction.  Similarly but separately, all of the
>operand stuff that used to be 16 bits can be made 32 bit, by putting an
>override byte in front of an individual instruction.

It turns out that TASM puts the necessary prefix bytes into the
instructions automatically when you use the .386 directive, and the segment
is declared as a "USE16" segment.  All the 32-bit addressing modes of the
386 work in "real mode", provided that the most significant 16-bits of the
address are zero.  A few edits were required.  I'll show this below.

2. Where can I get a c compiler that will generate code that uses the 32-bit
   386 registers efficiently, but still use DOS for I/O?  

The reponses to this question fell into three categories:

a. >I don't know.  I'd like to get one of these myself.  

b. >There are three compilers I know of that generate 32-bit code that
   >you run under a DOS extender: Zortech, Watcom, and MetaWare.

c. Four people suggested that MetaWare had a compiler that would generate
   code that would use 32-bit extended registers without an extender.
   I looked into this thoroughly, with results described below.

3.  Turbo C has a compiler option to generate 186/286 instructions, that
    I've never used.  What does that do for me?

>There are a few additional available instructions such as push immediate,
>push/pop all, shift by other than 1 bit, multiply immediate. Turn the option
>on, it does reduce the amount of code.

>That uses certain instructions that are available on the 186/286 but
>are not available on 8088's or 8086's.  These are primarily stack frame
>instructions such as "PUSH constant", and "RET n".  These "extra"
>instructions are available in both "real" and "protected" modes of any
>intel 80186 or better processor (and on the V20, as well). Because C
>spends a lot of time fiddling with stacks and stack frames, this can
>make a some improvement in both size and speed.  It isn't that significant
>an improvement, though, and reduces the portability of the binary.

>It does almost nothing for you.  You sacrifice compatibility with 8086/8
>machines for only slightly better code size and slightly better speed.
>The real gain would be from an ability to use the 32 bit registers on
>the '386, and no compiler vendor that I know of offers this.  I had
>high hopes for the recent C compiler from Topspeed, whose ads mentioned
>an ability to generate '386 specific code.  But a phone conversation
>with their technical staff revealed a singular lack of knowledge about
>this reputed ability, and I won't buy a pig in a poke.

In short, the 186 code option  wasn't the solution.

4.  Do you have any suggestions?

>Get your compiler to output assembler (-Fa with Microsoft C) and
>hack the assembler to use 32 bit instructions. 
>Or buy the 32 bit compiler and runtime (works great but costs $900). 
>OR just program in assembler. 
>I have done all three of the above at various times.
>I think it is just plain stupid that nobody makes a C compiler that
>runs in 8086 mode but uses 32 bit [registers for] ints!!

>you can write the code in assembler using TASM which understands
>the 32 bit instruction mnemonics.

>Bug your compiler's vendor.  I have done so, with little effect.  I bet
>others have done so too, with little effect.  If enough people want it,
>maybe they'll do it.  For my purposes (and I suspect for many other
>people's purposes, too), the ability of a compiler to generate code
>specific for the '386 in full generality is not necessary.  A simple
>change that would simply add inline support for 32 bit arithmetic in
>32 bit registers would go a long way towards satisfying me.  I am
>willing to accept the code size penalty (one byte per 32 bit instruction),
>especially since one instruction will often do the work of many 16 bitters.

>It makes me mad that the compiler vendors just won't do this.  I would
>guess that the ability to generate '286 code is bolted onto existing
>compilers by peephole optimizing 8086 code, and not by a separate code
>generator.  The '386 code that I need generated for C involves long
>arithmetic and long comparisons, something that a '386 supports almost
>directly, even in 16 bit mode.  This could all be implemented via
>peephole optimization, just as the '286 code generation is.  I don't
>need 32 bit addressing; I don't insist that a variable of type "int"
>be more than 16 bits long.  I just want "long" variables to be treated
>as efficiently as they can be by the CPU.

>Cheapest try: If the speed problems are in just a few hot spots, just
>hand-code them in assembler using TASM or whatever so you can generate the
>32-bit instructions using prefix-byte overrides.  The prefix bytes won't
>slow you down much, at a guess.  This will be totally dos-portable, cheap,
>compatable, and probably go as fast as you can go.

Actual solution (results of my investigations):

At the suggestion of several respondents, I contacted Watcom, and Metaware
(makers of the "HIGH C" compiler).  Watcom said it was necessary to use a
DOS extender to use any code generated by their 386 compilers.  They had no
product that would generate 386 code that would run without an extender.

Metaware makes a 386 compiler whose code must be run with an extender, and
a 8086 compiler with a "386 flag" that causes it to generate code that uses
some (a very few) 386 instructions.  They offered to take some of my c code
and compile it with 3 compilers (8086, 8086 with 386 flag, 386 protected
mode).  So I sent them one large routine, and after a while they sent back
the three .ASM files.  I publicly thank them right here for doing that.

I had high hopes for the "8086 compiler with 386 flag".  Unfortunately, the
differences between the code generated by the 8086 and 8086 with 386-flag
compilers seemed minor. The 386-flag compiler did not use the 32-bit
registers for long arithmetic, but instead used the AX and DX registers for
32-bit operations, just as the 8086 compiler does.  Here are some sample
side-by-side code differences between the two .ASM files:

        386-flag code 				8086 code (no 386 flag)
---------------------------------------------------
        or     -4[bp],dx                        or     -4[bp],dx
        or     -2[bp],ax                        or     -2[bp],ax
        movzbw ax,2[si]               |         mov    al,2[si]
        cwd                           |         sub    ah,ah
                                      >         sub    dx,dx
---------------------------------------------------
        mov    cx,8                   <
.L002c:                               |         mov    dh,dl
        shl    ax,1                   |         mov    dl,ah
        rcl    dx,1                   |         mov    ah,al
        loop   002c                   |         sub    al,al
---------------------------------------------------
                                      >         mov    dl,dh
                                      >         mov    dh,bl
                                      >         mov    bl,bh
                                      >         sub    bh,bh
        mov    cx,14                  |         mov    cx,6
.L0109:                                  .L0109:
        shr    bx,1                             shr    bx,1
        rcr    dx,1                             rcr    dx,1
        loop   0109                             loop   0109
---------------------------------------------------

The code produced by the 386 compiler was, by comparison, delightful to read.  
It used 32-bit registers for both addressing, and for integer arithmetic.

The idea occurred to me to take the 386 protected mode code and assemble it
with TASM for linking with other "real mode" code, after editing it
slightly.  That idea worked.  I took the sample 386 assembly code that
MetaWare sent me, and made a few edits, shown below.  

****
OLD:         extrn	_mwargstack	; unreferenced
OLD: CGROUP  group	_text
OLD: _text   segment
NEW:         .386	; tell TASM to generate 386 code
NEW: _TEXT   segment	DWORD PUBLIC USE16 'CODE'
NEW: DGROUP  group	_TEXT		; Turbo C segment naming conventions
NEW:         assume	cs:_TEXT,ds:DGROUP,ss:DGROUP
Explanation:  USE16 tells TASM to generate the prefix bytes needed to run in
real mode whenever it encounters extended registers (e.g. eax).
****
OLD: _f386   proc    	near
NEW: _f386   proc    	far
Explanation: see (*) below.
****
OLD:         shr     	eax
OLD:         shl     	edx
NEW:         shr     	eax,1
NEW:         shl     	edx,1
Explanation: add missing ",1" to all one bit shifts.
****
OLD:         dec     	-8[ebp]
NEW:         dec     	dword ptr -8[ebp]
Explanation: dword pointer is not the default in TASM.
****
OLD:         leave
NEW:         mov     	esp,ebp
NEW:         pop     	ebp
Explanation:  I could not get TASM to put the necessary prefix byte in front
of the leave instruction, so I coded the equivalent two instructions.
****
OLD: _text   ends
NEW: _f386   endp	; missing
NEW: _TEXT   ends	; Turbo C naming conventions
****
(*) Explanation:  The protected mode code assumes all parameters in the stack
are 32 bits.  Near pointers are 32-bit segment offsets.  The (near) return 
address is 32-bits.  All the stack offsets (e.g. 12[ebp]) in the code are 
computed with these assumptions.  I didn't want to go through all the
assembler code and change all the stack offsets, so I changed the way this
procedure is called from Turbo C, to make sure it matched the conventions.
To do that, I made it a far procedure (even though it's in the tiny model).
That ensured that 32-bits of return address got pushed.  I made the
following change in the Turbo C declaration of the function so that the
parameters on the stack would be as expected by the assembler code.

#ifdef old	/* before converting to 386 asm code */
int func1(
	unsigned char *  p1,
	unsigned char *  p2,
	unsigned long *  p3,
	int              p4);
#else		/* code to match 386 asm calling conventions */
long far f386(
	unsigned long     p1,
	unsigned long     p2,
	unsigned long     p3,
	long              p4);
#define n2ul(a) (unsigned long) FP_OFF( &a[0] )
#define func1(a,b,c,d) f386( n2ul(a), n2ul(b), n2ul(c), d )
#endif

Summary:  Generate 32-bit code that will run under DOS, and can be linked
with Turbo C code in the tiny and small models, as follows:
1. Compile with protected mode 386 compiler into a .ASM file.
2. Edit the .ASM file, changing segment naming conventions (expecially USE16), 
   pointer sizes, far proc, and the leave instruction.
3. Assemble with TASM (or MASM, I suppose)
4. link with other code.

I want to thank the following respondents for their comments:
> 6600m00@nucsbuxa.ucsb.edu (Rob)
> Anthony Scian <afscian@watmsg.waterloo.edu>
> Jeff Prothero <jsp@milton.u.washington.edu>
> Mark Alexander <alexande@dri.com>
> Norbert Schlenker <nfs@Princeton.EDU>
> g9023690@wolfen.cc.uow.edu.au (Phillip Secker)
> jme@pacer.Pacer.COM (John Eikanger)
> jpn@genrad.com (John P. Nelson)
> mcdonald@aries.scs.uiuc.edu (Doug McDonald)
> ralerche@lindy.Stanford.EDU (Robert A. Lerche)
> shaban@bu-pub.bu.edu (Marwan Shaban)
> toma@tekgvs.labs.tek.com (Tom Almy)
> uchida@flab.fujitsu.co.jp (Yoshiaki Uchida)
-----------------------------------------------------------------------------
Nelson Bolyard      nelson@sgi.COM      {decwrl,sun}!sgi!whizzer!nelson
Disclaimer: Views expressed herein do not represent the views of my employer.
-----------------------------------------------------------------------------