[comp.sys.ibm.pc] MASM quirk, is this a bug or my problem?

lotto@wjh12.UUCP (04/05/87)

The following code fragment was assembled using MASM 4.0, linked with the
linker provided in that package and the exe2bin'ed to a .COM file with
the DOS 3.2 incarnation of that utility.

	Title	FOO

Cseg	segment para 'CODE'
	org	100H
	assume	CS:Cseg, DS:Cseg, ES:Cseg
Start:	jmp	Begin

Label	db	'Post no bills'

Begin:	lea	DX, Cseg:Label
	.
	.
	.
Cseg	ends
	end	Start

When I looked at the resulting code, the line at label begin got:
	LEA	DX, [0003]
100H before the actual location of Label. The same (incorrect)
results were obtained with the source line:
Begin:	lea	DX, CS:Label
Removing the explicit segment reference to make:
Begin:	lea	DX, Label
gave the correct result:
	LEA	DX, [0103]
My understanding of the ASSUME directive is that labels located
in Cseg should assume the use of CS: as the segment register.
No changes are observed with the segment declaration, assume
directive and org assignment in any order!
Why is this a problem? Thanks.

doug@edge.UUCP (Doug Pardee) (04/14/87)

> My understanding of the ASSUME directive is that labels located
> in Cseg should assume the use of CS: as the segment register.

It's been a few months since I've had MASM available, so I can't check
into the original question.  But I can provide some basic info:

Caution:  This area is extremely complicated.  I don't pretend to understand
/all/ of it.

Basic premise:  any memory location in a PC (or other iAPX86 CPU) can be
addressed by any of 4096 different combinations of segment:offset addresses.
Whenever you try to reference "the address of a memory byte", you are forcing
the assembler to select one of those 4096 possible addresses.  Which one it
chooses is of great importance if your code treats the segment and offset
values separately.  For example, the nearly-every-instruction practice of
keeping a segment value in a segment register and then referencing the
data item by offset.

Fortunately, things aren't quite as bleak as this looks.  There are usually
only 2 of the 4096 "aliases" that the assembler will pick.  Each symbol which
is associated with a memory item has a default Segment:Offset based on the
segment in which it was defined.  [There are two variations, discussed later.]
And, if the default segment is declared as part of a GROUP, then there is
also the associated Group:Offset combination which could be used as an address.

The method used by the assembler to determine whether to use Segment:Offset,
Group:Offset, or some other oddball thing is based on the type of instruction.

 a) Branching instructions cannot have segment prefix bytes, so always use CS.
    Consequently, the offsets are computed based upon whatever was ASSUMEd for
    CS, *even if this is unusable*.  Make *dang* sure that any branch
    instruction has the appropriate ASSUME CS: ahead of it.
 b) The LEA is a special instruction.  It's operand is always an offset, and
    that offset is computed based upon whatever was ASSUMEd for DS, even if
    this is unusable.  This can be overridden.
 c) Normal instructions just referencing memory -- Take the first usable offset
    from this list:  ASSUME DS:Group, ASSUME ES:Group, ASSUME SS:Group,
    ASSUME CS:Group, ASSUME DS:Segment, ASSUME ES:Segment, ASSUME SS:Segment,
    ASSUME CS:Segment. (Or is SS before ES? I forget).  This can be overridden.
 d) Instructions referencing SEG x or OFFSET x, like MOV AX,OFFSET memloc --
    Use Segment:Offset.  This can be overridden.  And you had darned well
    better override it if the data item is in a GROUP, because then it's
    almost certain you wanted Group:Offset, not Segment:Offset.
 e) Data definitions (address constants), same as (d) above.  Same warnings.
 f) Operands with segment overrides are taken as written.  If a segment
    register is specified, then that forces the Segment Prefix byte as well
    as causing the address to be computed based on whatever was ASSUMEd for
    that segment register, even if this is unusable.

When I say that a computed offset is "usable" or "unusable", I refer to the
fact that the assembler actually leaves much of this work up to the linker.
It basically tells the linker, "This instruction references the memory
location which can be called Segment:Offset, but it uses XXXX instead of
segment, so figure out what the offset for XXXX:Offset is and use that
value instead."  This can cause bizarre happenings, because the linker is
perfectly happy referencing something in segment Y but using the totally
unrelated segment X as a base, as long as Y is after X and the referenced
location is withing 64K of the beginning of X.  One day you change something,
and there's more than 64K, and you get an linker error message you can't
figure out where it came from :-)

How can this happen?  Well, from an explicit segment override, for one thing.
But a sneakier way is from one of those 2 variations I mentioned some
paragraphs back, about how each memory-type symbol has a default segment
associated with it.

External symbols in MASM come in 2 types.  Regular externals do *not* have
a segment associated with them; the assembler cannot compute any offsets and
must leave the whole thing to the linker.  These are fine for procedure
labels, but trying to reference them as data items is a b**ch, because you
gotta deal with both the segment and offset addresses whenever you want to
reference it.  ASSUME statements can't help you here at all.

For most data items, you the programmer know which segment they're in.  In
fact, for smaller programs, you might well have all of your data accessible
off of DGROUP.  Including some or all externals.  So by placing the "EXTERN"
directive *inside* the confines of the appropriate "SEGMENT/ENDS" pair, you
tell the assembler that it can presume it knows which segment the symbol is
in.  Now you can reference the symbol directly (if DS is ASSUMEd to that
segment or group) with no fuss.  But if you *lied* to the assembler about
which segment the memory was in, you can have linker trouble.

The other variation?  Labels defined with a colon (jump-type labels) are
not exactly defined to be in the current segment as defined by "SEGMENT/ENDS".
They're defined to be in the segment which is ASSUMEd for CS.  Usually,
you've assumed the current segment in CS, so you don't notice the difference.

If you followed all this, you're better than I am...

-- Doug Pardee -- Edge Computer Corp. -- Scottsdale, Arizona

bill@hpcvlo.UUCP (04/16/87)

I just tried assembling your code using Microsoft MASM 4.0 and LINK 3.51,
and it worked just fine.  Here's the listing that masm generated; I also
verified that the resulting .COM file, after going through EXE2BIN, was
correct ...


Microsoft (R) Macro Assembler  Version 4.00                 4/16/87 07:57:53

foo                                                         Page     1-1
                                                            

                                	title	foo 
 0000                           cseg	segment	para 'code' 
 0100                           	org	100h 
                                	assume	cs:cseg,ds:cseg,es:cseg 
                                 
 0100  EB 0E 90                 start:	jmp	begin 
                                 
 0103  50 6F 73 74 20 6E 6F     label	db	'Post no bills' 
       20 62 69 6C 6C 73        
                                 
 0110  8D 16 0103 R             begin:	lea	dx,cseg:label 
 0114                           cseg	ends 
                                	end	start 
Microsoft (R) Macro Assembler  Version 4.00                 4/16/87 07:57:53

foo                                                         Symbols-1
                                                             

Segments and Groups:

                N a m e         	Size	Align	Combine Class

CSEG . . . . . . . . . . . . . .  	0114	PARA	NONE	'CODE'

Symbols:            

                N a m e         	Type	Value	Attr         

BEGIN  . . . . . . . . . . . . .  	L NEAR	0110	CSEG

LABEL  . . . . . . . . . . . . .  	L BYTE 	0103	CSEG

START  . . . . . . . . . . . . .  	L NEAR	0100	CSEG


     12 Source  Lines
     12 Total   Lines
     26 Symbols

  42226 Bytes symbol space free

      0 Warning Errors
      0 Severe  Errors




Bill Frolik
hp-pcd!bill
Hewlett-Packard Portable Computer Division
Corvallis, Oregon