[comp.sys.ibm.pc] SEGMENT:OFFSET Madness

cliffhanger@cup.portal.com (Cliff C Heyer) (01/12/90)

I wish Microsoft and MSDOS book writers would get their 
terms straight. For example, lets take "segment:offset". 
The sensible thing to do would be to use these words 
they way they are usually used in the english language, 
rather than invent "new" meanings. Specifically, segment 
means a 64K "segment" and offset means "offset within a 
64K segment." But I guess this was "too easy" for 
whoever made the word decision. So I'll take this 
opportunity now to set the matter straight.

First of all, "segment" in segment:offset is NOT a 
segment. In fact, the number is meaningless in itself. 
The same goes for "offset". Let me explain.

To get the "real" segment, you must multiply (base 10 
this example) the so-called segment by 16 and add the 
offset. Then you must subtract 65,536 from this total 
repeatedly and stop just before you get a negative 
number. Then number of times you subtract is the "real" 
segment number, and your remainder is the "real" offset 
address.

In Hex, you multiply the so-called segment by 10h and 
add the offset. The first hex number of your result is 
the "real" segment, and the "real" offset is the 
remaining four hex numbers. To be true to standard 
"segment:offset" word use, 39D3:0ECE should be written 
3:ABFE or 3:44030 base 10.

If people are smart enough to invent computers, they 
ought to be smart enough to get word use straight. I 
don't know what to call segment:offset, but certainly 
NOT segment:offset. 

I realize that it is more efficient to convert 
"segment:offset" in it's present form to a 20-bit real 
address than to store the "real" segment number 1-16 in 
the segment register. This is because currently only a 
bit-shift need be done to multiply the segment by 10h. 
With the real segment number in there, you'd have to 
multiply it by 65536 each time to get the segment 
address and consume more CPU cycles. However, this does 
not justify continued use of the same words.

Perhaps an astute observer could offer and explanation 
that would more easily allow conceptualization of what 
the current "segment:offset" really represent.

Ralf.Brown@B.GP.CS.CMU.EDU (01/12/90)

In article <25821@cup.portal.com>, cliffhanger@cup.portal.com (Cliff C Heyer) wrote:
}I realize that it is more efficient to convert 
}"segment:offset" in it's present form to a 20-bit real 
}address than to store the "real" segment number 1-16 in 
}the segment register. This is because currently only a 
}bit-shift need be done to multiply the segment by 10h. 
}With the real segment number in there, you'd have to 
}multiply it by 65536 each time to get the segment 
}address and consume more CPU cycles. However, this does 
}not justify continued use of the same words.

Multiply by 65536 is also a bit-shift (16 bits).

}Perhaps an astute observer could offer and explanation 
}that would more easily allow conceptualization of what 
}the current "segment:offset" really represent.

Many mainframes have what is known as "base registers".  To specify a memory
address, you give a base register and an offset from that base register.
This is precisely what the Intel 80x86 processors do.  The segment register
is loaded with a base address, and the offset specifies the distance (up to
64K) from that base address.  That is also why programs don't use a 20-bit
linear address, because they would simply have to split it up again every
time it is needed.

A major reason for having base registers (and one reason for the Intel
segment registers) is to support an address space which takes more bits than
are present in a register.  If Intel had defined the segment to represent
"address * 256", the 8086 would have been able to support 16M of address
space at the cost of more wasted RAM (due to the graininess of only being
able to start a segment every 256 bytes).

Another major reason for having base registers is to allow easy relocation
of code.  If programs always used linear addresses, every memory reference
in a program would have to be patched by the program loader to allow the
program to execute at a different position in memory (on machines with
virtual memory, you can play with the memory mapping instead, but the Intel
family in real mode do not have virtual memory).  When you have base registers,
the program loader can simply set the base register to the starting location
of the program in memory, and the program automatically references the proper
real memory locations by specifying the offset from the base register.	This
is how tiny model (.COM) programs work under MSDOS.  Other memory models
(.EXE) need to be patched by the loader, but only need the segment register
loads patched, not every single memory reference.

As for naming, "segment" implies a portion.  From the American Heritage
Dictionary:
  ENTRY     segment (SEG'muhnt) n.
  MEANING   1. Any of the parts into which something can be divided.  2. Math. A
As used by Intel for the 8086, "segment" means any portion of memory up to 64K
in size, starting at any address which is a multiple of 16 (under protected
mode on the 80286, a segment has a descriptor which specifies its 24-bit linear
starting address [it can start on any byte] and its length [1 byte to 64K
bytes], and trying to access beyond the defined length causes a protection
violation error).

--
UUCP: {ucbvax,harvard}!cs.cmu.edu!ralf -=- 412-268-3053 (school) -=- FAX: ask
ARPA: ralf@cs.cmu.edu  BIT: ralf%cs.cmu.edu@CMUCCVMA  FIDO: Ralf Brown 1:129/46
"How to Prove It" by Dana Angluin              Disclaimer? I claimed something?
14. proof by importance:
    A large body of useful consequences all follow from the proposition in
    question.

johnl@esegue.segue.boston.ma.us (John R. Levine) (01/12/90)

In article <25821@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) writes:
>To get the "real" segment, you must multiply (base 10 
>this example) the so-called segment by 16 and add the 
>offset. Then you must subtract 65,536 from this total 
>repeatedly and stop just before you get a negative 
>number. Then number of times you subtract is the "real" 
>segment number, and your remainder is the "real" offset 
>address.

I suppose that's one way to look at it, but I find it more convenient
to write my programs so that the segments I use are disjoint, keeping
the knowledge of how segments map to linear addresses restricted to
the memory allocator.  Intel in its programming manuals has always
urged people to do that.

If you look at the 286 and 386, you'll find that in protected mode
the segments really are segments, and there is no architectural
relationship between segment N and segment N+1 (well, actually N+8,
but that's a separate argument.)  If your programs are written believing
that segments are segments, they are relatively straightforward to port
to a protected environment.  If you wire in knowledge of the 8086's
16-byte paragraphs, you're in trouble.

I am no fan of 16-bit segments and offsets, but if you have to deal
with them, you might as well make the best of it.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
"Now, we are all jelly doughnuts."

pipkins@qmsseq.imagen.com (Jeff Pipkins) (01/13/90)

What you are saying is true, for the 8086.  I suppose a better connotation
would be base:displacement.  But any way you slice it, the terminology is
the least of the things they *SCREWED*UP*, imho.

With the advent of protected mode on the '286 and V86 mode on the '386,
the value in the segment register really is a segment number, but a segment
is no longer (necessarily) 64k bytes.  It is a segment selector.  This also
presents real problems (no pun intended).

On the 8086, before there was an 80286, it was perfectly legitimate to 
normalize the address so that the segment was between 0 and F.  Programs
that use this technique will not run under '286 protected mode or V86
mode on the '386.  This is the main reason that the term "DOS INcompatibility
bos" was coined.

[My employer my not share my opinions.  Insert your favorite disclaimer here.]

jdudeck@polyslo.CalPoly.EDU (John R. Dudeck) (01/13/90)

In article <25821@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) writes:
>Perhaps an astute observer could offer and explanation 
>that would more easily allow conceptualization of what 
>the current "segment:offset" really represent.

I share your frustration with trying to understand the 80x86 architecture.
A lot of why it is this way has to do with the wonders of evolution...

Mentally I think of the offset as being exactly what it sounds like, the 
displacement into the segment starting from some base address.

And a segment can be visualized as a 64k "window" into the address space of
the memory.  The starting address of that window is the value in the
segment register being used.  Since segment registers are only 16 bits long,
and since the address space is 20 bits wide (in "real" mode), the segment
register just contains the 16 most significant bits of the base address of
the "window", and the 4 lsb's are always 0, resulting in the situation
that segments are aligned on 16-byte boundaries.

Of course when you go into "protected" mode, or to the 386, the picture 
changes again...

Really, I don't see too much point in bashing the design decisions made
by the designers.  Every cpu ever designed is a combination of tradeoff
decisions.  I do feel it was too bad that IBM chose the 8088 for the PC.
The National Semiconductor 16016 would have been a much better choice...
or the 32032 even better yet!

-- 
John Dudeck                           "You want to read the code closely..." 
jdudeck@Polyslo.CalPoly.Edu             -- C. Staley, in OS course, teaching 
ESL: 62013975 Tel: 805-545-9549          Tanenbaum's MINIX operating system.

rob@prism.TMC.COM (01/16/90)

pipkins@qmsseq.UUCP writes:

>With the advent of protected mode on the '286 and V86 mode on the '386,
>the value in the segment register really is a segment number, but a segment
>is no longer (necessarily) 64k bytes.  It is a segment selector.  This also
>presents real problems (no pun intended).

>On the 8086, before there was an 80286, it was perfectly legitimate to 
>normalize the address so that the segment was between 0 and F.  Programs
>that use this technique will not run under '286 protected mode or V86
>mode on the '386. 

   Perhaps a nit, but in V86 mode on the 386/486, the segment registers
aren't treated as protected-mode style selectors, but as real-mode style
pointers to memory paragraphs (though the physical memory they point to
can be altered via the paging tables). That's the 'magic' of V86 mode;
it lets real mode programs run under protected mode by using real-mode
segment translation. As a result, the normalization you mentioned (which 
is used in accessing 'huge' arrays) is acceptable under V86 mode, though 
not under standard 286/386/486 protected mode.

   Ideally, address normalization should be unnecessary under 386/486 
protected mode anyway, since it allows segments to be arbitrarily large.
A type of normalization can be performed under 286 protected mode if
the operating system sets up descriptor tables the right way (OS/2
does this with the various DosGetHuge...() functions). It's slow and
cumbersome, but it works.

cliffhanger@cup.portal.com (Cliff C Heyer) (01/17/90)

Well, I guess I was blowin' off some steam when I posted 
that article. Gotta be more carful.

I was the victim of vague books (Howard Sams Co.) 
which discuss a "segment" as a 64KB chunk only. 
Nowhere do they say that a segment refers to 16B 
chunks and that "offset" is NOT the offset within 
the "current segment" but is really just a pointer 
to any location above the base "address" of the 
current segment (segment X 10h).

Thanks for your assistance.

Cliff