maarten@uva.uucp (Maarten Carels) (08/11/86)
After our previous posting, we found out a lot more about the format of
MS Word document files. By now, all parts of the file are known, not
only the function of the part, but also the almost complete internal
structure. A new version of the document (BinHexed MS Word format) is
posted to net.sources.mac. For those of you who do not have MS Word
(what could they do with this description -:)), a text only version
follows this posting.
New parts discovered are mostly related to the division structure of a
MS Word document. Almost all details related to this part are described.
Some small errors in the first version of the document were corrected.
We encourage everyone to make use of the information provided, but we
expect you to be fair and place every program you develop based upon
this information in the public domain, or at least make it a shareware
product. (Free for us of course !). Also we would like anybody who
discovers new facts about the format, or who finds errors in our
explanation to inform us.
(-------------- Cut Here ------------------)
MicroSoft Word file format
second, revised edition
M.J. Carels,
A.G. Starreveld
Department of Computer Science
University of Amsterdam
e-mail: {decvax, philabs, seismo}!mcvax!uva!{maarten,dolf}
s-mail: Kruislaan 409, 1098 SJ Amsterdam, The Netherlands
1. Introduction.
This document describes the structure of the files produced by MicroSoft
WORD (versions 1.00 and 1.05). These files are of type TWDBNU. This
knowledge has been gathered by looking at such files, and trying to
interpret the bytes in the file. All information is believed to be correct,
but no responsibility for errors or omissions can be taken. Please let us
know if you find errors in this document, or if you find more about the
structure of the file. The file format used by MS Word for other computers
than Macs may, or may not be the same.
We will describe all structures of the file format by means of C structure
declarations, as this is a convenient way to describe things. In the
structures some basic types appear. These are:
ubyte 8 bits unsigned number
byte 8 bits signed number
ushort 16 bits unsigned number
short 16 bits signed number
ulong 32 bits unsigned number
long 32 bits signed number
No alignment is to be assumed, the only bytes present are the ones
described. In all structures byte addresses (decimal) are included as
comments.
2. General file structure.
A MS Word file of type TWDBNU can be divided into seven different main
parts. We will call these parts the RheaderS, RtextS, Rcharacter formatsS,
Rparagraph formatsS, Rdivision blocksS, Rdivision listS and Rpage listS
respectively. These parts appear in the above mentioned order in the file's
data fork. The resource fork is always empty, i.e. not allocated.
The file can be seen as being built from basic blocks, each 128 bytes long.
This is the case for all parts of the file, although it does not appear to be
significant for the text part. This implies that the size of an MS Word file
is always a multiple of 128 bytes, though within each of the parts
mentioned, the last bytes in the last block belonging to a certain part may
(and usually will) contain garbage.
In several sections sizes, dimensions and distances are present. In all such
fields these dimensions are given in Tbasic unitsU. A MS Word basic unit is
1/20 of a point, or 1/1440 of an inch (a point equals 1/72 of an inch).
3. Header.
The header part consists of a single block of 128 bytes. It contains
pointers to most parts of the file. The header can be defined in terms of a
C structure as follows:
/*
* This structure starts each MS-WORD 'WDBN' file.
*/
struct _header {
/* 00 */ ushort h_1 ; /* always 0xfe32 */
/* 02 */ ushort h_2 ; /* always 0 */
/* 04 */ ushort h_3 ; /* always 0xab00 */
/* 06 */ short h_unk1[4] ; /* always 0 */
/* 14 */ ulong h_ET ; /* Position of byte past text */
/* 18 */ ushort h_par ; /* first paragraph block # */
/* 20 */ ushort h_div ; /* first division info block # */
/* 22 */ ushort h_div1 ; /* same */
/* 24 */ ushort h_divlist ; /* first division list block # */
/* 26 */ ushort h_pagelist ; /* first page list block # */
/* 28 */ ushort h_unalloc ; /* first unallocated block # */
/* 30 */ short h_unk2[17] ; /* always 0 */
/* 64 */ ulong h_tlength ; /* Length of text */
/* 68 */ ulong h_tlength1 ; /* same */
/* 72 */ short h_unk3[28] ; /* always 0 */
};
typedef struct _header MS_Head;
The h_ET field contains the address within the file of the byte just past
the last character in the document text, i.e. the RtextS part. The h_tlength
field contains the length of the RtextS part in bytes. The h_par field gives
the block number (remember a MS Word block is 128 bytes long) of the
first block that contains paragraph formats. Every division has its own
block containing margins, page number and that kind of stuff. The h_div
field contains the block number for the first division block. For some
reason it is stored twice. The connection between the text and the division
blocks is made through the division list. The h_divlist field gives the block
number for the first block in the division list. The last block in the file
contains the page list. In the page list the position of the first character
in the first line of each page is stored. It is this list which is updated
when you issue a RRepaginateS (COMMAND-J) command. The small T=U signs
in the margins come also from this list. The h_pagelist field gives the
first block for this list. The block number of the first RunallocatedS block
is stored in the h_unalloc field. The file is also h_unalloc blocks long.
4. Text part.
The text part contains a complete representation of the text in the
document, including running heads, footnotes and pictures. The text is
represented in the order in which it occurs in the document, in the
extended Macintosh ascii character set. Some ascii values have special
meaning however:
0x01 page number ((page) glossary)
0x05 auto numbered footnote reference ((footnote) glossary)
0x0b Forced new line within paragraph
0x0c End of division or forced new page
0x0d End of paragraph
0x1f Optional hyphen
The above implies that to extract a text only version of an MS Word file
one only needs to extract the text part of the file, possibly replacing some
of the special characters with others, depending on what you want. If you
do nothing, you will certainly get very long lines, since you will get a
newline character only at the end of each paragraph, so perhaps you want
to do some line folding. Pictures are stored within the text part, along
with a header. The picture is a single paragraph by itself. The paragraph
format run pointing to the picture has a bit set to indicate the
corresponding paragraph is a picture.
5. Format runs.
Everything related to the layout of the text is stored in what we will call
Rformat runsS and Rformat descriptorsS. A format run consists of several
bytes of formatting information, described below (section 6 and 7). A
format descriptor consists of 6 bytes. It is described by the following
structure:
/*
* This structure defines a format descriptor.
*/
struct _fdescriptor {
/* 00 */ ulong fd_start; /* start of text for next run */
/* 04 */ short fd_run ; /* pointer to this format run */
};
Each format block starts with the offset in the text part where the
formats of this block start. After the initial start a number of format
descriptors follow. The rest of the format block contains format runs.
Both the format run and the format descriptor must be contained in the
same block. A new block is allocated if either one does not fit.
The last byte in a format block (offset 0x7f) contains the number of
format descriptors present in the block.
The format runs are stored preceded by a byte count. This bytecount gives
the number of bytes in the format run that are actually stored in the file.
The other (not stored) bytes of the format run contain the default value.
File size is reduced by not storing seldomly used fields.
The fd_start field is a pointer in the text part of the document. The next
format applies from there. The fd_run field is an offset (relative to byte 4
in the format block) to the format run.
6. Character formats.
The character format runs define how the characters in the text look. This
includes properties like the font, size and style of the characters. The
character format runs are 6 bytes long, although not all 6 need be stored.
One field needs special attention. The font number is split in (at least)
two pieces. The low order 6 bits are in the cf_font field. This fits most
standard fonts, as they have small numbers. More exotic fonts have larger
numbers. The extra bits are stored in the high order 3 bits of the cf_flags2
field. The meaning of the bytes in the format run is:
/*
* This structure defines character formats.
*/
struct _cformat {
/* 00 */ ubyte cf_unknown ; /* seems always 0x80 or 0x00 */
/* 01 */ ubyte cf_font ; /* font number, some flags */
/* 02 */ ubyte cf_pointsize ; /* times 2, 0 = default */
/* 03 */ ubyte cf_flags1 ; /* more flags */
/* 04 /* ubyte cf_flags2 ; /* more flags, more font # */
/* 05 */ byte cf_position ; /* > 0 super, < 0 sub script */
};
typedef struct _cformat MS_CFmt;
/* macro for extracting the font number */
#define CHF_FONT(x) (((x)->cf_font&0x3f) | (((x)->cf_flags2&0xe0) << 1))
/* values for the cf_font field */
#define CHF_BOLD 0x80 /* bold bit */
#define CHF_ITAL 0x40 /* italic bit */
/* values for the flags1 and flags2 fields */
#define F1_UL 0x80 /* underlined */
#define F1_SC 0x0c /* Small Caps */
#define F2_OL 0x10 /* outline */
#define F2_SH 0x08 /* shadow */
The first field in a character formats format run seems to take only the
values 0x00 and 0x80. The meaning of this field is unknown.
7. Paragraph formats.
The fourth part of the file contains the paragraph formats. The format runs
start with normal paragraph formatting information. Thereafter follow the
Rtab definitionsS. As many tab definitions as needed will be in the format
run. The tab definition is described by the structure below:
/*
* This structure defines tabs
*/
struct _tformat {
/* 00 */ ushort t_position ; /* tab position */
/* 02 */ ushor t t_flags ; /* type of tab stop */
};
typedef struct _tformat MS_Tab;
/* values for the t_flags field */
#define T_ALIGNMS K 0x600 0 /* mask for tab alignment */
/* alignment values: */
#define T_LEFT 0x0000 /* left aligning tab */
#define T_CENTER 0x2000 /* center aligning tab */
#define T_RIGHT 0x4000 /* right aligning tab */
#define T_DECIMAL 0x6000 /* decimal tab */
#define T_LEADMSK 0x0c00 /* mask for tab leader */
/* leader values: */
#define T_BLANK 0x0000 /* blank leader */
#define T_DOTS 0x0400 /* dotted leader */
#define T_DASH 0x0800 /* dashed leader */
#define T_LINE 0x0c00 /* line leader */
The format run is defined by the structure below:
/*
* This structure defines paragraph formats.
*/
struct _pformat {
/* 00 */ ushort p_flags; /* some flags */
/* 02 */ ushort p_unk1; /* always 0 */
/* 04 */ ushort p_right; /* right indent */
/* 06 */ ushort p_left; /* left indent */
/* 08 */ ushort p_first; /* first indent */
/* 10 */ ushort p_line_spacing; /* line spacing (0 = auto) */
/* 12 */ ushort p_before; /* space before */
/* 14 */ ushort p_after; /* space after */
/* 16 */ ushort p_rhead_pict; /* running head & picture info */
/* 18 */ ushort p_unk2; /* always 0 */
/* 20 */ ushort p_unk3; /* always 0 */
/* 22 */ MS_Tab p_tabs [0]; /* list of tab descriptors */
};
typedef struct _pformat MS_Fmt;
/* values for the p_flags field */
#define PF_FOOTMASK 0x7f00 /* mask for footnote info */
/* values unknown */
#define PF_JUSTMASK 0x00c0 /* mask for justification */
/* justification values: */
#define PF_LEFT 0x0000 /* left justifiedparagraph */
#define PF_CENTER 0x0040 /* centered paragraph */
#define PF_RIGHT 0x0080 /* right justified paragraph */
#define PF_JUST 0x00c0 /* justified paragraph */
#define PF_KEEP 0x0010 /* keep with next paragraph */
#define PF_KEEPL 0x0020 /* keep lines together */
/* values for the running_head field */
#define RH_MASK 0xf000 /* mask for running head info */
/* running head values: */
#define RH_FIRST 0x8000 /* appears on first page */
#define RH_EVEN 0x4000 /* on even pages */
#define RH_ODD 0x2000 /* on odd pages */
#define RH_BOTTOM 0x1000 /* appears on bottom of page */
#define RH_PICT 0x0800 /* this paragraph is a picture */
The p_flags field has sometimes the high order bit set (0x8000). The
meaning of this could be the same as the setting of this bit in the
character formats, where it is also sometimes set. What this bit indicates
is unknown.
In the paragraph format run only the tab descriptors needed for explicitely
defined tabs are stored. The list of tab descriptors is ended by a tab
descriptor with the t_position field equal to 0. This final tab descriptor is
stored in the file, although this seems not necessary.
A picture is identified by having the RH_PICT bit set. In the text part of
the document the picture is present, preceded by a 6 byte header. The
actual picture data follows after the header. It is encoded in standard PICT
format (see tech note #21). The picture header is defined by the structure
given below:
/*
This structure defines a picture header
*/
struct _phead {
/* 00 */ short ph_offset; /* offset from left margin */
/* 02 */ short ph_xdist; /* distortion in x direction */
/* 04 */ short ph_ydist; /* distortion in y direction */
};
typedef struct _phead MS_Pict;
The ph_xdist and ph_ydist fields are used to store the distortion of the
picture. If both are zero, the picture is undistorted. The exact meaning of
the values stored here is unknown.
8. Division blocks.
The next (fifth) part of the file contains the division blocks. One block
(128 bytes!) is allocated for every division. The block is filled with the
following structure, preceded by a byte count indicating how many bytes
are actually stored:
/*
* This structure describes division formats
*/
struct _dformat {
/* 00 */ ushort d_flags; /* flags */
/* 02 */ ushort d_pap_len; /* total paper length */
/* 04 */ ushort d_pap_wit; /* total paper width */
/* 06 */ ushort d_p_start; /* start page # */
/* 08 */ ushort d_top; /* top margin */
/* 10 */ ushort d_bot; /* bottom margin, from top paper */
/* 12 */ ushort d_left; /* left margin */
/* 14 */ ushort d_right; /* right margin, from left paper */
/* 16 */ ushort d_flag_col; /* some flags, number of columns */
/* 18 */ ushort d_r_top; /* top run.head pos, from top paper */
/* 20 */ ushort d_r_bot; /* bottom run.head pos, from top paper */
/* 22 */ ushort d_colsp; /* column spacing */
/* 24 */ ushort d_gutter; /* gutter */
/* 26 */ ushort d_pag_top; /* page number position from top */
/* 28 */ ushort d_pag_left; /* page number position from left */
/* 30 */ ushort d_unk1;
/* 32 */ ushort d_rbot; /* seems runn. head pos, from bottom */
/* 34 */ short d_unk2[34];
};
typedef struct _dformat MS_Div;
#define DFB_MASK 0x00e0 /* mask for break */
/* values for break: */
#define DFB_CONT 0x0000 /* continuous */
#define DFB_COL 0x0020 /* column */
#define DFB_PAGE 0x0040 /* page (default) */
define DFB_ODD 0x0060 /* odd */
#define DFB_EVEN 0x0080 /* even */
#define DFP_MASK 0x001c /* mask for the page # format */
/* values for page # format: */
#define DFP_NUM 0x0000 /* numeric (1, 2...)*/
#define DFP_ROM 0x0004 /* roman, upper case (I, II...) */
#define DFP_rom 0x0008 /* roman, lower case (i, ii...) */
#define DFP_ALF 0x000c /* alphabetic, upper case (A, B...) */
#define DFP_alf 0x0010 /* alphabetic, lower case (a, b...) */
#define DF_DIV 0x0001 /* division layout present */
#define DEFAULT_PAG 0xffff /* default page number */
/* mask for flags in the d_flag_col field */
/* values: */
#define DCF_AUTO 0x0200 /* auto page numbering on */
#define DCF_FOOT 0x0100 /* 1=footnote at end of division */
#define DCF_COL 0x00ff /* mask for number of columns */
Some of the values stored in the division blocks seem to have no relation
to a division, but rather to the document as a whole. These values are
present in all division blocks, but only the value in the first division block
is used. The values in the other blocks are just ignored. As usual, all
dimensions are in basic units.
The DF_DIV bit used to indicate whether the division layout information is
stored (DF_DIV = 1) or only the paper dimensions (DF_DIV = 0). If the
DF_DIV bit is 0, only 16 bytes of the division block are used.
The division blocks do not contain any information on which part of the
text they apply to. This information is stored in the next part of the file,
the division list:
9. Division list.
The division list links the division blocks to the text. It consists of
division descriptors. These are defined by the structure:
/*
* This structure describes a division descriptor
*/
struct _pdiv {
/* 00 */ ulong pd_text; /* where the division starts */
/* 04 */ ushort pd_unk; /* unknown */
/* 06 */ ulong pd_block; /* there is the div block */
};
typedef struct _pdiv MS_DivD;
The pd_text field gives the place in the text where the division ends,
relative to the start of the text part. Add 0x80 to get the offset from the
start of the file. The pd_block field gives the offset in the file where the
division block starts. If the pd_block field is 0xffffffff, this division has
no division block allocated. This seems a division with all default values.
The meaning of the pd_unk field is unclear.
The structure described in section 8 is present in the corresponding
division block, preceded by a bytecount. As with the paragraph and
character formats, bytes not stored contain default values. As many
division descriptors are in the division list as there are divisions. The
division list holds some more information, as can be seen in the structure
definition:
/*
* This structure describes the division list
*/
struct _divlist {
/* 00 */ ushort dl_count; /* the number of descriptors */
/* 02 */ ushort dl_unk; /* some unknown counter */
/* 06 */ MS_DivD dl_list [0]; /* as many as needed */
};
typedef struct _divlist MS_Div;
If one block (128 bytes) is not sufficient to store the division list, the list
can span block boundaries.
The division list is also the way to se if a 0x0c character stored within
the file is a Tforced new pageU (TSHIFT-ENTERU) or a Tend of divisionU
character. If there is no entry for the 0x0c character in the division list,
it must be only a new page.
10. Page list.
The last part of the file is the page list. This list contains information
about the pages in the document. This information is used to process the
RGo ToIS command, and to show the small T=U signs in the margin. The page
list is updated when the document is repaginated by means of the
RRepaginateS command. The items in the page list are described by this
structure:
/*
* This structure describes a page list item
*/
struct _page {
/* 00 */ ushort pg_num; /* the page number */
/* 02 */ ulong pg_text; /* where it starts */
};
typedef struct _page MS_Page;
The page numbers seem to be always in numerical order. The pg_text field
is the offset in the text part where the page starts. Add 0x80 to get the
position relative to the start of the file. All page list items are contained
in the page list, which fills the last part of the file:
/*
* This structure describes the page list
*/
struct _plist {
/* 00 */ ushort pl_count; /* the number list items */
/* 02 */ ushort pl_unk; /* some other count */
/* 04 */ MS_Page pl_list [0]; /* the list */
};
typedef struct _plist MS_PList;
11. End of file.
Just after the page list the end of file is present. The file contains an
integral number of MS Word blocks. As a MS Word block is smaller than a
physical disk block, the end of file may be in the middle of the last
physical disk block.
Dolf Starreveld / Maarten Carels
Department of Computer Science, UvA
Usenet: {dolf,maarten}@uva.uucp
{seismo,decvax,philabs}!mcvax!uva!{dolf,maarten}
Snail mail: Dolf Starreveld
Department of Computing Science
University of Amsterdam
Kruislaan 409
NL-1098 SJ Amsterdam
The Netherlands
Telefone: In Holland: 020-592 5137/5022
International: 31-20-592 5137 or 31-20-592 5022
Telex: 10262 HEF NL