maarten@uva.uucp (Maarten Carels) (08/11/86)
After  our previous posting, we found out a lot more about the format of
MS  Word  document  files.  By now, all parts of the file are known, not
only  the  function  of  the part, but also the almost complete internal
structure.  A  new  version of the document (BinHexed MS Word format) is
posted  to  net.sources.mac.  For  those  of you who do not have MS Word
(what  could  they  do  with  this description -:)), a text only version
follows this posting.
New  parts  discovered are mostly related to the division structure of a
MS Word document. Almost all details related to this part are described.
Some small errors in the first version of the document were corrected.
We  encourage  everyone  to make use of the information provided, but we
expect  you  to  be  fair and place every program you develop based upon
this  information  in the public domain, or at least make it a shareware
product.  (Free  for  us  of  course  !). Also we would like anybody who
discovers  new  facts  about  the  format,  or  who  finds errors in our
explanation to inform us.
(-------------- Cut Here ------------------)
MicroSoft Word file format
second, revised edition
M.J. Carels,
A.G. Starreveld
Department of Computer Science 
University of Amsterdam
e-mail: {decvax, philabs, seismo}!mcvax!uva!{maarten,dolf}
s-mail: Kruislaan 409, 1098 SJ  Amsterdam, The Netherlands
1. Introduction.
This document describes the structure of the files produced by MicroSoft
WORD (versions 1.00 and 1.05). These files are of type TWDBNU. This
knowledge has been gathered by looking at such files, and trying to
interpret the bytes in the file. All information is believed to be correct,
but no responsibility for errors or omissions can be taken. Please let us
know if you find errors in this document, or if you find more about the
structure of the file. The file format used by MS Word for other computers
than Macs may, or may not be the same.
We will describe all structures of the file format by means of C structure
declarations, as this is a convenient way to describe things. In the
structures some basic types appear. These are:
ubyte	8 bits unsigned number
byte	8 bits signed number
ushort	16 bits unsigned number
short	16 bits signed number
ulong	32 bits unsigned number
long	32 bits signed number
No alignment is to be assumed, the only bytes present are the ones
described. In all structures byte addresses (decimal) are included as
comments.
2. General file structure.
A  MS Word file of type TWDBNU can be divided into seven different main
parts. We will call these parts the RheaderS, RtextS, Rcharacter formatsS,
Rparagraph formatsS, Rdivision blocksS, Rdivision listS and Rpage listS 
respectively. These parts appear in the above mentioned order in the file's
data fork. The resource fork is always empty, i.e. not allocated.
The file can be seen as being built from basic blocks, each 128 bytes long.
This is the case for all parts of the file, although it does not appear to be
significant for the text part. This implies that the size of an MS Word file
is always a multiple of 128 bytes, though within each of the parts
mentioned, the last bytes in the last block belonging to a certain part may
(and usually will) contain garbage.
In several sections sizes, dimensions and distances are present. In all such
fields these dimensions are given in Tbasic unitsU. A MS Word basic unit is
1/20 of a point, or 1/1440 of an inch (a point equals 1/72 of an inch).
3. Header.
The header part consists of a single block of 128 bytes. It contains
pointers to most parts of the file. The header can be defined in terms of a
C structure as follows:
/*
 *	This structure starts each MS-WORD 'WDBN' file.
 */
struct	_header	{
/* 00 */	ushort	h_1 ;	/* always 0xfe32 */
/* 02 */	ushort	h_2 ;	/* always 0 */
/* 04 */	ushort	h_3 ;	/* always 0xab00 */
/* 06 */	short	h_unk1[4] ;	/* always 0 */
/* 14 */	ulong	h_ET ;	/* Position of byte past text */
/* 18 */	ushort	h_par ;	/* first paragraph block # */
/* 20 */	ushort	h_div ;	/* first division info block # */
/* 22 */	ushort	h_div1 ;	/* same */
/* 24 */	ushort	h_divlist ;	/* first division list block # */
/* 26 */	ushort	h_pagelist ;	/* first page list block # */
/* 28 */	ushort	h_unalloc ;	/* first unallocated block # */
/* 30 */	short	h_unk2[17] ;	/* always 0 */
/* 64 */	ulong	h_tlength ;	/* Length of text */
/* 68 */	ulong	h_tlength1 ;	/* same */
/* 72 */	short	h_unk3[28] ;	/* always 0 */
};
typedef struct _header MS_Head;
The h_ET field contains the address within the file of the byte just past
the last character in the document text, i.e. the RtextS part. The h_tlength
field contains the length of the RtextS part in bytes. The h_par field gives
the block number (remember a MS Word block is 128 bytes long) of the
first block that contains paragraph formats. Every division has its own
block containing margins, page number and that kind of stuff. The h_div
field contains the block number for the first division block. For some
reason it is stored twice. The connection between the text and the division
blocks is made through the division list. The h_divlist field gives the block
number for the first block in the division list. The last block in the file
contains the page list. In the page list the position of the first character
in the first line of each page is stored. It is this list which is updated
when you issue a RRepaginateS (COMMAND-J) command. The small T=U signs
in the margins come also from this list. The h_pagelist field gives the
first block for this list. The block number of the first RunallocatedS block
is stored in the h_unalloc field. The file is also h_unalloc blocks long.
4. Text part.
The text part contains a complete representation of the text in the
document, including running heads, footnotes and pictures. The text is
represented in the order in which it occurs in the document, in the
extended Macintosh ascii character set. Some ascii values have special
meaning however:
0x01	page number ((page) glossary)
0x05	auto numbered footnote reference ((footnote) glossary)
0x0b	Forced new line within paragraph
0x0c	End of division or forced new page
0x0d	End of paragraph
0x1f	Optional hyphen
The above implies that to extract a text only version of an MS Word file
one only needs to extract the text part of the file, possibly replacing some
of the special characters with others, depending on what you want. If you
do nothing, you will certainly get very long lines, since you will get a
newline character only at the end of each paragraph, so perhaps you want
to do some line folding. Pictures are stored within the text part, along
with a header. The picture is a single paragraph by itself. The paragraph
format run pointing to the picture has a bit set to indicate the
corresponding paragraph is a picture.
5. Format runs.
Everything related to the layout of the text is stored in what we will call
Rformat runsS and Rformat descriptorsS. A format run consists of several
bytes of formatting information, described below (section 6 and 7). A
format descriptor consists of 6 bytes. It is described by the following
structure:
/*
 *	This structure defines a format descriptor.
 */
struct	_fdescriptor	{
/* 00 */	ulong	fd_start;	/* start of text for next run */
/* 04 */	short	fd_run ;	/* pointer to this format run */
};
Each format block starts with the offset in the text part where the
formats of this block start. After the initial start a number of format
descriptors follow. The rest of the format block contains format runs.
Both the format run and the format descriptor must be contained in the
same block. A new block is allocated if either one does not fit. 
The last byte in a format block (offset 0x7f) contains the number of
format descriptors present in the block.
The format runs are stored preceded by a byte count. This bytecount gives
the number of bytes in the format run that are actually stored in the file.
The other (not stored) bytes of the format run contain the default value.
File size is reduced by not storing seldomly used fields. 
The fd_start field is a pointer in the text part of the document. The next
format applies from there. The fd_run field is an offset (relative to byte 4
in the format block) to the format run. 
6. Character formats.
The character format runs define how the characters in the text look. This
includes properties like the font, size and style of the characters. The
character format runs are 6 bytes long, although not all 6 need be stored.
One field needs special attention. The font number is split in (at least)
two pieces. The low order 6 bits are in the cf_font field. This fits most
standard fonts, as they have small numbers. More exotic fonts have larger
numbers. The extra bits are stored in the high order 3 bits of the cf_flags2
field. The meaning of the bytes in the format run is:
/*
 *	This structure defines character formats.
 */
struct	_cformat	{
/* 00 */	ubyte	cf_unknown ;	/* seems always 0x80 or 0x00 */
/* 01 */	ubyte	cf_font ;	/* font number, some flags */
/* 02 */	ubyte	cf_pointsize ;	/* times 2, 0 = default */
/* 03 */	ubyte	cf_flags1 ;	/* more flags */
/* 04 /*	ubyte	cf_flags2 ;	/* more flags, more font # */
/* 05 */	byte	cf_position ;	/* > 0 super, < 0 sub script */
};
typedef struct _cformat MS_CFmt;
/* macro for extracting the font number */
#define CHF_FONT(x) (((x)->cf_font&0x3f) | (((x)->cf_flags2&0xe0) << 1))
/* values for the cf_font field */
#define	CHF_BOLD	0x80	/* bold bit */
#define	CHF_ITAL	0x40	/* italic bit */
/* values for the flags1 and flags2 fields */
#define	F1_UL	0x80	/* underlined */
#define	F1_SC	0x0c	/* Small Caps */
#define	F2_OL	0x10	/* outline */
#define	F2_SH	0x08	/* shadow */
The first field in a character formats format run seems to take only the
values 0x00 and 0x80. The meaning of this field is unknown.
7. Paragraph formats.
The fourth part of the file contains the paragraph formats. The format runs
start with normal paragraph formatting information. Thereafter follow the
Rtab definitionsS. As many tab definitions as needed will be in the format
run. The tab definition is described by the structure below:
/*
 *	This structure defines tabs
 */
struct	_tformat {
/* 00 */	ushort	t_position ;	/* tab position */
/* 02 */	ushor t	t_flags ;	/* type of tab stop */
};
typedef struct _tformat MS_Tab;
/* values for the t_flags field */
#define	T_ALIGNMS K	0x600 0	/* mask for tab alignment */
/* alignment values: */
#define	T_LEFT	0x0000	/* left aligning tab */
#define	T_CENTER	0x2000	/* center aligning tab */
#define	T_RIGHT	0x4000	/* right aligning tab */
#define	T_DECIMAL	0x6000	/* decimal tab */
#define	T_LEADMSK	0x0c00	/* mask for tab leader */
		/* leader values: */
#define	T_BLANK	0x0000	/* blank leader */
#define	T_DOTS	0x0400	/* dotted leader */
#define	T_DASH	0x0800	/* dashed leader */
#define	T_LINE	0x0c00	/* line leader */
The format run is defined by the structure below:
/*
 *	This structure defines paragraph formats.
 */
struct	_pformat	{
/* 00 */	ushort	p_flags;	/* some flags */
/* 02 */	ushort	p_unk1;	/* always 0 */
/* 04 */	ushort	p_right;	/* right indent */
/* 06 */	ushort	p_left;	/* left indent */
/* 08 */	ushort	p_first;	/* first indent */
/* 10 */	ushort	p_line_spacing;	/* line spacing (0 = auto) */
/* 12 */	ushort	p_before;	/* space before */
/* 14 */	ushort	p_after;	/* space after */
/* 16 */	ushort	p_rhead_pict;	/* running head & picture info */
/* 18 */	ushort	p_unk2;	/* always 0 */
/* 20 */	ushort	p_unk3;	/* always 0 */
/* 22 */	MS_Tab	p_tabs [0];	/* list of tab descriptors */
};
typedef	struct _pformat	MS_Fmt;
/* values for the p_flags field */
#define	PF_FOOTMASK	0x7f00	/* mask for footnote info */
/* values unknown */
#define	PF_JUSTMASK	0x00c0	/* mask for justification */
/* justification values: */
#define	PF_LEFT	0x0000	/* left justifiedparagraph */
#define	PF_CENTER	0x0040	/* centered paragraph */
#define	PF_RIGHT	0x0080	/* right justified paragraph */
#define	PF_JUST	0x00c0	/* justified paragraph */
#define	PF_KEEP	0x0010	/* keep with next paragraph */
#define	PF_KEEPL	0x0020	/* keep lines together */
/* values for the running_head field */
#define	RH_MASK	0xf000	/* mask for running head info */
		/* running head values: */
#define	RH_FIRST	0x8000	/* appears on first page */
#define	RH_EVEN	0x4000	/* on even pages */
#define	RH_ODD	0x2000	/* on odd pages */
#define	RH_BOTTOM	0x1000	/* appears on bottom of page */
#define	RH_PICT	0x0800	/* this paragraph is a picture */
The p_flags field has sometimes the high order bit set (0x8000). The
meaning of this could be the same as the setting of this bit in the
character formats, where it is also sometimes set. What this bit indicates
is unknown.
In the paragraph format run only the tab descriptors needed for explicitely
defined tabs are stored. The list of tab descriptors is ended by a tab
descriptor with the t_position field equal to 0. This final tab descriptor is
stored in the file, although this seems not necessary.
A picture is identified by having the RH_PICT bit set. In the text part of
the document the picture is present, preceded by a 6 byte header. The
actual picture data follows after the header. It is encoded in standard PICT
format (see tech note #21). The picture header is defined by the structure
given below:
/*
	This structure defines a picture header
*/
struct	_phead {
/* 00 */	short	ph_offset;	/* offset from left margin */
/* 02 */	short	ph_xdist;	/* distortion in x direction */
/* 04 */	short	ph_ydist;	/* distortion in y direction */
};
typedef struct _phead MS_Pict;
The ph_xdist and ph_ydist fields are used to store the distortion of the
picture. If both are zero, the picture is undistorted. The exact meaning of
the values stored here is unknown. 
8. Division blocks.
The next (fifth) part of the file contains the division blocks. One block
(128 bytes!) is allocated for every division. The block is filled with the
following structure, preceded by a byte count indicating how many bytes
are actually stored:
/*
 *	This structure describes division formats
 */
struct _dformat {
/* 00 */	ushort	d_flags;	/* flags */
/* 02 */	ushort	d_pap_len;	/* total paper length */
/* 04 */	ushort	d_pap_wit;	/* total paper width */
/* 06 */	ushort	d_p_start;	/* start page # */
/* 08 */	ushort	d_top;	/* top margin */
/* 10 */	ushort	d_bot;	/* bottom margin, from top paper */
/* 12 */	ushort	d_left;	/* left margin */
/* 14 */	ushort	d_right;	/* right margin, from left paper */
/* 16 */	ushort	d_flag_col;	/* some flags, number of columns */
/* 18 */	ushort	d_r_top;	/* top run.head pos, from top paper */
/* 20 */	ushort	d_r_bot;	/* bottom run.head pos, from top paper */
/* 22 */	ushort	d_colsp;	/* column spacing */
/* 24 */	ushort	d_gutter;	/* gutter */
/* 26 */	ushort	d_pag_top;	/* page number position from top */
/* 28 */	ushort	d_pag_left;	/* page number position from left */
/* 30 */	ushort	d_unk1;
/* 32 */	ushort	d_rbot;	/* seems runn.  head pos, from bottom */
/* 34 */	short	d_unk2[34];
};
typedef struct _dformat MS_Div;
#define	DFB_MASK	0x00e0	/* mask for break */
		/* values for break: */
#define	DFB_CONT	0x0000	/* continuous */
#define	DFB_COL	0x0020	/* column */
#define	DFB_PAGE	0x0040	/* page (default) */
define	DFB_ODD	0x0060	/* odd */
#define	DFB_EVEN	0x0080	/* even */
#define	DFP_MASK	0x001c	/* mask for the page # format */
		/* values for page # format: */
#define	DFP_NUM	0x0000	/* numeric (1, 2...)*/
#define	DFP_ROM	0x0004	/* roman, upper case (I, II...) */
#define	DFP_rom	0x0008	/* roman, lower case (i, ii...) */
#define	DFP_ALF	0x000c	/* alphabetic, upper case (A, B...) */
#define	DFP_alf	0x0010	/* alphabetic, lower case (a, b...) */
#define	DF_DIV	0x0001	/* division layout present */
#define	DEFAULT_PAG	0xffff	/* default page number */
/* mask for flags in the d_flag_col field */
		/* values: */
#define	DCF_AUTO	0x0200	/* auto page numbering on */
#define	DCF_FOOT	0x0100	/* 1=footnote at end of division */
#define	DCF_COL	0x00ff	/* mask for number of columns */
Some of the values stored in the division blocks seem to have no relation
to a division, but rather to the document as a whole. These values are
present in all division blocks, but only the value in the first division block
is used. The values in the other blocks are just ignored. As usual, all
dimensions are in basic units. 
The DF_DIV bit used to indicate whether the division layout information is
stored (DF_DIV = 1) or only the paper dimensions (DF_DIV = 0). If the
DF_DIV bit is 0, only 16 bytes of the division block are used.
The division blocks do not contain any information on which part of the
text they apply to. This information is stored in the next part of the file,
the division list:
9. Division list.
The division list links the division blocks to the text. It consists of
division descriptors. These are defined by the structure:
/*
 *	This structure describes a division descriptor
 */
struct	_pdiv {
/* 00 */	ulong	pd_text;	/* where the division starts */
/* 04 */	ushort	pd_unk;	/* unknown */
/* 06 */	ulong	pd_block;	/* there is the div block */
};
typedef struct _pdiv MS_DivD;
The pd_text field gives the place in the text where the division ends,
relative to the start of the text part. Add 0x80 to get the offset from the
start of the file. The pd_block field gives the offset in the file where the
division block starts. If the pd_block field is 0xffffffff, this division has
no division block allocated. This seems a division with all default values.
The meaning of the pd_unk field is unclear.
The structure described in section 8 is present in the corresponding
division block, preceded by a bytecount. As with the paragraph and
character formats, bytes not stored contain default values. As many
division descriptors are in the division list as there are divisions. The
division list holds some more information, as can be seen in the structure
definition:
/*
 *	This structure describes the division list
 */
struct _divlist {
/* 00 */	ushort	dl_count;	/* the number of descriptors */
/* 02 */	ushort	dl_unk;	/* some unknown counter */
/* 06 */	MS_DivD	dl_list [0];	/* as many as needed */
};
typedef struct _divlist MS_Div;
If one block (128 bytes) is not sufficient to store the division list, the list
can span block boundaries.
The division list is also the way to se if a 0x0c character stored within
the file is a Tforced new pageU (TSHIFT-ENTERU) or a Tend of divisionU
character. If there is no entry for the 0x0c character in the division list,
it must be only a new page.
10. Page list.
The last part of the file is the page list. This list contains information
about the pages in the document. This information is used to process the
RGo ToIS command, and to show the small T=U signs in the margin. The page
list is updated when the document is repaginated by means of the
RRepaginateS command. The items in the page list are described by this
structure:
/*
 *	This structure describes a page list item
 */
struct _page {
/* 00 */	ushort	pg_num;	/* the page number */
/* 02 */	ulong	pg_text;	/* where it starts */
};
typedef struct _page MS_Page;
The page numbers seem to be always in numerical order. The pg_text field
is the offset in the text part where the page starts. Add 0x80 to get the
position relative to the start of the file. All page list items are contained
in the page list, which fills the last part of the file:
/*
 *	This structure describes the page list
 */
struct _plist {
/* 00 */	ushort	pl_count;	/* the number  list items */
/* 02 */	ushort	pl_unk;	/* some other count */
/* 04 */	MS_Page	pl_list [0];	/* the list */
};
typedef struct _plist MS_PList;
11. End of file.
Just after the page list the end of file is present. The file contains an
integral number of MS Word blocks. As a MS Word block is smaller than a
physical disk block, the end of file may be in the middle of the last
physical disk block.
			Dolf Starreveld / Maarten Carels
			Department of Computer Science, UvA
Usenet:			{dolf,maarten}@uva.uucp
			{seismo,decvax,philabs}!mcvax!uva!{dolf,maarten}
Snail mail:		Dolf Starreveld
			Department of Computing Science
			University of Amsterdam
			Kruislaan 409
			NL-1098 SJ  Amsterdam
			The Netherlands
Telefone:		In Holland:    020-592 5137/5022
			International: 31-20-592 5137 or 31-20-592 5022
Telex:			10262 HEF NL