[comp.os.minix] compression

ast@cs.vu.nl (Andy Tanenbaum) (07/16/89)

I have gone through the net postings and found a number of useful programs.
I have this fear that the next distribution is going to take 20 disks and cost
$149.  I have also noticed that compress only seems to win a factor of 2.
I wonder if better compression of C programs is possible.  In particular,
suppose we had a two pass compression program.  Pass 1 collected all the
strings in the program and their frequencies.  Pass 2 replaced the most
important string by ASCII code 128, the next most important one by 129 etc,
sort of like libpack.c does, only dynamically instead of using fixed strings.
Most important means that the product of # occurences times length is
maximum.  The compressed file would include regular ASCII characters 0-127,
and 120 or so characters denoting strings.  The last 8 could be for
1 tab, 2 tabs, 3 tabs, etc., and a two byte sequence for encoding runs of
blanks.  The dictionary would have to go too.  One would also have to think
carefully about what a string is, i.e., is "while" a string, or "while ("
a better choice?  It is my suspicion that such a program could compress
better than a factor of 2 on C programs.

Does anyone know of such a program, or want to write one?

Andy Tanenbaum (ast@cs.vu.nl)

nick_andrew%713.602@fidogate.fido.oz (Nick Andrew) (07/18/89)

Original to: ast@cs_vu_nl
        I fear these sorts of compression methods have
already been long discussed. The possibility Andy mentions
sounds a lot like Compress, i.e. common substring
replacement.

        Onto Andy's problem, which is "Minix distribution
too big", I can see that MUCH more is and will be available
than can fit onto a small distribution (10 disks) no matter
how well compressed.

        I would suggest then, the formation of a set of
Minix TOOLS, comprising tested postings from the net, and
available separately to the Minix OS.  Ideally, these tools
disks would be PD, but perhaps PH could make a reasonable
return on distribution.

        This way, OS students studying Minix need not buy a
whole heap of stuff that they will not need during their
course. Minix hackers who want to turn Minix into a
full-blown Unix can get the Minix Tools, and install
everything.  There would also be a place there for useful
utilities written for Minix which are not in V7 or other
versions.

        Regards, Nick.
--- Zeta
 * Origin: Zeta: Unix, Minix, Xenix support (02) 627-4177 (3:713/602)

henry@utzoo.uucp (Henry Spencer) (07/19/89)

In article <2888@ast.cs.vu.nl> ast@cs.vu.nl (Andy Tanenbaum) writes:
>I wonder if better compression of C programs is possible...
>sort of like libpack.c does, only dynamically instead of using fixed strings.
>... It is my suspicion that such a program could compress
>better than a factor of 2 on C programs.

Andy, I just ran some quick tests using some C-analysis stuff I've got,
and I doubt that a simple approach will give you more than a factor of 2-3.
I ran a few large C programs through a tokenizer (one which retains white
space), and counted both the number of tokens (approximating the number of
output codewords, ignoring limits on codeword size) and the size of the
output after "sort -u" (approximating the size of the codeword dictionary).
This is actually an optimistic estimate because of the limits on codeword
size and the fact that my tokenizer essentially eliminates comments.  Best
case was about a factor of 3.  A quick look at eliminating all white space
(i.e. we assume a C-specific compressor whose decompressor includes a
paragrapher) suggests that this might perhaps get it to a factor of 4 in
favorable cases.  All in all, it doesn't seem a promising approach.
-- 
$10 million equals 18 PM       |     Henry Spencer at U of Toronto Zoology
(Pentagon-Minutes). -Tom Neff  | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

dtynan@altos86.Altos.COM (Dermot Tynan) (07/19/89)

In article <2888@ast.cs.vu.nl>, ast@cs.vu.nl (Andy Tanenbaum) writes:

	[About the number of disks needed to distribute Minix, and
	 a compression algorithm].

If I remember correctly, the scheme you mentioned is the same one used by
'pack' on AT&T System V.  It doesn't come close to compress-16, in terms
of space savings.  However, a full 16-bit compress may be difficult to
implement.  In the past, I have thought about generating a version of
'compress' which used a disk-file, instead of memory.  However, such a
version would be *extremely* slow.  Particularly on a 4.77MHz 8088.

	For initial distribution of the Minix code, this probably isn't
too bad.  I guess if I were going to do it, I would link a miniature
version of the OS (with the BIOS-wini driver, no RS232, etc), which would
boot on just about any PC (IBM, that is, something similar for Atari, etc).
Then, just include the binaries for those utilities which are needed to
boot the system.  Then, archive everything, to increase the disk efficiency,
and use a 16-bit compress to cut it down to about 35% of its original size.

Now, what you have is an "inflatable" operating system.  The user would
boot the mini-root, and run a shell script called "install", which would
run for three or four weeks (:-), and generate a complete installation on
his/her hard-disk.  The big problem with a scheme like this, is the
install script has to be pretty bullet-proof.  Comments?
						- Der
-- 
	dtynan@altos86.Altos.COM		(408) 946-6700 x4237
	Dermot Tynan,  Altos Computer Systems,  San Jose, CA   95134

    "Far and few, far and few, are the lands where the Jumblies live..."

ast@cs.vu.nl (Andy Tanenbaum) (07/19/89)

In article <1989Jul18.174647.19537@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>Andy, I just ran some quick tests using some C-analysis stuff I've got,
>and I doubt that a simple approach will give you more than a factor of 2-3.

If you have measured it, I'm inclined to believe it.  Still, intuitively
I would have thought that with typical tokens being 5 or 6 characters,
one should have gotten more.  I guess too many tokens are of the 1
character variety.

Maybe another approach is standard layout, e.g. 'while (' counts as 1
token.  Even if you write "while(' you get 'while (' on the other end.

Andy Tanenbaum (ast@cs.vu.nl)

ast@cs.vu.nl (Andy Tanenbaum) (07/19/89)

In article <3545@altos86.Altos.COM> dtynan@altos86.Altos.COM (Dermot Tynan) writes:
>The big problem with a scheme like this, is the
>install script has to be pretty bullet-proof.  Comments?

I agree it is essential.  For a distribution, what I would like is to boot
MINIX in the usual way, then on the /usr diskette have a small shell
script that copies things to hard disk and decompresses them.  
The decompression should just be a regular program, but there is no
objection to it's using disk files, as MINIX will be running when it is
called.

ANdy Tanenbaum (ast@cs.vu.nl)

meulenbr@cstw01.prl.philips.nl (Frans Meulenbroeks) (07/19/89)

Mmm. I don't think such a compression scheme would help very much.
If you really want to do something like that, the best way seems to
be to tokenize your source and apply Huffman coding on it.

I did a small test using compress.c
src: 38550 bytes
16 bit compress: 18189 bytes
13 bit compress: 18590 bytes
12 bit compress: 20776 bytes
arc (crunched): 20754 bytes
arc (squashed): 18618 bytes
I don't have the zoo compressor so I can't give figures on that one.

I think the main problem with the compression scheme that you suggest is
that there is a lot of comment in the code which is unprocessed
in your compression scheme.

By the way: proper punctuation helps a little.
After applying cb on compress.c the 16 bit compress was only
18030 bytes.

I think that indent would even format a little better but badly enough
my indent had some troubles with compress.c

note: all tests done on a SUN using the SUNOS (BSD) utilities.

Frans Meulenbroeks        (meulenbr@cst.prl.philips.nl)
	Centre for Software Technology
	( or try: ...!mcvax!phigate!prle!cst!meulenbr)

fnh@coyote.philips.com (Fletcher Holmquist;6483;4.85;$0351) (07/20/89)

I have done some preliminary tests along the lines of ast's suggestions.  By
replacing recurring strings by a three character token, I reduced much of
the GNU C source by about 36%.  The resulting file is printable ASCII text,
and compress can is used as a follow-up with good results.  The token
replacement seems to be quite worthwhile.

My preferred solution is a Huffman encoding of sequences, with slots for
keywords, identifiers, macro names, and common C sequences (", ", "{\n",
"/*", <four spaces>}.  This could be a static list to make make things
quick.  There would also be slots for the tokens as used above for recurring
strings not already recognized.  The process has two passes, one to hit the
strings and one to produce a Huffman code.

The savings for C code with <20% comments should be 60-80%.

I am writing an initial version now, and will mail things to ast when done.
If others are working the problem, please send me mail to see if we can
combine the good points, or at least argue.

					fletch (fnh@philabs.philips.com)
--
fletch holmquist			robotics and flexible automation
fnh@philabs.philips.com			philips laboratories
(914) 945-6483				345 scarborough road
					briarcliff manor, ny, 10510

hugh@dgp.toronto.edu (D. Hugh Redelmeier) (08/20/89)

| From: ast@cs.vu.nl (Andy Tanenbaum)
| Subject: compression
| Message-ID: <2888@ast.cs.vu.nl>
| 
| I have gone through the net postings and found a number of useful programs.
| I have this fear that the next distribution is going to take 20 disks and cost
| $149.  I have also noticed that compress only seems to win a factor of 2.
| I wonder if better compression of C programs is possible.  In particular,
| suppose we had a two pass compression program.  Pass 1 collected all the
| strings in the program and their frequencies.  Pass 2 replaced the most
| ...

Over a decade ago I wrote a Huffman compression program for UNIX.
It worked on "tokens" rather than characters.  I put the program in
the public domain, but few people noticed (there was no USENET then).

Since "compress" is the standard with which everyone is familiar, I
will contrast my program "huf" with compress.

- huf often compresses a little bit more than 16-bit compress.
  huf is quite a bit better than 12-bit compress.

 39614 compress.c	[my copy of compress]
 22842 compress.c.a	[arced with arc 5.21]
 21826 compress.c.Z12	[compressed with -b12]
 19871 compress.c.Z13	[compressed with -b13]
 19141 compress.c.Z16	[compressed without -b]
 17996 compress.cH	[huffed]

  And since turnabout is fair play:

 41053 huf.c
 21693 huf.c.a
 22713 huf.c.Z12
 20913 huf.c.Z13
 19610 huf.c.Z16
 18047 huf.cH

- huf was intended to compress source programs and natural language
  (English) documents.  It only accepts ASCII text (no characters
  above 0177).  It cannot handle texts with a large number of long
  distinct tokens (which is what uuencode output looks like).

- huf runs in 64k+64k space (UNIX only ran on PDP-11s when it was
  written).  In fact, it can be happy in much less.

- huf is slower on compression, but faster on decompression (last
  time I checked).

- compress is much more widely accepted, an important virtue.

Huf has been very stable.  The format of compressed files has never
changed.  I have never lost data because of bugs in it (but there
was an obscure bug with such a potential discovered in 1986).

Huf works on V7, 4.2BSD, and SystemVr2.  I presume it would work on
MINIX.  Files are not affected by byte order, but it is assumed that
bytes are eight bits.  I think that two's complement arithmetic is
required.

Here are some ideas for improvements.  I am not planning to make
these changes but welcome anyone else who would like to.  If huf
becomes widespread, changing the format of the compressed file
becomes a large logistics problem.

- In the dictionary, characters are represented in a 6-bit code.  A
  few percent improvement could be achieved by encoding these
  characters in a one-character-per-symbol Huffman code.  This would
  also make it cheap to allow non-ASCII characters in the text.

- With some care, huf could handle currently-pathological texts.  If
  the lexical analysis module noticed a problem (non-ascii text or
  an excessive number of distinct tokens) it could switch to another
  mode that would parse the text into tokens differently (say,
  single-byte tokens).

- One major oportunity for compression is external fragmentation.
  In UNIX, a file is stored in an integral number of blocks (or
  fragments, in BSD).  The last block, on average, is less than half
  full.  This is especially bad for short files.  This can be
  eliminated if the compression program also creates a more compact
  simulation of a file system.  arc does this.  tar does not -- it
  uses more blocks than the UNIX file system.  Would MINIX ar do
  this?

- The hashing function could be sped up.  A new hash function would
  not affect the portability of compressed files, but the code
  implementing the hash function might not be portable.  (Multiplies
  were relatively cheap on the machines on which huf was developed.)

- Since the hashing function multiplies two ints, it is probably
  particularly slow on Atari ST MINIX.  If the compiler knows how to
  do short*short with a single machine instruction, changing the
  hashing function to use shorts should make it run much faster.

Putting HUF up on MINIX
=======================

- Since I don't have MINIX, Mark Moraes very kindly tested huf on
  his ST MINIX.  This testing has not been extensive.

- huf.c uses varargs.h.  This is not part of older versions of MINIX.
  I have included the one distributed by AST in the newsgroup as part
  of the Version 1.4 upgrade.

- To configure huf.c, give the flag -DV7_16 to the compilation of
  huf.c Mark says -O worked for him.  A chmem will be needed: Mark
  found huf could compress itself with a chmem +500; more would be
  wise.  -DV7_16 specifies that huf is being compiled for and on a
  Version 7 UNIX system with 16-bit ints and pointers.  MINIX is
  close enough to Version 7 for this purpose.  "16-bit ints and
  pointers" is correct for MINIX on a PC-clone.  It is a useful lie
  on the ST: it should cause the hashing multiply to be a little
  faster, without causing any problems.

- I don't know how MINIX man-pages are distributed.  I am including
  a UNIX man-page.  If can be formatted on unix using:
	[nt]roff -man huf.1

Hugh Redelmeier
{utcsri, yunexus, uunet!attcan, utzoo, hcr}!redvax!hugh
When all else fails: hugh@csri.toronto.edu
+1 416 482-8253

#!/bin/sh
# To unbundle, sh this file
echo huf.1 1>&2
sed 's/^-//' >huf.1 <<'!'
-.TH HUF 1 Local
-.SH NAME
huf \- encode and decode Huffman coded text files
-.SH SYNOPSIS
-.B huf
-[
-\fB\-npuvw\fIn\fR
-] [ name ] ...
-.SH DESCRIPTION
-.IR Huf
is a program for compressing ASCII text files.
Text compression depends on the repetition of
-\fIwords\fP, that is, strings of alphanumeric characters.
On average, a compressed file is about half the size of the original.
The argument list of this command is a sequence of file
names and flags.
-.PP
A file whose name ends in ``H'' is decoded into a file
with the same path name, minus the ``H'' (unless the
-``\fBp\fP'' option is used).
-.PP
A file whose name does not end in ``H'' is encoded
into a file with the same path name, plus an ``H''
-(unless the ``\fBp\fP'' option is used).  The encoding process
takes two passes, requiring that the file not be a device or pipe.
The file must not be modified while being encoded.
-.PP
Several attributes of the newly created file are set to match those of the
file it was derived from: owner, group, permissions, modification
time, and access time.
-.PP
A flag operand starts with ``\-'' and continues with any number
of option names.
Each option name (except \fBw\fP)
complements the current setting of the corresponding option.
Each option is initially off.
-.TP 8
-.B \-n
-(for noisy) option causes various statistics
to be printed on the diagnostic output.
-.TP
-.B \-p
causes the output to be printed on the
standard output (even if this output is encoded).
-.TP
-.B \-u
causes the input file to be unlinked.
If any error is detected(!) during translation, the file
is not unlinked.
-.TP
-.B \-v
causes the output file to be verified:
the reverse transformation is performed and compared with the input.
Verification has not been necessary.
-.TP
-.BI \-w n
sets the upper bound on the
number of unique words allowed in a file to be encoded.
If this bound is exceeded, the amount of compression is reduced.
This bound is set to the smallest prime number greater than
-\fIn\fP that is known to the program.
The default bound is about 3000; the maximum is about 7000.
-.SH "LOCAL INFO"
Written at the University of Toronto by D. Hugh Redelmeier.
!
echo varargs.h 1>&2
sed 's/^-//' >varargs.h <<'!'
-/*  varargs.h  */

typedef char *va_list;

-#define  va_dcl		int va_alist;
-#define  va_start(p)	(p) = (va_list) &va_alist;
-#define  va_arg(p,type)	( (type *) ((p)+=sizeof(type)) )[-1]
-#define  va_end(p)

-#define vfprintf 	_doprintf
-#define vprintf(fmt,args)	vfprintf(stdout,fmt,args)
!
echo huf.c 1>&2
sed 's/^-//' >huf.c <<'!'
-/* Huf - encode and decode Huffman coded text files.
- *
- *	D. Hugh Redelmeier  1978 June 24
- *
- * To Compile (See "Portability Issues" below):
- *
- *	cc -s -O -i -z -Dsystem %
- *	-s: strip
- *	-O: optimize
- *	-i: split I and D on PDP-11 etc.
- *	-z: trap references through NULL on System Vr2
- *	-Dsystem: chose configuration (CUSTOM, V7_16, V7_32, BSD_32, or SYSV_32)
- *
- * To Execute:
- *
- *	The arguments to this command are a sequence of file names
- *	and flags.  These aruments are processed left to right.
- *
- *	A flag starts with "-" and continues with any number of
- *	options.
- *
- *	The "n" (for noisy) option causes various statistics to
- *	be printed on the diagnostic output.
- *	The flag complements the option which is initially off.
- *
- *	The "l" (for local) option causes the output file to
- *	be placed in the current directory (with the appropriate
- *	filename).  The flag complements the option which is
- *	initially off.
- *
- *	The "p" option causes the output to be Printed on the
- *	standard output (even if this output is encoded).
- *	The flag complements the option which is initially off.
- *
- *	The "u" option causes the input file to be Unlinked.
- *	If any error is detected(!) during translation, the file
- *	is not unlinked.  For safety reasons, this option may not
- *	be use with the print option.  The flag complements the
- *	option which is initially off.
- *
- *	The "v" option causes the new file to be verified after
- *	creation.  The verification occurs before any unlink operation
- *	implied by the "u" option.  Verification is not be possible
- *	with the "p" option.  Verification has never proven
- *	necessary with working hardware.  The flag complements
- *	the option which is initially off.
- *
- *	The "w<number>" option sets the upper bound on the
- *	number of unique Words allowed in a file to be encoded.
- *	To reduce hashing collisions, ten percent more words than
- *	specified are reserved but not used.  The resulting number
- *	is moved up to the nearest prime known to the program.
- *
- *	The "d" (for dump) option causes huf to abort when an error
- *	is detected.
- *
- *	A file whose name ends in <suffix> is decoded into a file
- *	with the same path name, minus the <suffix> (unless the
- *	"p" option is used).
- *
- *	A file whose name does not end in <suffix> is encoded
- *	into a file with the same path name, plus <suffix>
- *	(unless the "p" option is used).  The encoding process
- *	takes two passes, requiring that the file not be a device.
- *	The file must not be modified while being encoded.
- *
- * Structure of encoded file:
- *
- *	bits(byteLen) magicNum&byteMask, (magicNum>>byteLen)&byteMask;
- *	bits(nwLen) numWords;
- *	bits(specLen) numSpecs-1;
- *	bits(specLen) specs[numSpecs];	// in frequency order
- *	wc words[numWords];	// in frequency order
- *	bit spec_tree_building_decisions[];
- *	bit word_tree_building_decisions[];
- *	bit starts_with_word?;
- *	bit hufman_code[];
- */

-/* Modification History:
- *
- * 1981?; DHR: Ported to V7 on a VAX (much wailing & gnashing of bits).
- * 1984 Sept; DHR: Copy modtime of original file to derivative.
- * 1984 Nov; DHR: Make overflow symbols into solitaries.
- *	Added "leap" to speed decoding.
- *	Improved the hash function and avoided clustering.
- * 1984 Dec; DHR: Skip bad files (instead of quitting).
- *	Try to copy owner and group of original file.
- * 1986 Oct 31; DHR: Fix off-by-one bug in lex() affecting tabs.
- *	The effect of this bug is to turn certain newlines into
- *	'`' (if the preceding newline was followed by more than
- *	12 HTabs, and this newline is followed by none).  In a
- *	similar circumstance, semicolon followed by newline would
- *	be turned into '@').
- *	Sped up hashing by eliminating one multiply per word.
- *	Added the l option.
- *	Added #ifdef for VMSUNITY.
- * 1987 May 12; DHR: Improve portability:
- *	Use varargs.h for errf().
- *	Allow malloc to substitute for brk/sbrk (but not well).
- *	Declare procedures as returning void.
- *	Cut down number of write() calls from errf().
- * 1987 September 27; DHR: Make ANSI compatible
- *	Yet another change to the diagnostic stuff:
- *	- diagfmt() formats a message
- *	- diagpr() prints in on stderr
- *	- error() no longer takes a variable number of args;
- *	  if its arg is NULL, it finds the message in the diagnostic
- *	  buffer (formatted by diagfmt(), but not yet printed).
- *	- added function prototypes.
- * 1989 August 14; DHR: Make squeaky clean.
- *	- Conditionally use ANSI's stdarg.h.
- *	- added "static" to all non-exports.
- *	- A few more things are "unsigned".
- *	- A few more coecions are explicit.
- *	- Parameters of compar() are "pointer" (not "tree *")
- *	- Fixed bug in error().
- *	- Avoid MINIX C bug [would not accept x++ -> y].
- *	- Tinkered with formatting.
- */

-/* Portability Issues:
- *
- *	Memory allocation can be done using sbrk() or malloc().
- *	sbrk() is the more efficient, and the original technique.
- *	malloc() has been hacked in recently.  To use malloc(),
- *	define the symbol USEMALLOC.  The version using malloc()
- *	is quite crude and wasteful: it allocates a maximum
- *	amount of memory and aborts if this is not enough.
- *	The amount allocated is the value of the symbol USEMALLOC!
- *
- *	If memory allocation is done using sbrk(), no mallocs may
- *	be used, even by library routines (hence errf() is defined
- *	and used instead of fprintf(stderr, ...): stdio is not safe).
- *
- *	The overlaying of various memory structures is dicey, but
- *	it is probably portable.
- *
- *	The size of an int must be at least twice the size of a byte
- *	and greater than nwLen.
- *
- *	"hashMult" is sensitive to hardware wordlength, and will be
- *	wrong unless the machine is a byte machine with exactly 32-bit
- *	longs.  This will not be a disaster.
- *
- *	"bufSize" should be a multiple of the file system block size:
- *	bigger is faster, but wastes space.
- *
- *	The value of nwLen reflects the size of the PDP-11 address
- *	space.  To enable larger address spaces to be exploited,
- *	it should be increased (larger entries would have to be added
- *	to primes[]).  Changing nwLen would make old encoded files
- *	incompatible so the limit has not been changed.
- *
- *	The format of the encoded file should be universal unless
- *	one of the constants mentioned in the format description
- *	is changed.
- *
- *	Under System V, we define a dummy routine "_cleanup".
- *	This maybe wrong for some systems.  See the explanation
- *	where _cleanup is defined.
- *
- *	Berkeley renames utime(2) as utimes(2).  If running
- *	under Berkeley, define the preprocessor symbol "BSD".
- *
- *	Under Berkeley 4.2, file name components of path names are
- *	not limited in length.  Define the symbol LONGFILENAMES
- *
- *	"HZ", the clock frequency, must be a multiple of 10.
- *	Otherwise, the times reported by noisy mode will be funny
- *	(big deal).
- *
- *	VMS/Unity is H.C.R.'s emulation of System V under DEC's
- *	VAX/VMS system.  It does not support chown() and utime().
- */

-/* The following include is needed for <sys/times.h>.
- * It is included early because (under BSD) it defines size_t.
- */
-#include <sys/types.h>

-/* ---------------- Custom Configuration ---------------- */
-#ifdef	CUSTOM
-/* #define	BSD	1	// defined iff running under Berkeley */
-/* #define	LONGFILENAMES	1	// defined iff no bound on filename */
-/* #define	SYSV	1	// defined iff running under System V */
-/* #define	VMSUNITY	1	// defined iff running under VMSUNITY */
-/* #define	USEMALLOC 65534	// for systems on which sbrk() doesn't work */
-/* typedef int void;	// needed for older compilers (e.g. V7) */
-/* typedef unsigned int size_t;	// type of result of sizeof */

-/* hashMult is the inverse golden ratio scaled by the word length.
- * See Knuth Volume 3, 6.4.
- * 1327217885 == ((sqrt(5)-1)/2)*(2**31) | 1
- */
-/* #define hashMult	20251	// 16-bit int */
-/* #define hashMult	1327217885	// 32-bit int */

-#define	HZ	60	/* clock frequency */
-#define bufSize	1024	/* bigger is faster; no other effect */
-#endif	/* Custom Configuration */

-/* ---------------- V7, 16-bit int and char * ---------------- */
-#ifdef	V7_16
typedef int void;	/* needed for older compilers (e.g. V7) */
typedef unsigned int size_t;	/* type of result of sizeof */

-/* hashMult is the inverse golden ratio scaled by the word length.
- * See Knuth Volume 3, 6.4.
- * 1327217885 == ((sqrt(5)-1)/2)*(2**31) | 1
- */
-#define hashMult	20251	/* 16-bit int */

-#define	HZ	60	/* clock frequency */
-#define bufSize	1024	/* bigger is faster; no other effect */
-#endif	/* V7 on 16-bit machine */

-/* ---------------- V7, 32-bit int and char * ---------------- */
-#ifdef	V7_32
typedef int void;	/* needed for older compilers (e.g. V7) */
typedef unsigned int size_t;	/* type of result of sizeof */

-/* hashMult is the inverse golden ratio scaled by the word length.
- * See Knuth Volume 3, 6.4.
- * 1327217885 == ((sqrt(5)-1)/2)*(2**31) | 1
- */
-#define hashMult	1327217885	/* 32-bit int */

-#define	HZ	60	/* clock frequency */
-#define bufSize	1024	/* bigger is faster; no other effect */
-#endif	/* V7 on 32-bit machine */

-/* ---------------- BSD, 32-bit int and char * ---------------- */
-#ifdef	BSD_32
-#define	BSD	1	/* defined iff running under Berkeley */
-#define	LONGFILENAMES	1	/* defined iff no bound on filename */
-/* size_t defined in <sys/types.h> */

-/* hashMult is the inverse golden ratio scaled by the word length.
- * See Knuth Volume 3, 6.4.
- * 1327217885 == ((sqrt(5)-1)/2)*(2**31) | 1
- */
-#define hashMult	1327217885	/* 32-bit int */

-#define	HZ	100	/* clock frequency */
-#define bufSize	1024	/* bigger is faster; no other effect */
-#endif	/* BSD version */

-/* ---------------- System V, 32-bit int and char * ---------------- */
-#ifdef	SYSV_32
-#define	SYSV	1	/* defined iff running under System V */
typedef unsigned int size_t;	/* type of result of sizeof */

-/* hashMult is the inverse golden ratio scaled by the word length.
- * See Knuth Volume 3, 6.4.
- * 1327217885 == ((sqrt(5)-1)/2)*(2**31) | 1
- */
-#define hashMult	1327217885	/* 32-bit int */

-#define	HZ	100	/* clock frequency */
-#define bufSize	1024	/* bigger is faster; no other effect */
-#endif	/* SysV on VAX version */

-/* end of system specific definitions */

-#define null		0

-#ifdef	__STDC__
-#define	proto(p)	p	/* use prototype information */
typedef void *pointer;	/* universal pointer type */
-#else	/*!__STDC__*/
-#define	proto(p)	()	/* ignore prototype information */
typedef char *pointer;	/* universal pointer type */
-#endif	/*!__STDC__*/


-/* A node pointer is a pointer to one of:
- * - an Eleaf (an encoding leaf)
- * - a Dleaf (a decoding leaf -- a character string)
- * - an intrn (an internal node)
- *
- * The way these are discriminated between is by where the pointer points.
- * This is not a union of pointers (because no matter which is being
- * pointed to, the pointer representation must be identical so compares
- * work).  Nor is it a pointer to a union (because a union would be
- * aligned on a boundary suitable for its most-aligned component, and
- * this might change the pointer representation).  It must be represented
- * as a universal pointer, and cast when it must be used as a pointer
- * to a particular kind.
- */

typedef pointer nodePtr;

typedef struct {	/* Huffman tree internal node */
-	nodePtr left;
-	nodePtr right;
-} intrn;

typedef struct {	/* Huffman tree root */
-	unsigned Tcount;	/* Tree frequency count */
-	nodePtr root;		/* top node */
-} tree;	/* must be no bigger than Eleaf */

typedef union {	/* Huffman Encoding tree leaf */
-	unsigned Lcount;	/* Leaf frequency count */
-	long code;	/* Huffman code for this leaf */
-	tree FUDGE3;	/* force size requirement */
-} Eleaf;	/* must be as big as tree */

typedef char Dleaf;	/* decoding leaf: a flag-terminated string */

-#define numAscii	128
-#define asciiMask	(numAscii-1)
-#define aFlag		0200
-#define perNl		'0'
-#define commaSp	'1'
-#define commaNl	'2'
-#define eof		'9'
static nodePtr eofPtr;	/* Eleaf or Dleaf for eof */
-#define maxTabs	(('z'-'a')/2)
-#define nlTabs		('a'+maxTabs)
-#define scNlTabs	('A'+maxTabs)
static int curTabs;

-#define numWc		64
-#define wcLen		6
-#define wcMask		(numWc-1)
static char wcToA[numWc];
static char aToWc[numAscii];
static void initWordCode proto((void));

-#define wfBias		numAscii
-#define maxSpecs	(numAscii+wfBias)
-#define specLen	8
-#define specMask	(maxSpecs-1)

-#define solitary	maxSpecs
static nodePtr solPtr;	/* Eleaf or Dleaf for solitary */

-#define byteLen	8
-#define byteMask	((1<<byteLen) - 1)
-#define longLen	32
-#define lenLen	5
-#define lenMask	((1<<lenLen) - 1)

static char *mem;
static char *memRoof;
-#ifdef	USEMALLOC
extern pointer malloc proto((size_t));
-#else	/*!USEMALLOC*/
extern pointer sbrk proto((int));
extern int brk proto((pointer));
-#endif	/*!USEMALLOC*/

-/* nwLen must be less than the number of bits in "bits", an unsigned */
-#define nwLen		13
-#define nwMask		((1<<nwLen) - 1)
static unsigned maxWords	=	3121;	/* prime */

static char *chars;
static char *nxtChar;
-#define charRoof	memRoof
-#define charChunk	2048

-#define magicNum	((unsigned) 0145405)

extern int	creat proto((char *, int));
extern int	unlink proto((char *));
extern int	open proto((char *, int));
extern int	close proto((int));

extern int	write proto((int, char *, size_t));
extern int	read proto((int, char *, size_t));
extern long lseek proto((int fildes, long offset, int whence));

extern void qsort proto((pointer base, size_t nel, size_t, int (*compar)(pointer, pointer)));

static int inf;
static char inbuf[bufSize];
static char *inpos;
static char *inend;
static long inLen;
static int inLine	=	0;

static int outf;
-#define suffix		'H'
-#define pnSize		200
-#ifdef	LONGFILENAMES
-#define fnSize		pnSize
-#else
-#define fnSize		14
-#endif
static char pname[pnSize+2];	/* pathname (includes room for "H\0") */
static char *fname;
static char *fnEnd;
static char *inName;
static char *outName;
static char outbuf[bufSize];
static char *outpos;
static long outLen;
static void putbuf proto((void));

static int bits;
static int nxtBit;

static intrn *intrns;

static char *myName;
static char **curParm;
static int parmsLeft;
static int noisy	=	0;	/* print statistics */
static int local	=	0;	/* local: put output file in . */
static int print	=	0;	/* use standard output, not a created file */
static int unl		=	0;	/* unlink input file at end */
static int dump	=	0;	/* dump on error */
static int verify	=	0;	/* verify result */
static int verifying;			/* true if verifying now */
static char verbuf[bufSize];		/* buffer for verification */

extern void	abort proto((void));
extern void exit proto((int));

-#define	MAXDIAG	200
static char	DiagMess[MAXDIAG];
static char	*DiagPtr;
static void diagfmt proto((char *fmt, ...));
static void diagpr proto((void));
static void crash proto((char *cause));
static void error proto((char *m));

-#ifdef	SYSV
-/* Avoiding stdio seems to be very difficult under System V.
- * At least on the version I used, exit() called _cleanup()
- * which forced stdio to be linked in.  Here we define a dummy
- * version to avoid stdio.  I wonder about the portability of this.
- */
void
-_cleanup()
-{
-}
-#endif	/*SYSV*/

-#include <setjmp.h>
static jmp_buf skipFileJmpBuf;
static int retCode =	0;

static void encode proto((void));
static void initLex proto((void));
static void censusPass proto((void));
static void encWord proto((Eleaf *w));
static void assignCodes proto((nodePtr nodeArg, long c, unsigned cl));

static void decode proto((void));

-#include <sys/times.h>
-#ifdef	SYSV
extern long times proto((struct tms *));
-#else	/*!SYSV*/
-#ifdef	BSD
extern long times proto((struct tms *));
-#else	/*!BSD*/
extern void times proto((struct tms *));	/* V7 is in the minority */
-#endif	/*!BSD*/
-#endif	/*!SYSV*/
static struct tms tbuffer;
-#ifdef	BSD
-#define	utimes	utime
-#endif
struct utimbuf {	/* under V7 & BSD: should be an array */
-	time_t actime;	/* access time */
-	time_t modtime;	/* modification time */
-};
extern int utime proto((char *, struct utimbuf *));
static struct utimbuf ftimes;
-#include <sys/stat.h>

extern int	stat proto((char *, struct stat *));
extern int	fstat proto((int, struct stat *));
extern int	chown proto((char *, short, short));

static struct stat statbuf;

-#define hashMask	0777	/* smaller than every prime[] */
static unsigned primes[]={	607,	947,	1109,	1511,
-		1913,	2393,	2707,	3121,
-		3529,	3911,	4363,	4721,
-		5167,	5521,	5939,	6311,
-		6719,	7121,	7541,	0 };

int
main(argc,argv)
-	int argc;
-	char *argv[];
-{
-	register char *p;
-	time_t putime, pstime;

-	parmsLeft=argc;
-	curParm=argv;
-	myName = *curParm;
-	initWordCode();
-#ifdef	USEMALLOC
-	mem= (char *) malloc(USEMALLOC);
-	if (mem==null)
-		error("can't malloc at all");
-	memRoof=mem+USEMALLOC;
-#else	/*!USEMALLOC*/
-	memRoof=mem=(char *) sbrk(0);
-#endif	/*!USEMALLOC*/
-#ifdef	SYSV
-	(void)
-#endif
-	times(&tbuffer);
-	while (--parmsLeft != 0) {
-		p = *++curParm;
-		if (*p=='-') {
-			while (*++p) {
-				switch (*p) {
-				case 'd':
-					dump = !dump;
-					break;
-				case 'n':
-					noisy = !noisy;
-					break;
-				case 'l':
-					local = !local;
-					break;
-				case 'p':
-					print = !print;
-					break;
-				case 'u':
-					unl = !unl;
-					break;
-				case 'v':
-					verify = !verify;
-					break;
-				case 'w':
-					{
-					unsigned n;
-					unsigned *pp;

-					for (n=0; '0'<= p[1] && p[1]<='9';)
-						n=n*10+(*++p-'0');
-					for (pp=primes; *pp<n; pp++)
-						if (*pp==0) {
-			diagfmt("Number must be less than %d", (*--pp/11)*10);
-							error((char *) null);
-						}
-					maxWords = *pp;
-				}
-					break;
-				default:
-					diagfmt("Bad flag: \"%c\"",*p);
-					error((char *) null);
-				}
-			}
-		} else {
-			if (print && (verify || unl))
-				error("-p may not be used with -v or -u");
-			fnEnd=fname=pname;
-			while (*p != '\0') {
-				if (*p == '/') {
-					do ; while (*++p == '/');
-					if (*p != '\0') {
-						*fnEnd++ = '/';
-						fname = fnEnd;
-					}
-				} else {
-					if (fnEnd == fname+fnSize
-					||  fnEnd == pname+pnSize)
-						error("File name too long");
-					*fnEnd++ = *p++;
-				}
-			}
-			*fnEnd = '\0';
-			if (setjmp(skipFileJmpBuf)) {
-				if (inf>=0)
-					(void) close(inf);
-			} else {
-				inName=pname;
-				for(verifying=0;verifying<=verify;verifying++) {
-					inf=open(inName,0);
-					if (inf<0)
-						error("+Can't open");
-					if (fstat(inf,&statbuf) == -1)
-						crash("Can't fstat input file");
-					if ((statbuf.st_mode&S_IFMT)!=S_IFREG
-					&& (statbuf.st_mode&S_IFMT)!=S_IFBLK)
-					{
-		diagfmt("+Invalid file mode (0%o)", statbuf.st_mode&S_IFMT);
-						error((char *) null);
-					}
-					ftimes.modtime = statbuf.st_mtime;
-					if (noisy) {
-						diagfmt("%s: ",inName);
-						diagpr();
-					}
-					outpos=outbuf;
-					outLen=0;
-					bits=nxtBit=0;
-					if (fnEnd!=fname && fnEnd[-1]==suffix)
-						decode();
-					else
-						encode();
-					inLine=0;
-					putbuf();
-					if (fstat(inf,&statbuf) == -1)
-						crash("Can't fstat input file");
-					if (ftimes.modtime != statbuf.st_mtime)
-						error(
-					"Input file modified while processing");
-					ftimes.actime = statbuf.st_atime;
-					(void) close(inf);
-					if (outf!=1) {
-						(void) close(outf);
-#ifndef	VMSUNITY
-					if (!verifying) {
-			if (utime(outName,&ftimes)!=0)
-				error("Can't set modification time");
-			(void) chown(outName,statbuf.st_uid,statbuf.st_gid);
-					}
-#endif
-					}
-					if (noisy) {
-						diagfmt(
-						"%D bytes in; %D bytes out.\n",
-							inLen,outLen);
-						diagpr();
-						putime=tbuffer.tms_utime;
-						pstime=tbuffer.tms_stime;
-#ifdef	SYSV
-						(void)
-#endif
-						times(&tbuffer);
-						diagfmt(
-						"%Dds user; %Dds system.\n",
-					(tbuffer.tms_utime-putime)/(HZ/10),
-					(tbuffer.tms_stime-pstime)/(HZ/10));
-						diagpr();
-					}
-					inName = outName;
-				}
-				if (unl) {
-					if (unlink(*curParm) == -1)
-						error("Can't unlink");
-				}
-			}
-		}
-	}
-	return retCode;
-}

static void
initWordCode()
-{
-	register int i;
-	register int c;

-	for (c=i=0; i!=numAscii; i++) {
-		if ('a'<=i && i<='z' || 'A'<=i && i<='Z'
-		|| '0'<=i && i<='9' || i=='_')
-		{
-			c++;
-			if (c>=numWc)
-				crash("Bad wc");
-			wcToA[c]=i;
-			aToWc[i]=c;
-		} else {
-			aToWc[i]=0;
-		}
-	}
-}

-/*
- *	Input/Output Routines
- */

-/* diagfmt() is much like sprintf to DiagMess, however it avoids stdio.
- * This avoidance is because stdio takes memory & uses malloc.
- */

static void
diagpr()
-{
-	if (DiagPtr<DiagMess || &DiagMess[MAXDIAG]<=DiagPtr)
-		crash("bad DiagPtr");
-	if (DiagPtr != DiagMess)
-		write(2, DiagMess, (size_t) (DiagPtr-DiagMess));
-}

static void
errc(c)
-	char c;
-{
-	if (DiagPtr<DiagMess || &DiagMess[MAXDIAG]<=DiagPtr)
-		crash("bad DiagPtr");
-	*DiagPtr++ = c;
-}

static void
errs(s)
-	char *s;
-{
-	register char *p;

-	p=s;
-	while (*p != '\0')
-		errc(*p++);
-}

-#ifdef	__STDC__
-#include <stdarg.h>
static void
diagfmt(char *fmt, ...)
-{
-#else	/*!__STDC__*/
-#include <varargs.h>
-/*VARARGS1*/ static void
diagfmt(va_alist)
-	va_dcl
-{
-#endif	/*!__STDC__*/
-	va_list argp;
-	register char *fp;
-	register char *fnp;
-	register int b;
-	register long n;
-	static char fnum[20];
-	register int neg;

-#ifdef	__STDC__
-	fp = fmt;
-	va_start(argp, fmt);
-#else	/*!__STDC__*/
-	va_start(argp);
-	fp = va_arg(argp, char *);
-#endif	/*!__STDC__*/
-	DiagPtr = DiagMess;
-	for (;;) {
-		b = *fp++;
-		if (b == '\0') {
-			break;
-		} else if (b == '%') {
-			switch (*fp++) {
-			case '%':
-				errc('%');
-				continue;
-			case 'c':
-				errc(va_arg(argp, int));
-				continue;
-			case 's':
-				errs(va_arg(argp, char *));
-				continue;
-			case 'o':
-				b = 8;
-				n = va_arg(argp, unsigned);
-				break;
-			case 'O':
-				b = 8;
-				n = va_arg(argp, long);
-				break;
-			case 'l':
-				b = 10;
-				n = va_arg(argp, unsigned);
-				break;
-			case 'd':
-				b = 10;
-				n = va_arg(argp, int);
-				break;
-			case 'D':
-				b = 10;
-				n = va_arg(argp, long);
-				break;
-			default:
-				crash("bad format code");
-			}
-			fnp = &fnum[sizeof fnum];
-			*--fnp = '\0';
-			if ((neg = (n<0)) != 0)
-				n = -n;
-			do {
-				*--fnp = n%b + '0';
-				n /= b;
-			} while (n != 0);
-			if (neg)
-				*--fnp = '-';
-			errs(fnp);
-		} else {
-			errc(b);
-		}
-	}
-	va_end(argp);
-}

-/*ARGSUSED*/ static void
crash(cause)
-	char *cause;	/* read this with a debugger */
-{
-	(void) abort();
-	exit(2);	/* just in case! */
-}

static void
error(m)
-	char *m;
-{
-	int errSkip;
-	char	mess[MAXDIAG+1];

-	if (m == (char *) null) {
-		register char	*from = DiagMess;
-		register char	*to = mess;
-		m = mess;
-		while (from != DiagPtr)
-			*to++ = *from++;
-		*to = '\0';
-	}
-	if ((errSkip = *m=='+') != 0)
-		m++;
-	diagfmt("%s(%s)",myName,*curParm);
-	diagpr();
-	if (inLine) {
-		diagfmt(" in line %l",inLine);
-		diagpr();
-	}
-	diagfmt(": %s%s\n",m,errSkip? " -- file skipped." : ".");
-	diagpr();
-	if (dump)
-		crash("Error.");
-	if (errSkip) {
-		retCode = 2;
-		longjmp(skipFileJmpBuf,1);
-	}
-	exit(3);
-}

static void
nextBlock()
-{
-	register int len;
-	if (inpos>inend)
-		error("Unexpected EOF");
-	inpos=inend=inbuf;
-	len=read(inf,inbuf,(size_t) bufSize);
-	if (len<=0) {
-		if (len<0)
-			error("Input I/O error");
-		*inend = '\0';
-	}
-	inLen += len;
-	inend += len;
-}

-#define needByte()	{ if (inpos>=inend) nextBlock(); }

static void
addByte()
-{
-#ifndef	I8086	/* avoid hardware(!) bug */
-	if (bits>>nxtBit != 0)
-		crash("Bad bits");
-#endif
-	needByte();
-	bits |= (*inpos++ & byteMask) << nxtBit;
-	nxtBit += byteLen;
-}

-#define needBits(n)	{ if(nxtBit<(n)) addByte(); nxtBit -= (n); }

static void
croutf()
-{
-	register unsigned mode;

-	if (verifying) {
-		outName = pname;
-		outf = open(outName,0);
-	} else {
-		outName = local? fname : pname;
-		mode=statbuf.st_mode;
-		if (stat(outName,&statbuf) != -1) {
-			diagfmt("\"%s\" already exists; I will not overwrite",
-				outName);
-			error((char *) null);
-		}
-		outf=creat(outName,(int)mode);
-	}
-	if (outf == -1) {
-		diagfmt("Can't create \"%s\"",outName);
-		error((char *) null);
-	}
-}

static void
putbuf()
-{
-	register size_t i;

-	i = outpos-outbuf;
-	outLen += i;
-	if (verifying) {
-		if (read(outf,verbuf,(size_t) bufSize) != i)
-			error("Bad length on verify");
-		while (i--)
-			if (verbuf[i] != outbuf[i])
-				error("Verification failed");
-	} else {
-		if (write(outf,outbuf,i) != i)
-			error("Output I/O error");
-	}
-	outpos=outbuf;
-}

-#define outch(c)	{ *outpos++ = (c); if(outpos==outbuf+bufSize) putbuf(); }

static void
gotByte()
-{
-	outch(bits);
-	bits >>= byteLen;
-	nxtBit -= byteLen;
-}

-#define gotBits(n)	{ nxtBit += (n); if(nxtBit>=byteLen) gotByte(); }


-/*	Encoder dynamic memory structures:
- *
- *	Eleaf leaves[maxSyms];
- *		leaves, eofPtr, solPtr, leavesRoof.
- *	tree forest[maxSyms] overlaying leaves;
- *		(so we require sizeof(tree) <= sizeof(leaf))
- *		forest synonym for leaves.
- *	intrn intrns[maxWords];
- *		intrns.
- *	char *words[maxWords] overlaying start of intrns;
- *		words.
- *	char chars[] overlaying end of intrns and extending beyond;
- *		chars, nxtChar, charRoof (synonym for memRoof).
- */

static Eleaf *leaves, *leavesRoof;
-#define forest ((tree *) leaves)
static char **words;
static unsigned wordLimit;
static unsigned wordsLeft;
static unsigned solCnt;
static unsigned oflowCnt, oflowChars;
static unsigned maxSyms;
static unsigned maxCodeLen;
static long tabLen;
static long symLookups,hashProbes;
static long bytesSpecials,bytesWords;
static char *charCeil;	/* charRoof-1 */
static Eleaf *lex proto((void));	/* declare forward */

static void
encode()
-{
-	register int i;
-	pointer memNeed;
-	register Eleaf *symbol;
-	register long lbits;

-	wordLimit=maxWords-maxWords/11;	/* consider table full at 90% */
-	maxSyms=maxSpecs+1+maxWords;
-	leaves = (Eleaf *) mem;
-	solPtr = (nodePtr) &leaves[solitary];
-	eofPtr = (nodePtr) &leaves[eof];
-	leavesRoof = &leaves[maxSyms];
-	intrns = (intrn *) leavesRoof;
-	memNeed = (pointer) &intrns[maxWords];
-	words = (char **) leavesRoof;
-	chars = (char *) &words[maxWords];
-	if (memNeed < (pointer)chars)
-		memNeed = (pointer)chars;
-	if ((pointer)memRoof < memNeed) {
-#ifdef	USEMALLOC
-		error("out of fixed memory allocation");
-#else	/*!USEMALLOC*/
-		if (brk(memNeed) == -1)
-			error("Can't get memory needed");
-		memRoof=memNeed;
-#endif	/*!USEMALLOC*/
-	}
-	charCeil=charRoof-1;
-	maxCodeLen=0;
-	hashProbes=0;
-	symLookups=0;
-	oflowChars=0;
-	censusPass();
-	initLex();
-	bits |= (aToWc[*inpos&asciiMask]!=0) << nxtBit;
-	gotBits(1);
-	do {
-		symbol=lex();
-		lbits=symbol->code;
-		if (lbits == -1L)
-			lbits=((Eleaf *) solPtr)->code;
-#ifdef	DEBUG
-		diagfmt("sym %o outpos %d nxtBit %d bits %o code %O\n",
-			symbol,outpos-outbuf,nxtBit,bits,lbits);
-		diagpr();
-#endif	/*DEBUG*/
-		i=lbits&lenMask;
-		lbits = lbits>>lenLen<<nxtBit | bits;
-		nxtBit += i;
-		while (nxtBit>=byteLen) {
-			outch( (char) lbits );
-			lbits >>= byteLen;
-			nxtBit -= byteLen;
-		}
-		bits=lbits;
-		if ((nodePtr)symbol==solPtr || symbol->code==-1L)
-			encWord(symbol);
-	} while ((nodePtr)symbol!=eofPtr);
-	if (bits!=0) {
-		outch(bits);
-		if (bits<0)
-			crash("Signed bits");
-	}
-	if (noisy) {
-		diagfmt("%d of %d words containing %d characters.\n",
-			wordLimit-wordsLeft,maxWords,nxtChar-chars);
-		diagpr();
-		diagfmt("%d solitaries", solCnt-oflowCnt);
-		diagpr();
-		if (oflowCnt != 0) {
-			diagfmt("; %d words of %d characters overflowed",
-				oflowCnt, oflowChars);
-			diagpr();
-		}
-		diagfmt(".\n%D symbol lookups;  %D hash probes.\n",
-			symLookups,hashProbes);
-		diagpr();
-		diagfmt("%D bytes of tables; %d bits in longest code.\n",
-			tabLen,maxCodeLen);
-		diagpr();
-		diagfmt("%D bytes of coded specials; %D bytes of coded words.\n",
-			bytesSpecials,bytesWords);
-		diagpr();
-	}
-}

static long buildTree proto((tree *, tree *, Eleaf*));
static int compar proto((pointer, pointer));

static void
censusPass()
-{
-	register Eleaf *symbol;
-	tree *wf;		/* word forest */
-	register tree *wfn;	/* word forest next */
-	register tree *sfn;	/* spec forest next */
-	tree *treep;

-	initLex();
-	for (symbol=leaves; symbol!=leavesRoof; symbol++)
-		symbol->Lcount=0;
-	do {
-		symbol=lex();
-#ifdef	DEBUG
-		diagfmt("Symbol %d\n",symbol-leaves);
-		diagpr();
-#endif	/*DEBUG*/
-		if (++(symbol->Lcount) <= 0) {
-			diagfmt("Counter overflow. Symbol 0%o",symbol-leaves);
-			error((char *) null);
-		}
-	} while ((nodePtr)symbol!=eofPtr);
-	inLine=0;
-	sfn=forest;
-	for (symbol=leaves; (nodePtr)symbol!=solPtr; symbol++) {
-		if (symbol->Lcount!=0) {
-			sfn->Tcount=symbol->Lcount;
-			sfn->root = (nodePtr) symbol;
-			sfn++;
-		}
-	}
-	oflowCnt = solCnt = ((Eleaf *)solPtr)->Lcount;
-	wf = wfn = (tree *) symbol;
-	while (++symbol != leavesRoof) {
-		if (symbol->Lcount != 0) {
-			if (symbol->Lcount==1) {
-				if(++solCnt <= 0)
-					error("Solitary counter overflow.");
-			} else {
-				wfn->Tcount=symbol->Lcount;
-				wfn->root = (nodePtr)symbol;
-				wfn++;
-			}
-		}
-	}
-	if (solCnt!=0) {
-		wfn->Tcount=solCnt;
-		wfn->root = solPtr;
-		wfn++;
-	}
-	outf=1;
-	if (!print) {
-		if (fnEnd>=fname+fnSize)
-			error("+File name too long");
-		*fnEnd++ = suffix;
-		*fnEnd = 0;
-		croutf();
-	}
-	outch((char) (magicNum&byteMask));
-	outch((char) (magicNum>>byteLen & byteMask));
-	bits=wfn-wf;	/* nxtBit must be zero or else! */
-	if (bits > nwMask) {
-		diagfmt("Too many unique words: %d", bits);
-		error((char *) null);
-	}
-	gotBits(nwLen);
-	bits |= (sfn-forest-1) << nxtBit;
-	gotBits(specLen);
-	qsort((pointer) forest, (size_t) (sfn-forest), sizeof(tree), compar);
-	for (treep=forest; treep!=sfn; treep++) {
-		bits |= ((Eleaf *)treep->root - leaves) << nxtBit;
-		gotBits(specLen);
-	}
-	if (wfn!=wf) {
-		qsort((pointer) wf, (size_t) (wfn-wf), sizeof(tree), compar);
-		for (treep=wf; treep!=wfn; treep++) {
-			if (treep->root != solPtr) {
-				encWord((Eleaf *) treep->root);
-			} else {
-				/* output sol, a zero-length word */
-				/* bits |= 0<<nxtBit; */
-				gotBits(wcLen);
-			}
-		}
-	}
-	bytesSpecials=buildTree(forest,sfn,(Eleaf *)solPtr);
-	bytesWords=0L;
-	if (wfn!=wf)
-		bytesWords=buildTree(wf,wfn,leavesRoof);
-	tabLen = outLen + (outpos-outbuf);
-}

static long
buildTree(f,ofn,lr)
-	tree *f;	/* forest (and leaves) */
-	tree *ofn;	/* old forest next */
-	Eleaf *lr;	/* leaves roof */
-{
-	register intrn *nxtIntrn;
-	tree *of;	/* old forest */
-	tree *mf;	/* merge forest */
-	register tree *mfn;	/* merge forest next */
-	int treeCount;
-	int decision;
-	nodePtr topNode;
-	long bitsCode;

-	if (f[0].Tcount>ofn[-1].Tcount)
-		crash("Sort failed");
-	nxtIntrn=intrns;
-	bitsCode=0L;
-	mf=mfn=of=f;
-	for (treeCount=ofn-f; --treeCount; ) {
-		register tree *p;

-		decision = of==ofn || mf==mfn;
-		if (of==ofn || (mf!=mfn && (mf->Tcount)<(of->Tcount))) {
-			p=mf++;
-			decision += 2;
-		} else {
-			p=of++;
-		}
-		nxtIntrn->left = p->root;
-		mfn->Tcount = p->Tcount;
-		if (of==ofn || (mf!=mfn && (mf->Tcount)<(of->Tcount))) {
-			p=mf++;
-			decision += 4;
-		} else {
-			p=of++;
-		}
-		nxtIntrn->right = p->root;
-		mfn->Tcount += p->Tcount;
-		mfn->root = (nodePtr) nxtIntrn;
-		if (mfn->Tcount<p->Tcount && treeCount!=1) {
-			diagfmt("Counter overflow. %d trees left", treeCount);
-			error((char *) null);
-		}
-		bitsCode += mfn->Tcount;
-		if ((decision&1)==0) {
-			if (decision == 2) {
-				/* turn into decision=4 */
-				nxtIntrn->right = nxtIntrn->left;
-				nxtIntrn->left = p->root;
-				decision=4;
-			}
-			if (decision==4) {
-				/* bits |= 0<<nxtBit; */
-				gotBits(1);
-			} else /* decision==0 || decision==6 */ {
-				bits |= (decision&2|1) << nxtBit;
-				gotBits(2);
-			}
-		}
-		nxtIntrn++;
-		mfn++;
-	}
-	topNode = (of!=ofn? of:mf)->root;
-	{
-		register Eleaf *p;

-		for (p = (Eleaf *) f; p!=lr; p++)
-			p->code = -1L;
-}
-	assignCodes(topNode,0L,(unsigned)0);
-	return bitsCode/byteLen;
-}

static int
compar(p,q)
-	pointer p;
-	pointer q;
-{
-	if (((tree *)p)->Tcount < ((tree *)q)->Tcount)
-		return -1;
-	else if (((tree *)p)->Tcount > ((tree *)q)->Tcount)
-		return 1;
-	else
-		return 0;
-}

static void
encWord(w)
-	Eleaf *w;
-{
-	register char *p;

-	if (w == (Eleaf *)solPtr)
-		p = nxtChar;	/* output ephemeral word */
-	else
-		p=words[(w-leaves)-(solitary+1)];
-	do {
-		bits |= aToWc[*p&asciiMask] << nxtBit;
-		gotBits(wcLen);
-	} while ((*p++ & aFlag) == 0);
-	/* bits |= 0<<nxtBit; */
-	gotBits(wcLen);
-}

static void
assignCodes(nodeArg,c,cl)
-	nodePtr nodeArg;
-	long c;
-	register unsigned cl;
-{
-	register nodePtr node;

-	node=nodeArg;
-#ifdef	DEBUG
-	diagfmt("Assign %d %O %d\n",node,c,cl);
-	diagpr();
-#endif	/*DEBUG*/
-	if (node < (nodePtr) intrns) {
-		if (cl>maxCodeLen) {
-			maxCodeLen=cl;
-			if (cl>longLen-1-lenLen)
-				error("A code would be too long");
-		}
-		((Eleaf *)node)->code = c<<lenLen | cl;
-	} else {
-		assignCodes(((intrn *) node)->left, c, cl+1);
-		assignCodes(((intrn *) node)->right, c | 1L<<cl, cl+1);
-	}
-}

static void
initLex()
-{
-	register char **p;
-	register int i;

-	nxtChar=chars;
-	p=words;
-	i=maxWords;
-	do {
-		*p++ = null;
-	} while (--i);
-	if (lseek(inf,0L,0)<0)
-		error("Seek failed");
-	wordsLeft=wordLimit;
-	inpos=inend=inbuf;
-	inLen=0;
-	inLine=1;
-	curTabs=0;
-	needByte();
-}

static Eleaf *
lex()
-{
-	register char c;
-	register char *p;
-	register unsigned i;
-	register char *inp, *endWord;

-	/* assume needByte() already done */
-	c = *inpos++;
-	if ((c&~asciiMask)!=0) {
-		diagfmt("+Non-ASCII character: \\%o",c&byteMask);
-		error((char *) null);
-	}
-	if (aToWc[c]!=0) {
-		register unsigned step;

-		i=c;
-		p=nxtChar;
-		for (;;) {
-			if (p>=charCeil) {
-#ifdef	USEMALLOC
-				error("out of fixed memory allocation");
-#else	/*!USEMALLOC*/
-				memRoof += charChunk;
-				if (brk((pointer) memRoof) == -1)
-					error("Out of memory");
-				charCeil=charRoof-1;
-#endif	/*!USEMALLOC*/
-			}
-			*p++ = c;
-			needByte();
-			c = *inpos++;
-			if ((c&~asciiMask)!=0 || aToWc[c]==0)
-				break;
-			i = i*hashMult+c;	/* ignore overflow */
-		}
-		inpos--;
-		p[-1] |= aFlag;
-		p[0] = 0;	/* force string compare termination */
-		endWord=p;
-		symLookups++;
-		step = (i&hashMask) + 1;
-		/* note: signed divide faster than unsigned on VAX */
-		i = ((int)(i & ~(-1<<((sizeof i)*byteLen - 1)))) % maxWords;
-		for (;;) {
-			hashProbes++;
-			if ((p=words[i]) == null) {
-				if (wordsLeft == 0) {
-					oflowChars += endWord-nxtChar;
-					return (Eleaf *)solPtr;
-				}
-				words[i]=nxtChar;
-				nxtChar=endWord;
-				wordsLeft--;
-				break;
-			}
-			if (*p == *nxtChar) {
-				inp=nxtChar;
-				do; while(*++p == *++inp);
-				/* note: *endWord=='\0' but *p cannot */
-				if (inp==endWord)
-					break;
-			}
-			i += step;
-			if (i >= maxWords)
-				i -= maxWords;
-		}
-		i += maxSpecs+1;
-	} else {
-		if (inpos>inend) {
-			/* inpos--; */
-			return leaves+eof;	/* avoid lookahead */
-		} else {
-			needByte();
-			switch (c) {
-			case ';':
-				if (*inpos!='\n')
-			break;
-				inpos++;
-				/* FALLTHROUGH */
-			case '\n':
-				inLine++;
-				for (i=0; ; i++) {
-					needByte();
-					if (*inpos!='\t' || i>=maxTabs)
-				break;
-					inpos++;
-				}
-				c = (c==';'? scNlTabs : nlTabs)+i-curTabs;
-				curTabs=i;
-				break;
-			case ',':
-				if (*inpos==' ') {
-					inpos++;
-					c=commaSp;
-				} else if (*inpos=='\n') {
-					inpos++;
-					inLine++;
-					c=commaNl;
-				}
-				break;
-			case '.':
-				if (*inpos=='\n') {
-					inpos++;
-					inLine++;
-					c=perNl;
-				}
-			}
-		}
-		needByte();
-		i=aToWc[*inpos&asciiMask]!=0? c+wfBias : c;
-	}
-	return leaves+i;
-}

-/*	Decoder dynamic memory structures:
- *
- *	intrn intrns[numSpecs+numWords];
- *		intrns.
- *	char chars[num_chars];
- *		chars, eofPtr, solPtr, nxtChar,
- *			charRoof (synonym of memRoof).
- */

-#define leapLen	8

struct leap {	/* how to leap to right part of decoding tree */
-	nodePtr leapTo;
-	int leapDepth;
-};

static struct leap specLeap[1<<leapLen];
static struct leap wordLeap[1<<leapLen];

static void getTree proto((intrn *of, intrn *ofn, struct leap *leapTable));
static void buildLeap proto((struct leap *table, int bit, nodePtr nodeArg, int depth));

static char *decWord proto((void));	/* declare forward */

static void
decode()
-{
-	register nodePtr node;
-	register int c;
-	register struct leap *leap;
-	unsigned numSpecs;
-	unsigned numWords;

-	inpos=inend=inbuf;
-	inLen=0;
-	curTabs=0;
-	needByte();
-	c = *inpos++ & byteMask;
-	needByte();
-	c |= (*inpos++ & byteMask) << byteLen;
-	needByte();
-	numWords = *inpos++ & byteMask;
-	needBits(nwLen-byteLen);
-	numWords |= bits<<byteLen & nwMask;
-	bits >>= nwLen-byteLen;
-	needBits(specLen);
-	numSpecs=(bits&specMask)+1;
-	bits >>= specLen;
-	if ((unsigned)c!=magicNum || numSpecs>maxSpecs)
-		error("Input is not Hufman encoded");
-	intrns = (intrn *) mem;
-	chars = (char *) &intrns[numSpecs+numWords];
-	eofPtr = (nodePtr) &chars[eof];
-	solPtr = (nodePtr) &chars[solitary];
-	nxtChar = (char *)solPtr + 1;
-	if (memRoof<nxtChar) {
-#ifdef	USEMALLOC
-		error("out of fixed memory allocation");
-#else	/*!USEMALLOC*/
-		memRoof=nxtChar+charChunk;
-		if (brk((pointer)memRoof) == -1)
-			error("Can't get needed memory");
-#endif	/*!USEMALLOC*/
-	}
-	outf=1;
-	if (!print) {
-		*--fnEnd=0;
-		if (fnEnd<=fname)
-			error("File name too short");
-		croutf();
-	}
-#ifdef	DEBUG
-	diagfmt("start spec list %d %d %o\n",inpos-inbuf,nxtBit,bits);
-	diagpr();
-#endif	/*DEBUG*/
-	node = (nodePtr) intrns;
-	for (c=numSpecs; c!=0; c--) {
-		needBits(specLen);
-		((intrn *) node)->left = (nodePtr) &chars[bits&specMask];
-		bits >>= specLen;
-		node = (nodePtr) (((intrn *) node) + 1);
-	}
-#ifdef	DEBUG
-	diagfmt("got spec list %d %d %o\n",inpos-inbuf,nxtBit,bits);
-	diagpr();
-#endif	/*DEBUG*/
-	for (c=numWords; c!=0; c--) {
-		((intrn *) node)->left = (nodePtr) nxtChar;
-		nxtChar=decWord();
-		if ((nodePtr)nxtChar == ((intrn *) node)->left)
-			((intrn *) node)->left = solPtr;
-		node = (nodePtr) (((intrn *) node) + 1);
-	}
-#ifdef	DEBUG
-	diagfmt("got word list %d %d %o\n",inpos-inbuf,nxtBit,bits);
-	diagpr();
-#endif	/*DEBUG*/
-	getTree(intrns,intrns+numSpecs,specLeap);
-#ifdef	DEBUG
-	diagfmt("got spec tree %d %d %o\n",inpos-inbuf,nxtBit,bits);
-	diagpr();
-#endif	/*DEBUG*/
-	if (numWords!=0)
-		getTree(intrns+numSpecs,(intrn *)node,wordLeap);
-#ifdef	DEBUG
-	diagfmt("got word tree %d %d %o\n",inpos-inbuf,nxtBit,bits);
-	diagpr();
-#endif	/*DEBUG*/
-	needBits(1);
-	if ((bits&1)==0)
-		leap=specLeap;
-	else
-		leap=wordLeap;
-	bits >>= 1;
-	for (;;) {
-#ifdef	DEBUG
-		diagfmt("Node = 0%o\n",node);
-		diagpr();
-#endif	/*DEBUG*/
-		c = bits;
-		if (nxtBit < leapLen) {
-			needByte();
-			c |= (*inpos++ & byteMask) << nxtBit;
-			nxtBit += byteLen;
-		}
-		leap += c & ((1<<leapLen)-1);
-		c >>= leap->leapDepth;
-		nxtBit -= leap->leapDepth;
-		node = leap->leapTo;
-		while (node<(nodePtr)chars) {
-			if (--nxtBit<0) {
-				needByte();
-				c = *inpos++ & byteMask;
-				nxtBit=byteLen-1;
-			}
-			if ((c&1)==0)
-				node=((intrn *) node)->left;
-			else
-				node=((intrn *) node)->right;
-			c >>= 1;
-		}
-		bits = c;
-		if (node >= solPtr) {
-			if (node == solPtr) {
-				/* decode the word, but don't keep it */
-				(void) decWord();
-				node = (nodePtr) nxtChar;
-			}
-			for (;;) {
-				outch(*(Dleaf *)node & asciiMask);
-				if ((*(Dleaf *)node & aFlag) != 0)
-					break;
-				node = (nodePtr) ((Dleaf *) node + 1);
-			}
-			leap = specLeap;
-		} else {
-			if (node == eofPtr)
-				break;
-			c=((Dleaf *)node - chars)&~wfBias;
-			if (aToWc[c]) {
-				if (c==perNl) {
-					outch('.');
-					outch('\n');
-				} else if (c==commaSp) {
-					outch(',');
-					outch(' ');
-				} else if (c==commaNl) {
-					outch(',');
-					outch('\n');
-				} else {
-					if (nlTabs-maxTabs<=c
-					&&  c<=nlTabs+maxTabs) {
-						c -= nlTabs;
-					} else {
-						c -= scNlTabs;
-						outch(';');
-					}
-					outch('\n');
-					curTabs = c += curTabs;
-					while (c--)
-						outch('\t');
-				}
-			} else {
-				outch(c);
-			}
-			if (node < (nodePtr) (chars+wfBias))
-				leap = specLeap;
-			else
-				leap = wordLeap;
-		}
-	}
-	if (inpos<inend)
-		error("Unexpected EOF code.");
-}

static void
getTree(of,ofn,leapTable)
-	register intrn *of;	/* used as a copy */
-	intrn *ofn;
-	struct leap *leapTable;
-{
-	intrn *mf;
-	register intrn *mfn;
-	register nodePtr p;
-	int treeCount;
-	int decision;

-	mf=mfn=of;
-	for (treeCount=ofn-of; --treeCount; ) {
-		if (mfn>of)
-			crash("Tree builder");
-		if (of!=ofn && mf!=mfn) {
-			/* both lists are non-empty, so read "decision" */
-			needBits(1);
-			if ((bits&1)==0) {
-				decision=4;
-			} else {
-				bits >>= 1;
-				needBits(1);
-				decision = (bits&1)==0? 0 : 6;
-			}
-			bits >>= 1;
-		}
-		if (of==ofn || (mf!=mfn && (decision&2)!=0))
-			p = (nodePtr) mf++;
-		else
-			p = (of++) -> left;
-		mfn->left = p;
-		if (of==ofn || (mf!=mfn && (decision&4)!=0))
-			p = (nodePtr) mf++;
-		else
-			p = (of++) -> left;
-		mfn->right = p;
-		mfn++;
-	}
-	buildLeap(leapTable, 1, of!=ofn? of->left : (nodePtr) mf, 0);
-}

static void
buildLeap(table, bit, nodeArg, depth)
-	register struct leap *table;
-	register int bit;
-	nodePtr nodeArg;
-	int depth;
-{
-	register nodePtr node;
-	register int i;

-	node = nodeArg;
-	while (bit!=(1<<leapLen) && node<(nodePtr)chars) {
-		depth++;
-		buildLeap(table, bit<<1, ((intrn *) node)->left, depth);
-		node = ((intrn *) node)->right;
-		table += bit;
-		bit <<= 1;
-	}
-	for (i=0; i!=(1<<leapLen); i+=bit) {
-		table->leapTo = node;
-		table->leapDepth = depth;
-		table += bit;
-	}
-}

static char *
decWord()
-{
-	register char *p;
-	register char c;

-	p=nxtChar;
-	for (;;) {
-		if (nxtBit < wcLen) {
-			needByte();
-			bits |= (*inpos++ & byteMask) << nxtBit;
-			nxtBit += byteLen;
-		}
-		c=bits&wcMask;
-		bits >>= wcLen;
-		nxtBit -= wcLen;
-		if (c==0) {
-			if (p!=nxtChar)
-				p[-1] |= aFlag;
-			return p;
-		}
-		if (p>=charRoof) {
-#ifdef	USEMALLOC
-			error("out of fixed memory allocation");
-#else	/*!USEMALLOC*/
-			memRoof += charChunk;
-			if (brk((pointer)memRoof) == -1)
-				error("Out of memory for characters");
-#endif	/*!USEMALLOC*/
-		}
-		*p++ = wcToA[c];
-	}
-}
!
echo shar unpacked fully
exit