[comp.sys.apple] zork decoding

saponara@batcomputer.tn.cornell.edu (John Saponara) (11/03/87)

References:


A few weeks ago someone asked how to decode Infocom's "Zork" data.  Well, I
just got the answer in the mail.  "Computist" magazine has an enhancement
for an Infocom text reader in their latest issue (#47).  The original text
reader is in their issue #34.  This magazine is mostly a "how to copy disks"
and "programs to look into adventure programs and change parameters" kind of
thing, if you're interested in that.  The back issue #34 is $4.75 from
Computist, PO Box 110846-T, Tacoma, WA  98411 (I don't have it, otherwise I'd
pass on the basics).

Eric Haines

dmw3@ur-tut.UUCP (David M Walsh Jr.) (11/04/87)

   Well I heard a few years back that Infocom used a 3/2 encoding scheme.
They put 3 characters into 2 bytes by only using 5 bits per character.
This leaves 1 extra bit for God knows what.

  Unfortunately I have no idea what they included in their table of 32
characters (2^5=32), so I can't help in getting correct decoding, but
this should help any hackers out there to finish the job.  I would
assume that they do use the "left-over" bit because they do use more than
32 characters (I think.)  The alphabet is 26, the space key makes 27,
adding punctuation can take it over 32...

  Oh well, hope this helps...

  Dave Walsh - an old Apple ][+ hacker

kamath@reed.UUCP (Sean Kamath) (11/05/87)

Actually, it would be rathering interesting to find out how to do it.
If I recall correctly, Infocom uses a rather nice packing technique to
get all that text on the disks.  If anyone know the basic concept of
this, I'd like to hear about it.  It would be nice to shrink strict text
files with it. . .

Sean Kamath

-- 
UUCP:  {decvax allegra ucbcad ucbvax hplabs ihnp4}!tektronix!reed!kamath
CSNET: reed!kamath@Tektronix.CSNET  ||  BITNET:  reed!kamath@Berkeley.BITNET
ARPA:  tektronix!reed!kamath@Berkeley <or> reed!kamath@hplabs
US Snail: 3934 SE Boise, Portland, OR  97202 (I hate 4 line .sigs!)

jra1@ur-tut.UUCP (The Super Abuser) (11/06/87)

I get the impression that the basic coding technique is that
all the words that are used in Zork's descriptions are stored in some mildly
compressed format on the disk (probably "3-2" packing of characters), and the
text descriptions are arrays of pointers to the words on disk.

Of course, I've never tested this theory but it seems reasonable.


Hope this helps.

--- Jem Radlow, former rogueaholic
"God. Life without rogue is like... life without rogue."

dmw3@ur-tut.UUCP (David M Walsh Jr.) (11/06/87)

In article <434@ur-tut.UUCP> jra1@tut.cc.rochester.edu.UUCP (The Super Abuser) writes:
>
>I get the impression that the basic coding technique is that
>all the words that are used in Zork's descriptions are stored in some mildly
>compressed format on the disk (probably "3-2" packing of characters), and the
>text descriptions are arrays of pointers to the words on disk.
>
>Of course, I've never tested this theory but it seems reasonable.
>

Yes, I think that I've said this before (maybe it was in a direct reply though)
that I am almost positive that they use the 3-2 packing method: 3 chars per
2 bytes.  The only problem here is that we don't know exactly how they mapped
the alphabet.  The 3-2 packing method easily allows 32 characters in your
alphabet, but you have an extra bit to diddle with, and I don't know if they
played with it.

If I have offended anyone or made a mistake I am sorry, but I'm pretty sure
that this info is correct as far as it goes...

If anyone figures out how it truly works (like if they use that extra bit or 
not...) I'd love to hear from you.

Have a nice day

  David Walsh - an old Apple ][ hacker, slowly becoming a Mac hacker

rassilon@mit-eddie.UUCP (11/07/87)

In article <434@ur-tut.UUCP> jra1@tut.cc.rochester.edu.UUCP (The Super Abuser) writes:
> I get the impression that the basic coding technique is that all the words
> that are used in Zork's descriptions are stored in some mildly compressed
> format on the disk (probably "3-2" packing of characters), and the text
> descriptions are arrays of pointers to the words on disk. 

I don't know how Infocom stores anything but I can help.  There is a program
called ZorkTools which will allow you to play your game as normal but, at any
point, you may do such things as looking at all the text descriptions, get a
list of valid verbs and nouns, make a copy suitable for installing on a hard
disk, etc.  The program runs on IBM PC's and compatibles and I will, if there
is enough interest, post it here.

					Shar and Enjoy!

DISCLAIMER: I have never tried the aforementioned program as I don't have
any Infocom games for PC's.

-- 
Rassilon (Brian Preble)
UUCP: ...{ihnp4 | decvax!genrad}!mit-eddie!rassilon
Internet: rassilon@eddie.mit.edu

dyon@batcomputer.tn.cornell.edu (Dyon Anniballi) (11/07/87)

  Back in the old days, a friend and I made some headway in disassembling
the Zork interpreter.  If I recall correctly, they used 5 bits per char.
for encoding text.  Only 32 combinations you say?  Well, the first 26
were the lower case alphabet, and the other 6 were for switching character
sets.  That is, one would switch to upper case, another to punctuation type
stuff, etc.  I don't remember the specifics though, sorry.

-- 
--Dyon Anniballi

dyon@batcomputer.tn.cornell.edu     |  dyon%batcomputer@crnlcs.bitnet
rochester!cornell!batcomputer!dyon  |  "No time for romantic escape...."

john@moncol.UUCP (John Ruschmeyer) (11/08/87)

In article <7364@eddie.MIT.EDU> rassilon@eddie.MIT.EDU (Brian Preble) writes:
>In article <434@ur-tut.UUCP> jra1@tut.cc.rochester.edu.UUCP (The Super Abuser) writes:
>> I get the impression that the basic coding technique is that all the words
>> that are used in Zork's descriptions are stored in some mildly compressed
>> format on the disk (probably "3-2" packing of characters), and the text
>> descriptions are arrays of pointers to the words on disk. 
>
>I don't know how Infocom stores anything but I can help.  There is a program
>called ZorkTools which will allow you to play your game as normal but, at any
>point, you may do such things as looking at all the text descriptions, get a
>list of valid verbs and nouns, make a copy suitable for installing on a hard
>disk, etc.  The program runs on IBM PC's and compatibles and I will, if there
>is enough interest, post it here.

One note, the only version of ZorkTools that I have seen only works for the
older copy-protected games. Recently, Infocom has begun shipping thwir
games unprotected, with in install program to copy the correct files, set
up directories, etc. (They also now do all screen I/O through ANSI.SYS.

I have not heard of a version of ZorkTools which will work with these
games.

-- 
Name:		John Ruschmeyer
US Mail:	Monmouth College, W. Long Branch, NJ 07764
Phone:		(201) 571-3557
UUCP:		...!vax135!petsd!moncol!john	...!princeton!moncol!john
						   ...!pesnta!moncol!john

	Give computers an artificial intelligence and the
	next thing you know they want to use the same bathroom.
					- Allan Dean Foster

keith@apple.UUCP (Keith Rollin) (11/08/87)

In article <2846@batcomputer.tn.cornell.edu> dyon@tcgould.tn.cornell.edu (Dyon Anniballi) writes:
>
>  Back in the old days, a friend and I made some headway in disassembling
>the Zork interpreter.  If I recall correctly, they used 5 bits per char.
>for encoding text.  Only 32 combinations you say?  Well, the first 26
>were the lower case alphabet, and the other 6 were for switching character
>sets.  That is, one would switch to upper case, another to punctuation type
>stuff, etc.  I don't remember the specifics though, sorry.
>

It doesn't seem like any of us can! I, too, spent some of the vigor of my
youth trying to comment the assembly code of one of InfoCom's interpreter. 
One feature that (I think) I found concerned this switching that Dyon just 
mentioned. Some of the more common words (about 180 in Hitchhiker's) are
stored separate from the rest of the encoded text. These special words
(common objects like 'towel', 'Arthur', etc. and words like 'the', 'into',
and so on) are then referenced by one of these 5 bit symbols. That way,
you can squeeze a whole word into just 5 bits.

To make matters worse, InfoCom has developed many different interpreters
(I assume tthey have at least 7, as I have seen versions A, B, and G). With
any bit of bad luck, they will have changed the encoding algorithm somewhere
along the way...

-- 

Keith Rollin                                               amdahl\
Sales Technical Support                               pyramid!sun !apple!keith
Apple Computer                                             decwrl/

Disclaimer: I read this board for fun, not profit. Anything I say is from the
            result of reading magazines, hacking, and soaking my head in acid.

denbeste@bgsuvax.UUCP (11/09/87)

in article <400@ur-tut.UUCP>, dmw3@ur-tut.UUCP (David M Walsh Jr.) says:
> Xref: bgsuvax rec.games.misc:816 comp.sys.apple:2495
> 
> 
>    Well I heard a few years back that Infocom used a 3/2 encoding scheme.
> They put 3 characters into 2 bytes by only using 5 bits per character.
> This leaves 1 extra bit for God knows what.

That extra bit is used to tell if the 3/2 encoding scheme is used, or if the
next 7 bits contain a straight ascii character.

> assume that they do use the "left-over" bit because they do use more than
> 32 characters (I think.)  The alphabet is 26, the space key makes 27,
> adding punctuation can take it over 32...

The lowercase alphabet and space are in there.  I don't know about the
other 5 characters.  All of the stories that I know of use both upper and 
lower case text.  For machines, such as the //+, that don't have lower case,
the text is converted on the fly.  Word wrap is also handled on the fly.

I have a friend that spent some time looking into this, but the accuracy
of the above is subject to the reliability of my memory

---
          William C. DenBesten | CSNET denbeste@research1.bgsu.edu

reeves@amd.AMD.COM (James Reeves) (11/10/87)

In article <2846@batcomputer.tn.cornell.edu> dyon@tcgould.tn.cornell.edu (Dyon Anniballi) writes:
>
>  Back in the old days, a friend and I made some headway in disassembling
>the Zork interpreter.  If I recall correctly, they used 5 bits per char.
>for encoding text.  Only 32 combinations you say?  Well, the first 26
>were the lower case alphabet, and the other 6 were for switching character
>sets.  That is, one would switch to upper case, another to punctuation type
>stuff, etc.  I don't remember the specifics though, sorry.
>
>-- 
>--Dyon Anniballi

I'm sorry to get involved in this but this last statement was truly 
ridiculous. with 2^5 you have 32 combinations. That's it. Done. No more.

I have spent hours trying to decode the Infocom files and have found some very
interesting thing. First of all I am almost positive that they DO NOT USE 
A 5 bit encodeing scheme. (Actually I should say that they do not solely
use a 5 nit encoding scheme) I have played around weth some of the bits in
the .DAT file and watched how the text changed. It definitely didn't react
in a way that would indicate a 5 bit encoding scheme. It think that there
is a tokenizing scheme as well as a compression method. When I attempted 
to debug the code I found some nasty GOTCHAs in the code. The loop for creating a text sentence apparently pushes a whole lot of SH*T on the stack and then
goes through some calculations as it pops it off. I also seem to remeber that
there is some initial code that scans the memory for INT 3 instructions
that make it very difficult to place breakpoints in the code.

Anyway, the person who wrote ZORK Tools should be able to help 
where is he/she when you need them.

James

rbl@nitrex.UUCP ( Dr. Robin Lake ) (11/11/87)

In article <1363@bgsuvax.UUCP> denbeste@bgsuvax.UUCP (William C. DenBesten) writes:
>in article <400@ur-tut.UUCP>, dmw3@ur-tut.UUCP (David M Walsh Jr.) says:
>> Xref: bgsuvax rec.games.misc:816 comp.sys.apple:2495
>> 
>> 
>>    Well I heard a few years back that Infocom used a 3/2 encoding scheme.
>> They put 3 characters into 2 bytes by only using 5 bits per character.
>> This leaves 1 extra bit for God knows what.
>
Has anyone looked into DEC's "Rad50" coding as used on the PDP-11?
It was a 3/2 scheme.

-- 
Rob Lake
{decvax,ihnp4!cbosgd}!mandrill!nitrex!rbl

Robert_Bob_Freed@cup.portal.com (11/13/87)

In article <2804@batcomputer.tn.cornell.edu>,
saponara@batcomputer.tn.cornell.edu (John Saponara) writes:

> A few weeks ago someone asked how to decode Infocom's "Zork" data...

Attention Zork addicts and dedicated Infocommies:  This posting has
touched off a barage of speculative replies and vague recollections,
but no definitive answers as to how Infocom encodes textual data in its
game files.  Considering the amount of apparent interest in this topic,
here are the EXACT details, both general and specific.  I hope this
will lay to rest this subject (if such is ever possible on the net).

General:

Infocom text is not encrypted.  Rather, it is coded using a packing
scheme that results in a high degree of data compression.  (Although,
considering the difficulty people have found in decoding text, the
scheme functions as a pretty fair encryption mechanism also.)

Text is encoded using a 5-bit, 3-level code.  Although the 5-bit code
permits only 32 characters, 26 of these are multiplexed three ways by
means of two "shift" codes, resulting in a 78-character set.  This
includes the 52 letters of the upper/lower-case alphabet, 10 digits,
and 15 common punctuation characters, including New Line.  Also, one
code is used as an escape prefix for a 3-code sequence to specify any
arbitrary 7-bit ASCII character not included in the set.

Four codes are common to all three levels.  One of these is a space
character.  The remaining three codes are prefixes for a 2-code token,
which references one of 96 common text substrings.  These substrings
are packed using the same encoding scheme and are referenced by means
of a pointer table, the address of which is at a fixed location in the
game file.  The particular set of 96 token substrings is game-dependent
and is chosen automatically by the Infocom game language compiler for
maximum overall compression, via analysis of all text strings in a
game.  Output of a token substring involves a single level of recursion
by the text string output processor.

Text strings are packed into 16-bit words, with three 5-bit codes per
word.  The extra bit is used as an end-of-string flag.  This same
packing scheme is used for ALL text, including both game output and the
vocabulary table used to match user input.

The same scheme is used by ALL Infocom games, including the newer
Interactive Fiction "Plus" games, such as Bureaucracy and Beyond Zork.
I have personally disassembled many versions of the Infocom interpreter
program for several different computer systems, and I can thus verify
the accuracy of this description.

Note for would-be game hackers:  The interpreter programs (e.g. the
.COM files provided for CP/M and MS-DOS systems) are game-independent.
All game-specific text is contained within the game data (e.g. .DAT)
files, and is typically processed "on-the-fly" by the interpreter
program.  Thus, you cannot "cheat" by disassembling the interpreter
itself, and examining memory while the interpreter is running will
yield at most the last (bufferred) line of output text in ASCII code.

Specific:

Infocom game files consist of an ordered, addressable sequence of 8-bit
bytes (maximum 128K bytes in the "classic" games, 256K bytes in the
newer "plus" games).  Text strings are packed into 16-bit words, with
the most significant byte first (lower address) in the game file.  Text
strings may (and do) start at arbitrary (odd or even) byte addresses.

The msb of each word is 0 in each intermediate word of a string, 1 in
the final word.  This is followed by three 5-bit codes, which are
processed in left-to-right order.  I.e., numbering bits 15-to-0 from
msb-to-lsb:

        Bit   15   = end-of-string flag
        Bits 14-10 = code 1
        Bits  9-5  = code 2
        Bits  4-0  = code 3

Codes are interpreted at one of three levels, which we denote 0, 1, 2:

        Level 0 = lower-case alphabet
        Level 1 = upper-case alphabet
        Level 2 = digits and punctuation

Level 0 is the default (initial) level at the start of a string, and at
the start of a tokenized substring (processed recursively from the main
string).

Codes 0-3 are common to all levels, and do not affect the current level:

        Code 0 = space character
        Code 1 = prefix for token substrings  0-31 (next code)
        Code 2 = prefix for token substrings 32-63 (next code + 32)
        Code 3 = prefix for token substrings 64-95 (next code + 64)

The token substrings are normally stored starting at byte address 40
hex in the game file, but these are properly accessed via a 96-word
pointer table whose address is contained in the word at bytes 18-19 hex.
All token substrings begin on even byte addresses, and the pointer
table contains the substring word addresses (i.e. byte addresses
divided by two).  All 16-bit addresses and pointers are stored
high-byte-first.

Codes 4-5 are used to shift levels:

        Code 4, level 0 = shift to level 1 for next code only
        Code 5, level 0 = shift to level 2 for next code only

        Code 4, level 1 = permanent shift to level 1
        Code 5, level 1 = permanent shift back to level 0

        Code 4, level 2 = permanent shift back to level 0
        Code 5, level 2 = permanent shift to level 2

Note that, from level 0, the shift codes affect the NEXT code only.
(Code 4 normally precedes the first, capitalized word of a sentence.)
Two identical shift codes in a row effect a shift-lock to a new level,
and the alternate shift code is then used to restore level 0.

Code 5 is also used to end-pad the last word of a text string which
does not contain a multiple of three codes.

Finally, codes 6-31 generate printable characters:

        Code     Level 0     Level 1     Level 2
        ----     -------     -------     -------
          6         a           A      (see below)
          7         b           B        New Line
          8         c           C           0
          9         d           D           1
         10         e           E           2
         11         f           F           3
         12         g           G           4
         13         h           H           5
         14         i           I           6
         15         j           J           7
         16         k           K           8
         17         l           L           9
         18         m           M           .
         19         n           N           ,
         20         o           O           !
         21         p           P           ?
         22         q           Q           _
         23         r           R           #
         24         s           S           '
         25         t           T           "
         26         u           U           /
         27         v           V           \
         28         w           W           -
         29         x           X           :
         30         y           Y           (
         31         z           Z           )

Code 6 at level 2 is a special escape prefix, which may be used to
generate an arbitrary ASCII code via a 3-code sequence.  This is
specified by the next two following codes, which contain the high two
bits and low five bits, respectively, of the desired 7-bit ASCII code.

Note that SOME text output is generated by single ASCII codes, which
are not packed using the scheme described here.  However, the majority
of all textual data, including the input vocabulary, is packed.  (I'll
reserve a description of how to locate the input vocabulary table for a
later posting, if there is sufficient interest in same.)
 
In conclusion:

I hope all this satisfies some curiosity, particularly from the
standpoint of understanding an interesting and effective text
compression technique.  I personally fail to see how anyone could enjoy
an Infocom story by using this information to (partially) decode the
game data file.  But to each his own.  Happy adventuring!

-- Bob Freed

Internet:  Robert_Freed@cup.portal.com
    Uucp:  sun!portal!cup.portal.com!Robert_Freed