saponara@batcomputer.tn.cornell.edu (John Saponara) (11/03/87)
References: A few weeks ago someone asked how to decode Infocom's "Zork" data. Well, I just got the answer in the mail. "Computist" magazine has an enhancement for an Infocom text reader in their latest issue (#47). The original text reader is in their issue #34. This magazine is mostly a "how to copy disks" and "programs to look into adventure programs and change parameters" kind of thing, if you're interested in that. The back issue #34 is $4.75 from Computist, PO Box 110846-T, Tacoma, WA 98411 (I don't have it, otherwise I'd pass on the basics). Eric Haines
dmw3@ur-tut.UUCP (David M Walsh Jr.) (11/04/87)
Well I heard a few years back that Infocom used a 3/2 encoding scheme. They put 3 characters into 2 bytes by only using 5 bits per character. This leaves 1 extra bit for God knows what. Unfortunately I have no idea what they included in their table of 32 characters (2^5=32), so I can't help in getting correct decoding, but this should help any hackers out there to finish the job. I would assume that they do use the "left-over" bit because they do use more than 32 characters (I think.) The alphabet is 26, the space key makes 27, adding punctuation can take it over 32... Oh well, hope this helps... Dave Walsh - an old Apple ][+ hacker
kamath@reed.UUCP (Sean Kamath) (11/05/87)
Actually, it would be rathering interesting to find out how to do it. If I recall correctly, Infocom uses a rather nice packing technique to get all that text on the disks. If anyone know the basic concept of this, I'd like to hear about it. It would be nice to shrink strict text files with it. . . Sean Kamath -- UUCP: {decvax allegra ucbcad ucbvax hplabs ihnp4}!tektronix!reed!kamath CSNET: reed!kamath@Tektronix.CSNET || BITNET: reed!kamath@Berkeley.BITNET ARPA: tektronix!reed!kamath@Berkeley <or> reed!kamath@hplabs US Snail: 3934 SE Boise, Portland, OR 97202 (I hate 4 line .sigs!)
jra1@ur-tut.UUCP (The Super Abuser) (11/06/87)
I get the impression that the basic coding technique is that all the words that are used in Zork's descriptions are stored in some mildly compressed format on the disk (probably "3-2" packing of characters), and the text descriptions are arrays of pointers to the words on disk. Of course, I've never tested this theory but it seems reasonable. Hope this helps. --- Jem Radlow, former rogueaholic "God. Life without rogue is like... life without rogue."
dmw3@ur-tut.UUCP (David M Walsh Jr.) (11/06/87)
In article <434@ur-tut.UUCP> jra1@tut.cc.rochester.edu.UUCP (The Super Abuser) writes: > >I get the impression that the basic coding technique is that >all the words that are used in Zork's descriptions are stored in some mildly >compressed format on the disk (probably "3-2" packing of characters), and the >text descriptions are arrays of pointers to the words on disk. > >Of course, I've never tested this theory but it seems reasonable. > Yes, I think that I've said this before (maybe it was in a direct reply though) that I am almost positive that they use the 3-2 packing method: 3 chars per 2 bytes. The only problem here is that we don't know exactly how they mapped the alphabet. The 3-2 packing method easily allows 32 characters in your alphabet, but you have an extra bit to diddle with, and I don't know if they played with it. If I have offended anyone or made a mistake I am sorry, but I'm pretty sure that this info is correct as far as it goes... If anyone figures out how it truly works (like if they use that extra bit or not...) I'd love to hear from you. Have a nice day David Walsh - an old Apple ][ hacker, slowly becoming a Mac hacker
rassilon@mit-eddie.UUCP (11/07/87)
In article <434@ur-tut.UUCP> jra1@tut.cc.rochester.edu.UUCP (The Super Abuser) writes: > I get the impression that the basic coding technique is that all the words > that are used in Zork's descriptions are stored in some mildly compressed > format on the disk (probably "3-2" packing of characters), and the text > descriptions are arrays of pointers to the words on disk. I don't know how Infocom stores anything but I can help. There is a program called ZorkTools which will allow you to play your game as normal but, at any point, you may do such things as looking at all the text descriptions, get a list of valid verbs and nouns, make a copy suitable for installing on a hard disk, etc. The program runs on IBM PC's and compatibles and I will, if there is enough interest, post it here. Shar and Enjoy! DISCLAIMER: I have never tried the aforementioned program as I don't have any Infocom games for PC's. -- Rassilon (Brian Preble) UUCP: ...{ihnp4 | decvax!genrad}!mit-eddie!rassilon Internet: rassilon@eddie.mit.edu
dyon@batcomputer.tn.cornell.edu (Dyon Anniballi) (11/07/87)
Back in the old days, a friend and I made some headway in disassembling the Zork interpreter. If I recall correctly, they used 5 bits per char. for encoding text. Only 32 combinations you say? Well, the first 26 were the lower case alphabet, and the other 6 were for switching character sets. That is, one would switch to upper case, another to punctuation type stuff, etc. I don't remember the specifics though, sorry. -- --Dyon Anniballi dyon@batcomputer.tn.cornell.edu | dyon%batcomputer@crnlcs.bitnet rochester!cornell!batcomputer!dyon | "No time for romantic escape...."
john@moncol.UUCP (John Ruschmeyer) (11/08/87)
In article <7364@eddie.MIT.EDU> rassilon@eddie.MIT.EDU (Brian Preble) writes: >In article <434@ur-tut.UUCP> jra1@tut.cc.rochester.edu.UUCP (The Super Abuser) writes: >> I get the impression that the basic coding technique is that all the words >> that are used in Zork's descriptions are stored in some mildly compressed >> format on the disk (probably "3-2" packing of characters), and the text >> descriptions are arrays of pointers to the words on disk. > >I don't know how Infocom stores anything but I can help. There is a program >called ZorkTools which will allow you to play your game as normal but, at any >point, you may do such things as looking at all the text descriptions, get a >list of valid verbs and nouns, make a copy suitable for installing on a hard >disk, etc. The program runs on IBM PC's and compatibles and I will, if there >is enough interest, post it here. One note, the only version of ZorkTools that I have seen only works for the older copy-protected games. Recently, Infocom has begun shipping thwir games unprotected, with in install program to copy the correct files, set up directories, etc. (They also now do all screen I/O through ANSI.SYS. I have not heard of a version of ZorkTools which will work with these games. -- Name: John Ruschmeyer US Mail: Monmouth College, W. Long Branch, NJ 07764 Phone: (201) 571-3557 UUCP: ...!vax135!petsd!moncol!john ...!princeton!moncol!john ...!pesnta!moncol!john Give computers an artificial intelligence and the next thing you know they want to use the same bathroom. - Allan Dean Foster
keith@apple.UUCP (Keith Rollin) (11/08/87)
In article <2846@batcomputer.tn.cornell.edu> dyon@tcgould.tn.cornell.edu (Dyon Anniballi) writes: > > Back in the old days, a friend and I made some headway in disassembling >the Zork interpreter. If I recall correctly, they used 5 bits per char. >for encoding text. Only 32 combinations you say? Well, the first 26 >were the lower case alphabet, and the other 6 were for switching character >sets. That is, one would switch to upper case, another to punctuation type >stuff, etc. I don't remember the specifics though, sorry. > It doesn't seem like any of us can! I, too, spent some of the vigor of my youth trying to comment the assembly code of one of InfoCom's interpreter. One feature that (I think) I found concerned this switching that Dyon just mentioned. Some of the more common words (about 180 in Hitchhiker's) are stored separate from the rest of the encoded text. These special words (common objects like 'towel', 'Arthur', etc. and words like 'the', 'into', and so on) are then referenced by one of these 5 bit symbols. That way, you can squeeze a whole word into just 5 bits. To make matters worse, InfoCom has developed many different interpreters (I assume tthey have at least 7, as I have seen versions A, B, and G). With any bit of bad luck, they will have changed the encoding algorithm somewhere along the way... -- Keith Rollin amdahl\ Sales Technical Support pyramid!sun !apple!keith Apple Computer decwrl/ Disclaimer: I read this board for fun, not profit. Anything I say is from the result of reading magazines, hacking, and soaking my head in acid.
denbeste@bgsuvax.UUCP (11/09/87)
in article <400@ur-tut.UUCP>, dmw3@ur-tut.UUCP (David M Walsh Jr.) says: > Xref: bgsuvax rec.games.misc:816 comp.sys.apple:2495 > > > Well I heard a few years back that Infocom used a 3/2 encoding scheme. > They put 3 characters into 2 bytes by only using 5 bits per character. > This leaves 1 extra bit for God knows what. That extra bit is used to tell if the 3/2 encoding scheme is used, or if the next 7 bits contain a straight ascii character. > assume that they do use the "left-over" bit because they do use more than > 32 characters (I think.) The alphabet is 26, the space key makes 27, > adding punctuation can take it over 32... The lowercase alphabet and space are in there. I don't know about the other 5 characters. All of the stories that I know of use both upper and lower case text. For machines, such as the //+, that don't have lower case, the text is converted on the fly. Word wrap is also handled on the fly. I have a friend that spent some time looking into this, but the accuracy of the above is subject to the reliability of my memory --- William C. DenBesten | CSNET denbeste@research1.bgsu.edu
reeves@amd.AMD.COM (James Reeves) (11/10/87)
In article <2846@batcomputer.tn.cornell.edu> dyon@tcgould.tn.cornell.edu (Dyon Anniballi) writes: > > Back in the old days, a friend and I made some headway in disassembling >the Zork interpreter. If I recall correctly, they used 5 bits per char. >for encoding text. Only 32 combinations you say? Well, the first 26 >were the lower case alphabet, and the other 6 were for switching character >sets. That is, one would switch to upper case, another to punctuation type >stuff, etc. I don't remember the specifics though, sorry. > >-- >--Dyon Anniballi I'm sorry to get involved in this but this last statement was truly ridiculous. with 2^5 you have 32 combinations. That's it. Done. No more. I have spent hours trying to decode the Infocom files and have found some very interesting thing. First of all I am almost positive that they DO NOT USE A 5 bit encodeing scheme. (Actually I should say that they do not solely use a 5 nit encoding scheme) I have played around weth some of the bits in the .DAT file and watched how the text changed. It definitely didn't react in a way that would indicate a 5 bit encoding scheme. It think that there is a tokenizing scheme as well as a compression method. When I attempted to debug the code I found some nasty GOTCHAs in the code. The loop for creating a text sentence apparently pushes a whole lot of SH*T on the stack and then goes through some calculations as it pops it off. I also seem to remeber that there is some initial code that scans the memory for INT 3 instructions that make it very difficult to place breakpoints in the code. Anyway, the person who wrote ZORK Tools should be able to help where is he/she when you need them. James
rbl@nitrex.UUCP ( Dr. Robin Lake ) (11/11/87)
In article <1363@bgsuvax.UUCP> denbeste@bgsuvax.UUCP (William C. DenBesten) writes: >in article <400@ur-tut.UUCP>, dmw3@ur-tut.UUCP (David M Walsh Jr.) says: >> Xref: bgsuvax rec.games.misc:816 comp.sys.apple:2495 >> >> >> Well I heard a few years back that Infocom used a 3/2 encoding scheme. >> They put 3 characters into 2 bytes by only using 5 bits per character. >> This leaves 1 extra bit for God knows what. > Has anyone looked into DEC's "Rad50" coding as used on the PDP-11? It was a 3/2 scheme. -- Rob Lake {decvax,ihnp4!cbosgd}!mandrill!nitrex!rbl
Robert_Bob_Freed@cup.portal.com (11/13/87)
In article <2804@batcomputer.tn.cornell.edu>, saponara@batcomputer.tn.cornell.edu (John Saponara) writes: > A few weeks ago someone asked how to decode Infocom's "Zork" data... Attention Zork addicts and dedicated Infocommies: This posting has touched off a barage of speculative replies and vague recollections, but no definitive answers as to how Infocom encodes textual data in its game files. Considering the amount of apparent interest in this topic, here are the EXACT details, both general and specific. I hope this will lay to rest this subject (if such is ever possible on the net). General: Infocom text is not encrypted. Rather, it is coded using a packing scheme that results in a high degree of data compression. (Although, considering the difficulty people have found in decoding text, the scheme functions as a pretty fair encryption mechanism also.) Text is encoded using a 5-bit, 3-level code. Although the 5-bit code permits only 32 characters, 26 of these are multiplexed three ways by means of two "shift" codes, resulting in a 78-character set. This includes the 52 letters of the upper/lower-case alphabet, 10 digits, and 15 common punctuation characters, including New Line. Also, one code is used as an escape prefix for a 3-code sequence to specify any arbitrary 7-bit ASCII character not included in the set. Four codes are common to all three levels. One of these is a space character. The remaining three codes are prefixes for a 2-code token, which references one of 96 common text substrings. These substrings are packed using the same encoding scheme and are referenced by means of a pointer table, the address of which is at a fixed location in the game file. The particular set of 96 token substrings is game-dependent and is chosen automatically by the Infocom game language compiler for maximum overall compression, via analysis of all text strings in a game. Output of a token substring involves a single level of recursion by the text string output processor. Text strings are packed into 16-bit words, with three 5-bit codes per word. The extra bit is used as an end-of-string flag. This same packing scheme is used for ALL text, including both game output and the vocabulary table used to match user input. The same scheme is used by ALL Infocom games, including the newer Interactive Fiction "Plus" games, such as Bureaucracy and Beyond Zork. I have personally disassembled many versions of the Infocom interpreter program for several different computer systems, and I can thus verify the accuracy of this description. Note for would-be game hackers: The interpreter programs (e.g. the .COM files provided for CP/M and MS-DOS systems) are game-independent. All game-specific text is contained within the game data (e.g. .DAT) files, and is typically processed "on-the-fly" by the interpreter program. Thus, you cannot "cheat" by disassembling the interpreter itself, and examining memory while the interpreter is running will yield at most the last (bufferred) line of output text in ASCII code. Specific: Infocom game files consist of an ordered, addressable sequence of 8-bit bytes (maximum 128K bytes in the "classic" games, 256K bytes in the newer "plus" games). Text strings are packed into 16-bit words, with the most significant byte first (lower address) in the game file. Text strings may (and do) start at arbitrary (odd or even) byte addresses. The msb of each word is 0 in each intermediate word of a string, 1 in the final word. This is followed by three 5-bit codes, which are processed in left-to-right order. I.e., numbering bits 15-to-0 from msb-to-lsb: Bit 15 = end-of-string flag Bits 14-10 = code 1 Bits 9-5 = code 2 Bits 4-0 = code 3 Codes are interpreted at one of three levels, which we denote 0, 1, 2: Level 0 = lower-case alphabet Level 1 = upper-case alphabet Level 2 = digits and punctuation Level 0 is the default (initial) level at the start of a string, and at the start of a tokenized substring (processed recursively from the main string). Codes 0-3 are common to all levels, and do not affect the current level: Code 0 = space character Code 1 = prefix for token substrings 0-31 (next code) Code 2 = prefix for token substrings 32-63 (next code + 32) Code 3 = prefix for token substrings 64-95 (next code + 64) The token substrings are normally stored starting at byte address 40 hex in the game file, but these are properly accessed via a 96-word pointer table whose address is contained in the word at bytes 18-19 hex. All token substrings begin on even byte addresses, and the pointer table contains the substring word addresses (i.e. byte addresses divided by two). All 16-bit addresses and pointers are stored high-byte-first. Codes 4-5 are used to shift levels: Code 4, level 0 = shift to level 1 for next code only Code 5, level 0 = shift to level 2 for next code only Code 4, level 1 = permanent shift to level 1 Code 5, level 1 = permanent shift back to level 0 Code 4, level 2 = permanent shift back to level 0 Code 5, level 2 = permanent shift to level 2 Note that, from level 0, the shift codes affect the NEXT code only. (Code 4 normally precedes the first, capitalized word of a sentence.) Two identical shift codes in a row effect a shift-lock to a new level, and the alternate shift code is then used to restore level 0. Code 5 is also used to end-pad the last word of a text string which does not contain a multiple of three codes. Finally, codes 6-31 generate printable characters: Code Level 0 Level 1 Level 2 ---- ------- ------- ------- 6 a A (see below) 7 b B New Line 8 c C 0 9 d D 1 10 e E 2 11 f F 3 12 g G 4 13 h H 5 14 i I 6 15 j J 7 16 k K 8 17 l L 9 18 m M . 19 n N , 20 o O ! 21 p P ? 22 q Q _ 23 r R # 24 s S ' 25 t T " 26 u U / 27 v V \ 28 w W - 29 x X : 30 y Y ( 31 z Z ) Code 6 at level 2 is a special escape prefix, which may be used to generate an arbitrary ASCII code via a 3-code sequence. This is specified by the next two following codes, which contain the high two bits and low five bits, respectively, of the desired 7-bit ASCII code. Note that SOME text output is generated by single ASCII codes, which are not packed using the scheme described here. However, the majority of all textual data, including the input vocabulary, is packed. (I'll reserve a description of how to locate the input vocabulary table for a later posting, if there is sufficient interest in same.) In conclusion: I hope all this satisfies some curiosity, particularly from the standpoint of understanding an interesting and effective text compression technique. I personally fail to see how anyone could enjoy an Infocom story by using this information to (partially) decode the game data file. But to each his own. Happy adventuring! -- Bob Freed Internet: Robert_Freed@cup.portal.com Uucp: sun!portal!cup.portal.com!Robert_Freed