[sci.math] Encryption of Hoffman coded text

kent@xanth.UUCP (Kent Paul Dolan) (01/01/70)

I am currently involved (for a communications package) in a FORTRAN
implementation of the recently published (CACM V30N6 June 1987
p520-540) adaptive arithmetic data compression algorithm.  (It works
fine; I needed to talk across a line that would only accept printable
ASCII characters, so I adapted the algorithm to encode to 13 bits and
then pass 6.5 bits per message byte. (95*95 printable ASCII characters
holds 13 bits of information quite conveniently.))  This seems to be
even more interesting than the adaptive Huffman encoding, for purposes
of confusing decrypters, because individual bit boundaries don't match
with individual character boundaries, NOR VICE VERSA!  A particular,
non-standard start-up encounter count list for the characters (in the
adaptive form, which is the only interesting one) might be a
sufficient encryption scheme by itself to "lend confusion to the
enemy".  In any case, I'm pretty sure that every bit in the compressed
message depends at least slightly on every character up to then in the
uncompressed message.  I don't have the math skills to analyze this,
but I'd sure like to see comments on this one from the net!

Kent, the man from xanth.
--
Kent Paul Dolan, LCDR, NOAA, Retired; ODU MSCS grad student	 // Yet
UUCP  :  kent@xanth.UUCP   or    ...{sun,harvard}!xanth!kent	// Another
CSNET :  kent@odu.csnet    ARPA  :  kent@xanth.cs.odu.edu   \\ // Happy
USPost:  P.O. Box 1559, Norfolk, Virginia 23501-1559	     \// Amigan!
Voice :  (804) 587-7760    -=][> Last one to Ceres is a rotten egg! -=][>

I code reactor power plant control in C.
I add "count_of_recent_alarms" to "count_of_rods_to_lift".
C has weak type checking; the compiler doesn't notice.
A major coolant valve sticks, a spate of alarms occur.
All die.
Oh, the embarrassment!

palmer@tybalt.caltech.edu (David Palmer) (08/07/87)

If you are going to encrypt a message by the standard breakable techniques,
(Caesar cypher, Ultra, Purple, NBS DES, whatever technique the NSA is
selling to our allies :-)  Is decryption more difficult if you use
Hoffman coding (including Hoffman coding of common combinations such as 'th')
on the plaintext before encypherment?  I am assuming that the people doing
the attack know that you are doing Hoffman coding and know what the code is.

(For those who are not familiar with the technique, Hoffman coding uses
variable length characters, the more common characters requiring fewer
bits.  This usually decreases the total length of a message (in bits), assuming
that it is non-pathological and not overly concerned with zebu and oxen in
Zanzibar and Qom)

It seems to me that Hoffman coding makes the plaintext look more 'random'
and, so, less amenable to statistical attack.  Is this so?

I don't know how to crack anything harder than a simple substitution
cypher, so details would be welcome.


		David Palmer
		palmer@tybalt.caltech.edu
		...rutgers!cit-vax!tybalt.caltech.edu!palmer

bs@augusta.UUCP (Burch Seymour) (08/10/87)

in article <3515@cit-vax.Caltech.Edu>, palmer@tybalt.caltech.edu (David Palmer) says:
>
> It seems to me that Hoffman coding makes the plaintext look more 'random'
> and, so, less amenable to statistical attack.  Is this so?

An in a similar vein, it would seem that preceeding the encryption with
a simpleminded substitution, say to 17 bit characters, which flattened out 
the distribution on the use of the alphabet, ie mapping more characters
to 'e' than 'q' would also make the results difficult to attack. Any 
comments?

-Burch Seymour-

jgy@hropus.UUCP (John Young) (08/10/87)

David Palmer writes:
> If you are going to encrypt a message by the standard breakable techniques,
> (Caesar cypher, Ultra, Purple, NBS DES, whatever technique the NSA is
> selling to our allies :-)  Is decryption more difficult if you use
> Hoffman coding (including Hoffman coding of common combinations such as 'th')
> on the plaintext before encypherment?  I am assuming that the people doing
> the attack know that you are doing Hoffman coding and know what the code is.
> .
> .
> .
> It seems to me that Hoffman coding makes the plaintext look more 'random'
> and, so, less amenable to statistical attack.  Is this so?
> 

My first reaction would be to say "yes it would be less amenable to
statistical attack", but then I wondered, if it's known your doing
Huffman coding (you say it is) then that algorithm has already done
the frequency analysis for the code breaker, and the partial-cypher
may now be in a "stricter" form.  Comments from experts??? (I'm not either)

jim@randvax.UUCP (08/10/87)

In article <3515@cit-vax.Caltech.Edu> palmer@tybalt.caltech.edu.UUCP (David Palmer) writes:
>If you are going to encrypt a message by the standard breakable techniques,
>(Caesar cypher, Ultra, Purple, NBS DES, whatever technique the NSA is
>selling to our allies :-)  Is decryption more difficult if you use
>Hoffman coding (including Hoffman coding of common combinations such as 'th')
>on the plaintext before encypherment?  I am assuming that the people doing
>the attack know that you are doing Hoffman coding and know what the code is.

Data compression methods such as Huffman coding will normally make
the cryptanalysis more difficult: the variable-length strings of the
underlying plaintext will make it hard for the cryptanalyst to spot
repetitions that are important for breaking in.

However, a common implementation of Huffman coding involves computing the
optimal tree of assignments of bit strings to ascii characters before
encoding, then prepending this tree to the compressed text.  The tree
is in a standard format, which we assume is available to the cryptanalyst;
this might result in half of several hundred bytes being known.  So rather
than improving the security, this implementation would actually give the
cracker a leg up...

Better would be the adaptive flavor of Huffman coding, which starts with a
standard English (or C code or assembler) decoding tree and changes the
tree on the fly as the symbol frequencies move away from the starting
tree.  This would eliminate the initial cribs, leaving the cryptanalyst
with only the usual tools, such as guessing or discovering pieces of the
underlying plaintext, Huffmanizing them with a standard tree, and testing
the result for consistency.
-- 
	Jim Gillogly
	{hplabs, ihnp4}!sdcrdcf!randvax!jim
	jim@rand-unix.arpa

karn@faline.bellcore.com (Phil R. Karn) (08/10/87)

> Is decryption more difficult if you use Hoffman coding..?

Yes, it is well known that increasing the entropy ("randomness") of the
plaintext before encryption is an excellent way to thwart cryptanalysis.

It's easy to see this. Suppose you have LOTS of CPU cycles available,
maybe even a parallel array of a million DES chips, but you need to know
when you've found the answer. If you've got matched ciphertext and
plaintext, no problem. But if you've only got ciphertext, you'll need to
guess the expected properties of the plaintext to know when you've found
the right answer. If the plaintext is ASCII-encoded English text, it
will leap right out at you with even the simplest statistical analysis
(all parity bits off, typical English character distribution, etc).
However, if the plaintext has been compressed, these statistics go away
(assuming the compression algorithm is doing its job). You simply don't
know when you stumble on the right answer.

I've heard it from someone that should know that "knowing when you've found
the answer" is indeed perhaps THE hardest part of cryptanalysis.

Practical point: many compression utilities put "magic numbers" at the
beginning of their compressed output so they can tell if they're being
fed random garbage when decompressing. The cryptanalyst could look for
these magic numbers, though if they're small compared to the block size
of the encryption algorithm he would get a very large number of false
alarms. (E.g. a 1 byte magic number with 8 byte DES blocks would yield
one false alarm every 256 trial keys, on average).

Phil

biep@cs.vu.nl (J. A. "Biep" Durieux) (08/12/87)

In article <1034@faline.bellcore.com> karn@faline.bellcore.com (Phil R. Karn) writes:
>> Is decryption more difficult if you use Hoffman coding..?
>
>Yes, it is well known that increasing the entropy ("randomness") of the
>plaintext before encryption is an excellent way to thwart cryptanalysis.
>I've heard it from someone that should know that "knowing when you've found
>the answer" is indeed perhaps THE hardest part of cryptanalysis.

That sounds as if true, but I don't think Huffman encoding will make that
so difficult (that is, Huffman encoding on a letter-by-letter basis), since
English is also rather easily recognisable (from random bits) by it's inter-
character statistics, i.e. the relative frequencies of groups of letters.
It will be a bit more work, through, and a *real good* compress algorithm
will take advantage of these regularities too (if it's there for multiple
uses. Otherwise telling the recipient how to decode might be more work than
sending the plaintext).
-- 
						Biep.  (biep@cs.vu.nl via mcvax)
	To be the question or not to be the question, that is.

jim@randvax.UUCP (Jim Gillogly) (08/13/87)

>In article <1034@faline.bellcore.com> karn@faline.bellcore.com (Phil R. Karn) writes:
>>...  it is well known that increasing the entropy ("randomness") of the
>>plaintext before encryption is an excellent way to thwart cryptanalysis.
>>I've heard it from someone that should know that "knowing when you've found
>>the answer" is indeed perhaps THE hardest part of cryptanalysis.

In article <851@klipper.cs.vu.nl> biep@cs.vu.nl (J. A. "Biep" Durieux) responds:
>That sounds as if true, but I don't think Huffman encoding will make that
>so difficult (that is, Huffman encoding on a letter-by-letter basis), since
>English is also rather easily recognisable (from random bits) by it's inter-
>character statistics, i.e. the relative frequencies of groups of letters.

I agree with Phil, assuming the decoding tree is not known.  Huffman
coding (letter-by-letter or not) results in variable-length bit strings,
so that looking at a random piece of output you can't tell where the
letters (or groups of letters) begin and end.  This gives considerably
more room for error.  It's not terribly easy to decrypt Huffman-coded
stuff (even before a superencryption) without the decoding tree.
	    ------
If you know or guess that a particular word appears at a certain place
in the output, you can try to use that to build up the decryption.  That's
how I'd try to attack it anyway.  But I wouldn't expect to have it done
before lunch.

	Jim Gillogly
-- 
	Jim Gillogly
	{hplabs, ihnp4}!sdcrdcf!randvax!jim
	jim@rand-unix.arpa