[net.math] Digram statistics question

sibley (05/10/83)
Relay-Version:version B 3/9/83; site harpo.UUCP
Message-ID:<1245@psuvax.UUCP>
Date:Tue, 10-May-83 11:09:35 EDT


I am just learning about cryptography and have been reading "Cryptanalysis
for Microcomputers" by C.C. Foster (Hayden Book Co.).  It seems to be a
reasonable but very elementary introduction to amateur cryptanalysis.
Of course, all the programs are in Basic.

One of the problems he treats concerns finding the best "column matches"
based on digram (letter pair) frequencies.  For instance, suppose you have
a cyphertext arranged as
	0 1 2 3 4 5 6
	M N Y A E T E
	K A I N C G T
	B N R O A A O
	O F N T R X D
	H U R E S I R
	R F C E O M N
	S N N O T W E
and you know that the plaintext is obtained by permuting the columns and
then reading the rows of the result.  In practice this is usually not
difficult.  For this one, the first row can be rearranged to read
"ENEMYAT".  Making the same permutation for whole columns readily reveals
the message.  Foster proposes a way to use a computer to help find the
correct column permutation.  One considers each possible ordered pair of
columns and calculates a number which is to measure how well these two fit
together.  For the pair (0, 2) of columns we see digrams MY, KI, BR, etc.
Each digram occurs in English with a certain known frequency (he provides
a table of these).  The number he computes for a pair of columns is the
geometric mean of the frequencies of the digrams in the pair.  The higher
this number is, the better the two columns match.

Now this does not work very well.  In fact, the above is a pretty good
example.  Some of the best numbers are incorrect matches and some of the
correct matches give poor numbers.  It seems to me that the trouble is
that digram frequencies are the wrong thing to use for this sort of
analysis.  For instance, the digram TH is the most common in English,
occurring about 297 times per 10000 letters of text, while QU is pretty
rare, occuring only 11 times in 10000 letters.  Thus, In the analysis
above, a digram TH contributes 297 while QU contributes 11.  Of course, Q
followed by anything other than U is extremely rare (but does occur, e.g.,
in proper nouns, like "Iraqi").  It seems to me that QU should be weighted
as much more important than TH.

Part of the problem is that the digram frequencies also reflect the letter
frequencies.  The TH is frequent partly because T is a common letter,
while QU is rare mostly because Q is rare.  However, we already know that
we have a T and a Q, so we shouldn't consider statistics involving
their frequencies.

So what is the "correct" way to analyze a pair of columns for goodness of
fit?  I have some ideas, but they're just guesses.  Does anyone out there
know the proper statistics?

Dave Sibley
...psuvax!sibley