[net.crypt] Digram statistics question

sibley (05/10/83)

Relay-Version:version B 3/9/83; site harpo.UUCP
Message-ID:<1245@psuvax.UUCP>
Date:Tue, 10-May-83 11:09:35 EDT


I am just learning about cryptography and have been reading "Cryptanalysis
for Microcomputers" by C.C. Foster (Hayden Book Co.).  It seems to be a
reasonable but very elementary introduction to amateur cryptanalysis.
Of course, all the programs are in Basic.

One of the problems he treats concerns finding the best "column matches"
based on digram (letter pair) frequencies.  For instance, suppose you have
a cyphertext arranged as
	0 1 2 3 4 5 6
	M N Y A E T E
	K A I N C G T
	B N R O A A O
	O F N T R X D
	H U R E S I R
	R F C E O M N
	S N N O T W E
and you know that the plaintext is obtained by permuting the columns and
then reading the rows of the result.  In practice this is usually not
difficult.  For this one, the first row can be rearranged to read
"ENEMYAT".  Making the same permutation for whole columns readily reveals
the message.  Foster proposes a way to use a computer to help find the
correct column permutation.  One considers each possible ordered pair of
columns and calculates a number which is to measure how well these two fit
together.  For the pair (0, 2) of columns we see digrams MY, KI, BR, etc.
Each digram occurs in English with a certain known frequency (he provides
a table of these).  The number he computes for a pair of columns is the
geometric mean of the frequencies of the digrams in the pair.  The higher
this number is, the better the two columns match.

Now this does not work very well.  In fact, the above is a pretty good
example.  Some of the best numbers are incorrect matches and some of the
correct matches give poor numbers.  It seems to me that the trouble is
that digram frequencies are the wrong thing to use for this sort of
analysis.  For instance, the digram TH is the most common in English,
occurring about 297 times per 10000 letters of text, while QU is pretty
rare, occuring only 11 times in 10000 letters.  Thus, In the analysis
above, a digram TH contributes 297 while QU contributes 11.  Of course, Q
followed by anything other than U is extremely rare (but does occur, e.g.,
in proper nouns, like "Iraqi").  It seems to me that QU should be weighted
as much more important than TH.

Part of the problem is that the digram frequencies also reflect the letter
frequencies.  The TH is frequent partly because T is a common letter,
while QU is rare mostly because Q is rare.  However, we already know that
we have a T and a Q, so we shouldn't consider statistics involving
their frequencies.

So what is the "correct" way to analyze a pair of columns for goodness of
fit?  I have some ideas, but they're just guesses.  Does anyone out there
know the proper statistics?

Dave Sibley
...psuvax!sibley

kdh (05/12/83)

One possible attack for this problem would be to compile stats on

"Given this letter, what letter most frequently follows it?"

This would have the desired effect with "QU", since a "U" follows a "Q"
in almost all cases.  Given the columns:

	M N Y A E T E 

the operation would then consist of "which letter will probably follow the M"
type decisions, rather than "which pair is most common".
Other than that, the algorithm would probably be very similar.

					Kevin Hunter
					American Bell
					houca!kdh

outer (05/20/83)

The idea is to use digrams to aid in the anagramming of a complete columnar
transposition.  Foster goes a little wild, but he's basicly right according to
the "classical" literature.  EG:
	Friedman, Elements of Cryptanalysis: "The method is to select a column
	which has a good assortment of high-frequency letters and try to find
	columns which can be added before and after the selected column to
	build up high-frequency digraphs and trigraphs."

	Sinkov, Elementary Cryptanalysis: [After failing to reconstruct the
	first line by sight] "Then the next step should be to try to fit two
	columns together so as to get good digraphs.... including, in
	particular...occurences of TH."

	Gaines, Cryptanalysis: "Sometimes this can be decided by simple
	observation.  Otherwise, the combinations can be subjected to a digram
	test.  This is made by setting down beside each digram, as formed by 
	each pair of columns, its frequency as taken from a digram chart....
	and the supposition is that the combination furnishing the highest 
	frequency-total will be the correct one..."

Why?  Because it is the redundancy of the frequent digrams that is the key to
putting the puzzle together.  "qu" and such are useful because they are so rare
as to be almost certain, but in general I think what you want to do is maximize
the latent patterns.  You look for what you expect (hope?) to find.  I know
that doesn't explain it very well, so I won't even guess at what sort of
probabilities are summed up in a digram chart.  They are not independent: in
"THE MAN" you take TH, HE, EM, MA & AN, sometimes omitting EM as it cuts across
words.  What does that make them?

	- richard outerbridge / UofT / CSRG
	  (...decvax!utzoo!utcsrgv!outer)