sibley (05/10/83)
Relay-Version:version B 3/9/83; site harpo.UUCP Message-ID:<1245@psuvax.UUCP> Date:Tue, 10-May-83 11:09:35 EDT I am just learning about cryptography and have been reading "Cryptanalysis for Microcomputers" by C.C. Foster (Hayden Book Co.). It seems to be a reasonable but very elementary introduction to amateur cryptanalysis. Of course, all the programs are in Basic. One of the problems he treats concerns finding the best "column matches" based on digram (letter pair) frequencies. For instance, suppose you have a cyphertext arranged as 0 1 2 3 4 5 6 M N Y A E T E K A I N C G T B N R O A A O O F N T R X D H U R E S I R R F C E O M N S N N O T W E and you know that the plaintext is obtained by permuting the columns and then reading the rows of the result. In practice this is usually not difficult. For this one, the first row can be rearranged to read "ENEMYAT". Making the same permutation for whole columns readily reveals the message. Foster proposes a way to use a computer to help find the correct column permutation. One considers each possible ordered pair of columns and calculates a number which is to measure how well these two fit together. For the pair (0, 2) of columns we see digrams MY, KI, BR, etc. Each digram occurs in English with a certain known frequency (he provides a table of these). The number he computes for a pair of columns is the geometric mean of the frequencies of the digrams in the pair. The higher this number is, the better the two columns match. Now this does not work very well. In fact, the above is a pretty good example. Some of the best numbers are incorrect matches and some of the correct matches give poor numbers. It seems to me that the trouble is that digram frequencies are the wrong thing to use for this sort of analysis. For instance, the digram TH is the most common in English, occurring about 297 times per 10000 letters of text, while QU is pretty rare, occuring only 11 times in 10000 letters. Thus, In the analysis above, a digram TH contributes 297 while QU contributes 11. Of course, Q followed by anything other than U is extremely rare (but does occur, e.g., in proper nouns, like "Iraqi"). It seems to me that QU should be weighted as much more important than TH. Part of the problem is that the digram frequencies also reflect the letter frequencies. The TH is frequent partly because T is a common letter, while QU is rare mostly because Q is rare. However, we already know that we have a T and a Q, so we shouldn't consider statistics involving their frequencies. So what is the "correct" way to analyze a pair of columns for goodness of fit? I have some ideas, but they're just guesses. Does anyone out there know the proper statistics? Dave Sibley ...psuvax!sibley