sibley (05/10/83)
Relay-Version:version B 3/9/83; site harpo.UUCP Message-ID:<1245@psuvax.UUCP> Date:Tue, 10-May-83 11:09:35 EDT I am just learning about cryptography and have been reading "Cryptanalysis for Microcomputers" by C.C. Foster (Hayden Book Co.). It seems to be a reasonable but very elementary introduction to amateur cryptanalysis. Of course, all the programs are in Basic. One of the problems he treats concerns finding the best "column matches" based on digram (letter pair) frequencies. For instance, suppose you have a cyphertext arranged as 0 1 2 3 4 5 6 M N Y A E T E K A I N C G T B N R O A A O O F N T R X D H U R E S I R R F C E O M N S N N O T W E and you know that the plaintext is obtained by permuting the columns and then reading the rows of the result. In practice this is usually not difficult. For this one, the first row can be rearranged to read "ENEMYAT". Making the same permutation for whole columns readily reveals the message. Foster proposes a way to use a computer to help find the correct column permutation. One considers each possible ordered pair of columns and calculates a number which is to measure how well these two fit together. For the pair (0, 2) of columns we see digrams MY, KI, BR, etc. Each digram occurs in English with a certain known frequency (he provides a table of these). The number he computes for a pair of columns is the geometric mean of the frequencies of the digrams in the pair. The higher this number is, the better the two columns match. Now this does not work very well. In fact, the above is a pretty good example. Some of the best numbers are incorrect matches and some of the correct matches give poor numbers. It seems to me that the trouble is that digram frequencies are the wrong thing to use for this sort of analysis. For instance, the digram TH is the most common in English, occurring about 297 times per 10000 letters of text, while QU is pretty rare, occuring only 11 times in 10000 letters. Thus, In the analysis above, a digram TH contributes 297 while QU contributes 11. Of course, Q followed by anything other than U is extremely rare (but does occur, e.g., in proper nouns, like "Iraqi"). It seems to me that QU should be weighted as much more important than TH. Part of the problem is that the digram frequencies also reflect the letter frequencies. The TH is frequent partly because T is a common letter, while QU is rare mostly because Q is rare. However, we already know that we have a T and a Q, so we shouldn't consider statistics involving their frequencies. So what is the "correct" way to analyze a pair of columns for goodness of fit? I have some ideas, but they're just guesses. Does anyone out there know the proper statistics? Dave Sibley ...psuvax!sibley
kdh (05/12/83)
One possible attack for this problem would be to compile stats on "Given this letter, what letter most frequently follows it?" This would have the desired effect with "QU", since a "U" follows a "Q" in almost all cases. Given the columns: M N Y A E T E the operation would then consist of "which letter will probably follow the M" type decisions, rather than "which pair is most common". Other than that, the algorithm would probably be very similar. Kevin Hunter American Bell houca!kdh
outer (05/20/83)
The idea is to use digrams to aid in the anagramming of a complete columnar transposition. Foster goes a little wild, but he's basicly right according to the "classical" literature. EG: Friedman, Elements of Cryptanalysis: "The method is to select a column which has a good assortment of high-frequency letters and try to find columns which can be added before and after the selected column to build up high-frequency digraphs and trigraphs." Sinkov, Elementary Cryptanalysis: [After failing to reconstruct the first line by sight] "Then the next step should be to try to fit two columns together so as to get good digraphs.... including, in particular...occurences of TH." Gaines, Cryptanalysis: "Sometimes this can be decided by simple observation. Otherwise, the combinations can be subjected to a digram test. This is made by setting down beside each digram, as formed by each pair of columns, its frequency as taken from a digram chart.... and the supposition is that the combination furnishing the highest frequency-total will be the correct one..." Why? Because it is the redundancy of the frequent digrams that is the key to putting the puzzle together. "qu" and such are useful because they are so rare as to be almost certain, but in general I think what you want to do is maximize the latent patterns. You look for what you expect (hope?) to find. I know that doesn't explain it very well, so I won't even guess at what sort of probabilities are summed up in a digram chart. They are not independent: in "THE MAN" you take TH, HE, EM, MA & AN, sometimes omitting EM as it cuts across words. What does that make them? - richard outerbridge / UofT / CSRG (...decvax!utzoo!utcsrgv!outer)