[sci.crypt] how do you tell encrytped data from random data?

leonard@bucket.UUCP (Leonard Erickson) (01/06/88)

An interesting question has crossed my mind. If someone presents you with
an allegedly encrypted message, How can you tell if it really is encrypted
as opposed to being a bunch of random characters?

I know that transposition and *simple* substitution can be detected by
letter frequency analysis. But is a "flat" distibution evidence of random
data?

For my purposes, both "one-time pad" ciphers and anything that operates on
units other than characters can be considered random! If it is that complex,
then I'm not likely to crack it!

This is inspired by my fiinding some notes I made a year or so ago when I
and some friends were exchanging encrypted msgs over a public BBS. A few
of our friends cracked or first efforts and a typical "arms race" of better
ciphers followed by better analytic techniques ensued.

We finally came (after about 5 changes) came up with a scheme that only
one outsiders could read part of. Some of the other users started using
their own ciphers. The sysop didn't mind until someone started posting
HUGE messages which he suspected were just garbage. But the challenge
was, were they garbage or real messages?

Under his rules, messages (even encrypted!) were ok, junk wasn't. I made
a few attempts to analyze them, but while I was fairly certain, I could
never be sure. (when an 8k msg uses the 26 letters so evenly that the
spread better most used and least used is 12, you get *real* suspicious :-)

-- 
Leonard Erickson		...!tektronix!reed!percival!bucket!leonard
CIS: [70465,203]
"I used to be a hacker. Now I'm a 'microcomputer specialist'.
You know... I'd rather be a hacker."

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/08/88)

In article <660@bucket.UUCP> leonard@bucket.UUCP (Leonard Erickson) writes:
>An interesting question has crossed my mind. If someone presents you with
>an allegedly encrypted message, How can you tell if it really is encrypted
>as opposed to being a bunch of random characters?

The only foolproof way is to decrypt it.

>I know that transposition and *simple* substitution can be detected by
>letter frequency analysis. But is a "flat" distibution evidence of random
>data?

No, most really good encryption systems pretty much randomizes the cipher
stream, insofar as simple statistical tests can determine.  Conversely,
just because some putative cipher stream appears nonrandom does not mean
that it has meaningful underlying plaintext.  It could be a hoax.  There
are those who think the two undeciphered Beale cryptograms are hoaxes.

>For my purposes, both "one-time pad" ciphers and anything that operates on
>units other than characters can be considered random! If it is that complex,
>then I'm not likely to crack it!

You can't crack a true one-time pad system by analysis anyway, no matter
how competent a cryptanalyst you are.

>(when an 8k msg uses the 26 letters so evenly that the
>spread better most used and least used is 12, you get *real* suspicious :-)

Flatter-than-random distributions are not necessarily hoaxes.
They're actually pretty easy to obtain; Mil Cryp has an example.

mjr@osiris.UUCP (Marcus J. Ranum) (01/10/88)

	Reminds me of a wonderful concept in a book by Stanitslaw Lem
(Memoirs Found in a Bathtub):

	The underground of the Pentagon is all that remains after the end of
the world, but the spooks are sure they have been infiltrated. One of the best
tools they have is a super-dooper computer than can decrypt *ANYTHING* if it
tries hard enough. In fact, it can take plain text, and find the secret and
subversive messages hidden in it - even if there weren't any to begin with...

	Another problem with Gywn's suggestion that the only way to tell if it
is encrypted (decrypt it) - how can you tell that it is really "meaningful" ?
Anyone who has listened to first year latin students trying to "decrypt"
latin will realize that garbage in does not always imply the same garbage out.

--mjr();
-- 
------------------------------------------------------------------------------
...ich bin in einem dusenjet ins jahr 53 vor chr...ich lande im antiken Rom...
                     einige gladiatoren spielen scrabble...ich rieche PIZZA...

hes@ecsvax.UUCP (Henry Schaffer) (01/10/88)

These days it is quite common to compress text (and other files too.)
The various compressions schemes such as Huffman encoding and Lempel-
Ziv-Welch tend to even out (smooth out) the character distribution
(i.e., they hope to maximize the information transmitted per character,
or per byte).  If this file is then encrypted, the smoothness may
end up in the encrypted data.  

Conversly, random data need not have a smooth character distribution.
(It is common to confuse "random" with "equi-probable".  However, even
though most examples given in courses of "random" are "equi-probable
there is no necessary connection.)

--henry schaffer  n c state univ

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/11/88)

In article <1499@osiris.UUCP> mjr@osiris.UUCP (Marcus J. Ranum) writes:
>Another problem with Gywn's suggestion that the only way to tell if it
>is encrypted (decrypt it) - how can you tell that it is really "meaningful"?

It's not that hard in practice.  Witness the following, which have been
uniquely cryptanalyzed into plaintext which all involved agreed was
correct (and in cases where it could be checked with the encryptor,
its correctness was verified):
	(a)  the Chaucer transposition with spelling variations that
	     was posted to this newsgroup not long ago;
	(b)  "challenger" cryptograms
Some of the latter are extremely short, and have plaintext that nobody
in his right mind would really use.  I've personally broken quite short
messages, some 25 letters long, encrypted by some of the simpler
methods.  Usually, student cryppies would all get essentially the same
result (sometimes off in one or two letters) for cryptanalysis of
practice cryptograms, if they were able to crack them at all.

It is extremely hard to make up a cryptogram of modest length that
has two valid but different decryptions.  In fact, I don't know of
any of length more than a couple of dozen letters, although I can
imagine some ways to try to do this for somewhat longer messages.
This would make an interesting puzzle for someone.

On the other hand, there are people such as those who find Baconian
ciphers in Shakespeare's works, who put the message there themselves.
It's pretty easy to tell when this happens; either the decipherment
method is ambiguous, so the message could never have been
realistically expected to be read even by its intended recipient,
or so many "corrections" and otherwise unjustified special twists
on the decipherment scheme have been introduced that the
information content of the decipherment rivals that of the supposed
message being extracted.  It is interesting to note that William
F. Friedman, perhaps the greatest cryptanalyst of all time, and his
future wife got their start in the field by being employed at "Col."
Fabyan's Riverbank Labs, which was trying to find just such hidden
messages in Shakespeare.  The Friedmans showed how the same
methodology could be used to produce messages with precisely
opposite meaning, i.e. the methodology was making up the whole
content of the messages based on the analyst's preconceptions
about what was to be found.  On the other side of this issue,
several rum-runner codes (generally much more ambiguous to crack
than ciphers) were broken by the Friedmans, and they were able to
demonstrate their methodology satisfactorily to the federal courts.

karn@faline.bellcore.com (Phil R. Karn) (01/12/88)

Re random and encrypted messages: There is a strong connection between
randomness and cryptography. "Randomness" is in the eye of the beholder.
All it means for you to say that something is "random" is that *you*
can't predict what the next element in the sequence is, not that it
doesn't contain some underlying redundancy that someone else with
additional information (or computing power) might discover.  It
therefore follows that any strong encryption algorithm would necessarily
produce what appeared to be totally random ciphertext. Feed a counter
into DES and you have an excellent random number generator;
unfortunately it is usually too slow to be practical.

> On the other hand, there are people such as those who find Baconian
> ciphers in Shakespeare's works, who put the message there themselves.
> It's pretty easy to tell when this happens; either the decipherment
> method is ambiguous, so the message could never have been
> realistically expected to be read even by its intended recipient,
> or so many "corrections" and otherwise unjustified special twists
> on the decipherment scheme have been introduced that the
> information content of the decipherment rivals that of the supposed
> message being extracted.

Quite true. I know I'm going to get flamed for this, but this is *precisely*
what people do all the time with the Bible.

Phil

chapman@sco.COM (Brian Chapman Mx321) (01/19/88)

In article <660@bucket.UUCP> leonard@bucket.UUCP (Leonard Erickson) writes:
< An interesting question has crossed my mind. If someone presents you with
< an allegedly encrypted message, How can you tell if it really is encrypted
< as opposed to being a bunch of random characters?

It is the goal of any good encryption technique, in this class
of techniques, to make the data look random.
Any sort of pattern in the encrypted data is a crack you can
drive a decryption wedge into.

< I know that transposition and *simple* substitution can be detected by
< letter frequency analysis. But is a "flat" distibution evidence of random
< data?
< For my purposes, both "one-time pad" ciphers and anything that operates on
< units other than characters can be considered random! If it is that complex,
< then I'm not likely to crack it!

Yet UNIX crypt is a simple substitution code with the extra twists
that the offset in the file is added to the character (mod 256)
and you change substution tables every 256 characters.
This tends to flatten out a naively taken distrubution.

< (when an 8k msg uses the 26 letters so evenly that the spread
< better most used and least used is 12, you get *real* suspicious :-)
  ^^^^^^ you mean "between"?

You mean the character counts vary between 309 and 321.
Sounds about right to me for random looking data.

Artificialy flat distributions mean that the data is
either a fraud or the encryption method pads with extra
characters.

-- 
			   uunet!-\
Brian Chapman		microsof!-->sco!chapman
			   ihnp4!-/
Pay no attention to the man behind the curtain!  -- The Great & Powerfull Oz

hes@ecsvax.UUCP (Henry Schaffer) (01/20/88)

In article <275@sysco>, chapman@sco.COM (Brian Chapman Mx321) writes:
> In article <660@bucket.UUCP> leonard@bucket.UUCP (Leonard Erickson) writes:
> < An interesting question has crossed my mind. If someone presents you with
> < an allegedly encrypted message, How can you tell if it really is encrypted
> < as opposed to being a bunch of random characters?
> ... 
> You mean the character counts vary between 309 and 321.
> Sounds about right to me for random looking data.
  
I'm taking "random" as meaning multinomial with equi-probability.  In
this case - looking at any one character should be found about 8k/26=307
times, with a standard deviation of about 17.  That makes a spread of 12
between the most and least frequent of the 26 look quite small, indeed.

> Artificialy flat distributions mean that the data is
> either a fraud or the encryption method pads with extra
> characters.

Hmm, is this correct?  I don't know how flat the output of something like
Huffman encoding can be.  I know it isn't encryption - but perhaps something
like this in conjunction with encryption would give a flatter-than-random
character distribution without character padding.
> -- 
> Brian Chapman		microsof!-->sco!chapman

--henry schaffer  n c state univ

dunc%moria@Sun.COM (duncs home) (01/20/88)

In article <660@bucket.UUCP> leonard@bucket.UUCP (Leonard Erickson) writes:
>                               ...
>I know that transposition and *simple* substitution can be detected by
>letter frequency analysis. But is a "flat" distibution evidence of random
>data?

No.  Consider this simple substitution cipher:

'a' = 'abcdefghijklmnopqrstuvwxyz'
'b' = 'bcdefghijklmnopqrstuvwxyza'
'c' = 'cdefghijklmnopqrstuvwxyzab'
             ...
'z' = 'zabcdefghijklmnopqrstuvwxy'

The frequency distribution of the letters will be absolutely flat, yet it
does contain a message.  This shows that a flat distribution does not mean
absence of a message (and also that flat distribution, by itself, does not
guarantee a good cipher!)

mlm@NL.CS.CMU.EDU (Michael Mauldin) (01/21/88)

In article <275@sysco>, chapman@sco.COM (Brian Chapman Mx321) writes:
> Artificialy flat distributions mean that the data is
> either a fraud or the encryption method pads with extra
> characters.

The encryption algorithm might very easily generate purposefully "flat"
distributions by a variety of methods.  You could switch tables after
every block to shuffle "overused" bytes to less commonly used meanings.
You can even guarantee perfectly flat counts by not altering the
"meaning" of each byte.  This turns out to require extra bytes, since
once you make a decision that "I've used \035 enough, now", the
remaining bytes encode fewer bits.  Still, the result would be neither
fraudulent nor padded with extra characters (the extra bytes would be
"necessary").

If you got a really good encryption algorithm, this would be a
pointless add-on feature, but that doesn't mean that seomone else isn't
doing it.

Michael L. Mauldin (Fuzzy)		Department of Computer Science
ARPA: Michael.Mauldin@NL.CS.CMU.EDU	Carnegie-Mellon University
Phone: (412) 268-3065			Pittsburgh, PA  15213-3890

jackm@devvax.JPL.NASA.GOV (Jack Morrison) (01/23/88)

In article <39394@sun.uucp> dunc@sun.UUCP (duncs home) writes:
>In article <660@bucket.UUCP> leonard@bucket.UUCP (Leonard Erickson) writes:
>>                               ...
>>I know that transposition and *simple* substitution can be detected by
>>letter frequency analysis. But is a "flat" distibution evidence of random
>>data?
>
>No.  Consider this simple substitution cipher:
>
>'a' = 'abcdefghijklmnopqrstuvwxyz'
>'b' = 'bcdefghijklmnopqrstuvwxyza'
>'c' = 'cdefghijklmnopqrstuvwxyzab'
>             ...
>'z' = 'zabcdefghijklmnopqrstuvwxy'
>
>The frequency distribution of the letters will be absolutely flat, yet it
>does contain a message.  This shows that a flat distribution does not mean
>absence of a message (and also that flat distribution, by itself, does not
>guarantee a good cipher!)

Cute, but keep in mind that a "distribution" can be more than just a histogram
of letter frequencies; the example cipher would have a very lopsided
distibution in other measurements, for example digram frequencies.
(I realize the proposal was not meant as a good cipher, and I'm just 
adding to the discussion; this is NOT a flame, I just don't see where to
stick the :-) !)

[What?
[The
[followup
[has
[to
[be
[longer
[than
[the
[original?
[Than
[eat
[this!]-- 
Jack C. Morrison	Jet Propulsion Laboratory
(818)354-1463		jackm@jpl-devvax.jpl.nasa.gov
"The paycheck is part government property, but the opinions are all mine."