[comp.lang.c] New

alanf@bruce.OZ (Alan Grant Finlay) (03/06/90)

I have written an encription system for personal files on an IBM-PC compatible.
The algorithm is designed to be fast enough for small files and totally secure
if a key is chosen properly.  I mean it should be secure against known plain
text attack but not chosen plain text attack where large numbers of attempts are
permitted.  Included in this article are a source program for the system in C,
a user manual, and a formal description of the algorithm.  This is my first 
non trivial C program so I welcome any constructive criticism.  I would also 
like to know if the formal description is implemented correctly in C.  I am
an amateur at cyphers so I am not certain that my claims about security are
correct.  I would appreciate any comments by those more knowledgeable about
the security of this system.  An archive with all relevant files including
executables will be posted soon in comp.binaries.ibm.pc, included is a challenge
file which is an encription of the source file using a key known only to me.
If anyone would like to attempt a chosen plaintext attack I am willing to use
this same key to encrypt a limited number of files.  I am not sure if I need to
supply a secret file encrypted using this key to make the challenge more 
meaningful.  The program is currently called CRYPT 2.0, I apologise for using a
name that has probably already been used but it's a bit late to change now.

----------------------------------------------------------------------------
                         User manual for Crypt
                         ---------------------

The source code "crypt.c" can be used to produce two  executable  files  which
are  assumed  to be called "encrypt" and "decrypt" by the program itself.  The
source is in Turbo C (Trademark) and can be  compiled  with  the  tiny  memory
model.   Since  use  is  made  of system dependent functions for obtaining the
input file size for encrypt, the program is not portable as is.

To use encrypt and decrypt simply type the command name followed by the source
and  destination  file  names.  You will be prompted for a password and status
information is produced.  The password is  echoed  since  for  the  length  of
password  intended it would be too easy to mess it up.  The password should be
a long sequence of characters such as a sentence with some  strange  words  or
characters  in  it.  The maximum length is the same as the maximum block array
width (100 as supplied).  The password is scrolled off the  screen  after  the
line is terminated.

The algorithm used was inspired by the Rubik's cube.  The core algorithm works
as follows:

   First a fixed block size is determined which is the product of two numbers
   "wide" and "high".  The data is processed in a two dimensional array with
   these dimensions.  If the last block cannot be entirely filled it is
   padded with junk characters.  The password is used to determine a key on
   a column per character basis.  The key value for each column (modulo the
   height) is the amount this column is shifted down.  The part which emerges
   from the bottom is inserted into the top (i.e. a cyclic shift).  The rows
   are similarly shifted but using the values 1,2,3,...,etc.  This two stage
   process is repeated a fixed number of times (10+10) times in the original
   system).  The result is a permutation of the array determined by the
   password alone.

There is one major problem with  this  simple  scheme.   The  use  of  a  pure
permutation  is  compromised by the need to pad the last block.  Especially if
the last block is mostly padding.  It is very difficult to choose padding that
cannot  be  distinguished  from the intentional contents without revealing too
much.  The upshot of all this is that the scheme is  modified  slightly.   For
the first one half of the passes, after the usual column and row shifts, every
second row is replaced by the exclusive-or of  itself  and  the  previous  row
character  by  character.   This  means the shuffling process is now more than
just a pure permutation but alters values according  to  the  history  of  the
shuffle.   As  a  precaution  the rest of the passes are pure permutations.  A
crude  analogy  would  be  shuffling  cards  with  the  printing  still   wet.
Fortunately  the  process  is  reversible when done digitally.  In addition to
this precaution the block size is usually chosen so that the last block is  at
most  a  quarter  padding  (the rare exceptions are reported when they occur).
Finally the padding is  chosen  by  randomly  selecting  bytes  from  the  old
contents of the array (usually bytes from this or a previous pass).

If the input file is smaller than 200 bytes it is rejected.  This may seem  to
be  over cautious but remember that if a password is discovered by cracking an
easy case, every file encoded with that same password  is  available.   It  is
intended  that  users  will  not use many distinct passwords since they should
never be written down and need to be long.  The security of the system is  due
more  to  the choice of password and particularly its length.  Passwords which
are  too  short  are  rejected.   Naturally  the  password  in  not   recorded
permanently  anywhere  by the encryption program.  The easiest way to choose a
good password is to use a whole sentence (spaces and punctuation are  allowed)
with  some  strange  characters  in  it.  It is relatively easy to try out all
short sentences with words from a standard dictionary.  Brute  force  cracking
of  an  encrypted  file  with  n blocks of block size b, a password of p units
chosen from a set of q values and with s shuffles per block requires O(s * b *
(q^p)) operations.  Obviously increasing q and/or p pays great dividends.

For a pure permutation it would also be possible to try all b!  rearrangements
of  a  block.   This  is ruled out by the minimum block size being 100.  It is
clear that increasing s is not an effective way of  improving  security.   The
time  taken by the algorithm is O(s * b * n).  Since (b*n) cannot be changed s
should be chosen as small as is considered safe.  The chosen value is as small
as  seems  secure  but  is  not  chosen  for any real theoretically determined
reason.  Note that s is a compile time constant.

When choosing block dimensions adequate width is more important than  adequate
height.   If  the  width  is  shorter  than  the password then the tail of the
password is not used.  The minimum block dimensions are approximately 24  *  8
so  that  the  algorithm  has  enough  height  to minimise the chance of short
cycles.  Every shuffle is identical and could contain a small number of  short
cycles  (i.e. a small set of positions whose contents always remain within the
set).  Experiments seem to indicate this is not likely with  reasonably  sized
blocks  (i.e.  >200  bytes).   No  theoretical analysis of this issue has been
attempted.  Note however that the row shifts are chosen as 1..n for  a  column
or row of n+1 positions hence there is never a zero row shift.

So far the algorithm's strength depends critically upon  the  variety  of  the
input.   As  described  so for if you input a file with just space characters,
that is mostly what you will get back.  This applys on a block by block basis.
To  avoid  this  problem  the  algorithm  is  given  one final twist, a simple
substitution is performed before the first shuffle.  This substitution  should
suffice to eliminate most simple patterns.

Some final operational advice.  Confidential documents should be created using
a  ram  disk or a floppy which will be destroyed.  Note that deleting and even
overwriting magnetic media does not prevent the data's  recovery.   Watch  out
for  editors  which keep temporary copies of files which they are working upon
in places of their own choice.  A  deleted  temporary  file  is  often  easily
recovered and you may not even be aware of its existence.

-------------------------------------------------------------------------
/***********************************************************************/
/* (c) Copyright 1989, 1990, Alan Finlay.         Beta version 2.0 .    */
/* This program is intended for Public Domain, and may not be sold or    */
/* marketed in any form without the permission and written consent of     */
/* the author.  I retain all copyrights to this program, in either the    */
/* original or modified forms, and no violation, deletion, or change of   */
/* the copyright notice is allowed.  Every effort has been made to ensure */
/* this product provides a secure encryption system.  Mistakes can be     */
/* made however both in the theoretical basis and implementation of such  */
/* a system.  See the associated documentation for a discussion of known  */
/* limitations.  The source code has been provided so you can inspect it  */
/* and use it at your own risk.  I will accept no responsibility for any  */
/* loss, damage, or compromised data caused directly or indirectly by     */
/* this program.  It is released on an "as is" basis with no warranty    */
/* as to its being fit or suitable for encrypting any form of data.     */
/***********************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <dos.h>
#include <time.h>
#include <sys\stat.h>

#define ENCRYPT 1        /* Choose encrypt (1) or decrypt (0) */
#define DEBUG 0          /* Show page after each shuffle if non zero */
#define TRUE -1
#define FALSE 0
#define LIMIT 100
#define SECURE 10

typedef unsigned char line[LIMIT];

char copyright[40] = "(c) copyright 1989,1990, Alan Finlay";

unpat(page,wide,high) /* Simple substitution to avoid patterns */
   line page[LIMIT];  /* [width,height] */
   int wide,high;
   {
   int i,j,k;
   k = 0;
   for (i=0;i<wide;i++) for (j=0;j<high;j++) {
      k = (k+7)%128;
      page[i][j] = page[i][j] ^ k;
      }
   }

#if ENCRYPT
shuffle(page1,code,wide,high)
   line page1[LIMIT];  /* [width,height] */
   line code;
   int wide,high;
   {
   int i,j,k,key,shift;
   line *mix1,*mix2;
   line *oldline,page2[LIMIT];  /* [height,width] */
   for (k=0;k<(SECURE*2);k++) { /* See mixup below (2 security phases) */
#if DEBUG
      show(page1,wide,high);
#endif
      /* Shift columns */
      for (i=0;i<wide;i++) {
         oldline = page1[i];
         key = (int) code[i];
         for (j=0;j<high;j++) page2[j][i] =(*oldline)[(j+key)%high];
         }
      /* Mixup */
      if (k<SECURE) {
	 for (j=1;j<high;j+=2) {
            mix1 = page2[j-1];
            mix2 = page2[j];
	    for (i=0;i<wide;i++) (*mix2)[i] = (*mix2)[i]^(*mix1)[i];
            }
         }
      /* Shift rows */
      for (j=0;j<high;j++) {
         oldline = page2[j];
         shift = (j%(wide-1))+1;
         for (i=0;i<wide;i++) page1[i][j] = (*oldline)[(i+shift)%wide];
         }
      }
   show(page1,wide,high);  /* For text, user can check for blunders */
   }

#else
unshuffle(page1,code,wide,high)
   line page1[LIMIT];  /* [width,height] */
   line code;
   int wide,high;
   {
   int i,j,k,key,shift;
   line *mix1,*mix2;
   line *newline,page2[LIMIT];  /* [height,width] */
   for (k=0;k<(SECURE*2);k++) {  /* Check this times 2 as in shuffle */
#if DEBUG
      show(page1,wide,high);
#endif
      /* Shift rows back */
      for (j=0;j<high;j++) {
	 newline = page2[j];
         shift = wide-(j%(wide-1))-1;
         for (i=0;i<wide;i++) (*newline)[i] = page1[(i+shift)%wide][j];
         }
      /* Reverse mixup */
      if (k>=SECURE) {    /* Note matching code in shuffle */
	 for (j=1;j<high;j+=2) {
            mix1 = page2[j-1];
            mix2 = page2[j];
	    for (i=0;i<wide;i++) (*mix2)[i] = (*mix2)[i]^(*mix1)[i];
            }
         }
      /* Shift columns back */
      for (i=0;i<wide;i++) {
         newline = page1[i];
         key = (int) code[i];
         for (j=0;j<high;j++) (*newline)[(j+key)%high] = page2[j][i];
         }
      }
   }
#endif

show(page,wide,high)
   line page[LIMIT];
   int wide,high;
   {
   int i,j;
   puts("\n");
   for (j=0;j<high;j++) {
      putc('\n',stdout);
      for (i=0;i<wide;i++) {
         if (page[i][j]<30) putc('*',stdout);
         else putc(page[i][j],stdout);
         }
      }
   }

main (argc,argv)
   int argc;
   char *argv[];
{
   FILE *infile,*outfile;
   int wide,high,i,j,k;    /* Block width and height, loop counters */
   int blkn = 1;           /* Block counter */
   int clen;               /* Password code length */
   int ch = 0;
   int invers;             /* Version of input file for decrypt */
   int vers = 2;           /* Version of this program */
   line page[LIMIT],code;
#if ENCRYPT
   int chrcnt;       /* Character counter */
   long fsize;       /* Input file size */
   int blocksize,nblocks;
   struct time t;  /* For system time */
   struct stat st; /* For input file stats */
   /* Randomise the rand() function */
   gettime(&t);
   srand(t.ti_min*400+t.ti_sec*100+t.ti_hund); /* random seed <30000 */
   /* Check the input arguments */
   if (argc!=3) {puts("\nUsage is: encrypt src dst\n"); exit(1);}
#else
   int blkcnt;
   /* Check the input arguments */
   if (argc!=3) {puts("\nUsage is: decrypt src dst\n"); exit(1);}
#endif
   if ((infile = fopen(argv[1],"rb")) == NULL) {
      printf("\n%s",argv[1]); perror(" "); exit(1);}
   if ((outfile = fopen(argv[2],"wb")) == NULL) {
      printf("\n%s",argv[2]); perror(" "); exit(1);}
#if ENCRYPT
   /* Get input file size */
   if (stat(argv[1],&st)!=0) {perror(" "); exit(1);}
   fsize = st.st_size;
   printf("The input file size is %ld\n",fsize);
   /* Choose block size accordingly */
   if (fsize<200) {puts("Input file is too small"); exit(1);}
   if (fsize<(LIMIT*LIMIT)) blocksize = (int) fsize;
   else {
      nblocks = (int) (fsize/(LIMIT*LIMIT)+1);
      blocksize = (int) (fsize/nblocks+1);
      }
   wide = ((int) sqrt((double)blocksize))+10; if (wide>LIMIT) wide = LIMIT;
   high = blocksize/wide+1; if (high>LIMIT) high = LIMIT;
   while (1) {
      blocksize = wide*high;
      if (fsize<(long) blocksize) break;
      else {
         /* Multiple blocks, check for last block too small */
         if (((fsize-1)%blocksize)>(blocksize*3/4)) break;
         /* (fsize-1) is used above so perfect fit is accepted! */
         high--; wide--; /* Try a smaller block */
         }
      if (wide<50) {
         puts("It was necessary to use a short last block");
         puts("To correct this change the input file size");
         puts("Otherwise there is a minor security compromise");
         break;
         }
      }
   printf("The width and height are (%d,%d)\n",wide,high);
   printf("The last block is %ld bytes\n",((fsize-1)%blocksize)+1);
   fprintf(outfile,"%d,%d,%d,",vers,wide,high);
#else
   fscanf(infile,"%d,%d,%d,",&invers,&wide,&high);
   if (invers!=vers) {puts("Input file is version incompatible");exit(1);}
#endif
   /* Get password */
   while(1) {
      puts("\nPlease enter your password");
      gets(code);
      clen = strlen(code);
      if (clen>LIMIT) {puts("Password too long - abort"); exit(1);}
      if (clen>9) break;
      puts("Insecure password, try a longer one.");
      puts("For security do not use a name or word in any dictionary.");
      puts("For example use something like \"Dazed and Konfuzed\"");
      }
   for (i=0;i<25;i++) puts(" ");   /* Clear the screen */
   if (clen>wide) puts("Warning: tail of password ignored");
   /* Extend password to possible limit */
   for (i=clen;i<LIMIT;i++) code[i] = code[i%clen] ^ '\152';
   do { /* crypt a page */
#if ENCRYPT
      chrcnt = 0;
#else
      if (fscanf(infile,"%d,",&blkcnt)==EOF) goto NOBLOCK;
#endif
      for (j=0;j<high;j++) {
         for (i=0;i<wide;i++) {
            if ((ch = getc(infile)) != EOF) {
               page[i][j] = ch;
#if ENCRYPT
               chrcnt++;}
            else if (i==0 && j==0) goto NOBLOCK; /* EOF at start of block! */
            /* Pad the last block with existing junk */
            else page[i][j] = page[rand()%wide][rand()%high];
#else
               ;}
            else {puts("Error: unexpected end of file"); goto NOBLOCK;}
#endif
            }
         }
#if ENCRYPT
      fprintf(outfile,"%d,",chrcnt);
      unpat(page,wide,high);
      shuffle(page,code,wide,high);
      for (j=0;j<high;j++) for (i=0;i<wide;i++) putc(page[i][j],outfile);
#else
      unshuffle(page,code,wide,high);
      unpat(page,wide,high);
      for (j=0;j<high;j++) for (i=0;i<wide;i++)
         if ((j*wide+i)<blkcnt) putc(page[i][j],outfile);
#endif
      printf("\nFinished block number %d\n",blkn++);
      }
   while (ch != EOF);
NOBLOCK:                  /* Jump here to avoid writing an empty block */
   fclose(infile);
   fclose(outfile);
}
----------------------------------------------------------------------------

                   Formal description of the algorithm
                   -----------------------------------
A block size is determined according to the size of the input file.
For small files there will be only one block.  Each block has two dimensions -
wide and high.  The last block usually needs some padding which is chosen from
whatever happens to be in the array already.  The junk is selected randomly 
though the random number generator has only about 30,000 possible seeds.  
This is probably much ado about nothing, the padding could probably be just
null characters without any loss in security.  The password or key is a string
supplied by the user, this is extended to the maximum length (100) by 
repetition however the repeated strings are modified by exclusive-or '\152'
to help extract the maximum information from the supplied string (since later
only the remainder mod high will be utilised).

For each block, let us call it page[x,y] where 0<=x<wide and 0<=y<high.
-----------------------------------------------------------------------
	1) Perform a simple substitution:
           page[x,y] := page[x,y] XOR ((x*high+y)*7 mod 128)

	2) Do 10 special shuffles: (each line is for all x and y concurrently)
           page[x,y] := page[x,(y+key[x]) mod high]
	   if y is odd then page[x,y] := page[x,y-1] XOR page[x,y]
           page[x,y] := page[(x+(y mod (wide-1))+1) mod wide,y]

	3) Do 10 normal shuffles: (each line is for all x and y concurrently)
           page[x,y] := page[x,(y+key[x]) mod high]
           page[x,y] := page[(x+(y mod (wide-1))+1) mod wide,y]

Each step in the process is easily reversible.
-------------------------------------------------------------------------

The encrypted file format is:
	1) three 16 bit integers: program version number (= 2), wide, high.
        3) for each block: 
		one 16 bit integer = number of chars in block (non padding).
		the block of bytes:
			for j:=0..high-1
			   for i:=0..wide-1
			      page[i,j]

----------------------------<end>-------------------------------------

hal@gateway.mitre.org (Hal Feinstein) (03/07/90)

This is kind of a silly question.  I recently sat through a 
college level course on telecommunications security that
had a lot of elementary code breaking stuff in it.
Of all the subjects to get hung up on, someone in the class
asked if in counting 2-grams (bigrams) you shoud count every pair
of letters or to tally each pair by sliding a pointer down one
character at a time and taking the character pointed to and the one
to its immediate right as a pair. Well, the class divided.
The heck with traffic analysis and clandestine intercept, this was
really important!  Since then, blackboards have been filled with
math and the college halls filled with people shouting themselves
hoarse over this, the great bigram affair. Really!

Now, I put it to you in networld: which is the "correct" way to
count bigrams? 
 

cme@zircon.sw.stratus.com (Carl Ellison) (03/08/90)

In article <100295@linus.UUCP> hal@gateway.mitre.org (Hal Feinstein) writes:
>Of all the subjects to get hung up on, someone in the class
>asked if in counting 2-grams (bigrams) you shoud count every pair
>of letters or to tally each pair by sliding a pointer down one
>character at a time and taking the character pointed to and the one
>to its immediate right as a pair. [...]
>
>Now, I put it to you in networld: which is the "correct" way to
>count bigrams? 
> 


The answer is:  BOTH

If the resulting statistics are not identical (by standard statistical
tests), then you've discovered that your source text is naot stationary
-- a very interesting result which should be investigated.

--Carl

michael@xanadu.com (Michael McClary) (03/10/90)

In article <1877@bruce.OZ> alanf@bruce.OZ (Alan Grant Finlay) writes:
>I have written an encription system for personal files on an IBM-PC compatible.
>[]  Included in this article are a source program for the system in C,
...
>The source code "crypt.c"

You should be aware that "crypt" is used on unix systems to refer to
a particular utility (a one-rotor Enigma variant) and a particular
subroutine (the DES-variant used for password encryption).

Naming your source file "crypt.c" invites the same sort of confusion as
naming a home-brew editor "vi", "ed", or "emacs", and may tend to bother
unix hax.  You might want to find another name.

michael@xanadu.com (Michael McClary) (03/10/90)

In article <100295@linus.UUCP> hal@gateway.mitre.org (Hal Feinstein) asks
the "correct" way to count bigrams:
 - Each odd character, followed by its successor.
 - Each character, followed by its successor.

If what you're looking for is bigram frequency, to check for caeser
cypers and the like, the second is the way to go.  Consider that the
first approach will miss every instance of "th" that starts on an
even-numbered position.

In cyphers where each character is treated separately and similarly,
you can expect the two methods to produce similar results, but the
first to require twice as much cyphertext to give you an equivalent
amount of information.

If the cypher is chunking the data (with an even number of characters
per chunk), or otherwise treating odd and even cyphertext characters
differently, the first method might give you more useful information.

hal@gateway.mitre.org (Hal Feinstein) (03/13/90)

I'd like to thank all those who responded with the "correct" way to
count bigrams. Interestingly, the standard reference work
on this subject, Kullback's Statistical Methods in Cryptanalysis 
is silent on the subject. Here's a summary of what people said
on the subject:

The debate in our telecommunications security class was over
how to do the 2-gram correlation technique for solving 
monoalphabet ciphers suggested in Konheim p. 79 (Cryptography,
A Primer). At that point we had not considered  more complex
ciphers  but got stuck (for a while) trying to figure this out.
At least one reply said it was obvious. Well, some of us are
still learning.  I can't say it obvious right now.   

To paraphrase what most people said.  (1)If the symbols in the
underlying alphabet are single letters (simple transposition for
example) then sliding one character at a time is OK.  
(2) If the symbols in the underlying
alphabet are blocks of characters, for example two characters 
per symbol, then slide by the block length.  This makes sense
since forming bigrams by sliding one character a time across
a bigram substitution cipher fractures the bigrams, 
probabily wrecking the accuracy of the tally.

The problem is what to do when confronted with a unknown
cryptogram?  How should you do the tally?  One reply stated that
if a null had been inserted as the first character and you counted
on even boundaries, you would get junk. So best to take both types
of counts:  One starting on the odd boundary and the other on an 
even boundary. 

Another said: just discover how your bigram table was counted and
make sure you count in the same way. Hmmm...

One respondent advised using Markov analysis to discern hidden 
processes as a way to discover the underlying cipher. OK, seems 
reasonable.  Another reply suggested that the ciphertext is really
generated by a non-stationary statistical source which  explains why
decimated bigram averages would differ.   

cooper@cadsys.enet.dec.com (03/14/90)

This is in regards to whether to count discrete or overlapping digrams.

I'm surprised that no one seems to have pointed out the "answer" to
this, since it is
part of standard statistics.

It depends on the nature of the tests which you are performing on them. 
Overlapping
digrams give you twice as much data, but the frequencies obtained are
not independent.
For example, it is not valid to use a chi-square test on overlapping
digrams to determine
if the digrams deviate significantly from chance distribution, since the
chi-square
test assumes that the frequencies are independent.  On the other hand it
is perfectly
valid to estimate the probability that an X will be followed by a Y on
the basis of
overlapping digrams.

In practice, except for a few hard-and-fast yes/no automatic procedures,
in cryptographic
work the increase in information from using overlapping digrams (and
trigrams, etc.)
will generally overshadow the slight loss in pure validity, since the
information
collected is likely to be leavened with a large dose of intuitive,
trial-and-error
adjustment by the cryptoanalyist in later phases of the work.

						Topher Cooper