[sci.crypt] Design for a DES-breaker

jbn@glacier.STANFORD.EDU (John B. Nagle) (10/13/87)

     It's been 10 years since 1977, so let's rough out the design of a
brute-force DES cracker, designed to try all 2^56 possible keys for
a known-plaintext attack.  No cryptoanalytic cleverness here, just
current commercial electronics technology.

THE CHIP

     We will need a custom VLSI chip.  Our unit will contain a pipelined
DES encryptor able to perform one encryption of 64 bits into 64 bits using
a 56 bit key, registers to hold the known plaintext and cyphertext,
comparators to detect a match between the plaintext and the output of the
encryptor, and counter logic to advance the key on each unsuccesful try.
The unit halts when a comparison is succesful, indicating that the correct
key has been found.

     The chip thus explores a portion of the keyspace by itself, without
any need for communication with the outside world while running.  Only
during startup, testing, and checking for stops is communication necessary,
and the bandwidth required is quite modest.  So inter-unit communication 
is very low.  Let's use an 8-bit bus to connect the units.  Each unit
will have device registers, like an I/O device, and will be controlled
by writing values into these device registers and queried by reading
from them.

     Let's assume VLSI technology at the M68020 level; current technology,
air-cooled parts, 20Mhz clock.  Conservatively we can fit 16 units
on a chip; maybe we could do better, but let's assume 16.

     Let's package the chip in a 28-pin dual-inline package.  The wide 28-pin
package should have enough room for our chip, and it's a standard size
that can be fabricated cheaply; packaging can take place on standard
low-cost packaging lines used for RAM chips.

     We'll give the chip a pinout very similar to that of a standard
byte-wide static RAM.  Thus, we can treat it electrically like a memory
device, and address the device registers as memory.

     The chip will not generate interrupts, since it looks like a RAM.
But it will have a device register which indicates whether any unit
has stopped with a match.  This register must be polled periodically
to see if a solution has been found.

     So that's our chip.  Each chip can try 20Mhz x 16 keys per second,
or 320 x 10^6 keys per second.

THE BOARD

     Since our chip looks like a RAM chip, our board will look like a
memory board, a large array of densely packed sockets and some modest
interface electronics to the rest of the world.  Let's assume Eurocard
packaging, with 9U-high extra-deep cards, similar to those used in SUN
machines.  We can fit 128 chips on such a board, and still have room for
some interface electronics.  For the sake of convenience, let's make
our board look like a memory board on a VMEbus, using the 16-bit
VMEbus interface.  The board is a pure bus slave, and does not generate
interrupts; it must be polled to determine when a find has been made.

     So our board is about 15 inches square, and can try 128 x 320
x 10^6 keys a second, or 4 x 10^10 keys per second.  There are about
7.2 x 10^16 possible DES keys, so one board can break a key in about
2 x 10^6 seconds, or about 500 hours.  On the average, one would find
the key after searching half the keyspace, so 250 hours, or about 10
days, is a reasonable estimate for a one-board system.

THE SYSTEM

      A practical system needs to be a bit larger.  A 20-slot Eurocard
cage, with 19 slots of custom cards and some standard 68000-based VME
processor and memory card in the last slot for control, would be
a useful configuration.  Equipped with fans and power supplies of
suitable size, and a 3 1/2" disk drive to provide some storage for
the control machine, the entire system should fit in a 48-inch rack.
At least two async ports should be provided, one for a control terminal
to sit on top of the rack, and one for connection to a larger system.
Software would be quite modest; some simple M68000 operating system
without memory protection, such as M-DOS, or even CP/M 68K, would
be sufficient.  Alternatively, one could use a "PC-on a VME card"
board and run MS-DOS, which would simplify software development.

      One job of the software is to test all the units periodically.
Since each unit is independently addressable, and each can be started
on a test problem separately, the test system simply loads up each unit
with a known test problem which should result in a stop within a few
seconds, and checks back after an interval to see if the unit properly
found the solution.  Any unit which fails is reported and not used again.

      Failures during operation can be handled gracefully without difficulty.
During runs, the system should stop all units every ten minutes or so, save the
current key values they are working on, check them against what they were
loaded with, mark as dead any units with incorrect values, and run the
test sets, again marking any failed units as down.  Units still up
are reloaded with their key values and restarted.  The aborted
searches are rerun later on other units of the array.  Unit failure
is thus not a serious problem. 

CONCLUSION

     A DES-breaking machine of modest size and complexity is well within
the current state of the art.  While it would be necessary to develop 
one specialized VLSI chip, everything else is available off-the-shelf
in commercial-grade electronics.  The VLSI chip required is well within
the capabilities of many fabricators and would not require any exotic
technology.  Development costs are difficult to estimate, but should
comparable to those of developing a new and complex graphics board for
a commercial microcomputer.  Total manufacturing costs for a complete
system should not exceed $250K in quantities of 10.

					John Nagle

karn@faline.bellcore.com (Phil R. Karn) (10/15/87)

>      So that's our chip.  Each chip can try 20Mhz x 16 keys per second,
> or 320 x 10^6 keys per second.

I do not know how much chip real estate a DES engine takes, but John's
figures seem a tad optimistic when compared to existing technology.

Typical commercial DES chips, e.g., the AMD 9518/Z8068, do NOT pipeline
their calculations.  The 9518 has one set of S and P boxes and iterates
16 times. Including overhead, it takes a total of 18 clock cycles to do
one complete encryption or decryption.  Key loading isn't particularly
fast either. The designers probably considered a low pin count to be
more important than making the chip especially useful for key cracking.

Pipelining the DES engine would clearly increase its area at least 16
times, since not only do you need 16 copies of the S and P boxes but
also 16 key registers. You also need a way to extract the proper subkey
in parallel at each of the 16 stages.  John further hypothesizes that
each chip would contain sixteen of these complete DES engines, for a
total complexity increase of at least 256x over existing DES chips.  I
don't know if the RAM comparison is valid, at least inside the chip,
since a DES engine is much less regular than a RAM. True, the 9518 is a
few years old, implemented in N-MOS technology, and not specifically
designed for key cracking. Still, I don't know if a complexity increase
of 256x combined with a speed increase of 5x is reasonable in 1987 CMOS.
Perhaps somebody can comment on this.

Not that any of this really takes away from the John's valid argument
that the DES key is too small. An extra factor of x16 or even x256
really isn't all that much protection given inexorable improvements in
technology.  Cracking DES through brute force is still out of range of a
plug-in card for your PC, but certainly not out of the capability of the
NSA with custom VLSI and much more computer room floor space than John
allowed for his machine.

Phil

devine@vianet.UUCP (Bob Devine) (10/16/87)

In article <17195@glacier.STANFORD.EDU>, jbn@glacier.STANFORD.EDU (John B. Nagle) writes:
>      It's been 10 years since 1977, so let's rough out the design of a
> brute-force DES cracker, designed to try all 2^56 possible keys for
> a known-plaintext attack.  No cryptoanalytic cleverness here, just
> current commercial electronics technology.
> 
> Each chip can try 20Mhz x 16 keys per second, or 320 x 10^6 keys per second.
> [...]
> our board can try 128 x 320 x 10^6 keys a second, or 4 x 10^10 keys per
> second.  There are about > 7.2 x 10^16 possible DES keys, so one board
> can break a key in about 2 x 10^6 seconds, or about 500 hours. (10 days)
> [...]
> Total manufacturing costs for a system should not exceed $250K in
> quantities of 10.

  I'm no EE, but, I doubt that you could come up with a chip that can
calculate a new DES value per clock-cycle even with a lot of pipelining.
And putting 16 processing units on a single chip sounds to me like it
will be Rev Z before all bugs are worked out.  Yes it might be possible
but not for the price you gave.  [But then, that's why we give the CIA
billions of bucks every year...]

  An off-the-shelf approach would be to take an existing DES chip (our
parent company has a 6 MHz chip for under $20) and gang them on a board
with some controller and fast memory.  Using a large number, say 1K,
of processors for a system, it looks like the key search would take:

1. each chip calculates and compares = 7.2 * 10^16 keys / 1K chips
                                     ~=  5 * 10^13 keys / chip
2. assume "average" number of searches needed ~= 2 * 10^13 keys/chip
3. each calc/comp iteration takes (reasonable guess?) 100 microseconds
   [if someone wants, I can get the exact figure]

4. Time until average solution = 2 * 10^13  *  10^-4 seconds
                               = 2 * 10^9 seconds
                               ~= 60 years

  So, unless someone sees a problem with my calculations, a DES cracker
would need about 1000 such systems to bring the time down into the
range of months.  I'd guess that the cost per system is approximately
$20K for the DES chips plus $30K for the "glue" chips for $50K per box.

  To speed this up an obvious path to choose is a faster DES chip. 
There may be some on the market now it's just that I had the press
release for the WD20C03 chip in my file.

Bob Devine
[ BTW, 7.2 * 10^16 == 72 quadrillion ]

john@hpcvlo.HP.COM (John Eaton) (10/16/87)

<<<<
<     It's been 10 years since 1977, so let's rough out the design of a
< brute-force DES cracker, designed to try all 2^56 possible keys for
< a known-plaintext attack.  No cryptoanalytic cleverness here, just
< current commercial electronics technology.
----------

Ok, so its possible to break a DES encrypted message for a reasonable amount
of money in real time. Your analysis assumes that the message was only
encrypted once with a single key and can be detected once it has been broken.
The classic NSA response to machines like yours is that if your that paranoid
then encrypt it a second time with a different key. You can also do things
like compress the text before shifting or rotating the bits in each byte.
These may not be very effective by themselves but they greatly increase the
complexity of your detection circuits that you must build thousands of.


John Eaton
!hplabs!hp-pcgh egh

baum@apple.UUCP (Allen J. Baum) (10/17/87)

--------
[]
>I do not know how much chip real estate a DES engine takes, but John's
>figures seem a tad optimistic when compared to existing technology.
>

>Pipelining the DES engine would clearly increase its area at least 16
>times, since not only do you need 16 copies of the S and P boxes but
>also 16 key registers...

I'm not sure that this is true. What happens at each stage is a permutation
and an xor. If each stage doesn't have to share the permutation logic, then
its simply wires, not lots of wires + a 16 way mux

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

jim@randvax.UUCP (Jim Gillogly) (10/17/87)

In response to John Nagle's DES-cracking proposal Phil R. Karn writes:
>>      So that's our chip.  Each chip can try 20Mhz x 16 keys per second,
>> or 320 x 10^6 keys per second.
>
>Typical commercial DES chips, e.g., the AMD 9518/Z8068, do NOT pipeline
>their calculations.  The 9518 has one set of S and P boxes and iterates
>16 times. Including overhead, it takes a total of 18 clock cycles to do
>one complete encryption or decryption.
> ...
>Not that any of this really takes away from the John's valid argument
>that the DES key is too small.

Not being a hardware type, I'm not sure about the argument that a 20 MHz
clock means 20M decryptions per second, so I'll fall back on the famous
USENET "argument by appeal to authority".

In Crypto 84 (more formally, Advances in Cryptology: Proceedings of
Crypto 84, Ed. Blakley & Chaum, Springer-Verlag 1985) some Belgian scientists
(F.  Hoornaert, J.  Goubert and Y.  Desmedt) came up with a proposed
hardware architecture for fast DES operation, including pipelining.  They
assumed a 20 MHz clock rate and gave the following timing for a single
encryption:

	48 (=16 iterations) + 2 (pipeline delay) + 3 (I/O) = 53 cycles.

They expected throughput in ordinary DES operation of about 20 Mbit/sec.
They included a description of a configuration of the proposed chip that
would deal with about 1.13 x 10^6 keys/sec in a brute force key hacking
mode.  They proposed a system that would brute force a key in a
week or two, within an order of magnitude of John Nagle's proposal.  I
don't remember the costs that they estimated, but I think they expected
the chips themselves to be about $40 each in quantity.  The article is
"Efficient hardware implementation of DES; also design for exhaustive key
search machine", pp 147 ff.

Of course, it would be even more satisfying if one of us could find a
healthy key-folding algorithm and crack it elegantly...
-- 
	Jim Gillogly
	{hplabs, ihnp4}!sdcrdcf!randvax!jim
	jim@rand-unix.arpa

karn@faline.bellcore.com (Phil R. Karn) (10/18/87)

> I'm not sure that this is true. What happens at each stage is a permutation
> and an xor. If each stage doesn't have to share the permutation logic, then
> its simply wires, not lots of wires + a 16 way mux

Each round in DES involves an expansion, exclusive-OR, a ROM table
lookup (the S-boxes) followed by a permutation (P-boxes) and then
another exclusive-OR.  Except for the Initial Permutation (IP) and
Reverse Initial Permutation (IP-1) which you could factor out of the
algorithm and do once externally, *all* of the work in encrypting and
decrypting has to be repeated at each round.  You also need to generate
different sub-keys for each round. So I think an estimate of 16x chip
area for a pipelined DES chip over a non pipelined version is reasonable.

While wires may be "free" in terms of propagation time, they are not at all
free in terms of chip area. Complicated wire crossings like the P boxes
in DES may take up a considerable amount of real estate, making the chip
much less dense than a regular device like a RAM.

Phil

jim@randvax.UUCP (Jim Gillogly) (10/19/87)

In article <1483@faline.bellcore.com> karn@faline.bellcore.com (Phil R. Karn) writes:
>Each round in DES involves an expansion, exclusive-OR, a ROM table
>lookup (the S-boxes) followed by a permutation (P-boxes) and then
>another exclusive-OR.

If it helps, the S-box step can be done with Boolean operators: each of
the 4 output bits of each S box can be represented as a Boolean function
of the 6 input bits.  Some of them are simpler than one might expect, as
has been pointed out in the literature.  I suspect this would be faster in
hardware than a ROM lookup, but it would presumably take more chip area
and reinforce the bottom line of Phil's argument.
-- 
	Jim Gillogly
	{hplabs, ihnp4}!sdcrdcf!randvax!jim
	jim@rand-unix.arpa