[sci.nanotech] The Electrostatic Potential of the HIV Genome

stein@pixelpump.osc.edu (Rick 'Transputer' Stein) (06/29/89)

I've been investigating the computational complexity of evaluating the
electrical potential around a protein molecule in solution.  It seems
that the Poisson-Boltzman equation describes this system:

div . ( e(x)grad(phi(x)) ) + sinh[phi(x)]/(k(x)*k(x)) + 4pirho(x)= 0

where x is a vector (x,y,z),
e(x) is a dielectric function
phi(x) is the field
k(x) is the modified Debye-Huckel parameter,
and rho(x) is the charge density.

The PB equation is one of your gnarlier non-linear PDEs!

Now one typically solves this kind of equation via finite differences
over some cubic volume of space.  So, I guestimated that the HIV genome
which consists of ~ 4000 bases ( or 50K atoms ) would fit into a box of
say oh 400 angstroms on a side.  When you create a grid over the volume,
you may want about 4 points/angstrom in the x,y, and z directions.  This
implies that you have 2^32 (or 4096e6) pts. at which you must evaluate
the field.  

To store this quantity of information:

phi(x) needs 2^32 pts @ 24 bytes/pt (double precision) = 98.3 Gbytes
rho(x), e(x), and k(x) @ 8 bytes each * 3= 98.3 Gbytes

So, about 200 Gbytes of physical memory just for the data.

I have a iterative matrix solver (Gauss-Siedel with over relaxation),
and I know that for a 2D grid, the number of iterations required goes
like 3 times the grid size (as a general rule, to obtain convergence to
1e-6 error tolerance).  Since we are now in 3D, the number of iterations
goes like the square of the grid times 3.  On a transputer, it takes about 
120 microsecs to evaluate one pt.  So, I have the following formula for 
guestimating the convergence time (for a single cpu):

T(days) ~ K * 120 usec * npts * 3 * gridsize^2

where K= 1.15e-5 days/sec

This works out to be: 43.3e6 days! (assuming a single transputer can address
this much (which it can't but who cares, its the principle of the thing)).

Now, if you build a multicomputer with 64Knodes, and assumed linear
speedup by partitioning the dataset among the 64Knodes, and the communications
were all nearest neighbor (which they are in this case), then the time
to compute the potential surface is reduced to about 662 days.

The cost of a machine with 64Knodes is, with 4 Mbytes at each node, with
a T800 transputer is ~ $1K/node.  So $70K just for iron, chuck in about
$100K for wire and cooling fans plus about 1 Terabyte of disk here and there,
and you've got about $500K for the machine.

The problem is that the reliability will get you.  1000 T800s in a box
will experience a failure once each 5 years.  So 64K will experience
a failure about once each 3 weeks (about 40 times too often, but I'll
add that very few chips can equivalence this kind of reliability standard).

So, any alternatives?  Yes, if I was smart enough to figure out how
to use Greengard's algorithm (an O(n) solution, instead of the O(n^4) from
FD), then I could solve the problem on a single transputer in (43.3M)^.25
or 81 days.  One can have a 256 node multicomputer do this problem in
about 8 hrs or so.  This is reasonable.

Oh, I might add, that if you substitute 120 usec for the transputer
with .15 usecs for the local supervector ghettoblaster with 200 Gbytes 
of physical memory, and 64 processors, then the time to solve is about:

T(days) ~ K * .15 usec * npts * 3 * gridsize^2/64 or
1000 days via FD or 5.6 days with an O(n) solution.

I guess I like the multicomputer better because it'll give a cost performance
of oh at least 1000 to 1.
-=-
Richard M. Stein (aka Rick 'Transputer' Stein)
Concurrent Software Specialist @ The Ohio Supercomputer Center
Trollius: The Official Multicomputer OS of the 1992 Olympic Games
Internet: stein@pixelpump.osc.edu, Ma Bell Net: 614-292-4122