[comp.graphics] Mandelbrot/Julia optimizations

athos@otto.bf.rmit.oz.au (David Burren [Athos]) (12/01/89)

In article <234@xochitl.UUCP>, cheeks@edsr.eds.com (Mark Costlow) writes:
>In article <7106@ficc.uu.net>, peter@ficc.uu.net (Peter da Silva) writes:
>> One thing that can be done to speed up calculations of the mandelbrot set is
>> to use memory. After you have determined that a point is in the set, you know
>> that all the other points visited in determining that are in the set as well,
>> so you can mark them black. Similarly, if you determine a point is not in the
>> set you know that all the other points you visited are not in the set as
>> well, so you can mark them white (or whatever colors you're using). This is
>> a pretty safe optimisation.
> 
> Hmmm ... maybe I'm missing something here, but that doesn't seem to be a 
> valid algorithm.  Since the Mandelbrot algorithm is Z(n+1) = Z(n)^2 + C, then
> each iteration for a given C defines a "path" through the plane, stopping
> at several (possibly infinite) discrete points.  But, isn't the path different
> for each different C?  I mean, just because you saw point P when iterating
> with C1, doesn't mean that if you see point P when iterating on C2 that
> the C2 iteration will behave identically to the C1 iteration.

The term "path" seems to have mislead you a little.  The expression
Z(n+1) = Z(n)^2 + C  says nothing about Z(n-1). If you get to point X from any
previous point then you WILL proceed to the same next point. Of course, this
is subject, like everything else, to rounding errors. The many-to-few property
of the expression is outlined by the fact that complex numbers have two
square roots each.

However the search time to determine if you have found a point previously
makes Peter's scheme sound rather daunting. This would however be dependent
on memory architecture.

My own work on blotting (several years ago someone here at RMIT made a
comparison of a Mandelbrot image to an ink blot, and the name stuck :-) has
largely centered around using parallel hardware and optimising the code for
specific machines. Things like cache make enormous differences to performance.

I try to avoid memory accesses of any sort in the inner per-point loop, keeping
data in registers and instructions in cache. Machines with 8 FP registers are
able to keep all the data on-chip, although brain-dead compilers that refuse to
optimise, or optimise with the proviso that it keeps one register free for
intermediate expression values, often do their best to impede the programs.

People often seem to ignore the performance gains they can achieve with
relatively little effort. Playing with compiler options for a Pascal program
running on a Cyber 170 got the inner loop to sit within the cache (just) with
a performance jump of close to an order of magnitude.
Even algorithms presented in books such as "The Science of Fractal Images" have
(what seems to me now) glaring innefficiencies, calculating parts of
expressions twice, etc.

I've seen many fancy "optimizations" suggested over the years, but although
they reduce the "number of calculations", the "work per calculation" seems
to increase too much to make it worth it.
Comments anyone?

Call me a hacker, call me obsessive, but optimising for specific architectures
(which can take some time with some of the weirder ones) is worth every hour
of saved run-time. It takes the exploration of the Mandelbrot and Julia sets
out of the realm of batch jobs and into that of user interaction.
Even simple HLL code optimizations can make significant improvements.

I ended up with a set of C routines for Unix machines, with #define's for
various architectural features. The code is well optimised by standard C
compilers for HP-9000/300s, Sun-3s, Cyber-180s, VAXen, Multimaxes, Iris 4Ds.
There are essentially two versions: one for 2-register instruction machines
such as Motorola and NatSemi FPUs, and one for 3-register-instruction beasts
such as VAXen and MIPS boxes.

The Cyber-170 work produced a function written in COMPASS (assembly) callable
from Pascal to calculate single points. The code fits within less than half
the 170-760's cache and takes advantage of the multiple functional units for
parallelism. It takes just over two minutes to calculate a 1280x1024 image of
the full set (-2,-1.25 .. 0.5,1.25) with a dwell limit of 256, and using 60-bit
FP numbers. This compares with several hours for the older Pascal programs.

The reason the code was written to be optimal for a wide variety of machines
is because I have an unfinished distributed "fractal server" to use the
coarse-grained parallelism inherent in the Internet.
I intend to rewrite this to use RPC. I saw in comp.sys.sun recently mention
of an RPC-based Mandelbrot program in the 1987 SUG tape. Can anyone here
tell me an easy way to get hold of that?

These programs have been untouched for several months, and are in dire need of
cleaning up, which I intend to do very soon. Anyone interested in the code can
get them from me, although I'd prefer to work on them a bit more first.
_______________________________________________________________________________
David Burren (Athos),			       ACSnet: athos@otto.rmit.oz
Survey Computing Consultants, Hawthorn, Vic.   

doug@xdos.UUCP (Doug Merritt) (12/02/89)

In article <601@otto.bf.rmit.oz.au> athos@otto.bf.rmit.oz.au (David Burren [Athos]) writes:
>In article <234@xochitl.UUCP>, cheeks@edsr.eds.com (Mark Costlow) writes:
>>In article <7106@ficc.uu.net>, peter@ficc.uu.net (Peter da Silva) writes:
>>> After you have determined that a point is in the set, you know
>>>that all the other points visited in determining that are in the set as well,
>>  But, isn't the path different for each different C?
>If you get to point X from any
>previous point then you WILL proceed to the same next point. Of course, this
>is subject, like everything else, to rounding errors. [...]
>However the search time to determine if you have found a point previously
>makes Peter's scheme sound rather daunting.

About four years ago, when I was doing lots and lots of Mandelbrot graphics,
I tried this optimization, but with the definition of "point" being "pixel".
This failed rather miserably, because it effectively drastically reduced
precision, and many, many distinct paths passed through the same pixel,
even at a zoom of 1X.

I never followed up with a comparison with the actual floating point
coordinate of the pixel, but it still seems to me that Peter's scheme
might be a win. All you need is a 2 dimensional array corresponding to
the screen, with each cell containing two floats and a state marker
(not-visited, visited-and-in-set, visited-and-not-in-set). Call it
17 bytes per cell (2 double precision plus 1 byte state). The resulting
array for a 1Kx1K screen is a mere 17 megabytes, which will easily fit
into memory on *someone's* computer out there, and hence won't incur swapping.

My current Sun only has 12 Meg so I can't make this work. Well, ok,
I could use single precision and cram the array into 9 Meg, yeah, that
would work, at the expense of a smaller workable maximum zoom.

Anyway, if you've got the memory, this'll cost only 4 floating point
comparisons (assuming +/- check for equality to avoid mismatch in the
least significant digit(s)) on each step of the trajectory. Ok, on average
how many recurrences of the trajectory need to be calculated before finding
a previously-hit point? Well, let's say the whole screen is in the
interior of the set. Let's also say we're real near one of the strange
attractors in the interior, so that the trajectory tends to follow a
tight loop, usually staying on screen throughout. In the best case, with
a dwell of 256, a mere 4K pixels need be calculated before every pixel
on the screen has been visited. Thereafter, every single step of the
calculation will hit a visited pixel on the first iteration.

Hmm, but the floating point values will probably differ pretty often,
and meanwhile we've ensured that the data cache misses on every instance
of the inner loop. Maybe we can keep the instruction cache filled, though.

And of course, if the screen is mostly outside the Mandelbrot set, there's
no win at all, because we need to calculate the whole trajectory in order
to color the pixel by the dwell.

Ok, I give up, I can't see any way for it to win after all. The only
way to ensure more hits on the floating point comparisons would be to
keep more than one trajectory result per pixel, but that quickly
increases memory demands. A supercomputer with 512Meg of RAM could
store about 64 trajectory cells per pixel. Calculate the +/- delta
on the fly rather than storing it, and cut the storage requirements in
half, yielding 128 trajectory cells per pixel.

Maybe that's enough for a win. Anyone know, on average, how many different
trajectories pass through a given pixel?
	Doug
-- 
Doug Merritt		{pyramid,apple}!xdos!doug
Member, Crusaders for a Better Tomorrow		Professional Wildeyed Visionary

athos@otto.bf.rmit.oz.au (David Burren [Athos]) (12/08/89)

Only one person has pointed out to be the rather basic flaw in my previous
note to this group.  To avoid people being misled, here it is.

In article <601@otto.bf.rmit.oz.au>, I wrote:
>The term "path" seems to have mislead you a little.  The expression
>Z(n+1) = Z(n)^2 + C  says nothing about Z(n-1). If you get to point X from any
>previous point then you WILL proceed to the same next point. Of course, this

Of course, I was completely wrong, as C will change from one loop to the next.
My brain must have been on holiday. Please excuse me while I commit suicide :-(

Thanks to Kevin Radus for pointing this out to me.

(Most embarrassed)

scott@tekcrl.LABS.TEK.COM (Scott Huddleston) (12/13/89)

.In article <234@xochitl.UUCP>, cheeks@edsr.eds.com (Mark Costlow) writes:
.>In article <7106@ficc.uu.net>, peter@ficc.uu.net (Peter da Silva) writes:
.>> After you have determined that a point is in the set, you know
.>>that all the other points visited in determining that are in the set as well,
Not in the Mandelbrot set, because
.>  But, isn't the path different for each different C?
(yes).
The idea has merit for rendering Julia sets, though.  It also works for
points outside a Julia set, if you're rendering escape count.  Just
save your iteration history and when an iteration escapes, you know the
escape count for every point in the sequence.
-- 
Scott Huddleston
scott@crl.labs.tek.com

me@karl.pcs.com (Michael Elbel) (12/14/89)

In article <601@otto.bf.rmit.oz.au> athos@otto.bf.rmit.oz.au (David Burren [Athos]) writes:

   Call me a hacker, call me obsessive, but optimising for specific architectures
   (which can take some time with some of the weirder ones) is worth every hour
   of saved run-time. It takes the exploration of the Mandelbrot and Julia sets
   out of the realm of batch jobs and into that of user interaction.
   Even simple HLL code optimizations can make significant improvements.

   I ended up with a set of C routines for Unix machines, with #define's for
   various architectural features. The code is well optimised by standard C
   compilers for HP-9000/300s, Sun-3s, Cyber-180s, VAXen, Multimaxes, Iris 4Ds.
   There are essentially two versions: one for 2-register instruction machines
   such as Motorola and NatSemi FPUs, and one for 3-register-instruction beasts
   such as VAXen and MIPS boxes.

   The Cyber-170 work produced a function written in COMPASS (assembly) callable
   from Pascal to calculate single points. The code fits within less than half
   the 170-760's cache and takes advantage of the multiple functional units for
   parallelism. It takes just over two minutes to calculate a 1280x1024 image of
   the full set (-2,-1.25 .. 0.5,1.25) with a dwell limit of 256, and using 60-bit
   FP numbers. This compares with several hours for the older Pascal programs.

I'd like to ask, if there are people out there, who have measured mandelbrot
programs on larger screens.

We are developing this box with an Intel i860 here at PCS. For a demo we
ported a sample program from Intel to the 1280 x 1024 Pixel x 8 bit
Framebuffer we have attatched to the beast.

The kernel routine to calculate the points is written in assembly language
to take advantage of the 'multiple-instructions-per-clock-cycle' mode of
the i860.

I wonder how the results I got compares to other machines.

Here's timing for some interesting areas of the plane.


    X0     |    Y0     |    X1     |    Y1     | max. depth |   time
===============================================================================
 -2.25     | -1.5      |  0.75     |  1.5      |    156     | 0 min 11.7 sec
 -0.1992   |  1.0148   | -0.12954  |  1.06707  |    256	    | 0 min  8.7 sec
 -0.713    |  0.49216  | -0.4082   |  0.71429  |    256     | 0 min 21.6 sec
 -0.75104  |  0.10511  | -0.7408   |  0.11536  |   1024     | 1 min 35.0 sec
 -0.74758  |  0.10671  | -0.74624  |  0.10779  |   2048     | 2 min 30.0 sec

The times include output to the framebuffer, the program was running 
standalone, the processor was clocked with 33 MHz.

Unfortunately I don't have a version for Unix that is at least somewhat 
optimized (it's faster than this tricky program I have for the ST although
there the resolution is only 320 x 200, the depth 30 :-) ).

Hoping, Michael
--
Michael (X) Elbel - me@dude.PCS.COM for the World, me@dude.PCS.DE for Europe