[comp.sys.acorn] My ARM2's faster than an ARM3. Waaaa.

hughesmp@vax1.tcd.ie (02/13/91)

I have a few questions about the internals of the Arc. I have a 440i, and wrote
a silly demo, !Ba on it, a while ago. I wrote it for a MEMC1a, and ARM2, but
presumed that I would only have to cater for the MEMC1 being slower. Not so.
The ARM3 also goes slower. The probable reason is that it uses a lot of in line
code, with few loops. It also involves shuffling large amounts of data around.
The ARM3 needs to constantly update its cache as a result, so allowing for it
to synchronize with the 8MHz (?) external system, it effectively runs much
slower. Switching cacheing off is also useless, because rather than operating
at the logical 8MHz, it operates at 30/4 MHz = 7.5 MHz, I think. If this is the
case, _why_ did VLSI not make it just go a little bit faster, say 32 MHz, which
would give identical performance to the ARM2, with cache off?

Also, how difficult would it be to clock the _entire_ system at 30MHz, or would
memory chips not be able to handle that? Alternatively, how difficult would it
be to get a decent sized cache, say 64k or 256k or something? For my case, the
bits of the demo that use in-line code, used it as an obvious implementation of
the problems, ie 200 pixel diameter sphere-wrapping (on the new version),
because it just looks like the fastest implementation. Even putting this into
a loop, the amount of data it operates on, in the order of 250k per frame sync
or something, the code would be overwritten as the data is loaded in, and so it
would need to be cached again, causing similar problems. Can I fix the
processor so that it doesn't load accessed _data_ into the cache? And if I
can't, why does the processor put the data in a pseudo-random location
en-cache? Surely sequential locations would be more logical, because it would
take longer for code to be overwritten by accessed data, is it a) VLSI have 
their reasons, or b) it isn't more logical?

Another question... Some people say my demo works on the ARM3, others say it
doesn't. I assume that the two parties have 30 MHz ARM3s... Could it be that
i/o from podules slows down the machine, even when the podule isn't in use;
although some of the machines it slows down on don't have many podules to
speak of, plugged in. Or are there other pseudo-random elements to the
processor's operation?

I don't have an ARM3, I have just glanced at the VLSI chipset manual.. Any help
would be greatly appreciated...

Tracy.
SICK. We use ARM2s. They're faster.

as@prg.ox.ac.uk (Andrew Stevens) (02/19/91)

hughesmp@vax1.tcd.ie writes ...

>Also, how difficult would it be to clock the _entire_ system at 30MHz, ...
Very *expensive*  at those kind of clock rates propagating clocks and
signals (etc) any kind of distance becomes soooo much harder.  Current
affordable RAM's top out at around 16Mhz (give or take some).

> Alternatively, how difficult would it
> be to get a decent sized cache, say 64k or 256k or something?
Difficult enough to make it rather costly.  A nice big fat, hacky, custom
chip plus the necessary static RAM's.   Caches on PC's are affordable
because standard chippery already exists for the job...

>For my case, the bits of the demo that use in-line code,
>used it as an obvious implementation of
>the problems, ie 200 pixel diameter sphere-wrapping (on the new version),
>because it just looks like the fastest implementation.

Here lies the heart of your problem...
This is the old gotcha that naive in-lining does not necessarily improve
performance on processors using caches.  If you make your
loops big enough, you blow away the cache, and lose in a major
way.  Although inlining speeds things up by a factor:

	loop body / (loop body + loop overhead)

it also slows it down by a factor

	cache degradation * cache clock / ram clock

plus, probably, a significant bit extra since caches typically
refill multiple words at a time, may need to sync with external clocks,
etc etc

Thus, since on modern CPU's loop overhead is usually small,
and (cache clock / ram clock) large the technique is rarely worth
pursuing on machines with small caches.  

>Even putting this into
>a loop, the amount of data it operates on, in the order of 250k per frame sync
>or something, the code would be overwritten as the data is loaded in,
>and so it would need to be cached again, causing similar problems.

I am surprised by this - did you try the loopy version?  Even
assuming the ARM-3 has a very naive direct mapped cache you would still
expect the cache to retain a good proportion of the loop code, unless
the loop is big enough to be a good part of the size of the cache.  Data
access should not blow it away that badly, the innermost loop, surely,
cannot access more than a few hundreds of bytes per iteration?  If data
cache flushing is a problem you might find a bit cunning about how you
lay out / traversed the data in memory helps a lot.  Furthermore, I *think*
the ARM-3's cache is associative. Several cache entries with different
high address bits can be distinguished. Data really shouldn't mash
it *that* badly.  Even TeX and Prolog cache o.k.ish and they're
pretty pathological.

> ... And if I can't, why does the processor put the data in a
>pseudo-random location en-cache?
>Surely sequential locations would be more logical, because it would
>take longer for code to be overwritten by accessed data, is it a) VLSI have 
>their reasons, or b) it isn't more logical?

I take it what you mean is why doesn't the cache flush on a purely
last-in first-out basis.  The short answer is that despite being a
sort of o.k. strategy (most of the time) it would hopelessly slow and
area guzzling to implement in silicon.  Better overall performance
is achieved by using a ``stupider'' cache that can be made usefully large
and fast.  Furthermore, if you don't try to randomize cache flushing a bit
it is very easy to run into disastrous pathological cases.  E.g.
a loop that is the size of the cache would get no benefit at all under
a strictly last-in first-out strategy.

        Andrew Stevens                  
      Programmming Research Group       JANET: Andrew.Stevens@uk.ac.oxford.prg         
 Oxford University Computing Laboratory INTERNET: Andrew.Stevens@prg.ox.ac.uk
     11 Keble Road, Oxford, England     UUCP:  ...!uunet!mcvax!ukc!ox-prg!as
     OX1 3QD