hughesmp@vax1.tcd.ie (02/13/91)
I have a few questions about the internals of the Arc. I have a 440i, and wrote a silly demo, !Ba on it, a while ago. I wrote it for a MEMC1a, and ARM2, but presumed that I would only have to cater for the MEMC1 being slower. Not so. The ARM3 also goes slower. The probable reason is that it uses a lot of in line code, with few loops. It also involves shuffling large amounts of data around. The ARM3 needs to constantly update its cache as a result, so allowing for it to synchronize with the 8MHz (?) external system, it effectively runs much slower. Switching cacheing off is also useless, because rather than operating at the logical 8MHz, it operates at 30/4 MHz = 7.5 MHz, I think. If this is the case, _why_ did VLSI not make it just go a little bit faster, say 32 MHz, which would give identical performance to the ARM2, with cache off? Also, how difficult would it be to clock the _entire_ system at 30MHz, or would memory chips not be able to handle that? Alternatively, how difficult would it be to get a decent sized cache, say 64k or 256k or something? For my case, the bits of the demo that use in-line code, used it as an obvious implementation of the problems, ie 200 pixel diameter sphere-wrapping (on the new version), because it just looks like the fastest implementation. Even putting this into a loop, the amount of data it operates on, in the order of 250k per frame sync or something, the code would be overwritten as the data is loaded in, and so it would need to be cached again, causing similar problems. Can I fix the processor so that it doesn't load accessed _data_ into the cache? And if I can't, why does the processor put the data in a pseudo-random location en-cache? Surely sequential locations would be more logical, because it would take longer for code to be overwritten by accessed data, is it a) VLSI have their reasons, or b) it isn't more logical? Another question... Some people say my demo works on the ARM3, others say it doesn't. I assume that the two parties have 30 MHz ARM3s... Could it be that i/o from podules slows down the machine, even when the podule isn't in use; although some of the machines it slows down on don't have many podules to speak of, plugged in. Or are there other pseudo-random elements to the processor's operation? I don't have an ARM3, I have just glanced at the VLSI chipset manual.. Any help would be greatly appreciated... Tracy. SICK. We use ARM2s. They're faster.
as@prg.ox.ac.uk (Andrew Stevens) (02/19/91)
hughesmp@vax1.tcd.ie writes ... >Also, how difficult would it be to clock the _entire_ system at 30MHz, ... Very *expensive* at those kind of clock rates propagating clocks and signals (etc) any kind of distance becomes soooo much harder. Current affordable RAM's top out at around 16Mhz (give or take some). > Alternatively, how difficult would it > be to get a decent sized cache, say 64k or 256k or something? Difficult enough to make it rather costly. A nice big fat, hacky, custom chip plus the necessary static RAM's. Caches on PC's are affordable because standard chippery already exists for the job... >For my case, the bits of the demo that use in-line code, >used it as an obvious implementation of >the problems, ie 200 pixel diameter sphere-wrapping (on the new version), >because it just looks like the fastest implementation. Here lies the heart of your problem... This is the old gotcha that naive in-lining does not necessarily improve performance on processors using caches. If you make your loops big enough, you blow away the cache, and lose in a major way. Although inlining speeds things up by a factor: loop body / (loop body + loop overhead) it also slows it down by a factor cache degradation * cache clock / ram clock plus, probably, a significant bit extra since caches typically refill multiple words at a time, may need to sync with external clocks, etc etc Thus, since on modern CPU's loop overhead is usually small, and (cache clock / ram clock) large the technique is rarely worth pursuing on machines with small caches. >Even putting this into >a loop, the amount of data it operates on, in the order of 250k per frame sync >or something, the code would be overwritten as the data is loaded in, >and so it would need to be cached again, causing similar problems. I am surprised by this - did you try the loopy version? Even assuming the ARM-3 has a very naive direct mapped cache you would still expect the cache to retain a good proportion of the loop code, unless the loop is big enough to be a good part of the size of the cache. Data access should not blow it away that badly, the innermost loop, surely, cannot access more than a few hundreds of bytes per iteration? If data cache flushing is a problem you might find a bit cunning about how you lay out / traversed the data in memory helps a lot. Furthermore, I *think* the ARM-3's cache is associative. Several cache entries with different high address bits can be distinguished. Data really shouldn't mash it *that* badly. Even TeX and Prolog cache o.k.ish and they're pretty pathological. > ... And if I can't, why does the processor put the data in a >pseudo-random location en-cache? >Surely sequential locations would be more logical, because it would >take longer for code to be overwritten by accessed data, is it a) VLSI have >their reasons, or b) it isn't more logical? I take it what you mean is why doesn't the cache flush on a purely last-in first-out basis. The short answer is that despite being a sort of o.k. strategy (most of the time) it would hopelessly slow and area guzzling to implement in silicon. Better overall performance is achieved by using a ``stupider'' cache that can be made usefully large and fast. Furthermore, if you don't try to randomize cache flushing a bit it is very easy to run into disastrous pathological cases. E.g. a loop that is the size of the cache would get no benefit at all under a strictly last-in first-out strategy. Andrew Stevens Programmming Research Group JANET: Andrew.Stevens@uk.ac.oxford.prg Oxford University Computing Laboratory INTERNET: Andrew.Stevens@prg.ox.ac.uk 11 Keble Road, Oxford, England UUCP: ...!uunet!mcvax!ukc!ox-prg!as OX1 3QD