[comp.arch] icache size

billms@zip.eecs.umich.edu (Bill Mangione-Smith) (10/12/90)

In article <1990Oct12.042251.18884@cs.cmu.edu> spot@TR4.GP.CS.CMU.EDU (Scott Draves) writes:
>shouldn't loop unrolling burn lots of registers also?  when unrolling,
>which ceiling will you hit first, the number of registers, or the size
>of the I-cache?

I don't understand why people are still consumed by code size in with (most)
aggressive loop optimizations.  Loop unrolling and polycyclic scheduling
do increase code size.  That just isn't important anymore.  Give me a 4k
icache.  Thats usually 1k instructions, right?  I've been using the 
Astronautics ZS-1, which almost always unrolls loops and does a very good
job of picking the 'correct' unrolling depth.  Yet the loops I've looked
at are almost never expanded to over 100 instructions, let alone 1k.

When you are unrolling loops, you only need a certain number of instructions
(dependant on fu and mem latencies) to work with.  Even a small icache,
i.e. 1-4k, can hold the required number of instructions.

Granted, this issue might still be important for modern cpus that still
have very small icaches, but they are quickly being replaced.

>Scott Draves	

bill
-- 
-------------------------------
	Bill Mangione-Smith
	billms@eecs.umich.edu

clc5q@shamash.cs.Virginia.EDU (Clark L. Coleman) (10/25/90)

In article <1990Oct12.135814.14346@zip.eecs.umich.edu> billms@zip.eecs.umich.edu (Bill Mangione-Smith) writes:
>In article <1990Oct12.042251.18884@cs.cmu.edu> spot@TR4.GP.CS.CMU.EDU (Scott Draves) writes:
>>shouldn't loop unrolling burn lots of registers also?  when unrolling,
>>which ceiling will you hit first, the number of registers, or the size
>>of the I-cache?
>
>I don't understand why people are still consumed by code size with (most)
>aggressive loop optimizations.  Loop unrolling and polycyclic scheduling
>do increase code size.  That just isn't important anymore.  Give me a 4k
>icache.  Thats usually 1k instructions, right? 

You want a 4k icache, and so do 100 other users. When every context switch
eventually leads to some of your cache lines being thrown out of the cache
in order to fit someone else's code in the icache, you will see an
initialization transient every time the timesharing comes back around
to you.

If you prefer a different example, you could look at a networked
workstation with only 2-3 users, but with each user running several
processes. With compilers unrolling loops and otherwise gobbling up
the icache, your 4k starts to look small.

No problem, you say. In a short time, we will all have several times 4k
in every machine. This is a variation on one of the oldest themes in
computer science : it always seems that all our resource limitations
are about to be solved, as semiconductor technology keeps improving so
rapidly. But we just get more and more ambitious in our use of computers.
Many computer programmers in the 1950's were certain that programming
would be easy within a decade, because they would have more CPU and
storage resources. And it would be easy today, if all we were trying
to do was program the same problems as in the 1950's. Now that we have
multitasking Xwindows applications chewing up gobs of resources, we
find ourselves awaiting the next generation of hardware in order to
leave behind our current resource limitations. Who knows what icache
size we will have tomorrow; I do know that we will figure out a way
to chew it up, and then some. That's the message of history. There
is no excess of resources waiting around the corner that will excuse
our inefficiencies in CPU usage, icache usage, memory usage, etc.



-----------------------------------------------------------------------------
"We cannot talk of freedom unless we have private property." -- Gavriil Popov,
Mayor of Moscow, September 11, 1990. |||  clc5q@virginia.edu

cprice@mips.COM (Charlie Price) (10/26/90)

In article <1990Oct12.135814.14346@zip.eecs.umich.edu> billms@zip.eecs.umich.edu (Bill Mangione-Smith) writes:
>
>I don't understand why people are still consumed by code size in with (most)
>aggressive loop optimizations.  Loop unrolling and polycyclic scheduling
>do increase code size.  That just isn't important anymore.  Give me a 4k
>icache.  Thats usually 1k instructions, right?  I've been using the 
>Astronautics ZS-1, which almost always unrolls loops and does a very good
>job of picking the 'correct' unrolling depth.  Yet the loops I've looked
>at are almost never expanded to over 100 instructions, let alone 1k.
>
>When you are unrolling loops, you only need a certain number of instructions
>(dependant on fu and mem latencies) to work with.  Even a small icache,
>i.e. 1-4k, can hold the required number of instructions.
>
>Granted, this issue might still be important for modern cpus that still
>have very small icaches, but they are quickly being replaced.

I'll take a shot at this since I haven't seen anybody else try yet...

There ain't no such thing as a free lunch.

Unrolling a loop makes things faster by amortizing the per-loop 
loop test condition bookkeeping instruction(s) (like adds to addresses)
and the conditional branches over more "useful" instructions.
There are overhead costs for loop unrolling due to I-cache
considerations and this *can* make things slower -- or at least
change the ideas of good limits.

Consider the following:

1)  It costs some overhead to get instructions into cache.
    The larger the code loop, the higher the pure overhead
    cost fetching the instructions into the I-cache.
    As the processor becomes faster, the overhead per instruction
    almost certainly becomes relatively larger.
    The loop unrolling has to pay this direct overhead cost
    back before it even breaks even -- but you *guarantee* that
    you run through an unrolled loop fewer times the more it
    is unrolled.
    For the fast processor (with the smaller cache?) it takes more
    loop traversals just to break even.
2)  A more-unrolled loop, being bigger, displaces more instructions
    from the I-cache than a less-unrolled loop.
    Some of those instructions were useful!
    That means that in addition to incurring a direct overhead cost
    by being fetched in the first place, an unrolled loop incurs an
    indirect cost by forcing the processor to re-fetch some
    instructions that would have stayed in cache otherwise.
    The savings needed to break even is going up still more.
3)  Additional "real" memory traffic is nearly always bad
    in some limit.  It makes easy-to-program machines like
    bus-based shared-memory machines poop out at a lower number
    of processors.

Nothing seems to scale very well, loop unrolling that works
nicely on one machine, may not be a win on something either 5 times
faster or 5 times slower.

Loop unrolling isn't necessarily as big a win
as it might seem for all situations.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650