billms@zip.eecs.umich.edu (Bill Mangione-Smith) (10/12/90)
In article <1990Oct12.042251.18884@cs.cmu.edu> spot@TR4.GP.CS.CMU.EDU (Scott Draves) writes: >shouldn't loop unrolling burn lots of registers also? when unrolling, >which ceiling will you hit first, the number of registers, or the size >of the I-cache? I don't understand why people are still consumed by code size in with (most) aggressive loop optimizations. Loop unrolling and polycyclic scheduling do increase code size. That just isn't important anymore. Give me a 4k icache. Thats usually 1k instructions, right? I've been using the Astronautics ZS-1, which almost always unrolls loops and does a very good job of picking the 'correct' unrolling depth. Yet the loops I've looked at are almost never expanded to over 100 instructions, let alone 1k. When you are unrolling loops, you only need a certain number of instructions (dependant on fu and mem latencies) to work with. Even a small icache, i.e. 1-4k, can hold the required number of instructions. Granted, this issue might still be important for modern cpus that still have very small icaches, but they are quickly being replaced. >Scott Draves bill -- ------------------------------- Bill Mangione-Smith billms@eecs.umich.edu
clc5q@shamash.cs.Virginia.EDU (Clark L. Coleman) (10/25/90)
In article <1990Oct12.135814.14346@zip.eecs.umich.edu> billms@zip.eecs.umich.edu (Bill Mangione-Smith) writes: >In article <1990Oct12.042251.18884@cs.cmu.edu> spot@TR4.GP.CS.CMU.EDU (Scott Draves) writes: >>shouldn't loop unrolling burn lots of registers also? when unrolling, >>which ceiling will you hit first, the number of registers, or the size >>of the I-cache? > >I don't understand why people are still consumed by code size with (most) >aggressive loop optimizations. Loop unrolling and polycyclic scheduling >do increase code size. That just isn't important anymore. Give me a 4k >icache. Thats usually 1k instructions, right? You want a 4k icache, and so do 100 other users. When every context switch eventually leads to some of your cache lines being thrown out of the cache in order to fit someone else's code in the icache, you will see an initialization transient every time the timesharing comes back around to you. If you prefer a different example, you could look at a networked workstation with only 2-3 users, but with each user running several processes. With compilers unrolling loops and otherwise gobbling up the icache, your 4k starts to look small. No problem, you say. In a short time, we will all have several times 4k in every machine. This is a variation on one of the oldest themes in computer science : it always seems that all our resource limitations are about to be solved, as semiconductor technology keeps improving so rapidly. But we just get more and more ambitious in our use of computers. Many computer programmers in the 1950's were certain that programming would be easy within a decade, because they would have more CPU and storage resources. And it would be easy today, if all we were trying to do was program the same problems as in the 1950's. Now that we have multitasking Xwindows applications chewing up gobs of resources, we find ourselves awaiting the next generation of hardware in order to leave behind our current resource limitations. Who knows what icache size we will have tomorrow; I do know that we will figure out a way to chew it up, and then some. That's the message of history. There is no excess of resources waiting around the corner that will excuse our inefficiencies in CPU usage, icache usage, memory usage, etc. ----------------------------------------------------------------------------- "We cannot talk of freedom unless we have private property." -- Gavriil Popov, Mayor of Moscow, September 11, 1990. ||| clc5q@virginia.edu
cprice@mips.COM (Charlie Price) (10/26/90)
In article <1990Oct12.135814.14346@zip.eecs.umich.edu> billms@zip.eecs.umich.edu (Bill Mangione-Smith) writes: > >I don't understand why people are still consumed by code size in with (most) >aggressive loop optimizations. Loop unrolling and polycyclic scheduling >do increase code size. That just isn't important anymore. Give me a 4k >icache. Thats usually 1k instructions, right? I've been using the >Astronautics ZS-1, which almost always unrolls loops and does a very good >job of picking the 'correct' unrolling depth. Yet the loops I've looked >at are almost never expanded to over 100 instructions, let alone 1k. > >When you are unrolling loops, you only need a certain number of instructions >(dependant on fu and mem latencies) to work with. Even a small icache, >i.e. 1-4k, can hold the required number of instructions. > >Granted, this issue might still be important for modern cpus that still >have very small icaches, but they are quickly being replaced. I'll take a shot at this since I haven't seen anybody else try yet... There ain't no such thing as a free lunch. Unrolling a loop makes things faster by amortizing the per-loop loop test condition bookkeeping instruction(s) (like adds to addresses) and the conditional branches over more "useful" instructions. There are overhead costs for loop unrolling due to I-cache considerations and this *can* make things slower -- or at least change the ideas of good limits. Consider the following: 1) It costs some overhead to get instructions into cache. The larger the code loop, the higher the pure overhead cost fetching the instructions into the I-cache. As the processor becomes faster, the overhead per instruction almost certainly becomes relatively larger. The loop unrolling has to pay this direct overhead cost back before it even breaks even -- but you *guarantee* that you run through an unrolled loop fewer times the more it is unrolled. For the fast processor (with the smaller cache?) it takes more loop traversals just to break even. 2) A more-unrolled loop, being bigger, displaces more instructions from the I-cache than a less-unrolled loop. Some of those instructions were useful! That means that in addition to incurring a direct overhead cost by being fetched in the first place, an unrolled loop incurs an indirect cost by forcing the processor to re-fetch some instructions that would have stayed in cache otherwise. The savings needed to break even is going up still more. 3) Additional "real" memory traffic is nearly always bad in some limit. It makes easy-to-program machines like bus-based shared-memory machines poop out at a lower number of processors. Nothing seems to scale very well, loop unrolling that works nicely on one machine, may not be a win on something either 5 times faster or 5 times slower. Loop unrolling isn't necessarily as big a win as it might seem for all situations. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650