[comp.arch] loop unrolling

gg@jetsun.weitek.COM (01/15/91)

In article <PCG.91Jan13174042@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>
>If you have *some limited* degree of pipelining, as in contemporary
>implementations, such as the classic three-four stage pipeline that
>overlaps some computation with some control, and especially if this
>pipeline is exposed with things like delayed branches, then unrolling
>buys you nothing at all in time, and loses code space.
>

On the contrary: it can give you bigger basic blocks in the critical loops, thus
making more room for instruction scheduling to minimize delays.

A different problem with loop unrolling is when you have an instruction cache:
if the unrolled loop code size exceeds the size of the instruction cache (and the
rolled loop fits in it), then your cache miss rate will increase for that loop.

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/17/91)

On 14 Jan 91 21:54:01 GMT, gg@jetsun.weitek.COM said:

gg> In article <PCG.91Jan13174042@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
gg> (Piercarlo Grandi) writes:

pcg> If you have *some limited* degree of pipelining, as in contemporary
pcg> implementations, such as the classic three-four stage pipeline that
pcg> overlaps some computation with some control, and especially if this
pcg> pipeline is exposed with things like delayed branches, then
pcg> unrolling buys you nothing at all in time, and loses code space.

gg> On the contrary: it can give you bigger basic blocks in the critical
gg> loops, thus making more room for instruction scheduling to minimize
gg> delays.

Sorry, we are still on different planets: if this is of benefit it is
because the internal degree of parallelism is higher than what I have
assumed above. There is no need for instruction scheduling across loop
iterations if all the pipeline stages are kept busy with a single
iteration of the loop.

Unrolling only is of benefit if there are enough functional units/stages
of the pipe that a single iteration does keep all stages busy. In most
contemporary micrprocessor implementations that have three-four stage
pipelines, normally the computation in a single loop iteration PLUS the
control of the next iteration keeps all functional units busy. If your
implementation has greater internal parallelism, and your application
can take advantage of it, more power to you. If not, check your
assumptions, man :-).


At times this discussion reminds me of old discussions on the optimal
degree of multiprogramming, which is limited by (number of CPUs)+(number
of possibly outstanding IO operations) or something like that. For most
Unix machines in the PDP/VAX/SUN machines this made the optimal degree
of multiprogramming limited to something like 2-4 (the Law of Four).

I got the same type of reactions (no, it is 16 on my iPSC; no, it is
larger if you use a more parallelizing scheduler, ...). Bah! Pah!
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

graeme@labtam.labtam.oz (Graeme Gill) (01/18/91)

In article <PCG.91Jan16195742@teacho.cs.aber.ac.uk>, pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
> 
> Unrolling only is of benefit if there are enough functional units/stages
> of the pipe that a single iteration does keep all stages busy. In most
> contemporary micrprocessor implementations that have three-four stage
> pipelines, normally the computation in a single loop iteration PLUS the
> control of the next iteration keeps all functional units busy. If your
> implementation has greater internal parallelism, and your application
> can take advantage of it, more power to you. If not, check your
> assumptions, man :-).
> 
> 
	There seems to be an assumption here that the loop conditions are
very simple - ie. decrement a loop counter, jump if non-zero etc.
Many real problems involve loops with much more complicated tests. The
tests may involve significant processing in themselves, and in this situation
loop unrolling is the only way of reducing this overhead significantly.
Delay slots etc. just don't give you enough to cover this up.
	I think it is misleading to only consider a sub-set of typical
problems. Real-world code involves more than simple loops of
logical or arithmetic processing of integer sized elements. Some problems
involve data movement and processing of elements larger than integers.
Talk of 4 registers being sufficient for 'typical' problems is a joke
in many cases. For instance, copying a pattern into memory will
double in speed if the pattern can be cached in registers. A pattern
size may be of arbitrary length, and once the pattern size exceeds
the register resources, the performance of the code must drop to a
copy rather than fill speed.  Loops involving modulus arithmetic to keep
track of pattern repeat boundaries will involve several variables in
themselves, and even a reduction of code performance of 10 - 20 % because
the next to inner loop count variables are not in registers is un-acceptable
in a competitive situation. Even with a wonderful compiler, 32 registers
is rarely enough for some problems.

	Graeme Gill
	Electronic Design Engineer
	Labtam Australia

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/20/91)

On 19 Jan 91 12:08:51 GMT, pcg@cs.aber.ac.uk (Piercarlo Grandi) said:

pcg> 	for (i = 0; i < K; i++)		[ ... ]
pcg> 	    /* body 1 */;

pcg> These can be rewritten as (4-way-unrolling):

pcg> 	for (i = 0; (i+4) < K; i += 4)	[ ... ]
pcg> 	{
pcg> 	    /* body 1 */;
pcg> 	    /* body 1 */;
pcg> 	    /* body 1 */;
pcg> 	    /* body 1 */;
pcg> 	}
pcg> 	if (i < K) switch (K-i)
pcg> 	{
pcg> 	case 3: /* body 1 */;
pcg> 	case 2: /* body 1 */;
pcg> 	case 1: /* body 1 */;
pcg> 	}

OOPS: sorry, I forgot that 'i' might be used in 'body 1' or after the
loop end; the corrected (hope no more errors!) version is:

	for (i = 0; (i+4) < K; i++)
	{
	    /* body 1 */; i++;
	    /* body 1 */; i++;
	    /* body 1 */; i++;
	    /* body 1 */;
	}
	if (i < K) switch (K-i)
	{
	case 3: /* body 1 */; i++;
	case 2: /* body 1 */; i++;
	case 1: /* body 1 */; i++;
	}

This does not really change the argument in which this example was used,
though.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

baum@Apple.COM (Allen J. Baum) (01/21/91)

[]
>In article <PCG.91Jan19120851@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>2) registers are not free. Every doubling in size of the register file
>costs you a lengthening of the cycle time of a few percent; even worse,
>a lot of complication because such register files are flat and need to
>be multiported all the way. The same resources that go into doubling a
>register file might well be more productively employed elsewhere, just
>like all the real memory wasted because of the appalling inadequacy of
>the VM policies mentioned above.

I couldn't let this pass by. Doubling a register file size slows down register
access a few percent. It may or may not slow down cycle time-- when register
access is in the critical path.


--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/23/91)

On 21 Jan 91 03:46:27 GMT, baum@Apple.COM (Allen J. Baum) said:

baum> In article <PCG.91Jan19120851@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
baum> (Piercarlo Grandi) writes:

pcg> 2) registers are not free. Every doubling in size of the register file
pcg> costs you a lengthening of the cycle time of a few percent;

baum> I couldn't let this pass by. Doubling a register file size slows
baum> down register access a few percent. It may or may not slow down
baum> cycle time-- when register access is in the critical path.

Conceded, yet maybe you could have let it pass; how useful is it to
consider the case where the register file *isn't* in the critical path?
I mean, register files are typically multiported on *the* critical
internal buses of the machine.

I can only say that it is hard for me to imagine, not being an hardware
designer, relevant cases where the register file is not in the critical
path or does not influence it. I'd like the opinion of someone who has
dealt firsthand with this kind of issue. I think that if there are
relevant cases, they could teach us something.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

tim@proton.amd.com (Tim Olson) (01/23/91)

In article <PCG.91Jan22172644@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
| I can only say that it is hard for me to imagine, not being an hardware
| designer, relevant cases where the register file is not in the critical
| path or does not influence it. I'd like the opinion of someone who has
| dealt firsthand with this kind of issue. I think that if there are
| relevant cases, they could teach us something.

Register file access is not on the most critical path in the Am29000
(and we have 192 3-ported registers!).  Usually cache lookups and TLB
matching tend to be more critical, because they involve an immediate
comparison of tags after the access, and the arrays are usually larger
than register file arrays.


--
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)