[comp.arch] String Handling

greg@utcsri.UUCP (04/12/87)

This string op-stuff gave me an idea. A run-time library could contain
a function called 'mov200words' looking like this :

mov200words:	mov	(a0)+,(a1)+
		mov	(a0)+,(a1)+
		.....	200 mov's in all
		mov	(a0)+,(a1)+
		rts

Then, if, say, a 64-word struct needed to be copied, the compiler would get
the pointers and then call mov200words+(200-64)*2 [ or whatever ] to do the
copy. This would provide unrolled-loop speed with only one loop unrolled in
the whole executable. [ Call it more than once for >200 words ].  Presumably
this would be faster than a loop on a PDP-11 or a 68000, but might lose on a
machine with an instruction cache, that could run a copy loop on-chip. A wizzo
block copy instruction may or may not run faster than the unrolled loop.

The only advantage I am claiming over other unrolled-loop techniques is the
almost complete lack of anything but payoff move operations in the above,
whilst avoiding large amounts of code whenever a copy is done.

Of course, this must have been done before :-)

-- 
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg
Have vAX, will hack...

firth@sei.cmu.edu (Robert Firth) (04/20/87)

In article <4558@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes:
>This string op-stuff gave me an idea. A run-time library could contain
>a function called 'mov200words' looking like this :
>
>mov200words:	mov	(a0)+,(a1)+
>		mov	(a0)+,(a1)+
>		.....	200 mov's in all
>		mov	(a0)+,(a1)+
>		rts
>
>Then, if, say, a 64-word struct needed to be copied, the compiler would get
>the pointers and then call mov200words+(200-64)*2 [ or whatever ] to do the
>copy. This would provide unrolled-loop speed with only one loop unrolled in
>the whole executable. [ Call it more than once for >200 words ].  Presumably
>this would be faster than a loop on a PDP-11 or a 68000, but might lose on a
>machine with an instruction cache, that could run a copy loop on-chip. A wizzo
>block copy instruction may or may not run faster than the unrolled loop.

Great idea, Gregory!  I saw this implemented in a Pascal/PDP-11 compiler,
and it fascinated me then.  Whether it's faster or slower than a block
move or a loop depends on the machine.  For instance, on a RISCy machine
with a separate I-bus, the limiting factor is data accesses anyway, so
everything takes about the same time.  On the Vax-11/780, the MOVC3 seems
to take almost the same time as the equivalent number of MOVLs, and rather
more time than MOVQs.  Since it also destroys 6 registers, it should be
avoided.

Of course, for small structures you generate the sequence inline; on a Vax
maybe 7 or 8 MOVQs is OK, after that better go to the subroutine (called
by JSR of course).  Has anyone published statistics on the size distribution
of Pascal arrays & records?