greg@utcsri.UUCP (04/12/87)
This string op-stuff gave me an idea. A run-time library could contain a function called 'mov200words' looking like this : mov200words: mov (a0)+,(a1)+ mov (a0)+,(a1)+ ..... 200 mov's in all mov (a0)+,(a1)+ rts Then, if, say, a 64-word struct needed to be copied, the compiler would get the pointers and then call mov200words+(200-64)*2 [ or whatever ] to do the copy. This would provide unrolled-loop speed with only one loop unrolled in the whole executable. [ Call it more than once for >200 words ]. Presumably this would be faster than a loop on a PDP-11 or a 68000, but might lose on a machine with an instruction cache, that could run a copy loop on-chip. A wizzo block copy instruction may or may not run faster than the unrolled loop. The only advantage I am claiming over other unrolled-loop techniques is the almost complete lack of anything but payoff move operations in the above, whilst avoiding large amounts of code whenever a copy is done. Of course, this must have been done before :-) -- ---------------------------------------------------------------------- Greg Smith University of Toronto UUCP: ..utzoo!utcsri!greg Have vAX, will hack...
firth@sei.cmu.edu (Robert Firth) (04/20/87)
In article <4558@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes: >This string op-stuff gave me an idea. A run-time library could contain >a function called 'mov200words' looking like this : > >mov200words: mov (a0)+,(a1)+ > mov (a0)+,(a1)+ > ..... 200 mov's in all > mov (a0)+,(a1)+ > rts > >Then, if, say, a 64-word struct needed to be copied, the compiler would get >the pointers and then call mov200words+(200-64)*2 [ or whatever ] to do the >copy. This would provide unrolled-loop speed with only one loop unrolled in >the whole executable. [ Call it more than once for >200 words ]. Presumably >this would be faster than a loop on a PDP-11 or a 68000, but might lose on a >machine with an instruction cache, that could run a copy loop on-chip. A wizzo >block copy instruction may or may not run faster than the unrolled loop. Great idea, Gregory! I saw this implemented in a Pascal/PDP-11 compiler, and it fascinated me then. Whether it's faster or slower than a block move or a loop depends on the machine. For instance, on a RISCy machine with a separate I-bus, the limiting factor is data accesses anyway, so everything takes about the same time. On the Vax-11/780, the MOVC3 seems to take almost the same time as the equivalent number of MOVLs, and rather more time than MOVQs. Since it also destroys 6 registers, it should be avoided. Of course, for small structures you generate the sequence inline; on a Vax maybe 7 or 8 MOVQs is OK, after that better go to the subroutine (called by JSR of course). Has anyone published statistics on the size distribution of Pascal arrays & records?