srm@Matrix.COM (Steve Morris) (11/08/90)
Some folks have claimed that
            Loop:
                ...                             ; whatever
                subq    #1,<count_reg>          ; decrement counter reg
                bne     Loop                    ; branch if not zero
is faster than
            Loop:
                ...                             ; whatever
                dbra    <count_reg>,Loop        ; branch if not -1
because the next instruction following the 'dbra' instruction
is always prefetched by the 68020 (and never cached).
Can someone shed some light on this?
-- 
 _____________________________________________________________
| Steve Morris                         srm@matrix.com         |
| Matrix Corporation                   mcnc!matrx!srm         |
| 1203 New Hope Road                   (919)231-8000 Telephone|ron@motmpl.UUCP (Ron Widell) (11/14/90)
Steve Morris (srm@matrx.matrix.com) writes: > > Some folks have claimed that > > Loop: > ... ; whatever > subq #1,<count_reg> ; decrement counter reg > bne Loop ; branch if not zero > > > is faster than > > Loop: > ... ; whatever > dbra <count_reg>,Loop ; branch if not -1 > Let's examine a certain pathological case to see if this can be true. If, in a paged/virtual memory system (which necessarily presumes the use of a 68851 PMMU or other other harware to trap page boundary violations) we have: Logical Page N, physically resident---------------------------------- | Loop: | ... ; whatever | dbra <count_reg>,Loop ; branch if not -1 --------------------------------------------------------------------- Logical Page N+1, not physically resident---------------------------- | instruction logically following dbra (via PC increment) --------------------------------------------------------------------- We discover that a Bus Error will be generated by the PMMU (or other hardware). However, a Bus Error exception *WILL NOT* be taken at this time; rather the valid bit in the tag field of the cache line will be cleared. So we can ignore exception processing as a source of overhead. Also, since the execution unit and the bus controller are pretty well decoupled, and neither of the branch instructions will generate any bus traffic; the branch will complete and the sequencer will issue the next instruction (the top of the loop) while the prefetch to non-resident memory is taking place; IFF the branch is < 256 bytes (we'll discuss the other case later). Thus, the prefetch does not stall the pipe, so (counting cycles) we find from the manual that the relevant instructions times are: Best Case Cache Case Worst Case subq.w #1,Dn 0 2 3 bne Loop (taken) 3 6 9 vs. dbra Dn,Loop 3 6 9 It is much more likely that the dbra instruction can take advantage of overlap due to bus activity from a previous instruction, since any overlap in the first case would really show up during the subq instruction (and the I-pipe is only 3 words deep). Thus for cases where the loop fits entirely in cache, I would expect case #2 to be at least as fast as #1, perhaps faster. In those cases where the loop *DOES NOT* fit entirely in cache, we will have additional latency for both cases because we will wait for the prefetch cycle to complete (via the *BERR signal) prior to initiating the fetch for the instruction at the top of the loop. Note that in this example we will get a page fault inside the loop for case #1, rather than after the loop as in case #2. Here I would really expect #1 to be slower, as we *WILL* do Bus Error exception processiong. Also note that *BOTH* instructions have prefetch, not just dbra. > > because the next instruction following the 'dbra' instruction > is always prefetched by the 68020 (and never cached). > Prefetching is always occurring on the 68020 (assuming memory bandwidth is available), except when you use the 'sync' instruction (officially known as NOP). And if a valid access to memory occurs, the instruction is cached, provided the cache is enabled both by hardware and software. An additional case where #1 may be faster is where the branch displacement is such that we can use a byte (seven bits plus sign) displacement value. But this was not suggested by your example. > Can someone shed some light on this? > Hopefully, this helped. Regards, -- Ron Widell, Field Applications Eng. |UUCP: {...}mcdchg!motmpl!ron Motorola Semiconductor Products, Inc., |Voice:(612)941-6800 9600 W. 76th St., Suite G | I'm from Silicon Tundra, Eden Prairie, Mn. 55344 -3718 | what could I know?