vixie@decwrl.dec.com (Paul A Vixie) (01/29/89)
[Jan Gray] # This went around comp.arch a while ago. # has_NUL = (((i-0x01010101)&~i)&0x80808080) != 0, # e.g. "test if there were any borrows as a result of the bytewise subtracts" # # Using this trick on the '386, strlen on long strings can be made about 30% # faster than using the dedicated string instruction "rep scasb"! (Except this # will cause many instruction fetches that will keep your bus busy.) Wait a minute... This is probably a silly question, more so since it comes from the moderator of <info-386ix@vixie.sf.ca.us>, but... doesn't the '386 have a cache for recently used instructions? -- Paul Vixie Work: vixie@decwrl.dec.com decwrl!vixie +1 415 853 6600 Play: paul@vixie.sf.ca.us vixie!paul +1 415 864 7013
jangr@microsoft.UUCP (Jan Gray) (01/31/89)
[Paul A Vixie] # [Jan Gray] # # This went around comp.arch a while ago. # # has_NUL = (((i-0x01010101)&~i)&0x80808080) != 0, # # e.g. "test if there were any borrows as a result of the bytewise subtracts" # # # # Using this trick on the '386, strlen on long strings can be made about 30% # # faster than using the dedicated string instruction "rep scasb"! (Except this # # will cause many instruction fetches that will keep your bus busy.) # # Wait a minute... This is probably a silly question, more so since it comes # from the moderator of <info-386ix@vixie.sf.ca.us>, but... doesn't the '386 # have a cache for recently used instructions? The parenthetical comment was speculation on my part, as I haven't actually taken a data analyzer to any of my 386 boxes. The 386 itself has no instruction cache, although there are many high performance 386 systems with 32K or 64K external caches. The 386 does have an instruction prefetch buffer of some size, but I don't think it is of any benefit to loops a la 68010 loop mode. I would not advocate changing your 386 str* library, because the 30% figure is reallly the asymptotic speedup. For typical strings of less than ten bytes you wouldn't get much improvement after longword alignment (we don't want to page fault off the last data page), and having to find which null within the longword. The payoff is more dramatic on the 68020. The "by the book" timings for traditional byte-at-a-time strlen (unrolled 4 times) vs. one iteration of the longword scan: Cycles: best cache worst Bus reads Traditional (scan 4 bytes) 22 42 50 4 Longword scan 7 22 31 1 No, MS software engineers *don't* spend their days counting clock cycles! Jan Gray uunet!microsoft!jangr Microsoft Corp., Redmond Wash. 206-882-8080