[comp.arch] Endian wars - really 386 question

vixie@decwrl.dec.com (Paul A Vixie) (01/29/89)

[Jan Gray]
# This went around comp.arch a while ago.
#	has_NUL = (((i-0x01010101)&~i)&0x80808080) != 0,
# e.g. "test if there were any borrows as a result of the bytewise subtracts"
#
# Using this trick on the '386, strlen on long strings can be made about 30%
# faster than using the dedicated string instruction "rep scasb"! (Except this
# will cause many instruction fetches that will keep your bus busy.)

Wait a minute... This is probably a silly question, more so since it comes
from the moderator of <info-386ix@vixie.sf.ca.us>, but... doesn't the '386
have a cache for recently used instructions?
--
Paul Vixie
Work:    vixie@decwrl.dec.com    decwrl!vixie    +1 415 853 6600
Play:    paul@vixie.sf.ca.us     vixie!paul      +1 415 864 7013

jangr@microsoft.UUCP (Jan Gray) (01/31/89)

[Paul A Vixie]
# [Jan Gray]
# # This went around comp.arch a while ago.
# #	has_NUL = (((i-0x01010101)&~i)&0x80808080) != 0,
# # e.g. "test if there were any borrows as a result of the bytewise subtracts"
# #
# # Using this trick on the '386, strlen on long strings can be made about 30%
# # faster than using the dedicated string instruction "rep scasb"! (Except this
# # will cause many instruction fetches that will keep your bus busy.)
# 
# Wait a minute... This is probably a silly question, more so since it comes
# from the moderator of <info-386ix@vixie.sf.ca.us>, but... doesn't the '386
# have a cache for recently used instructions?

The parenthetical comment was speculation on my part, as I haven't actually
taken a data analyzer to any of my 386 boxes.

The 386 itself has no instruction cache, although there are many high
performance 386 systems with 32K or 64K external caches.  The 386 does have
an instruction prefetch buffer of some size, but I don't think it is of any
benefit to loops a la 68010 loop mode.

I would not advocate changing your 386 str* library, because the 30%
figure is reallly the asymptotic speedup.  For typical strings of less
than ten bytes you wouldn't get much improvement after longword alignment
(we don't want to page fault off the last data page), and having to find
which null within the longword.

The payoff is more dramatic on the 68020.  The "by the book" timings for
traditional byte-at-a-time strlen (unrolled 4 times) vs. one iteration
of the longword scan:
			Cycles:	best  cache  worst	Bus reads
Traditional (scan 4 bytes)	  22     42     50		4
Longword scan			   7     22     31		1

No, MS software engineers *don't* spend their days counting clock cycles!

Jan Gray  uunet!microsoft!jangr  Microsoft Corp., Redmond Wash.  206-882-8080