[comp.protocols.tcp-ip] TCP performance limitations

AUERBACH@CSL.SRI.COM (Karl Auerbach) (09/29/87)

Someone asked me today what are the performance limits on a TCP connection.
The situation he posited was on in which there are no intervening
IP routers, massive availability of underlying bandwidth, massive computing
resources in the hosts, and low noise.  It was further posited that
the hosts could be using acknowledgement strategies highly tuned for
this specific situation.

How to make such an estimate is well outside my expertise.  Can anyone
give me any pointers?

			Thanks, --karl--
-------

Mills@UDEL.EDU (09/29/87)

Karl,

Geez, how can I respond to such a wonder? Given your scenario, I would
posit something like X-band speeds, which works out to 10^10 bps, assuming
you use the right waveguide. Now, if you are phond of photons, we can
escalate that a few megatons. Seriously, existing Unix implementations
might horse up to a couple of megabits, but Craymonsters on Hyperbolts
might up that a magnitude. Meanwhile, ask Dave Clark at MIT about the
NETBLT performance, which is currently dimming the lights on FATNET
at megabit speeds over real satellite channels.

Dave

lazear@GATEWAY.MITRE.ORG (09/29/87)

Karl,
	It sounds like you've removed the major bottlenecks and
throughput ought to approach infinity!  :-)  Without queue delays,
transmission delays, checksum computing delays, and retransmissions,
there's not much left to stall data.

	Walt

DCP@QUABBIN.SCRC.SYMBOLICS.COM (David C. Plummer) (09/29/87)

Given your assumptions, TCP is not the limitting factor.  The limitting
factor are rate at which producers can produce data and consumers
can consume it.  It is quite easy to multibuffer TCP, and I suspect most
implementations have.  Given the horsepower of the consumer and the
producer, and the channel bandwidth, and the multibuffering methodology,
you can probably calculate the minimum window size that ensures the pipe
will always be full.

CASNER@VENERA.ISI.EDU (Stephen Casner) (09/29/87)

Karl,
	For all the variables you mention, you've specified values that
would pose no limit to TCP performance.  However, you left out one
critical variable:  delay.  If the bandwidth and processing power are
infinite, the sender will immediately transmit as many octets as the
receiver's window will allow.  The sender cannot transmit any more until
an ACK is returned, and the receiver cannot send an ACK until the data
gets there.  Therefore, the maximum throughput is one window's worth of
octets every round-trip delay time.

	The window size is determined by the amount of buffer space
available in the receiver.  If you assume buffer space is also
unlimited, then you bump into the first real limit:  the maximum TCP
window size is 64K octets because it is carried in a 16-bit field.  To
cite a real example, the Wideband Satellite Network has a raw data rate
of 3Mb/s, but it also has a round-trip delay time of 1.8 seconds.  The
maximum TCP throughput is 64K octets per 1.8 seconds or 290Kb/s.

	There have been proposals by the INENG Task Force and others to
define a TCP option to carry a larger window-size field.  If you assume
this option field could be as large as necessary, then the only limits
left are physical:  the speed of light and the distance between hosts.

							-- Ste.

A with t time
r

kent@DECWRL.DEC.COM (09/29/87)

There are some protocol details (like the IP identification field) and
common MTUs that will limit the data rate.

chris

braden@BRADEN.ISI.EDU (09/30/87)

	
	Karl,
		It sounds like you've removed the major bottlenecks and
	throughput ought to approach infinity!  :-)  Without queue delays,
	transmission delays, checksum computing delays, and retransmissions,
	there's not much left to stall data.
	
		Walt
____________________________
	
The max TCP window size (2**16) divided by the round-trip delay.

    Bob Braden

mmolle@Q2.ICS.UCI.EDU.UUCP (09/30/87)

Hey, wait a minute!  Even assuming no errors, arbitrarily high data rates
and arbitarily fast processors interpretting the protocols, there is still
propagation time across the channel to contend with.  Assuming a window
size of one (a.k.a. "stop and wait") you get at most one frame per round
trip time, i.e., a finite maximum data rate.  A larger (but still finite)
window size of W outstanding and unacknowledged frames raises that bound
by a factor of W, but it certainly doesn't give you infinite throughput.

Does that clarify the situation, or am I missing something deeper?

lamaster@pioneer.arpa (Hugh LaMaster) (09/30/87)

In article <8709290008.AA00477@ucbvax.Berkeley.EDU> AUERBACH@CSL.SRI.COM (Karl Auerbach) writes:

>Someone asked me today what are the performance limits on a TCP connection.
>The situation he posited was on in which there are no intervening
:
>resources in the hosts, and low noise.  It was further posited that

An interesting question. It depends on HOW low noise your connection is.
Because of acknowledgement and retransmission requirements, the faster the
link, the lower noise it has to be to maintain a high delivered fraction of
the raw channel speed.  This is in addition to the question of the interaction
of acknowledgement delay and window size, which some have mentioned, and which
is also a big problem.  To realize the high bandwidth in practice requires
host software smart enough to adaptively distinguish between a high bandwidth
low noise channel, and something like Arpanet, and adjust its behavior
appropriately to either situation.  Offhand, I am not sure how to do this.
Anybody have any ideas?





  Hugh LaMaster, m/s 233-9,  UUCP {topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

(Disclaimer: "All opinions solely the author's responsibility")

JBVB@AI.AI.MIT.EDU ("James B. VanBokkelen") (09/30/87)

    ... what are the performance limits on a TCP connection .... in which
    there are no intervening IP routers, massive availability of underlying
    bandwidth, massive computing resources in the hosts, and low noise.
    ... the hosts could be using acknowledgement strategies highly tuned for
    this specific situation.

    			Thanks, --karl--

My initial estimate would be based on the media bandwidth and the memory
bandwidth of the host.  If the memory bandwidth of the host limits, then
I would expect the data rate will be on the order of the memory bandwidth
times the number of times the data must be copied on the way out (count DMA
and the checksum, please).  If the net bandwidth limits, I would guess
somewhere between 40% and 80% of the network bandwidth, depending on the
architecture (CSMA/CD maybe towards the low side, well-implemented Token
Passing maybe towards the high side?).

Memory bandwidth definitely dominates on the implementation I am most
familiar with.  We recently unrolled the TCP checksum loop, and a ~35%
speed improvement there produced a ~15% overall throughput increase on
memory-to-memory TCP.  This got us up to 1.4 Mbits/sec between two 8Mhz
AT clones on a lighty-loaded Ethernet (with 3rd-generation boards - LAN
controller chip & multiple packet buffers, but no processor).  40% of a
4.3-modified Sun (3.5 Mbits/sec per a recent posting)...

James B. VanBokkelen
FTP Software Inc.

CERF@A.ISI.EDU.UUCP (09/30/87)

Karl,

I have been doing some work in this area to predict potential
performance at 100 Mb/s FDDI rates. First, you can expect a
TCP to execute about 1000-1200 instructions per packet,
assuming all is well. Some of this is checksum. In fact, I
recall some early statistics in which the checksum alg took up 40% of the
CPU cycles for processing incoming segments of TCO. I am lumping
IP level processing into TCP in the 1000-1200. This is absolutely
a WAG - so anyone with some hard data on instruction count is
most welcome to provide better info.

Roundtrip time and window size limitations affect performance
with respect to total throughput, all other things ignored.

Suppose you were operating at 10MB (megabytes)/second and
had a prop. delay of 10 microseconds on the ring.

With no constraints, and using 1000 byte packets, you would
be sending 10,000 packets/second to achieve full throughput.
That gives you 100 microseconds worth of insruction time.
At 14 MIPs, that is 1400 instructions. So, if you did nothing
else but TCP/IP, you might make it, with respect to
instruction rate.

Window size limitations: you can have 65,536 bytes outstanding
and at 10 MB/sec, it takes 6.5 ms to send them. Assuming you
can send the ack in the 100 microseconds you have to "process"
the packet, it takes 10 microseconds + 100 + 10 = 120 usec
(2 X propagation delay plus processing time) to get the ACK.
Even asusming another 100 usec to process the ACK, that 220 usec
worth of window needed, minimum. That's only 2200 bytes - so
even a factor of 10 increase is well within the window capacity
of the TCP.

The killer in all this is propagation and round-trip delay. It
doesn't take much to cause the window requirment for full
bandwidth to exceed the maximum allowed window of outstanding
bytes. For example, a round-trip time of 10 ms (the prop
delay of only 1000 miles, one way!) requires a window of 
over 100KB to maintain full capacity at 10MB/sec.

What these very sketchy and rough numbers suggest is that
window-based schemes won't be very satisfactory for long haul,
long delay, very high speed nets. Flow control based on rates is needed,
rather than on round-trip acks/permissions - of course, this is precisely
the kind of thinking that I believe underlies the work that
Dave Clark has been doing with NETBLT (Dave, holler if I
have put words/thoughts into your work inappropriately).

To sum up, TCP can probably be made to work well in low delay
systems at pretty high speeds, maybe up to 100 Mbit/sec, but
probably not at a gigabit and probably not at long haul cross
country at 100's of megabits/sec.

By the way, most of these same considerations apply to ISO TP,
in my estimation.

Vint

lamaster@pioneer.arpa (Hugh LaMaster) (09/30/87)

In article <[A.ISI.EDU]30-Sep-87.09:40:42.CERF> CERF@A.ISI.EDU writes:

>
>What these very sketchy and rough numbers suggest is that
>window-based schemes won't be very satisfactory for long haul,
>long delay, very high speed nets. Flow control based on rates is needed,
>rather than on round-trip acks/permissions - of course, this is precisely
>the kind of thinking that I believe underlies the work that
>Dave Clark has been doing with NETBLT (Dave, holler if I
>have put words/thoughts into your work inappropriately).
>

I have seen/heard references to rate based flow control before - but I haven't
seen (or maybe remembered) what the idea is.  Is there a summary somewhere or
could someone summarize the basic idea?  (Is there an RFC? etc.)





  Hugh LaMaster, m/s 233-9,  UUCP {topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

(Disclaimer: "All opinions solely the author's responsibility")

CSYSMAS@OAC.UCLA.EDU.UUCP (10/02/87)

> I have been doing some work in this area to predict potential
> performance at 100 Mb/s FDDI rates. First, you can expect a
> TCP to execute about 1000-1200 instructions per packet,
> assuming all is well. Some of this is checksum. In fact, I
> recall some early statistics in which the checksum alg took up 40% of the
> CPU cycles for processing incoming segments of TCO. I am lumping
> IP level processing into TCP in the 1000-1200. This is absolutely
> a WAG - so anyone with some hard data on instruction count is
> most welcome to provide better info.

The instruction count sounds low to me, how about 10 times more?

(A 1000 byte packet sounds like it would take 500 adds just to
compute the TCP checksum, not to mention a 64K packet).

Speaking of checksums, it seems to me that the IP header checksum
could be replaced with a "packet" level CRC at the link level and
done by hardware.  Most (all?) HDLC type chips provide this
without any extra hardware (or effort).

Unfortunately for a TCP connection, most of the checksum overhead
is in the TCP checksum (which is an end-to-end check) and this
sounds harder to move off of the general purpose CPU.  The idea
would be to let your general purpose 14 MIP CPU do general
purpose work rather than adding up checksums.

> With no constraints, and using 1000 byte packets, you would
> be sending 10,000 packets/second to achieve full throughput.
> That gives you 100 microseconds worth of insruction time.
> At 14 MIPs, that is 1400 instructions. So, if you did nothing
> else but TCP/IP, you might make it, with respect to
> instruction rate.

I have been thinking of how to design a T3 (45Mb) type speed
packet switch (just thinking) and there are some real problems
with doing IP packet header processing when you need to process a
packet every 6 us.  (Voice packets want to be about 100 bytes so
you need to be able to handle about 56K packets/sec).

Virtual circuts sure seem easer at the packet level at this speed
(smaller packet overhead too).  Of course, a virtual circut could
carry embedded IP packets.

howard@cos.UUCP (10/02/87)

In article <2922@ames.arpa>, lamaster@pioneer.arpa (Hugh LaMaster) writes:
> In article <8709290008.AA00477@ucbvax.Berkeley.EDU> AUERBACH@CSL.SRI.COM (Karl Auerbach) writes:
> 
> >Someone asked me today what are the performance limits on a TCP connection.
> >The situation he posited was on in which there are no intervening
> :
> >resources in the hosts, and low noise.  It was further posited that
> 
> An interesting question. It depends on HOW low noise your connection is.
> Because of acknowledgement and retransmission requirements, the faster the
> link, the lower noise it has to be to maintain a high delivered fraction of
> the raw channel speed.  This is in addition to the question of the interaction
> of acknowledgement delay and window size, which some have mentioned, and which
> is also a big problem.  To realize the high bandwidth in practice requires
> host software smart enough to adaptively distinguish between a high bandwidth
> low noise channel, and something like Arpanet, and adjust its behavior
> appropriately to either situation.  Offhand, I am not sure how to do this.
> Anybody have any ideas?

     A mechanism for implementing adaption to noise could be
   drawn from the SNA technique for varying the window size
   on FID4 Transmission Groups.  Essentially, a given path
   starts out with a default window size, which can be changed
   by nodes along the way based on their buffer availability
   status.  For severe congestion, a node can reset the
   window size to 1, for minor congestion, a node can decrease
   the window size BY 1.  If a node feels it can handle more
   traffic, it can also set a bit indicating it would like
   to increase the window size by 1.

     While this mechanism is intended more for congestion control
   adaption rather than error adaption, I did use it in a
   protocol for a network management system which used both
   dedicated and dial lines for control channels.  I started
   with a value assumed for dial lines, and gradually increased
   the frame and window size based on modified error-free 
   interval length.  By "modified," I mean that I kept an error
   counter which was decremented and incremented differently
   for retransmissions and for successfully transmitted sequences --
   retransmissions decremented the error-free interval counter
   by a lesser value than did long sequences of successful
   transmission.  This modification protected the frame and
   window sizes from radical changes due to error bursts.
   In general, those sizes were changed occasionally by 1,
   at thresholds determined by simulation, to give the
   best mixture of parameters for the encountered error conditions.  
-- 
-- howard(Howard C. Berkowitz) @cos.com
 {uunet,  decuac, sun!sundc, hadron, hqda-ai}!cos!howard
(703) 883-2812 [ofc] (703) 998-5017 [home]
DISCLAIMER:  I explicitly identify COS official positions.

narten@PURDUE.EDU.UUCP (10/02/87)

Does anyone have pointers to work being done in the performance of
transport protocols (including TCP), when communications links are not
reliable?  E.g., how do various protocols behave in the presence of
lost datagrams. I am asking this more from the theoretical point of
view than from the angle of tuning an existing implementation. For
instance, if 10% of the packets are lost, what happens to the
throughput of TCP? I realize that there are a lot of variables that go
into this, but it is still interesting to fix various parameters
while varying packet loss rate, or to observe how window size, RTT,
and packet lossage interact.

Thomas

CERF@A.ISI.EDU (10/03/87)

As you move upward in the bandwidth range, you have decreasing amounts of
time to process each object passing through - the high speed switching fabrics
developed at Bell Labs and Bell Core have the characteristic that very little
per packet processing is being done and, as you surmise, a kind of virtual
circuit is set up; however, the switching fabric is able to share the
bandwidth of the transmission resources, despite the VC set-up, because
the VCs are just table entries and not reservations of actual capacity
within each trunk.

At the terminations of such a switching fabric, however, I think one
still will need some end/end checking. Moreover, if we contemplate
the kind of linking of networks we have today (vastly different internal
operation), we may still need the end/end checking that TCP does. Rather
than putting the TCP in the mainframe, anymore, though, it is fair to
consider the sort of DMA interface which permits the TCP and perhaps
other layers of protocol to be housed on an external board, equipped with
special purpose logic or at least dedicated processing, placing data or
taking data from the main processor's memory. Communications becomes a
matter of placing buffers in memory and possibly signalling their arrival
to the communications processor.

I suspect that this description is not far away from the kinds of
equipment already made and sold by companies like EXCELAN and ACC.
Forgive me if I failed to mention the dozens of other companies whose
products may work this way - I'm not up to speed on all the commercial
products now flowering in the TCP/IP fields.

At an IP switch point, you are still faced with at least full IP
level processing and handling 45 Mb/s and up at that point is
still a big processing load, unless we re-think the IP level and
try to find a way to fabricize it, as the Bell folks have with
the lower level packet switching. Hmm.

Vint

MAB@CORNELLC.BITNET (Mark Bodenstein) (10/05/87)

>  ...           We recently unrolled the TCP checksum loop, and a ~35%
>speed improvement there produced a ~15% overall throughput increase on
>memory-to-memory TCP.
>  ...
>James B. VanBokkelen
>FTP Software Inc.
>

Could you provide more detail on how you unrolled this loop?

(The complication being that the length of the loop is determined by
the length of the data.  Some alternatives I can think of would be:

1. to keep checking to see if you're done

2. to unroll the loop for each possible data length, and chose and
   execute the appropriate unrolled loop

3. to pad the data with zeros to the maximum length to be processed

(One could also try various combinations of these three.)

1. seems inefficient, and thus defeats the purpose of unrolling the loop.
2. seems efficient, but perhaps excessive.
3. seems to penalize short segments (e.g. ACKs), of which there are
   many.

Or am I missing something?)

Thanks.

Mark Bodenstein      (mab@cornellc.bitnet@wiscvm.wisc.edu)
Cornell University

CLYNN@G.BBN.COM (10/05/87)

Mark,
	One could, given suitable hardware, etc. compute the checksum more
than 16 bits at at time - e.g., use 32-bit arithmetic (or some other higher
number).  There are at least two ways - adding 16 bits into a wider accumulator
which reduces/eliminates checking for carry/overflow, or adding 32 bits at a
time into the accumulator, reducing the number of memory fetches/adds/carry
checks by a factor of 2 (more if you have 64 bit arithmetic, etc.).
	One can also compute the checksum as the data is being placed
into the packet, instead of computing it after the packet has been "built".
	On retransmissions/repacketizations, one can "update" the checksum
(subtract out the old (header) values and add in the new) instead of
recomputing the sum of the data each time (the checksum is weak enough that
byte-pair position, order of summation, etc. doesn't matter).

jon@Cs.Ucl.AC.UK (Jon Crowcroft) (10/06/87)

 Since TCP is a transport protocol, it has to reach the processes
lower protocols can't reach. This means (in most OS's (eg
Used Never In Xanadu)), at least one buffer copy, and one or two context
switches per packet. 

However good your windowing, and lossless the net, that bites 
bad. If the hardware could keep track of the actual user buffers (and scatter
gather them too) straight in and out of user processes, then you are down
to DMA speeds. The trick is to design clever enough hardware that the OS 
designers will trust not to trash the OS. 

You lose two ways because of current hardware - packet over run (see many
writings by Dave Cheriton) causing lowere throughput, and higher latency.

Another win would be hardware that 
provided the higher level protocols with a CRC [Say buffer descriptor
list contains pointer to where to put CRC in outgoing packets, and how to
do it for incoming packets].

It's instructive to look at fast gateways - two lance ethernet chips on a 
multibus can blow away the bus bandwidth - really fast boxes have no bus,
or have a binary tree bus a la buttergates...

Jon

DCP@QUABBIN.SCRC.SYMBOLICS.COM (David C. Plummer) (10/06/87)

    Date:         Mon, 5 Oct 1987 11:24:26 EDT
    From:         Mark Bodenstein <MAB%CORNELLC.BITNET@wiscvm.wisc.edu>

    Could you provide more detail on how you unrolled this loop?

    (The complication being that the length of the loop is determined by
    the length of the data.  Some alternatives I can think of would be:

	...

There is a fourth way that we (Symbolics) have done which you did not
mentioned:

(a) Pick a compile-time unrolling factor, usually a power of 2, say 16 = 2^4.
(b) Divide the data length by the unrolling factor, obtaining a quotient
    and remainder.  When the unrolling factor is a power of two, the
    quotient is a shift and the remainder is a logical AND.
(c) Write a unrolled loop whose length is the unrolling factor.  Execute
    this loop <quotient> times.
(d) Write an un-unrolled loop (whose length is therefore 1).  Execute
    this loop <remainder> times.

mminnich@udel.EDU (Mike Minnich) (10/06/87)

In article <8710051601.AA14825@ucbvax.Berkeley.EDU>, MAB@CORNELLC.BITNET (Mark Bodenstein) writes:
> Could you provide more detail on how you unrolled this loop?
> 
> (The complication being that the length of the loop is determined by
> the length of the data.  Some alternatives I can think of would be:
> 
> 2. to unroll the loop for each possible data length, and chose and
>    execute the appropriate unrolled loop
> 

A simple technique that has worked well for me in the past is to unroll
the loop for the longest possible length and then compute a jump into
the unrolled loop based on the length of the data to be
checksummed/copied/etc.  Only one version of the unrolled loop is
needed in this case.

mike
-- 
Mike Minnich

ihm@nrcvax.UUCP (Ian H. Merritt) (10/07/87)

>The instruction count sounds low to me, how about 10 times more?

This sounds more reasonable.

>
>(A 1000 byte packet sounds like it would take 500 adds just to
>compute the TCP checksum, not to mention a 64K packet).

Since the 1's complement arithmetic is so flexible, this could of
course be improved on 32 bit processors, for example, but still
represents significantly more than 1000 instructions...
	.
	.
	.

>Unfortunately for a TCP connection, most of the checksum overhead
>is in the TCP checksum (which is an end-to-end check) and this
>sounds harder to move off of the general purpose CPU.  The idea
>would be to let your general purpose 14 MIP[S] CPU do general
>purpose work rather than adding up checksums.
>
	.
	.
	.

>I have been thinking of how to design a T3 (45Mb) type speed
>packet switch (just thinking) and there are some real problems
>with doing IP packet header processing when you need to process a
>packet every 6 us.  (Voice packets want to be about 100 bytes so
>you need to be able to handle about 56K packets/sec).
>
>Virtual circuts sure seem easer at the packet level at this speed
>(smaller packet overhead too).  Of course, a virtual circut could
>carry embedded IP packets.

This suggests a dedicated packet concentrator arrangement such that
long-haul very high speed (VHS (:->) ) circuits could be utilized by
grouping large numbers of small (i.e. ip/tcp-size) packets bound for
the same region together into mostly-reliable megapackets shipped over
the high-speed link to be unpacked and sent off to gateways at the
far end.

TCP's reliability features would be sufficient to permit this to work
with little or no megapacket-level error handling, but such
functionality could be included by using nonflexible (therefore
simple) algorithms implemented in VLSI that operate on the entire
megapacket or by cutting it up into arbitrary segments without any
content-specific knowledge.  The common-channel interoffice signalling
(CCIS) protocol used by the bell system incorporated a similar scheme
(albeit on a smaller (and slower) scale), allowing selective
retransmission of portions of a megapacket while verifying the
validity of the rest of the data.
					--i

ihm@nrcvax.UUCP (Ian H. Merritt) (10/08/87)

>There is a fourth way that we (Symbolics) have done which you did not
>mentioned:
>
>(a) Pick a compile-time unrolling factor, usually a power of 2, say 16 = 2^4.
>(b) Divide the data length by the unrolling factor, obtaining a quotient
>    and remainder.  When the unrolling factor is a power of two, the
>    quotient is a shift and the remainder is a logical AND.
>(c) Write a unrolled loop whose length is the unrolling factor.  Execute
>    this loop <quotient> times.
>(d) Write an un-unrolled loop (whose length is therefore 1).  Execute
>    this loop <remainder> times.

Or if you have memory to burn (which is fast becoming a common
condition), just unroll the loop for the maximum condition and branch
into it at the appropriate point to process the length of the actual
packet.
					--i

CERF@A.ISI.EDU (10/10/87)

thanks for the observations - what you describe also bears an uncanny
similarity to the internals of the early TYMNET which aggregated bytes
from terminals and grouped them into frames which were sent reliably
on a hop-by-hop basis, broken down at each hop and re-formed based on
new traffic arriving at that node and the fact that the bytes in the
frame might be destined to flow out on different trunks and so had
to be re-marshalled into new frames.

Vint

DCP@QUABBIN.SCRC.SYMBOLICS.COM (David C. Plummer) (10/13/87)

    Date: 8 Oct 87 16:05:47 GMT
    From: csustan!csun!psivax!nrcvax!ihm@LLL-WINKEN.ARPA  (Ian H. Merritt)

    >There is a fourth way that we (Symbolics) have done which you did not
    >mentioned:
    >
    >(a) Pick a compile-time unrolling factor, usually a power of 2, say 16 = 2^4.
    >(b) Divide the data length by the unrolling factor, obtaining a quotient
    >    and remainder.  When the unrolling factor is a power of two, the
    >    quotient is a shift and the remainder is a logical AND.
    >(c) Write a unrolled loop whose length is the unrolling factor.  Execute
    >    this loop <quotient> times.
    >(d) Write an un-unrolled loop (whose length is therefore 1).  Execute
    >    this loop <remainder> times.

    Or if you have memory to burn (which is fast becoming a common
    condition), just unroll the loop for the maximum condition and branch
    into it at the appropriate point to process the length of the actual
    packet.

First of all, that's 65535 octets for TCP.  Second, I believe that was
one of the three techniques metioned by the person to whom I was
replying.  Third, we (Symbolics) can't do that without playing some
really nasty games with the compiler.  You see, we're of the opinion
that assembly language is a thing of the past, and there aren't any good
Lisp constructs for the kind of computed GO necessary to pull this trick
off.  I can't think of any good tricks in FORTRAN, either.  I'm not
familiar with Pascal, Ada or C to know if those higher-level languages
allow such things.  Fourth, depending on your CPU architecture, there
may be ideal unrolling constants which would keep the unrolled loop
inside an instruction prefetch buffer; complete unrolling would actually
be a degredation.

henry@utzoo.UUCP (10/23/87)

> ... Fourth, depending on your CPU architecture, there
> may be ideal unrolling constants which would keep the unrolled loop
> inside an instruction prefetch buffer; complete unrolling would actually
> be a degredation.

In particular, almost any CPU with a cache -- which means most anything above
the PC level nowadays -- will have an optimum degree of unrolling for loops
that iterate a given number of times.  It's not just a question of whether
the loop will fit; eventually the extra main-memory fetches needed to get
a larger loop into the cache wipe out the gains from reduced loop-control
overhead.  For straightforward caches (with a loop that will *fit* in the
cache!), elapsed time versus degree of unrolling is a nice smooth curve with
a quite marked minimum.  Based on the look I took at this, if the ratio of
your cache speed to memory speed isn't striking, and your loop control is
not grossly costly (due to e.g. pipeline breaks), the minimum has a good
chance of falling at a fairly modest unrolling factor, maybe 8 or 16.

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

PAP4@AI.AI.MIT.EDU ("Philip A. Prindeville") (10/26/87)

Why is this starting to look like `net.arch' to me?  Could it be
the academic throttling of an overly esoteric subject?

I think we've all made our basic points -- let's leave something
as an exercise for the (already bored) reader.

Heading for the fall-out shelters...

-Philip