AUERBACH@CSL.SRI.COM (Karl Auerbach) (09/29/87)
Someone asked me today what are the performance limits on a TCP connection. The situation he posited was on in which there are no intervening IP routers, massive availability of underlying bandwidth, massive computing resources in the hosts, and low noise. It was further posited that the hosts could be using acknowledgement strategies highly tuned for this specific situation. How to make such an estimate is well outside my expertise. Can anyone give me any pointers? Thanks, --karl-- -------
Mills@UDEL.EDU (09/29/87)
Karl, Geez, how can I respond to such a wonder? Given your scenario, I would posit something like X-band speeds, which works out to 10^10 bps, assuming you use the right waveguide. Now, if you are phond of photons, we can escalate that a few megatons. Seriously, existing Unix implementations might horse up to a couple of megabits, but Craymonsters on Hyperbolts might up that a magnitude. Meanwhile, ask Dave Clark at MIT about the NETBLT performance, which is currently dimming the lights on FATNET at megabit speeds over real satellite channels. Dave
lazear@GATEWAY.MITRE.ORG (09/29/87)
Karl, It sounds like you've removed the major bottlenecks and throughput ought to approach infinity! :-) Without queue delays, transmission delays, checksum computing delays, and retransmissions, there's not much left to stall data. Walt
DCP@QUABBIN.SCRC.SYMBOLICS.COM (David C. Plummer) (09/29/87)
Given your assumptions, TCP is not the limitting factor. The limitting factor are rate at which producers can produce data and consumers can consume it. It is quite easy to multibuffer TCP, and I suspect most implementations have. Given the horsepower of the consumer and the producer, and the channel bandwidth, and the multibuffering methodology, you can probably calculate the minimum window size that ensures the pipe will always be full.
CASNER@VENERA.ISI.EDU (Stephen Casner) (09/29/87)
Karl, For all the variables you mention, you've specified values that would pose no limit to TCP performance. However, you left out one critical variable: delay. If the bandwidth and processing power are infinite, the sender will immediately transmit as many octets as the receiver's window will allow. The sender cannot transmit any more until an ACK is returned, and the receiver cannot send an ACK until the data gets there. Therefore, the maximum throughput is one window's worth of octets every round-trip delay time. The window size is determined by the amount of buffer space available in the receiver. If you assume buffer space is also unlimited, then you bump into the first real limit: the maximum TCP window size is 64K octets because it is carried in a 16-bit field. To cite a real example, the Wideband Satellite Network has a raw data rate of 3Mb/s, but it also has a round-trip delay time of 1.8 seconds. The maximum TCP throughput is 64K octets per 1.8 seconds or 290Kb/s. There have been proposals by the INENG Task Force and others to define a TCP option to carry a larger window-size field. If you assume this option field could be as large as necessary, then the only limits left are physical: the speed of light and the distance between hosts. -- Ste. A with t time r
kent@DECWRL.DEC.COM (09/29/87)
There are some protocol details (like the IP identification field) and common MTUs that will limit the data rate. chris
braden@BRADEN.ISI.EDU (09/30/87)
Karl, It sounds like you've removed the major bottlenecks and throughput ought to approach infinity! :-) Without queue delays, transmission delays, checksum computing delays, and retransmissions, there's not much left to stall data. Walt ____________________________ The max TCP window size (2**16) divided by the round-trip delay. Bob Braden
mmolle@Q2.ICS.UCI.EDU.UUCP (09/30/87)
Hey, wait a minute! Even assuming no errors, arbitrarily high data rates and arbitarily fast processors interpretting the protocols, there is still propagation time across the channel to contend with. Assuming a window size of one (a.k.a. "stop and wait") you get at most one frame per round trip time, i.e., a finite maximum data rate. A larger (but still finite) window size of W outstanding and unacknowledged frames raises that bound by a factor of W, but it certainly doesn't give you infinite throughput. Does that clarify the situation, or am I missing something deeper?
lamaster@pioneer.arpa (Hugh LaMaster) (09/30/87)
In article <8709290008.AA00477@ucbvax.Berkeley.EDU> AUERBACH@CSL.SRI.COM (Karl Auerbach) writes: >Someone asked me today what are the performance limits on a TCP connection. >The situation he posited was on in which there are no intervening : >resources in the hosts, and low noise. It was further posited that An interesting question. It depends on HOW low noise your connection is. Because of acknowledgement and retransmission requirements, the faster the link, the lower noise it has to be to maintain a high delivered fraction of the raw channel speed. This is in addition to the question of the interaction of acknowledgement delay and window size, which some have mentioned, and which is also a big problem. To realize the high bandwidth in practice requires host software smart enough to adaptively distinguish between a high bandwidth low noise channel, and something like Arpanet, and adjust its behavior appropriately to either situation. Offhand, I am not sure how to do this. Anybody have any ideas? Hugh LaMaster, m/s 233-9, UUCP {topaz,lll-crg,ucbvax}! NASA Ames Research Center ames!pioneer!lamaster Moffett Field, CA 94035 ARPA lamaster@ames-pioneer.arpa Phone: (415)694-6117 ARPA lamaster@pioneer.arc.nasa.gov (Disclaimer: "All opinions solely the author's responsibility")
JBVB@AI.AI.MIT.EDU ("James B. VanBokkelen") (09/30/87)
... what are the performance limits on a TCP connection .... in which there are no intervening IP routers, massive availability of underlying bandwidth, massive computing resources in the hosts, and low noise. ... the hosts could be using acknowledgement strategies highly tuned for this specific situation. Thanks, --karl-- My initial estimate would be based on the media bandwidth and the memory bandwidth of the host. If the memory bandwidth of the host limits, then I would expect the data rate will be on the order of the memory bandwidth times the number of times the data must be copied on the way out (count DMA and the checksum, please). If the net bandwidth limits, I would guess somewhere between 40% and 80% of the network bandwidth, depending on the architecture (CSMA/CD maybe towards the low side, well-implemented Token Passing maybe towards the high side?). Memory bandwidth definitely dominates on the implementation I am most familiar with. We recently unrolled the TCP checksum loop, and a ~35% speed improvement there produced a ~15% overall throughput increase on memory-to-memory TCP. This got us up to 1.4 Mbits/sec between two 8Mhz AT clones on a lighty-loaded Ethernet (with 3rd-generation boards - LAN controller chip & multiple packet buffers, but no processor). 40% of a 4.3-modified Sun (3.5 Mbits/sec per a recent posting)... James B. VanBokkelen FTP Software Inc.
CERF@A.ISI.EDU.UUCP (09/30/87)
Karl, I have been doing some work in this area to predict potential performance at 100 Mb/s FDDI rates. First, you can expect a TCP to execute about 1000-1200 instructions per packet, assuming all is well. Some of this is checksum. In fact, I recall some early statistics in which the checksum alg took up 40% of the CPU cycles for processing incoming segments of TCO. I am lumping IP level processing into TCP in the 1000-1200. This is absolutely a WAG - so anyone with some hard data on instruction count is most welcome to provide better info. Roundtrip time and window size limitations affect performance with respect to total throughput, all other things ignored. Suppose you were operating at 10MB (megabytes)/second and had a prop. delay of 10 microseconds on the ring. With no constraints, and using 1000 byte packets, you would be sending 10,000 packets/second to achieve full throughput. That gives you 100 microseconds worth of insruction time. At 14 MIPs, that is 1400 instructions. So, if you did nothing else but TCP/IP, you might make it, with respect to instruction rate. Window size limitations: you can have 65,536 bytes outstanding and at 10 MB/sec, it takes 6.5 ms to send them. Assuming you can send the ack in the 100 microseconds you have to "process" the packet, it takes 10 microseconds + 100 + 10 = 120 usec (2 X propagation delay plus processing time) to get the ACK. Even asusming another 100 usec to process the ACK, that 220 usec worth of window needed, minimum. That's only 2200 bytes - so even a factor of 10 increase is well within the window capacity of the TCP. The killer in all this is propagation and round-trip delay. It doesn't take much to cause the window requirment for full bandwidth to exceed the maximum allowed window of outstanding bytes. For example, a round-trip time of 10 ms (the prop delay of only 1000 miles, one way!) requires a window of over 100KB to maintain full capacity at 10MB/sec. What these very sketchy and rough numbers suggest is that window-based schemes won't be very satisfactory for long haul, long delay, very high speed nets. Flow control based on rates is needed, rather than on round-trip acks/permissions - of course, this is precisely the kind of thinking that I believe underlies the work that Dave Clark has been doing with NETBLT (Dave, holler if I have put words/thoughts into your work inappropriately). To sum up, TCP can probably be made to work well in low delay systems at pretty high speeds, maybe up to 100 Mbit/sec, but probably not at a gigabit and probably not at long haul cross country at 100's of megabits/sec. By the way, most of these same considerations apply to ISO TP, in my estimation. Vint
lamaster@pioneer.arpa (Hugh LaMaster) (09/30/87)
In article <[A.ISI.EDU]30-Sep-87.09:40:42.CERF> CERF@A.ISI.EDU writes: > >What these very sketchy and rough numbers suggest is that >window-based schemes won't be very satisfactory for long haul, >long delay, very high speed nets. Flow control based on rates is needed, >rather than on round-trip acks/permissions - of course, this is precisely >the kind of thinking that I believe underlies the work that >Dave Clark has been doing with NETBLT (Dave, holler if I >have put words/thoughts into your work inappropriately). > I have seen/heard references to rate based flow control before - but I haven't seen (or maybe remembered) what the idea is. Is there a summary somewhere or could someone summarize the basic idea? (Is there an RFC? etc.) Hugh LaMaster, m/s 233-9, UUCP {topaz,lll-crg,ucbvax}! NASA Ames Research Center ames!pioneer!lamaster Moffett Field, CA 94035 ARPA lamaster@ames-pioneer.arpa Phone: (415)694-6117 ARPA lamaster@pioneer.arc.nasa.gov (Disclaimer: "All opinions solely the author's responsibility")
CSYSMAS@OAC.UCLA.EDU.UUCP (10/02/87)
> I have been doing some work in this area to predict potential > performance at 100 Mb/s FDDI rates. First, you can expect a > TCP to execute about 1000-1200 instructions per packet, > assuming all is well. Some of this is checksum. In fact, I > recall some early statistics in which the checksum alg took up 40% of the > CPU cycles for processing incoming segments of TCO. I am lumping > IP level processing into TCP in the 1000-1200. This is absolutely > a WAG - so anyone with some hard data on instruction count is > most welcome to provide better info. The instruction count sounds low to me, how about 10 times more? (A 1000 byte packet sounds like it would take 500 adds just to compute the TCP checksum, not to mention a 64K packet). Speaking of checksums, it seems to me that the IP header checksum could be replaced with a "packet" level CRC at the link level and done by hardware. Most (all?) HDLC type chips provide this without any extra hardware (or effort). Unfortunately for a TCP connection, most of the checksum overhead is in the TCP checksum (which is an end-to-end check) and this sounds harder to move off of the general purpose CPU. The idea would be to let your general purpose 14 MIP CPU do general purpose work rather than adding up checksums. > With no constraints, and using 1000 byte packets, you would > be sending 10,000 packets/second to achieve full throughput. > That gives you 100 microseconds worth of insruction time. > At 14 MIPs, that is 1400 instructions. So, if you did nothing > else but TCP/IP, you might make it, with respect to > instruction rate. I have been thinking of how to design a T3 (45Mb) type speed packet switch (just thinking) and there are some real problems with doing IP packet header processing when you need to process a packet every 6 us. (Voice packets want to be about 100 bytes so you need to be able to handle about 56K packets/sec). Virtual circuts sure seem easer at the packet level at this speed (smaller packet overhead too). Of course, a virtual circut could carry embedded IP packets.
howard@cos.UUCP (10/02/87)
In article <2922@ames.arpa>, lamaster@pioneer.arpa (Hugh LaMaster) writes: > In article <8709290008.AA00477@ucbvax.Berkeley.EDU> AUERBACH@CSL.SRI.COM (Karl Auerbach) writes: > > >Someone asked me today what are the performance limits on a TCP connection. > >The situation he posited was on in which there are no intervening > : > >resources in the hosts, and low noise. It was further posited that > > An interesting question. It depends on HOW low noise your connection is. > Because of acknowledgement and retransmission requirements, the faster the > link, the lower noise it has to be to maintain a high delivered fraction of > the raw channel speed. This is in addition to the question of the interaction > of acknowledgement delay and window size, which some have mentioned, and which > is also a big problem. To realize the high bandwidth in practice requires > host software smart enough to adaptively distinguish between a high bandwidth > low noise channel, and something like Arpanet, and adjust its behavior > appropriately to either situation. Offhand, I am not sure how to do this. > Anybody have any ideas? A mechanism for implementing adaption to noise could be drawn from the SNA technique for varying the window size on FID4 Transmission Groups. Essentially, a given path starts out with a default window size, which can be changed by nodes along the way based on their buffer availability status. For severe congestion, a node can reset the window size to 1, for minor congestion, a node can decrease the window size BY 1. If a node feels it can handle more traffic, it can also set a bit indicating it would like to increase the window size by 1. While this mechanism is intended more for congestion control adaption rather than error adaption, I did use it in a protocol for a network management system which used both dedicated and dial lines for control channels. I started with a value assumed for dial lines, and gradually increased the frame and window size based on modified error-free interval length. By "modified," I mean that I kept an error counter which was decremented and incremented differently for retransmissions and for successfully transmitted sequences -- retransmissions decremented the error-free interval counter by a lesser value than did long sequences of successful transmission. This modification protected the frame and window sizes from radical changes due to error bursts. In general, those sizes were changed occasionally by 1, at thresholds determined by simulation, to give the best mixture of parameters for the encountered error conditions. -- -- howard(Howard C. Berkowitz) @cos.com {uunet, decuac, sun!sundc, hadron, hqda-ai}!cos!howard (703) 883-2812 [ofc] (703) 998-5017 [home] DISCLAIMER: I explicitly identify COS official positions.
narten@PURDUE.EDU.UUCP (10/02/87)
Does anyone have pointers to work being done in the performance of transport protocols (including TCP), when communications links are not reliable? E.g., how do various protocols behave in the presence of lost datagrams. I am asking this more from the theoretical point of view than from the angle of tuning an existing implementation. For instance, if 10% of the packets are lost, what happens to the throughput of TCP? I realize that there are a lot of variables that go into this, but it is still interesting to fix various parameters while varying packet loss rate, or to observe how window size, RTT, and packet lossage interact. Thomas
CERF@A.ISI.EDU (10/03/87)
As you move upward in the bandwidth range, you have decreasing amounts of time to process each object passing through - the high speed switching fabrics developed at Bell Labs and Bell Core have the characteristic that very little per packet processing is being done and, as you surmise, a kind of virtual circuit is set up; however, the switching fabric is able to share the bandwidth of the transmission resources, despite the VC set-up, because the VCs are just table entries and not reservations of actual capacity within each trunk. At the terminations of such a switching fabric, however, I think one still will need some end/end checking. Moreover, if we contemplate the kind of linking of networks we have today (vastly different internal operation), we may still need the end/end checking that TCP does. Rather than putting the TCP in the mainframe, anymore, though, it is fair to consider the sort of DMA interface which permits the TCP and perhaps other layers of protocol to be housed on an external board, equipped with special purpose logic or at least dedicated processing, placing data or taking data from the main processor's memory. Communications becomes a matter of placing buffers in memory and possibly signalling their arrival to the communications processor. I suspect that this description is not far away from the kinds of equipment already made and sold by companies like EXCELAN and ACC. Forgive me if I failed to mention the dozens of other companies whose products may work this way - I'm not up to speed on all the commercial products now flowering in the TCP/IP fields. At an IP switch point, you are still faced with at least full IP level processing and handling 45 Mb/s and up at that point is still a big processing load, unless we re-think the IP level and try to find a way to fabricize it, as the Bell folks have with the lower level packet switching. Hmm. Vint
MAB@CORNELLC.BITNET (Mark Bodenstein) (10/05/87)
> ... We recently unrolled the TCP checksum loop, and a ~35% >speed improvement there produced a ~15% overall throughput increase on >memory-to-memory TCP. > ... >James B. VanBokkelen >FTP Software Inc. > Could you provide more detail on how you unrolled this loop? (The complication being that the length of the loop is determined by the length of the data. Some alternatives I can think of would be: 1. to keep checking to see if you're done 2. to unroll the loop for each possible data length, and chose and execute the appropriate unrolled loop 3. to pad the data with zeros to the maximum length to be processed (One could also try various combinations of these three.) 1. seems inefficient, and thus defeats the purpose of unrolling the loop. 2. seems efficient, but perhaps excessive. 3. seems to penalize short segments (e.g. ACKs), of which there are many. Or am I missing something?) Thanks. Mark Bodenstein (mab@cornellc.bitnet@wiscvm.wisc.edu) Cornell University
CLYNN@G.BBN.COM (10/05/87)
Mark, One could, given suitable hardware, etc. compute the checksum more than 16 bits at at time - e.g., use 32-bit arithmetic (or some other higher number). There are at least two ways - adding 16 bits into a wider accumulator which reduces/eliminates checking for carry/overflow, or adding 32 bits at a time into the accumulator, reducing the number of memory fetches/adds/carry checks by a factor of 2 (more if you have 64 bit arithmetic, etc.). One can also compute the checksum as the data is being placed into the packet, instead of computing it after the packet has been "built". On retransmissions/repacketizations, one can "update" the checksum (subtract out the old (header) values and add in the new) instead of recomputing the sum of the data each time (the checksum is weak enough that byte-pair position, order of summation, etc. doesn't matter).
jon@Cs.Ucl.AC.UK (Jon Crowcroft) (10/06/87)
Since TCP is a transport protocol, it has to reach the processes lower protocols can't reach. This means (in most OS's (eg Used Never In Xanadu)), at least one buffer copy, and one or two context switches per packet. However good your windowing, and lossless the net, that bites bad. If the hardware could keep track of the actual user buffers (and scatter gather them too) straight in and out of user processes, then you are down to DMA speeds. The trick is to design clever enough hardware that the OS designers will trust not to trash the OS. You lose two ways because of current hardware - packet over run (see many writings by Dave Cheriton) causing lowere throughput, and higher latency. Another win would be hardware that provided the higher level protocols with a CRC [Say buffer descriptor list contains pointer to where to put CRC in outgoing packets, and how to do it for incoming packets]. It's instructive to look at fast gateways - two lance ethernet chips on a multibus can blow away the bus bandwidth - really fast boxes have no bus, or have a binary tree bus a la buttergates... Jon
DCP@QUABBIN.SCRC.SYMBOLICS.COM (David C. Plummer) (10/06/87)
Date: Mon, 5 Oct 1987 11:24:26 EDT From: Mark Bodenstein <MAB%CORNELLC.BITNET@wiscvm.wisc.edu> Could you provide more detail on how you unrolled this loop? (The complication being that the length of the loop is determined by the length of the data. Some alternatives I can think of would be: ... There is a fourth way that we (Symbolics) have done which you did not mentioned: (a) Pick a compile-time unrolling factor, usually a power of 2, say 16 = 2^4. (b) Divide the data length by the unrolling factor, obtaining a quotient and remainder. When the unrolling factor is a power of two, the quotient is a shift and the remainder is a logical AND. (c) Write a unrolled loop whose length is the unrolling factor. Execute this loop <quotient> times. (d) Write an un-unrolled loop (whose length is therefore 1). Execute this loop <remainder> times.
mminnich@udel.EDU (Mike Minnich) (10/06/87)
In article <8710051601.AA14825@ucbvax.Berkeley.EDU>, MAB@CORNELLC.BITNET (Mark Bodenstein) writes: > Could you provide more detail on how you unrolled this loop? > > (The complication being that the length of the loop is determined by > the length of the data. Some alternatives I can think of would be: > > 2. to unroll the loop for each possible data length, and chose and > execute the appropriate unrolled loop > A simple technique that has worked well for me in the past is to unroll the loop for the longest possible length and then compute a jump into the unrolled loop based on the length of the data to be checksummed/copied/etc. Only one version of the unrolled loop is needed in this case. mike -- Mike Minnich
ihm@nrcvax.UUCP (Ian H. Merritt) (10/07/87)
>The instruction count sounds low to me, how about 10 times more? This sounds more reasonable. > >(A 1000 byte packet sounds like it would take 500 adds just to >compute the TCP checksum, not to mention a 64K packet). Since the 1's complement arithmetic is so flexible, this could of course be improved on 32 bit processors, for example, but still represents significantly more than 1000 instructions... . . . >Unfortunately for a TCP connection, most of the checksum overhead >is in the TCP checksum (which is an end-to-end check) and this >sounds harder to move off of the general purpose CPU. The idea >would be to let your general purpose 14 MIP[S] CPU do general >purpose work rather than adding up checksums. > . . . >I have been thinking of how to design a T3 (45Mb) type speed >packet switch (just thinking) and there are some real problems >with doing IP packet header processing when you need to process a >packet every 6 us. (Voice packets want to be about 100 bytes so >you need to be able to handle about 56K packets/sec). > >Virtual circuts sure seem easer at the packet level at this speed >(smaller packet overhead too). Of course, a virtual circut could >carry embedded IP packets. This suggests a dedicated packet concentrator arrangement such that long-haul very high speed (VHS (:->) ) circuits could be utilized by grouping large numbers of small (i.e. ip/tcp-size) packets bound for the same region together into mostly-reliable megapackets shipped over the high-speed link to be unpacked and sent off to gateways at the far end. TCP's reliability features would be sufficient to permit this to work with little or no megapacket-level error handling, but such functionality could be included by using nonflexible (therefore simple) algorithms implemented in VLSI that operate on the entire megapacket or by cutting it up into arbitrary segments without any content-specific knowledge. The common-channel interoffice signalling (CCIS) protocol used by the bell system incorporated a similar scheme (albeit on a smaller (and slower) scale), allowing selective retransmission of portions of a megapacket while verifying the validity of the rest of the data. --i
ihm@nrcvax.UUCP (Ian H. Merritt) (10/08/87)
>There is a fourth way that we (Symbolics) have done which you did not >mentioned: > >(a) Pick a compile-time unrolling factor, usually a power of 2, say 16 = 2^4. >(b) Divide the data length by the unrolling factor, obtaining a quotient > and remainder. When the unrolling factor is a power of two, the > quotient is a shift and the remainder is a logical AND. >(c) Write a unrolled loop whose length is the unrolling factor. Execute > this loop <quotient> times. >(d) Write an un-unrolled loop (whose length is therefore 1). Execute > this loop <remainder> times. Or if you have memory to burn (which is fast becoming a common condition), just unroll the loop for the maximum condition and branch into it at the appropriate point to process the length of the actual packet. --i
CERF@A.ISI.EDU (10/10/87)
thanks for the observations - what you describe also bears an uncanny similarity to the internals of the early TYMNET which aggregated bytes from terminals and grouped them into frames which were sent reliably on a hop-by-hop basis, broken down at each hop and re-formed based on new traffic arriving at that node and the fact that the bytes in the frame might be destined to flow out on different trunks and so had to be re-marshalled into new frames. Vint
DCP@QUABBIN.SCRC.SYMBOLICS.COM (David C. Plummer) (10/13/87)
Date: 8 Oct 87 16:05:47 GMT From: csustan!csun!psivax!nrcvax!ihm@LLL-WINKEN.ARPA (Ian H. Merritt) >There is a fourth way that we (Symbolics) have done which you did not >mentioned: > >(a) Pick a compile-time unrolling factor, usually a power of 2, say 16 = 2^4. >(b) Divide the data length by the unrolling factor, obtaining a quotient > and remainder. When the unrolling factor is a power of two, the > quotient is a shift and the remainder is a logical AND. >(c) Write a unrolled loop whose length is the unrolling factor. Execute > this loop <quotient> times. >(d) Write an un-unrolled loop (whose length is therefore 1). Execute > this loop <remainder> times. Or if you have memory to burn (which is fast becoming a common condition), just unroll the loop for the maximum condition and branch into it at the appropriate point to process the length of the actual packet. First of all, that's 65535 octets for TCP. Second, I believe that was one of the three techniques metioned by the person to whom I was replying. Third, we (Symbolics) can't do that without playing some really nasty games with the compiler. You see, we're of the opinion that assembly language is a thing of the past, and there aren't any good Lisp constructs for the kind of computed GO necessary to pull this trick off. I can't think of any good tricks in FORTRAN, either. I'm not familiar with Pascal, Ada or C to know if those higher-level languages allow such things. Fourth, depending on your CPU architecture, there may be ideal unrolling constants which would keep the unrolled loop inside an instruction prefetch buffer; complete unrolling would actually be a degredation.
henry@utzoo.UUCP (10/23/87)
> ... Fourth, depending on your CPU architecture, there > may be ideal unrolling constants which would keep the unrolled loop > inside an instruction prefetch buffer; complete unrolling would actually > be a degredation. In particular, almost any CPU with a cache -- which means most anything above the PC level nowadays -- will have an optimum degree of unrolling for loops that iterate a given number of times. It's not just a question of whether the loop will fit; eventually the extra main-memory fetches needed to get a larger loop into the cache wipe out the gains from reduced loop-control overhead. For straightforward caches (with a loop that will *fit* in the cache!), elapsed time versus degree of unrolling is a nice smooth curve with a quite marked minimum. Based on the look I took at this, if the ratio of your cache speed to memory speed isn't striking, and your loop control is not grossly costly (due to e.g. pipeline breaks), the minimum has a good chance of falling at a fairly modest unrolling factor, maybe 8 or 16. Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry
PAP4@AI.AI.MIT.EDU ("Philip A. Prindeville") (10/26/87)
Why is this starting to look like `net.arch' to me? Could it be the academic throttling of an overly esoteric subject? I think we've all made our basic points -- let's leave something as an exercise for the (already bored) reader. Heading for the fall-out shelters... -Philip