[comp.protocols.tcp-ip] offloading the protocols

cheriton@PESCADERO.STANFORD.EDU ("David Cheriton") (03/16/88)

VMTP (RFC 1045) is specifically designed to work well with hardware support
on a network adaptor board with its own processing power.  In fact, we have
designed and are implementing such a board to support VMTP, and this process,
concurrent with the protocol design and refinement, has significantly
influenced the design of the protocol.  I would agree there are severe limits
on what such intelligent interfaces can provide with current protocols
independent of how well the boards are designed.  I would also agree that
most intelligent interfaces to date are slower than the dumb fast ones
when you look at transport-level performance.  However, my experience
with VMTP and the NAB (Network adaptor Board) we are building convinces
me that this approach is essential for transport-level performance in the
same general range as the network when we go to 100 Mb networks and higher.
Moreover, offboarding the processing load of protocols seems to have
additional advatnages on multiprocessors machines because the interrupts
and cache demands of protocols leans on the critical resources, namely the
system bus.

Interested parties can send to my secretary (nevena@pescadero.stanford.edu)
for a copy of our draft paper on the NAB.  The VMTP spec is of course
available as an RFC - only the first 30 pages are really needed to get
a feeling for the protocol.

David Cheriton

CERF@A.ISI.EDU (03/17/88)

I'm not goin to get into the front-end versus operating system resident
protocol argument (I argued against the front-end years ago on the same
basis you suggest Dave Clark argues, if anyone cares about my historical
biases).

However, it seems to me that as you approach the gigabit channels, you
really want to simplify the host's view of networking. An analogy might be
found in disk/file access and virtual memory. Years ago, an operating
systme was designed at UCLA called the Sigma EXperimental system (SEX for
short - the user's manual was a popular item!). It ran on a Sigma 7 made
by Scientific Data System (later, Xerox Data Systems, later, R.I.P.).

The notion of associating ("coupling" - God, I never thought about how
suggestive that term was in connection with the operating system acronym)
virtual memory with pages of files was an essential design element.

One would associate a particular virtual page space with disk pages 
occupied by a file. This is not much different than virtual memory
linked to pages of a disk, except in this case, actions to the memory
content were reflected in changes to the FILE (not just changes to a 
disk page which happened to represent a page of virtual memory space).

So, the user's virtual memory space was mapped onto the file system. I
imagine Multics could be considered to have done something like that
only even more elegantly with its rich addressing structure.

Perhaps what is needed is a way to associate virtual memory with places
in the networking space. Writing to virtual memory would be like writing
to the network. PDP-11's had the concept of associating certain words of
memory with I/O channels. But what I am looking for is a notion that lets
very simple actions to memory be interpreted by outboard processors as
network-related actions. 

Perhaps Dave Clark could expand on his theme which I view as related to
your question if not the rather poorly expressed ideas above.

Vint Cerf

kwe@bu-cs.BU.EDU (kwe@bu-it.bu.edu (Kent W. England)) (03/19/88)

In article <[A.ISI.EDU]17-Mar-88.05:12:56.CERF> CERF@A.ISI.EDU writes:
	[discussion of simplifying and speeding up network/host interface]
>
>Perhaps what is needed is a way to associate virtual memory with places
>in the networking space. Writing to virtual memory would be like writing
>to the network. PDP-11's had the concept of associating certain words of
>memory with I/O channels. 

	Doesn't Apollo's Domain [proprietary] networked operating
system do just that?

KASTEN@MITVMA.MIT.EDU (Frank Kastenholz) (03/19/88)

More importantly than devising protocols with OP's in mind is to
move data directly from the users space to the processor -
it should not go through some central network application.

A second (equally important issue) is to trust your local I/O
channel.

The things that really kill the protocol processing are checksums
and "adminstrative" I/O (separate ack's, etc).

By trusting the local I/O channel, you do not need to checksum packets
going between the OP and the host, ack them, etc, etc.


A very empirical model that I have dreamed up for a TCP file
transfer for a non-kernal TCP implementation (e.g. Wollongong, KNET
etc) is:

    The cost of moving the data from disk to user is X, from user
    to network application is X, running the TCP checksum is X and
    then moving the data from the network application to the I/O
    adaptor is X. The total cost is 4X and X seems to be O(n).

This model is not proven, but seems to be borne out by some empirical
testing for running file transfers through a VAX using TCP and UDP
(both had about the same throughput, but TCP took 100% and UDP
about 75% of the CPU - the transfers were done by FTP/TCP and NFS/RPC/
UDP - the only effective difference was the TCP checksum).


Moral of the story, if you can not move the data from the user's space
to the OP directly (i.e. need to go to an application level network
process first) you only save about 25%.

Remember, this is all empirical! real testing needs to be done.

Frank Kastenholz
Atex Inc.

All opinions are mine - not my employers.....

jerry@twg.COM ("Jerry Scott") (03/22/88)

Frank,
	That is not the way that data flows inside the Wollongong
software at all.  The same style used by 4.3BSD is the case here.
First data is sent from the user into the kernel where it is placed
into network buffers call mbufs.  Mbufs can and are chained to
build packets (IP headers, TCP headers, data, etc).  The mbuf
chains are passed between protocols, thus no data is moved at all
just the pointers to the data.  Plus once the data is in the
kernel, it never has to take a hop back to an application for
any further processing.

	We are well aware of the overhead of moving data between
the kernel level and user level, that is why we have done
considerable work in preventing this from happening (eg. Telnet
server is kernel resident, sharing DEC ethernet controllers
using ALTSTART interface).  We have also been eagerly tracking
the good work by Van Jacobson and Mike Karels in the TCP area.
Our implementation allows us to use there public domain code
without modification.

	I do agree with your assessment of the on-board TCP
solutions.  The overhead in the host must be minimal.  Data
must be moved from the user into a DMA area where the smart
controller can access it.  You must trust the data integrity
between the host and the controller performing the network
functions.  Now if you can get Van's and Mike's code down
onto these controllers...

Jerry

chris@gyre.umd.EDU (Chris Torek) (03/22/88)

On the other hand, putting the Jacobson/Karels TCP into the
board may produce something significantly slower than what you
get when you run the protocol on a Sun-3.

Even if the interface is right.

Even if you have a good DMA path.

No matter how low the overhead is.

The problem, you see, is that the Sun-3 CPU may be significantly
faster than the one on your protocol card.  That 68020 runs rings
around the 80x8x in some of those external protocol boards.  The
latest Ethernet chips from Intel and AMD are fast, but they are
not CPUs.  There may be some protocol boards that use fast hardware
---if they do not exist now they probably will soon---but I have
never seen one myself.

Chris

jerry@TWG.COM ("Jerry Scott") (03/23/88)

Chris,
	Agreed, the cpu on the board will definitely come
into play in terms of performance.  We see that here at
TWG when our host resident software is compared against
on-board software on VAX 86xx or 88xx hardware.  The host
resident runs circles around the on-board in these cases.

	I think the Jacobson/Karels code has more to offer
than blazing performance.  The code that I am distributing
does not yet have the performance hooks that Van explained
in some of his mail messages.  But it does have improved
congestion control that allows my connections to adapt to
line speeds during the life of the connection rather than
at the beginning.  Not only does this code save the net
the overhead of unnecessary retransmissions, but prevents
timeouts of connections as well.  The big improvement I
have seen is with Arpanet mail.  It used to be the case that
I would try to send mail to another host, make the connection,
and then lose the connection because of timeouts before the
mail could be transferred.  Now, even at peak packet times,
mail delivery is reliable.


Jerry

lamaster@ames.arpa (Hugh LaMaster) (03/24/88)

In article <8803160106.AA05728@Pescadero> cheriton@PESCADERO.STANFORD.EDU ("David Cheriton") writes:

>VMTP (RFC 1045) is specifically designed to work well with hardware support

I find it interesting to note that while some people are worrying about
the necessity for offloading protocol processing, Van Jacobson and Mike
Karels have contributed algorithms that together will push 10Mbits/sec
from a CPU with less than 2 MIPS.   In any reasonable model of the rates
of computation versus network traffic for any non-gateway host, it isn't
clear that there is any benefit at all to offloading protocol processing.
In fact, recent history seems to confirm that using a general purpose CPU
is a better way to go- easier to install new algorithms and bug fixes.  If
more processing power is needed, multiple (general purpose) CPU's seems
to be a much more cost effective way to go.  I note that Ardent Computer
seems to be applying the same principle to Graphics processing - instead
of special purpose graphics engines, build the system with multiple CPU's.

karn@thumper.bellcore.com (Phil R. Karn) (03/24/88)

I would like to draw extra attention to the fact that Van and Mike were
able to do what they did WITHOUT cheating, i.e., turning off TCP checksums.

Somebody should tell Sun that this puts the final nail in the coffin of their
argument that NFS can't tolerate UDP checksums for "performance" reasons... :-)

Phil