[comp.protocols.nfs] Offloading I/O [was:Incremental sync

mash@mips.com (John Mashey) (03/18/91)

In article <ZU=9R=8@xds13.ferranti.com> peter@ficc.ferranti.com (Peter da Silva) writes:
>Here we're talking about putting file systems in smart processors. How about
>putting other stuff there?
>
>	Erase and kill processing.	(some PC smart cards do this,
>					 as did the old Berkeley Bussiplexer)
>	Window management.		(all the way from NeWS servers
>					 with Postscript in the terminal,
>					 down through X terminals and Blits,
>					 to the 82786 graphics chip)
>	Network processing.		(Intel, at least, is big on doing
>					 lots of this in cards, to the point
>					 where the small memory on the cards
>					 becomes a problem... they do tend
>					 to handle high network loads nicely)

These have been done, in various ways, and have at least fairly often
been good ideas, although some of the requirements keep changing
on you.
	In general, these OFTEN have the characteristics that
	make it reasonable to distribute the processing, although
	people argue about the window management case.

	The relevant characteristics are:
		1: The ratio of # of device interactions to bandwidth
		or processing is high, i.e., there are many interactions,
		but most of them neither move much data, nor require
		much processing.
		2: There is relatively little global data needed to
		process the device interactions.
		3: There is a reasonable protocol, such that the work
		can be split between I/O processor and main CPU(s),
		such that the main CPU(s) normally interact with
		the I/O processor much less frequently than the IOP
		interacts with the devices.  I.e., if the IOP does NOT
		lessen the frequency of interaction, it may well get in
		the road more than not.
		4: The devices demand lower response latency than
		can cost-effectively be done by the main CPU.
	[Note that my argument against doing this for disk file systems
	was that disk I/ O was the opposite of 1 (much data, few requests),
	and disobeyed 2 (much global data).]

Consider some of the paths through which asynch serial I/O has gone through
(for concreteness, on UNIX):

1. Simple UARTS for serial ports, with no special queueing.
	CPU polls for input or gets 1 interrupt per char
	CPU emits output on scheduled basis, or 1 interrupt per char
	Cheap, but needs main CP Uthat can stand high interrupt rates
		if supporting many and/or fast lines.
	CPU does echoing
	EX: DEC DL11  (yes, that goes a way back)  (in this case just 1 line)

2.  #1 plus input silo
	EX: DEC DJ11: 64-entry silo, handling 16 lines.
	CPU can either:
		poll the device regularly to see if there is any
		input available, gather 1-64 entries and parcel them
		out to the processes they belong to.
	OR
		ask for an interrupt whenbver a character appears,
		thus possibly trading away overhead ot minimize latency.
	CPU does echoing

3. #2 plus more control
	EX: DEC DH11: like DJ11, but for example, you could ask for
	an interrupt only if there were N characters in the silo.
	Also, it could do auto-echo

4. #3 plus DMA output
	EX: (I think): DEC DZ11 + DQS11 com processor
	(My memory gets vague on this one. The DQS was a com processor
	with good speed, albeit "interesting" to code for.)

5. Input silos with settable parameters, output silos; maybe DMA
	There have been lots of these: no special processing, but:
	a) On input, poll, or ask for an interrupt if the silo has N characters,	or possibly, ask for interrupt if even 1 character is there,
	and M milliseconds have elapsed.  if something is set up as
	a comm processor geared for non-terminal use, it may well do DMA,
	to support fast lines.
	b) On output, either have deep silos that you can stuff many
	characters into in bunches, or else do DMA out.

6. Move echo, erase, and line kill into the IOP.
	This has been done in various commercial UNIXes (maybe people
	can post examples).
	For line-oriented input, this is not a bad idea, and especially
	as cheap micros became available, became much more reasonable.
	(Note that in the DQS/DZ era, anything like a micro was NOT cheap.)
	The CPU:
		needs a protocol to hand to the IOP definitions of
		erase & kill characters (if the user is allowed to change
		them), and in fact, any other parameters of relevance
	The IOP:
		needs more local storage (to hold parameters)
		needs even more local storage, because the natural unit of
		buffering is now a complete input line
		Must deal with all of the UNIX escape sequences and
		conversions, as well as interrupts.  (For example,
		if an interrupt comes, it may want to terminate
		output in progress to that line.)
	BENEFITS:
		CPU gets one interrupt for an entire line of input.
		CPU need not do echoing.
	DRAWBACKS:
		Often, for cost reasons, the IOP could just barely
		do what was needed, and then along would come additional
		requirements.  For example, suppose you had a definite
		idea of "Interrupt" and that was wired into the IOP,
		and then UNIX starts letting people change that, etc.
		In general, to make this work, the IOP must almost
		always be able to deal with every input character,
		without bothering the main CPU.

Of course, historically, about the same time such things became widespreadly
practical, they became almost useless in some environments, because
people were using CRTs, not with line mode, but with visually oriented
terminals.  At that point, everybody started running the terminals in
"raw" mode, thus bypassing all of the above processing!  So, the next
step:

7. Make the IOP smart enough to distinguish between "ordinary"
characters which should be echoed, and stuck in an input queue,
and other characters and/or sequiences, which demanded response from
the CPU.  The theory here, is that, for example, you sit typing text into
vi or emacs fairly efficiently, and still get the various control
sequences to do something, without making the IOP huge.

Of course, since every program might well use different special sequences,
this may require more interaction with the main CPU.
I think this one has been done.

8. Put termcap in the IOP.
	Implications of this left as an exercise :-)


One clear lesson:
	When doing an IOP, be careful that the interface requirements don't
	change on you, especially if they often force you into a
	transparent pass-thru mode where the IOP really is doing nothing.
Another one: watch out if you trying to make the same IOP handle both
	terminal I/O and fast serial lines to connect systems

I believe that most systems tend to use devices in groups 1-5 these days,
because the processing requirements have gotten complex enough that
you might as well just have the main CPU do it, irritating though it
may be.

Anyway, those are a few samples. Maybe people would post examples
of these various approaches, or others, with comments on:
	a) How well they worked
	b) How long they lasted
	c) What the weirder bugs and problems were
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

cs450a03@uc780.umd.edu (03/18/91)

For the case of an IOP....

line-orientation is an approximation of what you want, but not exactly.  

Essentially, what you need is a way of saying "just stuff these
characters in the buffer, but when you get one of those characters,
send the buffer up to me for handling."   'Implicitly', if the buffer
gets nearfull, you'll have to read it anyways.

Information of this sort could be communicated by sending short lists
of characters (turn on pass through for these, turn off pass through
for these), sending blanket commands (turn off pass through for all),
or sending bitmaps (set passthrough as indicated, for all characters).

As always, when you introduce a new class of feature, there will be
lots of programs that cannot take advantage of the feature.  Most
notably, programs where you have no source code access (nor a support
contract).

On the other hand, this mechanism is pretty general, and could be
implemented with a few kernal mods (in the case of unix), and a little
out-board hardwre.

Note that mostly I'm thinking about buffering printing characters, and
sending control characters (including delete) through to the 'main
processor' .. so perhaps you could get away with a simple switch to
turn on/turn off such a feature.

How useful?  How many users?  What is communication overhead?

I think this would integrate pretty well with something like emacs..
though it is arguable how much cpu overhead you'd "save" in that case.

Which, I suppose, is the reasoning behind X-terminals.

Raul Rockwell

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/19/91)

In article <1088@spim.mips.COM> mash@mips.com (John Mashey) writes:

| 6. Move echo, erase, and line kill into the IOP.
| 	This has been done in various commercial UNIXes (maybe people
| 	can post examples).
| 	For line-oriented input, this is not a bad idea, and especially
| 	as cheap micros became available, became much more reasonable.
| 	(Note that in the DQS/DZ era, anything like a micro was NOT cheap.)
| 	The CPU:
| 		needs a protocol to hand to the IOP definitions of
| 		erase & kill characters (if the user is allowed to change
| 		them), and in fact, any other parameters of relevance
| 	The IOP:
| 		needs more local storage (to hold parameters)
| 		needs even more local storage, because the natural unit of
| 		buffering is now a complete input line
| 		Must deal with all of the UNIX escape sequences and
| 		conversions, as well as interrupts.  (For example,
| 		if an interrupt comes, it may want to terminate
| 		output in progress to that line.)
| 	BENEFITS:
| 		CPU gets one interrupt for an entire line of input.
| 		CPU need not do echoing.

  About 20 years ago we wrote an OS which ran on a GE 605 (enhanced 635)
using a DN355 as the front end processor. Based on some of the things we
did with that, here's my current thinking on what kind of interface
could be used.

  For each character and line there would be a table of actions caused
by the input character. You also need an output state, which at minimum
would be:
 1. pass output to the line
 2. hold output until the next input char (ixany)
 3. hold output

  Flags would be at minimum:
    -  pass this char to host
    -  interrupt host
    -  set output state 1
    -  set output state 3
    -  echo
    -  this is char delete
    -  this is line delete

  Input states would be used to handle echo character delete or not,
what to do with char delete when the buffer is empty, etc. Special cases
would just pass the data through.

  This allows interrupt on single character input while providing line
buffering where it makes sense. Note that I don't claim this is
complete, just some of the things which make this possible.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

meissner@osf.org (Michael Meissner) (03/20/91)

In article <1088@spim.mips.COM> mash@mips.com (John Mashey) writes:

| 7. Make the IOP smart enough to distinguish between "ordinary"
| characters which should be echoed, and stuck in an input queue,
| and other characters and/or sequiences, which demanded response from
| the CPU.  The theory here, is that, for example, you sit typing text into
| vi or emacs fairly efficiently, and still get the various control
| sequences to do something, without making the IOP huge.
| 
| Of course, since every program might well use different special sequences,
| this may require more interaction with the main CPU.
| I think this one has been done.

In it's propriatary operating system AOS/VS (and AOS/VS-II), Data
General has done this option for about 11-12 years.  It started with
one outboard processor to control all character devices and was called
an IOP.  As terminal counts got higher, it turned into one processor
for every 8 or 16 lines called IAC's.  I think the current high end
system uses 4 68000's for an IAC instead of the wimpy bit-slice Nova
or Eclipse sorta work alikes from hell that they had previously used.
Originally, it would do just buffering, but later versions of the OS
added the AOS command line editing that the CLI had originally done
from user space.  Finally, the OS put a full blown interpretter for
it's word processing system in there.  At the same time, the buffer
sizes for the lines kept shrinking, and more stuff kept moving back to
the host for processing to keep within the 64K memory requirements.

After memory size issues, the main problem with this was inflexibility
(partly due to memory requirements).  The command line editing used
fixed characters, based on the original DG terminals.  With a great
deal of care (ie, you had to patch the IAC binary), you could support
maybe 10 different terminals, though in general only the 2 classes of
DG terminals were fully supported.  Also, only one copy of the
interpretted bytestream could be downloaded per controller, which lead
to the system call which downloaded the program not to be documented,
which lead to only one application using it.

Another problem with IAC's (before the 68000 based IAC's which I never
used) was speed.  These things were cheap, and very slow.  You could
not reliably use a comm program like uucp at full 9600 baud if memory
serves, let alone more than one on the same IAC.

Finally, you have the problem of imported software.  If you bring
software in from outside, it would typically drop down to single
character I/O and do it itself.  Because of the length between the
serial line and the process getting the data, it would slow things
down if the processor just woke up the host on every character.

| One clear lesson:
| 	When doing an IOP, be careful that the interface requirements don't
| 	change on you, especially if they often force you into a
| 	transparent pass-thru mode where the IOP really is doing nothing.

Yep.

| Another one: watch out if you trying to make the same IOP handle both
| 	terminal I/O and fast serial lines to connect systems

Yep.

| I believe that most systems tend to use devices in groups 1-5 these days,
| because the processing requirements have gotten complex enough that
| you might as well just have the main CPU do it, irritating though it
| may be.
| 
| Anyway, those are a few samples. Maybe people would post examples
| of these various approaches, or others, with comments on:
| 	a) How well they worked

They worked well in their original time frame, but were pushed beyond
their original specs.

| 	b) How long they lasted

Still in use I believe.

| 	c) What the weirder bugs and problems were
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?