[comp.arch] End-of-buffer interrupt instruction

sjc@key.COM (Steve Correll) (09/13/90)

Herman Rubin writes:
>There are quite a few situations in transaction processing which could use a
>few intelligent instructions, like reading from a buffer with an automatic
>interrupt when the buffer becomes empty.

> In article <4043@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>    "Reading from a buffer" in what sense?  Is this just an in-memory buffer
>    being read by a transaction processing application - in which case I
>    don't see how an *interrupt* would help, as a dumb old conditional
>    branch testing whether the number of characters left in the buffer was
>    zero would probably be *faster* than an interrupt (no tons of context so
>    save, etc.), or is it something else?

In article <STEPHEN.90Sep10191125@estragon.uchicago.edu>, stephen@estragon.uchicago.edu (Stephen P Spackman) writes:
> Er. You have an 8k buffer. Each branch takes 1 cycle to test, 1 cycle
> if not taken. There goes 16k cycles. You're telling me that servicing
> a trap is going to take 16k cycles?

First, let's be sure everybody understands what the proposed instruction is
meant to do. I assume we're filling, emptying, or copying a buffer using a
loop that deals with one character at a time, and we want a magic load or
store which will interrupt us when we try to load or store past the end of
the buffer. If that's wrong, please ignore the following and post a clarifying
definition.

I decided to perform a couple of crude experiments. First, to figure out how
long it takes to visit the Unix kernel, activate a user signal handler, and
return:

  #include <signal.h>

  static int
  handler()
    {
    (void) signal(SIGTERM, handler);
    }

  int
  main(argc, argv)
    int argc;
    char **argv;
    {
    int i, mypid = getpid();

    (void) signal(SIGTERM, handler);

    for (i = 0; i < 100000; i++)	/* loop for better timing */
      (void) kill(mypid, SIGTERM);
    return 0;
    }

Then to figure out how long it takes to copy an 8K buffer using comparison
and conditional branch:

  # define BUFLEN 8192
  char buf0[BUFLEN], buf1[BUFLEN];

  void
  bar()
    {
    char *limit = buf0 + BUFLEN;
    char *p0 = buf0;
    char *p1 = buf1;
    while (p0 < limit)
      *p0++ = *p1++;
    }

  int
  main()
    {
    int i;

    for (i=1; i < 10000; i++)		/* loop for better timing */
      (void) bar();
    return 0;
    }

Then to figure out how much the proposed instruction would save us by
eliminating the comparisons, I unrolled the loop:

  # define BUFLEN 8192
  char buf0[BUFLEN], buf1[BUFLEN];

  void
  bar()
    {
    char *limit = buf0 + BUFLEN;
    char *p0 = buf0;
    char *p1 = buf1;
    while (p0 < limit)
      {
      *p0++ = *p1++;
      *p0++ = *p1++;
      *p0++ = *p1++;
      *p0++ = *p1++;
      }
    }

  int
  main()
    {
    int i;

    for (i=1; i < 10000; i++)		/* loop for better timing */
      (void) bar();
    return 0;
    }

Now it's tricky to discern from the second and third programs the savings due
to the proposed instruction. I compared the inner loops on a Sun 4:

LY1:					! [internal]
	ldsb	[%o4],%g1
	stb	%g1,[%o5]
	inc	%o5
	cmp	%o5,%o3
	bcs	LY1
	inc	%o4

LY1:					! [internal]
	ldsb	[%o4],%g1
	stb	%g1,[%o5]
	inc	%o4
	ldsb	[%o4],%g2
	inc	%o5
	stb	%g2,[%o5]
	inc	%o4
	ldsb	[%o4],%g3
	inc	%o5
	stb	%g3,[%o5]
	inc	%o4
	ldsb	[%o4],%g4
	inc	%o5
	stb	%g4,[%o5]
	inc	%o5
	cmp	%o5,%o3
	bcs	LY1
	inc	%o4

I believe that we want to figure out the cost of the "cmp" and not the "bcs",
because even with the new instruction, you have to retain a branch or you
don't have a loop; and on all but the last iteration, the branch is taken and
the instruction in the delay slot is useful. The difference in cost between
the second and third programs is (4 * (cost of branches + cost of compares))
- (1 * (cost of branches + cost of compares)), or 3 * (cost of branches +
cost of compares).

Assuming the cost of the branch equals the cost of the compare, then the
cost of the remaining compares in program 3 is (1/3) * (1/2) = 1/6 the
observed difference between program 2 and program 3.

(Note that I started my experiments on a MipsCo machine, but changed to SPARC
partly because the MipsCo architecture didn't require a separate compare
instruction at all when the comparision was for equality or inequality!)

I used cc -O on the Sun 4, measuring the sum of user and system times. The
average of three trials for each was:

Program 1:					31.5 sec
Program 2:					46.2 sec
Program 3:					41.4 sec

Program 2 - Program 3:				46.2 - 41.4 = 4.8 sec
Cost of remaining compares in Program 3:	4.8 / 6 = 0.8 sec
Cost of trips through signal handler:		2.8 sec	[note 1]
Cost of Program 3 with "new instruction":	41.4 - 0.8 + 2.8 = 43.4 sec

Decrease in execution time versus Program 2:	6%	[note 2]

[Note 1]: An additional experiment showed that the cost of the timing loop,
  kill(), and initialization code in program 1 is about 10% of the total, and
  we must divide by 10 because program 1 loops 10 times as often as the others,
  so the cost of trips through the signal handler is (0.9 * 31.5) / 10 = 2.8
  sec.

[Note 2]: Given the crudity of the experiment, I think one significant digit
  is the very most I can present without shame! I didn't try to eliminate the
  cost of the timing loop and initialization code in programs 2 and 3 because,
  since it's the same for both programs, it roughly cancels.

Readers can draw their own conclusions. One might contend that buffers are
usually bigger, magnifying the improvement. Or one might contend that realistic
examples perform more processing than just moving characters from place to
place, diminishing the percentage improvement. And someone who understands
implementations would have to tell us the costs of adding such an instruction.

My conclusions are (1) it's easy to spend quite a lot of time on a scenic tour
through the kernel and (2) it's interesting to measure actual seconds rather
than conceptual cycles.
-- 
sjc@key.com or ...{sun,pyramid}!pacbell!key!sjc 		Steve Correll

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (09/13/90)

In article <2123@key.COM>, sjc@key.COM (Steve Correll) writes:
[about Herman Rubin's proposed support for reading from a buffer]

> I decided to perform a couple of crude experiments. First, to figure out how
> long it takes to visit the Unix kernel, activate a user signal handler, and
> return:

Why should Herman Rubin's proposed operation involve the UNIX kernel at all?
He never asked for that!  Imagine a scheme where we have

	struct BufHdr
	    {
		unsigned char *next;
		unsigned int  count;
		void (*refill)(struct BufHdr *);
		...
	    };

and the operation is

	unsigned char fetch(struct BufHdr *b)
	    {
		while (b->count == 0) (b->refill)(b);
		b->count--;
		return *(b->next++);
	    }

Now, if this is done in hardware:
    --	fetching the character and testing the count can proceed
	in parallel.
    --	decrementing the count and incrementing the pointer can
	proceed in parallel with the *next* instruction in the rest
	of the program.

So the time that the program would spend waiting for the FETCH
instruction would be the time it would spend for two memory references,
	FETCH r0, r1
on a hypothetical extended NS32532 would take the same time as
	MOVB 0(0(sp)), r1
Yes, more work would be done, but it could be done in parallel.
When the instruction had to trap, it would *not* have to involve the
kernel in signal handling, it would just do an ordinary procedure call.

This is a pretty substantial speedup for this operation.

The Xerox Lisp machines had micro-coded "get byte from buffer" and
"put byte into buffer" operations.  It paid off handsomely, on that
particular architecture, with that particular application mix.

With the advent of 16-bit character sets, and in particular of 16-bit
character sets encoded in 8-bit streams (see the introduction in the
SVID, or the relevant manual pages on a Sony NEWS machine), the cost
of fetching a byte becomes less significant, as one then has to check
whether this byte needs to be combined with the next.  At the moment,
there are several ways of doing this coding in actual use, so there isn't
one right one that could be built into an instruction.

My own belief is that a getc() instruction is not warranted.
But it deserves a fairer evaluation than it got.

-- 
Heuer's Law:  Any feature is a bug unless it can be turned off.

sjc@key.COM (Steve Correll) (09/17/90)

In article <3747@goanna.cs.rmit.oz.au>, ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:
> In article <2123@key.COM>, sjc@key.COM (Steve Correll) writes:
> [about Herman Rubin's proposed support for reading from a buffer]
> 
> Why should Herman Rubin's proposed operation involve the UNIX kernel at all?
> He never asked for that!  Imagine a scheme where we have
> 
	[scheme defined in C++ notation]

Well, Prof. Rubin _did_ ask for an "interrupt", and most Unix kernels insist
on being involved in any interrupts. And Unix is not unique in this.

I don't understand from the C++ code precisely what hardware instruction
you're suggesting--for example, how it will specify the procedure and the
count. But if you're interested, you could define the instruction precisely
and ask some of the experts among the readership to estimate its costs. For
example: If you use existing registers to specify the procedure address and
the count, does that complicate decoding? If you put a pointer in memory, does
that make it complicated for a pipelined processor to restart the instruction
in case of a page fault? If you use a special prefix register, are you adding
context which must be saved?
-- 
sjc@key.com or ...{sun,pyramid}!pacbell!key!sjc 		Steve Correll

cik@l.cc.purdue.edu (Herman Rubin) (09/17/90)

In article <2128@key.COM>, sjc@key.COM (Steve Correll) writes:
> In article <3747@goanna.cs.rmit.oz.au>, ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:
> > In article <2123@key.COM>, sjc@key.COM (Steve Correll) writes:
> > [about Herman Rubin's proposed support for reading from a buffer]
> > 
> > Why should Herman Rubin's proposed operation involve the UNIX kernel at all?
> > He never asked for that!  Imagine a scheme where we have
> > 
> 	[scheme defined in C++ notation]
> 
> Well, Prof. Rubin _did_ ask for an "interrupt", and most Unix kernels insist
> on being involved in any interrupts. And Unix is not unique in this.

Let me clarify the situation.  There are natural situations arising in 
Monte Carlo where one cannot predict when a buffer will empty.  Steps
can be taken to make it rare.  Therefore, the "natural" way to want to
handle this is to have an interrupt, or exception if you prefer, when
this does happen.  Is there any good reason why the user cannot set up
conditions for interrupts, and handling procedures for them?  I believe
that some setups allow for these without necessarily invoking the kernel.

The advantage of an interrupt procedure over a test each time is that
the interrupt is rare, and does not have the costs of the test when it
is not invoked.  It is likely to lengthen the instruction, however.  On
the other hand, an interrupt certainly has much higher costs than the
test when it does cut in.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

aglew@crhc.uiuc.edu (Andy Glew) (09/17/90)

>Let me clarify the situation.  There are natural situations arising in 
>Monte Carlo where one cannot predict when a buffer will empty.  Steps
>can be taken to make it rare.  Therefore, the "natural" way to want to
>handle this is to have an interrupt, or exception if you prefer, when
>this does happen.  Is there any good reason why the user cannot set up
>conditions for interrupts, and handling procedures for them?  I believe
>that some setups allow for these without necessarily invoking the kernel.
>
>The advantage of an interrupt procedure over a test each time is that
>the interrupt is rare, and does not have the costs of the test when it
>is not invoked.  It is likely to lengthen the instruction, however.  On
>the other hand, an interrupt certainly has much higher costs than the
>test when it does cut in.

I think I understand.  It's not so much a block copy, as a read the
next item, where the reads are scattered all over.

I frequently encounter situations like this when I am instrumenting
code, eg. recording events into a trace buffer.  The branches to
detect buffer overflow are often much more expensive than the record
of the event itself.

In such circumstances I often want to ensure that my buffers are page
aligned, and that an access to one element past the end of the buffer
will cause a page fault that I can handle.  I've only done this once,
because I'm not always in the kernel.  Conceivably MACH's memory
handling facilities could let you do this.
    Recently, I encountered a situation where I did not want to trap
on trace buffer overflow, but only to silently go on. (Ie. stop
recording but do not disturb the rest of the computation).  Mapping
many virtual pages past the end of my buffer to the same physical page
would have accomplished this - even better would have been a /dev/null
page type.


--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

wayne@dsndata.uucp (Wayne Schlitt) (09/17/90)

In article <2567@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> Let me clarify the situation.  There are natural situations arising in 
> Monte Carlo where one cannot predict when a buffer will empty.  Steps
> can be taken to make it rare.  Therefore, the "natural" way to want to
> handle this is to have an interrupt, or exception if you prefer, when
> this does happen.  [ ... ] 
> 
> The advantage of an interrupt procedure over a test each time is that
> the interrupt is rare, and does not have the costs of the test when it
> is not invoked.  It is likely to lengthen the instruction, however.  On
> the other hand, an interrupt certainly has much higher costs than the
> test when it does cut in.


the more i think about it, the less i see any value of this type of
instruction.  IFF your processor has a deep pipeline and/or it doesnt
stall on loads from memory AND your buffer isnt in cache, then your
compare is going to be totally hidden by the delay of loading from
memory.  let's look at the code...


you want something like this:

top_of_loop:
        add     R1,R0                   # do something with R0        
        loadbuf R0,buffer,endofbuf      # fetch something from the buffer
                                        # check to make sure it isnt past
                                        # endofbuf and increment buffer to
                                        # to the next item
        bra     top_of_loop             # do some more processing.


instead, what you are getting is:

top_of_loop:
        add     R1,R0                   # do something with R0
        load    R0,(buffer)             # fetch something from the buffer
        cmp     R0,endofbuf             # see if we are at the end of buf
        ble     top_of_loop             # if not, go process it
        inc     buffer                  # ye, old, delay slot of the branch

since R0 isnt going to be ready to use for several instructions after
the load, your code is simply going to stall on the add at the top of
the loop.  the code you are getting is going to be doing the compare
instead of stalling.  it dont see what the difference is.  your loop
isnt going to run any faster...


-wayne

cik@l.cc.purdue.edu (Herman Rubin) (09/19/90)

In article <WAYNE.90Sep17125041@dsndata.uucp>, wayne@dsndata.uucp (Wayne Schlitt) writes:
> In article <2567@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

			......................

| > The advantage of an interrupt procedure over a test each time is that
| > the interrupt is rare, and does not have the costs of the test when it
| > is not invoked.  It is likely to lengthen the instruction, however.  On
| > the other hand, an interrupt certainly has much higher costs than the
| > test when it does cut in.

> the more i think about it, the less i see any value of this type of
> instruction.  IFF your processor has a deep pipeline and/or it doesnt
> stall on loads from memory AND your buffer isnt in cache, then your
> compare is going to be totally hidden by the delay of loading from
> memory.  let's look at the code...

> you want something like this:

> top_of_loop:
>         add     R1,R0                   # do something with R0        
>         loadbuf R0,buffer,endofbuf      # fetch something from the buffer
>                                         # check to make sure it isnt past
>                                         # endofbuf and increment buffer to
>                                         # to the next item
>         bra     top_of_loop             # do some more processing.
> 
> 
> instead, what you are getting is:

> top_of_loop:
>         add     R1,R0                   # do something with R0
>         load    R0,(buffer)             # fetch something from the buffer
>         cmp     R0,endofbuf             # see if we are at the end of buf
>         ble     top_of_loop             # if not, go process it
>         inc     buffer                  # ye, old, delay slot of the branch

> since R0 isnt going to be ready to use for several instructions after
> the load, your code is simply going to stall on the add at the top of
> the loop.  the code you are getting is going to be doing the compare
> instead of stalling.  it dont see what the difference is.  your loop
> isnt going to run any faster...

Why should I not run other instructions while waiting for the memory access?
I am not reading a whole buffer; I am reading one item, and doing something
with it.  If I were using such a short loop, I would definitely decide if
the number of items to be read is more or less than what is in the buffer,
and write my code accordingly.

I am quite aware of the problems of programming when instructions can run
in parallel, and I do write my code to take advantage of it, sometimes even
when the language does not have any way for me to tell the compiler to take
advantage of it.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

lyndon@cs.athabascau.ca (Lyndon Nerenberg) (09/19/90)

cik@l.cc.purdue.edu (Herman Rubin) writes:
>  Is there any good reason why the user cannot set up
>conditions for interrupts, and handling procedures for them?  I believe
>that some setups allow for these without necessarily invoking the kernel.

I'm reminded of the (undocumented) facility in MTS (Michigan Terminal
System) that allowed the user to intercept addressing exceptions without
going through the full blown exception handling mechanism. Very early
in the exception handling code, the kernel would look at the instruction
following the one that generated the exception. If this was a NOP, the
kernel would extract the address field and use it as a pointer to a user
specified exception handler in user VM space, and branch to it (in user
mode). Although this did involve decending into the kernel somewhat (there's
no way around that in a VM system), it avoided doing a full blown context
switch to supervisor mode.

Programs that liked to poke around in shared system memory (you thought
UNIX security was bad? :-) were littered with statements like:

	.
	.
	.
	MVC   mumble,foo	Move shared mem to user buffer
* Gack, where's my copy of Strubel?!? I can't even remember SS
* instruction syntax! :-(
	NOP   15,MYTRAP		Ack! Fell off shared memory
	.
	.
	.
MYTRAP	EQU   *		Routine to spin through non-existant memory
*			looking for the start of the next valid
*			chunk of VM

I sort of recall that the R field of the NOP instruction was interpreted
as a bitmask that you could use to flag to the supervisor which exceptions
you were willing to handle.

-- 
    Lyndon Nerenberg  VE6BBM / Computing Services / Athabasca University
        {alberta,cbmvax,mips}!atha!lyndon || lyndon@cs.athabascau.ca

      The only thing open about OSF is their mouth.  --Chuck Musciano

sjc@key.COM (Steve Correll) (09/20/90)

In article <AGLEW.90Sep17121944@dwarfs.crhc.uiuc.edu>, aglew@crhc.uiuc.edu (Andy Glew) writes:
> In such circumstances I often want to ensure that my buffers are page
> aligned, and that an access to one element past the end of the buffer
> will cause a page fault that I can handle.  I've only done this once,
> because I'm not always in the kernel.  Conceivably MACH's memory
> handling facilities could let you do this.

Unix System V Release 4 also appears to provide this service to the user
process via "mmap" and "mprotect" calls, though I don't have a system to try
them out on right now. Whatever one thinks about the feasibility of the
proposed instruction, it seems likely that one OS or another is going to
solve Prof. Rubin's problem long before any instruction set architect does!
-- 
sjc@key.com or ...{sun,pyramid}!pacbell!key!sjc 		Steve Correll