sjc@key.COM (Steve Correll) (09/13/90)
Herman Rubin writes: >There are quite a few situations in transaction processing which could use a >few intelligent instructions, like reading from a buffer with an automatic >interrupt when the buffer becomes empty. > In article <4043@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: > "Reading from a buffer" in what sense? Is this just an in-memory buffer > being read by a transaction processing application - in which case I > don't see how an *interrupt* would help, as a dumb old conditional > branch testing whether the number of characters left in the buffer was > zero would probably be *faster* than an interrupt (no tons of context so > save, etc.), or is it something else? In article <STEPHEN.90Sep10191125@estragon.uchicago.edu>, stephen@estragon.uchicago.edu (Stephen P Spackman) writes: > Er. You have an 8k buffer. Each branch takes 1 cycle to test, 1 cycle > if not taken. There goes 16k cycles. You're telling me that servicing > a trap is going to take 16k cycles? First, let's be sure everybody understands what the proposed instruction is meant to do. I assume we're filling, emptying, or copying a buffer using a loop that deals with one character at a time, and we want a magic load or store which will interrupt us when we try to load or store past the end of the buffer. If that's wrong, please ignore the following and post a clarifying definition. I decided to perform a couple of crude experiments. First, to figure out how long it takes to visit the Unix kernel, activate a user signal handler, and return: #include <signal.h> static int handler() { (void) signal(SIGTERM, handler); } int main(argc, argv) int argc; char **argv; { int i, mypid = getpid(); (void) signal(SIGTERM, handler); for (i = 0; i < 100000; i++) /* loop for better timing */ (void) kill(mypid, SIGTERM); return 0; } Then to figure out how long it takes to copy an 8K buffer using comparison and conditional branch: # define BUFLEN 8192 char buf0[BUFLEN], buf1[BUFLEN]; void bar() { char *limit = buf0 + BUFLEN; char *p0 = buf0; char *p1 = buf1; while (p0 < limit) *p0++ = *p1++; } int main() { int i; for (i=1; i < 10000; i++) /* loop for better timing */ (void) bar(); return 0; } Then to figure out how much the proposed instruction would save us by eliminating the comparisons, I unrolled the loop: # define BUFLEN 8192 char buf0[BUFLEN], buf1[BUFLEN]; void bar() { char *limit = buf0 + BUFLEN; char *p0 = buf0; char *p1 = buf1; while (p0 < limit) { *p0++ = *p1++; *p0++ = *p1++; *p0++ = *p1++; *p0++ = *p1++; } } int main() { int i; for (i=1; i < 10000; i++) /* loop for better timing */ (void) bar(); return 0; } Now it's tricky to discern from the second and third programs the savings due to the proposed instruction. I compared the inner loops on a Sun 4: LY1: ! [internal] ldsb [%o4],%g1 stb %g1,[%o5] inc %o5 cmp %o5,%o3 bcs LY1 inc %o4 LY1: ! [internal] ldsb [%o4],%g1 stb %g1,[%o5] inc %o4 ldsb [%o4],%g2 inc %o5 stb %g2,[%o5] inc %o4 ldsb [%o4],%g3 inc %o5 stb %g3,[%o5] inc %o4 ldsb [%o4],%g4 inc %o5 stb %g4,[%o5] inc %o5 cmp %o5,%o3 bcs LY1 inc %o4 I believe that we want to figure out the cost of the "cmp" and not the "bcs", because even with the new instruction, you have to retain a branch or you don't have a loop; and on all but the last iteration, the branch is taken and the instruction in the delay slot is useful. The difference in cost between the second and third programs is (4 * (cost of branches + cost of compares)) - (1 * (cost of branches + cost of compares)), or 3 * (cost of branches + cost of compares). Assuming the cost of the branch equals the cost of the compare, then the cost of the remaining compares in program 3 is (1/3) * (1/2) = 1/6 the observed difference between program 2 and program 3. (Note that I started my experiments on a MipsCo machine, but changed to SPARC partly because the MipsCo architecture didn't require a separate compare instruction at all when the comparision was for equality or inequality!) I used cc -O on the Sun 4, measuring the sum of user and system times. The average of three trials for each was: Program 1: 31.5 sec Program 2: 46.2 sec Program 3: 41.4 sec Program 2 - Program 3: 46.2 - 41.4 = 4.8 sec Cost of remaining compares in Program 3: 4.8 / 6 = 0.8 sec Cost of trips through signal handler: 2.8 sec [note 1] Cost of Program 3 with "new instruction": 41.4 - 0.8 + 2.8 = 43.4 sec Decrease in execution time versus Program 2: 6% [note 2] [Note 1]: An additional experiment showed that the cost of the timing loop, kill(), and initialization code in program 1 is about 10% of the total, and we must divide by 10 because program 1 loops 10 times as often as the others, so the cost of trips through the signal handler is (0.9 * 31.5) / 10 = 2.8 sec. [Note 2]: Given the crudity of the experiment, I think one significant digit is the very most I can present without shame! I didn't try to eliminate the cost of the timing loop and initialization code in programs 2 and 3 because, since it's the same for both programs, it roughly cancels. Readers can draw their own conclusions. One might contend that buffers are usually bigger, magnifying the improvement. Or one might contend that realistic examples perform more processing than just moving characters from place to place, diminishing the percentage improvement. And someone who understands implementations would have to tell us the costs of adding such an instruction. My conclusions are (1) it's easy to spend quite a lot of time on a scenic tour through the kernel and (2) it's interesting to measure actual seconds rather than conceptual cycles. -- sjc@key.com or ...{sun,pyramid}!pacbell!key!sjc Steve Correll
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (09/13/90)
In article <2123@key.COM>, sjc@key.COM (Steve Correll) writes: [about Herman Rubin's proposed support for reading from a buffer] > I decided to perform a couple of crude experiments. First, to figure out how > long it takes to visit the Unix kernel, activate a user signal handler, and > return: Why should Herman Rubin's proposed operation involve the UNIX kernel at all? He never asked for that! Imagine a scheme where we have struct BufHdr { unsigned char *next; unsigned int count; void (*refill)(struct BufHdr *); ... }; and the operation is unsigned char fetch(struct BufHdr *b) { while (b->count == 0) (b->refill)(b); b->count--; return *(b->next++); } Now, if this is done in hardware: -- fetching the character and testing the count can proceed in parallel. -- decrementing the count and incrementing the pointer can proceed in parallel with the *next* instruction in the rest of the program. So the time that the program would spend waiting for the FETCH instruction would be the time it would spend for two memory references, FETCH r0, r1 on a hypothetical extended NS32532 would take the same time as MOVB 0(0(sp)), r1 Yes, more work would be done, but it could be done in parallel. When the instruction had to trap, it would *not* have to involve the kernel in signal handling, it would just do an ordinary procedure call. This is a pretty substantial speedup for this operation. The Xerox Lisp machines had micro-coded "get byte from buffer" and "put byte into buffer" operations. It paid off handsomely, on that particular architecture, with that particular application mix. With the advent of 16-bit character sets, and in particular of 16-bit character sets encoded in 8-bit streams (see the introduction in the SVID, or the relevant manual pages on a Sony NEWS machine), the cost of fetching a byte becomes less significant, as one then has to check whether this byte needs to be combined with the next. At the moment, there are several ways of doing this coding in actual use, so there isn't one right one that could be built into an instruction. My own belief is that a getc() instruction is not warranted. But it deserves a fairer evaluation than it got. -- Heuer's Law: Any feature is a bug unless it can be turned off.
sjc@key.COM (Steve Correll) (09/17/90)
In article <3747@goanna.cs.rmit.oz.au>, ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: > In article <2123@key.COM>, sjc@key.COM (Steve Correll) writes: > [about Herman Rubin's proposed support for reading from a buffer] > > Why should Herman Rubin's proposed operation involve the UNIX kernel at all? > He never asked for that! Imagine a scheme where we have > [scheme defined in C++ notation] Well, Prof. Rubin _did_ ask for an "interrupt", and most Unix kernels insist on being involved in any interrupts. And Unix is not unique in this. I don't understand from the C++ code precisely what hardware instruction you're suggesting--for example, how it will specify the procedure and the count. But if you're interested, you could define the instruction precisely and ask some of the experts among the readership to estimate its costs. For example: If you use existing registers to specify the procedure address and the count, does that complicate decoding? If you put a pointer in memory, does that make it complicated for a pipelined processor to restart the instruction in case of a page fault? If you use a special prefix register, are you adding context which must be saved? -- sjc@key.com or ...{sun,pyramid}!pacbell!key!sjc Steve Correll
cik@l.cc.purdue.edu (Herman Rubin) (09/17/90)
In article <2128@key.COM>, sjc@key.COM (Steve Correll) writes: > In article <3747@goanna.cs.rmit.oz.au>, ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: > > In article <2123@key.COM>, sjc@key.COM (Steve Correll) writes: > > [about Herman Rubin's proposed support for reading from a buffer] > > > > Why should Herman Rubin's proposed operation involve the UNIX kernel at all? > > He never asked for that! Imagine a scheme where we have > > > [scheme defined in C++ notation] > > Well, Prof. Rubin _did_ ask for an "interrupt", and most Unix kernels insist > on being involved in any interrupts. And Unix is not unique in this. Let me clarify the situation. There are natural situations arising in Monte Carlo where one cannot predict when a buffer will empty. Steps can be taken to make it rare. Therefore, the "natural" way to want to handle this is to have an interrupt, or exception if you prefer, when this does happen. Is there any good reason why the user cannot set up conditions for interrupts, and handling procedures for them? I believe that some setups allow for these without necessarily invoking the kernel. The advantage of an interrupt procedure over a test each time is that the interrupt is rare, and does not have the costs of the test when it is not invoked. It is likely to lengthen the instruction, however. On the other hand, an interrupt certainly has much higher costs than the test when it does cut in. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!cik(UUCP)
aglew@crhc.uiuc.edu (Andy Glew) (09/17/90)
>Let me clarify the situation. There are natural situations arising in >Monte Carlo where one cannot predict when a buffer will empty. Steps >can be taken to make it rare. Therefore, the "natural" way to want to >handle this is to have an interrupt, or exception if you prefer, when >this does happen. Is there any good reason why the user cannot set up >conditions for interrupts, and handling procedures for them? I believe >that some setups allow for these without necessarily invoking the kernel. > >The advantage of an interrupt procedure over a test each time is that >the interrupt is rare, and does not have the costs of the test when it >is not invoked. It is likely to lengthen the instruction, however. On >the other hand, an interrupt certainly has much higher costs than the >test when it does cut in. I think I understand. It's not so much a block copy, as a read the next item, where the reads are scattered all over. I frequently encounter situations like this when I am instrumenting code, eg. recording events into a trace buffer. The branches to detect buffer overflow are often much more expensive than the record of the event itself. In such circumstances I often want to ensure that my buffers are page aligned, and that an access to one element past the end of the buffer will cause a page fault that I can handle. I've only done this once, because I'm not always in the kernel. Conceivably MACH's memory handling facilities could let you do this. Recently, I encountered a situation where I did not want to trap on trace buffer overflow, but only to silently go on. (Ie. stop recording but do not disturb the rest of the computation). Mapping many virtual pages past the end of my buffer to the same physical page would have accomplished this - even better would have been a /dev/null page type. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
wayne@dsndata.uucp (Wayne Schlitt) (09/17/90)
In article <2567@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > Let me clarify the situation. There are natural situations arising in > Monte Carlo where one cannot predict when a buffer will empty. Steps > can be taken to make it rare. Therefore, the "natural" way to want to > handle this is to have an interrupt, or exception if you prefer, when > this does happen. [ ... ] > > The advantage of an interrupt procedure over a test each time is that > the interrupt is rare, and does not have the costs of the test when it > is not invoked. It is likely to lengthen the instruction, however. On > the other hand, an interrupt certainly has much higher costs than the > test when it does cut in. the more i think about it, the less i see any value of this type of instruction. IFF your processor has a deep pipeline and/or it doesnt stall on loads from memory AND your buffer isnt in cache, then your compare is going to be totally hidden by the delay of loading from memory. let's look at the code... you want something like this: top_of_loop: add R1,R0 # do something with R0 loadbuf R0,buffer,endofbuf # fetch something from the buffer # check to make sure it isnt past # endofbuf and increment buffer to # to the next item bra top_of_loop # do some more processing. instead, what you are getting is: top_of_loop: add R1,R0 # do something with R0 load R0,(buffer) # fetch something from the buffer cmp R0,endofbuf # see if we are at the end of buf ble top_of_loop # if not, go process it inc buffer # ye, old, delay slot of the branch since R0 isnt going to be ready to use for several instructions after the load, your code is simply going to stall on the add at the top of the loop. the code you are getting is going to be doing the compare instead of stalling. it dont see what the difference is. your loop isnt going to run any faster... -wayne
cik@l.cc.purdue.edu (Herman Rubin) (09/19/90)
In article <WAYNE.90Sep17125041@dsndata.uucp>, wayne@dsndata.uucp (Wayne Schlitt) writes: > In article <2567@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: ...................... | > The advantage of an interrupt procedure over a test each time is that | > the interrupt is rare, and does not have the costs of the test when it | > is not invoked. It is likely to lengthen the instruction, however. On | > the other hand, an interrupt certainly has much higher costs than the | > test when it does cut in. > the more i think about it, the less i see any value of this type of > instruction. IFF your processor has a deep pipeline and/or it doesnt > stall on loads from memory AND your buffer isnt in cache, then your > compare is going to be totally hidden by the delay of loading from > memory. let's look at the code... > you want something like this: > top_of_loop: > add R1,R0 # do something with R0 > loadbuf R0,buffer,endofbuf # fetch something from the buffer > # check to make sure it isnt past > # endofbuf and increment buffer to > # to the next item > bra top_of_loop # do some more processing. > > > instead, what you are getting is: > top_of_loop: > add R1,R0 # do something with R0 > load R0,(buffer) # fetch something from the buffer > cmp R0,endofbuf # see if we are at the end of buf > ble top_of_loop # if not, go process it > inc buffer # ye, old, delay slot of the branch > since R0 isnt going to be ready to use for several instructions after > the load, your code is simply going to stall on the add at the top of > the loop. the code you are getting is going to be doing the compare > instead of stalling. it dont see what the difference is. your loop > isnt going to run any faster... Why should I not run other instructions while waiting for the memory access? I am not reading a whole buffer; I am reading one item, and doing something with it. If I were using such a short loop, I would definitely decide if the number of items to be read is more or less than what is in the buffer, and write my code accordingly. I am quite aware of the problems of programming when instructions can run in parallel, and I do write my code to take advantage of it, sometimes even when the language does not have any way for me to tell the compiler to take advantage of it. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!cik(UUCP)
lyndon@cs.athabascau.ca (Lyndon Nerenberg) (09/19/90)
cik@l.cc.purdue.edu (Herman Rubin) writes: > Is there any good reason why the user cannot set up >conditions for interrupts, and handling procedures for them? I believe >that some setups allow for these without necessarily invoking the kernel. I'm reminded of the (undocumented) facility in MTS (Michigan Terminal System) that allowed the user to intercept addressing exceptions without going through the full blown exception handling mechanism. Very early in the exception handling code, the kernel would look at the instruction following the one that generated the exception. If this was a NOP, the kernel would extract the address field and use it as a pointer to a user specified exception handler in user VM space, and branch to it (in user mode). Although this did involve decending into the kernel somewhat (there's no way around that in a VM system), it avoided doing a full blown context switch to supervisor mode. Programs that liked to poke around in shared system memory (you thought UNIX security was bad? :-) were littered with statements like: . . . MVC mumble,foo Move shared mem to user buffer * Gack, where's my copy of Strubel?!? I can't even remember SS * instruction syntax! :-( NOP 15,MYTRAP Ack! Fell off shared memory . . . MYTRAP EQU * Routine to spin through non-existant memory * looking for the start of the next valid * chunk of VM I sort of recall that the R field of the NOP instruction was interpreted as a bitmask that you could use to flag to the supervisor which exceptions you were willing to handle. -- Lyndon Nerenberg VE6BBM / Computing Services / Athabasca University {alberta,cbmvax,mips}!atha!lyndon || lyndon@cs.athabascau.ca The only thing open about OSF is their mouth. --Chuck Musciano
sjc@key.COM (Steve Correll) (09/20/90)
In article <AGLEW.90Sep17121944@dwarfs.crhc.uiuc.edu>, aglew@crhc.uiuc.edu (Andy Glew) writes: > In such circumstances I often want to ensure that my buffers are page > aligned, and that an access to one element past the end of the buffer > will cause a page fault that I can handle. I've only done this once, > because I'm not always in the kernel. Conceivably MACH's memory > handling facilities could let you do this. Unix System V Release 4 also appears to provide this service to the user process via "mmap" and "mprotect" calls, though I don't have a system to try them out on right now. Whatever one thinks about the feasibility of the proposed instruction, it seems likely that one OS or another is going to solve Prof. Rubin's problem long before any instruction set architect does! -- sjc@key.com or ...{sun,pyramid}!pacbell!key!sjc Steve Correll