mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (08/03/89)
Well, I haven't heard anything about the i860 lately on this group, so maybe I should start things up again.... I have been re-reading various postings about the i860 floating-point unit and my current impression is that it is seriously bizarre. The pipelined floating-point instructions are definitely not vector instructions, since one instruction is required for every operation. (Some instructions use both the adder and the multiplier, but it still requires a separate instruction for each input data pair). So it looks like a optimized fp code would be something the code below - apologies to those who know what the instruction format _really_ is. :-) All the src? and dest? are register addresses, pointing to the 16 64-bit floating-point registers. f_mul src1, scr2, dest0 % begin loading pipe f_mul src3, src4, dest0 % load second stage of pipe pf_mul src5, src6, dest1 % specifies dest for 1st multiply!!! pf_mul src7, src8, dest2 % specifies dest for 2nd multiply pf_mul dum0, dum0, dest3 % begin unloading pipe pf_mul dum0, dum0, dest4 % end unloading pipe At this point I am finished, as I have used 8 source registers and 4 destination registers. Three more registers are available to store other scalars and such, but since loads are by 128-bit quantities, it is convenient to work with an even number of elements. (Note that only 15 registers can be used, since fp0 is a fixed floating-point zero.) So I have done 4 floating-point operations in 6 cycles, and now I have to store the results to memory and grab some new sources from memory. I can load/store in pipelined mode, too, with two 64-bit operands transferred every cycle, with a 3-cycle pipeline length. So I would need two store instructions (4 cycles), and four load instructions (6 cycles) to re-fill the registers that I used. Overall, then, it takes something like 16 cycles to deliver 4 64-bit floating-point operations, assuming no data or instruction cache misses. To get this level of performance requires loop unrolling to a depth of 4 for this simple operation. More complicated loops may not be unrollable to even this depth, since only 16 registers are available. Lots of questions: (1) Am I missing something obvious? (2) Can more things be overlapped than this? (3) The programmers manual refers to instructions that use both the adder and the multiplier, but most of these look like accumulate functions (e.g. dot product). Is there a "linked triad" instruction which takes 3 operands and does x = y + q*z, where 4 registers are used? Comments welcome.... -- John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu mccalpin@delocn.udel.edu
chase@Ozona.orc.olivetti.com (David Chase) (08/04/89)
In article <MCCALPIN.89Aug3124057@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >Well, I haven't heard anything about the i860 lately on this group, so >maybe I should start things up again.... >I have been re-reading various postings about the i860 floating-point unit >and my current impression is that it is seriously bizarre. >Lots of questions: >(1) Am I missing something obvious? >(2) Can more things be overlapped than this? >(3) The programmers manual refers to instructions that use both the > adder and the multiplier, but most of these look like accumulate > functions (e.g. dot product). Is there a "linked triad" instruction > which takes 3 operands and does x = y + q*z, where 4 registers are > used? (1) I would say, no, you are not missing anything obvious. (2) Yes, quite a bit. (3) Sort of. I bet you'd like an explanation of that, wouldn't you? The best way to make the chip scream is to get it into dual-instruction mode, issuing one dual-operation floating point instruction and one core instruction at each as clock tick. This is how you get the 120 MOPS figure for a 40MHz chip (I have no idea what clock rate is actually being shipped, if any, but 40 is a nice round number). Anyhow, to do this, and actually be doing something useful, you have to (1) come up with some computation that can actually make use of the many strange ways that the adder and multiplier pipes can be hooked up and (2) figure out how to amortize the load-store bandwidth of one floating point instruction onto ONE core unit instruction. There are three problems that illustrate this, sort of (assume row-major storage, i.e., not Fortran, in the examples that follow) (1) row-row dot product inner loop of A * B-transpose (This example appears in the i860 PRM in several forms) (2) row-column dot product inner loop of A * B (3) elementary row operations Gaussian elimination, among other things For a row-row dot product, the inner loop looks like (in LaTeX): d.ml2apm.ss f4, f12, f0 & fld.q 16(r29)++, f8 & Load $A$ \\ d.ml2apm.ss f5, f13, f0 & pfld.d 8(r24)++, f16 & Load $B$ \\ d.ml2apm.ss f6, f14, f0 & pfld.d 8(r24)++, f18 & Load $B$ \\ d.ml2apm.ss f7, f15, f0 & fld.q 16(r29)++, f4 & Load $A$ \\ d.ml2apm.ss f8, f16, f0 & nop & \\ d.ml2apm.ss f9, f17, f0 & pfld.d 8(r24)++, f12 & Load $B$ \\ d.ml2apm.ss f10, f18, f0 & bla r27,r28, inner\_loop & \\ d.ml2apm.ss f11, f19, f0 & pfld.d 8(r24)++, f14 & Load $B$ The idea is that the row of matrix A is stored in the cache, but the row of matrix B is not. The loop is unrolled 8 times in order to get the required separation (for top speed) between the instructions which load the registers and the instructions which use them. The 8 floating point instructions require 16 input operands, and these must somehow be loaded in 7 core unit instructions. This is achieved by using quad-word loads from cache (8/4 = 2) and double-word pipeline loads (8/2 = 4). I have omitted the gory details of start-up code, and bothering to detail which load puts the results where, when. One detail that should not be omitted is the floating point instruction used here. "ml2apm.ss" advances the (3-stage) multiplier pipeline one step and advances the (3-stage) adder pipeline one step. The inputs to the multiplier are the two register operands; the inputs to the adder are the output of the multiplier and the output of the adder. In all the instructions above the output register is f0, meaning "discard". (Those of you who care about re-association of your floating point arithmetic should be a little worried by now.) Anyhow, the chip runs at full speed once the row of A is in the cache. Since only that row is cached, it can be pretty large. (2) is much more subtle. The problem is that however you unroll that loop, the elements of the column of B are not stored consecutively, and thus must be loaded one-at-a-time. For an unrolling by N, this means that N (load B) + N/4 (load A) + 1 (branch) core instructions are required for N floating point instructions. That is, if N = 8 then the chip can only operate at 8/11 (72%) of potential MFlops. However, matrix multiplication (of the non-Strassen sort) is O(N-cubed), and matrix transposition is only O(N-squared). Asymptotically, you can win that way. However however, there is a way to do it in place. Observe that so far we have been multiplying one row of A by one column of B, and we are choking on the instructions to load B. Obviously, all we need to do is reuse those loads a few times. We can do this by multiplying 3 rows of A by one column of B. Why 3? The adder has three stages, and when running as an accumulator we can treat those as registers. It is claimed that the following code will do the trick: d.ml2apm.ss f6,f16,f0 & nop & \\ d.ml2apm.ss f10,f16,f0 & nop & \\ d.ml2apm.ss f14,f16,f0 & pfld.s r20(r19)++, f18 & \\ \\ d.ml2apm.ss f7,f17,f0 & 16(r16)++, fld.q f4 & \\ d.ml2apm.ss f11,f17,f0 & 16(r17)++, fld.q f8 & \\ d.ml2apm.ss f15,f17,f0 & 16(r18)++, fld.q f12 & \\ \\ d.ml2apm.ss f4,f18,f0 & pfld.s r20(r19)++, f19 & \\ d.ml2apm.ss f8,f18,f0 & nop & \\ d.ml2apm.ss f12,f18,f0 & pfld.s r20(r19)++, f16 & \\ \\ d.ml2apm.ss f5,f19,f0 & nop & \\ d.ml2apm.ss f9,f19,f0 & bla r21, r22, inner\_loop & \\ d.ml2apm.ss f13,f19,f0 & pfld.s r20(r19)++, f17 & \\ In this case, the floating point unit is actually computing three inner products simultaneously. Doing this also means that the additions in the inner product are not reordered, which is a small win. This might help a little. Each iteration of the loop will multiply these registers, loaded from A f6 & f7 & f4 & f5 f10 & f11 & f8 & f9 f14 & f15 & f12 & f13 by these registers, loaded from B f16 f17 f18 f19 updating three quantities circulating around in the adder pipeline. At the end of the loop, the pipelines look like this: |--------Multiplier----------|-------------Adder----------------| +------------------------------+ V | f13*f19 -> f9*f19 -> f5*f19 ---> f12*f18 --> f8*f18 --> f4*f18 -+-> f0 +f15*f17 +f11*f17 +f7*f17 +f14*f16 +f10*f16 +f6*f16 +etc +etc +etc Obviously, this is not obvious, and all the edge conditions are enough to make you tear your hair. Also note that those delightful load operations prefer that their operands be aligned (at least, that's how I understood it). Anyhow, once you cross all the i's and dot all the t's, you should be able to multiply matrices in the "usual way" at full speed, i.e. 80 MFlops for a 40Mhz chip. I'll post more, later, if people are interested. Dinner and a movie are waiting. Problem (3) is not so easy, because of bandwidth constraints (more operands), but I haven't tried to rearrange the algorithms yet. David
preston@titan.rice.edu (Preston Briggs) (08/05/89)
In article <MCCALPIN.89Aug3124057@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >The pipelined floating-point instructions are definitely not vector >instructions, since one instruction is required for every operation. However, vectoring compiler technology can be readily (has been, in fact) adapted to make good code for the i860. They treat the on-chip cache as vector registers of reasonable length, and use tight, highly pipelined loops to operate in these "registers." The scheme is actually more flexible than using "real" vector registers since they can do arbitrarily complex operations, not just simple adds or multiplies. >So it looks like a optimized fp code would be something the code below - >All the src? and dest? are register addresses, pointing to the 16 64-bit >floating-point registers. Actually, there are 32x32 bit FP registers. f0 and f1 are reserved. They can be treated as 16x64 bit if desired. > f_mul src1, scr2, dest0 % begin loading pipe > f_mul src3, src4, dest0 % load second stage of pipe > pf_mul src5, src6, dest1 % specifies dest for 1st multiply!!! > pf_mul src7, src8, dest2 % specifies dest for 2nd multiply > pf_mul dum0, dum0, dest3 % begin unloading pipe > pf_mul dum0, dum0, dest4 % end unloading pipe You're close. There's actually another stage in the pipeline. More accurate is pfmul.ss f2,f3,f0 start f2*f3 (single precision) pfmul.ss f4,f5,f0 dump garbage into f0 pfmul.ss f6,f7,f0 pfmul.ss f8,f9,f2 f2 gets results of f2*f3 ... >At this point I am finished, as I have used 8 source registers and 4 >destination registers. Three more registers are available to store >other scalars and such, but since loads are by 128-bit quantities, it >is convenient to work with an even number of elements. (Note that only >15 registers can be used, since fp0 is a fixed floating-point zero.) >So I have done 4 floating-point operations in 6 cycles, and now I have >to store the results to memory and grab some new sources from memory. >I can load/store in pipelined mode, too, with two 64-bit operands >transferred every cycle, with a 3-cycle pipeline length. So I would >need two store instructions (4 cycles), and four load instructions (6 >cycles) to re-fill the registers that I used. >Overall, then, it takes something like 16 cycles to deliver 4 64-bit >floating-point operations, assuming no data or instruction cache misses. >To get this level of performance requires loop unrolling to a depth of >4 for this simple operation. More complicated loops may not be >unrollable to even this depth, since only 16 registers are available. >Lots of questions: >(1) Am I missing something obvious? >(2) Can more things be overlapped than this? >(3) The programmers manual refers to instructions that use both the > adder and the multiplier, but most of these look like accumulate > functions (e.g. dot product). Is there a "linked triad" instruction > which takes 3 operands and does x = y + q*z, where 4 registers are > used? >John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu > mccalpin@delocn.udel.edu Basically, more stuff can be overlapped. 1st, there is the "dual instruction mode". This allows you to simultaneously execute an integer ("core") instruction and a floating-point instruction. Core instructions include loading and storing FP registers. So, we can overlap loading and storing with actual arithmetic. The i860 also has an auto-increment addressing mode for zapping quickly through vectors. 2nd, there is are the "dual operation instructions". These are the ones that allow simulataneous addition and multiplication. Unfortunately, it isn't possible to be completely general when specifying the sources and destinations of the adder and multiplier. However, it is possible to use both the dual instruction mode and the dual operation instructions at the same time. 3rd and so on, you can get more use out of the registers than you have suggested. After the 1st instruction (in my example), it would be possible to reuse registers f2 and f3 immediately. It's also possible to use techniques like "software pipelining" or "perfect pipelining" to help keep the pipelines full during loops. For matrix multiply, it looks possible to achieve a rate of 1 FLOP/cycle without using the dual-operation instructions. Using the dual-operation instruction, it should be possible to approach 5/3 FLOP/cycle. This includes memory latency, and is perhaps achievable using automatic techniques. All this will be fairly difficult for a simple compiler to use. However, by using vectoring front-ends, they can take advantage of hand-built subroutines implementing many common operations. In addition, a really agressive compiler will have lots of opportunities for optimization. I think that a fairly complex chip like the i860 will accent the difference between the best dependence-based optimizing compilers and the comparatively simple compilers common today. Instead of a 30% improvement, or even a factor of 2 improvement, over a PCC-like compiler, we'll see a fairly sizeable integer factor. Optimistically yours, Preston Briggs
chase@Ozona.orc.olivetti.com (David Chase) (08/05/89)
Continuing a previous follow-up: Note -- the instruction written as "ml2apm" in the previous posting is probably really called "m12apm". If you're reading this in a true typewriter font, you won't see the difference. >> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >> Well, I haven't heard anything about the i860 lately on this group, so >> maybe I should start things up again.... >>(3) The programmers manual refers to instructions that use both the >> adder and the multiplier, but most of these look like accumulate >> functions (e.g. dot product). Is there a "linked triad" instruction >> which takes 3 operands and does x = y + q*z, where 4 registers are >> used? >There are three problems that illustrate this, sort of (assume >row-major storage, i.e., not Fortran, in the examples that follow) [1 & 2 discussed in previous post] (3) elementary row operations Gaussian elimination, among other things Elementary row operations are not as easy to get going fast as inner product operations, because each instruction has (as written) two input operands and one output operand. (Actually, there are three input operands, but we're about to answer John McCalpin's question above). Typically, also, it will happen that one of the input operands will be cached, but the other input operand and the output operand will not be cached (i.e., will be loaded and stored by the pipelined load/store operations). The difficulty with this is that pipelined load/store operands have a maximum operand size of 64 bits (consider the width of the bus to memory, and you'll see why), or only two operands. Working in this style with a loop unrolled 8 times will give at best 8/11 of peak floating-point speed. A digression -- typically, the repeated part of a row operation is of the form: Row1[k] := Row1[k] + factor * Row2[k] for increasing i. That is, there are four operands, but one of them doesn't change. The i860 FPU accomodates this by having a number of register-like things in the FPU named KR, KI, and T. KR and KI can act as inputs to the multiplier, and T can buffer a result from the multiplier before it is fed to the adder. That is, m12apm feeds the result of the multiplier directly to the adder (with the result of the adder used for the other adder input), while m12ttpa inserts T between the adder and the multiplier. All in all, I count 32 different ways to hook up the multiplier and the adder. In addition, the adder can also subtract, making for a grand total of 64 different dual-operation pipelined FPU instructions. For the example above, either r2p1 or i2p1 might do the trick. (End digression) Anyhow, row operations don't go at full speed if the updated row is stored in memory and the operands are single precision. Things get even slower in double precision, since even more instructions are necessary to load the operands. I'm still thinking about how this might be sped up. It might be possible to double- or triple-up the row operations (they're usually part of a larger algorithm, like QR factorization or Gaussian elimination), but I haven't thought it through yet. David