daveb@geaclib.UUCP (David Collier-Brown) (11/05/88)
From article <7472@winchester.mips.COM>, by cprice@mips.COM (Charlie Price): > From the architecture spec: > > Multiply and divide operations are performed by a separate, > autonomous execution unit. After a multiply or divide operation > is started, execution of other instructions may continue in parallel. > The multiply/divide unit continues to operate during cache miss and > other delaying cycles in which no instructions are executed. > > The number of cycles required for multiply/divide operations is > implementation-dependent. The MFHI and MFLO instructions are > interlocked so that any attempt to read them before operations > have completed will cause execution of instructions to be delayed > until the operations finishes. Some of you may recognize this set of behavior as substantially identical to the "future" function in experimental lisp(s): Future takes a computation and returns a function which, when called later will do one of: 1) return the result immediately, because it has been computed concurrently and the computation has completed, or 2) block until the concurrent computation is finished, or 3) compute it itself if it could do it concurrently. The significance of this should be apparent: here is a RISC instruction which can be decomposed into two, independently useful instructions: 1) floating accumulator move (of course) 2) wait for concurrent computation to complete... If the manufacturers see fit to write a generalized MFHI/MFLO instruction, they have a compiler-visible "hook" for not just special-purpose coprocessors, but general-purpose parallellism. Shall we call it MAHI/MALO? Or FUTU? --dave (them 88k guys is smart, harold) c-b -- David Collier-Brown. | yunexus!lethe!dave Interleaf Canada Inc. | 1550 Enterprise Rd. | HE's so smart he's dumb. Mississauga, Ontario | --Joyce C-B
wunder@hp-sde.SDE.HP.COM (Walter Underwood) (11/08/88)
> implementation-dependent. The MFHI and MFLO instructions are > interlocked so that any attempt to read them before operations > have completed will cause execution of instructions to be delayed > until the operations finishes. Some of you may recognize this set of behavior as substantially identical to the "future" function in experimental lisp(s): ... The significance of this should be apparent: here is a RISC instruction which can be decomposed into two, independently useful instructions: 1) floating accumulator move (of course) 2) wait for concurrent computation to complete... As an ALU implementation, this technique dates from the CDC 6600. In that machine, even loads from memory were done that way -- stuff the address of the desired location in a register, and some time later, the data shows up in a corresponding register. Machines were outfitted with a few add and multiply units, so that several operations could be overlapped. With careful coding, you could get a lot of computation done in a short time. No surprise that DEC advertised the VAX-11/780 as "half the speed of a CDC 6600 for floating point" at a time when the 6600 was a 15 year old machine (1978). You can find a nice description of the CDC 6600 in Bell and Newell, "Computer Structures: Readings and Examples". The article is probably in the revised version of that book (Siewiorek, Bell, and Newell) as well, but I only have the original. wunder
cjosta@taux01.UUCP (Jonathan Sweedler) (11/08/88)
In article <3385@geaclib.UUCP| daveb@geaclib.UUCP (David Collier-Brown) writes: |From article <7472@winchester.mips.COM|, by cprice@mips.COM (Charlie Price): || From the architecture spec: || || After a multiply or divide operation || is started, execution of other instructions may continue in parallel. || The MFHI and MFLO instructions are || interlocked so that any attempt to read them before operations || have completed will cause execution of instructions to be delayed || until the operations finishes. | | Some of you may recognize this set of behavior as substantially |identical to the "future" function in experimental lisp(s): | Future takes a computation and returns a function which, when |called later will do one of: | 1) return the result immediately, because it has been computed | concurrently and the computation has completed, or | 2) block until the concurrent computation is finished, or | 3) compute it itself if it could do it concurrently. | | The significance of this should be apparent: here is a RISC |instruction which can be decomposed into two, independently useful |instructions: | 1) floating accumulator move (of course) | 2) wait for concurrent computation to complete... | Isn't this the same principle as scoreboarding in a pipelined architecture that is used in many processors today? In such a design, an instruction is put in the pipe, but if another instruction tries to read the destination register of the first instruction before the result is written into the register, then the pipe is stalled. Am I missing something? -- Jonathan Sweedler === National Semiconductor Israel UUCP: ...!{amdahl,hplabs,decwrl}!nsc!taux01!cjosta Domain: cjosta@taux01.nsc.com
daveb@geaclib.UUCP (David Collier-Brown) (11/10/88)
From article <917@taux01.UUCP>, by cjosta@taux01.UUCP (Jonathan Sweedler): | Isn't this the same principle as scoreboarding in a pipelined | architecture that is used in many processors today? In such a design, | an instruction is put in the pipe, but if another instruction tries to | read the destination register of the first instruction before the | result is written into the register, then the pipe is stalled. Yup, thats almost exactly what the lispians were generalizing from. Its finally become available on silicon that doesn't belong to a $upercomputer, which tends to imply that it could become even more "generally" usefull. --dave -- David Collier-Brown. | yunexus!lethe!dave Interleaf Canada Inc. | 1550 Enterprise Rd. | HE's so smart he's dumb. Mississauga, Ontario | --Joyce C-B
mash@mips.COM (John Mashey) (11/12/88)
Speaking of Lisp support, a fairly recent Cypress databook has some pages for the CY7C608 Floating Point Controller. Amongst the features listed is: Artificial intelligence support Does anybody know what AI support would be in an FPC? (or is this a leftover typo from another datasheet). -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
daveb@geaclib.UUCP (David Collier-Brown) (11/12/88)
I should have included a disclaimer in my response to Jonathan Swedler (in <917@taux01.UUCP>, by saying that it is the same **principle** as scoreboarding within the alu/pipeline, but quite a different application of the principle. The principle is "start it, and only wait for it if you have to", but in this case the application is to large-scale, possibly programmer-visible coprocessors and parallel processors. I propose not that one make a scoreboarding instruction available (ie, a special case), but instead a general case, which might take the form future function,returnVector2 ... ... await r4,returnVector2 which could be done by a fairly simple trap scheme, which would execute the function iff no coprocessor and its return vector was available and would send it off to a coprocessor otherwise. This would cost some operating system overhead (meaning that it would not be a real risc instruction, but extracode instead). The await (which I called FUTU in a previous article) would be a real RISC instruction though, and one very straightforward to implement: move from memory area to register iff memory access controller permits. This implies, of course, that you are prepared to live with a small number of coprocessors, or a rather complex memory access controller. Anyone want to comment on the difficulty of achieving this with current memory management techniques? -- David Collier-Brown. | yunexus!lethe!dave Interleaf Canada Inc. | 1550 Enterprise Rd. | HE's so smart he's dumb. Mississauga, Ontario | --Joyce C-B
david@june.cs.washington.edu (David Callahan) (11/14/88)
In article <3410@geaclib.UUCP> daveb@geaclib.UUCP (David Collier-Brown) writes: > > The principle is "start it, and only wait for it if you have to", >programmer-visible coprocessors and parallel processors. I propose >not that one make a scoreboarding instruction available (ie, a >special case), but instead a general case, which might take the form > > future function,returnVector2 > ... > ... > await r4,returnVector2 > This is similar to a mechanism Arvind at MIT has been talking about for executing Dataflow programs on a RISC-ish processor He inverts the order of operations: join r1 ! wait for incoming token add r1 = r1 + r2 fork r3 ! generante extra token WHich is sort of what the coprocessor would see in your example. Arvind's proposed dataflow processor (Monsoon) can be thought of as multiplexing a large number of these small risc instruction streams on the same processor. A different approach used on the HEP machine from Denelcor was to multiplex more traditional "instruction streams" on one processor. The processor could support 64 instruction streams and one instruction stream could fork off a new instruction stream in one instruction (assuming resources available, trap otherwise). The HEP then provided Full/Empty bits on each word of memory to support "join" operations. > Anyone want to comment on the difficulty of achieving this with >current memory management techniques? > I'm not sure what you mean by memory management exactly. THe most difficult problem I see is "parallelism" mangagement to acheive load balancing. >-- > David Collier-Brown. | yunexus!lethe!dave > Interleaf Canada Inc. | > 1550 Enterprise Rd. | HE's so smart he's dumb. > Mississauga, Ontario | --Joyce C-B David Callahan, Tera Computer Co., Seattle, WA.
mark@hubcap.UUCP (Mark Smotherman) (11/15/88)
In article <6410@june.cs.washington.edu>, david@june.cs.washington.edu (David Callahan) writes: > In article <3410@geaclib.UUCP> daveb@geaclib.UUCP (David Collier-Brown) writes: > > > > The principle is "start it, and only wait for it if you have to", > > > future function,returnVector2 > > ... > > ... > > await r4,returnVector2 > > > > join r1 ! wait for incoming token > add r1 = r1 + r2 > fork r3 ! generante extra token > An early machine, the Gamma-60, implemented fork and join. A brief sketch of the machine follows: Bull Gamma 60 (1959) History Compagnie des Machines Bull of France announced the Gamma 60 in 1958. Twenty systems were built. The logical designers were Pierre Chanus, Jean Bosset, and J.P. Cottet (the ones credited in the Datamation article). The machine is composed of multiple processing units with exclusive processing differentiation (e.g. BCD ALU, binary ALU, comparison unit, code translation unit) and multiple I/O processing units (e.g. drum, tape, etc.), each of which has an identical instruction request and data request interface to a single supervisory control unit and to main memory. Each unit is capable of independent, concurrent operation once it is selected and initiated by the supervisory control unit. Highlights Noteworthy - Multiprocessing. - Hardware provision for asynchronous parallel processes (i.e. FORK and JOIN operations). - Synchronization by queueing for both hardware and software resources. - Processing unit status word. Pecularities - Separation of comparison and arithmetic operations into distinct processing units. - Multiple word, variable length instructions with storage of control information in instruction sequence. - Process scheduling (which is apparently FIFO) is wired into the supervisory control unit. Instruction Formats A complete instruction is composed of a series of 24-bit words. There may be 1-4 addresses depending on the processing unit. Word types are: A - address (includes opcode and 15-bit address field) B - branch (one appears to be SIMU - i.e. FORK), (15-bit address field) C - unit selection (misleadingly called an interrupt), (15-bit field used for unit number, another field contains number of calls required to this unit before processing can begin - i.e. used as JOIN) D - operation specification (15-bit field used for length) An address word can: load a DAR (data address register); change its value by indexing or indirection; or, cause it to be stored in memory. A unit selection word loads the PAR (program address word) of the unit with the next program word address (i.e. the address immediately following the address of the unit selection word). postulated example (complete instructions are labeled): A: branch SIMU to address D => FORK empty word used for queueing link word for SIMU? B: unit select arithmetic, 1 call to proceed empty word used for call count and queueing link word address location of operand 1 address location of operand 2 address location of result operation add C: branch unconditional to address E D: unit select drum empty word address source address address destination address operation transfer drum to memory, N number of words E: unit select translate, 2 calls to proceed => JOIN empty word address source address address destination address operation block move, N number of words to transfer thus the data flow is: A +----+----+ fork V V BC D +---+ +---+ join V V E Spaces and Addressing 24-bit word, 4-bit BCD digits, 6-bit characters Working and control storage The first 128 locations contain sets of registers for each unit. Each set has: PAR (program address register) that also contains the status of the unit (busy/available) unit queue head pointer unit queue tail pointer one or more DAR's (data address registers) depending on unit Memory name-space and hierarchy 32K words, first 128 locations reserved as special registers. Address calculation and addressing modes Provision for relative and indirect addressing, 4 index registers in aritmetic unit, autoindexing by incrementing or decrementing depending on operation. Address mapping None Data Formats Character - 6-bit code similar to ASCII Logical - 24-bit binary word Fixed-point apparently: 24-bit binary word (for machinehood - addresses and loop counts) BCD fixed-point - 10 digits with decimal point location specified +-+-------+--------------+-------------------+ |s|pt.loc.| number ... (2nd word) | +-+-------+--------------+-------------------+ # bits: 1/ 7 / 16 / 24 # bcd digits: 2 / 4 / 6 Floating-point BCD digits, similar to fixed point, exponent in point location field (0-79 with bias of 40), double precision possible with 11- 19 digits in four words (point location field of third word is apparently used as length). Operations Data handling - translate, edit, block transfer Logic Fixed-point - includes loop counts in binary Floating-point - 13 operations for arithmetic unit Sequencing Normal sequencing is provided by one PAR for each processing unit and a queue of waiting processes (i.e. addresses of instruction streams). The supervisory control unit advances the PAR's and moves the process (i.e. instruction address stream) to the queue of the appropriate processing unit whenever a unit selection word is encountered. The SIMU branch instruction causes the current program sequence to be queued on the supervisory control unit (in much the same manner as queueing is done for the processing units) for later execution. The supervisory control unit then starts the execution of a new program sequence at the branch address. This can occur several times, so that an indefinite number of parallel processes may be initiated. The supervisory control unit will perform unit selection (and if the unit is free, then the DAR loading/modification and command initiation) for each created process in turn. Decision - conditional branch on unit status Iteration - apparently some form of automatic loop counting Delegation Shared subroutines are treated as virtual units to provide mutual exclusion. A subroutine call then appears as a unit selection word to a virtual unit number, and queueing is used just as in the case of physical units. Supervision Control switching On errors, the currently executing program is queued (at front?), and a diagnosis program is automatically initiated. Integrity - no memory protection Process interaction, multiprocessing The empty word after each unit selection word is used to hold the queue links to the waiting unit selection words; the queue has FIFO handling (apparently the provision for waiting on a certain number of calls, i.e. unit selections, means that the queue must be completely searched for any previously queued calls at the same instruction address). Input/Output Implementation Notes Each processing unit (including each I/O device) has a control unit for interfacing to the busses for instruction requests, data requests, data distribution, and data collection. Each unit requests instructions and data over these special busses, which are priority arbitrated. Apparently, normal operation is: request an instruction over the inst.req bus, receive a command at some point on the inst.dist bus, request input data over the data.req bus, accept input data from the data.dist bus (note that the control unit registers have the addresses), perform the operation, request output data transfer, send output data over the data.collect bus, and then repeat the cycle. Comments As mentioned in section 7.4 of the text, the "granularity" of concurrent process specification in the Gamma 60 is too fine. Each instruction, rather than each logical unit of work, is defined to be a separate process that the machine must initiate and coordinate. This results in lots of overhead. The cause of this can be traced to the memory-oriented design, as opposed to a traditional CPU-oriented design; everything possible was done to use every memory cycle. Unfortunately, by attempting to meet this goal by using concurrent operation of multiple processing units at such a low level, it appears that many memory cycles were wasted due to the supervisory overhead of processes switching back and forth between units or went idle due to congestion at critical, bottleneck units. References "France's Gamma 60: A step forward in data processing?" Datamation 4, 3 (May-June 1958), 34-35. P. Dreyfus, "System design of the Gamma 60," Proc. WJCC 1958, LA, CA, pp. 130-133. P. Dreyfus, "Programming design features of the Gamma 60 computer," Proc. EJCC 1959, pp. 174-181. M. Bataille, "The Gamma 60: The computer that was ahead of its time," Honeywell Computer Journal 5, 3 (1971) 99-105. [This is the best reference, contains refs to French and German papers as well as a Univ. Mich. report by Dreyfus] -- Mark Smotherman, Comp. Sci. Dept., Clemson University, Clemson, SC 29634 INTERNET: mark@hubcap.clemson.edu UUCP: gatech!hubcap!mark
rro@handel.colostate.edu. (Rod Oldehoeft) (11/19/88)
In article <6410@june.cs.washington.edu> david@uw-june.UUCP (David Callahan) writes: > >A different approach used on the HEP machine from Denelcor >was to multiplex more traditional "instruction streams" >on one processor. The processor could support 64 instruction >streams and one instruction stream could fork off a new instruction >stream in one instruction (assuming resources available, trap otherwise). >The HEP then provided Full/Empty bits on each word of memory to support "join" >operations. > Mitch Alsup, 88k person, was a compiler guru at Denelcor; no surprise that HEP-like features appear.
daveb@geaclib.UUCP (David Collier-Brown) (11/19/88)
From article <3543@hubcap.UUCP>, by mark@hubcap.UUCP (Mark Smotherman): > An early machine, the Gamma-60, implemented fork and join. A brief > sketch of the machine follows: ... > Compagnie des Machines Bull of France announced the Gamma 60 in > 1958. Twenty systems were built. The logical designers were > Pierre Chanus, Jean Bosset, and J.P. Cottet (the ones credited in > the Datamation article). Oooohhh! Is my face red. -dave (formerly of Honeywell, now Honeywell-Bull) c-b -- David Collier-Brown. | yunexus!lethe!dave Interleaf Canada Inc. | 1550 Enterprise Rd. | HE's so smart he's dumb. Mississauga, Ontario | --Joyce C-B
richard@aiva.ed.ac.uk (Richard Tobin) (11/22/88)
In article <8075@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >Does anybody know what AI support would be in an FPC? Probably does fuzzy arithmetic :-) -- Richard -- Richard Tobin, JANET: R.Tobin@uk.ac.ed AI Applications Institute, ARPA: R.Tobin%uk.ac.ed@nss.cs.ucl.ac.uk Edinburgh University. UUCP: ...!ukc!ed.ac.uk!R.Tobin "And to think he was once a famous professor at Yale University!" "That's life, Robin."