[comp.arch] Lisp "future" instruction in 88k hardware.

daveb@geaclib.UUCP (David Collier-Brown) (11/05/88)

From article <7472@winchester.mips.COM>, by cprice@mips.COM (Charlie Price):
> From the architecture spec:
> 
>   Multiply and divide operations are performed by a separate,
>   autonomous execution unit.  After a multiply or divide operation
>   is started, execution of other instructions may continue in parallel.
>   The multiply/divide unit continues to operate during cache miss and
>   other delaying cycles in which no instructions are executed.
> 
>   The number of cycles required for multiply/divide operations is
>   implementation-dependent.  The MFHI and MFLO instructions are
>   interlocked so that any attempt to read them before operations
>   have completed will cause execution of instructions to be delayed
>   until the operations finishes.

  Some of you may recognize this set of behavior as substantially
identical to the "future" function in experimental lisp(s):
  Future takes a computation and returns a function which, when
called later will do one of:
	1) return the result immediately, because it has been computed
	   concurrently and the computation has completed, or
	2) block until the concurrent computation is finished, or
	3) compute it itself if it could do it concurrently.

  The significance of this should be apparent: here is a RISC
instruction which can be decomposed into two, independently useful
instructions:
	1) floating accumulator move (of course)
	2) wait for concurrent computation to complete...

  If the manufacturers see fit to write a generalized MFHI/MFLO
instruction, they have a compiler-visible "hook" for not just
special-purpose coprocessors, but general-purpose parallellism.
  Shall we call it MAHI/MALO? Or FUTU?

--dave (them 88k guys is smart, harold) c-b
-- 
 David Collier-Brown.  | yunexus!lethe!dave
 Interleaf Canada Inc. |
 1550 Enterprise Rd.   | HE's so smart he's dumb.
 Mississauga, Ontario  |       --Joyce C-B

wunder@hp-sde.SDE.HP.COM (Walter Underwood) (11/08/88)

   >   implementation-dependent.  The MFHI and MFLO instructions are
   >   interlocked so that any attempt to read them before operations
   >   have completed will cause execution of instructions to be delayed
   >   until the operations finishes.

     Some of you may recognize this set of behavior as substantially
   identical to the "future" function in experimental lisp(s):

   ...

     The significance of this should be apparent: here is a RISC
   instruction which can be decomposed into two, independently useful
   instructions:
	   1) floating accumulator move (of course)
	   2) wait for concurrent computation to complete...

As an ALU implementation, this technique dates from the CDC 6600.  In
that machine, even loads from memory were done that way -- stuff the
address of the desired location in a register, and some time later,
the data shows up in a corresponding register.  Machines were
outfitted with a few add and multiply units, so that several
operations could be overlapped.  With careful coding, you could
get a lot of computation done in a short time.

No surprise that DEC advertised the VAX-11/780 as "half the speed of a
CDC 6600 for floating point" at a time when the 6600 was a 15 year old
machine (1978).

You can find a nice description of the CDC 6600 in Bell and Newell,
"Computer Structures: Readings and Examples".  The article is probably
in the revised version of that book (Siewiorek, Bell, and Newell) as
well, but I only have the original.

wunder

cjosta@taux01.UUCP (Jonathan Sweedler) (11/08/88)

In article <3385@geaclib.UUCP| daveb@geaclib.UUCP (David Collier-Brown) writes:
|From article <7472@winchester.mips.COM|, by cprice@mips.COM (Charlie Price):
|| From the architecture spec:
|| 
||   After a multiply or divide operation
||   is started, execution of other instructions may continue in parallel.
||   The MFHI and MFLO instructions are
||   interlocked so that any attempt to read them before operations
||   have completed will cause execution of instructions to be delayed
||   until the operations finishes.
|
|  Some of you may recognize this set of behavior as substantially
|identical to the "future" function in experimental lisp(s):
|  Future takes a computation and returns a function which, when
|called later will do one of:
|	1) return the result immediately, because it has been computed
|	   concurrently and the computation has completed, or
|	2) block until the concurrent computation is finished, or
|	3) compute it itself if it could do it concurrently.
|
|  The significance of this should be apparent: here is a RISC
|instruction which can be decomposed into two, independently useful
|instructions:
|	1) floating accumulator move (of course)
|	2) wait for concurrent computation to complete...
|

Isn't this the same principle as scoreboarding in a pipelined
architecture that is used in many processors today?  In such a design,
an instruction is put in the pipe, but if another instruction tries to
read the destination register of the first instruction before the
result is written into the register, then the pipe is stalled.
Am I missing something?
-- 
Jonathan Sweedler  ===  National Semiconductor Israel
UUCP:    ...!{amdahl,hplabs,decwrl}!nsc!taux01!cjosta
Domain:  cjosta@taux01.nsc.com

daveb@geaclib.UUCP (David Collier-Brown) (11/10/88)

From article <917@taux01.UUCP>, by cjosta@taux01.UUCP (Jonathan Sweedler):

| Isn't this the same principle as scoreboarding in a pipelined
| architecture that is used in many processors today?  In such a design,
| an instruction is put in the pipe, but if another instruction tries to
| read the destination register of the first instruction before the
| result is written into the register, then the pipe is stalled.

   Yup, thats almost exactly what the lispians were generalizing
from.  Its finally become available on silicon that doesn't belong
to a $upercomputer, which tends to imply that it could become
even more "generally" usefull.

--dave
-- 
 David Collier-Brown.  | yunexus!lethe!dave
 Interleaf Canada Inc. |
 1550 Enterprise Rd.   | HE's so smart he's dumb.
 Mississauga, Ontario  |       --Joyce C-B

mash@mips.COM (John Mashey) (11/12/88)

Speaking of Lisp support, a fairly recent Cypress databook has some pages
for the CY7C608 Floating Point Controller.  Amongst the features listed is:

Artificial intelligence support

Does anybody know what AI support would be in an FPC?
(or is this a leftover typo from another datasheet).
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

daveb@geaclib.UUCP (David Collier-Brown) (11/12/88)

  I should have included a disclaimer in my response to Jonathan
Swedler (in <917@taux01.UUCP>, by saying that it is the same
**principle** as scoreboarding within the alu/pipeline, but quite a
different application of the principle.

  The principle is "start it, and only wait for it if you have to",
but in this case the application is to large-scale, possibly
programmer-visible coprocessors and parallel processors.  I propose
not that one make a scoreboarding instruction available (ie, a
special case), but instead a general case, which might take the form

	future function,returnVector2
	...
	...
	await r4,returnVector2

  which could be done by a fairly simple trap scheme, which would
execute the function iff no coprocessor and its return vector was
available and would send it off to a coprocessor otherwise. This
would cost some operating system overhead (meaning that it would not
be a real risc instruction, but extracode instead).  The await
(which I called FUTU in a previous article) would be a real RISC
instruction though, and one very straightforward to implement: move
from memory area to register iff memory access controller permits.
  This implies, of course, that you are prepared to live with a
small number of coprocessors, or a rather complex memory access
controller.

  Anyone want to comment on the difficulty of achieving this with
current memory management techniques?
	
-- 
 David Collier-Brown.  | yunexus!lethe!dave
 Interleaf Canada Inc. |
 1550 Enterprise Rd.   | HE's so smart he's dumb.
 Mississauga, Ontario  |       --Joyce C-B

david@june.cs.washington.edu (David Callahan) (11/14/88)

In article <3410@geaclib.UUCP> daveb@geaclib.UUCP (David Collier-Brown) writes:
>
>  The principle is "start it, and only wait for it if you have to",

>programmer-visible coprocessors and parallel processors.  I propose
>not that one make a scoreboarding instruction available (ie, a
>special case), but instead a general case, which might take the form
>
>	future function,returnVector2
>	...
>	...
>	await r4,returnVector2
>
This is similar to a mechanism Arvind at MIT has been talking
about for executing Dataflow programs on a RISC-ish processor
He inverts the order of operations:

	join  r1		! wait for incoming token
	add r1 =  r1 + r2 
	fork r3			! generante extra token

WHich is sort of what the coprocessor would see in your example.
Arvind's proposed  dataflow processor (Monsoon) can be thought of
as multiplexing a large number of these small risc instruction
streams on the same processor. 

A different approach used on the HEP machine from Denelcor
was to multiplex more traditional "instruction streams" 
on one processor. The processor could support 64 instruction
streams and one instruction stream could fork off a new instruction
stream in one instruction (assuming resources available, trap otherwise).
The HEP then provided Full/Empty bits on each word of memory to support "join" 
operations.

>  Anyone want to comment on the difficulty of achieving this with
>current memory management techniques?
>	
I'm not sure what you mean by memory management exactly. THe
most difficult problem I see is "parallelism" mangagement
to acheive load balancing. 

>-- 
> David Collier-Brown.  | yunexus!lethe!dave
> Interleaf Canada Inc. |
> 1550 Enterprise Rd.   | HE's so smart he's dumb.
> Mississauga, Ontario  |       --Joyce C-B

David Callahan, Tera Computer Co.,  Seattle, WA.

mark@hubcap.UUCP (Mark Smotherman) (11/15/88)

In article <6410@june.cs.washington.edu>, david@june.cs.washington.edu (David Callahan) writes:
> In article <3410@geaclib.UUCP> daveb@geaclib.UUCP (David Collier-Brown) writes:
> >
> >  The principle is "start it, and only wait for it if you have to",
> 
> >	future function,returnVector2
> >	...
> >	...
> >	await r4,returnVector2
> >
> 
> 	join  r1		! wait for incoming token
> 	add r1 =  r1 + r2 
> 	fork r3			! generante extra token
> 

An early machine, the Gamma-60, implemented fork and join.  A brief
sketch of the machine follows:




Bull Gamma 60 (1959)


History

Compagnie des Machines Bull of France announced the Gamma 60 in
1958.  Twenty systems were built.  The logical designers were
Pierre Chanus, Jean Bosset, and J.P. Cottet (the ones credited in
the Datamation article).

The machine is composed of multiple processing units with
exclusive processing differentiation (e.g. BCD ALU, binary ALU,
comparison unit, code translation unit) and multiple I/O
processing units (e.g. drum, tape, etc.), each of which has an
identical instruction request and data request interface to a
single supervisory control unit and to main memory.  Each unit is
capable of independent, concurrent operation once it is selected
and initiated by the supervisory control unit.  


Highlights

Noteworthy
 -   Multiprocessing.
 -   Hardware provision for asynchronous parallel processes (i.e.
     FORK and JOIN operations).
 -   Synchronization by queueing for both hardware and software
     resources.
 -   Processing unit status word.

Pecularities
 -   Separation of comparison and arithmetic operations into
     distinct processing units.
 -   Multiple word, variable length instructions with storage of
     control information in instruction sequence.
 -   Process scheduling (which is apparently FIFO) is wired into
     the supervisory control unit.


Instruction Formats

A complete instruction is composed of a series of 24-bit words. 
There may be 1-4 addresses depending on the processing unit.  

Word types are:
     A - address (includes opcode and 15-bit address field)
     B - branch (one appears to be SIMU - i.e. FORK), (15-bit
          address field)
     C - unit selection (misleadingly called an interrupt),
          (15-bit field used for unit number, another field
          contains number of calls required to this unit before
          processing can begin - i.e. used as JOIN)
     D - operation specification (15-bit field used for length)
An address word can: load a DAR (data address register); change
its value by indexing or indirection; or, cause it to be stored
in memory.  A unit selection word loads the PAR (program address
word) of the unit with the next program word address (i.e. the
address immediately following the address of the unit selection
word).

postulated example (complete instructions are labeled):

 A:  branch         SIMU to address D => FORK
     empty word     used for queueing link word for SIMU?

 B:  unit select    arithmetic, 1 call to proceed
     empty word     used for call count and queueing link word
     address        location of operand 1
     address        location of operand 2
     address        location of result
     operation      add

 C:  branch         unconditional to address E

 D:  unit select    drum
     empty word
     address        source address
     address        destination address
     operation      transfer drum to memory, N number of words

 E:  unit select    translate, 2 calls to proceed => JOIN
     empty word
     address        source address
     address        destination address
     operation      block move, N number of words to transfer

          thus the data flow is:        A
                                   +----+----+ fork
                                   V         V
                                   BC        D
                                   +---+ +---+ join
                                       V V
                                        E


Spaces and Addressing

24-bit word, 4-bit BCD digits, 6-bit characters

Working and control storage
The first 128 locations contain sets of registers for each unit.
Each set has:
     PAR (program address register) that also contains the status
          of the unit (busy/available)
     unit queue head pointer
     unit queue tail pointer
     one or more DAR's (data address registers) depending on unit

Memory name-space and hierarchy
32K words, first 128 locations reserved as special registers.

Address calculation and addressing modes
Provision for relative and indirect addressing, 4 index registers
in aritmetic unit, autoindexing by incrementing or decrementing
depending on operation.

Address mapping
None


Data Formats

Character - 6-bit code similar to ASCII

Logical - 24-bit binary word

Fixed-point
apparently: 24-bit binary word (for machinehood - addresses and
loop counts)

BCD fixed-point - 10 digits with decimal point location specified
               +-+-------+--------------+-------------------+
               |s|pt.loc.|   number    ...   (2nd word)     |
               +-+-------+--------------+-------------------+
  # bits:       1/  7    /         16   /         24
  # bcd digits:     2    /         4    /         6

Floating-point
BCD digits, similar to fixed point, exponent in point location
field (0-79 with bias of 40), double precision possible with 11-
19 digits in four words (point location field of third word is
apparently used as length).


Operations

Data handling - translate, edit, block transfer
Logic
Fixed-point - includes loop counts in binary
Floating-point - 13 operations for arithmetic unit


Sequencing

Normal sequencing is provided by one PAR for each processing unit
and a queue of waiting processes (i.e. addresses of instruction
streams).  The supervisory control unit advances the PAR's and
moves the process (i.e. instruction address stream) to the queue
of the appropriate processing unit whenever a unit selection word
is encountered.

The SIMU branch instruction causes the current program sequence
to be queued on the supervisory control unit (in much the same
manner as queueing is done for the processing units) for later
execution.  The supervisory control unit then starts the
execution of a new program sequence at the branch address.  This
can occur several times, so that an indefinite number of parallel
processes may be initiated.  The supervisory control unit will
perform unit selection (and if the unit is free, then the DAR
loading/modification and command initiation) for each created
process in turn.

Decision - conditional branch on unit status

Iteration - apparently some form of automatic loop counting

Delegation
Shared subroutines are treated as virtual units to provide mutual
exclusion.  A subroutine call then appears as a unit selection
word to a virtual unit number, and queueing is used just as in
the case of physical units. 


Supervision

Control switching
On errors, the currently executing program is queued (at front?),
and a diagnosis program is automatically initiated.

Integrity - no memory protection

Process interaction, multiprocessing
The empty word after each unit selection word is used to hold the
queue links to the waiting unit selection words; the queue has
FIFO handling (apparently the provision for waiting on a certain
number of calls, i.e. unit selections, means that the queue must
be completely searched for any previously queued calls at the
same instruction address).


Input/Output


Implementation Notes

Each processing unit (including each I/O device) has a control
unit for interfacing to the busses for instruction requests, data
requests, data distribution, and data collection.  Each unit
requests instructions and data over these special busses, which
are priority arbitrated.  Apparently, normal operation is:
request an instruction over the inst.req bus, receive a command
at some point on the inst.dist bus, request input data over the
data.req bus, accept input data from the data.dist bus (note that
the control unit registers have the addresses), perform the
operation, request output data transfer, send output data over
the data.collect bus, and then repeat the cycle.


Comments

As mentioned in section 7.4 of the text, the "granularity" of
concurrent process specification in the Gamma 60 is too fine. 
Each instruction, rather than each logical unit of work, is
defined to be a separate process that the machine must initiate
and coordinate.  This results in lots of overhead.  The cause of
this can be traced to the memory-oriented design, as opposed to a
traditional CPU-oriented design; everything possible was done to
use every memory cycle.  Unfortunately, by attempting to meet
this goal by using concurrent operation of multiple processing
units at such a low level, it appears that many memory cycles
were wasted due to the supervisory overhead of processes
switching back and forth between units or went idle due to
congestion at critical, bottleneck units.


References

"France's Gamma 60: A step forward in data processing?"
Datamation 4, 3 (May-June 1958), 34-35.

P. Dreyfus, "System design of the Gamma 60," Proc. WJCC 1958, LA,
CA, pp. 130-133.

P. Dreyfus, "Programming design features of the Gamma 60
computer," Proc. EJCC 1959, pp. 174-181.

M. Bataille, "The Gamma 60: The computer that was ahead of its
time," Honeywell Computer Journal 5, 3 (1971) 99-105.
     [This is the best reference, contains refs to French and
     German papers as well as a Univ. Mich. report by Dreyfus]
-- 
Mark Smotherman, Comp. Sci. Dept., Clemson University, Clemson, SC 29634
INTERNET: mark@hubcap.clemson.edu    UUCP: gatech!hubcap!mark

rro@handel.colostate.edu. (Rod Oldehoeft) (11/19/88)

In article <6410@june.cs.washington.edu> david@uw-june.UUCP (David Callahan) writes:
>
>A different approach used on the HEP machine from Denelcor
>was to multiplex more traditional "instruction streams" 
>on one processor. The processor could support 64 instruction
>streams and one instruction stream could fork off a new instruction
>stream in one instruction (assuming resources available, trap otherwise).
>The HEP then provided Full/Empty bits on each word of memory to support "join" 
>operations.
>
Mitch Alsup, 88k person, was a compiler guru at Denelcor;
no surprise that HEP-like features appear.

daveb@geaclib.UUCP (David Collier-Brown) (11/19/88)

From article <3543@hubcap.UUCP>, by mark@hubcap.UUCP (Mark Smotherman):
> An early machine, the Gamma-60, implemented fork and join.  A brief
> sketch of the machine follows:
...
> Compagnie des Machines Bull of France announced the Gamma 60 in
> 1958.  Twenty systems were built.  The logical designers were
> Pierre Chanus, Jean Bosset, and J.P. Cottet (the ones credited in
> the Datamation article).

  Oooohhh! Is my face red.

-dave (formerly of Honeywell, now Honeywell-Bull) c-b
-- 
 David Collier-Brown.  | yunexus!lethe!dave
 Interleaf Canada Inc. |
 1550 Enterprise Rd.   | HE's so smart he's dumb.
 Mississauga, Ontario  |       --Joyce C-B

richard@aiva.ed.ac.uk (Richard Tobin) (11/22/88)

In article <8075@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>Does anybody know what AI support would be in an FPC?

Probably does fuzzy arithmetic :-)

-- Richard
-- 
Richard Tobin,                         JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,             ARPA:  R.Tobin%uk.ac.ed@nss.cs.ucl.ac.uk
Edinburgh University.                  UUCP:  ...!ukc!ed.ac.uk!R.Tobin

"And to think he was once a famous professor at Yale University!"
"That's life, Robin."