[comp.lang.apl] APL Machines

raulmill@usc.edu (Raul Deluth Rockwell) (09/04/89)

I am interested in the idea of an APL machine.  It is my understanding
that some machines have been built along these lines, but their
performance has been degraded with the overhead of checking type and
rank on generic variables.  I think that this difficulty could be
overcome with some attention to this in hardware.

What I am currently working on is "what is a good intermediate form
for expressing apl".  In other words, APL is a generalized array
language, is there a vector language which can efficiently implement
APL?  (It seems to me that some intermediate form would be required
for expressions, to reduce the cost of a number of operations, such as
optimization, idiom recognition and execution).

By the way, although I say APL, I mean something like APL0 (no state
variables, such as []IO or []CT or []PP, nested arrays present, etc.)
I am also thinking of including some sort of hierarchical storage
structure, to give named storage (variables/functions) "operating
system functionality" to the language (no--I haven't considered all
the implications of this yet).

Any comments or ideas??  (on APL as non-Von Neumann (sp?) hardware or
as an operating system.)

--
Raul Rockwell                                      !
INTERNET:   raulmill@usc.edu                       !
UUCP:       ...uunet!usc!raulmill                  !  55 mph = 82 nc
U.S.SNAIL:  721 E Windsor #4,  GLENDALE CA  91205  !

daniel@ziebmef.mef.org (Daniel Albano) (09/06/89)

There was an APL machine (by that name) produced by a corporation
somewhere on the east coast of the US.  It was based on 68000 
processors, back when that was one of the new, high speed CPU
chips.  I saw it at a computer show in Toronto, and it seemed 
quite capable.

As for performance, type checking and rank checking should not
be a major concern, as you should be able to do it once for
each reference to a symbol.  If the symbols are a decent size,
the actual overhead wrt. computation should be insignificant 
(or rather, if the data pointed to by the symbols are a decent
size).

I suspect that there is no advantage in embedding APL arrays in
an internal vector representation.  This is just a suspicion, but
if the data are really 3-arrays, why not store them as such (which
are really vectors anyway, but why confuse ourselves! :-) ).

If it is to be a real APL, I don't think you can omit []CT and
[]IO, though you could make a case for omitting []PP as it
"only" affects the display of values, not the values calculated
themselves.  As for nested arrays, I have never found a particularly
compelling need for them, but I am sure opinions differ.

Daniel Albano
Toronto, Ontario

raulmill@usc.edu (Raul Deluth Rockwell) (09/10/89)

In article <1989Sep5.153557.9841@ziebmef.mef.org> daniel@ziebmef.mef.org (Daniel Albano) writes:

   There was an APL machine (by that name) produced by a corporation
   somewhere on the east coast of the US.  It was based on 68000 
   processors ...

I suppose I wasn't very explicit on this point.  By APL machine, I
meant a machine who's CPU architecture was oriented towards APL data
structures and operations.  I didn't mean to imply that I was talking
about machines which 'merely' run APL.

Also, note that there is no need for APL to be the only language that
runs on the machine.  The important point is that the machine
architecture is an optimal one for APL.

   As for performance, type checking and rank checking should not be a
   major concern ... (... if the data pointed to by the symbols are a
   decent size).

That is exactly the point.  If your algorithm must use small data
structures with any frequency, you lose.  I do not think that you can
make the assumption that small data structures are unimportant.

   I suspect that there is no advantage in embedding APL arrays in an
   internal vector representation.  ... if the data are really
   3-arrays, why not store them as such (which are really vectors
   anyway, but why confuse ourselves! :-) ).

Now I'm confused 8*).

Seriously, the advantage of vectors over arrays is that they require
less complexity in the machine that implements them. I don't think
that I could design a generalized array language that didn't HEAVILY
rely on the structure of a vector.  Wouldn't you gain something by
considering this in a design?

   If it is to be a real APL, I don't think you can omit []CT and
   []IO ...

Not in the interpreter, sure.  But that doesn't mean that you couldn't
implement the interpreter in an APL-like language which doesn't have
these.  For example, you could implement = as:

_
V RESULT <- LEFT EQUAL RIGHT 
                  __                                      _
  RESULT <- (LEFT |  RIGHT) < (LEFT | RIGHT) x 1 + QUADCT V
                                    L

Just in case:
_                         __
V del    <- assignment    |  ceiling    x times    | floor
                                                   L

I appreciate the comments, but this is getting a bit off of the track
of the original article.  Can I redirect the track onto the idea of an
efficient implementation of APL in terms of machine hardware?
--

jaxon@uicsrd.csrd.uiuc.edu (09/11/89)

> There was an APL machine (by that name) produced by a corporation
> somewhere on the east coast of the US.  It was based on 68000 
> processors,...

You're thinking of Analogic Corporation's 68000/CATscanner hybrid.  The
68000s were just control processors for the interpreter, most primitives
were microcoded on a 12-bit x 8 element Array Processor which had been
scavenged from one of Analogic's medical imaging instruments.  I believe
it's still on the market, although no plans exist to upgrade it.  The
interpreter and user interface are excellent.   The CAT scanner part is
a merciless number cruncher, so fast that you won't notice the effects
of long vector lengths until well past 1000 elements.  (i.e. it is several
hundred times faster than a 68000).  

There is a lesson in the APL Machine design, though.  The array processor
is not enough!  Despite the implementer's heavy use of the AP (e.g. linear
search on the AP always outperformed hash table lookup on the 68000),  the
68000 was a constant damper on program speed.  Memory management of small
vectors and scalars is not an especially parallelizable aspect of APL. 
The uniprocessors responsible for serial sections of the interpreter, and
whatever features are used to synchronize the parallel sections are absolutely
critical elements in an APL supercomputer.

> The "equals" primitive could be written in APL : ...[buggy code omitted]

Several interpreters have tried using "magic functions" to produce new 
primitives from old.   It is never as easy as it looks, and it is NEVER fast -
it's not even tolerably slow.   

1) Your definition is wrong -- you must take absolute values before comparing
   the arguments.

2) Once the correct definition is written, you must make it work even when
   the intermediate terms exceed the number system's limits.

3) By now you've got a function that works for simple scalars. To call it
   you'll have to create two scalar APL objects, and a stack frame (that's
   invisible in the caller's ")SI").  You'll have to make a class of APL
   function capable of returning into a primitive algorithm at the correct
   place.  

This does not compare favorably with a single instruction for Tolerant Equals.

I'm not a great fan of #CT and its consequences, but it is STANDARD and heavily
relied upon, and it is one more language-specific hardware feature that APL
could really use.


> "Dictionary APL"  NOT the "APL2" dialect.

Firstly I'd urge any APL designer to become deeply familiar with BOTH these
language definitions, and to really use BOTH systems.  The dictionary approach
("function rank") is really a wonderful perfection of the original APL array
processing ideas.  I suspect it is more efficient to implement, because it is
a little less powerful than the equivalent features in APL2.  In "Dictionary"
APL, operators are much more powerful.  If operators can really manipulate
the functions (e.g. pipelining them, carrying temporary results in registers,
etc.)  then I'd say the dictionary approach is best suited to today's 
vector supercomputers.   

But "function rank" seems tied to homogeneous arrays (am I wrong here?)

There are real limits on programmers' ability to forsee what a function
expression will do.  In the APL2 approach, all the data decompositions are
explicitly written out, you can enter the subexpressions and watch what's
happening to your data.  You can also do all kinds of unorthodox decompositions
of your data, which stand no chance of being vectorizable.

And "vectorizing" is not the only hope for APL anyway!  Multiple instruction
Multiple Data parallel machines are growing in number and power, these don't
require that "one function" be in control, that "two arguments" be in memory
and that "one type" of result is expected.  I think the APL2 approach will
provide very rich ground for parallel machine designers.

Thanks in advance for any replies!

greg jaxon -- jaxon@uicsrd.csrd.uiuc.edu
Univ. of Ill. Center for Supercomputing R&D

raulmill@usc.edu (Raul Deluth Rockwell) (09/13/89)

In article <49700014@uicsrd.csrd.uiuc.edu> jaxon@uicsrd.csrd.uiuc.edu writes:

   There is a lesson in the APL Machine design, though.  The array
   processor is not enough!  . . . The uniprocessors responsible for
   serial sections of the interpreter, and whatever features are used
   to synchronize the parallel sections are absolutely critical
   elements in an APL supercomputer.

Anybody have any comments on how how these should be implemented (i.e.
deviations from commercial general-purpose CPU design)?  Has anyone
done any work to analyze the most common (and/or least used)
operations in the serial/scalar/small vector portions of the machine?

   > The "equals" primitive could be written in APL : ...[buggy code
   > omitted]

   Several interpreters have tried using "magic functions" to produce
   new primitives from old.  It is never as easy as it looks, and it
   is NEVER fast - it's not even tolerably slow.

The point was not to write APL in APL, but to write APL in a generic
vector language.  However, let me say that you are probably right:  it
is too easy to implement []CT at the hardware level (especially if
[]CT is limited to negative-integer-powers of two) to justify leaving
it out of a design.

   1) Your definition is wrong -- you must take absolute values before
      comparing the arguments.

I don't see that my definition was unacceptably wrong.  Deviation from
ideal is []CT squared, and for small values of []CT this is trivial.
(I am assuming that I am allowed to limit []CT to small values).  

   2) Once the correct definition is written, you must make it work
      even when the intermediate terms exceed the number system's
      limits.

Is good point.

   3) By now you've got a function that works for simple scalars. To
      call it you'll have to create two scalar APL objects, and a
      stack frame (that's invisible in the caller's ")SI").  You'll
      have to make a class of APL function capable of returning into a
      primitive algorithm at the correct place.

I'm not sure I'm not taking this level of complexity for granted in my
earlier posting.  I mean, if I do implement APL some vector language,
won't I have to deal with this complexity for most of the primitives?
Or, to rephrase:  what level of complexity should be in hardware, what
in software (feel free to elaborate on what you are expecting in
hardware).

   . . .

   The dictionary approach ("function rank") is really a wonderful
   perfection of the original APL array processing ideas.  . . . If
   operators can really manipulate the functions (e.g. pipelining
   them, carrying temporary results in registers, etc.) then I'd say
   the dictionary approach is best suited to today's vector
   supercomputers.

   But "function rank" seems tied to homogeneous arrays (am I wrong
   here?)

Function rank is tied to what the function can handle.  If the
primitive function can handle hetrogenous arrays, then function rank
is applicable to hetrogenous arrays.  Since we are talking about
APL machine architecture here, primitive means machine primitive, not
language primitive.  This leads into your idea of MIMD machines.

   There are real limits on programmers' ability to forsee what a
   function expression will do.  . . . You can also do all kinds of
   unorthodox decompositions of your data, which stand no chance of
   being vectorizable.

   And "vectorizing" is not the only hope for APL anyway!  Multiple
   instruction Multiple Data parallel machines are growing in number
   and power, these don't require that "one function" be in control,
   that "two arguments" be in memory and that "one type" of result is
   expected.

You are probably right.  I have been ignoring MIMD because I need to
think more about instruction propagation in this type of machine.
Know of any good references on the general subject?

   I think the APL2 approach will provide very rich ground for
   parallel machine designers.

Make that the APL approach, and I will be happy.  I do not consider
APL2 a complete design at this point.  (I do agree with including most
of APL2's functionality in a design, but there are a number of issues
I still have problems with.)

--
btw, does anyone know any good reference works on Heisenberg's
original approach to quantum physics?
--
Raul Rockwell                                      !
INTERNET:   raulmill@usc.edu                       !
UUCP:       ...uunet!usc!raulmill                  !  55 mph = 82 nc
U.S.SNAIL:  721 E Windsor #4,  GLENDALE CA  91205  !

Pesch@cup.portal.com (Roland Henry Pesch) (09/16/89)

[most of excellent articlel by Greg Jaxon omitted]

>> "Dictionary APL"  NOT the "APL2" dialect.
>
>Firstly I'd urge any APL designer to become deeply familiar with BOTH these
>language definitions, and to really use BOTH systems.  The dictionary approach
>("function rank") is really a wonderful perfection of the original APL array
>processing ideas.  I suspect it is more efficient to implement, because it is
>a little less powerful than the equivalent features in APL2.  In "Dictionary"
>APL, operators are much more powerful.  If operators can really manipulate
>the functions (e.g. pipelining them, carrying temporary results in registers,
>etc.)  then I'd say the dictionary approach is best suited to today's 
>vector supercomputers.   
>
>But "function rank" seems tied to homogeneous arrays (am I wrong here?)

Wrong (just that last sentence, not the most insightful paragraph I couldn't 
resist quoting first), but not surprisingly: Ken Iverson took a long time 
to add heterogeneous arrays to the Dictionary.  The version published in 
Quote Quad includes them, I think, though with little fanfare.

Moreover, we've included them in SAX (an APL interpreter and associated
programs for the Unix environment, produced by I.P. Sharp Palo Alto in
collaboration with STSC), which is to my knowledge the most complete implemente
d
subset of Dictionary APL.  (Unfortunately it's also kind of hard to get---
we try to concentrate on OEM sales, for overhead reasons).  But there's
nothing in Dictionary APL which conflicts with mixed arrays; they're just
not essential, as they are in APL2 (in Dictionary APL, it's largely a matter
of extending the domain of a few primitives like catenate ---formally, that
is, internally naturally there's a fair bit of work--- but in APL2 they're
a degenerate case of the enclosed universe, so they're essential).
 
>There are real limits on programmers' ability to forsee what a function
>expression will do.  In the APL2 approach, all the data decompositions are
>explicitly written out, you can enter the subexpressions and watch what's
>happening to your data.  You can also do all kinds of unorthodox decomposition
s
>of your data, which stand no chance of being vectorizable.

I'm not entirely sure whether that paragraph is supposed to be part of
the text contrasting Dictionary APL and APL2.  If so, I should point out
that although the statement is quite true, Dictionary APL does include
the "box" and "open" primitives (similar to enclose and disclose).  We find
that it's frequently easier to write an algorithm usingn them in the first
place, while you're groping around trying to understand it; then later,
when you know what you're doing, it's often possible to dispense with box and
open and do something elegant with operators and flat arrays, instead.

[much more great stuff omitted]

>Thanks in advance for any replies!
>
>greg jaxon -- jaxon@uicsrd.csrd.uiuc.edu
>Univ. of Ill. Center for Supercomputing R&D

You're most welcome---and it's very good to hear from you!  I'd lost track
of where you were...

                              /Roland Pesch
                               pesch@pa.reuter.com  (or: pesch@cup.portal.com)

                               SAX Development Group
                               I.P. Sharp Associates
                               425 Sherman Ave., Suite 200
                               Palo Alto, CA   94306

daniel@ziebmef.mef.org (Daniel Albano) (09/19/89)

The question of "hardware optimization" for APL machines 
requires some consideration of how they will be used.  Are
we speaking here of single user or multi-user machines?
Are we looking at multiple user tasks, or single-workspace
machines with synchronous processing (single task/workspace)?
Is the machine going to run "APLish" code (data embedded
in arrays, large "logically parallel" operations); or will 
it be used for "third generation language" processing style
as tends to predominate in COBOL, dBase, and PASCAL application
(mixed data type stored as "records" or in structures accessed
through index operations)?
 
Choices in how it will be used - and who is going to be 
footing the bill :-)  will largely determine design choices.
Exxon crunching seismic data will have a much different view
of the "optimal machine" than I will; probably by a few million
dollars.
 
I tend to discount the overheads in handling small (several 
element to scalar) data structures, because it that is all
there is, then the application is unlikely to be large, and
there should be lots of spare machine cycles.  When the work
grows, if the application is well designed, so does the size
of your data entities.
 
Making APL hardware could involve implementing a lot of APL
as "machine instructions", but that really means microcode -
programs written in a constrained environment to squeeze as
much as possible out of very elementary operations.  Given the
raw hardware capabilities (single cycle instruction execution,
pipelineing, branch lookahead, etc.) the idea of building 
what could be termed an ECISC machine (Exceedingly Complicated
Instruction Set Computing - and I do mean complicated, not
complex) flies in the face of current conventional wisdom
on performance.  Of course, this would not be the first time
that conventional wisdom could be wrong.  The structuring of
the execution environment by the APL interpreter would be an 
advantage here - but it is also an advantage in generating 
highly tuned object code as well - and perhaps an assembler
would be easier to work with than a micro-assembler.  As is 
often the case, a middle ground may work well.  Perhaps a 
more or less conventional CISC (or RISC?) machine with a few
well chosen enhancements to the instruction set to improve
fundamental APL operations would offer advantages without too
much sacrifice of development effort and system complexity.
 
The nature of the enhancements depend very much on what you
consider the key APL level operations to be.  My own list of 
crucial favourites would include reshape, dyadic iota, 
compress, reduce, and generalized inner products (or at least
and.equals).  Operations on Boolean arrays would also be very 
high on the list, as would comparisons, especially those for
integer and character data.  Both base value and representation
are also important from a performance view in that they should
not be overly slow, but good system design can keep their use 
far below that of the previously mentioned operations.  One 
operation implicit in any APL that is crucial is the allocation
of storage for data entities.  Function calling, including the
creation of local symbols for user defined functions is another
thing that can happen very often in a well-structured system.
 
On the other hand, I don't really care that much how fast 
the trig functions or domino run, nor that much how fast 
exponentiation and the like are.  Someone who does more 
"calculation" rather than "byte bashing" (commercial/logical
systems) would again have a different set of priorities.  If
I were heavily into graphics in a particular instance I might
care about floating point, but in most systems I have seen,
comparisons and data manipulation far outweigh calculation.
 
The most crucial item in a lot of cases is just workspace size.
An application that can get most or all of its data in a 
workspace and have enough left over for transient storage for
intermediate results will be not only faster, but easier to 
write and to maintain.  I have used workspaces from 32K to 
several megabytes, and would not willingly go back to the 
small ones.  Of course, my home system has a 750K workspace,
so most of my recent code has been very easy - I just don't
seem to have too much data in my life :-) ... of course the 
data compression / storage reduction techniques learned in
those 32K workspaces still do a great job in maximizing the
capabilities of something more than twenty times larger.
 
Personally, I don't thing (single user machine) that you have
to provide multiple "control streams" or simultaneous processing
of user level tasks, but some parallelism in the execution of
array operations - or a high speed vector processing scheme -
could provide a major boost in performance.  In many operations
on conformal arrays, the (logical) structure of the data is
not important during the actual computation step.  If you could
generalize (should be easy) for scalar/array operations, then
you have a large part of the problem in hand.  Many other 
operations could be decomposed into a number of vector, or
"quasi-vector" operations.  By quasi-vector, I mean that there
are a number of elements, regularly and repetitively spaced 
through the machine's memory, with (temporarily) irrelevant
elements interspersed. 
 
In real world terms, another major performance boost is a true
native APL file system - one that stores the components in their
internal representaion.  I did work with one system that only
stored what were essentially character representations of the 
data (two dimensional? memory is mercifully weak on the details)
and a lot of time was spent converting to and from "file format".
This is probably too peripheral for your interests, as it should
be primarily a "software file server" issue, but it does touch
on (perhaps) some details of the hardware/software interface.
 
The idea of making APL the operating system, and giving it 
complete and direct control over the system had a considerable
appeal.  The benefits could be enormous.
 
 

Speaking of Analogic, are they still around?  Do they still have
any APL products?  And, for that matter, is there a good APL for
Intel (Msdos) machines that does not cost the better part of a 
thousand dollars?

	

Daniel Albano
Toronto, Ontario

raulmill@usc.edu (Raul Deluth Rockwell) (09/20/89)

Quick question (I hope).  Does anyone know of a reasonably systematic
(and fast) way of implementing gamma and beta function?
--

raulmill@usc.edu (Raul Deluth Rockwell) (09/22/89)

In article <1989Sep19.111751.7613@ziebmef.mef.org>
daniel@ziebmef.mef.org (Daniel Albano) writes:
;> The question of "hardware optimization" for APL machines 
;> requires some consideration of how they will be used.

Let's see.  Selecting concepts out of your original posting, I am
thinking of a multi-user machine, running "APLish code".  Let's say
something with the capacity of what currently costs around $50,000.
Price should drop for that capability by the time I am anywhere near
completion.

I haven't decided how to deal with resource locking on multi-tasking
conflicts.  (Not that I don't have ideas.  I just haven't convinced
myself that they are that good.)

The basic architechture I am trying to thrash out passes data objects
around with a checklist of what needs to happen to them.  There is a
bandwidth penalty here, but it allows for quite a bit of extensibility
(I hope).

;> . . .  I tend to discount the overheads in handling small (several
;> element to scalar) data structures, because if that is all there
;> is, then the application is unlikely to be large, and there should
;> be lots of spare machine cycles.  When the work grows, if the
;> application is well designed, so does the size of your data
;> entities.

I am not convinced that this is true for all applications.  Also there
generally needs to be several design iterations before an application
is "well designed".   It is true that MANY (probably MOST)
applications can be handled quite nicely using an APLish approach.  

;> Making APL hardware could involve implementing a lot of APL
;> as "machine instructions", but that really means microcode -
;> . . . [much deleted]

I think that you are talking about a conventional von neuman
architecture (cpu with bus leading to memory in which is stored
instructions and data).  I am being a little more "blue sky" than
that.  (I can almost afford to:  I'm a college student 8^)

;> The nature of the enhancements depend very much on what you
;> consider the key APL level operations to be.  My own list of 
;> crucial favourites would include reshape, dyadic iota, 
;> compress, reduce, and generalized inner products (or at least
;> and.equals).

A nice selection.

;> Operations on Boolean arrays would also be very high on the list,
;> as would comparisons, especially those for integer and character
;> data.

Given a fairly decent architecture these SHOULD be fast.  Look at the
occurance of blitters in recent personal computers.  (fast bit
manipulation does nice things to performance where applicable).

;> . . .  One operation implicit in any APL that is crucial is the
;> allocation of storage for data entities.  . . .

That is why I am trying for an architecture which is based on a
communication model.  There are penalties, but there should be big
advantages.  The question (a big one) is how to organize things.

;> On the other hand, I don't really care that much how fast 
;> the trig functions or domino run, nor that much how fast 
;> exponentiation and the like are.  . . .

I do.  Trancendental functions and domino do nice things to
applications which would other wise require massive looping.  Could I
get you to agree that base 2 logarithms should be fast?

;> The most crucial item in a lot of cases is just workspace size.

Considering the current state of technology, I have been assuming a
reasonably massive memory resource, though with varying levels of
caching/mass-storage.

;> Personally, I don't think (single user machine) that you have
;> to provide multiple "control streams" or simultaneous processing
;> of user level tasks, but some parallelism in the execution of
;> array operations - or a high speed vector processing scheme -
;> could provide a major boost in performance.

There are two kinds of parallelism I am considering.  (1) User-level
parallelism (coarse-grained) and (2) User-invisible (fine-grained).
(1) is more of a software issue, and (2) is more of a hardware issue.

;> In many operations on conformal arrays, the (logical) structure of
;> the data is not important during the actual computation step.

I don't quite follow you here.  What are you saying?

;> [description of practical vector operation: skipping over parts of
;> memory] . . .

;> In real world terms, another major performance boost is a true
;> native APL file system - one that stores the components in their
;> internal representation.  . . .  The idea of making APL the operating
;> system, and giving it complete and direct control over the system
;> had a considerable appeal.  The benefits could be enormous.

exactly!

I hope I am not being too foggy on details here.  If anyone feels I
should elaborate on any of the points here (assuming anyone is
interested) don't hesitate to tell me.

--
Raul
--