raulmill@usc.edu (Raul Deluth Rockwell) (09/04/89)
I am interested in the idea of an APL machine. It is my understanding that some machines have been built along these lines, but their performance has been degraded with the overhead of checking type and rank on generic variables. I think that this difficulty could be overcome with some attention to this in hardware. What I am currently working on is "what is a good intermediate form for expressing apl". In other words, APL is a generalized array language, is there a vector language which can efficiently implement APL? (It seems to me that some intermediate form would be required for expressions, to reduce the cost of a number of operations, such as optimization, idiom recognition and execution). By the way, although I say APL, I mean something like APL0 (no state variables, such as []IO or []CT or []PP, nested arrays present, etc.) I am also thinking of including some sort of hierarchical storage structure, to give named storage (variables/functions) "operating system functionality" to the language (no--I haven't considered all the implications of this yet). Any comments or ideas?? (on APL as non-Von Neumann (sp?) hardware or as an operating system.) -- Raul Rockwell ! INTERNET: raulmill@usc.edu ! UUCP: ...uunet!usc!raulmill ! 55 mph = 82 nc U.S.SNAIL: 721 E Windsor #4, GLENDALE CA 91205 !
daniel@ziebmef.mef.org (Daniel Albano) (09/06/89)
There was an APL machine (by that name) produced by a corporation somewhere on the east coast of the US. It was based on 68000 processors, back when that was one of the new, high speed CPU chips. I saw it at a computer show in Toronto, and it seemed quite capable. As for performance, type checking and rank checking should not be a major concern, as you should be able to do it once for each reference to a symbol. If the symbols are a decent size, the actual overhead wrt. computation should be insignificant (or rather, if the data pointed to by the symbols are a decent size). I suspect that there is no advantage in embedding APL arrays in an internal vector representation. This is just a suspicion, but if the data are really 3-arrays, why not store them as such (which are really vectors anyway, but why confuse ourselves! :-) ). If it is to be a real APL, I don't think you can omit []CT and []IO, though you could make a case for omitting []PP as it "only" affects the display of values, not the values calculated themselves. As for nested arrays, I have never found a particularly compelling need for them, but I am sure opinions differ. Daniel Albano Toronto, Ontario
raulmill@usc.edu (Raul Deluth Rockwell) (09/10/89)
In article <1989Sep5.153557.9841@ziebmef.mef.org> daniel@ziebmef.mef.org (Daniel Albano) writes:
There was an APL machine (by that name) produced by a corporation
somewhere on the east coast of the US. It was based on 68000
processors ...
I suppose I wasn't very explicit on this point. By APL machine, I
meant a machine who's CPU architecture was oriented towards APL data
structures and operations. I didn't mean to imply that I was talking
about machines which 'merely' run APL.
Also, note that there is no need for APL to be the only language that
runs on the machine. The important point is that the machine
architecture is an optimal one for APL.
As for performance, type checking and rank checking should not be a
major concern ... (... if the data pointed to by the symbols are a
decent size).
That is exactly the point. If your algorithm must use small data
structures with any frequency, you lose. I do not think that you can
make the assumption that small data structures are unimportant.
I suspect that there is no advantage in embedding APL arrays in an
internal vector representation. ... if the data are really
3-arrays, why not store them as such (which are really vectors
anyway, but why confuse ourselves! :-) ).
Now I'm confused 8*).
Seriously, the advantage of vectors over arrays is that they require
less complexity in the machine that implements them. I don't think
that I could design a generalized array language that didn't HEAVILY
rely on the structure of a vector. Wouldn't you gain something by
considering this in a design?
If it is to be a real APL, I don't think you can omit []CT and
[]IO ...
Not in the interpreter, sure. But that doesn't mean that you couldn't
implement the interpreter in an APL-like language which doesn't have
these. For example, you could implement = as:
_
V RESULT <- LEFT EQUAL RIGHT
__ _
RESULT <- (LEFT | RIGHT) < (LEFT | RIGHT) x 1 + QUADCT V
L
Just in case:
_ __
V del <- assignment | ceiling x times | floor
L
I appreciate the comments, but this is getting a bit off of the track
of the original article. Can I redirect the track onto the idea of an
efficient implementation of APL in terms of machine hardware?
--
jaxon@uicsrd.csrd.uiuc.edu (09/11/89)
> There was an APL machine (by that name) produced by a corporation > somewhere on the east coast of the US. It was based on 68000 > processors,... You're thinking of Analogic Corporation's 68000/CATscanner hybrid. The 68000s were just control processors for the interpreter, most primitives were microcoded on a 12-bit x 8 element Array Processor which had been scavenged from one of Analogic's medical imaging instruments. I believe it's still on the market, although no plans exist to upgrade it. The interpreter and user interface are excellent. The CAT scanner part is a merciless number cruncher, so fast that you won't notice the effects of long vector lengths until well past 1000 elements. (i.e. it is several hundred times faster than a 68000). There is a lesson in the APL Machine design, though. The array processor is not enough! Despite the implementer's heavy use of the AP (e.g. linear search on the AP always outperformed hash table lookup on the 68000), the 68000 was a constant damper on program speed. Memory management of small vectors and scalars is not an especially parallelizable aspect of APL. The uniprocessors responsible for serial sections of the interpreter, and whatever features are used to synchronize the parallel sections are absolutely critical elements in an APL supercomputer. > The "equals" primitive could be written in APL : ...[buggy code omitted] Several interpreters have tried using "magic functions" to produce new primitives from old. It is never as easy as it looks, and it is NEVER fast - it's not even tolerably slow. 1) Your definition is wrong -- you must take absolute values before comparing the arguments. 2) Once the correct definition is written, you must make it work even when the intermediate terms exceed the number system's limits. 3) By now you've got a function that works for simple scalars. To call it you'll have to create two scalar APL objects, and a stack frame (that's invisible in the caller's ")SI"). You'll have to make a class of APL function capable of returning into a primitive algorithm at the correct place. This does not compare favorably with a single instruction for Tolerant Equals. I'm not a great fan of #CT and its consequences, but it is STANDARD and heavily relied upon, and it is one more language-specific hardware feature that APL could really use. > "Dictionary APL" NOT the "APL2" dialect. Firstly I'd urge any APL designer to become deeply familiar with BOTH these language definitions, and to really use BOTH systems. The dictionary approach ("function rank") is really a wonderful perfection of the original APL array processing ideas. I suspect it is more efficient to implement, because it is a little less powerful than the equivalent features in APL2. In "Dictionary" APL, operators are much more powerful. If operators can really manipulate the functions (e.g. pipelining them, carrying temporary results in registers, etc.) then I'd say the dictionary approach is best suited to today's vector supercomputers. But "function rank" seems tied to homogeneous arrays (am I wrong here?) There are real limits on programmers' ability to forsee what a function expression will do. In the APL2 approach, all the data decompositions are explicitly written out, you can enter the subexpressions and watch what's happening to your data. You can also do all kinds of unorthodox decompositions of your data, which stand no chance of being vectorizable. And "vectorizing" is not the only hope for APL anyway! Multiple instruction Multiple Data parallel machines are growing in number and power, these don't require that "one function" be in control, that "two arguments" be in memory and that "one type" of result is expected. I think the APL2 approach will provide very rich ground for parallel machine designers. Thanks in advance for any replies! greg jaxon -- jaxon@uicsrd.csrd.uiuc.edu Univ. of Ill. Center for Supercomputing R&D
raulmill@usc.edu (Raul Deluth Rockwell) (09/13/89)
In article <49700014@uicsrd.csrd.uiuc.edu> jaxon@uicsrd.csrd.uiuc.edu writes: There is a lesson in the APL Machine design, though. The array processor is not enough! . . . The uniprocessors responsible for serial sections of the interpreter, and whatever features are used to synchronize the parallel sections are absolutely critical elements in an APL supercomputer. Anybody have any comments on how how these should be implemented (i.e. deviations from commercial general-purpose CPU design)? Has anyone done any work to analyze the most common (and/or least used) operations in the serial/scalar/small vector portions of the machine? > The "equals" primitive could be written in APL : ...[buggy code > omitted] Several interpreters have tried using "magic functions" to produce new primitives from old. It is never as easy as it looks, and it is NEVER fast - it's not even tolerably slow. The point was not to write APL in APL, but to write APL in a generic vector language. However, let me say that you are probably right: it is too easy to implement []CT at the hardware level (especially if []CT is limited to negative-integer-powers of two) to justify leaving it out of a design. 1) Your definition is wrong -- you must take absolute values before comparing the arguments. I don't see that my definition was unacceptably wrong. Deviation from ideal is []CT squared, and for small values of []CT this is trivial. (I am assuming that I am allowed to limit []CT to small values). 2) Once the correct definition is written, you must make it work even when the intermediate terms exceed the number system's limits. Is good point. 3) By now you've got a function that works for simple scalars. To call it you'll have to create two scalar APL objects, and a stack frame (that's invisible in the caller's ")SI"). You'll have to make a class of APL function capable of returning into a primitive algorithm at the correct place. I'm not sure I'm not taking this level of complexity for granted in my earlier posting. I mean, if I do implement APL some vector language, won't I have to deal with this complexity for most of the primitives? Or, to rephrase: what level of complexity should be in hardware, what in software (feel free to elaborate on what you are expecting in hardware). . . . The dictionary approach ("function rank") is really a wonderful perfection of the original APL array processing ideas. . . . If operators can really manipulate the functions (e.g. pipelining them, carrying temporary results in registers, etc.) then I'd say the dictionary approach is best suited to today's vector supercomputers. But "function rank" seems tied to homogeneous arrays (am I wrong here?) Function rank is tied to what the function can handle. If the primitive function can handle hetrogenous arrays, then function rank is applicable to hetrogenous arrays. Since we are talking about APL machine architecture here, primitive means machine primitive, not language primitive. This leads into your idea of MIMD machines. There are real limits on programmers' ability to forsee what a function expression will do. . . . You can also do all kinds of unorthodox decompositions of your data, which stand no chance of being vectorizable. And "vectorizing" is not the only hope for APL anyway! Multiple instruction Multiple Data parallel machines are growing in number and power, these don't require that "one function" be in control, that "two arguments" be in memory and that "one type" of result is expected. You are probably right. I have been ignoring MIMD because I need to think more about instruction propagation in this type of machine. Know of any good references on the general subject? I think the APL2 approach will provide very rich ground for parallel machine designers. Make that the APL approach, and I will be happy. I do not consider APL2 a complete design at this point. (I do agree with including most of APL2's functionality in a design, but there are a number of issues I still have problems with.) -- btw, does anyone know any good reference works on Heisenberg's original approach to quantum physics? -- Raul Rockwell ! INTERNET: raulmill@usc.edu ! UUCP: ...uunet!usc!raulmill ! 55 mph = 82 nc U.S.SNAIL: 721 E Windsor #4, GLENDALE CA 91205 !
Pesch@cup.portal.com (Roland Henry Pesch) (09/16/89)
[most of excellent articlel by Greg Jaxon omitted] >> "Dictionary APL" NOT the "APL2" dialect. > >Firstly I'd urge any APL designer to become deeply familiar with BOTH these >language definitions, and to really use BOTH systems. The dictionary approach >("function rank") is really a wonderful perfection of the original APL array >processing ideas. I suspect it is more efficient to implement, because it is >a little less powerful than the equivalent features in APL2. In "Dictionary" >APL, operators are much more powerful. If operators can really manipulate >the functions (e.g. pipelining them, carrying temporary results in registers, >etc.) then I'd say the dictionary approach is best suited to today's >vector supercomputers. > >But "function rank" seems tied to homogeneous arrays (am I wrong here?) Wrong (just that last sentence, not the most insightful paragraph I couldn't resist quoting first), but not surprisingly: Ken Iverson took a long time to add heterogeneous arrays to the Dictionary. The version published in Quote Quad includes them, I think, though with little fanfare. Moreover, we've included them in SAX (an APL interpreter and associated programs for the Unix environment, produced by I.P. Sharp Palo Alto in collaboration with STSC), which is to my knowledge the most complete implemente d subset of Dictionary APL. (Unfortunately it's also kind of hard to get--- we try to concentrate on OEM sales, for overhead reasons). But there's nothing in Dictionary APL which conflicts with mixed arrays; they're just not essential, as they are in APL2 (in Dictionary APL, it's largely a matter of extending the domain of a few primitives like catenate ---formally, that is, internally naturally there's a fair bit of work--- but in APL2 they're a degenerate case of the enclosed universe, so they're essential). >There are real limits on programmers' ability to forsee what a function >expression will do. In the APL2 approach, all the data decompositions are >explicitly written out, you can enter the subexpressions and watch what's >happening to your data. You can also do all kinds of unorthodox decomposition s >of your data, which stand no chance of being vectorizable. I'm not entirely sure whether that paragraph is supposed to be part of the text contrasting Dictionary APL and APL2. If so, I should point out that although the statement is quite true, Dictionary APL does include the "box" and "open" primitives (similar to enclose and disclose). We find that it's frequently easier to write an algorithm usingn them in the first place, while you're groping around trying to understand it; then later, when you know what you're doing, it's often possible to dispense with box and open and do something elegant with operators and flat arrays, instead. [much more great stuff omitted] >Thanks in advance for any replies! > >greg jaxon -- jaxon@uicsrd.csrd.uiuc.edu >Univ. of Ill. Center for Supercomputing R&D You're most welcome---and it's very good to hear from you! I'd lost track of where you were... /Roland Pesch pesch@pa.reuter.com (or: pesch@cup.portal.com) SAX Development Group I.P. Sharp Associates 425 Sherman Ave., Suite 200 Palo Alto, CA 94306
daniel@ziebmef.mef.org (Daniel Albano) (09/19/89)
The question of "hardware optimization" for APL machines requires some consideration of how they will be used. Are we speaking here of single user or multi-user machines? Are we looking at multiple user tasks, or single-workspace machines with synchronous processing (single task/workspace)? Is the machine going to run "APLish" code (data embedded in arrays, large "logically parallel" operations); or will it be used for "third generation language" processing style as tends to predominate in COBOL, dBase, and PASCAL application (mixed data type stored as "records" or in structures accessed through index operations)? Choices in how it will be used - and who is going to be footing the bill :-) will largely determine design choices. Exxon crunching seismic data will have a much different view of the "optimal machine" than I will; probably by a few million dollars. I tend to discount the overheads in handling small (several element to scalar) data structures, because it that is all there is, then the application is unlikely to be large, and there should be lots of spare machine cycles. When the work grows, if the application is well designed, so does the size of your data entities. Making APL hardware could involve implementing a lot of APL as "machine instructions", but that really means microcode - programs written in a constrained environment to squeeze as much as possible out of very elementary operations. Given the raw hardware capabilities (single cycle instruction execution, pipelineing, branch lookahead, etc.) the idea of building what could be termed an ECISC machine (Exceedingly Complicated Instruction Set Computing - and I do mean complicated, not complex) flies in the face of current conventional wisdom on performance. Of course, this would not be the first time that conventional wisdom could be wrong. The structuring of the execution environment by the APL interpreter would be an advantage here - but it is also an advantage in generating highly tuned object code as well - and perhaps an assembler would be easier to work with than a micro-assembler. As is often the case, a middle ground may work well. Perhaps a more or less conventional CISC (or RISC?) machine with a few well chosen enhancements to the instruction set to improve fundamental APL operations would offer advantages without too much sacrifice of development effort and system complexity. The nature of the enhancements depend very much on what you consider the key APL level operations to be. My own list of crucial favourites would include reshape, dyadic iota, compress, reduce, and generalized inner products (or at least and.equals). Operations on Boolean arrays would also be very high on the list, as would comparisons, especially those for integer and character data. Both base value and representation are also important from a performance view in that they should not be overly slow, but good system design can keep their use far below that of the previously mentioned operations. One operation implicit in any APL that is crucial is the allocation of storage for data entities. Function calling, including the creation of local symbols for user defined functions is another thing that can happen very often in a well-structured system. On the other hand, I don't really care that much how fast the trig functions or domino run, nor that much how fast exponentiation and the like are. Someone who does more "calculation" rather than "byte bashing" (commercial/logical systems) would again have a different set of priorities. If I were heavily into graphics in a particular instance I might care about floating point, but in most systems I have seen, comparisons and data manipulation far outweigh calculation. The most crucial item in a lot of cases is just workspace size. An application that can get most or all of its data in a workspace and have enough left over for transient storage for intermediate results will be not only faster, but easier to write and to maintain. I have used workspaces from 32K to several megabytes, and would not willingly go back to the small ones. Of course, my home system has a 750K workspace, so most of my recent code has been very easy - I just don't seem to have too much data in my life :-) ... of course the data compression / storage reduction techniques learned in those 32K workspaces still do a great job in maximizing the capabilities of something more than twenty times larger. Personally, I don't thing (single user machine) that you have to provide multiple "control streams" or simultaneous processing of user level tasks, but some parallelism in the execution of array operations - or a high speed vector processing scheme - could provide a major boost in performance. In many operations on conformal arrays, the (logical) structure of the data is not important during the actual computation step. If you could generalize (should be easy) for scalar/array operations, then you have a large part of the problem in hand. Many other operations could be decomposed into a number of vector, or "quasi-vector" operations. By quasi-vector, I mean that there are a number of elements, regularly and repetitively spaced through the machine's memory, with (temporarily) irrelevant elements interspersed. In real world terms, another major performance boost is a true native APL file system - one that stores the components in their internal representaion. I did work with one system that only stored what were essentially character representations of the data (two dimensional? memory is mercifully weak on the details) and a lot of time was spent converting to and from "file format". This is probably too peripheral for your interests, as it should be primarily a "software file server" issue, but it does touch on (perhaps) some details of the hardware/software interface. The idea of making APL the operating system, and giving it complete and direct control over the system had a considerable appeal. The benefits could be enormous. Speaking of Analogic, are they still around? Do they still have any APL products? And, for that matter, is there a good APL for Intel (Msdos) machines that does not cost the better part of a thousand dollars? Daniel Albano Toronto, Ontario
raulmill@usc.edu (Raul Deluth Rockwell) (09/20/89)
Quick question (I hope). Does anyone know of a reasonably systematic (and fast) way of implementing gamma and beta function? --
raulmill@usc.edu (Raul Deluth Rockwell) (09/22/89)
In article <1989Sep19.111751.7613@ziebmef.mef.org> daniel@ziebmef.mef.org (Daniel Albano) writes: ;> The question of "hardware optimization" for APL machines ;> requires some consideration of how they will be used. Let's see. Selecting concepts out of your original posting, I am thinking of a multi-user machine, running "APLish code". Let's say something with the capacity of what currently costs around $50,000. Price should drop for that capability by the time I am anywhere near completion. I haven't decided how to deal with resource locking on multi-tasking conflicts. (Not that I don't have ideas. I just haven't convinced myself that they are that good.) The basic architechture I am trying to thrash out passes data objects around with a checklist of what needs to happen to them. There is a bandwidth penalty here, but it allows for quite a bit of extensibility (I hope). ;> . . . I tend to discount the overheads in handling small (several ;> element to scalar) data structures, because if that is all there ;> is, then the application is unlikely to be large, and there should ;> be lots of spare machine cycles. When the work grows, if the ;> application is well designed, so does the size of your data ;> entities. I am not convinced that this is true for all applications. Also there generally needs to be several design iterations before an application is "well designed". It is true that MANY (probably MOST) applications can be handled quite nicely using an APLish approach. ;> Making APL hardware could involve implementing a lot of APL ;> as "machine instructions", but that really means microcode - ;> . . . [much deleted] I think that you are talking about a conventional von neuman architecture (cpu with bus leading to memory in which is stored instructions and data). I am being a little more "blue sky" than that. (I can almost afford to: I'm a college student 8^) ;> The nature of the enhancements depend very much on what you ;> consider the key APL level operations to be. My own list of ;> crucial favourites would include reshape, dyadic iota, ;> compress, reduce, and generalized inner products (or at least ;> and.equals). A nice selection. ;> Operations on Boolean arrays would also be very high on the list, ;> as would comparisons, especially those for integer and character ;> data. Given a fairly decent architecture these SHOULD be fast. Look at the occurance of blitters in recent personal computers. (fast bit manipulation does nice things to performance where applicable). ;> . . . One operation implicit in any APL that is crucial is the ;> allocation of storage for data entities. . . . That is why I am trying for an architecture which is based on a communication model. There are penalties, but there should be big advantages. The question (a big one) is how to organize things. ;> On the other hand, I don't really care that much how fast ;> the trig functions or domino run, nor that much how fast ;> exponentiation and the like are. . . . I do. Trancendental functions and domino do nice things to applications which would other wise require massive looping. Could I get you to agree that base 2 logarithms should be fast? ;> The most crucial item in a lot of cases is just workspace size. Considering the current state of technology, I have been assuming a reasonably massive memory resource, though with varying levels of caching/mass-storage. ;> Personally, I don't think (single user machine) that you have ;> to provide multiple "control streams" or simultaneous processing ;> of user level tasks, but some parallelism in the execution of ;> array operations - or a high speed vector processing scheme - ;> could provide a major boost in performance. There are two kinds of parallelism I am considering. (1) User-level parallelism (coarse-grained) and (2) User-invisible (fine-grained). (1) is more of a software issue, and (2) is more of a hardware issue. ;> In many operations on conformal arrays, the (logical) structure of ;> the data is not important during the actual computation step. I don't quite follow you here. What are you saying? ;> [description of practical vector operation: skipping over parts of ;> memory] . . . ;> In real world terms, another major performance boost is a true ;> native APL file system - one that stores the components in their ;> internal representation. . . . The idea of making APL the operating ;> system, and giving it complete and direct control over the system ;> had a considerable appeal. The benefits could be enormous. exactly! I hope I am not being too foggy on details here. If anyone feels I should elaborate on any of the points here (assuming anyone is interested) don't hesitate to tell me. -- Raul --