vanroy@bellatrix.uucp (02/22/89)
Does anyone know of any comparisons of algorithms implemented in symbolic languages (i.e. Prolog/Lisp) without any special annotations versus in an imperative language? This involves writing the algorithm in a natural style in both languages, and using high-performance (perhaps experimental) compilers to compare the execution speeds. The results depend on many factors: (1) the choice of algorithm, (2) programming style, (3) the quality of the compilers, and (4) the suitability of the architecture for the symbolic language. I know of one comparison done at Argonne in which a theorem prover was written in Prolog and in C. The C version consistently outperformed the Prolog version by an order of magnitude. This was due to (1) the poor implementation of arrays in the Prolog version and (2) the awkwardness of the architecture for Prolog (I believe it was a Sun-3). Has anyone done any other such comparisons? Present-day general-purpose architectures will naturally favor the imperative language in a comparison of this sort. This is because the traditional notion of "general-purpose" architecture is not truly general-purpose. It emphasizes numeric calculation and does not do the things that symbolic languages require, such as tag manipulation and multiway branching. Therefore I propose that the notion of "general-purpose" be extended to include support for symbolic languages. At Berkeley, we are developing such an architecture for the Prolog language. We are designing it from the start as a complete system, i.e. compiler, instruction set architecture, and (single-chip) implementation are being developed together with feedback in all directions. In this way we hope to overcome some of the limitations of previous implementations of Prolog. Thanks for any pointers, Peter Van Roy vanroy@bellatrix.berkeley.edu
wilson@carcoar.Stanford.EDU (Paul Wilson) (02/22/89)
I too am interested in such issues. One thing I've been wondering about is the compatibility of RISC-style caches and Lisp Machine-style forwarding pointers. On Lisp Machines, the time required to check for invisible forwarding pointers is generally hidden in time required for cache operation, but I'm unsure of the details. Do the new cache architectures still allow this. (I know they don't generally support such snazzy features. The question is whether other design decisions directly conflict with invisible forwarding.) Also, is anybody doing any work on coordinating cache operation with garbage collection? With a large cache (100 KB or larger), you should be able to have a small Newest generation (say, 50 KB) that is scavenged completely within the cache -- no thrashing. Even if the Newest generation won't fit in the cache, you should be able to avoid swapping a lot of cache blocks out by telling the cache hardware if they only contain garbage. (i.e., force them to appear clean, so they won't be written back, or tell the cache to "create" a block without actually reading in the corresponding block from main memory). It seems that this would cut cache-memory traffic by approximately half (?) for the Newest generation. Anybody out there doing such work? Paul R. Wilson Human-Computer Interaction Laboratory lab ph.: (312) 413-0042 U. of Ill. at Chi. EECS Dept. (M/C 154) Box 4348 Chicago,IL 60680 wilson@carcoar.stanford.edu Paul R. Wilson Human-Computer Interaction Laboratory lab ph.: (312) 413-0042 U. of Ill. at Chi. EECS Dept. (M/C 154) Box 4348 Chicago,IL 60680 wilson@carcoar.stanford.edu
jwmills@iuvax.cs.indiana.edu (Jonathan Mills) (02/22/89)
In article <28113@ucbvax.BERKELEY.EDU> vanroy@bellatrix.uucp () writes: >Present-day general-purpose architectures will naturally favor the imperative >language in a comparison of this sort. This is because the traditional notion >of "general-purpose" architecture is not truly general-purpose. It emphasizes >numeric calculation and does not do the things that symbolic languages require >such as tag manipulation and multiway branching. Therefore I propose that the >notion of "general-purpose" be extended to include support for symbolic >languages. >At Berkeley, we are developing such an architecture for the Prolog language. >We are designing it from the start as a complete system, i.e. compiler, >instruction set architecture, and (single-chip) implementation are being >developed together with feedback in all directions. In this way we hope to >overcome some of the limitations of previous implementations of Prolog. > Thanks for any pointers, In response to the queries about extending the definition of "general purpose" architectures to include symbolic processing, it's been done, and the results are very exciting. I have a single chip design for a Prolog machine, and frankly think it's pretty good (but we know it could be improved, don't we?) Because the design was closely tied to discussions and experiments with Kevin Buettner, the author of Applied Logic Systems Prolog compiler, the architecture is the first "native code" Prolog processor below the level of the WAM that is tied in to the optimizations that a Prolog compiler would typically perform. I've been flogging the notion of a Prolog RISC since 1986, and have evolved a design which we are building now at Indiana University with some support from Motorola; Frank McCabe & British Aerospace have been influenced in some small part by my earliest LOW RISC I, and have independently come up with a similar design (the DLM, a very nice machine based on the little-appreciated sigma machine). Because of the similarity in designs, and because there is only so much you can do in a limited instruction set, I'd argue that we already have a good notion of what a general purpose processor needs. First some history: 1985 - WAM taken apart, pieces used to build a Sub-WAM for Ken Bowen's ALS PC Prolog compiler. Fast implementation; taught us how to compile Prolog to native code. 1986 - LOW RISC I. Arcane 7 opcodes. Verified need for integrated tag & value processing; had problems with excessive branching & deref loops 1987 - LOW RISC II. Simulated, nominal speeds 300 - 800 KLIPS. Still had excessive branching problems (particularly in unification). - 88000e. Brian Short extended the 88000 to a tagged architecture. Similar performance to LOW RISC II. His thesis is an excellent piece of work, but you have to sign a non-disclosure to get it. Try calling Sam Daniel, Motorola GEG, (602) 897-3010 if you want a copy. It might not be releasable until late 1990, though. 1988 - LIBRA. A "balanced" RISC with 35 instructions. ("Balance" is a concept described by Michael Flynn of Stanford and others - I recall a BISC described at a DoD conference). The LIBRA runs WAM macros with 2.3x fewer cycles than the PLM, but code optimizations can improve this even more. Every opcode is conditioned to reduce branch frequency (a la the ARM), and branch frequency is further reduced by a single cycle partial unifier. It is possible that an ECL gate array version with an instruction prefetch buffer, an ECL cache, and interleaved 100ns RAM could nominally run over1,000,000 inferences per second, and could max out on the inner loop of NREV at more than 5.7 MLIPS (but not much more). The architecture of the LIBRA is discussed in more detail in the following posting, derived from an article being written for IEEE Micro.
jwmills@iuvax.cs.indiana.edu (Jonathan Mills) (02/22/89)
LIBRA: LOGICAL INFERENCE BALANCED RISC ARCHITECTURE (name by Tony Faustini) The LIBRA is a 40-bit 4-stage pipelined processor with four major functional units, each pipelined & operating synchronously in parallel: 1. Value ALU. Contains separate hardware for arithmetic/logic, dereferencing, bounds checking. 2. Tag ALU. Tag comparison, interface to partial unifier. 3. GC ALU. Support for mark&sweep GC algorithms 4. PC ALU. Next instruction fetching; many branches are simple counter loads, but of a partial field. Fastest branches are within a page (yes this is a paged machine, at least for now. It doesn't bother the compiler writer I work with, but I do recognize the problems for multi-processing.) The machine is microcoded, but uses only one control word per machine instruction. An alternate microcode address is latched after every instruction that sets condition codes (more on this in partial unify). The 40-bit data word is divided into a 3-bit type, 1 bit each for mark & reverse garbage collect flags, and a 35-bit value which can be further typed for use with 32-bit numeric host processors. NON-ORTHOGONAL REGISTER FILE There are 32 registers. 4 are stack pointers (see why in a minute), 24 are general purpose, and 4 are link registers for subroutine calling (or continuation pointers a la the WAM). The register bank is non-orthogonal to allow single cycle instructions that can: push register A using register B as a pointer to memory, increment/decrement register B, store register B into register C, overwriting the tag in C with an immediate. These instructions make use of parts of the datapath that would be idle that are provided for forwarding, and routes their contents back to the register file, which is useful in a structure-copying implementation of Prolog. PARTIAL UNIFY Microcode addressing is typically provided or modified by flags, fields in the microinstruction, and a microprogram counter. In the LIBRA concatenated-tags are used for microcode addressing. Using this technique, the microcode is viewed as a two-dimensional array of control words. The two tags saved from the previous instruction's operands are used to index the microcode array, and select the control word for a replaceable instruction. Single-cycle partial unify performs either a nop, a store, a call or a branch, thus condensing three to five tag checking instructions into one. The partial unify instruction can eliminate as many as 30% of the subroutine calls performed by a general purpose RISC running Prolog. By applying this operation to the LIBRA's partial unification and switch instructions, dereferencing can be made a conditional operation, performed only when necessary. More significantly, the pair of checks for a bound variable (one is needed for each operand being unified) can be included in the operation of the partial unification instruction by simply enlarging the unifier table in ROM, thus removing as many as eight instructions (which include two branches) from the direct execution sequence. The modified partial unify instruction thus reduces the number of instruction cycles spent dereferencing objects reached in eight or fewer indirections; this optimization improves the performance of unification from 60% to 100% by conditionally omitting dereferencing as well as explicit checks for bound variables. Here's an example: Label Condition Instr sc Operands Comment -------------------------------------------------------------------- loop: if bound1 ld Xn 0 Xn if bound2 ld Ai 0 Ai enter: sub sc Xn Ai r27 unify sc Xn Ai loop exit ;no mode split if trail1 push+ TR Xn if trail2 push+ TR Ai exit: SMART CACHE Tags can also be used to improve cache hit rates by preventing transient data, such as intermediate elements in a pointer chain, from entering a cache. This implies that dereferencing should take place outside the cache, rather than at the chip (which was suggested in 1986 by Mats Carlson). If a "dereferenced choice point" is built once for a procedure, then the bottleneck that this would otherwise form can be reduced. Also, the only hardware interlock built into the LIBRA is used to stall the pipeline during dereferencing. Moving dereferencing into the cache would clean up the design. It is also possible to do a CDC-6600 like trick, and move trailing and failing out of the CPU to a peripheral processor (also done by McCabe in the DLM). If cache entries are monitored by the trail-fail processor, and reset if requested by the CPU, the failure of a previous goal can be overlapped with the execution of the subsequent goal, thus improving backtracking performance. SYMBOLIC BOUNDS CHECKING & TRAIL-CHECK SCOREBOARDING Automatic bounds check are provided for trail and current environment checking. Some bounds checks are "sticky", to allow detection of an "almost stack collision" condition. Then garbage collection can be initiated at programmer defined points. This saves stack collision checking at every call, and further reduces branching. Scoreboarding is generalized to include marking pre-processed data as well as data waiting to be processed. The trail check is moved to occur during a load or the final stage of a dereference. If the check showed that an unbound variable is being loaded, and it needs to be trailed if it is instantiated later, the register loaded is marked by setting a trail-check flag in the scoreboard. SYMBOLIC CONDITIONING OF EVERY INSTRUCTION Conditional instruction execution is used to implement preferred branches and is also used with numeric conditions to control the execution of every instruction in the Acorn RISC Machine (Acorn Computers Limited 1986, 1987). The LIBRA architecture extends this concept by adding symbolic conditions, allowing operations which formerly needed several test-and-branch instructions to be coded instead as a sequence of instructions, all of which are conditionally nops. This reduces the branch frequency of the LIBRA, and improves its ability to use interleaved memory. Symbolic conditions include certain variable types (bound1, bound2) and trail and environment checks (trail1, trail2, env1, env2), and collision checks ("sticky overflow"). Overall, the use of conditional instruction execution, memory interleaving, tag-controlled caching, instruction prefetching and loop-buffering (lookahead, look-behind) allow the LIBRA to execute short inner loops and shallow backtracking at speeds limited primarily by the cache memory cycle time rather than the hit rate. A GENERAL PURPOSE ARCHITECTURE? Well, the LIBRA was compared only for Prolog, but it was compared to the PLM, the SPUR, the 68K and the 88K, and outperformed them all. Truly, Peter VanRoy's assertion that general purpose architectures should support symbolic processing is one I have been pushing for years. The argument against it that has always been thrown back at me is, "We will have 50 MIPS machine in a year or two; the performance advantage of a special purpose processor is not worth it." I disagree with that rebuttal, and consider Motorola's interest (which varies over time) to indicate that it might - just might - be possible to extract the essentials of the LIBRA and put them into place as an SFU in a 88000. Toward this end I have designed a 32-bit LIBRA with 16 instructions, that encodes the WAM in 2.16x fewer cycles than the PLM. As an SFU this beast would be capable of interpreting 32-bit data as tagged cells, would be able to take advantage of all of the 88K numeric and O/S specific support, and thus could deal with symbolic processing variants such as Constraint Logic Programming, as well as Prolog (and presumably Lisp - and after talking to the Schemers here at IU I can find nothing they do that the LIBRA can't encode). CLOSING WORD: I have probably left out some minor but interesting things, but if you are really interested & are free early this summer I will give a 5-week seminar beginning around May 20th at IU on symbolic architectures, with an emphasis on their use in automated deduction systems. E-mail me for the details.
Johnson@tilde.ti.com (Doug) (02/23/89)
In article <7051@polya.Stanford.EDU> you write: >I too am interested in such issues. > >One thing I've been wondering about is the compatibility of RISC-style >caches and Lisp Machine-style forwarding pointers. > >On Lisp Machines, the time required to check for invisible forwarding >pointers is generally hidden in time required for cache operation, >but I'm unsure of the details. Do the new cache architectures still >allow this. (I know they don't generally support such snazzy features. >The question is whether other design decisions directly conflict with >invisible forwarding.) Forwarding pointers do not involve the cache on either the Explorer I or Explorer II. The Explorer I doesn't have a cache. It uses microcode to check the type of the data read and do the indirection if necessary. It's strictly serial. The Explorer II uses hardware in the read path to check the data type and cause the CPU to do the indirection. The major advantage is that it eliminates the inline microcode check. The Explorer II has a cache, but the cache knows nothing about forwarding pointers -- they're just a series of memory references. The essence of the problem is that the indirection is done based on the type of the data fetched. This is right in the critical path of the memory interface. Plus or minus a bit, the type check can't start until the data could be presented to the CPU. You either have to slow the memory interface by the type check time, or have a way of saying "I didn't mean it" after making the data available. Incidently, there is an analogous problem with implementing a read barrier for incremental garbage collection. You'd like to raise an exception if a pointer to old space is read. > >Also, is anybody doing any work on coordinating cache operation with >garbage collection? With a large cache (100 KB or larger), you should >be able to have a small Newest generation (say, 50 KB) that is scavenged >completely within the cache -- no thrashing. > >Even if the Newest generation won't fit in the cache, you should be >able to avoid swapping a lot of cache blocks out by telling the cache >hardware if they only contain garbage. (i.e., force them to appear >clean, so they won't be written back, or tell the cache to "create" >a block without actually reading in the corresponding block from >main memory). It seems that this would cut cache-memory traffic by >approximately half (?) for the Newest generation. > >Anybody out there doing such work? Bob Courts has a paper in the September 1988 CACM titled "Improving Locality of Reference in a Garbage Collecting Memory Management System". It focuses on the interaction between GC and virtual memory. I suspect much of the work applies to caches as well. I strongly suspect that it's not worth worrying about for caches. (Courts shows the significant advantages of worrying about virtual memory.) Unless your average cache lifetimes are long compared to the interval between GC flips, the cache flushes caused by flips will be small compared to those caused by normal running. This Explorer has been up for 236 hours and generation 0 (youngest) has flipped 688 times -- averaging a flip every 21 minutes. I couldn't turn up any data on cache lifetimes in a quick look through my bookcase, but I have the impression it's much shorter. (Thought experiment: A 5 mips processor with a 128K byte cache, 20% load ratio, 1% miss rate, and uniformly distributed misses will have an average cache life of about 3 seconds.) Obviously, this comparison has a lot of holes in it, but we're an order of magnitude apart. -- Doug
ohbuchi@unc.cs.unc.edu (Ryutarou Ohbuchi) (02/23/89)
There are several discussion about special architectures for symbolic languages, such as lisp and Prolog. I used to think that Lisp specific architecture is the way to go. The same for Prolog. This is probably true *IF* you are talking about prototype/research architectures. But, if you are going to sell software/hardeare product for/using symbolic languages, using stock processors seems to be the way to go. Designing a CPU, memory system, etc. and O/S, compiler, interpreter, environment takes awful a lot of time and money. And, when they are ready, the companies like MIPS, Motorola, (may be Intel, now with N10) has far more powerful processors with cutting edge technologies. So, if your software product in symbolic language is to live more than a couple of years, pick a stock hardware that seems to live long and prosper, as well as stay competitive in the future ("scalable" architecture, as SUN calls ?), and make one for it. Now, afterward, everytime they comes up with faster processor, port old compiler of yours, with constant improvements. If you are building from scratch every time, you can not fine tune your stuff. May be it is even better to compile into the source language of one of those honed compilers (such as MIPSco's C), not bothering the machine language. You can take advantage of their optimization... (Not yet, probably....) ============================================================================== Any opinion expressed here is my own. ------------------------------------------------------------------------------ Ryutarou Ohbuchi "Life's rock." "Climb now, work later." ohbuchi@cs.unc.edu <on csnet> Department of Computer Science, University of North Carolina at Chapel Hill ==============================================================================