chase@Ozona.orc.olivetti.com (David Chase) (11/01/88)
In article <28200219@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: >If anyone connected with the ARM wishes to start a meaningful >discussion, I would probably join in, with questions like: >did you intend to sell it into the low market, or did you design >the chip and have it just happen that way? did you make the right >tradeoffs? etc.? Is there anyone out there that can lead this? and several people wrote, so I thought I'd provide the gory details for anyone interested I can't say that I'm "connected with the ARM", but I do have a copy of the CPU software manual, and I've considered what it would be like to write a "RISC-targeted" compiler for the ARM. A colleague has in fact written several compilers for the ARM, and NorCroft in England apparently has a respectable ANSI-C compiler for the machine. ================ Rough machine description 16 32-bit "general" registers, R15 = PC, R14 = link register There are 9 additional 32-bit registers for use in interrupt processing; FIQ (fast interrupt) mode has a private copy of registers 10-14, IRQ mode has a private copy of 13 and 14, and SVC mode has a private copy of 13 and 14. The intention is that a Fast interrupt will be processed with extremely little overhead, since 4 registers are immediately available without any register save overhead, and (I think) information from the last FIQ can be stashed in the FIQ registers. All instructions have a 4 bit conditional execution code; for short runs of conditional code this is faster than taking a branch. The conditions are N (negative/signed less-than) Z (zero) C (carry/not borrow/rotate extend) V (overflow) Memory access is either "sequential" (burst mode) or "non-sequential" (slower than burst mode). I'm not sure how much of this is supported by the chip and how much is supported by the memory controller. All instructions are 32 bits. ---------------- Branch and subroutine call instructions are unsurprising. ---------------- "Arithmetic" operations take the form Rd := Rn OP OPERAND where OPERAND is one of 1) a zero-extended 8 bit quantity rotated right Shf*2 bits, where Shf has 4 bits. 2) a register, SHIFTed. SHIFT is one of SL 0-31 SLR 1-32 SAR 1-32 RR 1-31 RR1 w/ extension to carry (the following take an extra cycle) SL Rs SLR Rs SAR Rs RR Rs The shifts and non-extended rotates put the appropriate bit into C. All operations have a bit that controls setting of the condition-codes. The versions of the CPU that I know of lack on-chip multiply or divide, but the shifted operands and conditional instructions take a lot of the curse out of that. The indexing modes on the load and store instructions described below also help with this. Instructions include add with carry, add, and, bit clear, xor, move, move negated, or, subtract, reverse subtract, subtract with cary, reverse subtract with carry, compare, compare negated, test equal, test complement equal ---------------- There are single-register data transfer instructions, with an astounding array of choices. I'll try to summarize as best I can. 1) Pre-indexed by constant Rn := Rn +/- (12 bit) offset; Rd := *Rn; (load ) OR *Rn := Rd; (store) 2) Pre-indexed by SHIFTed register (see arithmetic for shift) Rn := Rn +/- SHIFT Rm; Rd := *Rn; (load ) OR *Rn := Rd; (store) 3) Post-indexed by constant 4) Post-indexed by shifted register The modification of Rn is optional for pre-indexed modes. ---------------- Block register save/restore instructions (these are pretty non-surprising) ---------------- ================ Note that the following information is somewhat dated (1985); for all I know the chip may have a multiply instruction now, which would also up the transistor count a bit. The may have shrunk the design a bit, too. Other notable features -- the cpu is dinky -- I believe that it only uses 25,000 transistors. It is thus also rather cheap, and appropriate for use as a mega-cell. The Archimedes machine (based on a 4-chip set which includes the ARM) runs a killer Basic, for instance (I think there were benchmarks of this in Byte of October 1987, give or take a month). The memory controller normally paired with the ARM supports at most 128 pages (these may be large, however). As far as trade-offs and design decisions go, it appears that the chip was intended to be small, cheap, and simple. This is less of an advantage nowadays with high memory prices and gluttonous software, but there are other applications for which this might be appropriate. I would not be surprised if I found it being used in some peripheral or controller. As far as compilation goes, there is the Norcroft compiler I mentioned above, and it seems that the baroque instructions (which remind me somewhat of microcode) should provide opportunities either for a good peephole optimizer or a super-optimizer. A version of Modula-2 was done at the Acorn Research Center*, though I'm not sure if this is generally available. * now the Olivetti Software Technology Laboratory The version of the chip described in the October '85 version of the CPU software manual was done in 2 micron something-MOS, with a 20Mhz clock. The clock is divided by 3 for non-memory (S) cycles, by 6 for memory (N) cycles. Published timings are: ordinary-case times arithmetic instructions 1S * + 150nS loads 2S 1N * + 600nS stores 2N * 600nS load multiple (n+1)S 1N + store multiple (n-1)S 2N branch and call 2S 1N 600nS software interrupt 2S 1N * 1S for SHIFT by register (not constant) amount + 1N if PC modified All skipped instructions take one S cycle. So, depending upon your instruction mix, you could have expected to see (in 1985) somewhere between a 6.7 and 1.3 "MIPS" machine, based on actual instruction counts. (I'm assuming that the PC is rarely modified by loads, stores, or arithmetic.) Note, too, the substantial effect that the burst-mode memory access has; a single register load takes 600ns, while 5 registers load in 1200ns (2.5 times as fast per register). Clever loop unrolling wins really really big there, but this is not the sort of thing that compilers will usually do for you (certainly not C compilers), even if you unroll the loops for them. Cleverly crafted libraries, however, can take advantage of this. Lacking a cache and all the usual split I/D madness, it is also possible to generate code on the fly for things like Bit Blit. This clearly wins, what with the 2:1 advantage in doing constant shifts. Again, this is a good thing to do in a library. David
daveh@cbmvax.UUCP (Dave Haynie) (11/02/88)
in article <31748@oliveb.olivetti.com>, chase@Ozona.orc.olivetti.com (David Chase) says: > Memory access is either "sequential" (burst mode) or "non-sequential" > (slower than burst mode). I'm not sure how much of this is supported > by the chip and how much is supported by the memory controller. As I recall from a similarly dated (circa 1985) discussion on the hardware, it's the custom memory controller that's managing burst. The burst fetching works something like a 68030's burst, in that you pay for one long cycle and get the next three fast. According to what I read, the ARM chip itself is given the clock modified by the memory controller. In a non-burst cycle, the chip is physically clocked at 4MHz, for the burst fetches, 8MHz. Pretty 6502ish if you ask me. I never heard the details on this, but you'd have to assume that [A] there's nothing like internal pipelining going on, so that this fast/slow business doesn't slow down other operations in progress, [B] internal stuff does get hit depending on burst status, or [C] internal operations work from their own clock, perphaps an always-8 meg clock. -- Dave Haynie "The 32 Bit Guy" Commodore-Amiga "The Crew That Never Rests" {uunet|pyramid|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy Amiga -- It's not just a job, it's an obsession