[comp.arch] Some ARM data

chase@Ozona.orc.olivetti.com (David Chase) (11/01/88)

In article <28200219@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>If anyone connected with the ARM wishes to start a meaningful 
>discussion, I would probably join in, with questions like:
>did you intend to sell it into the low market, or did you design
>the chip and have it just happen that way? did you make the right
>tradeoffs? etc.?  Is there anyone out there that can lead this?

and several people wrote, so I thought I'd provide the gory details
for anyone interested

I can't say that I'm "connected with the ARM", but I do have a copy of
the CPU software manual, and I've considered what it would be like to
write a "RISC-targeted" compiler for the ARM.  A colleague has in fact
written several compilers for the ARM, and NorCroft in England
apparently has a respectable ANSI-C compiler for the machine.

================
Rough machine description

16 32-bit "general" registers, R15 = PC, R14 = link register

There are 9 additional 32-bit registers for use in interrupt
processing; FIQ (fast interrupt) mode has a private copy of registers
10-14, IRQ mode has a private copy of 13 and 14, and SVC mode has a
private copy of 13 and 14.  The intention is that a Fast interrupt
will be processed with extremely little overhead, since 4 registers
are immediately available without any register save overhead, and (I
think) information from the last FIQ can be stashed in the FIQ
registers.

All instructions have a 4 bit conditional execution code; for short
runs of conditional code this is faster than taking a branch.  The
conditions are 
N (negative/signed less-than)
Z (zero)
C (carry/not borrow/rotate extend)
V (overflow)

Memory access is either "sequential" (burst mode) or "non-sequential"
(slower than burst mode).  I'm not sure how much of this is supported
by the chip and how much is supported by the memory controller.

All instructions are 32 bits.
----------------
Branch and subroutine call instructions are unsurprising.
----------------
"Arithmetic" operations take the form

    Rd := Rn OP OPERAND

where OPERAND is one of

1) a zero-extended 8 bit quantity rotated right Shf*2 bits,
   where Shf has 4 bits.

2) a register, SHIFTed.

SHIFT is one of
   SL  0-31
   SLR 1-32
   SAR 1-32
   RR  1-31
   RR1 w/ extension to carry
       (the following take an extra cycle)
   SL  Rs
   SLR Rs
   SAR Rs
   RR  Rs

The shifts and non-extended rotates put the appropriate bit into C.
All operations have a bit that controls setting of the condition-codes.

The versions of the CPU that I know of lack on-chip multiply or
divide, but the shifted operands and conditional instructions take a
lot of the curse out of that.  The indexing modes on the load and
store instructions described below also help with this.

Instructions include
add with carry, add, and, bit clear, xor, move, move negated, or,
subtract, reverse subtract, subtract with cary, reverse subtract with
carry, compare, compare negated, test equal, test complement equal

----------------
There are single-register data transfer instructions, with an
astounding array of choices.  I'll try to summarize as best I can.

1) Pre-indexed by constant

   Rn := Rn +/- (12 bit) offset;
   Rd := *Rn; (load ) OR *Rn := Rd; (store)

2) Pre-indexed by SHIFTed register (see arithmetic for shift)

   Rn := Rn +/- SHIFT Rm;
   Rd := *Rn; (load ) OR *Rn := Rd; (store)

3) Post-indexed by constant
4) Post-indexed by shifted register

The modification of Rn is optional for pre-indexed modes.
----------------
Block register save/restore instructions

(these are pretty non-surprising)
----------------
================
Note that the following information is somewhat dated (1985); for all
I know the chip may have a multiply instruction now, which would also
up the transistor count a bit.  The may have shrunk the design a bit,
too.

Other notable features -- the cpu is dinky -- I believe that it only
uses 25,000 transistors.  It is thus also rather cheap, and
appropriate for use as a mega-cell.  The Archimedes machine (based on
a 4-chip set which includes the ARM) runs a killer Basic, for instance
(I think there were benchmarks of this in Byte of October 1987, give
or take a month).  The memory controller normally paired with the ARM
supports at most 128 pages (these may be large, however).

As far as trade-offs and design decisions go, it appears that the chip
was intended to be small, cheap, and simple.  This is less of an
advantage nowadays with high memory prices and gluttonous software,
but there are other applications for which this might be appropriate.
I would not be surprised if I found it being used in some peripheral
or controller.  As far as compilation goes, there is the Norcroft
compiler I mentioned above, and it seems that the baroque instructions
(which remind me somewhat of microcode) should provide opportunities
either for a good peephole optimizer or a super-optimizer.  A version
of Modula-2 was done at the Acorn Research Center*, though I'm not sure
if this is generally available.

* now the Olivetti Software Technology Laboratory

The version of the chip described in the October '85 version of the
CPU software manual was done in 2 micron something-MOS, with a 20Mhz
clock.  The clock is divided by 3 for non-memory (S) cycles, by 6 for
memory (N) cycles.  Published timings are:

                                           ordinary-case times

arithmetic instructions          1S    * +     150nS
loads                            2S 1N * +     600nS
stores                              2N *       600nS
load multiple                (n+1)S 1N   +
store multiple               (n-1)S 2N 
branch and call                  2S 1N         600nS
software interrupt               2S 1N 

* 1S for SHIFT by register (not constant) amount
+ 1N if PC modified
All skipped instructions take one S cycle.

So, depending upon your instruction mix, you could have expected to
see (in 1985) somewhere between a 6.7 and 1.3 "MIPS" machine, based on
actual instruction counts.  (I'm assuming that the PC is rarely
modified by loads, stores, or arithmetic.)

Note, too, the substantial effect that the burst-mode memory access
has; a single register load takes 600ns, while 5 registers load in
1200ns (2.5 times as fast per register).  Clever loop unrolling wins
really really big there, but this is not the sort of thing that
compilers will usually do for you (certainly not C compilers), even if
you unroll the loops for them.  Cleverly crafted libraries, however,
can take advantage of this.

Lacking a cache and all the usual split I/D madness, it is also
possible to generate code on the fly for things like Bit Blit.  This
clearly wins, what with the 2:1 advantage in doing constant shifts.
Again, this is a good thing to do in a library.

David

daveh@cbmvax.UUCP (Dave Haynie) (11/02/88)

in article <31748@oliveb.olivetti.com>, chase@Ozona.orc.olivetti.com (David Chase) says:

> Memory access is either "sequential" (burst mode) or "non-sequential"
> (slower than burst mode).  I'm not sure how much of this is supported
> by the chip and how much is supported by the memory controller.

As I recall from a similarly dated (circa 1985) discussion on the hardware,
it's the custom memory controller that's managing burst.  The burst fetching
works something like a 68030's burst, in that you pay for one long cycle and
get the next three fast.  According to what I read, the ARM chip itself is
given the clock modified by the memory controller.  In a non-burst cycle,
the chip is physically clocked at 4MHz, for the burst fetches, 8MHz.  Pretty
6502ish if you ask me.  I never heard the details on this, but you'd have to
assume that [A] there's nothing like internal pipelining going on, so that
this fast/slow business doesn't slow down other operations in progress, [B]
internal stuff does get hit depending on burst status, or [C] internal
operations work from their own clock, perphaps an always-8 meg clock.
-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
              Amiga -- It's not just a job, it's an obsession