[comp.sys.acorn] 32bit immediate load in ARM code

csuwr@warwick.ac.uk (Derek Hunter) (05/16/91)

I was trying to cut down the number of labels my C compiler produces,
 (having finally allowed the thing to access globals beyond the 4095 range),
 and I (re)invented this:

You can do 	Ldr Rn,VERY_FAR_AWAY	; with:

		Ldr Rn,[PC]
		Ldr Rn,[PC,Rn]
		DCD VERY_FAR_AWAY-P%	; if V_F_A preceeds this code

		Ldr Rn,[PC]
		Ldr Rn,[PC,-Rn]		;   (You /can/ do -Rn, can't you?)
		DCD P%-VERY_FAR_AWAY    ; if it doesn't

	. . . and they are still relative addresses cunningly enough.
		(In fact, I think they exceed the addressing space!)

32 bit immediate constants can be read with

		Ldr Rn,[PC]
		Bic Rn,Rn,# (( number >> 28 ) EOR 15) << 28
		DCD number OR &F0000000

            or equivalent, (but Bic impresses people, because no-one knows
                            what it does, and those shift-28s are luvverly).

(If you didn't guess, the DCDs are all -ve and the top nybble is set to &F
 making the DCD value pretend to be an op code with the NeVer condition set
 so the processor can run through it without so much as an `undefined
 instruction', unless someone's been *really* dense and told it to complain
 on UndefinedNV. Blast! That's potentially spoiled my whole strategy!
 Is it OK for top /byte/ = &FF?)

I don't know whether this is well known and/or despised, but it seemed like
 quite a good trick at the time.

My main point of interest is this: On an ARM 3, would the DCD be read into
 cache in an s cycle during the final stage of the Ldr Rn,[PC], or does it
 take an n cycle all of its very own?

                   Is this nice on an ARM 3's cache?

                           Is it nice at all?

                   Was this the intentional use of NV?

                           Has Acorn used it?     

                 Is the UndefinedNV really a problem?

   Will this latter be supported in future releases of the hardware?


	- For a full list of questions pertaining to Life, the Universe
	  and rugby socks, write to oracle@iuvax.cs.indiana.edu with
	  `help' in the subject line.

nbvs@cl.cam.ac.uk (Nicko van Someren) (05/16/91)

In article <+|Q_L||@warwick.ac.uk> csuwr@warwick.ac.uk (Derek Hunter) writes:
>My main point of interest is this: On an ARM 3, would the DCD be read into
> cache in an s cycle during the final stage of the Ldr Rn,[PC], or does it
> take an n cycle all of its very own?
>
>                   Is this nice on an ARM 3's cache?
>
>                           Is it nice at all?
>
>                   Was this the intentional use of NV?
>
>                           Has Acorn used it?     
>
>                 Is the UndefinedNV really a problem?
>
>   Will this latter be supported in future releases of the hardware?

Indeed, on an ARM3 the constant will get cached bacause the cache always reads
in 'lines' of four words.  Since by the time the LDR instruction is being
the other end of the pipeline will be being loaded with the NeVer instruction
the data must be in the cache and no extra external cycles will happen on an
ARM3.  IMHO it is nice on a cache, nice at all, a useful use of NV but I don't
know if Acorn have used it. I think if you read the chip spec carefully
you will find aborts can only occur on instructions with valid condition codes.
Having said that, I bet they will remove the NV option on future chips just to
wind us up. After all, if you look at the statistics compilers never use that
option so they might as well take it out!

Nicko
+-----------------------------------------------------------------------------+
| Nicko van Someren, nbvs@cl.cam.ac.uk, (44) 223 358707 or (44) 860 498903    |
+-----------------------------------------------------------------------------+

Gavin.Flower@comp.vuw.ac.nz (Gavin Flower) (05/17/91)

Have read in several places,
that the condition code NV
is *NOT* forward compatible.

Specifically that Acorn have
reserved the right to reuse
it for something else, at
some future stage!

To the best of my knowledge
it is still ok with ARM3.

****could someone at Acorn
confirm this, also could
they suggest what it might 
be earmarked for?

-Gavin

-- 
The main "user" of well brought up, and educated, children is the community
at large.  So if you really believe in "user pays", charge the correct users
- stop overloading parents with financial penalties.
******* These comments have no known correlation with dept. policy! *******

kers@hplb.hpl.hp.com (Chris Dollin) (05/17/91)

Derek Hunter gives a cunning trick for loading far globals and big constants:

   You can do 	Ldr Rn,VERY_FAR_AWAY	; with:

		   Ldr Rn,[PC]
		   Ldr Rn,[PC,Rn]
		   DCD VERY_FAR_AWAY-P%	; if V_F_A preceeds this code

I'd generate

    LDR Rn, nnn[PC]	(or however the s*d assembler notates it)
    LDR Rm, mmm[Rn]

where nnn[PC] holds a pointer to the item you need. Of course this item is
reusable for each far reference (so long as you stay ``near enough'' in the
code). It's not relocatable; if it were, you'd have to do

    LDR Rm,[PC,Rn]

as Derek does. However, if you *can* get away with fixed (or self-updating)
code, then the former method allows you to group far globals together and
access them using the one pointer (perhaps one per procedure).

Derek adds:

   32 bit immediate constants can be read with

		   Ldr Rn,[PC]
		   Bic Rn,Rn,# (( number >> 28 ) EOR 15) << 28
		   DCD number OR &F0000000

I'd definitely use an out-of-line constant in a table at the procedure head. It
reduces the code to one instruction at the cost of a store reference (which I
think Derek's costs in any case, but I'm not familiar enough with the timing
details), and the constant is sharable.

Of course, if you're prepared to spend 3 instructions on building the
constant, then anything composable from 3 8-bit fields can be done by:

    MOV Rn, #XXX
    OR Rn, Rn, #YYY
    OR Rn, Rn, #ZZZ

with XXX, YYY, ZZZ being suitable components of the full value. Do Acorn's
object formats permit this to be expressed when the full value is some constant
not known until link time?

--

Regards, Kers.      | "You're better off  not dreaming of  the things to come;
Caravan:            | Dreams  are always ending  far too soon."

athomas (Alasdair Thomas) (05/17/91)

 Derek Hunter (csuwr@warwick.ac.uk) asks:

>                 Was this the intentional use of NV?
>                 Is the UndefinedNV really a problem?

 The intention is to redefine NV class of instructions in the future to
enhance ARM's instruction set - there are no firm plans yet as to how that
instruction space will be redefined, but if you wish your code to run on
future generations of ARM, you should _not_ use the NV instructions in your
code.
 [Note: It is recommended that the instruction "MOV R0,R0" be used as a
general purpose NOP.]

 Whilst on the topic of programming the ARM, the document below summarises
the instruction sequences to be avoided.

 -- Alasdair Thomas
     Advanced RISC Machines Ltd.

 ********************************************************************
 
                 IMPORTANT RULES FOR ARM CODE WRITERS
                  ====================================
  Date: 17/5/91
 Issue: 2.5

 Every effort has been made to ensure that the information in this document
is true and correct at the date of issue. Products described in this
document, however, are subject to continuous development and improvements
and Advanced RISC Machines Ltd (and other contributors) reserve the right to
change their specifications at any time. Advanced RISC Machines Ltd cannot
accept liability for any loss or damage arising from the use of any
information or particulars in this document.


                          ================
                          = Introduction =
                          ================
 The ARM processor family uses Reduced Instruction Set (RISC) techniques to
maximise performance; as such, the instruction set allows some instructions
and code sequences to be constructed that will give rise to unexpected (and
potentially erroneous) results. These cases must be avoided by all machine
code writers and generators if correct program operation across the whole
range of ARM processors is to be obtained.

 In order to be upwards compatible with future versions of the ARM processor
family NEVER use any of the undefined instruction formats: both those shown
in the manual as "Undefined" which the processor traps AND those which are
not shown in the manual and which don't trap (for example a Multiply
instruction where bit 5 or 6 of the instruction is set). In addition the
"NV" (never executed) instruction class should not be used [It is
recommended that the instruction "MOV R0,R0" be used as a general purpose
NOP].

 This document lists the instruction code sequences to be avoided. It is
*STRONGLY* recommended that you take the time to familiarise yourself with
these cases because some will only fail under particular circumstances which
may not arise during testing.


              ============================================
              = Instructions and code sequences to avoid =
              ============================================
 The instructions and code sequences are split into a number of categories.
Each category starts with a recommendation or warning, and indicates which
of the two main ARM variants (ARM2, ARM3) it applies to. The text then goes
on to explain the conditions in more detail and to supply examples where
appropriate.

 Unless a program is being targeted SPECIFICALLY for a single version of the
ARM processor family, all of these recommendations should be adhered to.


1) TSTP/TEQP/CMPP/CMNP: Changing mode
-------------------------------------
  ####################################################################
  # When the processor's mode is changed by altering the mode bits   #
  # in the PSR using a data processing operation, care must be taken #
  # not to access a banked register (R8-R14) in the following        #
  # instruction. Accesses to the unbanked registers (R0-R7,R15) are  #
  # safe.                                                            #
  ####################################################################
  # Applicability: ARM2                                              #
  ####################################################################

 The following instructions are affected, but note that mode changes can
only be made when the processor is in a non-user mode:-

   TSTP Rn,<Op2>
   TEQP Rn,<Op2>
   CMPP Rn,<Op2>
   CMNP Rn,<Op2>

 These are the only operations that change all the bits in the PSR
(including the mode bits) without affecting the PC (thereby forcing a
pipeline refill during which time the register bank select logic settles).

e.g. Assume processor starts in Supervisor mode in each case:-

 a)  TEQP PC,#0
     MOV  R0,R0            SAFE: NOP added between mode change and access
     ADD  R0,R1,R13_usr          to a banked register (R13_usr).

 b)  TEQP PC,#0
     ADD  R0,R1,R2         SAFE: No access made to a banked register

 c)  TEQP PC,#0
     ADD  R0,R1,R13_usr    *FAILS*: Data NOT read from Register R13_usr!

 The safest default is always to add a NOP (e.g. MOV R0,R0) after a mode
changing instruction; this will guarantee correct operation regardless of
the code sequence that follows it.



2) LDM/STM: Forcing transfer of the user bank (Part 1)
------------------------------------------------------
  ###################################################################
  # Don't use write back when forcing user bank transfer in LDM/STM #
  ###################################################################
  # Applicability: ARM2,ARM3                                        #
  ###################################################################

 For STM instructions the S bit is redundant as the PSR is always stored
with the PC whenever R15 is in the transfer list. In user mode programs the
S bit is ignored, but in other modes it has a second interpretation. S=1 is
used to force transfers to take values from the user register bank instead
of the current register bank. This is useful for saving the user state on
process switches.
 Similarly, in LDM instructions the S bit is redundant if R15 is not in the
transfer list. In user mode programs, the S bit is ignored, but in non-user
mode programs where R15 is not in the transfer list, S=1 is used to force
loaded values to go to the user registers instead of the current register
bank.
 In both cases where the processor is in a non-user mode and transfer
to/from the user bank is forced by setting the S bit, write back of the base
will also be to the user bank though the base will be fetched from the
current bank. Therefore don't use write back when forcing user bank transfer
in LDM/STM.

e.g. In all cases, the processor is assumed to be in a non-user mode and
     <Rlist> is assumed not to include R15:-

     STMxx Rn!,<Rlist>   SAFE: Storing non-user registers with write back to
                               the non-user base register

     LDMxx Rn!,<Rlist>   SAFE: Loading non-user registers with write back to
                               the non-user base register

     STMxx Rn,<Rlist>^   SAFE: Storing user registers, but no base
                               write-back

     STMxx Rn!,<Rlist>^  *FAILS*: Base fetched from non-user register, but
                                   written back into user register

     LDMxx Rn!,<Rlist>^  *FAILS*: Base fetched from non-user register, but
                                   written back into user register




3) LDM: Forcing transfer of the user bank (Part 2)
--------------------------------------------------
  ######################################################################
  # When loading user bank registers with an LDM in a non-user mode,   #
  # care must be taken not to access a banked register (R8-R14) in the #
  # following instruction. Accesses to the unbanked registers          #
  # (R0-R7,R15) are safe.                                              #
  ######################################################################
  # Applicability: ARM2,ARM3                                           #
  ######################################################################

 Because the register bank switches from user mode to non-user mode during
the first cycle of the instruction following an "LDM Rn,<Rlist>^", an
attempt to access a banked register in that cycle may cause the wrong
register to be accessed.

e.g. In all cases, the processor is assumed to be in a non-user mode and
     <Rlist> is assumed not to include R15:-

   LDM Rn,<Rlist>^
   ADD R0,R1,R2             SAFE: Access to unbanked registers after LDM^

   LDM Rn,<Rlist>^
   MOV R0,R0                SAFE: NOP inserted before banked register used
   ADD R0,R1,R13_svc              following an LDM^

   LDM Rn,<Rlist>^
   ADD R0,R1,R13_svc        *FAILS*: Accessing a banked register immediately
                                     after an LDM^ returns the wrong data!

   ADR   R14_svc, saveblock
   LDMIA R14_svc, {R0 - R14_usr}^
   LDR   R14_svc, [R14_svc,#15*4]   *FAILS*: Banked base register (R14_svc)
   MOVS   PC, R14_svc                        used immediately after the LDM^

   ADR   R14_svc, saveblock
   LDMIA R14_svc, {R0 - R14_usr}^
   MOV   R0,R0                      SAFE: NOP inserted before banked
   LDR   R14_svc, [R14_svc,#15*4]         register (R14_svc) used
   MOVS   PC, R14_svc


NOTE:
 The ARM2 and ARM3 processors *usually* give the expected result, but cannot
be guaranteed to do so under all circumstances. Therefore this code sequence
should be avoided in future.



4) SWI/Undefined Instruction trap interaction
---------------------------------------------
  ######################################################################
  # Care must be taken when writing an undefined instruction handler   #
  # to allow for an unexpected call from a SWI instruction.            #
  # The erroneous SWI call should be intercepted and redirected to the #
  # software interrupt handler                                         #
  ######################################################################
  # Applicability: ARM2                                                #
  ######################################################################

 The implementation of the CDP instruction on ARM2 causes a Software
Interrupt (SWI) to take the Undefined Instruction trap if the SWI was the
next instruction after the CDP.
e.g.
    SIN F0,F1
    SWI &11        *FAILS*: ARM2 will take the undefined instruction trap
                            instead of software interrupt trap.

 All Undefined Instruction handler code should check the failed instruction
to see if it is a SWI, and if so pass it over to the software interrupt
handler.




5) Undefined instruction/Prefetch abort trap interaction
--------------------------------------------------------
  ######################################################################
  # Care must be taken when writing the Prefetch abort trap handler to #
  # allow for an unexpected call due to an undefined instruction       #
  ######################################################################
  # Applicability: ARM2,ARM3                                           #
  ######################################################################

 When an undefined instruction is fetched from the last word of a page,
where the next page is absent from memory, the undefined instruction will
cause the undefined instruction trap to be taken, and the following
(aborted) instructions will cause a prefetch abort trap. One might expect
the undefined instruction trap to be taken first, then the return to the
succeeding code will cause the abort trap. In fact the prefetch abort has a
higher priority than the undefined instruction trap, so the prefetch abort
handler is entered _before_ the undefined instruction trap, indicating a
fault at the address of the undefined instruction (which is in a page which
is actually present). A normal return from the prefetch abort handler (after
loading the absent page) will cause the undefined instruction to execute and
take the trap correctly. However the indicated page is already present, so
the prefetch abort handler may simply return control, causing an infinite
loop to be entered.
 Therefore, the prefetch abort handler should check whether the indicated
fault is in a page which is actually present. If so, the above condition
must be present and so control should be passed to the undefined instruction
handler. This will restore the expected sequential nature of the execution
sequence; a normal return from the undefined instruction handler will cause
the next instruction to be fetched (which will abort), the prefetch abort
handler will be reentered (with an address pointing to the absent page), and
execution can proceed normally.



                       ========================
                       = Other points to note =
                       ========================

 This section highlights some obscure cases of ARM operation which should be
borne in mind when writing code.

1) Use of R15
-------------
 *************************************************************************
 * WARNING: When the PC is used as a destination, operand, base or shift *
 *          register, different results will be obtained depending on    *
 *          the instruction and the exact usage of R15                   *
 *************************************************************************
 * Applicability: ARM2,ARM3                                              *
 *************************************************************************

  Full details of the value derived from or written into R15+PSR for each
instruction class is given in the datasheet. Care must be taken when using
R15 because small changes in the instruction can yield significantly
different results.

e.g. Consider data operations of the type:-
                <opcode>{cond}{S} Rd,Rn,Rm
           or   <opcode>{cond}{S} Rd,Rn,Rm,<shiftname> Rs
    a) When R15 is used in the Rm position, it will give the value of the PC
       together with the PSR flags.
    b) When R15 is used in the Rn or Rs positions, it will give the value of
       the PC without the PSR flags (PSR bits replaced by zeros).

    MOV R0,#0
    ORR R1,R0,R15   ; R1:=PC+PSR  (bits 31:26,1:0 reflect PSR flags)
    ORR R2,R15,R0   ; R2:=PC      (bits 31:26,1:0 set to zero)

NOTE:
 The relevant instruction description in the ARM datasheets should be
consulted for full details of the behaviour of R15.


2) STM: Inclusion of the base in the register list
--------------------------------------------------
 ***********************************************************************
 * WARNING: In the case of a STM with writeback that includes the base *
 *          register in the register list, the value of the base       *
 *          register stored depends upon its position in the register  *
 *          list                                                       *
 ***********************************************************************
 * Applicability: ARM2,ARM3                                            *
 ***********************************************************************

 During a STM, the first register is written out at the start of the second
cycle of the instruction. When writeback is specified, the base is written
back at the end of the second cycle. A STM which includes storing the base
with the base as the first register to be stored will therefore store the
unchanged value, whereas with the base second or later in the transfer
order, it will store the modified value.

 e.g.
    MOV   R5,#&1000
    STMIA R5!,{R5-R6}  ;  Stores value of R5=&1000

    MOV   R5,#&1000
    STMIA R5!,{R4-R5}  ;  Stores value of R5=&1008




3) MUL/MLA: Register restrictions
---------------------------------
  ****************************************************
  *   Given  MUL Rd,Rm,Rs                            *
  *      or  MLA Rd,Rm,Rs,Rn                         *
  *                                                  *
  *   Then   Rd & Rm must be different registers     *
  *               Rd must not be R15                 *
  ****************************************************
  * Applicability: ARM2,ARM3                         *
  ****************************************************

 Due to the way that Booth's algorithm has been implemented, certain
combinations of operand registers should be avoided. (The assembler will
issue a warning if these restrictions are overlooked.)
 The destination register (Rd) should not be the same as the Rm operand
register, as Rd is used to hold intermediate values and Rm is used
repeatedly during the multiply. A MUL will give a zero result if Rm=Rd, and
a MLA will give a meaningless result.
 The destination register (Rd) should also not be R15. R15 is protected from
modification by these instructions, so the instruction will have no effect,
except that it will put meaningless values in the PSR flags if the S bit is
set.
 All other register combinations will give correct results, and Rd, Rn and
Rs may use the same register when required.



4) LDM/STM: Address Exceptions
------------------------------
  ************************************************************************
  * WARNING: Illegal addresses formed during a LDM or STM operation will *
  *          not cause an address exception                              *
  ************************************************************************
  * Applicability: ARM2,ARM3                                             *
  ************************************************************************

 Only the address of the first transfer of a LDM or STM is checked for an
address exception; if subsequent addresses over- or under-flow into illegal
address space they will be truncated to 26 bits but will not cause an
address exception trap.

e.g. Assume processor is in a non-user mode & MEMC being accessed:-
       {these examples are very contrived}

    MOV R0,#&04000000  ; R0=&04000000
    STMIA R0,{R1-R2}   ; Address exception reported (base address illegal)

    MOV R0,#&04000000
    SUB R0,R0,#4       ; R0=&03FFFFFC
    STMIA R0,{R1-R2}   ; No address exception reported (base address legal)
                       ; code will overwrite data at address &00000000

NOTE:
 The exact behaviour of the system depends upon the memory manager to which
the processor is attached; in some cases, the wraparound may be detected and
the processor aborted.



5) LDC/STC: Address Exceptions
------------------------------
  ************************************************************************
  * WARNING: Illegal addresses formed during a LDC or STC operation will *
  *          not cause an address exception (affects LDF/STF)            *
  ************************************************************************
  * Applicability: ARM2,ARM3                                             *
  ************************************************************************

 The coprocessor data transfer operations act like STM and LDM with the
processor generating the addresses and the coprocessor supplying/reading the
data. As with LDM/STM, only the address of the first transfer of a LDM or
STM is checked for an address exception; if subsequent addresses over- or
under-flow into illegal address space they will be truncated to 26 bits but
will not cause an address exception trap.
 Note that the floating point LDF/STF instructions are forms of LDC & STC!

e.g. Assume processor is in a non-user mode & MEMC being accessed:-
       {these examples are very contrived}

    MOV R0,#&04000000  ; R0=&04000000
    STC CP1,CR0,[R0]   ; Address exception reported (base address illegal)

    MOV R0,#&04000000
    SUB R0,R0,#4       ; R0=&03FFFFFC
    STFD F0,[R0]       ; No address exception reported (base address legal)
                       ; code will overwrite data at address &00000000

NOTE:
 The exact behaviour of the system depends upon the memory manager to which
the processor is attached; in some cases, the wraparound may be detected and
the processor aborted.



6) LDC: Data transfers to a coprocessor fetch more data than expected
---------------------------------------------------------------------
 ***************************************************************************
 * Data to be transferred to a coprocessor with the LDC instruction should *
 * never be placed in the last word of an addressable chunk of memory, nor *
 * in the word of memory immediately preceding a read-sensitive memory     *
 * location                                                                *
 ***************************************************************************
 * Applicability: ARM3                                                     *
 ***************************************************************************

 Due to the pipelining introduced into the ARM3 coprocessor interface, an
LDC operation will cause one extra word of data to be fetched from the
internal cache or external memory by ARM3 and then discarded; if the extra
data is fetched from an area of external memory marked as cacheable, a whole
line of data will be fetched and placed in the cache.
 A particular case in point is that an LDC whose data ends at the last word
of a memory page will load and then discard the first word (and hence the
first cache line) of the next page. A minor effect of this is that it may
occasionally cause an unnecessary page swap in a virtual memory system. The
major effect of it is that (whether in a virtual memory system or not), the
data for an LDC should never be placed in the last word of an addressable
chunk of memory: the LDC will attempt to read the immediately following
non-existent location and thus produce a memory fault.

e.g. Assume processor is in a non-user mode, FPU hardware attached and MEMC
     being accessed:-
       {this example is very contrived}

    MOV  R13,#&03000000 ; R13=Address of I/O space
    STFD F0,[R13,#-8]!  ; Store F.P. register 0 at top of physical memory
                        ; (two words of data transferred)
    LDFD F1,[R13],#8    ; Load F.P. register 1 from top of physical memory
                        ; but THREE words of data are transferred, and the
                        ; third access will read from I/O space which may be
                        ; read sensitive!  *** BEWARE ***

john@acorn.co.uk (John Bowler) (05/17/91)

In article <+|Q_L||@warwick.ac.uk> csuwr@warwick.ac.uk (Derek Hunter) writes:
>I was trying to cut down the number of labels my C compiler produces,
> (having finally allowed the thing to access globals beyond the 4095 range),
> and I (re)invented this:
>
>You can do 	Ldr Rn,VERY_FAR_AWAY	; with:
>
>		Ldr Rn,[PC]
>		Ldr Rn,[PC,Rn]
>		DCD VERY_FAR_AWAY-P%	; if V_F_A preceeds this code
>
>		Ldr Rn,[PC]
>		Ldr Rn,[PC,-Rn]		;   (You /can/ do -Rn, can't you?)

Yes, this is valid.

>		DCD P%-VERY_FAR_AWAY    ; if it doesn't
>
>	. . . and they are still relative addresses cunningly enough.
>		(In fact, I think they exceed the addressing space!)

This is three instructions to read an given memory location with an offset
of up to +/- 28 bits (4 bits set to F to give the NV condition code).  How
about:-

	ADD	Rn, PC, #x LSL 12	; PC without PSR bits, 8 bit constant
	ADD	Rn, Rn, #y LSL 20
	LDR	Rn, [Rn, #x]		; 12 bit offset

which gives a 28 bit (positive) PC offset, similarly using SUB will allow
generation of a negative offset.  This sequence of instructions has the
advantage that it is faster (it avoids one LDR) on non-ARM3 machines.  Also,
some offsets can be represented more efficiently - in particular an
arbitrary 20 bit offset only requires two instructions.

Notice that *neither* approach allows link time relocation - apparently
Derek's algorithm would do this, but, in practice, the compiler would not
know whether the value was positive or negative, so could not generate the
correct LDR instruction.  (This is all because of a deficiency in the
current AOF and A.OUT object module formats, which do not allow the
appropriate relocation forms for the instructions which would be needed.)

>
>32 bit immediate constants can be read with
>
>		Ldr Rn,[PC]
>		Bic Rn,Rn,# (( number >> 28 ) EOR 15) << 28
>		DCD number OR &F0000000
>
>            or equivalent, (but Bic impresses people, because no-one knows
>                            what it does, and those shift-28s are luvverly).

Again, three instructions will generate any +/-24 bit constant (obviously),
and (additionally) a very large number of the others.  (Possibly even all
of them, given that there are 36 immediate value bits in the three
instructions, plus quite a lot of bits corresponding to the selection of
different alu instruction types).

>My main point of interest is this: On an ARM 3, would the DCD be read into
> cache in an s cycle during the final stage of the Ldr Rn,[PC], or does it
> take an n cycle all of its very own?

The cache is mixed instruction + data, so the LDR causes no memory access
other than that which occurs as a result of the instruction loading.

>                   Is this nice on an ARM 3's cache?

Yes.

>                           Is it nice at all?

Hum.  See below.

>                   Was this the intentional use of NV?

I don't think so.  We have used NV when patching binaries (to remove
instructions we don't want :-)) and have recommended its use as a NOOP
(after processor mode changes for example).

>                           Has Acorn used it?     

Not in this way as far as I know.

>                 Is the UndefinedNV really a problem?

Currently anythingNV is ignored; for example co-processors don't see the
instruction, no instruction decoding takes place (I think...)

>   Will this latter be supported in future releases of the hardware?

NV instructions are hardly ever used.  There is a lot of pressure on the
instruction space; there are very few slots left in it, yet NV accounts for
1/16 of all the instructions in the ARM instruction set!  It seems to me
that it is very likely that future developments will use NV instructions in
some way, which would cause the above to cease to work.  Given that the
actual advantage of the suggested code is very small (at most one extra
instruction for some very rarely used 32 bit constants) it is probably worth
avoiding.

John Bowler (jbowler@acorn.co.uk)

hughesmp@vax1.tcd.ie (05/18/91)

In article <+|Q_L||@warwick.ac.uk>, csuwr@warwick.ac.uk (Derek Hunter) writes:
> I was trying to cut down the number of labels my C compiler produces,
>  (having finally allowed the thing to access globals beyond the 4095 range),
>  and I (re)invented this:
> 
> You can do 	Ldr Rn,VERY_FAR_AWAY	; with:
> 
> 		Ldr Rn,[PC]
> 		Ldr Rn,[PC,Rn]
> 		DCD VERY_FAR_AWAY-P%	; if V_F_A preceeds this code
> 
> 		Ldr Rn,[PC]
> 		Ldr Rn,[PC,-Rn]		;   (You /can/ do -Rn, can't you?)
> 		DCD P%-VERY_FAR_AWAY    ; if it doesn't

Ldr - 4 cycles
Ldr - 4 cycles
Nop - 1 cycle
--------------
      9 cycles

Another problem - I'm not sure, but should some lines be...
DCD V_F_A-P%   ---> DCD V_F_A-P%-4 ?
DCD P%-V_F_A   ---> DCD P%+4-V_F_A ?
As the R15 would be pointing at the instruction following the DCD?
(I may be wrong, it could work; I'm not sure)

It is also slow - much better is

Add Rn,Pc,#(within 4096 of address) \ This is a multi-instruction add -
                                    \ several adds to make the full value
Ldr Rn,[Rn,#(the error margin)]

Worst case here is realistically 6 cycles, possibly 7.

This can be implemented for the BASIC assembler as a FN, but it
is impossible (without <n>-pass assembly that may never terminate)
for the assembler to work out the optimum number of Adds because it
may not know the destination on Pass 1, but it must take up the
instruction space then... Thus the FN must be implemented as:

FNldr(n,a,o) - LDR Rn,a taking up an additional o instructions.

> 32 bit immediate constants can be read with
> 
> 		Ldr Rn,[PC]
> 		Bic Rn,Rn,# (( number >> 28 ) EOR 15) << 28
> 		DCD number OR &F0000000

Ldr - 4 cycles
Bic - 1 cycle
Nop - 1 cycle
--------------
      6 cycles

It is faster to do...

MOV Rn,#x AND &FF
ORR Rn,Rn,#x AND &FF00
ORR Rn,Rn,#x AND &FF0000
ORR Rn,Rn,#x AND &FF000000

- 4 cycles, and there won't be the problems you speculate on, in
possible future CPUs... Again with this, you can optimise it
further if you know certain bits of your number will be 0; it
might be faster to do...
MOV Rn,#x AND &F00000F:ORR Rn,Rn,#x AND &FF00 \ 2 cycles
depending on your numbers; write a FN that will work out the most
optimum code, if you are using the BASIC assembler - then you can
just say
FNmov(n,x)
and it will do the fastest possible implementation...
Such a FN would be fairly trivial to implement.

Incidentally, we have a BASIC library which implements all these
FNs, including FNadr (same limitations as FNldr apply), FNsp
(assign space ; equivalent to ]P%+=sp:[OPT pass), FNfi (assigns
space to load in the given file, and loads the file in on pass 2)
, and a fair few others methinks... We'll post them to c.s.a. if
there would be any interest... One limitation, because there is
no standard 'pass' instruction, we assume it is called 'pass' -
this is easily changed (Although really, you should follow our
example ;-) - a function FNpass is used, which we have as...
DEFFNpass=pass
Just change the pass to whatever you use...

Merlin,
--SICK--
You suffer... But why?

csuwr@warwick.ac.uk (Derek Hunter) (05/18/91)

In article <+|Q_L||@warwick.ac.uk> csuwr@warwick.ac.uk (I) write . . .

. . . about these 32 bit loads.

 I'm afraid I got it a bit wrong, the DCD for the extended loads should
  add four to P% before anything is done to it.

ie		DCD  VERY_FAR_AWAY-(P%+4)
or		DCD  (P%+4)-VERY_FAR_AWAY

In case the NV code has less precedance than an illegal instruction code,
 (sorry, I haven't tested it yet), you can set the top byte to 0xFF which
 I think takes it into the SWI range of instructions, where there should
 be no possibility of illegal instructions existing, merely lots of never
 executed undefined SWIs. (Not the same thing at all, really).

 Of course you then have to make the obvious modification to the Bic in
 that case, correcting the top byte rather than just the top nybble, but
 for current memory levels, the extended offset addressing shouldn't foul
 up. By the time it does, we'll be into RISC OS 27. (Any rumours available
 about /that/ one Acorn? :-)

	- I was typing from memory, sorry I got it wrong folks.

	- Derek Hunter		csuwr@cu.warwick.ac.uk

csuwr@warwick.ac.uk (Derek Hunter) (05/18/91)

I seem to be posting quite a lot at the moment. This is probably my last
 thing on this topic:

It was nice to have had an original idea, shame it wasn't an optimal
 solution and wo'n't work with proposed machines.

As far as knowing which Ldr to produce, all my globals come before any
 program code, so they're all -ve offset.

 Is there as yet any notes on a correction to the aof deficiency?
 Is there any information further to the PRMs in this area I could
  grab hold of?

For the moment, I'm going to leave this system in my compiler because it
 works with the current machines, and because the BASIC C compiler is going
 to be retired once I've rewritten it in C, (without all the design features
 (read: `design bugs') that I `forget' to mention in my Final Year Project
 Report), and I can snaffle John's suggestions into this latter.

------

If anyone's interested in the final compiler, it should code to m/c or
 Basic assembly as RQ'd (I haven't got the ARM assembler) with a linker
 which will work out what needs recompiling a la `make', using the #includes
 (and #pragmas for the extra info it'll probably need when someone out
 there points out why .h and .o timestamps are not sufficient).

 It will also have a nicer interface to RMA moduling than Norcroft's (unless
 I've been mislead about how nasty that is), and will make those interrupt
 surrounds for the routines that you're all itching to write much easier.
 I also want to WIMP it.
 (It currently has a prettier swi interface too, but that might have to
  change when I do register colouring. Either that, or a file with the
  in/output register information.)

 I doubt it will ever be as efficient/fast as Norcroft's thing, but that can
 only improve, and it /will/ only be about 1/3..1/4 as much (Shareware).

		- I hope sh/w plugs don't transgress NETiquette.

		- Derek Hunter		csuwr@cu.warwick.ac.uk


 I thought I might and I tried, but I stopped when I found out I couldn't.

john@acorn.co.uk (John Bowler) (05/21/91)

In article <1991May16.200200.24636@comp.vuw.ac.nz> gavin@comp.vuw.ac.nz (Gavin Flower) writes:
>
>Have read in several places,
> that the condition code NV
>is *NOT* forward compatible.
>
>Specifically that Acorn have
>reserved the right to reuse
>it for something else, at
>some future stage!

I was careful not to say this, only to point out that it is a very obvious
thing to want to reuse, however Alasdair Thomas's messages seems unambiguous
to me.

>To the best of my knowledge
>it is still ok with ARM3.

>****could someone at Acorn
>confirm this, also could
>they suggest what it might 
>be earmarked for?

NV works as specified on ARM2 and ARM3!  Whether and how NV is (re)used is
up to ARM Ltd (*not* Acorn Computers Ltd, although I would hope Acorn has
some influence :-).  The body of Alasdair's message seems to correspond to
the rules which we attempt to follow when generating code which will work
on all processors.  Certainly I (personally) would suggest that you attempt
to follow those rules both within hand-written ARM assembler and within
compiler generated code.

In article <7~R_RA`@warwick.ac.uk> csuwr@warwick.ac.uk (Derek Hunter) writes:
> Is there as yet any notes on a correction to the aof deficiency?

The deficiency would require extensions to AOF to allow use of a PC relative
symbol value (currently only allowed in the special context of B/BL
relocations if I remember correctly) and the insertion of bits derived from
such a value into bit fields in a word.  I don't see this being fixed on
its own - I would anticpate a more widespread review of AOF.  Obviously
since I am commenting on this I don't know about any current developments
in this area :-).

John Bowler (jbowler@acorn.co.uk)