[comp.sys.mips] MIPS assembler question

mark@quintus.UUCP (Mark Spotswood) (06/06/90)

I have a question about memory initialization in Mips assember. I would
like to initialize a memory location to contain a value which is the
difference between the addresses of two other labels, something like this:

a:
	.word    b-c

where b and c are two other labels. If I use the above syntax, the Mips
assembler will signal an error saying that the symbol 'c' must be an 
absolute value.

The mips assembler will allow things like:

a:
	.word    b

or

a:
	.word    b-2

If the assembler can figure out what b and b-2 will be, why can't it figure
out what b-c will be?  Is there a way to do what I want in Mips assember?

-mark
(mark@quintus.uucp)

meissner@osf.org (Michael Meissner) (06/06/90)

In article <1380@quintus.UUCP> mark@quintus.UUCP (Mark Spotswood)
writes:

| I have a question about memory initialization in Mips assember. I would
| like to initialize a memory location to contain a value which is the
| difference between the addresses of two other labels, something like this:
| 
| a:
| 	.word    b-c
| 
| where b and c are two other labels. If I use the above syntax, the Mips
| assembler will signal an error saying that the symbol 'c' must be an 
| absolute value.
| 
| The mips assembler will allow things like:
| 
| a:
| 	.word    b
| 
| or
| 
| a:
| 	.word    b-2
| 
| If the assembler can figure out what b and b-2 will be, why can't it figure
| out what b-c will be?  Is there a way to do what I want in Mips assember?

Not that I know of.  I was prototyping OSF/1 shared libraries with the
MIPS assembler, and wanted to get the difference of an item in .data
from the start of .data (for which I had a pointer).  I finally gave
up, and had GCC calculate the offset itself.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so

dave@imax.com (Dave Martindale) (06/12/90)

In article <1380@quintus.UUCP> mark@quintus.UUCP (Mark Spotswood)
writes:

| I have a question about memory initialization in Mips assember. I would
| like to initialize a memory location to contain a value which is the
| difference between the addresses of two other labels, something like this:
| 
| a:
| 	.word    b-c
| 
| where b and c are two other labels. If I use the above syntax, the Mips
| assembler will signal an error saying that the symbol 'c' must be an 
| absolute value.
| 
| The mips assembler will allow things like:
| 
| a:
| 	.word    b
| 
| or
| 
| a:
| 	.word    b-2
| 
| If the assembler can figure out what b and b-2 will be, why can't it figure
| out what b-c will be?  Is there a way to do what I want in Mips assember?

I don't know if this is true of the MIPS software specifically, but it
is a limitation with some systems:

Initializing a location in memory with the difference between two external
addresses requires some way for the assembler to tell the linker that it
should calculate the difference between two external symbols and store
the result in this location.  Some object file formats simply have
no way of specifying this computation.

For example, suppose you have an object file format that, for every
word of code generated, there is an associated tag that says one
of:

	- this word is absolute, do not relocate
	- this word is an offset from external symbol #N, add the value
	  of that external symbol at link time
	- this word is an offset from the beginning of the current module;
	  add in this module's starting address at link time

The offsets or absolute values are stored in the instruction stream,
and the relocation information is stored elsewhere.  This format is
simple, and has the ability to handle most of the normal sorts of
relocation that are needed.   However, since each chunk of relocation
information specifies at most *one* external symbol whose link-time value
can be added to the corresponding instruction-stream word, there is no
way to specify that the value of two symbols should be subtracted.

To allow the assembler and linker to handle constant expressions that
contain more than one reference to an address or size that is determined
at link time, the object file format must allow almost arbitrary expressions
to be passed between the assembler and the linker.

meissner@osf.org (Michael Meissner) (06/12/90)

In article <1990Jun11.213554.15606@imax.com> dave@imax.com (Dave
Martindale) writes:

| In article <1380@quintus.UUCP> mark@quintus.UUCP (Mark Spotswood)
| writes:
| 
| | I have a question about memory initialization in Mips assember. I would
| | like to initialize a memory location to contain a value which is the
| | difference between the addresses of two other labels, something like this:
| | 
| | a:
| | 	.word    b-c
| | 
| | where b and c are two other labels. If I use the above syntax, the Mips
| | assembler will signal an error saying that the symbol 'c' must be an 
| | absolute value.
| | 
| | The mips assembler will allow things like:
| | 
| | a:
| | 	.word    b
| | 
| | or
| | 
| | a:
| | 	.word    b-2
| | 
| | If the assembler can figure out what b and b-2 will be, why can't it figure
| | out what b-c will be?  Is there a way to do what I want in Mips assember?
| 
| I don't know if this is true of the MIPS software specifically, but it
| is a limitation with some systems:
| 
| Initializing a location in memory with the difference between two external
| addresses requires some way for the assembler to tell the linker that it
| should calculate the difference between two external symbols and store
| the result in this location.  Some object file formats simply have
| no way of specifying this computation.

Note, the MIPS assembler does not even allow for subtraction when the
items are in the same section, and are constant.  For example:

	.data
b:	.word	0
	.word	1
	.word	2
	#...
diff:	.word	(.-b)/4

I suspect that part of the reason may be that the MIPS assembler
reorganizes the code, and the first pass of the assembler doesn't have
the means of telling the second pass to do the appropriate back
patching after any rearrangement.

I've also gnashed my teeth over the fact that the MIPS assembler does
not allow instructions to be put into the data section.

Finally, we just discovered the hard way, that the MIPS assembler
screws up line numbers if you put non instructions (such as the table
of lables for implementing a switch statement) into .text.  This is
because the line number information is based on a delta from the
previous line, and the assembler doesn't count the non instructions in
forming the delta's.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so

sjc@key.COM (Steve Correll) (06/12/90)

In article <1380@quintus.UUCP> mark@quintus.UUCP (Mark Spotswood)
writes:
| a:
| 	.word    b-c
| 
| If the assembler can figure out what b and b-2 will be, why can't it figure
| out what b-c will be?  Is there a way to do what I want in Mips assember?

In article <1990Jun11.213554.15606@imax.com>, dave@imax.com (Dave Martindale) writes:
> Initializing a location in memory with the difference between two external
> addresses requires some way for the assembler to tell the linker that it
> should calculate the difference between two external symbols and store
> the result in this location.  Some object file formats simply have
> no way of specifying this computation.

Indeed, the MIPS COFF object format does not provide relocation for b-c.
However, the assembler can't even subtract non-relocatable labels, because it
tries to process the operand of .word long before it performs instruction
scheduling and reordering, and therefore it doesn't know how many nops it's
going to insert, etc. If you need the subtraction to figure out the offset to
a field in a data structure, the .struct directive may well help; but if you
need the difference of two labels in an instruction stream, sorry about that.
-- 
...{sun,pyramid}!pacbell!key!sjc 				Steve Correll

ph@ama-1.ama.caltech.edu (Paul Hardy) (12/01/90)

I've just started programming in MIPS assembler, and I've written a routine
to perform fast matrix multiplies.  I am sustaining a rate of approximately
6.7 Mflops on a DECstation 5000, an IRIS 4D (using only one processor), and
an ESV 10.  All use the 25 MHz MIPS R3000.  This is a higher number of MFlops
than the vendors claim their machines can do, so I guess I should be pretty
happy.  However, I'm wondering why it's not going faster.  This is probably
a question for comp.arch.mips, but there's no such newsgroup.

The main body of the multiply is a triplet of instructions: simultaneously,
a load, add, and multiply are being performed on different registers.  Since
they're not using each others' registers, they should all execute together.
According to the MIPS book, a single-precision floating-point multiply takes
6 cycles, but during the last two cycles another multiply can begin, so
effectively it takes four cycles if many multiplies occur back-to-back.
In reality, about 7 cycles elapse between multiplies.  The code looks
something like (where A, B*, C, D, E, F are single-precision floating point
registers, and offset is a hard-coded constant):

                   mul.s    A, A, B1
                   lwc1     C, offset($BASE)
                   add.s    E, E, D
                   mul.s    C, C, B2    ## 1 cycle stall if load takes 2 cycles
                   etc.

A stalled load will hold up the following multiply if it takes more than
three cycles to perform.  Stalling the add shouldn't affect speed at all,
since it's working on other data.  Sticking nops above all the mul.s
instructions didn't make any difference, so I took them out again.  It would
seem that loads are taking a long, long time.  This is unfortunate, because
all data is in cache.  The only machine that page faulted during 100,000
iterations of the loop was the E&S machine: 9 times -- fairly insignificant.
This is a trial with 10 x 10 matrices, so all of the data fits in one 1k page.
All loads in the loop of the operation occur from sequential memory locations.
This was done with hopes of decreasing access time on subsequent lookups from
the same bank in a cache RAM.  I write results in integer registers; they
don't get written back into the cache until I'm out of registers (I hold about
20 values, so I perform one write every 380 floating point operations for a
10 x 10 matrix).

Does anyone have any experience with this?  Where are the extra 3 cycles going?
How long does it _really_ take to load a value from cache?  If it does take a
lot more than 2 cycles, then I could relax make the subroutine a lot more
flexible.

By the way, this is a very nice assembler language to program in!


                                  --Paul

rowen@mips.COM (Chris Rowen) (12/05/90)

Paul Hardy (ph@ama-1.ama.caltech.edu) writes: 
>The main body of the multiply is a triplet of instructions: simultaneously,
>a load, add, and multiply are being performed on different registers.  Since
>they're not using each others' registers, they should all execute together.
>According to the MIPS book, a single-precision floating-point multiply takes
>6 cycles, but during the last two cycles another multiply can begin, so
>effectively it takes four cycles if many multiplies occur back-to-back.
>In reality, about 7 cycles elapse between multiplies.  The code looks
>something like (where A, B*, C, D, E, F are single-precision floating point
>registers, and offset is a hard-coded constant):
>
>                   mul.s    A, A, B1
>                   lwc1     C, offset($BASE)
>                   add.s    E, E, D
>                   mul.s    C, C, B2    ## 1 cycle stall if load takes 2 cycles
>                   etc.
>
>Does anyone have any experience with this?  Where are the extra 3 cycles going?
>How long does it _really_ take to load a value from cache?  If it does take a
>lot more than 2 cycles, then I could relax make the subroutine a lot more
>flexible.

As I recall, the relevant pipelining rules of the R3010 are the following:

1) An ADD cannot start or finish in cycle in which a MUL starts or finishes
2) Only one instruction can start in any cycle
3) A load can finish in any cycle   

This means that the add cannot start until the multiply has completed

Pipelining of instructions as coded:

CYCLE	  1     2      3     4      5      6      7      8      9    10    11
mul.s   START------ ------ RESULT 
lwc1         START  RESULT
add.s                             START RESULT
mul.s                                           START ------ ------ RESULT
lwc1                                                  START  RESULT    
add.s                                                                     START

This is six cycles per triple.

If you can reorder the code a little, it should get faster:

CYCLE	  1     2      3     4      5      6      7      8      9    10    11
mul.s   START------ ------ RESULT 
add.s         START RESULT
lwc1                START  RESULT
mul.s                      START ------ ------ RESULT
add.s                             START RESULT
lwc1                                    START  RESULT 

This is three cycles per triple.

Chris Rowen

ph@ama-1.ama.caltech.edu (Paul Hardy) (12/05/90)

In article <43786@mips.mips.COM> rowen@mips.COM (Chris Rowen) writes:

   1) An ADD cannot start or finish in cycle in which a MUL starts or finishes
   2) Only one instruction can start in any cycle
   3) A load can finish in any cycle   

   This means that the add cannot start until the multiply has completed
   ...
   If you can reorder the code a little, it should get faster
   ...
[preferred order: mul.s, add.s, lwc1, mul.s, add.s, lwc1, etc.]

   Chris Rowen				

Someone else at MIPS, Mark Johnson, mentioned this to me yesterday.
The bottom line is that the floating-point adder, multiplier, and divider
circuits all share one exponent adder.  I had erroneously assumed that
they each had their own.  For operations using the floating-point adder,
multiplier or divider, this exponent adder is used during the first cycle
for exponent approximation, and the last cycle for normalization.

Therefore, these operations should be arranged so that they don't end
on the same cycle, and so that one does not begin on the same cycle
that another one ends.  A stall of the pending floating-point operation
will result.  This wasn't obvious from Kane's book.

Mark also pointed me to an excellent article (written by him, Chris,
and Paul Ries): "The MIPS R3010 Floating-Point Coprocessor" IEEE Micro,
June 1988, pp. 53-62.  I recommend this to anyone who wants to write
floating-point assembly code for the R3010; I have a much better
understanding of the chip after reading this article.

Thanks to both of you for your very helpful advice on this problem.

                              --Paul