[comp.arch] M88000 power dissipation as a function of programming

wca@oakhill.UUCP (william anderson) (07/09/88)

INTRODUCTION

In a private communication, Landon Dyer <landon@claris.UUCP> asks the
following question in response to the power considerations discussed
in article <1362@oakhill.UUCP>:

-> Is it possible that a worst-case instruction-stream address sequence
-> (from a program designed to maximize address-bit changes while minimizing
-> cache hits) would cause an 88100 to overheat?

That is to say, can a malevolent programmer find an instruction sequence
for the MC88100 which is equivalent to the legendary HCF (Halt and Catch
Fire) instruction?



PATHOLOGICAL EXAMPLE

Assume we have the following register contents:

	r2:	0x55555555
	r3:	0xAAAAAAAA

Consider the following M88K "program", with the above register contents:

0xFFFFFFFC:
	st	r2,r3,r0
	br.n	-1
	st	r3,r2,r0

On the MC88100, r0 is hardwired to 0.  Consider what this loop does:

 1 - The first st stores all 5s at address = AAAAAAAA.  The address of this
     instruction is -1.

 2 - The br.n branches back to the first st instruction while executing
     another store in the delay slot.  The address of this instruction
     is 0.

 3 - The second st stores all As at address = 55555555.

A loop where stores (where the MC88100 must drive its data pins) run
nearly back to back with highly uncorrelated states on both data and
address busses, and with uncorrelated states on the instruction address
bus, is the worst case power dissipation for the MC88100.

Therefore the relevant M88K P bus states in this loop look like (using
hexadecimal notation):

I-Address	D-Address	Data		Byte Strobe
(30 bits)	(30 bits)	(32 bits)	(4 bits)
--------	--------	--------	-
3FFFFFFF	
00000000	
00000001	
3FFFFFFF	
00000000	AAAAAAAA	55555555	F
00000001	AAAAAAAA	55555555	0
3FFFFFFF	55555555	AAAAAAAA	F
00000000	AAAAAAAA	55555555	F
00000001	AAAAAAAA	55555555	0
3FFFFFFF	55555555	AAAAAAAA	F

and so on.  Note that the I-address pins are going from all 0 to all 1
to all 0 every third cycle, and both the D-address and Data pins are
alternating 1s and 0s at every pin at the same frequency.



POWER CONSIDERATIONS FOR PATHOLOGICAL EXAMPLE

Now, the AC power dissipation (as discussed in article <1362@oakhill.UUCP>)
is given by:

                           P =  .5*C*V**2*F*N,          [ equation 1 ]
where:
                    P = AC power dissipation (W)
                    C = load capacitance (F),
                    V = voltage swing (V),
                    F = frequency (Hz), and
                    N = number of pins which make transitions.

      (For the MC88100, V = 3.8 volts (TTL logic levels) and F = 20 MHz)

NOTE:  This formula was INCORRECTLY posted as P = 2*C*V**2*F*N in
       the aforementioned article.  We sincerely apologize for any
       inconvenience that this might have caused.  However, the
       derivation of this formula takes about 60 seconds and is
       left as an exercise for the reader (use the formulae
       P = V*I and I = C*dV/dt, and average over one clock to do
       the derivation).

If we plug in the numbers, using:

		    C = 85 pF		(70 pF maximum output load capacitance
					 plus 15 pF internal output capacitance)

		    V = 3.8 V		(see above)
		    N = 2/3 * 96 = 64	(2/3 due to transition frequency)
we get:
		    P = .79 W		(F = 20 MHz)
		    P = .98 W		(F = 25 MHz)

This result is well below the maximum power dissipation of 1.5 W given
for the MC88100.  In general, the AC power dissipation of the MC88100
highly dominates any DC power dissipation.

Clearly, this program is pathological:

 - No work gets done and the code never terminates; therefore, we can
   make the MC88100 halt (i.e. quit doing useful work) but we cannot make
   it catch fire.

 - The instruction addresses are contrived (in fact, this program
   won't work with the MC88200 CMMUs since addresses >= FFF00000 are
   hardwired for I/O address space).



REALISTIC EXAMPLE WITH OPTIONAL POP-QUIZ

Perhaps a more interesting example would be where the M88K is doing
memory-intensive work in a useful manner.  Let's consider an example
which might be useful in a Unix(R) kernel: wordmove().  A common example
of the C source code for wordmove() might be:

void
wordmove_1(d, s, n) long *d, *s; unsigned int n;
{
	while ( n-- ) *d++ = *s++;
}

OR:

void
wordmove_2(d, s, n) long *d, *s; unsigned int n;
{
     register i;
     for(i = -n; i; i++) s[n+i] = d[n+i];
}

OR:

void
wordmove_3(d, s, n) long *d, *s; unsigned int n;
{
     register i;
     for(i = 0; i<n; i++) s[i] = d[i];
}

( Multiple-choice pop-quiz for C programmers or compiler writers: which
  C code should generate the fastest loop on the MC88100?  Answer follows
  on next page. )



In the worst case, this word-move case causes more pin transitions per
cycle and is therefore a more power-consumptive situation than, for
example, a byte-move code.  Now, using a 'free source code with
restricted redistribution (but not public domain)' ANSI C compiler, the
answer to the above "pop-quiz" is wordmove_2(), (with wordmove_3()
second-best and wordmove_1() worst).  The inner loop code from this
compiler, after being scheduled with the Motorola scheduling filter,
is:

top:
	lda.b	r2,r4,r3
	ld	r7,r6[r2]
	lda.b	r3,r3,1
	bcnd.n	ne0,r3,top
	st	r7,r5[r2]

This is quite efficient code for a short (rolled) loop, running at
20 MIPS (for a 20 MHz MC88100) and moving 16 Mbytes/second.  We now
assume the worst case for the data being moved (we use 0xAAAAAAAA and
0x55555555 for alternate words as above).  We also assume worst case
coherency between source address (s) and destination address (d).  The
MC88100 P bus states in this loop look like (here, ******** represents
the data pins on a load; since the cache memory system [e.g. MC88200]
does the driving on a load, this doesn't represent power loading on the
MC88100):

I-Address	D-Address	Data		Byte Strobe
(30 bits)	(30 bits)	(32 bits)	(4 bits)
--------	--------	--------	-
top	
top+1	
top+2		s				F
top+3		s		********	0
top+4		d		55555555	F
top		d		55555555	0
top+1		d		55555555	0
top+2		s+1		55555555	F
top+3		s+1		********	0
top+4		d+1		AAAAAAAA	F
top		d+1		AAAAAAAA	0
top+1		d+1		AAAAAAAA	0
top+2		s+2		AAAAAAAA	F
top+3		s+2		********	0
top+4		d+2		55555555	F

and so on, ad nauseum.



POWER CONSIDERATIONS FOR REALISTIC EXAMPLE

We now reapply equation 1 above, using the same values for C, V, and F,
but we adjust N (effective number of pins changing per cycle) to get
the AC power dissipation for the more realistic example:

			N = 2 (I-Address) +
			    4*32/10 (D-Address) +
			    2*32/10 (Data) +
			    8*4/10 (Byte Strobe)

			  = 24.4 (effective pin transitions per clock)

and this gives:

			P = .30 W	(F = 20 MHz)
			P = .37 W	(F = 25 MHz)

or roughly one-fifth of the maximum power rating for the 20 MHz part.



CONCLUSION

In this article, we have analyzed the AC power dissipation of the
MC88100 as a function of the code that it runs.  A pathological code
example was found to be the worst-case with regard to power
dissipation: this code segment caused power dissipation to be roughly
one-half the rated maximum (at 20 MHz).  A more realistic code example
(a move-word routine) was examined and the power dissipation for this
code was found to be about one-fifth the rated maximum (again, at 20
MHz).  Clearly, a low average power dissipation should have a highly
desirable effect on the reliability and longevity of any design using
a CMOS VLSI microprocessor.

We have also (again) considered the output from an ANSI C compiler and
have implicitly shown that, in the case of the MC88100, pointer arithmetic
is less efficient than array arithmetic, due in part to the scaled-indexed
addressing mode of the part.  This may have some impact on the way system
code is most efficiently written for the M88K.

Finally, we have used an AC power dissipation equation (equation 1) which
has general applicability to all microprocessors, since it doesn't depend
upon internal details of the chip architecture but instead depends upon
the simple physics of the microprocessor/memory interface.  We hope that
the readers of this article can use this equation to enlighten us with
regards to the power characteristics of their products.

ACKNOWLEDGMENTS

Thanks to Mitch Alsup for his valuable advice and motivation.

The statements and opinions presented in this article are my own.
They should not be interpreted as being the opinons or policy,
official or otherwise, of Motorola Inc.

       /\        /\ 		William C. Anderson
      //\\      //\\		Member of the Motorola 88000 Design Group
     ///\\\    ///\\\		Motorola Microprocessor Division
    //    \\  //    \\		Oak Hill, TX
   /        \/        \
  /                    \

andrew@frip.gwd.tek.com (Andrew Klossner) (07/12/88)

A few nits ...

>	r2:	0x55555555
>	r3:	0xAAAAAAAA
> 0xFFFFFFFC:
>	st	r2,r3,r0
>	br.n	-1
>	st	r3,r2,r0

Under normal circumstances, each store will cause a misaligned data
access exception because the target addresses are not longword-aligned.
You can disable this exception by manipulating a bit in the PSR, but
I've never been able to figure out just what happens in that case,
except that a 68020-style unaligned longword store (straddling two
adjacent aligned longwords) doesn't.

> I-Address	D-Address	Data		Byte Strobe
> (30 bits)	(30 bits)	(32 bits)	(4 bits)
> --------	--------	--------	-
> 3FFFFFFF	55555555	AAAAAAAA	F

You can't get 55555555 in 30 bits.  The D-address will be 15555555.
Will the byte strobe really be F for this misaligned access?

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]

wca@oakhill.UUCP (07/12/88)

In article <10157@tekecs.TEK.COM>, andrew@frip.gwd.tek.com (Andrew Klossner) writes:

> A few nits ...

That's OK, Andrew, I deserve it.

>	r2:	0x55555555
>	r3:	0xAAAAAAAA
> 0xFFFFFFFC:
>	st	r2,r3,r0
>	br.n	-1
>	st	r3,r2,r0
>
> Under normal circumstances, each store will cause a misaligned data
> access exception because the target addresses are not longword-aligned.
> You can disable this exception by manipulating a bit in the PSR, but
> I've never been able to figure out just what happens in that case,
> except that a 68020-style unaligned longword store (straddling two
> adjacent aligned longwords) doesn't.

When the misaligned access exception is disabled (by setting the
appropriate bit in the Processor Status Register) and a misaligned
access is attempted, the M88100 rounds the address *down* to a
consistent boundary.  In the above case, a word access addressed for
the location 0xAAAAAAAA will access (in this case, store) a full word
of data (that is, byte strobe = 0xF) at location 0xAAAAAAA8.  Clearly,
this could create serious problems (!), so the moral of the story is
that any programmer who disables the misaligned access exception better
know what he/she is doing.  Note that one use of this feature is in a
tagged architecture application.

> I-Address	D-Address	Data		Byte Strobe
> (30 bits)	(30 bits)	(32 bits)	(4 bits)
> --------	--------	--------	-
> 3FFFFFFF	55555555	AAAAAAAA	F
> 
> You can't get 55555555 in 30 bits.  The D-address will be 15555555.
> Will the byte strobe really be F for this misaligned access?

You're right, Andrew, the D-Address should be 15555555 (and the next
D-Address in the cycle should be 2AAAAAAA).  And the byte strobe is 0xF
for this access, as mentioned above.

In my enthusiasm to flip as many bits (and burn as many mW) as possible
(and to keep the sample program as simple as possible), I blithely
ignored the misaligned problem and committed the gaffe.  If I use two
more registers for the addresses 0x55555554 and 0xAAAAAAA8 (keeping r2
and r3 for the data as above), then I can write a program that does
flip as many bits as possible and one gets essentially the same results
as in the previous article (as far as power dissipation goes).  That is
to say, if we have:

	r2:	0x55555555
	r3:	0xAAAAAAAA
	r4:	0x55555554
	r5:	0xAAAAAAA8

and we run the code:

0xFFFFFFFC:
	st	r2,r5,r0
	br.n	-1
	st	r3,r4,r0

then we get P-bus states that look like:

I-Address	D-Address	Data		Byte Strobe
(30 bits)	(30 bits)	(32 bits)	(4 bits)
--------	--------	--------	-
3FFFFFFF	
00000000	
00000001	
3FFFFFFF	
00000000	2AAAAAAA	55555555	F
00000001	2AAAAAAA	55555555	0
3FFFFFFF	15555555	AAAAAAAA	F
00000000	2AAAAAAA	55555555	F
00000001	2AAAAAAA	55555555	0
3FFFFFFF	15555555	AAAAAAAA	F

and so on.  Or, I could disable misaligned access exceptions and use
the original code for identical results (as far as the P-bus and the
MC88100 are concerned; after all, this was the pathological example!)
The remainder of the article (in particular, the power dissipation
analysis for the M88K) is not affected by this correction in any
substantial way.

>   -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
>                         (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]

Thanks again for the correction, Andrew.

The statements and opinions presented in this article are my own.
They should not be interpreted as being the opinons or policy,
official or otherwise, of Motorola Inc.

       /\        /\ 		William C. Anderson
      //\\      //\\		Member of the Motorola 88000 Design Group
     ///\\\    ///\\\		Motorola Microprocessor Division
    //    \\  //    \\		Oak Hill, TX
   /        \/        \
  /                    \