[comp.arch] Architectural analysis of RPM-40 for general usage very long

mash@mips.COM (John Mashey) (03/11/88)

Randall Jesup says the RPM40 stuff has been beaten to death, and I agree,
but I do have some data that may be useful, which I'd have posted before,
but has taken a while to write up, given that I've been busy.

As requested a while back by Mr. O'Connor, an analysis of the "40-MIPS" RPM-40
for workstation/general-purpose use (and only that):

Summary:
	40MIPS peak -> 20MIPS-cycles -> 14-15 VUP -> 12 VUP:
	(matches 7-9X 68020, or 2-6X Sun-3/260 estimates given by GE folks)

40MIPS	(peak native mips, no load delays, pipeline hazards, etc)

20MIPS-cycles	(assuming no MMU, cache, or other memory system degradation),
	i.e., what a 20MHz MIPS R2000 would be without those effects.

14-15VUP (assuming MMU, cache, memory degradation same as MIPS R2000,
	and compiler technology identical to MIPSco's)
	[VUP: VAX 11/780 with good optimizers = 1 VUP, integer+fp].

12 VUP	realistic expectation of what one would run like in an actual
	general-purpose system, with the more vanilla compiler
	technology that is likely to be available. (correct me if wrong)

The rest of this long posting:
1. Issues of general-purpose UNIX systems
2. Doing an analysis of a machine you only have a spec of
3. Detailed analysis
4. Conclusion

------STOP NOW UNLESS YOU ARE A GLUTTON FOR DETAILED ARCHITECTURAL ANALYSIS---

1. Fundamental issues, in which general-purpose UNIX systems may differ
from dedicated, small-memory, embedded systems:
	a) Programs are much bigger, i.e., they may easily have
	many megabytes of code, and 100's of megabytes of data, for
	single processes. (lots already = 16-32MB, at least)
	Since they have lots of data, they also do many loads and stores.

	b) The UNIX kernel is real nasty for high-performance machines:
	it has many sequences that are very hard to pipeline/reorg, it touches
	little bits of data here and there, and it sometimes has behavior
	that causes trouble for on-chip assists of otherwise reasoanble
	size (like branch-target buffers & register windows).
	The kernel also uses byte and halfword data, (both signed and
	unsigned), heavily-intermixed, because it uses heavily-packed
	data structures, which, for various reasons are difficult to
	change.  (Most user programs use halfwords less often).

	c) For general use, you sometimes have to make worst-case
	assumptions on the sizes of addresses, for example.  You usually
	must assume that addresses are "large", rather than "small",
	because you have no idea how big something is until it's linked,
	and because you may be compiling code for libraries where you
	can't know how big the final objects will be.  Hence, shortcuts
	often good in dedicated environments (i.e., "short" addresses are
	OK) don't work.

2. Doing the analysis.
	How do you get an estimate of performance for a machine you
	have no actual measurement data for?
	Here's a possible process:
	1) Find one or more machines that:
		-you have substantial data for (in this case, MIPS R2000)
		-are related to the machine to be estimated, where related means
		that it's fairly easy to do direct conversions to the
		machine to be estimated, and compute the architectural impacts.
	2) Feature by feature, do the recalibration, assuming the same
	software.  I.e., start with a base cycle count, and degrade as
	appropriate, or improve as appropriate.
	3) Having done the instruction-cycle analysis, degrade for cache,
	MMU, and RISC->VAX ratio effects.
	3) Finally, make some guesses on the impact on software of the
	architecture [after all, RISC design is half software.]

	In general, we'll make best-case assumptions, then come back
	and do guesses to recalibrate the optimism.

3. Detailed Analysis
The following lists features, with assumptions about them (if I'm wrong
about the assumptions, I'm happy to be corrected, and if somebody else
wants to use the data here to redo an analysis, that's fine.  I'm just using
what's published, plus the additional comments).  This analysis looks at the
raw cycle counts of the RPM40 as compared with a MIPS R2000,
(as the latter is the only one that I have detailed enough data for;
I might have used Stanford MIPS, as it is somewhat closer,
but I don't have the data).

I used a set of substantial user programs (as, ccom, doducd, espresso,
gnuchess, hspice, nroff, optimizer, code generator, timberwolf), written
in C, PASCAL, or FORTRAN.  I don't have kernel numbers handy, but I'll
occasionally note how the kernel differs.  Note that we first analyze the
instruction cycle counts, without regard to MMU and cache-miss behavior.
The "impact" number is the cycle-count degradation expected,
i.e., the "expansion" in number of cycles (over 1) due to the feature,
as compared with a "mythical" 40MHZ R2000.
Items marked * have a little more (but educated) guesswork than the others,
which are computed with high confidence from extensive data.
Items marked "+" are relevant to the 16-vs-32 issue.

A1.	.336 <= 3 cycles of load latency
	R2000 data: 21% of instruction cycles are loads (16-29%)
		68% of the (single) load-delay slots are filled (48-88%)
		(The usual numbers that I remember, but can't find the
		reference (and it wasn't bcase's posting), are that the
		expected fill rates for load-delays are:
		1: 70% (we got 68% on this bunch, and it contained one nasty
			program that only filled 48%).
		2: 30%
		3: 10%

	A1A: assuming the RPM has no more loads than an R2000
	A1B: assuming that similar-quality optimizers and reorganizers are used

	cycle-expansion = .21 * ( (1-.3) + (1-.1)) = .336
	(remember, we already counted the R2000's load-nops: the RPM40 would
	have equivalent stalls, in the same places.  These are the extras.
	(kernel is worse on this)

A2.+	.279 <= Loads/stores use 4-bit immediates
	R2000 data: 30.1% of instructions are loads+stores (22-38%)
		7.1% of the loads/stores could use a 4-bit offset (3.2-18%)
		R2000 load/stores have 16-bit offsets, which require the RPM40
		2 cycles to obtain:
	cycle-expansion = .301 * (1-.071) = .279

	Note: one might argue that there are clever code generation methods
	to reduce the number of places that addresses need to be calculated,
	and there are.  The MIPS compilers already do most of them,
	such as computing addresses, or parts of addresses, and saving them
	in registers, moving them outside of loops, etc.  The extra bits for
	larger offsets show up as LUIs (later).

A3.+	.035 <= Add/sub immediates [.050]
	R2000 data: 8.4% (5.8-12.6%) are add/subtract immediates
		3.5-5% of the total instructions are add/sub immediates
		whose constants don't fit into 4 bits.
		The 3.5% assumes there are both add/sub immediate (i.e.,
		4 bits + implied sign).  If there is really just add-immed,
		and you get 3 bits + sign, use the number 5%

	cycle-expansion = .035 * 1 = .035

A4.+	.013 <= Compare immediates (COND)
	R2000 data: About 1.7% [.2 - 4.8%] are compare-immediates or equivalent
		About 1.3% (of the total instructions) need >4 bits.
		About 1.3% fit in 16-bits, but not 4-bits, and thus
		require 1 PFX:

	cycle-expansion = .013 * 1 = .013

A5.+	.013 <= Logical immediates
	R2000 data: about 1.7% [.4-5.7%] are logical-immediates
		About 1.3% fit in 16-bits, but not 4-bits, and thus
		require 1 PFX:
	cycle-expansion = 0.013 * 1 = 0.013

	Note: kernel or other systems-type programs have higher rates of this.

A6.+	.011 <= Load-immediate
	This is used to stick constants in registers for arguments,
	compares, etc. (it's actually an add to zero, or something like that).
	I assume the RPM-40 has an equivalent.
	R2000 data: about 2.1% (.5-4.7%) are load-immediates
		About 1.1% (of total) fit in 16-bits, but not 4-bits, and thus
		require 1 PFX:
	cycle-expansion = 0.011 * 1 = 0.011

A7.+	.018 <= Load-upper-immediate
	This puts zeroes in low bits, and 16 bits in the top of a register.
	It's often used for setting up 32-bit address, for example,
	or a long constant.
	R2000 data: 1.9% (.5-6.9%) are LUIs
		About 1.8% (of the total) need the top 4 bits, and thus
		would need an 2 PFXs, rather than 1 (or 4, rather than 3):
	cycle-expansion = .018 * 1 = .018

A8.	.024 <= Jump-and-link
	JAL is done in the RPM-40 by MOV (PC) somplace; BRA...
	R2000 data: 1.2% of the instruction cycles (.3->2.0%) are
		calls (JAL, JALR), which give 28-bits of byte addressing.
	On the RPM-40,  a JAL is done by a 2-instruction pair, which
		I assume is a MOV of  the PC to the return-address register,
		followed by a branch, or something equivalent, although I
		haven't seen the exact sequence.  I assume that the return
		jump is a single cycle.
	Assuming programs of the size above, the natural code to
		generate would be:
		MOV (to save PC)
		PFX
		BRA
	i.e., because 12-bits of displacement are enough for local branches,
	(99+% on R2000), but not for globals. This adds 2 cycles/call.
	Assuming that branches get an extra bit of addressabiltiy (halfword
	alignment), the RPM gets 13 (BRA) + 12 (PFX) bits, for 32MB of code,
	which is adequate for all of the cited programs, although not so
	for others.  If by some chance the RPM uses a byte offset, getting
	only 12+12, (16MB), then it would, in normal use in general
	environment, probably take a hit of another 1.2% for a 2nd PFX,
	since there are already programs bigger than that floating around.

	cycle-expansion: .012 * 2 = .024

A9*.	.010 <= Load/store hazard
	The RPM40 is supposed to have a load/store hazard if a store occurs
	at the exact right number of cycles after a load.  This is not data
	we keep (since we don't have this hazard), and it's hard to compute.
	Here's a quick guess:
	R2000 data: about 10% of the instructions are stores, and 20% loads.
	Assume a 1-cycle hazard and a random distribution of instructions.
	
	Then, on the average, a store would find a load in the wrong spot 20%
	of the time, i.e.
	
	cycle-expansion: .1 (fraction of stores) * .2 (fraction of loads) = .02

	Let's assume the reorganizer can fix half of these (optimistic, given
	that we've found loads/stores are the hardest to move around).
	Thus, we get the cycle penalty as .01.  If the hazard is multi-cycle,
	this goes up, or if the reorganizer has a hard time, it goes up.

A10*.+	.098 <= Conditional branch
	Since the RPM40 has no conditional branch that has an address,
	the R2000's equivalents would be done with COND/BRA pairs.
	This requires a little estimation, as it is not something we track
	directly, i.e., those cases where the COND is able to skip just
	1-instruction-equivalent (in the RPM40 sense), without having to
	change into the COND/BRA pair.
	R2000 data: 11.4% (6-17%) of instruction cycles are conditional
		branches.  Approximately 30% of branches are backwards
		branches, which of course cannot be done with just COND.
		To get a quick estimate of the rest, I disassembled a few
		programs, and eyeballed the forward conditional branches,
		and assumed that conditional forward branches that jumped
		no more than 8 instructions might end up being CONDs that
		just skipped.  A few quick looks convinced me this was
		less than 10%.  However, I'll be conservative and allow
		20%, i.e., .8 (of the 70% forward branches) cost a cycle.

	cycle expansion: .114 * (.3 + .7 * .8) * 1 = .098

A11*.+	.010 <= Partial-word load/store
	The RPM40 does load/store of partial-word by setting a
	register (SR2) with a field that gives the type of access,
	then doing the load/store with "non-word" type.
	R2000 data: 6% of instruction cycles [0-19% !!-varies heavily] are
		partial-word load/stores.  Systems-type code is usually
		heavier, numerical code lighter on these.
	Assuming that it costs just 1 cycle to set SR2 (and it might be
	2, if it requires an OR/XOR pair or equivalent), we can guess how
	often this has to happen:
	Simple code generator: every time: 6% hit.
	Smarter code generator: track within basic block
	Global optimizer: track everywhere in a function.
		Under the best case, a function must treat SR2 as a register
		to be saved/restored if modified (they may well already do
		this), or else, it can be considered "dead" across function
		calls, and reset appropriately.
	Systems code, especially with lots of small functions, will feel
	a hit from this more than numeric code.
	I'll guess that there is a cycle hit (either from doing unnecessary
	save/restore, or having to reset this 1/6 times):
	cycle expansion: .01 * 1 = .01
	Note: this would definitely be worse in the kernel.

A12*.	.033 <= Misses due to branch-target-cache misse
	According to the talk, there was supposed to be a hit-rate >90%.
	I have no data on the kinds of programs used to calibrate that.
	Big programs are clearly worse than little programs in this regard;
	for example, a call thru the UNIX kernel (like read) misses most of
	the time, because it often goes 10 levels deep (10 function entries +
	10 return-sites = 20, leaving only 12 others for all other branches).

	R2000 data: over a set of similar benchmarks (including most of
	those above), 11.4% of the instructions are non-sequential,
	i.e., taken conditional branches plus all unconditional branches.
	Assuming a 3-cycle penalty for a miss (10% of the time), this gives:
	cycle expansion = .114 * (.1 * 3) = .033

A13*.	.050 <= Lack of ALU forwarding
	From all of the discussion, I can't tell how much bypassing the
	RPM40 does, or doesn't do.  From the various hints, it sounds like:
	a) you can store the output of an ALU op with no delay
	b) you can't otherwise use the result of an immediately-preceding
	ALU op.
	It would be nice if people said what is really going on, but given the
	sprited defense of non-bypassing, this might be a reasonable guess.
	This is another one we don't keep detailed statistics on (since we
	bypass everything). I disassembled several programs, picked randomly-
	chosen sequences of code (typically about 200 instructions), and
	eyeballed them to see what % of them would take 1-cycle hits if
	there were no bypassing, and otherwise following what would be the
	RPM40 rules, i.e., using PFX instead of R2000's immediates, etc.
	I eliminated cases where it looked like a reorganizer could get
	rid of them (not many, this is already heavily reorged code,
	and since many programs have only 5-10 instructions per basic block,
	and it is difficult to rearrange this stuff thru basic blocks,
	it is fairly easy to make a good guess on local information).
	Of course the numbers are static, not dynamic.  On the other hand,
	they're probably a reasonable guess (and note: this is not the
	bypassing hit an R2000 would take, it's the hit it would take if
	its code were transformed into an RPM40's):  The 3 samples I took
	gave 2.5, 5.6, and 7% hits from this, mostly because code doesn't
	easily migrate amongst basic blocks, other than at the boundaries.
	NOTE: after I did this, I see Earl Killian got the dynamic numbers
	for the R2000, which are as expected, much higher, given that we
	have forwarding and use it everywhere. 
	I'll guess 5%, which I think is reasonably conservative:
	cycle expansion: .05

A14*.	.020 <= Multiply-divide
	I'm not exactly sure how this is implemented on the RPM,
	although it probably doesn't have a fast multiplier on the CPU.
	(Maybe it's on the FPU, in which case some of this would go away).
	R2000 data: across the benchmarks, about 2.7% of the time was
	spent in multiply/divide interlock cycles [we have a 12/35 cycle
	multiple/divider], which includes the effects of having some
	instructions scheduled into the latency of an asynchronous unit.
	Assume that the interlock cycles are split 50/50 (faster mults,
	but more mults than divides, and that RPM's mults take about 3X,
	i.e., +2 factor, you get:
	cycle expansion: .013 * 2 = .026, which I'll round down to 2.

A15*.+	.040 <= 2-address registers, rather than 3-address ones
	This is another one that took some guessing, and is especially
	hard to know, given the willingness of our optimizers to keep
	things in registers rather than refetching them.
	From a static eyeballing of dis-assembled code, I saw 3-7%
	of 3-address ALU ops that looked like they needed to be that way,
	i.e., where a 2-address op would take a 1-cycle hit.
	I'll guess conservatively, 4%.
	cycle expansion: .04

A16*.	.039 <= Less registers (21 instead of 32)
	This definitely takes some guessing, as it has been a while since we
	bounced the compiler's register allocation around.  We do know we
	were getting good use of the extra registers up through 24-28,
	which is the number we actually left for normal code.
	(We hear that similar results were gotten at IBM & HP, but have no
	reference, as the "hearing" was from bar conversations.)
	Guess: assume having less registers (and not as symmetric) costs
	5% more loads/stores, which may well be (the R2000 has just enough
	registers that it seldom saves/restores any on a call to a leaf).
	Assuming 20% loads and 10% stores (close), we get:
	
	cycle expansion = .05 * (.2 + .1) = .015 (basic instrs)
		plus (the load-delays as in A1, except now the .3 unfilled
		1st delay slots are extra):
		.05 * (.3 + .7 + .9) = .01
		plus the 93% of ld+st that need >4 bits offset:
		.015 * .93 = .014
	Total: .015 + .01 + .014= .039

	Note: this is more of an effect for large, complex programs than
	small ones.  A cross-check is that you get the same number if
	you assume even 1 more register is saved/restored on the average per
	function call, (PFX+SW, PFX+LW for 4 cycles), with an average
	of 100 instructions/call (representative).

A17*.	.020 <= Architectural Reorganization issues
	Several of the times above related to reorganization:(A1, A9, A13).
	A number of factors appear to make the RPM40 more difficult to
	reorganize for:
	1) PFX instructions are difficult, if not impossible to move
	around away from the instructions prefixed (unlike the R2000's
	style of using a bypassed GP register).
	2) The instruction that sets the SR2 to get partial-word operations
	is hard to move "too far" away from the instruction(s) that need it.
	3) The load/store pipeline hazard must be taken care of.
	4) If there is no forwarding, that has to be reorganized also.
	Reorganization is very important for many RISC processors.
	The RPM40 has a some extra things to worry about, and one less
	(R2000 branch delay slot).  I'd guess the overall hit to be 2%.
	(Note: this is the ripple effect of all of this together, not of
	the individual pieces, which we've already counted.)

A18*.	?? <= Coprocessor issues
	I haven't really touched on this very much, as we don't know much,
	except that even a few cycles extra latency getting to a floating-point
	unit can hurt a lot, except in applications that naturally pipeline
	very well, or if the FPU has long cycle-count operations in the first
	place.  XPLD without XPST is a little puzzling. 

A19*.	-.067 <= Contraction issue (R2000 branch nops)
	The R2000 loses 6-7% to unfilled branch delay slots,
	which the RPM40 does not. (of course, the RPM40 takes hits in other
	areas of branching, but we've already included them).

A20*	.010 <= Miscellaneous
	There are a bunch of integer-related issues that I can only guess
	at, but observing that there are 4 bits in the opcode field for
	ALU ops, (not the R2000's 5), I'd guess that not all of the R2000's
	ops are found in the RPM, although I don't know which ones they
	might be.  Also, if the immediate field encodes 16-bit shifts,
	that will help, and hurt, if not.

Bottom line, given everything I know:
A1.	.336 <= 3 cycles of load latency @
A2.+	.279 <= Loads/stores use 4-bit immediates @
A3.+	.035 <= Add/sub immediates @
A4.+	.013 <= Compare immediates @
A5.+	.013 <= Logical immediates @
A6.+	.011 <= Load-immediate @
A7.+	.018 <= Load-upper-immediate @
A8.+	.024 <= Jump-and-link @
A9*.	.010 <= Load/store hazard
A10*.+	.098 <= Conditional branch
A11*.+	.010 <= Partial-word load/store @
A12*.	.033 <= Misses due to branch-target-cache misse
A13*.	.050 <= Lack of ALU forwarding
A14*.	.020 <= Multiply-divide
A15*.+	.040 <= 2-address registers, rather than 3-address ones
A16*.+	.039 <= Less registers (21 instead of 32, sort of 
A17*.	.020 <= Architectural Reorganization issues
A18*.	?? <= Coprocessor issues
A19*.	-.067 <= Contraction issue (R2000 branch nops)
A20*	.010 <= Miscellaneous

Total	0.992	cycle expansion

As can be seen, a few items account for big pieces, but the little ones
add up to a lot of cycles if there are enough of them (as Everett Dirksen
once said, referring to the US budget, "a billion here, a billion there,
sooner or later it adds up to real money.") (or real cycles)

Grossly, what this means is that an RPM40, with equivalent compiler
technology, takes about (1+.99) (i.e. about 2, well within the
accuracy of this analysis!) instruction cycles more than an R2000, RUNNING LARGE
PROGRAMS. Some of the analysis may be wrong, but the fundamentals are there.
	Thus, for cycle counts (ignoring cache-miss & MMU overhead),
	a 40MHz RPM would act more-or-less like a 20MHz R2000, i.e.,
	it would run twice as many (instruction cycles + delay cycles).

	Now, a 16.7Hz R2000 [the fastest we've actually shipped],
	with 64K I + 64K D-caches, acts like a 12-VUP (VAX-mips,
	versus 11/780 with good compilers, integer+float) machine, running
	realistic programs.  Let us note that these were in production
	4Q87, and do require a low-latency memory system (as in the
	Whitechapel Workstation version, for example).
	Thus, the ratio is 16.7MHz/12 VUP = 1.39 cycles/VUP, to go from
	raw cycles (including delays, but not MMU or cache-overhead),
	to actual measured performance.

	To estimate the performance of either a 20MHz R2000 or
	40MHz RPM40, and assuming that the RPM40's floating-point
	is as good as the R2000+R2010, and that it handles MMU and
	cache control as well as the R2000 (which does have these built in),
	and generally ignoring the scale-up effects of relatively slower
	main memory for both of them:

	20 / 1.39 = 14.4 VUP

which is well inside  "7-9X a 16.7MHz 68020 or 2-6X a Sun-3/260" estimates
given by various of the GE folks.

So far, all of this has been architectural, i.e., assuming that the
RPM40's software was as close to the R2000's as possible, i.e.,
what would it be if they had our compilers. It is hard to compare
against something you don't know, but I would observe:
	a) Our compiler suite took years of effort by people with
	tons of experience, including people on their 2nd or 3rd
	set of RISC compilers.
	b) This is not an advertisement, but those who've seen
	our compiler technology have some respect for it. In addition,
	the RPM40 will be slightly more difficult to do a good global
	optimizer for:
		assymetries in the register set(i.e., the ones above 15)
		the flag bits for word-type turn into something that
			needs to be tracked
	and some of the trickeries we used (like the
	global-register pointer) are less useful on a machine with
	less registers and shorter offsets.  I don't think anyone who's
	actually seen our compilers go would complain too much if I
	knocked off 25% for more vanilla compiler technology [this is
	well within our experiences looking at different levels of
	optimization.]  This one is arguable, but gets at the "if you
	had this machine today, how would it really act? comparison.

	14.4 VUP (for RPM40) * .75 = 12 VUP

	Given that architectural tradeoffs are what they are, it's always
	hard to separate a single feature out, but I trust this has
	given some ideas on the 16vs32-bit instruction issue, and backs up
	Craig Hansen's thoughts about cycle-count increases on 16-bitters.

4. Conclusion

This analysis is only relevant to running substantial programs
of the kinds found on general-purpose machines.  The RPM may well
do relatively better in more embedded-systems environments where the
tradeoffs work better, and there is nothing wrong/irrelevant with those
environments; they just aren't workstation environments, and whatever
one learns in either one doesn't necessarily translate to the other,
BECAUSE THE KINDS OF PROGRAMS YOU'RE RUNNING MAY HAVE DIFFERENT STATISTICS.

I apologize for any errors in this analysis, which took a fair
amount of time to put together, given the sketchiness of the info,
but which, I believe, is accurate to within the correctness
of the assumptions, and does reflect a modest amount of experience
with such things.  Feel free to fix it if it's wrong, and if it matters,
but as Mr. Jesup says, this has been beaten to death.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jesup@pawl8.pawl.rpi.edu (Randell E. Jesup) (03/11/88)

In article <1840@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>Randall Jesup says the RPM40 stuff has been beaten to death, and I agree,
>but I do have some data that may be useful, which I'd have posted before,
>but has taken a while to write up, given that I've been busy.

	And I thank you for it.  Numbers/real comparison stuff is great,
"my chip is better than yours" stuff get old fast.  I'll keep my comments
to a minimum, given the length of the original posting.

>Summary:
>	40MIPS peak -> 20MIPS-cycles -> 14-15 VUP -> 12 VUP:
>	(matches 7-9X 68020, or 2-6X Sun-3/260 estimates given by GE folks)

>------STOP NOW UNLESS YOU ARE A GLUTTON FOR DETAILED ARCHITECTURAL ANALYSIS---

>	a) Programs are much bigger, i.e., they may easily have
>	many megabytes of code, and 100's of megabytes of data, for
>	single processes. (lots already = 16-32MB, at least)

	The average program run isn't anywhere near that big, though.

>	c) For general use, you sometimes have to make worst-case
>	assumptions on the sizes of addresses, for example.  You usually
>	must assume that addresses are "large", rather than "small",
>	because you have no idea how big something is until it's linked,
>	and because you may be compiling code for libraries where you
>	can't know how big the final objects will be.  Hence, shortcuts
>	often good in dedicated environments (i.e., "short" addresses are
>	OK) don't work.

	Here I have to disagree with you.  How about this:  The object code
is compressed assembler, the linker links (with compressed assembler libraries)
and does reorganization/assembly.  This allows you to always know the process
instruction memory size (though the data side is a bit tougher).

>Items marked * have a little more (but educated) guesswork than the others,
>which are computed with high confidence from extensive data.
>Items marked "+" are relevant to the 16-vs-32 issue.

>A1.	.336 <= 3 cycles of load latency
>	R2000 data: 21% of instruction cycles are loads (16-29%)
>		expected fill rates for load-delays are:
>		1: 70% (we got 68% on this bunch, and it contained one nasty
>			program that only filled 48%).
>		2: 30%
>		3: 10%

	Here another affect come into play: two address ALU instructions.  When
the compiler wants 3-address, it has to generate 2 ALU instructions.  This
means a slightly higher percentage of ALU ops to load/store ops, and therefor
better filling of load delays.  That should increase the 30% and 10% a fair
amount, but I don't know how much.  Also, ALU ops with prefixes will help
here to, filling otherwise useless cycles.

>	A1B: assuming that similar-quality optimizers and reorganizers are used

	See note above re: linker; otherwise OK.

>A2.+	.279 <= Loads/stores use 4-bit immediates
>		7.1% of the loads/stores could use a 4-bit offset (3.2-18%)
>		R2000 load/stores have 16-bit offsets, which require the RPM40
>		2 cycles to obtain:

	Note:  for word load/stores, that 4 bit immediate is shifted left 2,
effectively 6 bits in byte addressing.  Halfwords (minor) get shift left 1.
Also, what about >16 bit offsets?  Obviously, the R2000 will do a mov; add
immediate 32-bit; load sequence (well, a guess), so it hard to compare here.
Luckily, >16 bit offsets are rare (note the shifting comes into play again
here.)

>A3.+	.035 <= Add/sub immediates [.050]
>		The 3.5% assumes there are both add/sub immediate (i.e.,
>		4 bits + implied sign).  If there is really just add-immed,
>		and you get 3 bits + sign, use the number 5%

	I'm not sure I understand you, for reference the leftmost bit of the
immediate is extended.  All ALU ops can use immediates.

>A4.+	.013 <= Compare immediates (COND)
>		About 1.3% fit in 16-bits, but not 4-bits, and thus
>		require 1 PFX:

	Once again, what about >16 bits?  (Not TOO common, but not totally
uncommon either.

>A6.+	.011 <= Load-immediate
>	This is used to stick constants in registers for arguments,
>	compares, etc. (it's actually an add to zero, or something like that).
>	I assume the RPM-40 has an equivalent.

	It's MOV with the source immediate. (Remember, no 3-addr ALU ops)

>		About 1.1% (of total) fit in 16-bits, but not 4-bits, and thus
>		require 1 PFX:

	> 16 Bits?

>A7.+	.018 <= Load-upper-immediate
>	This puts zeroes in low bits, and 16 bits in the top of a register.
>	It's often used for setting up 32-bit address, for example,
>	or a long constant.
>	R2000 data: 1.9% (.5-6.9%) are LUIs
>		About 1.8% (of the total) need the top 4 bits, and thus
>		would need an 2 PFXs, rather than 1 (or 4, rather than 3):

	One NEVER needs more than 3.  3 PFX's + 4 bits in the instruction =
40 bits.  2 PFX's + 4 bits = 28.  1 PFX + 4 bits = 16.  (this is for ALU ops
and loads/stores.  For branches/XP stuff, the instruction has more immediate
bits available (12).

>A8.	.024 <= Jump-and-link
>	JAL is done in the RPM-40 by MOV (PC) somplace; BRA...

	Close, no cigar.  BRA; <fill>; MOV or STW.  This helps fill branch
delay, which otherwise might be tough (subroutine calls are worse than
other types of branches, usually).

>	R2000 data: 1.2% of the instruction cycles (.3->2.0%) are
>		calls (JAL, JALR), which give 28-bits of byte addressing.
>	On the RPM-40,  a JAL is done by a 2-instruction pair, which
>		I assume is a MOV of  the PC to the return-address register,
>		followed by a branch, or something equivalent, although I
>		haven't seen the exact sequence.  I assume that the return
>		jump is a single cycle.
>	Assuming programs of the size above, the natural code to
>		generate would be:
>		MOV (to save PC)
>		PFX
>		BRA
>	i.e., because 12-bits of displacement are enough for local branches,
>	(99+% on R2000), but not for globals. This adds 2 cycles/call.

	For internal calls, it's often enough.  Do you have data on call
distances?  I would suspect even in moderate sized programs it's at least
10%, maybe as high as 25%; and in small utilities 25%+.

>	Assuming that branches get an extra bit of addressabiltiy (halfword
>	alignment), the RPM gets 13 (BRA) + 12 (PFX) bits, for 32MB of code,
>	which is adequate for all of the cited programs, although not so
>	for others.  If by some chance the RPM uses a byte offset, getting

	Yup, 32 Meg, you got it.  Why generate bits that are always 0?

>	cycle-expansion: .012 * 2 = .024

	Net result: .012 * 1 = .012

>A9*.	.010 <= Load/store hazard
>	The RPM40 is supposed to have a load/store hazard if a store occurs
>	at the exact right number of cycles after a load.  This is not data
>	we keep (since we don't have this hazard), and it's hard to compute.
>	Here's a quick guess:

	It's a bit wierder than you suppose, but the number sounds OK.

>A12*.	.033 <= Misses due to branch-target-cache misse
>	According to the talk, there was supposed to be a hit-rate >90%.
>	I have no data on the kinds of programs used to calibrate that.
>	Big programs are clearly worse than little programs in this regard;

	Our data coimes from BIG vax application programs, if I remember
correctly.  (BIG!)
	With small programs it approaches 99%.
	I'm assuming this was talked about at ISSCC, since the cache in general
was: with a good optimizer/reorganizer, you can override the hardware control
of replacement of targets, and take over <some, all> of the targets for 
whatever addresses you like.  What this actually gets you in practice is a
good question.

	MAJOR BUG:
	Your number left something out: you must subtract whatever the R2000's
cache miss times penalty from the RPM40's.
	I'll assume it all equals out (we MAY win, may not).

>A13*.	.050 <= Lack of ALU forwarding
>	From all of the discussion, I can't tell how much bypassing the
>	RPM40 does, or doesn't do.  From the various hints, it sounds like:
>	a) you can store the output of an ALU op with no delay
>	b) you can't otherwise use the result of an immediately-preceding
>	ALU op.

	Correct.

>	I'll guess 5%, which I think is reasonably conservative:
>	cycle expansion: .05

	Seems OK.  As you said, prefixes help here.

>A14*.	.020 <= Multiply-divide
>	I'm not exactly sure how this is implemented on the RPM,
>	although it probably doesn't have a fast multiplier on the CPU.
>	(Maybe it's on the FPU, in which case some of this would go away).

	I won't say how we do it (yet), but 32x32 multiplies are 15 or 16
(can't remember exactly) cycles.  Divides take longer but about equivalent
to R2000.

>	R2000 data: across the benchmarks, about 2.7% of the time was
>	spent in multiply/divide interlock cycles [we have a 12/35 cycle
>	multiple/divider], which includes the effects of having some
>	instructions scheduled into the latency of an asynchronous unit.
>	Assume that the interlock cycles are split 50/50 (faster mults,
>	but more mults than divides, and that RPM's mults take about 3X,
>	i.e., +2 factor, you get:
>	cycle expansion: .013 * 2 = .026, which I'll round down to 2.

	So this drops to just a little.

>A15*.+	.040 <= 2-address registers, rather than 3-address ones

	Sounds OK, tough to figure numbers on (basic architecture diff).

>A16*.	.039 <= Less registers (21 instead of 32)
>	Note: this is more of an effect for large, complex programs than
>	small ones.  A cross-check is that you get the same number if
>	you assume even 1 more register is saved/restored on the average per
>	function call, (PFX+SW, PFX+LW for 4 cycles), with an average
>	of 100 instructions/call (representative).

	This also seems reasonable.

>A17*.	.020 <= Architectural Reorganization issues
>	Several of the times above related to reorganization:(A1, A9, A13).
>	A number of factors appear to make the RPM40 more difficult to
>	reorganize for:
>	1) PFX instructions are difficult, if not impossible to move
>	around away from the instructions prefixed (unlike the R2000's
>	style of using a bypassed GP register).

	True, though they do help a bit in filling load delays.

>	2) The instruction that sets the SR2 to get partial-word operations
>	is hard to move "too far" away from the instruction(s) that need it.

	Also true, luckily most subroutines only use one of {signed,unsigned}
{halfword,byte}, if they use any at all.  Also, global optimizers allow
better knowlege of what subroutine calls do to Sr2.

>	3) The load/store pipeline hazard must be taken care of.

	True, but it's easy.  In practice it doesn't hurt much.

>	4) If there is no forwarding, that has to be reorganized also.
>	Reorganization is very important for many RISC processors.
>	The RPM40 has a some extra things to worry about, and one less
>	(R2000 branch delay slot).  I'd guess the overall hit to be 2%.

	Huh?  We have a two-cycle branch delay to fill (did you think we
had none???)  Experience on RPM40 shows it can be filled fairly well (good
place for stores, for example.)

>A18*.	?? <= Coprocessor issues
>	I haven't really touched on this very much, as we don't know much,
>	except that even a few cycles extra latency getting to a floating-point
>	unit can hurt a lot, except in applications that naturally pipeline
>	very well, or if the FPU has long cycle-count operations in the first
>	place.  XPLD without XPST is a little puzzling. 

	Note that load = approx 2x store frequency.  Should be obvious (and
there are other ways to load/store XP stuff.)  Suffice it to say, given the
envirionment we designed for, XP stuff was VERY carefully designed.  There
should be no loss, and probably even a gain, vs R2000.

>A19*.	-.067 <= Contraction issue (R2000 branch nops)
>	The R2000 loses 6-7% to unfilled branch delay slots,
>	which the RPM40 does not. (of course, the RPM40 takes hits in other
>	areas of branching, but we've already included them).

	Drop this (though we do get better than average stats for branch
fills, I predict, especially on CALL's).

>A20*	.010 <= Miscellaneous
>	There are a bunch of integer-related issues that I can only guess
>	at, but observing that there are 4 bits in the opcode field for
>	ALU ops, (not the R2000's 5), I'd guess that not all of the R2000's
>	ops are found in the RPM, although I don't know which ones they
>	might be.  Also, if the immediate field encodes 16-bit shifts,
>	that will help, and hurt, if not.

	Actually, we had leftovers we had to figure out what to use for.
(one became RADD (I think this was my idea; Dennis, do you remember?))
	Shifts/rotates are like all other ALU ops re: immediates.

>Bottom line, given everything I know:

	I'll show results modified re: the above comments.

>A1.	.336 <= 3 cycles of load latency @
	.300 guess
>A2.+	.279 <= Loads/stores use 4-bit immediates @
	.240 guessed at % that could use a 6 bit immediate (effective)
>A3.+	.035 <= Add/sub immediates @
>A4.+	.013 <= Compare immediates @
>A5.+	.013 <= Logical immediates @
>A6.+	.011 <= Load-immediate @
>A7.+	.018 <= Load-upper-immediate @
	.001 for a process smaller than 28 bits of instructions, we don't
	     ever need to load the top 4 bits for address constants.  For
	     integer constants, if the top 5 are all 1 or 0, we don't need the
	     top 4 (almost always the case).
>A8.+	.024 <= Jump-and-link @
	.012 see above
>A9*.	.010 <= Load/store hazard
	.005 I think we can do better at avoiding the problem, loads tend to
	     cluster near beginnings of block, stores at the end.  Not big
	     either way.
>A10*.+	.098 <= Conditional branch
>A11*.+	.010 <= Partial-word load/store @
>A12*.	.033 <= Misses due to branch-target-cache misse
	.000 mistake - R2000 has misses too.
>A13*.	.050 <= Lack of ALU forwarding
>A14*.	.020 <= Multiply-divide
	.005 See above
>A15*.+	.040 <= 2-address registers, rather than 3-address ones
>A16*.+	.039 <= Less registers (21 instead of 32, sort of 
>A17*.	.020 <= Architectural Reorganization issues
>A18*.	?? <= Coprocessor issues
>A19*.	-.067 <= Contraction issue (R2000 branch nops)
	.000 Misconception
>A20*	.010 <= Miscellaneous
	.000 I don't really think we lose anything signifigant here
>
>Total	0.992	cycle expansion

Amusing: I came out with (after my mods above):
	0.892
Pretty close to what you have, all in all.

>	Thus, for cycle counts (ignoring cache-miss & MMU overhead),
>	a 40MHz RPM would act more-or-less like a 20MHz R2000, i.e.,
>	it would run twice as many (instruction cycles + delay cycles).

	Effectively, I get close to the same (1.9 RPM cycles = 1 R2000 cycle).

>	20 / 1.39 = 14.4 VUP
>
>which is well inside  "7-9X a 16.7MHz 68020 or 2-6X a Sun-3/260" estimates
>given by various of the GE folks.

	Wow.  I guess we are pretty good guessers (actually, the numbers
came from things like this, but against 68000, 1750, etc, etc.)

>So far, all of this has been architectural, i.e., assuming that the
>RPM40's software was as close to the R2000's as possible, i.e.,
>what would it be if they had our compilers. It is hard to compare
>against something you don't know, but I would observe:
[stuff about the good MIPS compilers, estimates that equals 25% loss to rpm40]

	Well, since "GE is not in the computer business", I doubt we'll
ever know for sure.  But I'm not interested in what GE/USG does with it,
just what the architecture wins or loses.

>4. Conclusion

>This analysis is only relevant to running substantial programs
>of the kinds found on general-purpose machines.  The RPM may well
>do relatively better in more embedded-systems environments where the
>tradeoffs work better, and there is nothing wrong/irrelevant with those
>environments; they just aren't workstation environments, and whatever
>one learns in either one doesn't necessarily translate to the other,
>BECAUSE THE KINDS OF PROGRAMS YOU'RE RUNNING MAY HAVE DIFFERENT STATISTICS.

	Yup!  I suspect that for the instruction mixes we optimized for,
we do somewhat better than the 1.9x I arrived at vs R2000 (BIG programs, but
different types of BIG programs (hint hint)).  Maybe much better (1x).

>I apologize for any errors in this analysis, which took a fair
>amount of time to put together, given the sketchiness of the info,
>but which, I believe, is accurate to within the correctness
>of the assumptions, and does reflect a modest amount of experience
>with such things.  Feel free to fix it if it's wrong, and if it matters,
>but as Mr. Jesup says, this has been beaten to death.

	The errors were minor, and helped to correct a few misconceptions
about what the rpm-40 instruction set does (ex: branches and CALL).  I
thank you for all the work you did on this, this is the type of thing I
read comp.arch for.  It is pretty definitive, no need to quibble over
minor points.  As I said in a previous message, we didn't optimize for
workstations.  We can do them, but it's a toss-up (confirmed by this
article) whether we're better in a workstation envirionment than the R2000.
Of course, in different cases we do better (even large cases, just different
problems/instruction mixes).  No suprise.  The R2000 does very well at
what it was designed for, Unix boxes and workstations, and we read whatever
articles you had published when we were designing the Rpm-40.

>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
>UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
>DDD:  	408-991-0253 or 408-720-1700, x253
>USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/11/88)

This posting is gonna be short on real info, but...
 I just had to thank John Mashey for his analysis. It looks
 very good. I'll go over it in detail and e-mail any
 comments I have back to him.

Nice work, John. WE should have done it. Unfortunately,
since GE "is not in the computer business", the
broad-based and thorough knowledge of UNIX and compilers
of which you are possesed is not directly available to
us : We only know what we read in the papers, essentially.
Sigh.

And yes, I have to be truthful : although I'd like
to think that the architecture has SOMETHING to do
with RPM40 performance, maybe just a little comes
from the VHSIC-compatable 1.25 micron AVLSI process.
Well, maybe more than a little :-)

--
 Dennis O'Connor   oconnor%sungod@steinmetz.UUCP  ARPA: OCONNORDM@ge-crd.arpa
         ( I wish I could be civil all the time, like Eugene Miya )
  (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

mash@mips.COM (John Mashey) (03/14/88)

In article <9895@steinmetz.steinmetz.UUCP> oconnor%sungod@steinmetz.UUCP writes:
>This posting is gonna be short on real info, but...
> I just had to thank John Mashey for his analysis. It looks
> very good. I'll go over it in detail and e-mail any
> comments I have back to him....

thanx for the kind words.  I'll be looking forward to the comments.
You may want to send me a test message though: I've tried several times
to send mail to ....!steinmetz!.., by various routes, and the net demons
are still winning.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (03/16/88)

In article <514@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
...
>	And I thank you for it.  Numbers/real comparison stuff is great,
>"my chip is better than yours" stuff get old fast.  I'll keep my comments
>to a minimum, given the length of the original posting.

Yes, this is much more useful.  Thanx for the clarifications.  This posting
is mostly comments in response to your comments where I wasn't sufficiently
clear on the first posting.

>>A1.	.336 <= 3 cycles of load latency
..
>	Here another affect come into play: two address ALU instructions.  When
>the compiler wants 3-address, it has to generate 2 ALU instructions.  This
>means a slightly higher percentage of ALU ops to load/store ops, and therefor
>better filling of load delays.  That should increase the 30% and 10% a fair
>amount, but I don't know how much.  Also, ALU ops with prefixes will help
>here to, filling otherwise useless cycles.
Yes, this should help a little. The specific cases are the R2000 ones where
an lnop is immediately followed by:
	a) a 3-register ALU op
	b) an ALU op that uses an immediate >4 bits.
I have no data to figure out how often those are.  From looking at code,
it is sad, but true, that many lnops are structurally there almost no
matter what the architecture&compiler system are (with possible exceptions
of the VLIW systems).  I.e., they are the things that result from code like:
	if ((p != NULL) && (p->thing)) p = p->nextlink;
which gives code like:
	lw	reg1,p
	nop
	beq	reg1,0,1f
	nop
	lw	reg2,thingoffset(p1)
	nop
	beq	reg2,0,1f
	nop
	lw	reg1,nextlink(reg1)
	nop
1:
	maybe you get to fill the branch delays by grabbing the instruction
	at 1: in, (in fact, we often find a LUI or LI there and move it),
	but after looking at a lot of code, I can't for the life
	of me figure out how you frequently get to fill most of the load-nops,
	in an R2000, an RPM-40, or most other RISCs I've seen...UNIX kernel
	code is just filled with this kind of thing.

>>A2.+	.279 <= Loads/stores use 4-bit immediates

>	Note:  for word load/stores, that 4 bit immediate is shifted left 2,
>effectively 6 bits in byte addressing.  Halfwords (minor) get shift left 1.
>Also, what about >16 bit offsets?  Obviously, the R2000 will do a mov; add
>immediate 32-bit; load sequence (well, a guess), so it hard to compare here.
>Luckily, >16 bit offsets are rare (note the shifting comes into play again
>here.)
(Actually, what we do is:  lui reg1,address<<16; lw reg2,address>>16(reg1).
There was an article in ASPLOS this summer that talked about addressing).
I already counted this in the LUI analysis that was later, i.e., I had no
way to disentangle why LUI's were there, and only have the data on
an instruction-by-instruction basis for the offsets.  The extra 2 bits for
words definitely help (especially in non-stack, non-GP references),
so I'd buy the .240.

>>A3.+	.035 <= Add/sub immediates [.050]
>>		The 3.5% assumes there are both add/sub immediate (i.e.,
>>		4 bits + implied sign).  If there is really just add-immed,
>>		and you get 3 bits + sign, use the number 5%
>	I'm not sure I understand you, for reference the leftmost bit of the
>immediate is extended.  All ALU ops can use immediates.
That helps clarify it: the number should be .050.  The .035 included the
possibility that for add/subtract, you interpreted the 4-bit immediate as
a zero-extended (i.e., positive) number, in order to get 1 more bit of
immediate field, since the add/sub immediates overlap if you sign-extend
the immediate.  The .050 guess assumes you can add [-8..7] and subtract
[-8..7], or add/subtract [-8..8] by flipping add/sub as needed.  The .035
number assumed that you could get to [-15..15].

>	One NEVER needs more than 3.  3 PFX's + 4 bits in the instruction =
>40 bits.  2 PFX's + 4 bits = 28.  1 PFX + 4 bits = 16.  (this is for ALU ops
>and loads/stores....
Oops, I computed it right, but said it wrong: (I was thinking of the 4-bits
as a PFX)....

>>A8.	.024 <= Jump-and-link
>>	JAL is done in the RPM-40 by MOV (PC) somplace; BRA...
>	Close, no cigar.  BRA; <fill>; MOV or STW.  This helps fill branch
>delay, which otherwise might be tough (subroutine calls are worse than
>other types of branches, usually).

(Actually, JAL's fill pretty well with argument-prep instructions.)
Thanx for the clarification: am I still wrong to think that there
are 3 cycles (of actual work) needed for a large-size branch-and-link?
(Ignoring branch-delay slots for the moment).
...
>	For internal calls, it's often enough.  Do you have data on call
>distances?  I would suspect even in moderate sized programs it's at least
>10%, maybe as high as 25%; and in small utilities 25%+.
Call distances is something I don't have in the standard reports.
...
>	MAJOR BUG:
>	Your number left something out: you must subtract whatever the R2000's
>cache miss times penalty from the RPM40's.
> 	I'll assume it all equals out (we MAY win, may not).

(I included the generic cache-miss overhead in the factor that converted
cycles->VUPS, and assuming that the RPM was using SRAM as a cache,
as needed in a general-purpose environment. See the summary for more info.
The R2000's single branch-delay slot covers the time to access to I-cache.)
...
>>	2) The instruction that sets the SR2 to get partial-word operations
>>	is hard to move "too far" away from the instruction(s) that need it.
>
>	Also true, luckily most subroutines only use one of {signed,unsigned}
>{halfword,byte}, if they use any at all.  Also, global optimizers allow
>better knowlege of what subroutine calls do to Sr2.

Many user programs don't use halfwords.  Systems-type programs mix bytes
and halfwords a lot.  (experience, but no data, unfortunately).

>>	4) If there is no forwarding, that has to be reorganized also.
>>	Reorganization is very important for many RISC processors.
>>	The RPM40 has a some extra things to worry about, and one less
>>	(R2000 branch delay slot).  I'd guess the overall hit to be 2%.

>	Huh?  We have a two-cycle branch delay to fill (did you think we
>had none???)  Experience on RPM40 shows it can be filled fairly well (good
>place for stores, for example.)
(I thought you might have one, but I didn't have any data.  What I've heard
(from Stanford MIPS-X) is that the 2nd branch-delay slot is fairly
hard to fill.  We agree that stores often go into the branch delay.)

As can be seen, reorganization statistics are nontrivial to estimate!
...
>	Note that load = approx 2x store frequency.  Should be obvious (and
>there are other ways to load/store XP stuff.)  Suffice it to say, given the
>envirionment we designed for, XP stuff was VERY carefully designed.  There
>should be no loss, and probably even a gain, vs R2000.

(Possible: we'll know when we see it, although the R2000/R2010 is a very
low latency design, with the coprocessor watching the instruction stream,
and doing direct loads/stores at the right times.)

>>A20*	.010 <= Miscellaneous
>>	There are a bunch of integer-related issues that I can only guess
>>	at, but observing that there are 4 bits in the opcode field for
>>	ALU ops, (not the R2000's 5), I'd guess that not all of the R2000's
>>	ops are found in the RPM, although I don't know which ones they
>>	might be.  Also, if the immediate field encodes 16-bit shifts,
>>	that will help, and hurt, if not.
>
>	Actually, we had leftovers we had to figure out what to use for.
>(one became RADD (I think this was my idea; Dennis, do you remember?))
>	Shifts/rotates are like all other ALU ops re: immediates.

What I meant was: can you do left-shift 16 or right-shift 16 in a single
cycle, with no PFX (depends on how you interpret the immediate field)?
Not that it matters a lot, but it probably goes up to .005 if you can do
shifts by 15, but not 16, in one cycle.  Rather than wasting space on
torturous reasoning on the opcode field, I'll wait until the opcode
list can be published.

>>A1.	.336 <= 3 cycles of load latency @
>	.300 guess [OK]
>>A2.+	.279 <= Loads/stores use 4-bit immediates @
>	.240 guessed at % that could use a 6 bit immediate (effective) [OK]
>>A3.+	.050 <= Add/sub immediates @
>>A4.+	.013 <= Compare immediates @
>>A5.+	.013 <= Logical immediates @
>>A6.+	.011 <= Load-immediate @
>>A7.+	.018 <= Load-upper-immediate @
>	.001 for a process smaller than 28 bits of instructions, we don't
>	     ever need to load the top 4 bits for address constants.  For
>	     integer constants, if the top 5 are all 1 or 0, we don't need the
>	     top 4 (almost always the case). 
	Actually, almost all of these are for data-address, or logical
	constants/masks, etc. They're almost never for instructions, so I'd
	stick with the original number: .018.
>>A8.+	.024 <= Jump-and-link @
>	.012 see above
	I still didn't quite understand the change here, as I still think
	there are 3 cycles (PFX;BRA;fill;MOV)  where the R2000 would have
	(JAL;fill), with both architectures having the same sorts of things
	moving into the fill.  There is room for discussion on the fill issue,
	and on if you could use short branches in practice. Try:
	.020

>>A9*.	.010 <= Load/store hazard
>	.005 I think we can do better at avoiding the problem, loads tend to
>	     cluster near beginnings of block, stores at the end.  Not big
>	     either way.
		OK, hard to guess.
>>A10*.+	.098 <= Conditional branch
>>A11*.+	.010 <= Partial-word load/store @
>>A12*.	.033 <= Misses due to branch-target-cache misse
>	.000 mistake - R2000 has misses too.
	Actually not, or rather: the R2000 has cache misses in its external
	cache; the RPM has cache misses in its internal cache + misses in
	the external cache (this was assuming a design where the 64KI+64KD
	memories were caches, rather than the only memory.  The .033 here was
	the EXTRA penalty for taking internal cache misses (which might well
	be external cache hits).  I subsumed all of the external cache missing
	into the cycles->VUP conversion, since I didn't have a better way
	to get at it, and all of the cycle counts so far were independent
	of external cache designs. (Ask again if this doesn't make sense).
	As noted, the .033 number depended on the 90% rate, and would go up
	or down depending on the environment, but the number definitely is
	not zero, unless I misunderstand how the RPM works.
	.033

>>A13*.	.050 <= Lack of ALU forwarding
>>A14*.	.020 <= Multiply-divide
>	.005 See above
		OK, I can buy this.
>>A15*.+	.040 <= 2-address registers, rather than 3-address ones
>>A16*.+	.039 <= Less registers (21 instead of 32, sort of 
>>A17*.	.020 <= Architectural Reorganization issues
>>A18*.	?? <= Coprocessor issues
>>A19*.	-.067 <= Contraction issue (R2000 branch nops)
>	.000 Misconception
		OK.
>>A20*	.010 <= Miscellaneous
>	.000 I don't really think we lose anything signifigant here
	[OK, if you can do shift left/right 16 in 1 cycle, else .05].
>>Total	0.992	cycle expansion
>Amusing: I came out with (after my mods above):
>	0.892
>Pretty close to what you have, all in all.

Your revisions, plus my revisions to your revisions come out at .965.
My guess is that, unless there's some fundamental misunderstanding left,
that there is probably 90% confidence that the most programs of the sorts
modeled, would lie in the range 0.9 - 1.0 (to 1 sig. digit!).
Thanx for the corrections; I think A8 & A12 are the only spots that
might need more clarification, if this last wasn't right.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jesup@pawl17.pawl.rpi.edu (Randell E. Jesup) (03/18/88)

In article <1878@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
:In article <514@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
...
:>>A1.	.336 <= 3 cycles of load latency
:..
:>	Here another affect come into play: two address ALU instructions.  When
:>the compiler wants 3-address, it has to generate 2 ALU instructions.
...
:Yes, this should help a little. The specific cases are the R2000 ones where
:an lnop is immediately followed by:
:	a) a 3-register ALU op
:	b) an ALU op that uses an immediate >4 bits.

	I think you mean a load followed by an a) or b) and a nop.

:I have no data to figure out how often those are.  From looking at code,
:it is sad, but true, that many lnops are structurally there almost no
:matter what the architecture&compiler system are (with possible exceptions
:of the VLIW systems).  I.e., they are the things that result from code like:
:	if ((p != NULL) && (p->thing)) p = p->nextlink;
:which gives code like:
:	lw	reg1,p
:	nop
:	beq	reg1,0,1f
:	nop
:	lw	reg2,thingoffset(p1)
:	nop
:	beq	reg2,0,1f
:	nop
:	lw	reg1,nextlink(reg1)
:	nop
	I think you left out:
	sw	p,reg1
:1:
	Here's how I'd do it on the rpm40 (and I hope the reorganizer would to)
	ldw	.1, p
	nop
	nop
	nop
	cond	neq, .1, .0
	ldw	.2, thingoffset[.1]
	cond	neq, .1, .0
	ldw	.1, nextlink[.1]
	nop
	cond	neq, .2, .0
	pfx	<for the store, if not needed, move cond down one>
	stw	p, .1

Note that there are NO branches in it.  The first load is hard to fill, you
have to hope that you can do something for the next statement in those nops
(I would bet you could use 1 of them, maybe 2, depends a lot on the code.  Or
you could move stuff down from above, like a store.)

This is a good example how the cond instruction can help avoid branches
(and thus branch delays and branch misses.)

:>>A2.+	.279 <= Loads/stores use 4-bit immediates
:
:>	Note:  for word load/stores, that 4 bit immediate is shifted left 2,

:  The extra 2 bits for
:words definitely help (especially in non-stack, non-GP references),

	Why non-stack?  Stack refs are usually small offsets.  I agree about
globals, even with a global pointer to reduce the number of PFXs.

:>>A8.	.024 <= Jump-and-link

:>	Close, no cigar.  BRA; <fill>; MOV or STW.  This helps fill branch

:Thanx for the clarification: am I still wrong to think that there
:are 3 cycles (of actual work) needed for a large-size branch-and-link?
:(Ignoring branch-delay slots for the moment).

	You're correct, if the call is to greater than 2K instructions
away from the current address.

:The R2000's single branch-delay slot covers the time to access to I-cache.)

	Does that mean you can get an instruction 1 cycle after you output
it's address on the address lines?

:(I thought you might have one, but I didn't have any data.  What I've heard
:(from Stanford MIPS-X) is that the 2nd branch-delay slot is fairly
:hard to fill.  We agree that stores often go into the branch delay.)

	One thing about the PFXs in the rpm40 is that often you see this:
BRA; PFX; STW.  The stanford numbers make a number of assumptions that
aren't valid here, though 2 slots are harder to fill than one in general.

:>>A20*	.010 <= Miscellaneous

:What I meant was: can you do left-shift 16 or right-shift 16 in a single
:cycle, with no PFX (depends on how you interpret the immediate field)?
:Not that it matters a lot, but it probably goes up to .005 if you can do
:shifts by 15, but not 16, in one cycle.

	Ok, I understand now.  I think for shifts it acts as an unsigned number
if I remember correctly (Dennis?)

:>>A7.+	.018 <= Load-upper-immediate @
:>	.001 for a process smaller than 28 bits of instructions, we don't
:>	     ever need to load the top 4 bits for address constants.  For
:>	     integer constants, if the top 5 are all 1 or 0, we don't need the
:>	     top 4 (almost always the case). 
:	Actually, almost all of these are for data-address, or logical
:	constants/masks, etc. They're almost never for instructions, so I'd
:	stick with the original number: .018.

	Once again, if the process has less than 28 bits of data memory, you
don't need it.  Few programs require 256 Meg of data.  And of course 99+% of
constants are less than 28 bits long, if you sign-extend.  So I'll stick with
the low number.

:>>A12*.	.033 <= Misses due to branch-target-cache misse
:>	.000 mistake - R2000 has misses too.
:	Actually not, or rather: the R2000 has cache misses in its external
:	cache; the RPM has cache misses in its internal cache + misses in
:	the external cache (this was assuming a design where the 64KI+64KD
:	memories were caches, rather than the only memory.  The .033 here was
:	the EXTRA penalty for taking internal cache misses (which might well
:	be external cache hits).  I subsumed all of the external cache missing
:	into the cycles->VUP conversion, since I didn't have a better way
:	to get at it, and all of the cycle counts so far were independent
:	of external cache designs. (Ask again if this doesn't make sense).
:	As noted, the .033 number depended on the 90% rate, and would go up
:	or down depending on the environment, but the number definitely is
:	not zero, unless I misunderstand how the RPM works.
:	.033

	90% is something of lower bound, typical number for large programs
should be more like 95%, 99%+ for small ones.
	I take it the r2000 never misses on a branch if the destination
is in it's 64K cache?

:Your revisions, plus my revisions to your revisions come out at .965.
:My guess is that, unless there's some fundamental misunderstanding left,
:that there is probably 90% confidence that the most programs of the sorts
:modeled, would lie in the range 0.9 - 1.0 (to 1 sig. digit!).

	I'll agree with you here, though I'll emphasize (as Dennis and I have
said before) that the rpm-40 was optimized for a different range of
applications than the r2000, and we're comparing here in the r2000's specialty.
I suspect that for embedded systems and their programs, they would be
close to equal (on a per-cycle basis), which would mean the 40Mhz RPM-40
would perform as well as a 35+Mhz r2000 (note: total specualtion here!)
The RPM-40 can do large systems, but may not be the ultimate choice for
such, though it can do well.

:-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
:UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
:DDD:  	408-991-0253 or 408-720-1700, x253
:USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Thanks again for such good analysis.  I think this has been pretty well
covered now; if we continue, lets start discussing specific RISC features
and their plus/minuses (abstract, not rpm-40 vs r2000).

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

kers@otter.hple.hp.com (Christopher Dollin) (03/22/88)

"davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen)" said both:

|As nearly as I can determine, workstations are used for graphics,
|software development, word processing, ...

and:

| ... generally the programs run are less than 1 min cpu, less than 
| 2MB memory ...

I'm doing software development on a workstation. There are two or three
Poplog windows (mail, X-window playthings, project work); each of these is
at least 1.5Mb [overkill for mail and X, but there it is]. Compiling the source
of the program[*1] (Lisp -> loaded binary) takes about 5min - of course that's
only when I want to start afresh - and running the program on a reasonably 
large piece of data has, on occasion, taken as long as four hours; a rapid 
bout of optimisation followed! The program can eat VM like C eats sanity. A
quick check: popmemused=523244 longwords; that's 2Mb heap in use, not counting
the Poplog kernel, the Common Lisp susbsystem, and the loaded code of the
program, all of which are locked down and don't count as active heap.

Does using LaTeX count as word processing? If so, I have a document here that
takes about 5min elapsed (last time I counted) to format. That's probably
over 1min CPU - at least it's not far off.

I wouldn't say that this was "really large" stuff, but it beats the 1min 2Mb
limit. 

[*1] Sorry to be vague, but you know how it is.




Regards,
Kers                                    | "Why Lisp if you can talk Poperly?"