[comp.arch] Uses for unusual instructions

cik@l.cc.purdue.edu (Herman Rubin) (11/25/89)

Soem time ago, I posted to this group certain instructions which are cheap
in hardware and expensive in software.  I wish to discuss these and give
partial examples; the full examples are too lengthy and would provide little
insight.

The two instructions are to transfer conditionally on a bit, and to then
forget about this bit, and to find the spacing between ones in a random
bit stream.  There is also the question of the relative frequency of these
operations.  I will give adequate information on both points.

For the conditional transfer (ct), there are 4 instructions needed in general.
It can often be arranged that one of them can be omitted.  A typical
program would be

	count -= 1;
	if (count < 0) call getmorebits;
	shift bits by 1;
	transfer on {overflow, negative, low-order bit};

It is unnecessary to clear the bit; it will not be looked at again.

For the spacing between bits, doing it in a straightforward manner
involves similar problems.  On the machines I have seen, the instructions
would be

	i = j;
	clear *i; 	/* clear the bit */
	j = &( next one);
	in (not found) goto rareprocedure;
	result = j-i;

There are ways of eliminating the second operation in many cases, and there
are even ways to produce a buffer more quickly.

Now as for use.  Here is a computationally moderately efficient method of
generating exponential random variable from uniform input.

loop:	i = spacing between ones;
	switch (i){
	case 1: ct loop;
		int += 1;
		goto loop;
	case 2: goto nomask;
	case 3: j = spacing;
		goto quickmask;
	case 4: k = spacing;
		etc.

			}

This procedure uses case 1 50% of the time, case 2 25%, etc.  For 25% of
the time, 1 is added to the integer part; for 43% of the time, an exponetial
is produced by adding the integer part to a fraction, which is a masked 
uniform except in case 2, where there is no mask, and 32% of the time loop
is reentered with nothing done.  The expecte number of uses of spacing is
greater than 1.43, and 50% of the time ct is used.

This is not so efficient.  A more efficient one tries to get two at once.  
It starts out with a spacing, and the most ccmmon branch (8/15 probability)
has two occurrences of ct, the branches being equally likely to add 1 to
the integer or to go to the nomask output procedure.  In the next most
common branch (4/15), there are also two uses of ct, one of which is as
above, and the other one selects a random order between this and quickmask.

The point I am making is that these procedures, which only look at a few
bits before going to a final output stage, use these operations which are
quite simple in hardware, and rather expensive in software.  The spacing
procedure uses an average of 2 bits, and ct only one, yet these procedures
are quite expensive in software.  The use of hundreds of thousands or even
hundreds of millions of exponential random variables is quite common in
simulationa.

times ct is used is 50%, and the expected use of spacing is 
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

mikew@fxgrp.fx.com (Mike Wexler) (11/28/89)

>cik@l.cc.purdue.edu (Herman Rubin) writes:
>Soem time ago, I posted to this group certain instructions which are cheap
>in hardware and expensive in software.  I wish to discuss these and give
>partial examples; the full examples are too lengthy and would provide little
>insight.
It sounds like some special purpose hardware might help solve your problems
                                                             ^^^^
more quickly. I propose an experiment: 
1. Design a CPU with these instructions added and one without(you might be
   able to start with some kind of publicly available designs from a 
   university project).
2. Modify your codes to take advantage of these instructions. Run your code
   through a good simulator and see how much perfomance you gain.
3. Run some benchmarks or even some real applications that are used by a
   large number of people through both simulations.
4. Tweak the compilers to take advantage of the new instructions.
5. Do step 3 over.

If step 2 show a big performance win, go to a ASIC house and pay them to
build a chip for you.

If steps 3 or 5 show a big performance win, provide the data to people who
make micro-processors and you will be able to convince them to add some new
instructions.

BTW, I understand that the above process is time consuming and expensive,
but you seem to have the most to gain from it also. 

Mike Wexler (mikew@fx.com)

desnoyer@apple.com (Peter Desnoyers) (11/29/89)

In article <1989Nov27.173555.19673@fxgrp.fx.com> mikew@fxgrp.fx.com (Mike 
Wexler) writes:
> >cik@l.cc.purdue.edu (Herman Rubin) writes:
> >Soem time ago, I posted to this group certain instructions which are 
cheap
> >in hardware and expensive in software.  I wish to discuss these and give
> >partial examples; the full examples are too lengthy and would provide 
> >little insight.
> It sounds like some special purpose hardware might help solve your problems
>                                                               ^^^^
> more quickly. I propose an experiment: 
> 1. Design a CPU with these instructions added and one without(you might be
>    able to start with some kind of publicly available designs from a 
>    university project).
> 2. Modify your codes to take advantage of these instructions. Run your code
>    through a good simulator and see how much perfomance you gain.

alternately:
Assume that a memory-mapped I/O device exists that takes data in (and 
possibly instruction bits on the low-order address lines), implements your 
instructions, and can return results on the next memory read cycle. 
Simulate performance of your code with this hardware. If it gives you a 
big boost, implement it in a programmable chip and put it in your system.

                                      Peter Desnoyers
                                      Apple ATG
                                      (408) 974-4469

news@haddock.ima.isc.com (overhead) (11/30/89)

>cik@l.cc.purdue.edu (Herman Rubin) writes:
>Soem time ago, I posted to this group certain instructions which are cheap
>in hardware and expensive in software.

The idea of adding hardware to your machine is practical if you have
access to your machine.  Putting a device into your ETA-10 (Did Purdue
get theirs?) might be difficult.  Adding it to your PC clone when you
could just use a supercomputer won't probably be a win.  Of course if
portability is a concern...

The examples I've seen so far are finding or counting bits in a word.

In assembly:
			/ word bits to count in r0
	clr r1		/ count = 0
loop:	inc r1		/ count++
	shrr r0		/ shift word right, into carry
	jnc loop	/ continue if bit is zero
			/ r1 has a bit count

This is three instructions per loop.  Given a 32 bit word size,
it looks like 96 instructions may be executed (more than that if
the above loop is executed when all bits are zero).

About a decade ago, I was involved with a project that did
scheduling.  It used a bit per time slot, and there were enough
bits in a word to handle the number of time slots required.
Packing bits seemed like the data structure because boolean logic
let you quickly compute if there were any posibilities remaining.
However, finding out which slot was available (which bit was
still on) required a loop as above.  When the program was run, it
tended to spend a large fraction of its time in the loop.

The machine, a PDP-10, had an instruction, jffo (jump if find
first one), which had strange semantics, but which let us do what
we wanted.  When the loop was changed to use the instruction, the
overall execution speed increased noticably, and the program no
longer seemed to spend all its time in the one routine.

Was the application of general use?  Yes.  Was the appliaction
portable?  No - it was written in assembly.  Further, the data
structures were highly dependant on word length (36 bits).  Could
this task be done efficiently without the instruction?  Probably.
I've since seen some constant time algorythms for bit counting.
(Anyone care to post the C macro version?)  I haven't run any
tests to determine the real speed of these beasts, however.

The nice thing about portability is that when the next machine
comes out, you can often simply get a speed up for your
application.  The PDP-10 from nine years ago cost a million
dollars.  Two years ago, I bought a Mac II for home, which is
faster, and cost under $10K.  These systems are quite comparible:

		Dec System 20		Macintosh II
RAM		2 Megawords= ~10 MB	5 MB (max 8 MB, for additional $400)
Disk		40 Millisec access	20 Millisec access
		500 MB			300 MB
User Interface	Good line oriented	Good graphics oriented
OS		Multi user		Single user/multitask
Printer		Uppercase band printer	300 DPI laser printer
		6 pages/minute		6 pages/minute
Best utility	EMACS			MS Word 4.0 (or is 3.02 better?)
Best game	VT Trek			Crystal Quest II
User media	DECtape			800K floppies
Misc		40 terminals		two serial ports
					two graphics screens
					graphics digitizer
					sound digitizer
					stereo sound out
					MIDI controller
Cost		$1M			$10K

Alright, so its apples and orange colored things...

The moral?  Algorythmic research is important.  Portability is
often achievable.

In fact, maybe it is time to rewrite the scheduler for the Mac.
There's probably a market.

Stephen.
suitti@haddock.ima.isc.com

henry@utzoo.uucp (Henry Spencer) (12/01/89)

In article <15300@haddock.ima.isc.com> suitti@anchovy.UUCP (Stephen Uitti) writes:
>...[bit counting] Could
>this task be done efficiently without the instruction?  Probably.

Certainly.  Use table lookup a byte at a time.  "Never recompute what
you can precompute."
-- 
Mars can wait:  we've barely   |     Henry Spencer at U of Toronto Zoology
started exploring the Moon.    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu