[comp.arch] Memory access functional units

conor@lion.inmos.co.uk (Conor O'Neill) (05/13/91)
In article <1991May8.155455.14491@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
>Andrew Glew, of Intel, writes
>>    So: why not combine vector memory access instructions that convey
>>access pattern, with scalar computational operations?
>
>Yes, yes, my point exactly.  Please do it!  I'll take 10.  
>In fact, I think a fruitful area for extending the current
>architectures is in a more general model for pre-loading
>the cache.  Doing pre-loads by constant strides rather
>than by contiguous lines would already be a big
>performance boost, but I could also imagine more
>general solutions, where you gave the memory subsystem a 
>vector of addresses which specified the pre-load pattern.
>
>I'm just a user though.  I'm not at all clear how hard these
>things would be to implement in hardware.

A recent ASPLOS-IV paper addressed this:
"Code generation for Streaming: an Access/Execute Mechanism",
by Manual E Benitez and Jack W Davidson, Dept of Computer Science,
University of Virginia, Charlottesville, VA 22903, USA.

Despite the title of the paper, which doesn't give much away,
their stuff involves separate functional units for accessing memory.

You provide a base address, stride, and length, etc.
It then goes away and streams the data to or from memory via a fifo,
and can obviously make its own decisions about cacheing, prefetching, etc.
The inner loop then doesn't need to worry about any address calculation;
the dedicated hardware is doing that automatically.
They had 4 input and 4 output streams.

The paper also discusses the sort of compiler work needed to make
use of this.

While this obviously works for vector type code, they also claimed
noticeable speedup for general purpose code. Even UNIX utilities,
such as nroff and yacc: "The uses included copying strings and structures,
searching a decoding tree, searching a data structure for a specific item,
and initializing an array".

Some numbers:

Program         Percent reduction in cycles executed
banner            5
bubblesort       18
cal              17
dhrystone        39
dot-product      43
iir              13
quicksort         1
sieve            18
whetstone         3

The SPECratio for their box improved from 4.0 to 4.3.

I thought that it was quite a good paper, but it didn't seem to
create much discussion at the conference.
---
Conor O'Neill, Software Group, INMOS Ltd., UK.
UK: conor@inmos.co.uk		US: conor@inmos.com
"It's state-of-the-art" "But it doesn't work!" "That is the state-of-the-art".