[comp.arch] VLIW assembly

rodman@mfci.UUCP (Paul Rodman) (07/07/89)

In article <3243@alliant.Alliant.COM> lewitt@Alliant.COM (Martin E. Lewitt) writes:
>
>Maybe some VLIWs out there are more difficult because they are pushing
>the technology harder, trying to encode more in an instruction word or
>something.  They might sacrifice some of the generality that their
>bus structure diagram would lead you to believe was there.  I'm curious
>about these experiences with other VLIW architectures.

Ok, as the person that wrote the FFT package for the Trace 7,14 and
28/300 machines I can throw in my $.02 worth. Not much assembly has
been written for the Trace machines, as the compiler gets you very
close to peak performance for considerably less effort! :-) However,
with an undergraduate physics background I personally wanted to max
out FFTs by writing hand-code. [I also wanted to counter the all those
that I've heard say "1024 bit VLIWs can't be hand-coded"!]

Writing assembly for a 1024 bit instruction word, heavily pipelined
machine is not "easy", but what makes it hard is NOT the size of the
instruction per se, so I agree with Mr. Lewitt.

What makes it hard is the fact that you, the programmer , have so much
hardware at your fingertips you refuse to allow a single unneeded
instr creep into the algorithm. Sometimes this means you might be
juggling a few more balls than you thought...:-) 

In general though, the flexiblity of the VLIW instr word, i.e. no
funny conflicts and interdependencies in the encodings, are a breath
of fresh air, compared to the typical CISC microword. And you ALWAYS
have another an alu or constant when you need it! A static resource
checker keeps you from making dumb resource conflict errors.

Of course, it isn't worth hand-coding things very often. Not many
programs have such an unbalanced profile as to make it worth while.

At the end of my FFT endeavour I had a kernel of 76
instructions that could be run through the M4 macro processor to
generate the FFT library for all three machines, and for single or
double precision. The performance, needless to say, is as good as the
hardware can do:

     28/300:
     1 dimension, complex, 32 bit fft, 1024 point  = 520  microseconds.
     1 dimension, complex, 64 bit fft, 1024 point  = 930  microseconds.

     1 dimension, complex, 32 bit fft, 1e6 point   = 901  milliseconds.
     1 dimension, complex, 64 bit fft, 1e6 point   = 1768 milliseconds.

     2 dimension, complex, 32 bit fft, 1k x 1k     = 970  milliseconds.
     2 dimension, complex, 64 bit fft, 1k x 1k     = 1800 milliseconds.

Gratifying results. Try the large cases on your workstation.

-pkr

[A non-quiche eater. :-)]