[comp.arch] block copy & VAX MOVC

chris@mimsy.UUCP (Chris Torek) (09/07/88)

In article <5654@june.cs.washington.edu> pardo@june.cs.washington.edu
(David Keppel) writes:
>I believe that the VAX "movc" command takes arbitrary pointers and
>does the following:
>
>* If both are word-aligned, do a word copy (I mean a 4-byte word).
>* If both are non-aligned and could be aligned with 1, 2, or 3 bytes
>  of byte-copy at either end, then do a byte copy at either end and do
>  a word copy down the middle.
>* If niether aligned then ??
>
>Unfortunately, my VAX hardware reference is out of town for a couple
>of weeks, so I can't ask him about neither aligned.  Anybody know?

I do not *know*, but I predict that the answer is machine-dependent:
that BI machines use octaword transfers, while SBI machines use
quadword transfers and CMI machines use longword transfers.  I believe
that at least the 780 and faster VAXen have an alignment network, and
that the microcode can use this directly, so that even if the two
addresses cannot become simultaneously aligned, the copy can proceed as
if they were, with intermediate 64-bit results accumulated in a series
of latches behind the alignment network.

Incidentally, the microcode has a harder job than simply aligning:
The formats of the two instructions are

	movc3	count.rw,src.ab,dst.ab
and
	movc5	srclen.rw,src.ab,fill.rb,dstlen.rw,dst.ab

(r = read-reference, a = address-reference; b = byte, w = word; these
tell how the argument is used and what increments and shifts are
applied to postincrement, predecrement, and indexed addressing modes).
In both cases, if the source and destination overlap, the copy is done
in whichever direction is nondestructive.  Alas, since the count
(movc3) and length (movc5) fields are only read as words, one
instruction can move at most 65535 bytes.  To make these work as a
general copy routine one must surround these with loops which also must
determine the appropriate direction; moreover, since the results are
left in specific registers (r0..r5) the loops must be carefully written
so as to hold the source and destination fields in the appropriate
result-registers to avoid unnecessary moves.

(Fortunately, a C compiler has the sizes of structures available
directly, and can generate the proper series of movc3 instructions for
a structure assignment, but in fact the 4BSD PCC cheats and assumes
that no structure contains more than 65535 bytes.  You get a compiler
error if you try to assign one larger than this.  Well, at least it
does not generate bad code....)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

cjh@hpausla.HP.COM (Clifford Heath) (09/26/88)

I played with Duffs device on an HP 9000/850 (RISC machine), and got
some interesting results.  Duffs is faster than the comparable
non-unrolled loop, but only by about 20-30%.  memcpy was heaps faster,
so I looked at the (memcpy) assembly code using a debugger.  As a result
of this I changed the unrolling factor in Duff's to 4 (not much change),
changed the auto-incr pointer addressing to short offset indexing (using
a pointer adjustment before the loop and a single increment before the
while) and got about 30% more.  The 850 has auto-increment, but it still
takes time that doesn't need to be wasted.  It also has a good global
optimizer, which seemed to do sensible things even for this strange
device.

Duffs's was STILL slower than memcpy by about 50%, and couldn't handle
byte-size moves, non-aligned moves etc etc.

Duff's is really only a way of saving the code size required to perform
the additional moves left after the unrolled loop has run, which is a
fairly poor excuse for using a device that's so hard to read.  The only
additional benefit is that the extra instructions may be in the I-cache,
which isn't really such a big deal.

The memcpy on the 850 is quite an astonishing effort, using word moves
with double register 8/16/24 bit shifts for unequally non-aligned moves.
It also has a very small setup time, so that small moves get caught
early and handled quickly.  Congratulations to the coder, a very good
effort.  Before this experiment, I was convinced that C with a good
optimizer could get within 10% of assembly code for anything.  I now
have a convincing counter-example.

In short, use the system-supplied routines for preference, and if they
prove to be slow, replace them yourself AND SEND THE CODE to the company
that wrote it.  They'll probably be grateful.

Clifford Heath, Hewlett Packard Australian Software Operation.
(UUCP: hplabs!hpfcla!hpausla!cjh, ACSnet: cjh@hpausla.oz)

mcdonald@uxe.cso.uiuc.edu (09/28/88)

>In short, use the system-supplied routines for preference, and if they
>prove to be slow, replace them yourself AND SEND THE CODE to the company
>that wrote it.  They'll probably be grateful.
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

They won't be grateful. They (particularly IBM) won't even look at it.
IF you send code to IBM it gets looked at by a special person whose
job it is to see if the code is USER WRITTEN APPLICATIONS CODE
illustrating a bug in THEIR software. If it is that, this person then
sends a description off to the responsible group. If, on the
other hand, you send in a proposed improvement in THEIR software,
two things may happen: one is that the special filter-person shreds
you suggestions and then goes off to special super-secret room where,
using the fruits of super-secret research, his brain is wiped of
all memory of the event. OR, he sends it to their legal department
for legal action: they sue the sender for having looked inside their
software to find the bad code. Big companies aren't interested
in suggestions for improvements direct from customers. They are afraid
that if they were to even look at it, someone else might have used
it in the past and could sue them. They want to code THEIR WAY and
only their way. Indirectly, of course, they must know whether their
code is any good (through benchmarks and user comments to support
reps).

jps@wucs1.wustl.edu (James Sterbenz) (10/03/88)

In article <46500026@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:


>>In short, use the system-supplied routines for preference, and if they
>>prove to be slow, replace them yourself AND SEND THE CODE to the company
>>that wrote it.  They'll probably be grateful.
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>They won't be grateful. They (particularly IBM) won't even look at it.
>IF you send code to IBM it gets looked at by a special person whose
>job it is to see if the code is USER WRITTEN APPLICATIONS CODE
>illustrating a bug in THEIR software. If it is that, this person then
>sends a description off to the responsible group. If, on the
>other hand, you send in a proposed improvement in THEIR software,
>two things may happen: one is that the special filter-person shreds
>you suggestions and then goes off to special super-secret room where,
>using the fruits of super-secret research, his brain is wiped of
>all memory of the event. OR, he sends it to their legal department
>for legal action: they sue the sender for having looked inside their
>software to find the bad code. Big companies aren't interested
>in suggestions for improvements direct from customers. They are afraid
>that if they were to even look at it, someone else might have used
>it in the past and could sue them. They want to code THEIR WAY and
>only their way. ...



There are various official mechanisms for suggesting improvements to products
of most companies, IBM included.  For IBM its called (I beleive) a PASR.
Much of IBM source code is liscenced, in which case (assuming you're
liscenced for the code you're using) there's nothing wrong with looking
at, modifying, and making suggestions for improvement of code.

If, on the other hand, you've disassembled an OCO (object code only)
program, that might be another matter.

There are a lot of program IBM liscences that were WRITTEN by users,
These are in a category called program offerings (used to be
installed user programs).  These are normally offered 'as-is', but
if IBM likes them enough, they will take over full support and 
development.-- 
James Sterbenz  Computer and Communications Research Center
                Washington University in St. Louis 314-726-4203
INTERNET:       jps@wucs1.wustl.edu
UUCP:           wucs1!jps@uunet.uu.net