[comp.arch] Complex Instructions

firth@sei.cmu.edu (Robert Firth) (04/13/89)

In article <5064@hubcap.clemson.edu> mark@hubcap.clemson.edu (Mark Smotherman) writes:
 [quoting Clark & Strecker]

>  "Anecdotal accounts of irrational implementations are certainly
>   interesting.  Is it *typical*, however, that composite instructions
>   run more slowly than equivalent sequences of simple instructions?

In my experience, yes, it is indeed typical.  Taking only the example
of the VAX, I have found naive code faster than the block operations
(MOVC3, CMPC3), the loop instruction (ACB), INDEX, and, of course, the
call instructions.

Moving to RISC machines, one point to note is that the authors' assertion
that, other things being equal, several instructions cannot be faster than
one is false: the several instructions can be interleaved by the compiler
with other code, so as to overlap delays with useful work.

slackey@bbn.com (Stan Lackey) (04/13/89)

In article <3169@bd.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
>In article <5064@hubcap.clemson.edu> mark@hubcap.clemson.edu (Mark Smotherman) writes:
> [quoting Clark & Strecker]
>
>>  "Anecdotal accounts of irrational implementations are certainly
>>   interesting.  Is it *typical*, however, that composite instructions
>>   run more slowly than equivalent sequences of simple instructions?
>
>In my experience, yes, it is indeed typical.  Taking only the example
>of the VAX, I have found naive code faster than the block operations
>(MOVC3, CMPC3), the loop instruction (ACB), INDEX, and, of course, the
>call instructions.

You should include ALL of the data here.  The proper way of saying
this is: "The implementation of the subset that I need in a particular
case is faster than the full operation."  For example, the VAX CALL
includes aligning the stack and saving enough state so it can be fully
restored, saving the condition codes, pushing a minimum of five
longwords, picking up a mask at the called procedure, and pushing only
those registers with corresponding bits set in the mask.  I'll bet
your substitution code did not do all that.  The string instructions
have logic in them to make sure overlapping strings worked, zero-length
strings worked etc., regardless of the case.  Did your code do that?
INDEX was another example where some of the exception checking was
left out of the emulation.  I guess it USUALLY works!

CALL was never intended to be faster than JSR; it was designed to ease
the programming burden of a consistent calling standard, so
cross-language calls would just work.  Now if there were no JSR, there
would be something to complain about.  But all the simple instructions
are present, and implemented in an optimized way!  So the user has a
choice: use the fully functional instructions if he wants the machine
to do the work, or reinvent all the way down to the bare metal.  With
RISC, there is no choice.
So there! :-) Stan

joel@pandora.pa.dec.com (Joel McCormack) (04/14/89)

I too can vouch for the fact that sequences of simple VAX instructions are often faster than a single ``smart'' instruction.  The WRL Modula-2 compiler generates decent, but not great, code.  Nonetheless, it is quite competitive against most of the

bcase@cup.portal.com (Brian bcase Case) (04/15/89)

>You should include ALL of the data here.  The proper way of saying
>this is: "The implementation of the subset that I need in a particular
>case is faster than the full operation."  For example, ...
[examples...]
>CALL was never intended to be faster than JSR; it was designed to ease
>the programming burden of a consistent calling standard, so
>cross-language calls would just work.  Now if there were no JSR, there
>would be something to complain about.  But all the simple instructions
>are present, and implemented in an optimized way!  So the user has a
>choice: use the fully functional instructions if he wants the machine
>to do the work, or reinvent all the way down to the bare metal.  With
>RISC, there is no choice.

You make a valid point:  the complex instructions do a lot of work, and
that work is sequenced in a transparent way so that instruction fetch
overhead is eliminated and advantage of low-level special cases can be
taken in microcode (well, you didn't say all that; I guessed you were
thinking it).  Unfortunately, all the work that the complex instructions
do is rarely needed.  And in the case of the call instruction, I find
the VAX complex instruction sorta insulting:  assuming that compiler
writers are too stupid to adhere to a common spec for inter-language call
protocol.  Having the microcode do it is no better than having a subroutine
in the compiler emit the sequence of instructions that implements the
agreed-upon protocol.  And the agreed-upon protocol can do just what is
needed.  Maybe compiler writers are that stupid, in general, or were at
the time of the VAX's conception.  You are also right when you say with
RISC there is no choice:  one is FORCED to use the fast instructions!  :-)

ok@quintus.UUCP (Richard A. O'Keefe) (04/16/89)

In article <17167@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>And in the case of the call instruction, I find the VAX complex instruction
>sorta insulting:  assuming that compiler writers are too stupid to adhere to
>a common spec for inter-language call protocol.

Try calling Pascal from PL/I under MVS sometime.

bcase@cup.portal.com (Brian bcase Case) (04/17/89)

>>And in the case of the call instruction, I find the VAX complex instruction
>>sorta insulting:  assuming that compiler writers are too stupid to adhere to
>>a common spec for inter-language call protocol.
>
>Try calling Pascal from PL/I under MVS sometime.

I don't want this to get out of hand, and it might belong elsewhere anyway.

BUT, was calling Pascal from PL/I under MVS a design goal?  If so, then why
doesn't the existence of the VAX's CALLS instruction guarantee this goal
will be met?  I guess this is yet another point:  if the CALLS instruction
was included so that inter-language calls "would just work," then it fails
even at that!

leichter@CS.YALE.EDU (Jerry Leichter (LEICHTER-JERRY@CS.YALE.EDU)) (04/17/89)

In article <17296@cup.portal.com>, bcase@cup.portal.com (Brian bcase Case) writes...
>>
>>Try calling Pascal from PL/I under MVS sometime.
> 
>I don't want this to get out of hand, and it might belong elsewhere anyway.
> 
>BUT, was calling Pascal from PL/I under MVS a design goal?  If so, then why
>doesn't the existence of the VAX's CALLS instruction guarantee this goal
>will be met?  I guess this is yet another point:  if the CALLS instruction
>was included so that inter-language calls "would just work," then it fails
>even at that!

Ahem.  Last time I checked, MVS didn't run on VAXes; it ran on IBM mainframes.

VMS runs on VAXes.  It was a design goal of VMS that Pascal code be callable
from PL/I code.  It is.  (Yes, there are complications when you cross over
into an area where both languages specify "how the world should look" - for
example, the I/O system.)

I have no idea what goals MVS may have had in this are.

							-- Jerry

ok@quintus.UUCP (Richard A. O'Keefe) (04/18/89)

In article <17296@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>>>And in the case of the call instruction, I find the VAX complex instruction
>>>sorta insulting:  assuming that compiler writers are too stupid to adhere to
>>>a common spec for inter-language call protocol.
>>
>>Try calling Pascal from PL/I under MVS sometime.
>
>I don't want this to get out of hand, and it might belong elsewhere anyway.
>
>BUT, was calling Pascal from PL/I under MVS a design goal?  If so, then why
>doesn't the existence of the VAX's CALLS instruction guarantee this goal
>will be met?  I guess this is yet another point:  if the CALLS instruction
>was included so that inter-language calls "would just work," then it fails
>even at that!

I am *sick* of people not reading what I wrote.  Watch carefully now.
EM.  VEE.  ESS.  I did not say "try calling Pascal from PL/I under <VMS>
sometime."  To the best of my knowledge that works perfectly well and has
done ever since it was possible to try.  And yes, the CALLS/CALLG
instructions and VIA were important in achieving that.

My point was that IBM have a software convention for System/370, and that
their Pascal compiler (which is not a fully supported product, but it is
the compiler they will provide to customers who want Pascal) was written
after that convention was well established and described in hundreds of
books, and it _does_ _not_ follow that convention.  My point, then, is in
DEC's defence:  yes there _are_ compiler writers who will not adhere to a
common spec for inter-language call protocol.  That's not to say that they
are stupid.  They had to trade off incompatibility with other languages
against inefficiency for Pascal.  DEC tried to make that a non-issue.

On the subject of calling conventions, I just had my first look at a
68030 manual.  I know, I'm way behind the times, but it was supposed to
be just more of the same.  Why did Motorola delete CALLM?

richt@breakpoint.ksr.com (Rich Title) (04/18/89)

In article <57252@yale-celray.yale.UUCP> leichter@CS.YALE.EDU (Jerry Leichter (LEICHTER-JERRY@CS.YALE.EDU)) writes:

> [Discussion of VMS call conventions & stack unwinding...]

>Now, you can argue about whether such a condition handling mechanism is impor-
>tant.  Unix gets by without it...

In UNIX, setjmp/longjmp are the mechanism for stack unwinding. UNIX didn't
make the VMS mistake of getting stack-unwind mixed up with condition
handling: longjmp can be executed from anywhere (though it is most commonly
done from within a handler). 

setjmp simply saves an entire register set, so longjmp need not do the 
frame-by-frame unwind that VMS's $UNWIND must do. 

The advantage of the UNIX method of stack unwinding is that it is 
relatively portable, and doesn't make demands on the calling conventions.
The disadvantage is that the establish-an-unwind-point (the setjmp call)
is more expensive: an entire register set must be saved, as opposed to
just putting a word on the stack as in VMS. So there's pro & cons to each
approach.

>In fact, what often gets overlooked in discussing the VAX architecture was
>that it was NOT designed by hardware people alone.  It was the result of a
>lot of give and take between hardware and software people.  It was the soft-
>ware people who said "we want a condition handling facility which provides
>certain features (in particular, no cost to code which doesn't use it...

Well, to rebut that one, let me simply quote you again (!)

>Here's a simple example:  It's common for procedures to have a two different
>paths, one for a simple case, one for a complex case.  It may very well be
>that the simple case path uses fewer registers than the complex case path.
>So you'd be tempted to say "on entry (i.e., using the procedure entry mask)
>just save the registers that the simple case uses (or that both use); if the
>complex case occurs, save and restore those explicitly".  This is illegal
>under the procedure call standard, for a simple reason:  If the stack gets
>unwound through the complex case's code, the registers it had saved "by hand"
>would not be restored...

Aha! So there *is* a cost to the code that is not using it after all...

    - Rich

rob@tolerant.UUCP (Rob Kleinschmidt) (04/19/89)

Once in a while, the atomic nature of a complex instruction can have
real advantages over a sequence of fast single cycle instructions.
Because the complex instruction cannot be interrupted in mid sequence,
there is no need to modify processor priority. This can allow better
manipulation of shared data items (linked lists, counters, etc.)
especially by user level processes. Anybody got a good way for a user
execute an un-interupted sequence of a dozen RISC instructions ?

chase@Ozona.orc.olivetti.com (David Chase) (04/19/89)

In article <4101@tolerant.UUCP> rob@tolerant.UUCP (Rob Kleinschmidt) writes:
>Anybody got a good way for a user execute an un-interupted sequence
>of a dozen RISC instructions ?

Saw (via cable) a talk at Stanford by one of the 80860 people from
Intel where he said (I believe) that the chip provided the ability to
do just that.  Upon examining the Programmer's Reference Manual, I
find that up to 32 user-mode instructions can be protected by a LOCK
instruction (no branching allowed, must be restartable from the LOCK
in case of traps).  That sounds like what you want.

David

tim@crackle.amd.com (Tim Olson) (04/19/89)

In article <4101@tolerant.UUCP> rob@tolerant.UUCP (Rob Kleinschmidt) writes:
| Once in a while, the atomic nature of a complex instruction can have
| real advantages over a sequence of fast single cycle instructions.
| Because the complex instruction cannot be interrupted in mid sequence,
| there is no need to modify processor priority. This can allow better
| manipulation of shared data items (linked lists, counters, etc.)
| especially by user level processes. Anybody got a good way for a user
| execute an un-interupted sequence of a dozen RISC instructions ?


I believe the i860 has a Lock bit which can be set by the user,
disabling interrupts and setting an external Lock pin.  It is tracked by
a watchdog timer, so that after a certain number of cycles (10?) it
automatically is reset (to prevent the user from disabling interrupts
for too long).

On the Am29000, the on-chip timer is disabled separately from the
external interrupts.  Thus, an operating system could supply a
user-invokable disable() routine which masked interrupts and set the timer
to interrupt in a small number of cycles.  The user can then perform a
series of operations, then call an enable() routine (resetting the timer
and the interrupt mask).  If enable() were not called in time, the timer
would catch it and force the interrupt mask off.

One other (software) method is to associate a semaphore with each
critical region (or a collection of critical regions), and manipulate it
with an atomic load/set instruction (which many RISC processors have). 
Interrupts don't have to be disabled if all users of the critical region
obey the semaphore.  I guess one may then have to worry about deadlock
conditions, though...

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

tim@crackle.amd.com (Tim Olson) (04/19/89)

In article <531@ksr.UUCP> richt@ksr.UUCP (Rich Title) writes:
| In article <57252@yale-celray.yale.UUCP> leichter@CS.YALE.EDU (Jerry Leichter (LEICHTER-JERRY@CS.YALE.EDU)) writes:
| The advantage of the UNIX method of stack unwinding is that it is 
| relatively portable, and doesn't make demands on the calling conventions.
| The disadvantage is that the establish-an-unwind-point (the setjmp call)
| is more expensive: an entire register set must be saved, as opposed to
| just putting a word on the stack as in VMS. So there's pro & cons to each
| approach.

Not necessarily.  As has been pointed out here in the past,
register-windowed machines can get by with very simple
setjmp()/longjmp() implementations.  Setjmp() only needs to save the
current stack pointer and stack-cache bounds in the jmpbuf; longjmp()
restores these values (the standard return from a function makes sure
that the frame being returned into is entirely in the stack cache). 

This implementation also makes "register" variables act like memory
variables -- after a longjmp() both contain the same values as of the
time of the longjmp() call, rather than having register values suddenly
restored to their contents at the time of the setjmp() call.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

rpw3@amdcad.AMD.COM (Rob Warnock) (04/19/89)

[Although the implementation discussed below is Am29000-specific, the general
notion is probably feasible on the other current RISCs, and maybe even CISCs.]

In article <25280@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
+---------------
| On the Am29000, the on-chip timer is disabled separately from the
| external interrupts.  Thus, an operating system could supply a
| user-invokable disable() routine which masked interrupts and set the timer
| to interrupt in a small number of cycles.  The user can then perform a
| series of operations, then call an enable() routine (resetting the timer
| and the interrupt mask).  If enable() were not called in time, the timer
| would catch it and force the interrupt mask off.
+---------------

[Funny you should mention... Our internal 29k/4.3bsd Unix port group was
talking about this just yesterday. ;-} ]

Actually, Tim, you can do a lot better than that, by remembering to optimize
for the *expected* case, which is: There will be no attempt to break the lock.

The main thing you are trying to protect against is rescheduling and
"signals" due to external events such as typing <^C>. For that, you
don't need a separate enable() call. Instead, you can use a temp_disable()
call [which can be a *very* lightweight sys call, on the 29k] which
needs to do just two things:

1. Check a u_page flag for a pending event that may have been held off
   by a previous temporary_disable() call. [This prevents the user from
   keeping the scheduler locked out.]

2. Store the current value of the CPU timer in the u_page.

In 29000 assembler, this is:

	; User does assert instruction "asne V_UIOFF,gr1,gr1", which traps
	; here through vector V_UIOFF, since gr1 *does* equal gr1.
	user_ioff:
		add	it0,upage,BRK_UIOFF_OFS	; find u.u_break_uioff
		load	0, 0, it0, it0		; see if we need a resched
		mfsr	it1, TCV		; get current timer value
		add	it1, it1, 1		; make sure non-zero
		add	it2,upage,UIOFF_OFS	; find u.u_uioff
		jmpt	it0, user_ioff_broken	; take traps first...
		  store	0, 0, it1, it2		; save timer+1 in u.u_uioff
		iret

Assuming hits in the BTC [user doing this a lot], this is 9+8+2+2 = 21 cycles
or just under one microsecond on a 25 MHz machine [with 2-first/1-burst memory].

Now, when you get an external signal like <^C>, you do all the normal Unix
signal stuff, but "issig()" will check the u.u_uioff and, if it's non-zero
and not more than "delta" more than the current timer value, will just set
u.u_break_uioff and *not* ask for the signal. Likewise, "sched()" will delay
rescheduling a uioff'd process. The timer interrupt code checks the current
process to see if the u.u_uioff word is "old" [mod 2^32], and if so, clears
it (and causes a resched if needed).

I'm sure there are more places to touch that that, but you get the idea.

That works for user-mode code, but for the kernel you *would* need hardware.
What we used back at Dig. Comm. Assoc. (in a Z-80 comm system) was a counter
chip, wired so it counted down to zero and hung there until reloaded. Whenever
it was *not* zero, interrupts and DMA requests were held of (simple gate). The
trick is that the store which loads the counter *enables* interrupts and DMA
[at least long enough for them to "take"], thus preventing infinite lockouts.

Hmmm... You could do the same thing for user mode, if your MMU allows mapping
user pages into I/O space. [The 29k does.] Dedicate a whole I/O page to the
"temp_ioff" counter, put it in every process's address space, including the
kernel's. Cost: A 25-cent external chip. [I assume there are enough misc. I/O
decodes lying around in a PAL somewhere.]


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

alan@rnms1.paradyne.com (Alan Lovejoy) (04/20/89)

In article <17363@cup.portal.com< bcase@cup.portal.com (Brian bcase Case) writes:
<>On the subject of calling conventions, I just had my first look at a
<>68030 manual.  I know, I'm way behind the times, but it was supposed to
<>be just more of the same.  Why did Motorola delete CALLM?
<
<Did they really do this?  Maybe it is because the PMMU subset on the 030
<can't support the protection semantics, but I'm not sure about that.  Maybe
<it's because nobody used the instruction.  :-) :-)

I heard that they checked, and nobody was using the instruction.  I don't  
remember now what the motivation for dropping CALLM was, other than lack
of use.  That seems like sufficient justification in itself.


Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
_________________________Design Flaws Travel In Herds_________________________
Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.

guy@auspex.auspex.com (Guy Harris) (04/21/89)

>You should include ALL of the data here.  The proper way of saying
>this is: "The implementation of the subset that I need in a particular
>case is faster than the full operation."  For example, the VAX CALL
>includes aligning the stack and saving enough state so it can be fully
>restored, saving the condition codes, pushing a minimum of five longwords,
>picking up a mask at the called procedure, and pushing only
>those registers with corresponding bits set in the mask.  I'll bet
>your substitution code did not do all that.

If it didn't *have* to, it probably didn't.  Does VAX compiled code
really care about having the condition code saved and restored? Does it
care about all five of those longwords?  The compiled code in the callee
can save the registers itself - or the designer of the calling sequence
can choose to have the caller save the registers (which, as I remember,
Firth has advocated in the past).

>INDEX was another example where some of the exception checking was
>left out of the emulation.  I guess it USUALLY works!

In a lot of cases the exception checking *can* be left out, if the
compiler already knows that the subscript won't be out of bounds. 
Admittedly, I think IBM may have a patent on one way of doing this, so
unless you do it differently that may be an extra cost :-).

schow@bnr-public.uucp (Stanley Chow) (04/21/89)

In article <134@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes:
>
>As a result of this difficulty, architects begain to build in
>instructions to handle data structures more directly.  However,
>few of these instructions can be compiled since the C compilers
						    ^^^^
>have difficulty decyphering the code well enough to know when
>to use the "insert into circular queue" instruction as opposed to
>smaller operations to manipulate pointers.  Also, few programmers
>are sufficiently versed in the underlying hardware to generate
>compatible data structures which could allow a very smart compiler
>to generate the instructions in the first place.
>

There are many other compilers that can take advantage of complex
intructions. Actualy, even in C, it is relatively easy to set up
include files that will declare the right data types and inline
procedures [assuming your compiler let's you.]

Throwing out architectural features just because C can't use it? 
There are other languages.  How about throwing out C instead? :-)

Also, it is not necessary for all programmers to know enough to use
the specail instructions. Some guru can set it up and everyone can
then use the pre-built modules. Depending on how you set everything
up, the compiler does not have to be smart at all.

For example, we have a system with really wild instructions that gets
generated even though the user don't know what they are and the
compiler does only peephole optimization. As is turns out, having
atomic operations for free is really nice.

>I personally prefer the way DG/UX handles things.  (I'm biased, I 
>work for DG.)  Each virtual processor contains a value called "atomic
>operation count" which can be raised or lowered (somewhat like parenthesis
>can be nested).  Whenever the AOC is > 0, the processor cannot be suspended
>for anything but a simple interrupt service.  This guarantees that atomic
>operations can complete without interference.  If more stringent control
>is required, then some form of event counter solution is used.

I like this design. Sound like it is implemented by the kernal for virtual
processors. How do you handle the case of kernal not wanting to be 
interrupted?

Stanley Chow   ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public

My opinions are mine alone. Since there has been no interest in others
being represented by them, I am lowering the representation fee.

cik@l.cc.purdue.edu (Herman Rubin) (04/21/89)

In article <5965@pdn.paradyne.com<, alan@rnms1.paradyne.com (Alan Lovejoy) writes:
< In article <17363@cup.portal.com< bcase@cup.portal.com (Brian bcase Case) writes:
< <>On the subject of calling conventions, I just had my first look at a
< <>68030 manual.  I know, I'm way behind the times, but it was supposed to
< <>be just more of the same.  Why did Motorola delete CALLM?
< <
< <Did they really do this?  Maybe it is because the PMMU subset on the 030
< <can't support the protection semantics, but I'm not sure about that.  Maybe
< <it's because nobody used the instruction.  :-) :-)
> 
> I heard that they checked, and nobody was using the instruction.  I don't  
> remember now what the motivation for dropping CALLM was, other than lack
> of use.  That seems like sufficient justification in itself.

Could this be another example of the vicious cycle?  The language does not
use the instruction, therefore the instruction does not get used, therefore
the instruction gets dropped.  Which algorithm I use depends on the hardware
instructions; if the instruction is not there, I may very well know that 
simulating the instruction is too costly to consider, so it would even be
impossible to tell from a careful examination of code that it would be
desirable.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

barmar@think.COM (Barry Margolin) (04/21/89)

In article <134@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes:
>As a result of this difficulty, architects begain to build in
>instructions to handle data structures more directly.  However,
>few of these instructions can be compiled since the C compilers
>have difficulty decyphering the code well enough to know when
>to use the "insert into circular queue" instruction as opposed to
>smaller operations to manipulate pointers.  Also, few programmers
>are sufficiently versed in the underlying hardware to generate
>compatible data structures which could allow a very smart compiler
>to generate the instructions in the first place.

The solution to this is to hide these instructions away in subroutine
libraries and/or compiler intrinsics.  In addition to making it easy
for the compiler to recognize these operations and optimize them
appropriately, it also makes it easier to port the code and still get
good performance.

This is how computer languages and processors evolve in general.
Consider floating point and trigonometric functions.  Once upon a time
programmers might have been required to code these things themselves.
But this would make it difficult to use hardware which has these
operations built in.  So just about every language provides them as
standard features, with provisions for open-coding them.  What's the
difference between operations on floating point numbers and operations
on sophisticated data structures such as queues?

Barry Margolin
Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

rpw3@amdcad.AMD.COM (Rob Warnock) (04/22/89)

In article <1244@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
+---------------
| In article <5965@pdn.paradyne.com<, alan@rnms1.paradyne.com writes:
| < In article <17363@cup.portal.com< bcase@cup.portal.com writes:
| < <                       ...  Why did Motorola delete CALLM?
| < < Maybe it's because nobody used the instruction.  :-) :-)
| > don't remember now what the motivation for dropping CALLM was, other
| > than lack of use.  That seems like sufficient justification in itself.
| Could this be another example of the vicious cycle?  The language does not
| use the instruction, therefore the instruction does not get used, therefore
| the instruction gets dropped...
+---------------

Gerry Weinberg, in his neat little book "Re-thinking Systems Analysis",
calls this "The Railroad Paradox". The 2:30pm train to the city zooms
on by without stopping, but some suburban wives would like to take that
one into the city sometimes so they can shop before meeting hubby for
dinner. They write several notes and petitions, finally getting back a
polite letter from the railroad which says, "We have stationed an observer
on the platform at your stop for a week now who reports that there is
never anyone waiting for the 2:30pm train to stop. Therefore, we must
regretfully deny your request." He also gives computer-related examples...


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

henry@utzoo.uucp (Henry Spencer) (04/23/89)

In article <1011@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>My point was that IBM have a software convention for System/370, and that
>their Pascal compiler (which is not a fully supported product, but it is
>the compiler they will provide to customers who want Pascal) was written
>after that convention was well established and described in hundreds of
>books, and it _does_ _not_ follow that convention...

Might this have something to do with how ill-designed that convention is?
"If you want people to use standards, make it easier and faster to be
standard than to be non-standard."
-- 
Mars in 1980s:  USSR, 2 tries, |     Henry Spencer at U of Toronto Zoology
2 failures; USA, 0 tries.      | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

cik@l.cc.purdue.edu (Herman Rubin) (04/24/89)

In article <39609@think.UUCP>, barmar@think.COM (Barry Margolin) writes:
> In article <134@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes:
  
			......................

> The solution to this is to hide these instructions away in subroutine
> libraries and/or compiler intrinsics.  In addition to making it easy
> for the compiler to recognize these operations and optimize them
> appropriately, it also makes it easier to port the code and still get
> good performance.
> 
> This is how computer languages and processors evolve in general.
> Consider floating point and trigonometric functions.  Once upon a time
> programmers might have been required to code these things themselves.
> But this would make it difficult to use hardware which has these
> operations built in.  So just about every language provides them as
> standard features, with provisions for open-coding them.  What's the
> difference between operations on floating point numbers and operations
> on sophisticated data structures such as queues?

And suppose you want multiple precision.  In that case you need good 
coding AND GOOD INTEGER INSTRUCTIONS.  There are multiple precision
packages written in floating point.  They simulate integer arithmetic
in floating point, and are necessarily VERY slow.  Current hardware on
most machines throws real obstacles into efficient multiple precision
coding.  Slight differences in the multiplication and division instruc-
tions can cause a factor of 2-20 or more in these.

It is not true that all languages provide procedures for open coding.
Most provide some procedures for attempting to open-code, but most
implementations I have seen discourage this.  
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

prem@crackle.amd.com (Prem Sobel) (04/25/89)

Many years ago, there was a machine called the Interdata Model 70 which
had instructions for atomically adding or removing items from circular
double ended queues. The data structure was defined reasonable effeciently
and the machine was microcoded.

Yet no one, no compiler seriously used these instructions. The reason was,
amazingly, that individual instructions were faster!!! I never looked at
the microcode, so I cannot commebnt why that was.


Prem

schow@bnr-public.uucp (Stanley Chow) (04/27/89)

Since this discussion of complex instructions is venturing outside of
architectural issues and getting into language/OS issues, I have cross
posted to what I think are appropiate groups. Hopefully, we can get some
other viewpoints on this issue. Please limit followups to appropiate group(s).

In article <25384@amdcad.AMD.COM> prem@crackle.amd.com (Prem Sobel) writes:
>Many years ago, there was a machine called the Interdata Model 70 which
>had instructions for atomically adding or removing items from circular
>double ended queues. The data structure was defined reasonable effeciently
>and the machine was microcoded.
>
>Yet no one, no compiler seriously used these instructions. The reason was,
>amazingly, that individual instructions were faster!!! I never looked at
>the microcode, so I cannot comment why that was.

Strangely enough, we have a proprietary machine that have micro-coded
instructions for much the same functions. The queueing instructions happen
to be at the top of the usage list.

Even more amazing, micro-coding of frequently used instruction sequences
essentially doubled performance. Since I wrote much of the micro-code, (and
did much of analysis to begin with), I can state that the main reasons are:
  - reduced program bandwidth
  - better pipelining of program and data access
  - better parallelism for using the hardware units.

All this is done with a peephole optimizer! And *all* the instructions 
fitted into 4K by 40 bits of micro-code!

You think the VAX procedure calling instructions are big? We have special
instructions for swapping processes in and out! The code for swapping process
is something like:

     SaveRegisters();                   ; one instruction. 
					; old process is implicit
     RestoreRegisters(new_process);     ; another instruction

These instructions play with the hardware registers, firmware registers, help 
the scheduler do software stuff, calculate the CPU time spent in the current
process, save/restore the runtime stacks and some other things that I cannot
remember off hand.

The end result is that process swapping happens at data memory bandwidth. We
looked at the options, and concluded that even with absolutely no program store
wait-states, it is impossible for any software (compiled or hand-tuned) to 
evan come close to this performance.

Note that this is on a machine designed for micro-coding in the early 70's
so the comparisons may not be valid for current machines. Considering that it
uses only MSI TTL on 4-layer boards, we get very good through-put. [We have
already come out with a 68K based replacement and are working on more fun
stuff, more about that next decade.]

A word of caution for people that want to look into micro-coding: get control
of your operating system and compiler before you try it. There is no point in
micro-coding instructions for your application unless you can make the OS and
the compiler like it. (I  managed to introduce new syntax into the language and
changed whole chunks of the OS to support some of the fancy micro-code).

Stanley Chow    ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
                (613) 763-2831

Disclaimer: Since I am only talking about an old system, and all the information
	    has already been published in one form or another, I don't think my
	    employer minds we talking about it. That does not mean I represent
	    anyone.
--
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

aglew@mcdurb.Urbana.Gould.COM (04/29/89)

>Once in a while, the atomic nature of a complex instruction can have
>real advantages over a sequence of fast single cycle instructions.
>Because the complex instruction cannot be interrupted in mid sequence,
>there is no need to modify processor priority. This can allow better
>manipulation of shared data items (linked lists, counters, etc.)
>especially by user level processes. Anybody got a good way for a user
>execute an un-interupted sequence of a dozen RISC instructions ?

Don't rely on "the atomic nature of a complex instruction".

1) they may be restartable or continuable across interrupts,
   but they may not necessarily guarantee atomicity across an
   interrupt - ie. if the interrupt routine messes with the
   data structures the complex instruction is manipulating,
   atomicity may be lost.
      Simple example: what does atomicity of a block copy mean to
   you? It might mean that exactly what was present in memory when
   the instruction was first issued should be copied - which is
   seldom true (in fact, I know of RT systems that depended
   on interrupt routines being able to modify copies in progress.)
      Smaller examples, like Compare-and-swap, the processors usually
   do guarantee atomicity wrt interrupts on. But not always...

2) much more frequently, although complicated instructions may be
   atomic wrt interrupts, they are *not* atomic wrt other
   processors in the system.

jkrueger@daitc.daitc.mil (Jonathan Krueger) (05/03/89)

In article <426@bnr-fos.UUCP>, schow@bnr-public (Stanley Chow) writes:
>it is not necessary for all programmers to know enough to use
>the specail instructions. Some guru can set it up and everyone can
>then use the pre-built modules. Depending on how you set everything
>up, the compiler does not have to be smart at all.

Of course, but you lose the performance advantage if the modules get
called frequently and require significant setup (e.g. can't maintain
state for multiple callers or sequences) with respect to work
performed.  Or to put it another way, That Way Lies Threaded Code.
However, this isn't an argument for Better Living Through Inlining,
rather an appeal for analysis and measurement of the tradeoffs.

>For example, we have a system with really wild instructions that gets
>generated even though the user don't know what they are and the
>compiler does only peephole optimization. As is turns out, having
>atomic operations for free is really nice.

Sounds like you've managed to let your users express what they want in
terms familiar to them but also understandable to the language
translator, following which optimization to particular hardware can be
done in a reasonable way.  By any chance, would this be done by way of
a function library?

-- Jon
--

cik@l.cc.purdue.edu (Herman Rubin) (05/03/89)

In article <504@daitc.daitc.mil>, jkrueger@daitc.daitc.mil (Jonathan Krueger) writes:
> In article <426@bnr-fos.UUCP>, schow@bnr-public (Stanley Chow) writes:
> >it is not necessary for all programmers to know enough to use
> >the specail instructions. Some guru can set it up and everyone can
> >then use the pre-built modules. Depending on how you set everything
> >up, the compiler does not have to be smart at all.

I agree it is not necessary for ALL programmers to know how to really
use the computer, any more than it is necessary for the house painter
to be an artist.  But the artist cannot set up the modules for a
house painter to produce a work of art.

It is also not true that even the use of modules will handle what is
needed.  I may see tomorrow an ingenious use of the hardware that I do
not see today.  It is quite common when I see a different architecture
that I see that some instruction, intended for a totally different
purpose, does what I want.  Also, I can see that some easily implemented
hardware will do not only what I want, but what others can use.

> Of course, but you lose the performance advantage if the modules get
> called frequently and require significant setup (e.g. can't maintain
> state for multiple callers or sequences) with respect to work
> performed.  Or to put it another way, That Way Lies Threaded Code.
> However, this isn't an argument for Better Living Through Inlining,
> rather an appeal for analysis and measurement of the tradeoffs.
'
What I ask for here is flexibility.  The guru is not a universal guru.

< >For example, we have a system with really wild instructions that gets
< >generated even though the user don't know what they are and the
< >compiler does only peephole optimization. As is turns out, having
< >atomic operations for free is really nice.
< 
< Sounds like you've managed to let your users express what they want in
< terms familiar to them but also understandable to the language
< translator, following which optimization to particular hardware can be
< done in a reasonable way.  By any chance, would this be done by way of
< a function library?
< 

And what if your users find that the language is inadequate?  I know of
no adequate one.  Do not cripple the programmer.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

les@unicads.UUCP (Les Milash) (05/03/89)

In article <1277@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

>What I ask for here is flexibility.  The guru is not a universal guru.

the guru that is not universal is not the true guru. 0 :-)-|X

cdshaw@alberta.UUCP (Chris Shaw) (05/04/89)

In article <1277@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>In article <504@daitc.daitc.mil>, jkrueger@daitc.daitc.mil (Jonathan Krueger) writes:
>< >For example, we have a system with really wild instructions that gets
>< >generated even though the user don't know what they are and the
>< >compiler does only peephole optimization. As is turns out, having
>< >atomic operations for free is really nice.
>< 
>< Sounds like you've managed to let your users express what they want in
>< terms familiar to them but also understandable to the language
>< translator, following which optimization to particular hardware can be
>< done in a reasonable way.  By any chance, would this be done by way of
>< a function library?
>
>And what if your users find that the language is inadequate?  I know of
>no adequate one.  Do not cripple the programmer.

What if your stock Ford Escort doesn't do 180MPH on the straights? Don't cripple
the driver! What if your 2 micron CMOS gate array doesn't give 0.3 nanosecond
inverter propagation delay with fanout of 8? Don't cripple the logic designer! 
Get it?

The point I'm illustrating is that if you want to do something that 99% of the
rest of the world is unwilling to pay for, it should come as no surprise that
that the stock solutions fall short. Griping and complaining that the stock
solutions are inadequate is simply vacuous. If you invest the time and money,
you can get a Ford Escort to go 180. But it won't be stock. If you invest the
time and money, you can get 2.0 micron CMOS to give 0.3 ns propagation delay.
But it won't be a gate array, it'll be full custom.

Frankly, I'm getting very tired of reading Herman Rubin's complaints about C,
because he never seems to have a coherent solution. All I see is calls
for someone else to solve his particular problem, coupled with the unspoken
attitude that his particular complaint should have been taken care of long ago.
"The meter's running, buddy! Get to work!"

So what?

Maybe people do seriously listen, but come to the conclusion that it isn't
worth the time and effort to give Herman Rubin what he wants. I don't know, 
but it seems to me that Herman Rubin is wasting his breath. Time for action, I
think. Time for Herman Rubin to write "SL", for "Statistical Language", and
to report on its results. Calling other people stupid is valuable only as 
long as they are willing to listen. For me, time was up long ago.

>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907

-- 
Chris Shaw    cdshaw@alberta.UUCP (or via watmath or ubc-vision)
University of Alberta
CatchPhrase: Bogus as HELL !

cik@l.cc.purdue.edu (Herman Rubin) (05/04/89)

In article <2249@pembina.UUCP>, cdshaw@alberta.UUCP (Chris Shaw) writes:
> In article <1277@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> >In article <504@daitc.daitc.mil>, jkrueger@daitc.daitc.mil (Jonathan Krueger) writes:

			.......................

> >And what if your users find that the language is inadequate?  I know of
> >no adequate one.  Do not cripple the programmer.
> 
> What if your stock Ford Escort doesn't do 180MPH on the straights? Don't cripple
> the driver! What if your 2 micron CMOS gate array doesn't give 0.3 nanosecond
> inverter propagation delay with fanout of 8? Don't cripple the logic designer! 

This is the wrong analogy.  YOU are saying that I should not complain if the
hardware is not capable of doing everything I want it to.  I do complain 
about this, but not very loudly.  I do not ask for technological impossi-
bilities, or things requiring great cost.  I may ask for better brakes, but
not to stop the car from 100mph in 50 feet.

But my real complaint has nothing to do with the hardware.  You would provide
me with detailed instructions about how to get from one location to another,
but you would not allow me to use a road map to change the route.  You would
not even allow me to back the car into the driveway to clean out the garage.

The car has a steering wheel.  I do not wish to put it under your control.

> The point I'm illustrating is that if you want to do something that 99% of the
> rest of the world is unwilling to pay for, 

			.........................

What I am asking for is cheap.  I am asking that you unchain the steering wheel.

> Frankly, I'm getting very tired of reading Herman Rubin's complaints about C,
> because he never seems to have a coherent solution. All I see is calls
> for someone else to solve his particular problem, coupled with the unspoken
> attitude that his particular complaint should have been taken care of long ago.

You jump to conclusions and assume more than was stated.  I have consistently
stated that the best that can be produced is a fair language, and that to
produce such a language, the producers must be aware that that is all that
it possible.  Also, one person cannot produce a language.

I am not in a CS department.  Funding for people in mathematics and statistics
to hire programmers is difficult and niggardly, even to hire students.

I have made one concrete suggestion for something which is not a language, but
which would make for versatile relatively easy programming.  It is a versatile
macro processor; maybe one should call it a user-controlled semi-compiler.  Let
me give an example for the use of a single machine instruction on the CYBER 
205.  The format of the way I would like to write the macro is

	c{'z} ={t} {-}{|} a{'x} OP{mod} {|} b{'y} { /\ {~} w}

where the {} delineate optional fields.  The interpretation of this depends on
the types of the arguments, and covers more than 40 instructions, some of
which have up to 8 optional bits.  For starters, I would like to translate
expressions like this into other statements which are in the syntax recognized
by a processor.  Of course, macros could be recursive.

Such a processor would allow me to produce a customized means of writing
machine instructions, or sequences of machine instructions, instead
of the clumsy assembler mnemonics.  With this type of macro assembler, I
believe it would be easy to produce good semi-portable code.  Since I am
customizing the way I write instructions, I expect to have to provide you
with the description of what I am doing.  We may get groups of people to
agree on some of these; that could be called a partial language.

I am not asking for the perfect language.  I am asking for a means whereby
a programmer can use a reasonable notation to tell the modified assembler
what instructions to use.  It should be possible to use more complicated
macros to achieve what the current languages do, and more.  The current
languages are specialized macro statement languages.  
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

chuck@melmac.harris-atd.com (Chuck Musciano) (05/04/89)

In article <2249@pembina.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes:
>Frankly, I'm getting very tired of reading Herman Rubin's complaints about C,
>because he never seems to have a coherent solution. All I see is calls
>for someone else to solve his particular problem, coupled with the unspoken
>attitude that his particular complaint should have been taken care of long ago.
>"The meter's running, buddy! Get to work!"
>
>So what?
>
>Maybe people do seriously listen, but come to the conclusion that it isn't
>worth the time and effort to give Herman Rubin what he wants. I don't know, 
>but it seems to me that Herman Rubin is wasting his breath. Time for action, I
>think. Time for Herman Rubin to write "SL", for "Statistical Language", and
>to report on its results. Calling other people stupid is valuable only as 
>long as they are willing to listen. For me, time was up long ago.

     Hear!  Hear!  I was just thinking this morning about how the statistical
world would feel about a bunch of programmers announcing that they were
completely dissatisfied with the current state of statistical analysis, and
demanding that statisticians begin "doing statistics right!"

     I anxiously await release 1.0 of SL.  I'll need an implementation for
my Sun-3/60, of course, and I'm running SunOS 3.4, if that helps.

     I would suggest that Mr. Rubin get a firm grounding before beginning his 
project.  I would suggest as a basic curriculum a compilers course, including 
study of lexical analysis, grammar theory, top down and bottom up parsing, LL,
LR, and LALR grammar generation, basic code generation, code optimizations, 
and flow analysis.  Then, dive into a semester of machine architecture, 
including the fundamentals of stack, one, two, and three address machines, 
pipelining, the merits of CISC and RISC, and instruction set design. Since 
languages have users, you'll need some time in a few psychology courses, to 
study how people think about problems and perceive tools.  It is also helpful 
to take a language survey course, and write a few programs in about 10 or 15 
languages, just to get a feel for what is out there. A few operating systems 
courses will bring you up to speed on linkers and loaders, resource 
management, device interfacing, and I/O issues.  Finally, some training in 
software methodology, so that your compilers will be easy to build, modify, and
port, and so that you'll be able to include good design features in your 
language. 

     After you have done all this, like I and countless other computing
professionals have, then you are qualified to make the hard tradeoff decisions
that arise in language design and implmentation.  There is no perfect decision
and you can't please all users.  Like my old office mate used to say, you
can't be all tools for all fools.

     This may seem a little harsh, and I guess it is because I'm a little
peeved.  The point is, until you fully understand an issue, you are only
qualified to make suggestions.  When a more qualified professional overrides
your suggestion, you should defer to his or her opinion.  I don't try to
tell auto engineers how to build cars.  I might have some ideas about how
my car could be better, but if they say there is a good reason why not, that
pretty much closes the issue for me.  I'm simply not informed enough to 
dispute them.

     Chris Shaw also raises the point of effort versus payoff for a large
group of users.  I think Mr. Rubin is in a small majority of users, and it
simply does not behoove system designers to expend the effort to support
his desires.  Classic case of supply and demand in a free market.

Chuck Musciano			ARPA  : chuck@trantor.harris-atd.com
Harris Corporation 		Usenet: ...!uunet!x102a!trantor!chuck
PO Box 37, MS 3A/1912		AT&T  : (407) 727-6131
Melbourne, FL 32902		FAX   : (407) 727-{5118,5227,4004}

chuck@melmac.harris-atd.com (Chuck Musciano) (05/05/89)

In article <2024@trantor.harris-atd.com> chuck@trantor.harris-atd.com (Chuck Musciano) writes:
>group of users.  I think Mr. Rubin is in a small majority of users, and it
                                                  ^^^^^^^^

     Ahem.  Make that "minority".

Chuck Musciano			ARPA  : chuck@trantor.harris-atd.com
Harris Corporation 		Usenet: ...!uunet!x102a!trantor!chuck
PO Box 37, MS 3A/1912		AT&T  : (407) 727-6131
Melbourne, FL 32902		FAX   : (407) 727-{5118,5227,4004}

lfoard@wpi.wpi.edu (Lawrence C Foard) (05/06/89)

In article <2024@trantor.harris-atd.com> chuck@trantor.harris-atd.com (Chuck Musciano) writes:
>In article <2249@pembina.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes:
>.....
>     This may seem a little harsh, and I guess it is because I'm a little
>peeved.  The point is, until you fully understand an issue, you are only
>qualified to make suggestions.  When a more qualified professional overrides
>your suggestion, you should defer to his or her opinion.  I don't try to
>tell auto engineers how to build cars.  I might have some ideas about how
>my car could be better, but if they say there is a good reason why not, that
>pretty much closes the issue for me.  I'm simply not informed enough to 
>dispute them.
>
>....

Come on who defines qualified professional??? Of course it will depend on how
many expensive slips of paper you have (degrees). History is full of great
idea's that where at first rejected by the "qualified professionals". If the
car maker's only explaination is "it isn't done that way" there is no reason
why you should except that for an answer. There may be a hundred reasons why
SL might be a bad idea but you didn't name any you just stated that you where
a qualified professional and that pions arn't allowed to question you! This
isn't the dark ages where the pope was infallible, people no matter how well
trained or professional can still make mistakes and many times it takes an
outsider to see them. If you had brought the idea of Risc upto the "qualified
professionals" in the VAX development team, I wonder what the answer would have
been?

-- 
Disclaimer: My school does not share my views about FORTRAN.
            FORTRAN does not share my views about my school.

cdshaw@alberta.UUCP (Chris Shaw) (05/06/89)

In article <1281@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
me>In article <2249@pembina.UUCP>, cdshaw@alberta.UUCP (Chris Shaw) writes:
HR>>In article <1277@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes
HR>>>And what if your users find that the language is inadequate? I know of
HR>>>no adequate one. Do not cripple the programmer.

me>> <Silly automotive analogy. Don't cripple the driver, etc.>

HR>This is the wrong analogy. I do not ask for technological impossi-
HR>bilities, or things requiring great cost.

On the contrary, special cases that HR desires DO come at great cost. Read on.

HR>The car has a steering wheel. I do not wish to put it under your control.

Granted. Of course, cars aren't computers, so chasing this analogy is a waste of
time. Explaining an analogy is worse than explaining a joke.

me>>The point I'm illustrating is that if you want to do something that 99% of 
me>>the rest of the world is unwilling to pay for....

HR>What I am asking for is cheap. I am asking that you unchain the steering 
HR>wheel....Also, one person cannot produce a language....

I dispute this. Many one-person languages exist and are in use today. A one 
person language has the advantage of conceptual wholeness. Now I admit, it won't
be easy to create SL all by yourself, but if it was easy, your solution would
be for sale on PC's for $75. Or you could hack it in an afternoon.

HR>I am not in a CS department. Funding for people in mathematics and statis-
HR>tics to hire programmers is difficult and niggardly, even to hire students.

I suppose the real thrust of my argument is that you can do anything you like,
but if nobody else is interested, expect to do it yourself! HR states that
he has no money for programmers, and I infer from this that he has either no
time or no ability to do this himself. Perhaps he has neither, I don't know.
But the latter statement from HR about needing $$ to implement his plan 
contradicts his earlier statement that he's not looking for something at 
great cost. In a nutshell: HR wants a niche product, and tacitly admits that
substantial design and development time are required to produce the desired
product. It's going to be expensive, because you can't amortize the development
cost over a lot of customers.

HR>I have made one concrete suggestion for something which is not a language,but
HR>which would make for versatile relatively easy programming. It is a versatile
HR>macro processor; maybe one should call it a user-controlled semi-compiler. 
HR>...an example for the use of a single machine instruction on the CYBER 205...

I think that HR's goals conflict. He wants something portable but at the same
time able to maximize machine performance. In other words, you need an 
expression syntax that is able to encompass all available modifications of
a particular 205 instruction, plus all mods for Cray, plus SX-1, etc.
But at the same time, portability is desired. What if 205 option A is not
available on Cray? What do you do? Cut the expression into inefficient little 
bits? How is this different from normal C code?

I guess my point is that if you only want nifty expressions for the 205, you're
simply asking for a spoonful of syntactic sugar, assuming that the #asm
thing allows proper data structure access. If the compiler-asm link is
inadequate, then this is probably a straightforward system fix. I'd wager that
the grail -- portable and maximally efficient -- is impossible.

HR>I am asking for a means whereby a programmer can use a reasonable notation to
HR>tell the modified assembler what instructions to use. 

Machine specific, unportable, and therefore window dressing, unless you are
content with giving hints only. But this means you have to say the same thing
twice. A hard problem.

HR>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
--
Chris (not a guru) Shaw    cdshaw@alberta.UUCP (or via watmath or ubc-vision)
University of Alberta
CatchPhrase: Bogus as HELL !
-- 
Chris Shaw    cdshaw@alberta.UUCP (or via watmath or ubc-vision)
University of Alberta
CatchPhrase: Bogus as HELL !

milne@ics.uci.edu (Alastair Milne) (05/06/89)

prem@crackle.amd.com (Prem Sobel) writes
>Many years ago, there was a machine called the Interdata Model 70 which
>had instructions for atomically adding or removing items from circular
>double ended queues. The data structure was defined reasonable effeciently
>and the machine was microcoded.
>
>Yet no one, no compiler seriously used these instructions. The reason was,
>amazingly, that individual instructions were faster!!! I never looked at
>the microcode, so I cannot commebnt why that was.

    Is this sort of finding not precisely the reason behind RISC architecture?
    The realisation that many an instruction to handle a relatively
    sophisticated job turned out to be slower than a series of very simple
    instructions to do the equivalent thing?

    I myself am not quite clear yet on why this should be ( is there perhaps a
    reason that larger, more sophisticated microcode routines can't be made
    optimally fast?).  Yet it does seem to be a fact of life.  


    Alastair Milne

cik@l.cc.purdue.edu (Herman Rubin) (05/06/89)

In article <2253@pembina.UUCP>, cdshaw@alberta.UUCP (Chris Shaw) writes:
> In article <1281@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> me>In article <2249@pembina.UUCP>, cdshaw@alberta.UUCP (Chris Shaw) writes:
> HR>>In article <1277@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes
> HR>>>And what if your users find that the language is inadequate? I know of
> HR>>>no adequate one. Do not cripple the programmer.
> 
> me>> <Silly automotive analogy. Don't cripple the driver, etc.>
> 
> HR>This is the wrong analogy. I do not ask for technological impossi-
> HR>bilities, or things requiring great cost.
> 
> On the contrary, special cases that HR desires DO come at great cost. Read on.
> 
> HR>The car has a steering wheel. I do not wish to put it under your control.
> 
> Granted. Of course, cars aren't computers, so chasing this analogy is a waste of
> time. Explaining an analogy is worse than explaining a joke.
> 
> me>>The point I'm illustrating is that if you want to do something that 99% of 
> me>>the rest of the world is unwilling to pay for....

So we should not train professional sports figures.  We should not produce
Ph.D.'s.  We should not produce professional programmers.  We should not
produce doctors.  More than 99% of the population is unable to use the
training and education for these fields.  Therefore we should not provide
it for those who can use it.

> HR>What I am asking for is cheap. I am asking that you unchain the steering 
> HR>wheel....Also, one person cannot produce a language....
> 
> I dispute this. Many one-person languages exist and are in use today. A one 
> person language has the advantage of conceptual wholeness. Now I admit, it won't
> be easy to create SL all by yourself, but if it was easy, your solution would
> be for sale on PC's for $75. Or you could hack it in an afternoon.

If you think I want a language for statistics, you are wrong.  I want a
language in which a lot of things in mathematics and statistics can be done.
And I do not want the language to restrict my thinking.  You want a program 
to be portable.  But what if only junk is portable?  You would insist that
a good program on a machine, easy to envision, not be used because another
machine cannot use it efficiently.

> HR>I am not in a CS department. Funding for people in mathematics and statis-
> HR>tics to hire programmers is difficult and niggardly, even to hire students.

It is not the problem of getting people to agree it is worthwhile.  I can do
that.  It is the fact that universities will not fund things in mathematics
and statistics; they insist that the money be obtained from the government.
And the NSF budget does not permit the funding of enough faculty, and graduate
students are more expensive than faculty.  We have no problem getting computing
time, but no faculty member in our department can get non-routine programming
help.  And we are considered one of the top statistics departments.

>. It's going to be expensive, because you can't amortize the development
> cost over a lot of customers.
> 
> HR>I have made one concrete suggestion for something which is not a language,but
> HR>which would make for versatile relatively easy programming. It is a versatile
> HR>macro processor; maybe one should call it a user-controlled semi-compiler. 

I believe that I could do this with one CS hacker in a few months.

> HR>...an example for the use of a single machine instruction on the CYBER 205...
> 
> I think that HR's goals conflict. He wants something portable but at the same
> time able to maximize machine performance. In other words, you need an 
> expression syntax that is able to encompass all available modifications of
> a particular 205 instruction, plus all mods for Cray, plus SX-1, etc.
> But at the same time, portability is desired. What if 205 option A is not
> available on Cray? What do you do? Cut the expression into inefficient little 
> bits? How is this different from normal C code?
> 
> I guess my point is that if you only want nifty expressions for the 205, you're
> simply asking for a spoonful of syntactic sugar, assuming that the #asm
> thing allows proper data structure access. If the compiler-asm link is
> inadequate, then this is probably a straightforward system fix. I'd wager that
> the grail -- portable and maximally efficient -- is impossible.

I agree.  But semi-portable and close to efficient is not impossible.  You 
may have to abandon an algorithm completely on some machines.  In some cases
the algorithm may have to be entirely abandoned.  This is especially the case
in generating random variables, where it is the method which must be correct,
NOT the answer.  The CRAY-1 is so bad I consider it a basket case for
vectorization.  I would suggest connecting a mini, or even a micro, to
do this part of the job.  The algorithm I would use on the 205 differs 
from what I would use on the IBM 3090; the 205 algorithm could be implemented
on the 3090, but would be decidedly more costly.  Implementing the 3090
algorithm on the 205 would require breaking up the vector into short pieces,
which is required on vector register machines such as the 3090 and CRAY, but
is not required on vector stream machines such as the 205.  But to me, these
considerations are immediate and obvious.

> HR>I am asking for a means whereby a programmer can use a reasonable notation to
> HR>tell the modified assembler what instructions to use. 
> 
> Machine specific, unportable, and therefore window dressing, unless you are
> content with giving hints only. But this means you have to say the same thing
> twice. A hard problem.

There is a heck of a lot of portability which can be achieved, but not in any
of the current HLLs.  The problem of adding vectors is portable, but the C
code to do it efficiently on one machine is inefficient on another.  In this
respect C is too low level.  The same holds for string management.

But C does not handle transfers too well.  There are situations where what is
wanted is a branch to subroutine, not a subroutine call.  C does not handle
this at all.  Here C is too high level.  C can do recursion, but not
coroutines.  And what about a recursion in which one wishes to force
the use of quantities unchanged in registers?

The ideal language does not exist, and never will.  Portability means using
the same strategies with the automobile and airplane as with the horse and
buggy.  There may be similarities, but full portability is totally incompatible
with even fair efficiency.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

cdshaw@alberta.UUCP (Chris Shaw) (05/07/89)

HR>In article <1287@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
me>In article <2253@pembina.UUCP>, cdshaw@alberta.UUCP (Chris Shaw) writes:
me>The point I'm illustrating is that if you want to do something that 99% of
me>the rest of the world is unwilling to pay for....

HR>So we should not train professional sports figures.  We should not produce
HR>Ph.D.'s.  We should not produce professional programmers.  We should not
HR>produce doctors.  More than 99% of the population is unable to use the
HR>training and education for these fields.  Therefore we should not provide
HR>it for those who can use it.

This corellation is too fallacious to be taken seriously. The training of 
professionals has demonstrable value to society, so society as a whole pays
some of the expense for such training. By contrast, SL is an unproven concept.
It's not even clear that its goals are achievable. The burden of proof for SL
therefore rests upon its advocates. Therefore, they must pay all costs.

HR>But what if only junk is portable?  

We agree on the technical aspects of SL, but we fundamentally
disagree on its worth. HR wants a language and compiler system that programs as
well in assembler as is humanly possible. He wants maximum efficiency. If
necessary, machine-specific detail must be provided, since the paramount goal
is speed. My point is that with a specification such as this, HR is really
looking for a more friendly assembler, since the machine specific statements 
that MUST be provided for (say) the 205 are useless on the 3090. In other words,
one must state the algorithm differently for each target machine. This is
called unportability. So why not just use 205 assembler and not bother with 
SL at all? Well, I can see a few reasons why you'd want SL -- control, storage
allocation, etc. But if half your code is 205 specific, then you're wasting
your time with a high level language. A syntactically-sugared assembler is
still an assembler.

Why is portability important? Because one might want to run your program
on a machine different from the one you originally wrote it for.

HR>You (CDS) would insist that a good program on a machine, easy to envision, 
HR>not be used because another machine cannot use it efficiently.

I have no idea where HR got this from what I was saying. What I'm saying is that
this mythical easily-envisioned program must be written in one language that
can be compiled on all machines. If for the sake of speed, the program is 
really 7 programs mutually excluded by #ifdefs, then this program could be
better written in the assembly language for each of those 7 machines.

In fact, if it's 7 different programs, then the choice of language is moot. The 
specifics of the compiler and the machine matter a lot more than the features
of the language itself.

HR>It is not the problem of getting people to agree it is worthwhile. I can do
HR>that. (but) the NSF budget does not permit the funding of enough faculty..

Well, it would seem that SL is low on the list of priorities. A pity, perhaps,
but I think that HR is tilting at windmills.

me>I'd wager that the grail --portable and maximally efficient-- is impossible.

HR>I agree.  But semi-portable and close to efficient is not impossible. You 
HR>may have to abandon an algorithm completely on some machines.  

Then this is UNportable, and for good reason. HR's example of correct
random numbers shows that you can't do some things portably. Fine. This is no
different than device driver code. But if HALF your code is device driver code,
why bother with portability at all? You can have portability or speed. Not
both. The boundaries are being pushed, but I don't thing that a syntactic
patch-up job will do what you want.

HR>There is a heck of a lot of portability which can be achieved, but not in any
HR>of the current HLLs.  The problem of adding vectors is portable, but the C
HR>code to do it efficiently on one machine is inefficient on another.

I think this might have a lot to do with a lack of consensus on what is a "good"
vector add instruction. Or maybe the C compiler sucks on one of the machines.
Sue the compiler manufacturer. It's not the language's fault.
Anyway, QUANTIFY "heck of a lot". Quantify that across 3 machines. You may be
able to hand compile your C program nicely, but can a program (or 3 programs)
do it?

HR>But C does not handle transfers too well.  There are situations where what is
HR>wanted is a branch to subroutine, not a subroutine call.  C does not handle
HR>this at all. And what about a recursion in which
HR>one wishes to force the use of quantities unchanged in registers?

This is bogus. The implementation of a subroutine call is up to the compiler
writer, and a bad implementation does not imply that C is bad.

HR>There may be similarities, but full portability is totally incompatible
HR>with even fair efficiency.

Exactly. So what's your point? You want your cake and eat it too, it seems,
and are crying because nobody will deliver. Tough bananas. Prove how stupid the
rest of the world is, and come up with a better alternative. I doubt you can
do it, because the maximal use of a particular machine probably requires too
much specific and hence non-portable information about that machine.

If portability is not your goal, then who cares? Assembler fits your bill.

>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907

-- 
Chris Shaw    cdshaw@alberta.UUCP (or via watmath or ubc-vision)
University of Alberta
CatchPhrase: Bogus as HELL !

dave@lethe.UUCP (Dave Collier-Brown) (05/07/89)

In article <1281@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
HR>I have made one concrete suggestion for something which is not a language,but
HR>which would make for versatile relatively easy programming. It is a versatile
HR>macro processor; maybe one should call it a user-controlled semi-compiler. 
HR>...an example for the use of a single machine instruction on the CYBER 205...

In article <2253@pembina.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes:
>I guess my point is that if you only want nifty expressions for the 205, you're
CS>simply asking for a spoonful of syntactic sugar, assuming that the #asm
CS>thing allows proper data structure access. If the compiler-asm link is
CS>inadequate, then this is probably a straightforward system fix. I'd wager that
CS>the grail -- portable and maximally efficient -- is impossible.

  I'm inclined to agree that portability and maximal "fit" to the machine are
probably incompatable, but I rather like the idea of using a general-purpose
macroprocessor to mock up a mini-language.

  Andy Tannenbaum (of Minix fame) used to suffer from similar financial
strictures, and used the same solution. He carefully misused a macroprocessor.
(The article, I believe, was in SP&E many moons ago).

--dave (1: this is getting awfully far from architecture
        2: I'd use lisp to write the translator, myself) c-b


-- 
David Collier-Brown,  | {toronto area...}lethe!dave
72 Abitibi Ave.,      |  Joyce C-B:
Willowdale, Ontario,  |     He's so smart he's dumb.
CANADA. 223-8968      |

dean@columbine.Berkeley.EDU (R. Drew Dean) (05/07/89)

In article <1287@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>I agree.  But semi-portable and close to efficient is not impossible.  You 
>may have to abandon an algorithm completely on some machines.  In some cases
>the algorithm may have to be entirely abandoned.  This is especially the case
>in generating random variables, where it is the method which must be correct,
>NOT the answer.  The CRAY-1 is so bad I consider it a basket case for
>vectorization.  I would suggest connecting a mini, or even a micro, to
>do this part of the job.  The algorithm I would use on the 205 differs 
>from what I would use on the IBM 3090; the 205 algorithm could be implemented
>on the 3090, but would be decidedly more costly.  Implementing the 3090
>algorithm on the 205 would require breaking up the vector into short pieces,
>which is required on vector register machines such as the 3090 and CRAY, but
>is not required on vector stream machines such as the 205.  But to me, these
>considerations are immediate and obvious.
>
>-- 
>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907

Can you please provide examples of the different algorithms you use on these
different machines ?
I refer you to the ACM Turing Lecture book, where I paraphrase from (I forget
whose lecture this was, the book is in the library where I'm not, but if
my memory hasn't dropped too many refresh cycles it's McCarthy's...)
	Look at generating Fibonacci numbers.  The obvious algorithm
	F(n) = F(n-1) + F(n-2) generates a process with exponential
	running time, but this can be transformed into a process with
	linear running time...[new algorithm omitted]  I don't care
	how great your "optimizing" compiler is -- a compiler which
	sees this transformation (reducing the order of growth), generating
	not necessarily perfect code, will trounce your compiler....
Note: Ok, this would be a really difficult compiler to write....I don't know
of anybody doing it yet.  But it's the principle that matters here:
It's _not_ how micro-efficient your code is that matters.  If it's
too slow, rethink the algorithm.  That's where the real difference is --
if you reduce the order of growth, wonderful things start happening.

If your random variable generator uses an algorithm that executes in the
least known time, then please forgive this posting....

Drew Dean
Internet: dean@xcssun.berkeley.edu
UUCP: ...!ucbvax!xcssun!dean
FROM Disclaimers IMPORT StandardDisclaimer;

cik@l.cc.purdue.edu (Herman Rubin) (05/07/89)

In article <13396@pasteur.Berkeley.EDU>, dean@columbine.Berkeley.EDU (R. Drew Dean) writes:
> In article <1287@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
< <I agree.  But semi-portable and close to efficient is not impossible.  You 
< <may have to abandon an algorithm completely on some machines.  In some cases
< <the algorithm may have to be entirely abandoned.  This is especially the case
< <in generating random variables, where it is the method which must be correct,
< <NOT the answer.  The CRAY-1 is so bad I consider it a basket case for
< <vectorization.  I would suggest connecting a mini, or even a micro, to
< <do this part of the job.  The algorithm I would use on the 205 differs 
< <from what I would use on the IBM 3090; the 205 algorithm could be implemented
< <on the 3090, but would be decidedly more costly.  Implementing the 3090
< <algorithm on the 205 would require breaking up the vector into short pieces,
< <which is required on vector register machines such as the 3090 and CRAY, but
< <is not required on vector stream machines such as the 205.  But to me, these
< <considerations are immediate and obvious.

> Can you please provide examples of the different algorithms you use on these
> different machines ?

I am in the process of producing a report on this.  I will send a copy to
anyone who requests it.  It is possible I may post (not to this group) an
abbreviated source form.  

> I refer you to the ACM Turing Lecture book, where I paraphrase from (I forget
> whose lecture this was, the book is in the library where I'm not, but if
> my memory hasn't dropped too many refresh cycles it's McCarthy's...)
> 	Look at generating Fibonacci numbers.  The obvious algorithm
> 	F(n) = F(n-1) + F(n-2) generates a process with exponential
> 	running time, but this can be transformed into a process with
> 	linear running time...[new algorithm omitted]  I don't care
> 	how great your "optimizing" compiler is -- a compiler which
> 	sees this transformation (reducing the order of growth), generating
> 	not necessarily perfect code, will trounce your compiler....
> Note: Ok, this would be a really difficult compiler to write....I don't know
> of anybody doing it yet.  But it's the principle that matters here:
> It's _not_ how micro-efficient your code is that matters.  If it's
> too slow, rethink the algorithm.  That's where the real difference is --
> if you reduce the order of growth, wonderful things start happening.

This is definitely the case if you want only F(1384257).  I am not sure
if you want F(100).  But if you want F(1), ..., F(100), I suggest you
might want to use the obvious algorithm.  Vector machines are another
problem; I have no idea how to vectorize this algorithm.  On some vector
machines, I can come up with a safe way to do it, but I do not know how
good it is.

> If your random variable generator uses an algorithm that executes in the
> least known time, then please forgive this posting....

Given sufficient resources, I can come arbitrarily close to the algorithm
which uses the least possible expected number of bits.  Possibly even the
"arbitrarily close" can be removed in an algorithm with a finite expected
time.  The algorithms which are bit efficient are not likely to be time
efficient, although I have some hope for some being able to do a fair job
of both..  Technical reports on this are available.
available, although incomplete.
> Drew Dean
> Internet: dean@xcssun.berkeley.edu
> UUCP: ...!ucbvax!xcssun!dean
> FROM Disclaimers IMPORT StandardDisclaimer;


-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

mike@thor.acc.stolaf.edu (Mike Haertel) (05/07/89)

In article <1287@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>There is a heck of a lot of portability which can be achieved, but not in any
>of the current HLLs.  The problem of adding vectors is portable, but the C
>code to do it efficiently on one machine is inefficient on another.  In this
>respect C is too low level.  The same holds for string management.
>
>But C does not handle transfers too well.  There are situations where what is
>wanted is a branch to subroutine, not a subroutine call.  C does not handle
>this at all.  Here C is too high level.  C can do recursion, but not
>coroutines.  And what about a recursion in which one wishes to force
>the use of quantities unchanged in registers?

You seem to have the language C confused with the existing implementations
of C.  Granted, coroutines cannot be expressed in C.  But there is no
reason a C compiler couldn't generate more efficient procedure calls by
using the simpler instructions, or do smart register allocation across
multiple procedures.

Saying "C is inadequate because it leads to inefficient code" is like
saying "Fortran is good because it generates efficient numeric code."
I bet I could write a Fortran compiler that generated awful numeric code!

The Lisp community has understood this difference for years.  As long
as anyone can remember people have been saying "Lisp is slow."  Well,
the MACLISP compiler (on the PDP-10) generated better numeric code than
DEC's Fortran compiler.

Please try not to confuse a language itself (an abstraction, really)
with the implementations you have seen.
-- 
Mike Haertel <mike@stolaf.edu>
main() ??< printf("hello, world??/n"); ??>

aglew@mcdurb.Urbana.Gould.COM (05/08/89)

>prem@crackle.amd.com (Prem Sobel) writes
>>Many years ago, there was a machine called the Interdata Model 70 which
>>had instructions for atomically adding or removing items from circular
>>double ended queues. The data structure was defined reasonable effeciently
>>and the machine was microcoded.
>>
>>Yet no one, no compiler seriously used these instructions. The reason was,
>>amazingly, that individual instructions were faster!!! I never looked at
>>the microcode, so I cannot commebnt why that was.
>
>    Is this sort of finding not precisely the reason behind RISC architecture?
>    The realisation that many an instruction to handle a relatively
>    sophisticated job turned out to be slower than a series of very simple
>    instructions to do the equivalent thing?
>
>    I myself am not quite clear yet on why this should be ( is there perhaps a
>    reason that larger, more sophisticated microcode routines can't be made
>    optimally fast?).  Yet it does seem to be a fact of life.  
>
>
>    Alastair Milne

The Gould NP1 had queue manipulation instructions that provided atomicity.
And, yes, the OS avoided using them, because a hand-crafted set of individual
instructions were faster.

But the CISCy queue manipulation instructions were left in because the hand
crafted sequences used priviliged operations not callable from user mode.
Creating a system call to do the same operation was more expensive than 
leaving the microcoded instructions - especially since the system call would
have had to do privilige checks that the hand-crafted inside-the-OS version
didn't.
    Providing user-level access to these instructions was only important
because Gould's Real-Time Simulator customers regularly did explicit parallel
programming (the queue operations were message primitives, if you will)
- and they *did* *not* want to have to run their entire application 
priviliged.

Now, I would much rather have decomposed these CISCy operations into a RISCy
set of primitives that might be used to implement other multiprocessing
primitives. But, the normal arithmetic and logic primitives are not sufficient.
Several times I have attempted to start discussion in comp.arch about what
such primitives might be - here is what I remember of the previous
discussions (I have the highlights archived, but not easily accessible).
It may be time for another go-around.

RISCy PRIMITIVES FOR PARALLEL PROCESSING
========================================

Active Memory
-------------
    Some manufacturers implement special kinds of memory that provide extended
semantics suitabler for multiprocessing - like Sequent's MULTIBUS semaphore
locations.  Somebody at MIPS said someething similar, but it wasn't clear
if the approach was actually used.
    PRO: no extra CPU operations.
    CON: resource allocation problem - especially if you want to make
these accessible to ordinary user processes - and if the active memory is
limited in size (the HEP's full/empty bits on all memory locations wouldn't
suffer this).

Atomicity Boundaries
--------------------
    Letting the user block interrupts, but only for a maximum amount of time
(erroring if still blocked at the end of the max time) has been tried, eg.
in Honeywell and Norsk Data(?) machines.
    Similarly, architecting that interrupts may occur only at, say, modulo
N instruction boundaries (either static or dynamic - static requiring special
provision for loops) can let the user specify some uninterruptable
(atomic wrt interrupts on the same processor) operations, with help from
language system to locate operations between the appropriate boundaries.
    Apparently the ARM does something similar, only checking for interrupts 
at branches.
    
Obviously, approaches similar to those for interrupts can be applied to 
multiple processor synchronization.
    
Split Synchronization
---------------------
    My contribution has been to point out that such synchronization operations
may take a long time, and may be split into START-SYNCHRONIZATION and
STOP-SYNCHRONIZATION operations - and that other instructions may be placed 
between these synchronization points. So you do not necessarily have to
stop the entire processor when performing one of these operations.
    Recently a similar idea appeared in ASPLOS III in the paper titled
"Fuzzy Barriers..."


Andy "Krazy" Glew   aglew@urbana.mcd.mot.com   uunet!uiucdcs!mcdurb!aglew
   Motorola Microcomputer Division, Champaign-Urbana Design Center
	   1101 E. University, Urbana, Illinois 61801, USA.
   
My opinions are my own, and are not the opinions of my employer, or
any other organisation. I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

schow@bnr-public.uucp (Stanley Chow) (05/08/89)

In article <504@daitc.daitc.mil> jkrueger@daitc.daitc.mil (Jonathan Krueger) writes:
>In article <426@bnr-fos.UUCP>, schow@bnr-public (Stanley Chow) writes:
>>it is not necessary for all programmers to know enough to use
>>the specail instructions. Some guru can set it up and everyone can
>>then use the pre-built modules. Depending on how you set everything
>>up, the compiler does not have to be smart at all.
>
>Of course, but you lose the performance advantage if the modules get
>called frequently and require significant setup (e.g. can't maintain
>state for multiple callers or sequences) with respect to work
>performed.  Or to put it another way, That Way Lies Threaded Code.
>However, this isn't an argument for Better Living Through Inlining,
>rather an appeal for analysis and measurement of the tradeoffs.
>

This depends a lot on the system and language you use. We spend 
considerable time to make sure the utilities are suited to the
application. (This is easy, since this whole system is a single
application, and we have measured it to death).

In our operating system, maintaining states for multiple callers
is easy and essentially free. Most of the funny opcodes can be
easily generated by a peepholer and the really strange opcodes are
what we call Intrinsics - half way between macros and inlining.

>>For example, we have a system with really wild instructions that gets
>>generated even though the user don't know what they are and the
>>compiler does only peephole optimization. As is turns out, having
>>atomic operations for free is really nice.
>
>Sounds like you've managed to let your users express what they want in
>terms familiar to them but also understandable to the language
>translator, following which optimization to particular hardware can be
>done in a reasonable way.  By any chance, would this be done by way of
>a function library?

We use just about every trick in the book and some that aren't. Since
we have a project that is big enough to support our own language and
operating system, we have a lot of freedom. Usually, we figure out the
hardware/firmware limits, decide how we want it in the language, then
change the language/compiler and operating system to suit. Sorry I 
can't give details.

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
I am just a small cog in a big machine. I don't represent nobody.

schow@bnr-public.uucp (Stanley Chow) (05/08/89)

Since this is moving from architecture to language, I have set
Follow-up: to comp.lang.misc.

In article <1277@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>In article <504@daitc.daitc.mil>, jkrueger@daitc.daitc.mil (Jonathan Krueger) writes:
>> In article <426@bnr-fos.UUCP>, schow@bnr-public (Stanley Chow) writes:
>> >it is not necessary for all programmers to know enough to use
>> >the specail instructions. Some guru can set it up and everyone can
>> >then use the pre-built modules. Depending on how you set everything
>> >up, the compiler does not have to be smart at all.
>
>I agree it is not necessary for ALL programmers to know how to really
>use the computer, any more than it is necessary for the house painter
>to be an artist.  But the artist cannot set up the modules for a
>house painter to produce a work of art.
>

This is probably a major difference in opinion between groups. You
are an artist and feel constrained by the languages that are designed
for painting houses. Much of the programming in this world is done by
house painters as opposed to artists. 

In the case of our project, all "artists" who wants/needs to are turned
into guru's and set up utilities for themselves and others to use. The
system is by no means frozen.         

I can understand your pleas for flexibility. Forturnately, I have quite
a bit of it. I am allowed to change (or at least influence) everything
from hardware, firware, language, compiler, operating system through to
any application code. If someone sees a new use for some instructions,
we change the utilities or compiler as needed. New hardware that will
be usefull? We will just throw it into the next revision.

Having said that, I must point out that we are (very?) unusual. Most
projects are not big enough to support this style of working. if you
have to rely on a commercial compiler; chances are the compiler was
not optimized for your needs. This means the compiler writer had to
trade-off many different demands. Unforturnately (from your point
of view), portability is high priority, good debuger is important,
good object code from protable source is compulsory. Since most
people don't care about flexibility to take advantage of different 
architecutes, you don't get that from most commercial compilers.

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
I am just a small cog in a big machine. I don't represent nobody.

dietz@cs.rochester.edu (Paul Dietz) (05/09/89)

In article <13396@pasteur.Berkeley.EDU> dean@columbine.Berkeley.EDU (R. Drew Dean) writes:

>	Look at generating Fibonacci numbers.  The obvious algorithm
>	F(n) = F(n-1) + F(n-2) generates a process with exponential
>	running time, but this can be transformed into a process with
>	linear running time...[new algorithm omitted]

You can generate the nth Fibonacci number using O(log n) arithmetic
operations (albeit operations on numbers with O(n) bits).

Hint:  if A is the 2x2 matrix ((0 1) (1 1)), the lower right corner
of A^n contains Fib(n+1).

	Paul F. Dietz
	dietz@cs.rochester.edu

josh@klaatu.rutgers.edu (J Storrs Hall) (05/09/89)

Paul F. Dietz writes:

    You can generate the nth Fibonacci number using O(log n) arithmetic
    operations (albeit operations on numbers with O(n) bits).

    Hint:  if A is the 2x2 matrix ((0 1) (1 1)), the lower right corner
    of A^n contains Fib(n+1).

Or, using the same trick, generate all the fibonacci numbers up to
the nth in log n time in arithmetic vector operations.  

However, there is a catch:  the arithmetic operations are multiply
rather than add.  And since we're talking about architecture here,
remember that for any of this to make any difference at all you 
must be considering values of n where the size of the number is way
over any reasonable word size.  Every arbitrary-precision arithmetic
package I've seen has linear add and quadratic multiply.  Given
that, both algorithms are quadratic in n.

--JoSH

chuck@melmac.harris-atd.com (Chuck Musciano) (05/10/89)

In article <2223@wpi.wpi.edu> lfoard@wpi.wpi.edu (Lawrence C Foard) writes:
>Come on who defines qualified professional??? Of course it will depend on how
>many expensive slips of paper you have (degrees). History is full of great
>idea's that where at first rejected by the "qualified professionals". If the
>car maker's only explaination is "it isn't done that way" there is no reason
>why you should except that for an answer. There may be a hundred reasons why
>SL might be a bad idea but you didn't name any you just stated that you where
>a qualified professional and that pions arn't allowed to question you! This
>isn't the dark ages where the pope was infallible, people no matter how well
>trained or professional can still make mistakes and many times it takes an
>outsider to see them. If you had brought the idea of Risc upto the "qualified
>professionals" in the VAX development team, I wonder what the answer would have
>been?

     "Qualified professional" is a term defined by other qualified 
professionals.  If they came from real schools, those "expensive slips of
paper" represent large amounts of knowledge.  When you have the slips of
paper, you are a graduate, nothing more.  After you go into the world and do
something with the knowledge, you begin to attain the rank of professional.

     In my example, the car maker doesn't necessarily say "it isn't done
that way", he will probably say "it can't be done that way".  I don't have
enough knowledge to dispute him.  If he is a responsible professional, he'll
mull over my suggestions, and perhaps modify his thinking.  If I'm an
interested third party, I'll educate myself to better understand the problem.

     I didn't say anything about SL.  I can't, since it doesn't exist.  The
point I was trying to make is that before one begins taking aim at langauge
designers, one should learn a lot about language design.  Without sufficient
background, you simply cannot comment.  I can't count the times people have 
come to me with some new language feature, not realizing the impact on the
rest of the language.  Lots of things that people think are simple represent
massive work, based upon the way a particular compiler has been built.  Other
features are completely incompatible with the language philosophy and 
aesthetics.  Still others are not extensible to include existing language
features.  My job as a professional is to sift the wheat from the chaff,
and do the things that are right.

     Anyone can feel free to question my work.  The weight I give their
questions is commensurate with their professional background.  I try to
give an acceptable answer to anyone who asks me things, couched in a way
that matches their knowledge.  When my Mom asks how computers work, I
don't launch into a discussion of the impact of cache loading on pipeline
stalls.  When a coworker asks why I chose a particular algorithm in some
piece of code, I'll go into great detail, as needed.

     My wife is an industrial engineer.  I don't presume to tell her how to
design automated factories.  She doesn't tell me how to build compilers.
I might offer some input when she is designing assembly method sheets, since
the psychology of instruction writing is somewhat similar to the psychology
of user interface design.  In that realm, we are qualified peers.

     Certainly, professionals make mistakes.  I've made my share.  Almost
all of them have been pointed out by other, equally qualified professionals.
This field is so complex, the issues so detailed, it is difficult for laymen
to understand the subtle shadings particular issues have.

     While the merits of the Vax architecture design have been debated in this
group ad infinitum, I would still hazard the guess that the Vax team would
have correctly rejected RISC had it been presented to them.  They were filling
a particular market niche that RISC, at that time, did not meet.  That isn't
to say that they would not have given it a lot of thought, and perhaps begun
to think of their next machine which might use RISC technology.

Chuck Musciano			ARPA  : chuck@trantor.harris-atd.com
Harris Corporation 		Usenet: ...!uunet!x102a!trantor!chuck
PO Box 37, MS 3A/1912		AT&T  : (407) 727-6131
Melbourne, FL 32902		FAX   : (407) 727-{5118,5227,4004}

jkrueger@daitc.daitc.mil (Jonathan Krueger) (05/10/89)

In article <2254@pembina.UUCP>, cdshaw@alberta (Chris Shaw) writes:
>HR>It is not the problem of getting people to agree it is worthwhile. I can do
>HR>that. (but) the NSF budget does not permit the funding of enough faculty..
>
>Well, it would seem that SL is low on the list of priorities. A pity, perhaps,
>but I think that HR is tilting at windmills.

Well, let's just say that Knuth's ideas about computer typesetting are
better understood with reference to TeX.  It exists, it runs, and it
can be critically examined.  And he didn't complain on the net until
someone else built it.

-- Jon
--