[comp.arch] ABI and growth

hutch@fps.com (Jim Hutchison) (01/16/90)

<>
When a "thing" gets off the drawing board and into practice, it seems
that there end up being one or two new features that people need almost
immediately (in days/weeks/months).  This is certainly understandable,
for the sake of simplicity we could call it growth.

Now, here we have the ABI(s) and people already want them to grow to include
FPAs, Application Processors, and such.  Sounds good.  Unfortunately it seems
that to add these neat things, and it gets called a "new" ABI.  With many of
the vendors making "new" ABIs for there new feature, it would seem that the
purpose for the ABI will be lost.  So clearly that result would be bad.

So here is a thought I was toying with.  How about allowing for library calls
to trap and re-write there calls as the appropriate instruction?  I can see
some problems right away with handling shared pages neatly, and having space
and parameters to re-write the call with.  Also some fpa's need a setup call
to get themselves in order when the game starts, this could probably be done
in the "trap" handler presuming that it had a way to get the information it
needed.  It's a thought, what does anyone else think about this?

--
/*    Jim Hutchison   		{dcdwest,ucbvax}!ucsd!celerity!hutch  */
/*    Disclaimer:  I am not an official spokesman for FPS computing   */

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/16/90)

In article <6186@celit.fps.com> hutch@fps.com (Jim Hutchison) writes:

| Now, here we have the ABI(s) and people already want them to grow to include
| FPAs, Application Processors, and such.  Sounds good.  Unfortunately it seems
| that to add these neat things, and it gets called a "new" ABI.

  That is the problem in a nutshell.

| So here is a thought I was toying with.  How about allowing for library calls
| to trap and re-write there calls as the appropriate instruction?  I can see
| some problems right away with handling shared pages neatly, and having space
| and parameters to re-write the call with.  

  It could work if you have copy on write. Or maybe not, since the page
will still be valid for all processes using it. I think this requires
some careful though about doing stuff with the MMU and paging. The usual
practice is to overwrite a code page rather than swap it, since it
(usually) hasn't been modified. I suppose this wouldn't be a problem,
since if a new, unmodified, copy came back in it would become a modified
copy quickly.

  You would need to lock the page which writing it in a multi-CPU
system, which might require a lot of MMU extensions.

  This is an old problem. The VAX gets around some of it by allowing
"writable control store" (users defined microcode for certain
instructions). A friend did a master's thesis based on implementing an
FFT instruction in microcode. It was almost twice as fast as doing it
with discrete instructions.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

desnoyer@apple.com (Peter Desnoyers) (01/17/90)

In article <2020@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM (Wm E 
Davidsen Jr) writes:
> | So here is a thought I was toying with.  How about allowing for library calls
> | to trap and re-write there calls as the appropriate instruction?  I can see
> | some problems right away with handling shared pages neatly, and having space
> | and parameters to re-write the call with.  
> 

This sounds like a variation on run-time loading. I'm not sure of the 
specifics, but I believe some variation on {link in call to loader, load 
function on first call, patch stub to jump to loaded function} is used in 
OS/2, the Macintosh, and TOPS-20. The patching is easier in this case, 
however, as you are always patching the same run-time load stub instead of 
arbitrary code.

>   It could work if you have copy on write. Or maybe not, since the page
> will still be valid for all processes using it. I think this requires
> some careful though about doing stuff with the MMU and paging. The usual
> practice is to overwrite a code page rather than swap it, since it
> (usually) hasn't been modified. I suppose this wouldn't be a problem,
> since if a new, unmodified, copy came back in it would become a modified
> copy quickly.

If the code page is used infrequently enough that it gets swapped out, 
then whether or not the optimized version of the instruction gets used is 
a moot point.

                                      Peter Desnoyers
                                      Apple ATG
                                      (408) 974-4469

henry@utzoo.uucp (Henry Spencer) (01/17/90)

In article <2020@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  This is an old problem. The VAX gets around some of it by allowing
>"writable control store" (users defined microcode for certain
>instructions). A friend did a master's thesis based on implementing an
>FFT instruction in microcode. It was almost twice as fast as doing it
>with discrete instructions.

Which is actually fairly impressive, since the VAX WCS seems to have been
an afterthought and there was substantial overhead involved in getting to
it.  I guess his FFT instruction did enough work to amortize the overhead
pretty well.
-- 
1972: Saturn V #15 flight-ready|     Henry Spencer at U of Toronto Zoology
1990: birds nesting in engines | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

mrc@Tomobiki-Cho.CAC.Washington.EDU (Mark Crispin) (01/17/90)

In article <6197@internal.Apple.COM> desnoyer@apple.com (Peter Desnoyers) writes:
>This sounds like a variation on run-time loading. I'm not sure of the 
>specifics, but I believe some variation on {link in call to loader, load 
>function on first call, patch stub to jump to loaded function} is used in 
>[...] TOPS-20.

Not that I know of, and I was a TOPS-20 OS programmer for 10 years.
It was possible for user programs to do something like this, and I
think the FORTRAN overlay system (used mostly on the older TOPS-10
operating system) did, but it was not at all general practice.
TOPS-20 is a demand-paged virtual memory operating system.  Usually
your program would have everything lunk in with it, or call a
well-defined library segment (usually a TOPS-10 style "high segment")
that would be separately loaded.  "Loading" an executable program (as
opposed to linking relocatable binaries) merely involved setting the
swap pointers for the appropriate process page(s) to that particular
file on the disk; this made sharing of pure pages easily (and in fact
impossible to avoid).
 _____     ____ ---+---   /-\   Mark Crispin           Atheist & Proud
 _|_|_  _|_ ||  ___|__   /  /   6158 Lariat Loop NE    R90/6 pilot
|_|_|_| /|\-++- |=====| /  /    Bainbridge Island, WA  "Gaijin! Gaijin!"
 --|--   | |||| |_____|   / \   USA  98110-2098        "Gaijin ha doko ka?"
  /|\    | |/\| _______  /   \  +1 (206) 842-2385      "Niichan ha gaijin."
 / | \   | |__| /     \ /     \ mrc@CAC.Washington.EDU "Chigau. Gaijin ja nai.
kisha no kisha ga kisha de kisha-shita                  Omae ha gaijin darou."
sumomo mo momo, momo mo momo, momo ni mo iroiro aru    "Iie, boku ha nihonjin."
uraniwa ni wa niwa, niwa ni wa niwa niwatori ga iru    "Souka. Yappari gaijin!"

ddb@ns.network.com (David Dyer-Bennet) (01/17/90)

In article <5344@blake.acs.washington.edu> mrc@Tomobiki-Cho.CAC.Washington.EDU (Mark Crispin) writes:
:In article <6197@internal.Apple.COM> desnoyer@apple.com (Peter Desnoyers) writes:
:>This sounds like a variation on run-time loading. I'm not sure of the 
:>specifics, but I believe some variation on {link in call to loader, load 
:>function on first call, patch stub to jump to loaded function} is used in 
:>[...] TOPS-20.
:
:Not that I know of, and I was a TOPS-20 OS programmer for 10 years.

I did implement something like this for TOPS-20 dynamic library
support, which was actually used with the TOPS-20 version of Datatrieve
(did that ever ship?) but not, so far as I know, with anything else.
It didn't require any changes in the OS (it used the PDV facility that
appeared in version whatever).  Each dynamic library was loaded into
its own segment (it started out as a one-weekend hack, ok?).
-- 
David Dyer-Bennet, ddb@terrabit.fidonet.org
or ddb@network.com
or Fidonet 1:282/341.0, (612) 721-8967 9600hst/2400/1200/300
or terrabit!ddb@Lynx.MN.Org, ...{amdahl,hpda}!bungia!viper!terrabit!ddb

johnl@esegue.segue.boston.ma.us (John R. Levine) (01/17/90)

In article <6186@celit.fps.com> hutch@fps.com (Jim Hutchison) writes:
|So here is a thought I was toying with.  How about allowing for library calls
|to trap and re-write there calls as the appropriate instruction?

This sort of thing has been done since time immemorial (which in this
business is about 25 years.)  A few years back I was programming an HP 1000
which is sort of an overgrown 16-bit PDP-8.  When they added floating point
instructions, they made them take their arguments in the place where they'd
be for a call to the floating whatever routine, so the linker could patch the
call into the appropriate instruction.

More recently, PC/IX had a wonderful hack again for floating point.  On the
8088, floating point instructions in the absence of a floating point unit do
nothing in particular, so if you generate in-line instructions and there's no
8087, you lose.  On the other hand, if you do have an 8087, the overhead of
library calls is considerable.  One of the PC/IX kernel guys noticed that
every FP instruction was preceded by a one-byte WAIT instruction, and the
first byte of every floating instruction is in the range D8 through DF.
Instead of generating the WAIT, they generated the first byte of an INT
instruction, so when the program got to that point it would generate an
interrupt on one of the vectors from D8 through DF.  If the machine didn't
have an 8087, it could then pick up the rest of the instruction and emulate
it.  If there was an 8087, it would patch the INT to a WAIT, back up the
instruction pointer, and return.  The first time through any piece of code,
you get an interrupt on each floating instruction but if there's an 8087,
after the first time through any loop the machine ran at full speed.

I don't propose exactly this hack for future ABIs, but there is a lot of
milage to be gained by designing your extensions so that you can patch
call instructions on the fly, getting practically full machine speed while
maintaining binary compatibility with older systems.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
"Now, we are all jelly doughnuts."