[comp.arch] Superoptimiser.

smryan@garth.UUCP (Steven Ryan) (06/29/88)

In article <7570@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <12171@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
> >          It came up with a four-68000-instruction `signum' function,
> > and its versions of min and max (and minmax together) are also quite
> > surprising.  None of them have any branches.
>(What is signum?)
>Not so surprising for min and max (and minmax).  However, they require a
>sign extending shift, so that implementation is not possible on all
>machines.  Also, if a branch takes only one cycle (with delay slot),
>you do not gain anything (in general).

The article was interesting, but  the technique does not appear ready for
compilers. In the article, the optimiser was used as standalone program that
was fed small pieces of code.

Branches tend to be deadly to fast machines. Delay slots or no, it can still
give the cache/instruction stack indigestion.

By the way, as we continue forward into the past, the Cyber 170 macros are
filled with handcoded sequences to avoid branches.

chuck@amdahl.uts.amdahl.com (Charles Simmons) (06/30/88)

In article <834@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
 >The article was interesting, but  the technique does not appear ready for
 >compilers. In the article, the optimiser was used as standalone program that
 >was fed small pieces of code.

yeah...  The authors basically suggested using the results of the
standalone program to do things similar to peephole optimization,
or writing code sequences for the GNU backend, or providing help
in figuring out cute instruction sequences to use in your innermost
handcoded loop.

 >Branches tend to be deadly to fast machines. Delay slots or no, it can still
 >give the cache/instruction stack indigestion.

Gosh!  You guys aren't thinking big enough.  How about multiple
parallel pipelines to compute all the various instruction threads
in parallel and just keep the results of the one that is actually
taken?  Takes Big Bucks?  Sure.  But when you're using gallium
arsenide, who cares?

aglew@urbsdc.UUCP (07/01/88)

> >Branches tend to be deadly to fast machines. Delay slots or no, it can still
> >give the cache/instruction stack indigestion.
>
>Gosh!  You guys aren't thinking big enough.  How about multiple
>parallel pipelines to compute all the various instruction threads
>in parallel and just keep the results of the one that is actually
>taken?  Takes Big Bucks?  Sure.  But when you're using gallium
>arsenide, who cares?

The IBM 360 model 97 (I always get the model number wrong;
anyway, one of the early 360s) did this.

smryan@garth.UUCP (07/01/88)

> >Branches tend to be deadly to fast machines. Delay slots or no, it can still
> >give the cache/instruction stack indigestion.
>
>Gosh!  You guys aren't thinking big enough.  How about multiple
>parallel pipelines to compute all the various instruction threads
>in parallel and just keep the results of the one that is actually
>taken?

Actually, it is a 170 (aka 6600) and 205 coding technique (read: trick).
Given a conditional expression like

       if p then a else b

if a and b are side-effect and fault-free, define a mask

       mask(p) = all ones if p true
                 all zeros if p false

then the above conditional is encoded as

       mask(p) and a  or  not mask(p) and b

For example, the following 170 sequence folds Ascii 8/12 control characters
into blanks:

        MX0  0         +0.
        SX7  40B       8/12 space character code.
        IX6  X5-X7     X5=original character.
        IX6  X6+X0     get rid of -0. If control, X5<X7, X5-X7<0, so sign set.
        AX6  59        all ones if X5 is a control character, otherwise zeros.
        BX5  -X6*X5    zero if control character, otherwise original character.
        BX6  X6*X7     space character if control, otherwise zero.
        BX5  X5+X6     X5 is a control folded to space or a printing character.

If I hadn't run out of registers (this is a macro), I could have interleaved
the next sequence which folds lower to upper case, so that it would have four
parallel streams for two conditionals.

We guys do think big. Cheaply.

mash@mips.COM (John Mashey) (07/02/88)

In article <28200172@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>
>> >Branches tend to be deadly to fast machines. Delay slots or no, it can still
>> >give the cache/instruction stack indigestion.

>>Gosh!  You guys aren't thinking big enough.  How about multiple
>>parallel pipelines to compute all the various instruction threads
>>in parallel and just keep the results of the one that is actually
>>taken?  Takes Big Bucks?  Sure.  But when you're using gallium
>>arsenide, who cares?

>The IBM 360 model 97 (I always get the model number wrong;
>anyway, one of the early 360s) did this.

360/91, of which maybe (on the order of) 20 were built;
there was also a 360/95 that was a little faster.
It is instructive to examine this, compared with, for example, the
360/85, which was built at aboutthe same time (late 60s).
The complexity of the 360/91 occurred because the CPU was too much faster
than the memories.
The 360/85 used a cache instead, was more cost-effective.  Certainly
the later machines derive more from its ideas than from the 91.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

baum@Apple.COM (Allen J. Baum) (07/02/88)

[]
>In article <28200172@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>
>>>Branches tend to be deadly to fast machines. Delay slots or no, it can still
>>>give the cache/instruction stack indigestion.
>>
>>Gosh!  You guys aren't thinking big enough.  How about multiple
>>parallel pipelines to compute all the various instruction threads
>>in parallel and just keep the results of the one that is actually
>>taken?  Takes Big Bucks?  Sure.  But when you're using gallium
>>arsenide, who cares?
>
>The IBM 360 model 97 (I always get the model number wrong;
>anyway, one of the early 360s) did this.

I don't believe that any model of IBM 360 or 370 (including /91, /92, or /195)
ever executed multiple threads and threw away the unused one. The did have
a high degree of overlap, and out-of-order execution, but that is all (and
was enough). The only rumors I've heard about multiple threading machines
from IBM was a machine that was never fully built, killed by the promise of
the 360/91, which (I believe) John Cocke worked on, called the ACS. I've heard
rumors of all kinds of features that this machine was supposed to have, some
of which were pretty hairy. If only it had ben 360 compatible, it may have
seen the light of day.
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

mat@amdahl.uts.amdahl.com (Mike Taylor) (07/02/88)

> 
> I don't believe that any model of IBM 360 or 370 (including /91, /92, or /195)
> ever executed multiple threads and threw away the unused one.

As I recall, the 3033 did this ( and I think maybe the 168 too, but I'm not
sure about that), in a limited sense.  The 3033 was the "tiger team" response
to Amdahl's 470 V/6. It had a three(two?)-pipe IPPF (Instruction Pre-Processing
Facility) that got three(two?) instruction streams ready for processing - i.e.
operands checked and ready to go, etc. However, I don't think any machine has
had multiple E-units, as you point out.
-- 
Mike Taylor                               ...!{hplabs,amdcad,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

pl@tut.fi (Pertti Lehtinen) (07/04/88)

From article <834@garth.UUCP>, by smryan@garth.UUCP (Steven Ryan):
> 
> Branches tend to be deadly to fast machines. Delay slots or no, it can still
> give the cache/instruction stack indigestion.
> 
> By the way, as we continue forward into the past, the Cyber 170 macros are
> filled with handcoded sequences to avoid branches.
>
	When avoiding branches, interesting technique is adopted in
	ARM-processor (Acorn Risc Machine).  All instructions are
	conditional, which makes it possible to avoid short branches
	very easily.


-- 
pl@tut.fi			! All opinions expressed above
Pertti Lehtinen			! are preliminary and in subject
N 61 26' E 23 50'		! to change without any further notice.

urjlew@ecsvax.uncecs.edu (Rostyk Lewyckyj) (07/05/88)

In article <13157@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes:
> >>Gosh!  You guys aren't thinking big enough.  How about multiple
> >>parallel pipelines to compute all the various instruction threads
> >>in parallel and just keep the results of the one that is actually
> >>taken?  
> >
> >The IBM 360 model 97 (I always get the model number wrong;
> >anyway, one of the early 360s) did this.
> 
> I don't believe that any model of IBM 360 or 370 (including /91, /92, or /195)
> ever executed multiple threads and threw away the unused one. The did have

If I remember correctly, the machine would decode the addresses of
the instructions on both sides of the branch, and then decode the
addresses of their operands, and begin fetching these. As soon as    
the decision on which side of the branch would be taken was decided
the machine would abort, throw away, the fetching for that thread.
No actual instruction executions were done in parallel. i.e. if
one side of the branch was a multiply and the other an add, neither
the multiplication nor the addition would be started, just the     
operands would be sent for if possible.

-----------------------------------------------
  Reply-To:  Rostyslaw Jarema Lewyckyj
             urjlew@ecsvax.UUCP ,  urjlew@tucc.bitnet
       or    urjlew@tucc.tucc.edu    (ARPA,SURA,NSF etc. internet)
       tel.  (919)-962-9107

davidsen@steinmetz.ge.com (William E. Davidsen Jr) (07/05/88)

In article <28200172@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:

| The IBM 360 model 97 (I always get the model number wrong;
| anyway, one of the early 360s) did this.

  Was that the "stretch" model? I can't remember the model number, but
very few were ever made. It supposedly was a reasearch item which went
into production but was never marketed, so you could buy one if you
could find out it existed (or something bizarre like that).

  I find it hard to believe that IBM used to have a machine which they
built for sale but didn't market. Now, of course they have come 180
degrees from that position.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

rw@beatnix.UUCP (Russell Williams) (07/05/88)

In article <2530@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>360/91, of which maybe (on the order of) 20 were built;
>there was also a 360/95 that was a little faster.
>It is instructive to examine this, compared with, for example, the
>360/85, which was built at aboutthe same time (late 60s).
>The complexity of the 360/91 occurred because the CPU was too much faster
>than the memories.
>The 360/85 used a cache instead, was more cost-effective.  Certainly
>the later machines derive more from its ideas than from the 91.

   The 195 was basically a 91 with a cache.  The 91 spent its time waiting
on memory unless you did mostly FP.  SLAC does, & a friend of mine there
said that for her work, the 91 was much faster than the 3033. 

Russell Williams
..uunet!elxsi!rw
..ucbvax!sun!elxsi!rw

hankd@pur-ee.UUCP (Hank Dietz) (07/06/88)

In article <4014@korppi.tut.fi>, pl@tut.fi (Pertti Lehtinen) writes:
> 	When avoiding branches, interesting technique is adopted in
> 	ARM-processor (Acorn Risc Machine).  All instructions are
> 	conditional, which makes it possible to avoid short branches
> 	very easily.

A similar technique is used in the Cydra machine...  as I understand it,
this machine performs VLIW-like scheduling of blocks of code (relatively
conventional control-flow stuff) and simply enables/disables output arcs
(i.e., stores) based on condition inputs for each block.  In other words,
you can execute both sides of a branch intermingled, then decide to ignore
the results of either.

I hav
e only a single literature sheet on this architecture, however.  Anyone
have more details?

						-hankd

root@cca.ucsf.edu (Computer Center) (07/06/88)

In article <11458@steinmetz.ge.com>, davidsen@steinmetz.ge.com (William E. Davidsen Jr) writes:
| In article <28200172@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
| 
| | The IBM 360 model 97 (I always get the model number wrong;
| | anyway, one of the early 360s) did this.
| 
|   Was that the "stretch" model? I can't remember the model number, but
| very few were ever made. It supposedly was a reasearch item which went


The Stretch was the 7030; it established technology which later appeared
in the 7090 and 7094. This was several years before the 360 series and
"Solid Logic Technology" (SLT). The 7000 series machines were discrete
transistor based.

Much of the technology was downgraded going from the 7030 to the 7090;
e.g. the Stretch disk ran the disk access comb in parallel while the
7090 etc. version (1301) did a head select and read from one track
at a time.

I don't believe there was a Model 97; the high speed scientific machine
was the Model 91; the fast general purpose machine was the Model 85.
There was also a Model 195 about which I have no details. IBM allegedly
lost money on each 91 and built them only to keep CDC out of prestige
accounts.

Thos Sumner       (thos@cca.ucsf.edu)   BITNET:  thos@ucsfcca
(The I.G.)        (...ucbvax!ucsfcgl!cca.ucsf!thos)

OS|2 -- an Operating System for puppets.

#include <disclaimer.std>

eugene@pioneer.arc.nasa.gov.arpa (Eugene N. Miya) (07/07/88)

In article <1300@ucsfcca.ucsf.edu> root@cca.ucsf.edu (Computer Center) writes:
>In article <11458@steinmetz.ge.com>, davidsen@steinmetz.ge.com (William E. Davidsen Jr) writes:
>IBM allegedly lost money on each 91 and built them only to keep CDC
>out of prestige accounts.

Do you think it is any different today?  Do you think IBM reads the
comp.arch? [I don't, well maybe a few at ibmpa.. ;-)].  Only today it's
not just CDC, it's all those little micro companies.  Mainframe people
seem really depressed of late.

--enm

paul@unisoft.UUCP (n) (07/07/88)

In article <1300@ucsfcca.ucsf.edu> root@cca.ucsf.edu (Computer Center) writes:
>
>The Stretch was the 7030; it established technology which later appeared
>in the 7090 and 7094. This was several years before the 360 series and
>"Solid Logic Technology" (SLT). The 7000 series machines were discrete
>transistor based.
>

	One of the best stories about the Stretch (I think I read about it
in CACM maybe 5 years ago) was about an early problem in the core memory:
I order to keep the temperature down the core stacks were embedded in
moving oil for cooling, an engineer was trying to find an intermittent
failure, this lasted for several weeks untill finally someone noticed that
the failure tended to move through memory addresses sequentially, finally
the problem was diagnosed - a piece of solder in the oil. This was the
first (and probably last) computer to be repaired by giving it an oil 
change .....


		Paul


-- 
Paul Campbell, UniSoft Corp. 6121 Hollis, Emeryville, Ca
	E-mail:		..!{ucbvax,hoptoad}!unisoft!paul  
Nothing here represents the opinions of UniSoft or its employees (except me)
"Nuclear war doesn't prove who's Right, just who's Left" (ABC news 10/13/87)

sxr@purdue.UUCP (07/13/88)

In article <11458@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
>In article <28200172@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>
>| The IBM 360 model 97 (I always get the model number wrong;
>| anyway, one of the early 360s) did this.
>
>  Was that the "stretch" model? I can't remember the model number, but
>very few were ever made. It supposedly was a reasearch item which went
>into production but was never marketed, so you could buy one if you
>could find out it existed (or something bizarre like that).
>
>  I find it hard to believe that IBM used to have a machine which they
>built for sale but didn't market. Now, of course they have come 180
>degrees from that position.
>-- 
>	bill davidsen		(wedu@ge-crd.arpa)

The Stretch belonged to an earlier generation than the 360.  Stretch
was the IBM 7030 which IBM withdrew from the market in 1961.  Its
performance was spectacular for its time, but not as good as had been
promised.  IBM agreed to deliver eight systems at a reduced price at
which it would lose money on every system delivered.  See "IBM's Early
Computers" by Bashe, Johnson, Palmer, and Pugh.  MIT Press 1986.

There was no 360 model 97.  IBM did introduce a model 90 series of
very high performance 360 systems, at least in part to help it compete
with Control Data.  This was the basis of an anti-trust suit by CDC
against IBM.  IBM built and delivered a few model 91 systems, and at
least one model 95.  They offered and sold a few model 91 systems to
University Computing centers at clearance prices, and withdrew them
from the market when the model 85 was introduced, and then the top of
the line 195.