[comp.arch] quest for breakthroughs

bb@tetons.UUCP (Bob Blau) (02/11/89)

  This news group is stuck in an endian rut and strung out on history.
How about an exercise in creativity?  The goal is to use computer
architecture as the basis for directing future technological advances.
 
  Imagine that you are the enlightened head of Imaginary Computer Corp's
architecture department.  Your job is to tell us (the enlightened
scientists at the research lab) exactly what sort of technological
breakthrough would help you the most.

  What are your assumptions?
End Product: Embedded controller, PC, Workstation, Mini, SuperMini,
             MiniSuper, Mainframe, Super, ... 
Architecture:RISC, CISC, Vector, Massively parallel, VLIW, Shared memory
             multiprocessor, ...
Application: Home, Business, Engineering, Scientific, Manufacturing, ...
Timeframe:   Next year, in 5 years, in 10 years, ...

  What problems are you trying to solve?
- Performance, Cost, Complexity, Size, Reliability, ...

  Breakthrough examples:
- Practical X-Ray lithography creating 1 million gate CMOS chips
- Quantum transistors and high temperature superconductor "metallization"
  layers creating femtosecond propagation delays
- Fiberoptic advances creating 1 Gigabyte/sec cable or bus bandwidths
- Optical disc advances creating terabyte 3.5" optical discs
- Nanotechnology inventions creating microscopic computers
- ...

  Keep in mind system constraints like memory and IO bandwidth, power
and cooling, packaging limitations, reasonable economic assumptions, 
and balanced system design.  For instance an engineering workstation
manufacturer may dream of femtosecond gates in 10 years, but the
development and manufacturing costs would probably make the technology
prohibitively expensive for a workstation in that timeframe.  The memory
and IO to support that kind of cycle time also would not fit with
workstation cost, size, and software.
  
  Try not to expect too many breakthroughs at once, magically
eliminating bothersome constraints, but also making the scenario
implausible.  For instance, femtosecond propagation delay gates are
highly unlikely in commercial products in the next 10 years. First, 
they will require breakthroughs in the commercialization of quantum
transistors. Second, in order to derive real benefit from them,
breakthroughs in reducing on-chip and off-chip wire delays will also
be necessary.  The combination of two breakthroughs at the same time
strains credibility (but isn't impossible.)

  At what point do technology changes affect the architecture? At what
point do you get diminishing returns?

  Chip density: 50K gates -> 100K -> 500K -> 1M -> 5M -> ?
  Chip pinout: 250 pins -> 400 -> 800 -> 1000 -> 2000 -> ?
  Propagation delays: 500ps -> 100ps -> 20ps -> 5ps -> 500fs -> ?
  Chip power dissipation: 1 Watt/chip -> 10 -> 20 -> 50 -> 100 -> ?
  Cable bandwidth: 100 Mbits/sec -> 500M -> 1G -> 10G -> 100G -> ?
  ...

  The intent of this exercise is to discuss what kind of technology advances
really benefit a particular computer architecture. You may want to attack
the problem from the other side, starting with a breakthrough and
determining which architecture would be most suitable.


Apply: Standard disclaimers
-- 
  Bob Blau       Amdahl Corporation    143 N. 2 E., Rexburg, Idaho 83440
  UUCP:{ames,decwrl,sun,uunet}!amdahl!tetons!bb           (208) 356-8915
  INTERNET: bb@tetons.idaho.amdahl.com

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (02/14/89)

In article <740@tetons.UUCP> bb@tetons.UUCP (Bob Blau) writes:
>  Imagine that you are the enlightened head of Imaginary Computer Corp's

I need a cheaper/smaller multiport memory interface technology.

>  What are your assumptions?

>End Product: 

Workstation, Mini

>Architecture:RISC, CISC, Vector
>             multiprocessor, ...

All of the above: A fixed instruction format 64 bit supermicro for workstations
with multiple processors.

>Application: 

Engineering, Scientific

>Timeframe:   

Next year

>  What problems are you trying to solve?
>- Performance, Cost, Complexity, Size, Reliability, ...

Yes.

>- Fiberoptic advances creating 1 Gigabyte/sec cable or bus bandwidths
>  Chip pinout: 250 pins -> 400 -> 800 -> 1000 -> 2000 -> ?
>  Cable bandwidth: 100 Mbits/sec -> 500M -> 1G -> 10G -> 100G -> ?

I want to build a supermicro that is architecturally similar to a big
vector/parallel machine - so, in order to do that, I need a technology
that will lower the complexity/cost/size of a multi-port memory interface
to around $1K.  For example, suppose you had a bus-like box with 8 processor
ports and 8 Memory Bank ports.  You could plug 4 CPU's in, with 2 IO
Controllers, a Video Controller, and a spare, and each processor be able
to get full memory bandwidth barring bank conflicts.
Clock cycle time: should support clock cycle times of 20 ns.
Bus width: 64 data bits + 14 bits ECC minimum.  Up to 512 data bits wide could
potentially be useful on a more expensive model (I am not sure how you would
do the bus connections, but the bandwidth is useful.)
Total memory bandwidth: 3-25GBytes/sec total, depending on width (64-512 bits)
(I know this sounds expensive  - that is why a breakthrough is needed.)
In other words, something like a Cray Y-MP multiport memory
interface, slowed down a little bit and packaged very inexpensively and
compactly, is what is desired.

-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

gillies@p.cs.uiuc.edu (02/14/89)

O.k., here is an idea, perhaps based on a fallacy -----

"Currently, the quality of the code from several commercial compilers
and hand-optimization for the same machine (e.g. 68000) might vary
tremendously.  The best (compiled or hand-optimized) code might easily
be X (X >= 10?)  times faster than the worst.  What is this constant
X, for most commercial compilers?  Why do compilers vary so much?
Clearly, reducing X bootstraps the performance of *all*
compilers/architectures.  Here's a plan:

1.  Discover X

2.  Discover the three contributors to X with the most variance.  For
    example, here are 6 possible contributors to X (PLEASE NO FLAMES 
    THESE ARE IMAGINARY REASONS)

     (1) Most compilers do poor register scheduling.
     (2) The architecture is so
	  (a) non-orthogonal?  => compiler optimization is too hard
	  (b) orthogonal?      => simple implementations are microcoded & slow
	  (c) powerful?	       => combinatorial search is necessary to
				  discover optimal code sequences
     (3) Most compiler writers don't
	  (a) have enough time
	  (b) have enough knowledge / experience
  	....
    Of these, reason (1) might have a variance of 1.5, reason 2(a) might
    result in a variance of 2.0, 3(b) might be 4.0, etc.

3.  Propose a technical solution for the top-3 contributers to the
variance of X.

------------------------------------------------------------

Also, I've always wondered if we could design a simple architecture +
language, from a theoretical standpoint, for which provably optimal
code could be generated.

At the moment this problem is ill-defined and requires a lot of formulation.

You can perhaps start with some ultra-simple architecture (Turing
Machine), and simple language (Turing Machine Language), and try
abstracting the two farther apart to see how far you can get.

This problem is not for the faint-hearted.



Don Gillies, Dept. of Computer Science, University of Illinois
1304 W. Springfield, Urbana, Ill 61801      
ARPA: gillies@cs.uiuc.edu   UUCP: {uunet,harvard}!uiucdcs!gillies

wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) (02/15/89)

Well, I'll jump.

I'm interested in computer animation.  The biggest problem
I face is memory bandwidth.  Compared to that everything
else is minor second-order effects.  So here is what I'd
like to see:

   1. RISC is nice since it has lots of registers, but bad
      because there are too many instruction to be excuted
      to do anything.

   2. CISC is nice in that one instruction can do a lot of
      things with one little instruction, but you only get
      a handful of registers.  So you always hitting memory
      to get your data.

   3. Memory caches are nice but, there is very little control  
      over what stays in the cache. 

So what I'd like to see is a two fold solution:

    1. RCISC - Register enhanced CISC.
       Take something like 680xx, paste in a couple kbytes of
       registers that can also hold code.  Use a few of the
       unimplemeted opcodes to give access to the new memory
       bank.  This would allow one to really reduce memory 
       accesses durning intense interactive graphics by putting
       some of the routines into the RAM bank and also common
       data would go in there.  Also since this would be on 
       chip RAM, programs would excute much quicker there and
       data access would be fast.  Along with this would have
       to be a tagging system to allow tasks to grab chunks to
       RAM bank so you could allow old-fashion multitasking
       programs to run. 

    2. OS cops to stop all OS writers from using the RAM bank.
       Power to the programmers, keep the OS to just the 
       standard registers.  This would keep overhead low, and
       even allow cheap context switches.  Let's not waste
       great hardware on PIG style OSs.  After all the only good
       OS is one that doesn't get in your way.


                                    Wayne Knapp

daveb@geaclib.UUCP (David Collier-Brown) (02/15/89)

In article <740@tetons.UUCP> bb@tetons.UUCP (Bob Blau) writes:
|  Imagine that you are the enlightened head of Imaginary Computer Corp's
	I'd like a very inexpensive dcache kit, much like the
	bit-slice processor kits of yore. 
	I don't mind if they're a bit limited in how much underlying
	memory they can handle, or if they're made affordable by simplifying
 	a little too much, because I want to use a **rather** large
	number of them.

|  What are your assumptions?
|End Product: 
	A large-memory mini, for recalcitrant/large problems that 
 	companies without gigabucks still have to solve.  It should
	fit in a small rack or large under-desk enclosure.

|Architecture:RISC, CISC, Vector
	RISC and moderately-CISC processors, like 68xxx's, MIPses and
	such-like. Standard state-of-practice stuff.

 
|Application: 
 	Scientific, logic-processing, symbolic computing.
 
|  What problems are you trying to solve?
|- Performance, Cost, Complexity, Size, Reliability, ...
	Size and complexity, primarily. 

	I want a machine with performance somewhat better than my
Stunned despite my stuffing it to bursting with slow main-memory
chips.  I want an architecture which will deal with large working
sets and ill-behaved (not very local) reference patterns without
having to sweep the slowdown under the rug by running large numbers
of users/processes in a timesharing system.
	I'd like the opportunity to do this at finite cost in the
near future by allowing me to bite the bullet and put in lots of
cache in front of a rather huge but not-too-fast main memory.  I
really do want lots of memory, and am betting that the difference in
memory speed/cost versus processor speed/cost makes investing in
cache worthwhile.  

	The assumption I'm making, you see, is that it's worthwhile
to allocating particular cache chips to particular blocks of memory
chips.  Probably along paging boundaries.  This in turn means that
I'm betting that cache invalidation/reloading on process change is a
major bottleneck when one has large amounts of cache.  This in turn
means that I'm still betting on simple multiprocessing for this kind
of application domain.  

	Of course, if one put several processors in front of this
memory hierarchy, but made sure that they didn't do strange things
like use two different page/cache sets for the same data, then it
might extend to multiprocessors...

--dave (no, we're not planning an AI document processor) c-b
[ the assumptions are basically a guess, based on the truism about
micro/mini systems repeating the history of mainframes, complete
with sillynesses and obvious bottlenecks]
-- 
 David Collier-Brown.  | yunexus!lethe!dave
 Interleaf Canada Inc. |
 1550 Enterprise Rd.   | He's so smart he's dumb.
 Mississauga, Ontario  |       --Joyce C-B

kleonard@PRC.Unisys.COM (Ken Leonard) (02/15/89)

* ... 
* architecture department. Your job is to tell us (the enlightened 
* scientists at the research lab) exactly what sort of technological 
* breakthrough would help you the most. 
* ... 
* Breakthrough examples: 
* ... 
* - Fiberoptic advances creating 1 Gigabyte/sec cable or bus bandwidths 
* ... 
* 
Is it OK that we're already building one? 1Gb/s per host pair in 
local or campus net, all host pairs concurrent at that rate. And 1Gb/s 
per fiber on a trunk as long as you care to pay for. (Note, that is 
1,000,000,000 bits of the user's data we're talking about.) 
 
Regardz, 
Ken Leonard 
Duke of West Nantmeal; Hereditary Captain-in-Chief and Master of Gunnery, 
His Lordship's Loyal Company of Freebooters Cannoners and Military Enginers. 
--- 
A Navy Colt beats a full house, every time.

colwell@mfci.UUCP (Robert Colwell) (02/16/89)

In article <76700063@p.cs.uiuc.edu> gillies@p.cs.uiuc.edu writes:
>
>O.k., here is an idea, perhaps based on a fallacy -----
>
>"Currently, the quality of the code from several commercial compilers
>and hand-optimization for the same machine (e.g. 68000) might vary
>tremendously.  The best (compiled or hand-optimized) code might easily
>be X (X >= 10?)  times faster than the worst.  What is this constant
>X, for most commercial compilers?  Why do compilers vary so much?
>Clearly, reducing X bootstraps the performance of *all*
>compilers/architectures.  Here's a plan:

I propose that part of the reason is that there is a lot of money
chasing compiler solutions for part of the "compilation space", and 
not much chasing it in other parts.  So a skew develops between how
good the compiler is in some parts of that space and how good it is
in other parts.  Handcoders are reasonably adept in all parts of the
space.  An example is how well compilers can vectorize code these days
vs. how well they do on garbage/serial code.

>1.  Discover X
>
>2.  Discover the three contributors to X with the most variance.
>    ...
>3.  Propose a technical solution for the top-3 contributers to the
>    variance of X.

To even have a prayer of getting anything reasonably interesting from
this line of inquiry, I suggest you need to focus on one architecture,
one language, and maybe even just a few benchmarks.  If you did that
the problem would be more tractable, and you'd have the opposite 
problem of extrapolating your results back out to other machines
and other languages at the end.  But at least you'd have something
concrete to say.

>Also, I've always wondered if we could design a simple architecture +
>language, from a theoretical standpoint, for which provably optimal
>code could be generated.
>
>You can perhaps start with some ultra-simple architecture (Turing
>Machine), and simple language (Turing Machine Language), and try
>abstracting the two farther apart to see how far you can get.

Nah, at that low a level, you could do what Henry Massalin did with
the "Superoptimizer" he built at Columbia (ASPLOS-II proceedings).
(His program tried every instruction sequence combination up to 
sequences of several instructions to find the ones that computed a
given function the quickest, including use of side-effects and
bizarre uses for carry bits and the like.  Fascinating experiment.)

But you could micro-optimize the entire program and still end up with slower
code than a handcoder or a good compiler for various reasons.  For instance,
the handcoder might have recognized some trig identity inherent to the
algorithm that wouldn't make economic sense to wire into the compiler's
knowledge base (since it doesn't come along often enough, and there's always
bigger fish to fry).  On the other hand, the compiler might recognize a code
transformation that entirely removes the section of code you were planning
to micro-optimize.  I think the problem is a lot bigger than your message
implies.

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
175 N. Main St.
Branford, CT 06405     203-488-6090

suitti@haddock.ima.isc.com (Stephen Uitti) (02/16/89)

In article <3780@tekig5.PEN.TEK.COM> wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) writes:
=>I'm interested in computer animation.  The biggest problem is
=>memory bandwidth.
=>So what I'd like to see is a two fold solution:
=>    1. RCISC - Register enhanced CISC.
=>       Take something like 680xx, paste in a couple kbytes of
=>       registers that can also hold code.  Use a few of the
=>       unimplemented opcodes to give access to the new memory bank.
=>                                    Wayne Knapp 

Don't invent a new memory store type.  Just put a few pages of
physical RAM on the chip.  This was done for some 8 bit controller
type chips way back when (for different reasons).

The OS can be told to provide access to these special pages to the
programmer, perhaps using existing calls (depending on the OS) to map
them.  The OS may also provide other services which may be handy, such
as real time stuff.

This has the advantages of:
1) no new instructions.  True compatibility.
2) current instruction power is available.
3) no new assembler/compiler tools required.
4) comparatively simple (localized) OS changes.
5) can be expanded dynamically with chip technology.

Intel's separate I/O instructions (for example on the 8080) show how
painful a second address space can be.

Of course, it may be that a peripheral (glorified DMA engine(s))
could do everything you want.  That would only require a "simple
driver"... which you - the programmer could write.

	Stephen.

w-colinp@microsoft.UUCP (Colin Plumb) (02/17/89)

suitti@haddock.ima.isc.com (Stephen Uitti) wrote:
> Don't invent a new memory store type.  Just put a few pages of
> physical RAM on the chip.  This was done for some 8 bit controller
> type chips way back when (for different reasons).

Also, on the Inmos Transputer.  4K of on-chip memory in the T800,
16K on the T810.  On the T810, it's 64 bits wide, too.

I'd still rather have cache.  It's a pain to allocate that space
unless you take over the whole processor with your application
and do it at compile-time.
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do."

gillies@p.cs.uiuc.edu (02/18/89)

/* Written 10:01 am  Feb 15, 1989 by colwell@mfci.UUCP in p.cs.uiuc.edu:comp.arch */
> Nah, at that low a level, you could do what Henry Massalin did with
> the "Superoptimizer" he built at Columbia (ASPLOS-II proceedings).
> (His program tried every instruction sequence combination up to 
> sequences of several instructions to find the ones that computed a
> given function the quickest, including use of side-effects and
> bizarre uses for carry bits and the like.  Fascinating experiment.)

That's precisely the article I was thinking about when I originally
thought of the problem -- I should have mentioned this.  Are there
architectures where combinatorial search can be used for practical
compilation?  Are these architectures useful?

> But you could micro-optimize the entire program and still end up with slower
> code than a handcoder or a good compiler for various reasons.....because
> of various reasons.....

I think you need an appropriate definition of "compilation".  You
can't expect the compiler to go off and implement new algorithms.  You
cannot assume there is a library of compiled subroutines.

You might be able to define an optimal compilation as one where every
state of the "important" state variables in the source (programming)
language, is also represented somewhere at some time in the target
(assembly) language program, during its execution.

Under this definition, code transformations are not allowed to remove
redundant code, but they are allowed to combine redundant
computations...

You also must enforce a space-time tradeoff, so the compiler doesn't
just precompute everything and use a lookup table.  The ASPLOS paper
required the programs to be as short as possible.

aglew@mcdurb.Urbana.Gould.COM (02/18/89)

>/* ---------- "quest for breakthroughs (long)" ---------- */
>
>  This news group is stuck in an endian rut and strung out on history.
>How about an exercise in creativity?  The goal is to use computer
>architecture as the basis for directing future technological advances.
> 
>  Imagine that you are the enlightened head of Imaginary Computer Corp's
>architecture department.  Your job is to tell us (the enlightened
>scientists at the research lab) exactly what sort of technological
>breakthrough would help you the most.

OK, I'll bite. Actually, I'll probably bite several times, with
different sets of starting assumptions/markets, as time permits
in the next few weeks.

First, I'll think about my dream, of building a truly useful
personal computer system...


>  What are your assumptions?
>End Product: Embedded controller, PC, Workstation, Mini, SuperMini,
>             MiniSuper, Mainframe, Super, ... 

PC: 
  Subclasses - desktop (where most PCs are now), briefcase,
	wristwatch, house.

>Architecture:RISC, CISC, Vector, Massively parallel, VLIW, Shared memory
>             multiprocessor, ...

Whatever it takes. Probably RISC/CISC single processor.
Actually, I don't think that processor technology is all that important
for this market, although performance increases in a single
processor may help in getting cost down by eliminating extra
components.

>Application: Home, Business, Engineering, Scientific, Manufacturing, ...

Home, Engineering, Scientific (basically, the personal computer I want)

>Timeframe:   Next year, in 5 years, in 10 years, ...

5-10 years.


>  What problems are you trying to solve?
>- Performance, Cost, Complexity, Size, Reliability, ...

I'm sure I'm going to get told "You want a personal scientific
computer? What about NeXT?" My response is that NeXT is much too
expensive. I want a really cheap computer, <2,000$, preferably
small enough to wear on my wrist, with ability to interface to the
standard I/O modules.
    A 33MHz 68030 based system with 32 MB of memory is the first
small computer system that I've found acceptable to work on,
so I don't think processor performance improvements are too much
needed for my "realizable" dream machine 
(although it still falls a way short of a Gould NP1 with 256 MB,
that's chiefly I/O).

So, here's what I want to solve:

  Performance comparable to today's top of the line workstations
    with considerably reduced component count.
    With a view to this, performance sufficient in a single 
    processor to eliminate the need for separate graphics coprocessors,
    etc. (or, multiple processors per package - chip or hybrid).

  Cost: get it *way* down!

  Complexity: reduce component count to reduce cost

  Size: 
     Second priority - make small enough to fit on briefcase, wrist...

  Reliability:
     Third priority - reliability comparable to one of today's PCs
     would be acceptable (but cost of repair should also stay same,
     which it probably won't - it'll increasingly be "Buy the whole
     thing")

>  Breakthrough examples:
>- Practical X-Ray lithography creating 1 million gate CMOS chips

  Anything that increases densities is good.

>- Quantum transistors and high temperature superconductor "metallization"
  Probably not applicable to this domain.
  I'm not going to carry liquid nitrogen around.

>  layers creating femtosecond propagation delays
  Probably not too applicable to the performance domain I'm
  talking about in this note (will be to others)

>- Fiberoptic advances creating 1 Gigabyte/sec cable or bus bandwidths
  High priority.

  What I want is a wristwatch computer that I can plug into a base
  unit to upload/download stuff from disk, control display devices
  with, etc. (Or, I want a base unit PC that can download stuff
  to a wristwatch unit - but I would really prefer that the 
  wristwatch, or Walkman, unit have the intelligence, with enough
  non-volatile memory to be "THE" system, with everything bigger
  a peripheral).

  Yes, I'm a portable computer freak. I bought one of the first
  luggables, and lugged it thousands of miles. At the moment I 
  don't have a portable; I do have one of those multifunction
  memory watches. But, these watches are sadly limited, with far
  too little memory and I/O capability for what I need.
     I would have bought one of those SEIKO memory systems that
  had a PC interface, except that I didn't have a PC compatible.
  I think that they missed the market with that device - instead 
  of having a simplified ASCII download capability, perhaps with 
  a stripped down RS232 i/f (I think that you could get that
  down to wristwatch profile) they went for PC users. PC users
  don't want writswatch computing. The people who want wristwatch
  computing are the people like me who use bigger machines all day
  long, and want the ability to manipulate their wristwatch schedule,
  appointment books, etc., with considerably more convenience than
  a writwatch keyboard allows (you know how much pain it is to type
  things in on such a keyboard? Give me download!)
     I want a personal computer that can replace the notebook
  I carry around all day, that can interface to other devices and
  computers. TABLET is close, but I don't want to have to carry that
  around.

>- Optical disc advances creating terabyte 3.5" optical discs
  
  Yes! Appears possible. Myself, I would just wait for that to
  come out for the desktop crowd; I can't see auxiliary storage
  being writwatch sized in the near future. Walkman sized maybe.
  In my own company I would concentrate on the wristwatch processor
  memory and I/O element that nobody seems to be working on.

>- Nanotechnology inventions creating microscopic computers

  Sure. But I don't see this being rewarding in the timespan I'm
  talking about.

>  Keep in mind system constraints like memory and IO bandwidth, power
>and cooling, packaging limitations, reasonable economic assumptions, 
>and balanced system design.  For instance an engineering workstation
>manufacturer may dream of femtosecond gates in 10 years, but the
>development and manufacturing costs would probably make the technology
>prohibitively expensive for a workstation in that timeframe.  The memory
>and IO to support that kind of cycle time also would not fit with
>workstation cost, size, and software.

  What I want is basically a present day processor, 8-32MB of
  memory, and an "optical SCSI"  in a 1"x0.25"x0.10" package.
  Everything else in this "realizable dream" I see on the
  drawing boards. 
     What's needed: is increased density; technology that permits
  logic, memory, and optics to be on the same chip; a good way
  of coupling optics to a chip (that doesn't require a relatively
  large shroud); packaging to protect this beast; and manufacturing
  technology to make this beast buildable (I can't see it being
  at all practical without robotics or *very* cheap labour);
  and, of course, reductions in power consumption so that I could
  wear such a device, or attach it to my belt.
 
  The worst part about this dream - although I think it's within
  reach, I don't think that it would be a big profit maker.
  After all, the idea would be to make something *cheap*.
  It'll probably arrive 5-10 years after I would like it,
  via excess competition in the PC market forcing people into
  new niches, or (more likely) via intelligent bank cards.

>  Try not to expect too many breakthroughs at once, magically
>eliminating bothersome constraints, but also making the scenario
>implausible.  For instance, femtosecond propagation delay gates are
>highly unlikely in commercial products in the next 10 years. First, 
>they will require breakthroughs in the commercialization of quantum
>transistors. Second, in order to derive real benefit from them,
>breakthroughs in reducing on-chip and off-chip wire delays will also
>be necessary.  The combination of two breakthroughs at the same time
>strains credibility (but isn't impossible.)

I'm trying to be mid-range practical. Please tell me where I fall down
(economics is the biggest constraint, I suspect)


>  At what point do technology changes affect the architecture? At what
>point do you get diminishing returns?

>
>  Chip density: 50K gates -> 100K -> 500K -> 1M -> 5M -> ?
     Say 24K gates for the processor, + 8M (minimum) of RAM.
     O(16M) gates necessary if all on same chip;
     less would require great pinout.

>  Chip pinout: 250 pins -> 400 -> 800 -> 1000 -> 2000 -> ?
     If memory and processor could live on same chip, both
     space, heat, and process-wise, then low pinouts need.
     Direct coupling to 1 or 2 optic fibers on same chio desirable.

>  Propagation delays: 500ps -> 100ps -> 20ps -> 5ps -> 500fs -> ?
     Not applicable.

>  Chip power dissipation: 1 Watt/chip -> 10 -> 20 -> 50 -> 100 -> ?
     I don't want to wear anything that hot on my wrist!
     How about fewer watts per device!
     (which is probably where I start needing new technology
     - I suppose that there are physical limits here)

>  Cable bandwidth: 100 Mbits/sec -> 500M -> 1G -> 10G -> 100G -> ?
     The basic thing I need for this wrist configuration is the
     ability to do bitmapped video, say for a colour megapixel
     at 60 Hz => 24x1Mx60 => O(2G) bits per second. Plus the rest
     of I/O traffic, which is relatively small.
        Biggest need is miniaturization of terminators.
>
>  The intent of this exercise is to discuss what kind of technology advances
>really benefit a particular computer architecture. You may want to attack
>the problem from the other side, starting with a breakthrough and
>determining which architecture would be most suitable.
>
>
>Apply: Standard disclaimers
>-- 
>  Bob Blau       Amdahl Corporation    143 N. 2 E., Rexburg, Idaho 83440
>  UUCP:{ames,decwrl,sun,uunet}!amdahl!tetons!bb           (208) 356-8915
>  INTERNET: bb@tetons.idaho.amdahl.com


This was a good idea. Writing this has been fun, and clarified a few
ideas in my head.

If I get the time I may try to write similar things for more realistic
/ commercially viable systems.

david@indetech.UUCP (David Kuder) (02/19/89)

In article <740@tetons.UUCP> bb@tetons.UUCP (Bob Blau) writes:
>  What are your assumptions?
>End Product: Embedded controller, PC, Workstation, Mini, SuperMini,
>             MiniSuper, Mainframe, Super, ... 
	Backup: fast and plentiful!
>Architecture:RISC, CISC, Vector, Massively parallel, VLIW, Shared memory
	Anything
>Application: Home, Business, Engineering, Scientific, Manufacturing, ...
	All of the above!
>Timeframe:   Next year, in 5 years, in 10 years, ...
	Yesterday, today I can have a ton of disk under my desk and no
	way to back it up easily.
>  What problems are you trying to solve?
>- Performance, Cost, Complexity, Size, Reliability, ...
	Time and space of backups.

Every other suggestion under this topic has involved more memory and
processing it faster.  Any concern for the permanence of the results
that are done bigger and faster should lead to backup.

I can currently stick one GB of disk on my workstation with little trouble
but the small form factor backup devices that I could then use (removable
diskette, 1/4" cartridge, 8mm cartridge) are either incapable of holding
that much data or transferring it in a small portion of a workday.  Various
optical disks aren't any better.  A tape drive that could transfer the
data at a decent rate would be a little large to put in my office and still
couldn't do the job without tape hangs.

Now when we consider real world (tm) applications where a database with
tens to thousands of GB of data backup is THE problem.  How many 6250 tape
drives and tapes would it take to backup 25GB in an 8 hour shift?  What
happens when you have to have your database online 24 hours a day?  My bank
has enough problems with its database and ATMs with out having to worry
about backup.  I understand from one go 'round with them that ATM backup
is paper printout in the machine -- this isn't just audit, it's backup!
If the computer goes down the transactions are recovered by entering the
paper printout by hand.

Solve the backup problem and you'll be the next (NeXT) Jobs.  Build another
fast CPU and you'll be one of the guys in the Silicon Valley.
-- 
____*_  David A. Kuder              {sun,sharkey,pacbell}!indetech!david
\  / /  Independence Technologies
 \/ /   42705 Lawrence Place        FAX: 415 438-2034
  \/    Fremont, CA 94538           Voice: 415 438-2003

csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (02/21/89)

In article <76700068@p.cs.uiuc.edu> gillies@p.cs.uiuc.edu writes:
>
>/* Written 10:01 am  Feb 15, 1989 by colwell@mfci.UUCP in p.cs.uiuc.edu:comp.arch */
>> Nah, at that low a level, you could do what Henry Massalin did with
>> the "Superoptimizer" he built at Columbia (ASPLOS-II proceedings).
>> (His program tried every instruction sequence combination up to 
>> sequences of several instructions to find the ones that computed a
>> given function the quickest, including use of side-effects and
>> bizarre uses for carry bits and the like.  Fascinating experiment.)
>
>That's precisely the article I was thinking about when I originally
>thought of the problem -- I should have mentioned this.  Are there
>architectures where combinatorial search can be used for practical
>compilation?  Are these architectures useful?

Somewhat related:  I recently had the pleasure of porting about
300 instructions of 68020 assembler to the MIPS processor.  I took the
68020 algorithm and translated it into C.  I then had the MIPS
compiler compile the C code.  With the exception of 2 instructions,
the output of the MIPS compiler was exactly what I would have
written if I had been writing in assembler.

My conclusion is that for an elegant architechture, such as the R2000,
combinatorial searches won't buy you much.  On less elegant
architechtures, such as the 680x0 and 80x86, there are a sufficient
number of special cases that a combinatorial search can find cute
code sequences.  (Alternatively, the R2000 instruction set has very few
instructions that aren't easily accessible from C.  The 680x0 and 80x86
have numerous instructions that aren't accessible from C.  For example,
C compilers normally won't generate rotate instructions.  The 'bfffo'
instruction on the 68020 also tends to be inaccessible to compilers.)

Contrariwise, if you do have a problem on which you want to perform
a combinatorial search, since the R2000 has far fewer special cases
than, say, the 68020, the combinatorial search won't explode quite
so quickly.  It will also be relatively easy to decide when one sequence
of instructions is more optimal than another.

(One of my favorite puzzles:  You have two 32-bit registers
containing some pattern of bits.  The most significant bit of a register
is numbered "0".  You want to move bits 0, 2, 4, and 16 from the
first register, and bit 0 of the second register into some subset
of the low-order eight bits of some register.  The high-order 24 bits
of the destination register must end up as zero.  The three bits in
the low-order byte of the destination register which are not copies
of the five bits of interest may have any value at all.)

(The places where the MIPS compiler didn't produce optimal
code were generated by source of the form:

	register unsigned long a, b, c;
	c &= ~(a | b);

Clearly, we would like for this to generate the code:

	nor $temp, $a, $b
	and $c, $c, $temp

The actual code generated was:

	or $temp, $a, $b
	nor $temp, $temp, $0
	and $c, $c, $temp

Since the 'nor' instruction doesn't directly map into the C language,
I didn't really expect the compiler to handle this minor special case.)

-- Chuck

kds@blabla.intel.com (Ken Shoemaker) (02/21/89)

Actually, what I would like is a three-dimensional read/write optical storage 
device.  Imagine gigabytes of storage in a 1" lucite cube.  Doubles as a
paperweight.
------------
I've decided to take George Bush's advice and watch his press conferences
	with the sound turned down...			-- Ian Shoales
Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds

henry@utzoo.uucp (Henry Spencer) (02/23/89)

In article <671@oracle.oracle.com> csimmons@oracle.UUCP (Charles Simmons) writes:
>My conclusion is that for an elegant architechture, such as the R2000,
>combinatorial searches won't buy you much.  On less elegant
>architechtures, such as the 680x0 and 80x86, there are a sufficient
>number of special cases that a combinatorial search can find cute
>code sequences...

I haven't read the Massalin paper yet -- possibly it addresses this --
but I am compelled to wonder whether those cute code sequences are really
faster than straightforward ones, given the attention that the chip
designers usually pay to optimizing the most common (i.e. simple) cases.
There have been surprises in this area in the past.
-- 
The Earth is our mother;       |     Henry Spencer at U of Toronto Zoology
our nine months are up.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

rpw3@amdcad.AMD.COM (Rob Warnock) (02/23/89)

In article <671@oracle.oracle.com> csimmons@oracle.UUCP writes:
+---------------
| (One of my favorite puzzles:  You have two 32-bit registers
| containing some pattern of bits.  The most significant bit of a register
| is numbered "0".  You want to move bits 0, 2, 4, and 16 from the
| first register, and bit 0 of the second register into some subset
| of the low-order eight bits of some register.  The high-order 24 bits
| of the destination register must end up as zero.  The three bits in
| the low-order byte of the destination register which are not copies
| of the five bits of interest may have any value at all.)
+---------------

O.k., I'll bite. (Byte? Nybble? Chomp at the bit? ;-} ;-} )

On the Am29000, you can this in five instructions of straight-line code,
involving a non-obvious use of the 29k "extract" instruction (which extracts
a 32-bit field from a 64-bit source: any two of the 32-bit regs). Let "x"
be the first source reg, "y" be the second, "t1" & "t2" temp regs, and "z"
the result.  (Depending on where the result goes and whether either/both of
the source regs may be destroyed, one or both of the temps may be uneeded.)
The code is:

	srl	t1,x,27		; t1<27:31> = x<0:4>, t1<0:26> = 0
	sll	t2,x,16		; t2<0> = x<16>   (t2<1:15> = "don't care")
	mtsrim	FC,1		; condition Funnel-shifter Count
	extract	t1,t1,t2	; t1<0:31> = t1<1:31> cat t2<0>
	extract	z,t1,y		; z<0:31> = t1<1:31> cat y<0>

The result (BigEndian bit numbers) satisfies the given conditions
(bits 24, 26, & 28 are the "don't care" bits):

  0:23      24      25      26      27      28      29      30      31
+- - - -+-------+-------+-------+-------+-------+-------+-------+-------+
| ...0  |    0  |  x<0> |  x<1> | x<2>  |  x<3> |  x<4> | x<16> |  y<0> |
+- - - -+-------+-------+-------+-------+-------+-------+-------+-------+

"Extract" is also handy in 29k code when shifting multi-word quantities.
A highly optimized version of "memcpy()" does non-aligned byte copies
with inner loops of "load_multiple, extract, extract..., store_multiple".

(Without using "extract", the best I could do was seven instructions,
which happened to give the same result pattern.)


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

w-colinp@microsoft.UUCP (Colin Plumb) (02/24/89)

csimmons@oracle.UUCP (Charles Simmons) wrote:
> [The desired code was]:
> 
> 	nor $temp, $a, $b
> 	and $c, $c, $temp
> 
> The actual code generated [by the MIPS compiler] was:
> 
> 	or $temp, $a, $b
> 	nor $temp, $temp, $0
> 	and $c, $c, $temp
> 
> Since the 'nor' instruction doesn't directly map into the C language,
> I didn't really expect the compiler to handle this minor special case.)

Really?  I do.  Since the sequence "or $c, $a, $b; not $c, $c" is
exactly equivalent to "nor $c, $a, $b" and easily recognised by a
peephole optimiser.  If the code generator frequently uses extra
temporary registers (say the first sequence ends in "not $d, $c"),
you have to complicate this by checking to see if $c is dead after
this point, but I expect MIPS have already fixed this.

Well, guys?  Have you? :-)
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do."

mo@prisma (02/24/89)

Going very very fast is very very hard if for no other reason than
time of flight delays across realizable circuit board material
and delays going on and off chips reduces the "1-foot per nanosecond"
speed-of-light-in-a-vacuum to more like "1-inch per nanosecond."
It ain't quite that bad, but it is a good number to plan with.
In theory, one *can* make multilayer circuit boards out of laminated
teflon, TFE, and gold conductors, but they would be astronomically
expensive.

This alone makes many parts of the machine more than one clock away,
and you'd be surprised at how badly that complicates things.

Fairly quickly one comes to understand that the speed of the parts
is NOT the driver of the machine clock speed.

	-Mike

wbeebe@bilver.UUCP (bill beebe) (02/25/89)

In article <3607@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes:
>Actually, what I would like is a three-dimensional read/write optical storage 
>device.  Imagine gigabytes of storage in a 1" lucite cube.  Doubles as a
>paperweight.

Well, you should read the article February 1989 _BYTE_, "Digital Paper",
by Dick Pountain. Bernoulli Optical Systems Corp (BOSCO, the chocolate
syrup people), is using a new type of optical storage media to create
a one gig floppy disk. Paper is something of a misnomer as the material is
made of a flexible polymer sandwich. When I say thin, I mean the about the
thickof a stiff piece of paper. BOSCO says they can put the equivalent
of a double-sided WORM drive with a gigabyte of storage capacity in the same
space as a 5.25" half-height. *Removeable*. Besides, if you get a cable long
enough from the PC, it can still double as a paper weight.

jesup@cbmvax.UUCP (Randell Jesup) (02/25/89)

In article <740@tetons.UUCP> bb@tetons.UUCP (Bob Blau) writes:
>  What are your assumptions?
>End Product: Embedded controller, PC, Workstation, Mini, SuperMini,
>             MiniSuper, Mainframe, Super, ... 

	Home PCs to low-end workstations.

>Architecture:RISC, CISC, Vector, Massively parallel, VLIW, Shared memory
>             multiprocessor, ...

	CISC 680x0 family

>Application: Home, Business, Engineering, Scientific, Manufacturing, ...

	Home, Business (esp graphics/video/sound), plus others.

>Timeframe:   Next year, in 5 years, in 10 years, ...

	Next 2-3 years.

>  What problems are you trying to solve?
>- Performance, Cost, Complexity, Size, Reliability, ...

	Cost, Performance

> Examples:

	(all of these can be used more or less alone)

	Cheap large ram.
	Cheap VRAM.

	Reasonably priced (<$500) 1280x1024 or better color monitors.
	
	Reasonably priced (<$500) 2Kx2K grey-scale monitors.

	Cheaper small SCSI drives (say <$300 for 100Meg 28ms 3.5").

	ASIC that allow both many pins (>>100) with 68000/68010 core and
	allows LOTS of other logic around it (currently 3 low-tech VLSI
	chips plus a gate array).  (Not bloody likely in given time frame :-)

	Cheap ram (did I say that already? :-)

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

cik@l.cc.purdue.edu (Herman Rubin) (02/25/89)

In article <730@microsoft.UUCP>, w-colinp@microsoft.UUCP (Colin Plumb) writes:
> csimmons@oracle.UUCP (Charles Simmons) wrote:
> > [The desired code was]:
> > 
> > 	nor $temp, $a, $b
> > 	and $c, $c, $temp
> > 
> > The actual code generated [by the MIPS compiler] was:
> > 
> > 	or $temp, $a, $b
> > 	nor $temp, $temp, $0
> > 	and $c, $c, $temp
> > 
> > Since the 'nor' instruction doesn't directly map into the C language,
> > I didn't really expect the compiler to handle this minor special case.)
> 
> Really?  I do.  Since the sequence "or $c, $a, $b; not $c, $c" is
> exactly equivalent to "nor $c, $a, $b" and easily recognised by a
> peephole optimiser.  If the code generator frequently uses extra
> temporary registers (say the first sequence ends in "not $d, $c"),
> you have to complicate this by checking to see if $c is dead after
> this point, but I expect MIPS have already fixed this.

There is no question that in some cases a peephole optimizer can catch
things like this.  But this requires that someone has anticipated the
problem and built it into the optimizer.  And how big is the peephole?

This is a point that I have been bringing up for a long time.  The 
language gurus are not aware of the hardware, and the "natural hardware,"
instuctions that Charles and I want to use.  At least they do not seem
to be.  Now if I give you a complete list of such instructions today,
the list may not be complete tomorrow.  So you should provide a way
for me to add this to the language, and communicate the necessary infor-
mation to the compiler.

I remind you that I have no difficulty with machine instructions on a
wide variety of machines, and a new machine is, for me, easier than a
new language.  I have difficulty with the clumsy assembler languages;
I understand how they got to be that way, and the advantages FOR THE
ASSEMBLER to having them that way.  I understand and appreciate the
advantages of the HLLs, and their drawbacks.

Possibly part of the solution can be done by having the intermediate
code available to the programmer, and allowing the editing of it.  This
is generally not the case, and the intermediate code is rarely documented.
> Well, guys?  Have you? :-)
> -- 
> 	-Colin (uunet!microsoft!w-colinp)
> 
> "Don't listen to me.  I never do."

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)