bb@tetons.UUCP (Bob Blau) (02/11/89)
This news group is stuck in an endian rut and strung out on history. How about an exercise in creativity? The goal is to use computer architecture as the basis for directing future technological advances. Imagine that you are the enlightened head of Imaginary Computer Corp's architecture department. Your job is to tell us (the enlightened scientists at the research lab) exactly what sort of technological breakthrough would help you the most. What are your assumptions? End Product: Embedded controller, PC, Workstation, Mini, SuperMini, MiniSuper, Mainframe, Super, ... Architecture:RISC, CISC, Vector, Massively parallel, VLIW, Shared memory multiprocessor, ... Application: Home, Business, Engineering, Scientific, Manufacturing, ... Timeframe: Next year, in 5 years, in 10 years, ... What problems are you trying to solve? - Performance, Cost, Complexity, Size, Reliability, ... Breakthrough examples: - Practical X-Ray lithography creating 1 million gate CMOS chips - Quantum transistors and high temperature superconductor "metallization" layers creating femtosecond propagation delays - Fiberoptic advances creating 1 Gigabyte/sec cable or bus bandwidths - Optical disc advances creating terabyte 3.5" optical discs - Nanotechnology inventions creating microscopic computers - ... Keep in mind system constraints like memory and IO bandwidth, power and cooling, packaging limitations, reasonable economic assumptions, and balanced system design. For instance an engineering workstation manufacturer may dream of femtosecond gates in 10 years, but the development and manufacturing costs would probably make the technology prohibitively expensive for a workstation in that timeframe. The memory and IO to support that kind of cycle time also would not fit with workstation cost, size, and software. Try not to expect too many breakthroughs at once, magically eliminating bothersome constraints, but also making the scenario implausible. For instance, femtosecond propagation delay gates are highly unlikely in commercial products in the next 10 years. First, they will require breakthroughs in the commercialization of quantum transistors. Second, in order to derive real benefit from them, breakthroughs in reducing on-chip and off-chip wire delays will also be necessary. The combination of two breakthroughs at the same time strains credibility (but isn't impossible.) At what point do technology changes affect the architecture? At what point do you get diminishing returns? Chip density: 50K gates -> 100K -> 500K -> 1M -> 5M -> ? Chip pinout: 250 pins -> 400 -> 800 -> 1000 -> 2000 -> ? Propagation delays: 500ps -> 100ps -> 20ps -> 5ps -> 500fs -> ? Chip power dissipation: 1 Watt/chip -> 10 -> 20 -> 50 -> 100 -> ? Cable bandwidth: 100 Mbits/sec -> 500M -> 1G -> 10G -> 100G -> ? ... The intent of this exercise is to discuss what kind of technology advances really benefit a particular computer architecture. You may want to attack the problem from the other side, starting with a breakthrough and determining which architecture would be most suitable. Apply: Standard disclaimers -- Bob Blau Amdahl Corporation 143 N. 2 E., Rexburg, Idaho 83440 UUCP:{ames,decwrl,sun,uunet}!amdahl!tetons!bb (208) 356-8915 INTERNET: bb@tetons.idaho.amdahl.com
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (02/14/89)
In article <740@tetons.UUCP> bb@tetons.UUCP (Bob Blau) writes: > Imagine that you are the enlightened head of Imaginary Computer Corp's I need a cheaper/smaller multiport memory interface technology. > What are your assumptions? >End Product: Workstation, Mini >Architecture:RISC, CISC, Vector > multiprocessor, ... All of the above: A fixed instruction format 64 bit supermicro for workstations with multiple processors. >Application: Engineering, Scientific >Timeframe: Next year > What problems are you trying to solve? >- Performance, Cost, Complexity, Size, Reliability, ... Yes. >- Fiberoptic advances creating 1 Gigabyte/sec cable or bus bandwidths > Chip pinout: 250 pins -> 400 -> 800 -> 1000 -> 2000 -> ? > Cable bandwidth: 100 Mbits/sec -> 500M -> 1G -> 10G -> 100G -> ? I want to build a supermicro that is architecturally similar to a big vector/parallel machine - so, in order to do that, I need a technology that will lower the complexity/cost/size of a multi-port memory interface to around $1K. For example, suppose you had a bus-like box with 8 processor ports and 8 Memory Bank ports. You could plug 4 CPU's in, with 2 IO Controllers, a Video Controller, and a spare, and each processor be able to get full memory bandwidth barring bank conflicts. Clock cycle time: should support clock cycle times of 20 ns. Bus width: 64 data bits + 14 bits ECC minimum. Up to 512 data bits wide could potentially be useful on a more expensive model (I am not sure how you would do the bus connections, but the bandwidth is useful.) Total memory bandwidth: 3-25GBytes/sec total, depending on width (64-512 bits) (I know this sounds expensive - that is why a breakthrough is needed.) In other words, something like a Cray Y-MP multiport memory interface, slowed down a little bit and packaged very inexpensively and compactly, is what is desired. -- Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
gillies@p.cs.uiuc.edu (02/14/89)
O.k., here is an idea, perhaps based on a fallacy ----- "Currently, the quality of the code from several commercial compilers and hand-optimization for the same machine (e.g. 68000) might vary tremendously. The best (compiled or hand-optimized) code might easily be X (X >= 10?) times faster than the worst. What is this constant X, for most commercial compilers? Why do compilers vary so much? Clearly, reducing X bootstraps the performance of *all* compilers/architectures. Here's a plan: 1. Discover X 2. Discover the three contributors to X with the most variance. For example, here are 6 possible contributors to X (PLEASE NO FLAMES THESE ARE IMAGINARY REASONS) (1) Most compilers do poor register scheduling. (2) The architecture is so (a) non-orthogonal? => compiler optimization is too hard (b) orthogonal? => simple implementations are microcoded & slow (c) powerful? => combinatorial search is necessary to discover optimal code sequences (3) Most compiler writers don't (a) have enough time (b) have enough knowledge / experience .... Of these, reason (1) might have a variance of 1.5, reason 2(a) might result in a variance of 2.0, 3(b) might be 4.0, etc. 3. Propose a technical solution for the top-3 contributers to the variance of X. ------------------------------------------------------------ Also, I've always wondered if we could design a simple architecture + language, from a theoretical standpoint, for which provably optimal code could be generated. At the moment this problem is ill-defined and requires a lot of formulation. You can perhaps start with some ultra-simple architecture (Turing Machine), and simple language (Turing Machine Language), and try abstracting the two farther apart to see how far you can get. This problem is not for the faint-hearted. Don Gillies, Dept. of Computer Science, University of Illinois 1304 W. Springfield, Urbana, Ill 61801 ARPA: gillies@cs.uiuc.edu UUCP: {uunet,harvard}!uiucdcs!gillies
wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) (02/15/89)
Well, I'll jump. I'm interested in computer animation. The biggest problem I face is memory bandwidth. Compared to that everything else is minor second-order effects. So here is what I'd like to see: 1. RISC is nice since it has lots of registers, but bad because there are too many instruction to be excuted to do anything. 2. CISC is nice in that one instruction can do a lot of things with one little instruction, but you only get a handful of registers. So you always hitting memory to get your data. 3. Memory caches are nice but, there is very little control over what stays in the cache. So what I'd like to see is a two fold solution: 1. RCISC - Register enhanced CISC. Take something like 680xx, paste in a couple kbytes of registers that can also hold code. Use a few of the unimplemeted opcodes to give access to the new memory bank. This would allow one to really reduce memory accesses durning intense interactive graphics by putting some of the routines into the RAM bank and also common data would go in there. Also since this would be on chip RAM, programs would excute much quicker there and data access would be fast. Along with this would have to be a tagging system to allow tasks to grab chunks to RAM bank so you could allow old-fashion multitasking programs to run. 2. OS cops to stop all OS writers from using the RAM bank. Power to the programmers, keep the OS to just the standard registers. This would keep overhead low, and even allow cheap context switches. Let's not waste great hardware on PIG style OSs. After all the only good OS is one that doesn't get in your way. Wayne Knapp
daveb@geaclib.UUCP (David Collier-Brown) (02/15/89)
In article <740@tetons.UUCP> bb@tetons.UUCP (Bob Blau) writes: | Imagine that you are the enlightened head of Imaginary Computer Corp's I'd like a very inexpensive dcache kit, much like the bit-slice processor kits of yore. I don't mind if they're a bit limited in how much underlying memory they can handle, or if they're made affordable by simplifying a little too much, because I want to use a **rather** large number of them. | What are your assumptions? |End Product: A large-memory mini, for recalcitrant/large problems that companies without gigabucks still have to solve. It should fit in a small rack or large under-desk enclosure. |Architecture:RISC, CISC, Vector RISC and moderately-CISC processors, like 68xxx's, MIPses and such-like. Standard state-of-practice stuff. |Application: Scientific, logic-processing, symbolic computing. | What problems are you trying to solve? |- Performance, Cost, Complexity, Size, Reliability, ... Size and complexity, primarily. I want a machine with performance somewhat better than my Stunned despite my stuffing it to bursting with slow main-memory chips. I want an architecture which will deal with large working sets and ill-behaved (not very local) reference patterns without having to sweep the slowdown under the rug by running large numbers of users/processes in a timesharing system. I'd like the opportunity to do this at finite cost in the near future by allowing me to bite the bullet and put in lots of cache in front of a rather huge but not-too-fast main memory. I really do want lots of memory, and am betting that the difference in memory speed/cost versus processor speed/cost makes investing in cache worthwhile. The assumption I'm making, you see, is that it's worthwhile to allocating particular cache chips to particular blocks of memory chips. Probably along paging boundaries. This in turn means that I'm betting that cache invalidation/reloading on process change is a major bottleneck when one has large amounts of cache. This in turn means that I'm still betting on simple multiprocessing for this kind of application domain. Of course, if one put several processors in front of this memory hierarchy, but made sure that they didn't do strange things like use two different page/cache sets for the same data, then it might extend to multiprocessors... --dave (no, we're not planning an AI document processor) c-b [ the assumptions are basically a guess, based on the truism about micro/mini systems repeating the history of mainframes, complete with sillynesses and obvious bottlenecks] -- David Collier-Brown. | yunexus!lethe!dave Interleaf Canada Inc. | 1550 Enterprise Rd. | He's so smart he's dumb. Mississauga, Ontario | --Joyce C-B
kleonard@PRC.Unisys.COM (Ken Leonard) (02/15/89)
* ... * architecture department. Your job is to tell us (the enlightened * scientists at the research lab) exactly what sort of technological * breakthrough would help you the most. * ... * Breakthrough examples: * ... * - Fiberoptic advances creating 1 Gigabyte/sec cable or bus bandwidths * ... * Is it OK that we're already building one? 1Gb/s per host pair in local or campus net, all host pairs concurrent at that rate. And 1Gb/s per fiber on a trunk as long as you care to pay for. (Note, that is 1,000,000,000 bits of the user's data we're talking about.) Regardz, Ken Leonard Duke of West Nantmeal; Hereditary Captain-in-Chief and Master of Gunnery, His Lordship's Loyal Company of Freebooters Cannoners and Military Enginers. --- A Navy Colt beats a full house, every time.
colwell@mfci.UUCP (Robert Colwell) (02/16/89)
In article <76700063@p.cs.uiuc.edu> gillies@p.cs.uiuc.edu writes: > >O.k., here is an idea, perhaps based on a fallacy ----- > >"Currently, the quality of the code from several commercial compilers >and hand-optimization for the same machine (e.g. 68000) might vary >tremendously. The best (compiled or hand-optimized) code might easily >be X (X >= 10?) times faster than the worst. What is this constant >X, for most commercial compilers? Why do compilers vary so much? >Clearly, reducing X bootstraps the performance of *all* >compilers/architectures. Here's a plan: I propose that part of the reason is that there is a lot of money chasing compiler solutions for part of the "compilation space", and not much chasing it in other parts. So a skew develops between how good the compiler is in some parts of that space and how good it is in other parts. Handcoders are reasonably adept in all parts of the space. An example is how well compilers can vectorize code these days vs. how well they do on garbage/serial code. >1. Discover X > >2. Discover the three contributors to X with the most variance. > ... >3. Propose a technical solution for the top-3 contributers to the > variance of X. To even have a prayer of getting anything reasonably interesting from this line of inquiry, I suggest you need to focus on one architecture, one language, and maybe even just a few benchmarks. If you did that the problem would be more tractable, and you'd have the opposite problem of extrapolating your results back out to other machines and other languages at the end. But at least you'd have something concrete to say. >Also, I've always wondered if we could design a simple architecture + >language, from a theoretical standpoint, for which provably optimal >code could be generated. > >You can perhaps start with some ultra-simple architecture (Turing >Machine), and simple language (Turing Machine Language), and try >abstracting the two farther apart to see how far you can get. Nah, at that low a level, you could do what Henry Massalin did with the "Superoptimizer" he built at Columbia (ASPLOS-II proceedings). (His program tried every instruction sequence combination up to sequences of several instructions to find the ones that computed a given function the quickest, including use of side-effects and bizarre uses for carry bits and the like. Fascinating experiment.) But you could micro-optimize the entire program and still end up with slower code than a handcoder or a good compiler for various reasons. For instance, the handcoder might have recognized some trig identity inherent to the algorithm that wouldn't make economic sense to wire into the compiler's knowledge base (since it doesn't come along often enough, and there's always bigger fish to fry). On the other hand, the compiler might recognize a code transformation that entirely removes the section of code you were planning to micro-optimize. I think the problem is a lot bigger than your message implies. Bob Colwell ..!uunet!mfci!colwell Multiflow Computer or colwell@multiflow.com 175 N. Main St. Branford, CT 06405 203-488-6090
suitti@haddock.ima.isc.com (Stephen Uitti) (02/16/89)
In article <3780@tekig5.PEN.TEK.COM> wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) writes:
=>I'm interested in computer animation. The biggest problem is
=>memory bandwidth.
=>So what I'd like to see is a two fold solution:
=> 1. RCISC - Register enhanced CISC.
=> Take something like 680xx, paste in a couple kbytes of
=> registers that can also hold code. Use a few of the
=> unimplemented opcodes to give access to the new memory bank.
=> Wayne Knapp
Don't invent a new memory store type. Just put a few pages of
physical RAM on the chip. This was done for some 8 bit controller
type chips way back when (for different reasons).
The OS can be told to provide access to these special pages to the
programmer, perhaps using existing calls (depending on the OS) to map
them. The OS may also provide other services which may be handy, such
as real time stuff.
This has the advantages of:
1) no new instructions. True compatibility.
2) current instruction power is available.
3) no new assembler/compiler tools required.
4) comparatively simple (localized) OS changes.
5) can be expanded dynamically with chip technology.
Intel's separate I/O instructions (for example on the 8080) show how
painful a second address space can be.
Of course, it may be that a peripheral (glorified DMA engine(s))
could do everything you want. That would only require a "simple
driver"... which you - the programmer could write.
Stephen.
w-colinp@microsoft.UUCP (Colin Plumb) (02/17/89)
suitti@haddock.ima.isc.com (Stephen Uitti) wrote: > Don't invent a new memory store type. Just put a few pages of > physical RAM on the chip. This was done for some 8 bit controller > type chips way back when (for different reasons). Also, on the Inmos Transputer. 4K of on-chip memory in the T800, 16K on the T810. On the T810, it's 64 bits wide, too. I'd still rather have cache. It's a pain to allocate that space unless you take over the whole processor with your application and do it at compile-time. -- -Colin (uunet!microsoft!w-colinp) "Don't listen to me. I never do."
gillies@p.cs.uiuc.edu (02/18/89)
/* Written 10:01 am Feb 15, 1989 by colwell@mfci.UUCP in p.cs.uiuc.edu:comp.arch */ > Nah, at that low a level, you could do what Henry Massalin did with > the "Superoptimizer" he built at Columbia (ASPLOS-II proceedings). > (His program tried every instruction sequence combination up to > sequences of several instructions to find the ones that computed a > given function the quickest, including use of side-effects and > bizarre uses for carry bits and the like. Fascinating experiment.) That's precisely the article I was thinking about when I originally thought of the problem -- I should have mentioned this. Are there architectures where combinatorial search can be used for practical compilation? Are these architectures useful? > But you could micro-optimize the entire program and still end up with slower > code than a handcoder or a good compiler for various reasons.....because > of various reasons..... I think you need an appropriate definition of "compilation". You can't expect the compiler to go off and implement new algorithms. You cannot assume there is a library of compiled subroutines. You might be able to define an optimal compilation as one where every state of the "important" state variables in the source (programming) language, is also represented somewhere at some time in the target (assembly) language program, during its execution. Under this definition, code transformations are not allowed to remove redundant code, but they are allowed to combine redundant computations... You also must enforce a space-time tradeoff, so the compiler doesn't just precompute everything and use a lookup table. The ASPLOS paper required the programs to be as short as possible.
aglew@mcdurb.Urbana.Gould.COM (02/18/89)
>/* ---------- "quest for breakthroughs (long)" ---------- */ > > This news group is stuck in an endian rut and strung out on history. >How about an exercise in creativity? The goal is to use computer >architecture as the basis for directing future technological advances. > > Imagine that you are the enlightened head of Imaginary Computer Corp's >architecture department. Your job is to tell us (the enlightened >scientists at the research lab) exactly what sort of technological >breakthrough would help you the most. OK, I'll bite. Actually, I'll probably bite several times, with different sets of starting assumptions/markets, as time permits in the next few weeks. First, I'll think about my dream, of building a truly useful personal computer system... > What are your assumptions? >End Product: Embedded controller, PC, Workstation, Mini, SuperMini, > MiniSuper, Mainframe, Super, ... PC: Subclasses - desktop (where most PCs are now), briefcase, wristwatch, house. >Architecture:RISC, CISC, Vector, Massively parallel, VLIW, Shared memory > multiprocessor, ... Whatever it takes. Probably RISC/CISC single processor. Actually, I don't think that processor technology is all that important for this market, although performance increases in a single processor may help in getting cost down by eliminating extra components. >Application: Home, Business, Engineering, Scientific, Manufacturing, ... Home, Engineering, Scientific (basically, the personal computer I want) >Timeframe: Next year, in 5 years, in 10 years, ... 5-10 years. > What problems are you trying to solve? >- Performance, Cost, Complexity, Size, Reliability, ... I'm sure I'm going to get told "You want a personal scientific computer? What about NeXT?" My response is that NeXT is much too expensive. I want a really cheap computer, <2,000$, preferably small enough to wear on my wrist, with ability to interface to the standard I/O modules. A 33MHz 68030 based system with 32 MB of memory is the first small computer system that I've found acceptable to work on, so I don't think processor performance improvements are too much needed for my "realizable" dream machine (although it still falls a way short of a Gould NP1 with 256 MB, that's chiefly I/O). So, here's what I want to solve: Performance comparable to today's top of the line workstations with considerably reduced component count. With a view to this, performance sufficient in a single processor to eliminate the need for separate graphics coprocessors, etc. (or, multiple processors per package - chip or hybrid). Cost: get it *way* down! Complexity: reduce component count to reduce cost Size: Second priority - make small enough to fit on briefcase, wrist... Reliability: Third priority - reliability comparable to one of today's PCs would be acceptable (but cost of repair should also stay same, which it probably won't - it'll increasingly be "Buy the whole thing") > Breakthrough examples: >- Practical X-Ray lithography creating 1 million gate CMOS chips Anything that increases densities is good. >- Quantum transistors and high temperature superconductor "metallization" Probably not applicable to this domain. I'm not going to carry liquid nitrogen around. > layers creating femtosecond propagation delays Probably not too applicable to the performance domain I'm talking about in this note (will be to others) >- Fiberoptic advances creating 1 Gigabyte/sec cable or bus bandwidths High priority. What I want is a wristwatch computer that I can plug into a base unit to upload/download stuff from disk, control display devices with, etc. (Or, I want a base unit PC that can download stuff to a wristwatch unit - but I would really prefer that the wristwatch, or Walkman, unit have the intelligence, with enough non-volatile memory to be "THE" system, with everything bigger a peripheral). Yes, I'm a portable computer freak. I bought one of the first luggables, and lugged it thousands of miles. At the moment I don't have a portable; I do have one of those multifunction memory watches. But, these watches are sadly limited, with far too little memory and I/O capability for what I need. I would have bought one of those SEIKO memory systems that had a PC interface, except that I didn't have a PC compatible. I think that they missed the market with that device - instead of having a simplified ASCII download capability, perhaps with a stripped down RS232 i/f (I think that you could get that down to wristwatch profile) they went for PC users. PC users don't want writswatch computing. The people who want wristwatch computing are the people like me who use bigger machines all day long, and want the ability to manipulate their wristwatch schedule, appointment books, etc., with considerably more convenience than a writwatch keyboard allows (you know how much pain it is to type things in on such a keyboard? Give me download!) I want a personal computer that can replace the notebook I carry around all day, that can interface to other devices and computers. TABLET is close, but I don't want to have to carry that around. >- Optical disc advances creating terabyte 3.5" optical discs Yes! Appears possible. Myself, I would just wait for that to come out for the desktop crowd; I can't see auxiliary storage being writwatch sized in the near future. Walkman sized maybe. In my own company I would concentrate on the wristwatch processor memory and I/O element that nobody seems to be working on. >- Nanotechnology inventions creating microscopic computers Sure. But I don't see this being rewarding in the timespan I'm talking about. > Keep in mind system constraints like memory and IO bandwidth, power >and cooling, packaging limitations, reasonable economic assumptions, >and balanced system design. For instance an engineering workstation >manufacturer may dream of femtosecond gates in 10 years, but the >development and manufacturing costs would probably make the technology >prohibitively expensive for a workstation in that timeframe. The memory >and IO to support that kind of cycle time also would not fit with >workstation cost, size, and software. What I want is basically a present day processor, 8-32MB of memory, and an "optical SCSI" in a 1"x0.25"x0.10" package. Everything else in this "realizable dream" I see on the drawing boards. What's needed: is increased density; technology that permits logic, memory, and optics to be on the same chip; a good way of coupling optics to a chip (that doesn't require a relatively large shroud); packaging to protect this beast; and manufacturing technology to make this beast buildable (I can't see it being at all practical without robotics or *very* cheap labour); and, of course, reductions in power consumption so that I could wear such a device, or attach it to my belt. The worst part about this dream - although I think it's within reach, I don't think that it would be a big profit maker. After all, the idea would be to make something *cheap*. It'll probably arrive 5-10 years after I would like it, via excess competition in the PC market forcing people into new niches, or (more likely) via intelligent bank cards. > Try not to expect too many breakthroughs at once, magically >eliminating bothersome constraints, but also making the scenario >implausible. For instance, femtosecond propagation delay gates are >highly unlikely in commercial products in the next 10 years. First, >they will require breakthroughs in the commercialization of quantum >transistors. Second, in order to derive real benefit from them, >breakthroughs in reducing on-chip and off-chip wire delays will also >be necessary. The combination of two breakthroughs at the same time >strains credibility (but isn't impossible.) I'm trying to be mid-range practical. Please tell me where I fall down (economics is the biggest constraint, I suspect) > At what point do technology changes affect the architecture? At what >point do you get diminishing returns? > > Chip density: 50K gates -> 100K -> 500K -> 1M -> 5M -> ? Say 24K gates for the processor, + 8M (minimum) of RAM. O(16M) gates necessary if all on same chip; less would require great pinout. > Chip pinout: 250 pins -> 400 -> 800 -> 1000 -> 2000 -> ? If memory and processor could live on same chip, both space, heat, and process-wise, then low pinouts need. Direct coupling to 1 or 2 optic fibers on same chio desirable. > Propagation delays: 500ps -> 100ps -> 20ps -> 5ps -> 500fs -> ? Not applicable. > Chip power dissipation: 1 Watt/chip -> 10 -> 20 -> 50 -> 100 -> ? I don't want to wear anything that hot on my wrist! How about fewer watts per device! (which is probably where I start needing new technology - I suppose that there are physical limits here) > Cable bandwidth: 100 Mbits/sec -> 500M -> 1G -> 10G -> 100G -> ? The basic thing I need for this wrist configuration is the ability to do bitmapped video, say for a colour megapixel at 60 Hz => 24x1Mx60 => O(2G) bits per second. Plus the rest of I/O traffic, which is relatively small. Biggest need is miniaturization of terminators. > > The intent of this exercise is to discuss what kind of technology advances >really benefit a particular computer architecture. You may want to attack >the problem from the other side, starting with a breakthrough and >determining which architecture would be most suitable. > > >Apply: Standard disclaimers >-- > Bob Blau Amdahl Corporation 143 N. 2 E., Rexburg, Idaho 83440 > UUCP:{ames,decwrl,sun,uunet}!amdahl!tetons!bb (208) 356-8915 > INTERNET: bb@tetons.idaho.amdahl.com This was a good idea. Writing this has been fun, and clarified a few ideas in my head. If I get the time I may try to write similar things for more realistic / commercially viable systems.
david@indetech.UUCP (David Kuder) (02/19/89)
In article <740@tetons.UUCP> bb@tetons.UUCP (Bob Blau) writes: > What are your assumptions? >End Product: Embedded controller, PC, Workstation, Mini, SuperMini, > MiniSuper, Mainframe, Super, ... Backup: fast and plentiful! >Architecture:RISC, CISC, Vector, Massively parallel, VLIW, Shared memory Anything >Application: Home, Business, Engineering, Scientific, Manufacturing, ... All of the above! >Timeframe: Next year, in 5 years, in 10 years, ... Yesterday, today I can have a ton of disk under my desk and no way to back it up easily. > What problems are you trying to solve? >- Performance, Cost, Complexity, Size, Reliability, ... Time and space of backups. Every other suggestion under this topic has involved more memory and processing it faster. Any concern for the permanence of the results that are done bigger and faster should lead to backup. I can currently stick one GB of disk on my workstation with little trouble but the small form factor backup devices that I could then use (removable diskette, 1/4" cartridge, 8mm cartridge) are either incapable of holding that much data or transferring it in a small portion of a workday. Various optical disks aren't any better. A tape drive that could transfer the data at a decent rate would be a little large to put in my office and still couldn't do the job without tape hangs. Now when we consider real world (tm) applications where a database with tens to thousands of GB of data backup is THE problem. How many 6250 tape drives and tapes would it take to backup 25GB in an 8 hour shift? What happens when you have to have your database online 24 hours a day? My bank has enough problems with its database and ATMs with out having to worry about backup. I understand from one go 'round with them that ATM backup is paper printout in the machine -- this isn't just audit, it's backup! If the computer goes down the transactions are recovered by entering the paper printout by hand. Solve the backup problem and you'll be the next (NeXT) Jobs. Build another fast CPU and you'll be one of the guys in the Silicon Valley. -- ____*_ David A. Kuder {sun,sharkey,pacbell}!indetech!david \ / / Independence Technologies \/ / 42705 Lawrence Place FAX: 415 438-2034 \/ Fremont, CA 94538 Voice: 415 438-2003
csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (02/21/89)
In article <76700068@p.cs.uiuc.edu> gillies@p.cs.uiuc.edu writes: > >/* Written 10:01 am Feb 15, 1989 by colwell@mfci.UUCP in p.cs.uiuc.edu:comp.arch */ >> Nah, at that low a level, you could do what Henry Massalin did with >> the "Superoptimizer" he built at Columbia (ASPLOS-II proceedings). >> (His program tried every instruction sequence combination up to >> sequences of several instructions to find the ones that computed a >> given function the quickest, including use of side-effects and >> bizarre uses for carry bits and the like. Fascinating experiment.) > >That's precisely the article I was thinking about when I originally >thought of the problem -- I should have mentioned this. Are there >architectures where combinatorial search can be used for practical >compilation? Are these architectures useful? Somewhat related: I recently had the pleasure of porting about 300 instructions of 68020 assembler to the MIPS processor. I took the 68020 algorithm and translated it into C. I then had the MIPS compiler compile the C code. With the exception of 2 instructions, the output of the MIPS compiler was exactly what I would have written if I had been writing in assembler. My conclusion is that for an elegant architechture, such as the R2000, combinatorial searches won't buy you much. On less elegant architechtures, such as the 680x0 and 80x86, there are a sufficient number of special cases that a combinatorial search can find cute code sequences. (Alternatively, the R2000 instruction set has very few instructions that aren't easily accessible from C. The 680x0 and 80x86 have numerous instructions that aren't accessible from C. For example, C compilers normally won't generate rotate instructions. The 'bfffo' instruction on the 68020 also tends to be inaccessible to compilers.) Contrariwise, if you do have a problem on which you want to perform a combinatorial search, since the R2000 has far fewer special cases than, say, the 68020, the combinatorial search won't explode quite so quickly. It will also be relatively easy to decide when one sequence of instructions is more optimal than another. (One of my favorite puzzles: You have two 32-bit registers containing some pattern of bits. The most significant bit of a register is numbered "0". You want to move bits 0, 2, 4, and 16 from the first register, and bit 0 of the second register into some subset of the low-order eight bits of some register. The high-order 24 bits of the destination register must end up as zero. The three bits in the low-order byte of the destination register which are not copies of the five bits of interest may have any value at all.) (The places where the MIPS compiler didn't produce optimal code were generated by source of the form: register unsigned long a, b, c; c &= ~(a | b); Clearly, we would like for this to generate the code: nor $temp, $a, $b and $c, $c, $temp The actual code generated was: or $temp, $a, $b nor $temp, $temp, $0 and $c, $c, $temp Since the 'nor' instruction doesn't directly map into the C language, I didn't really expect the compiler to handle this minor special case.) -- Chuck
kds@blabla.intel.com (Ken Shoemaker) (02/21/89)
Actually, what I would like is a three-dimensional read/write optical storage device. Imagine gigabytes of storage in a 1" lucite cube. Doubles as a paperweight. ------------ I've decided to take George Bush's advice and watch his press conferences with the sound turned down... -- Ian Shoales Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds
henry@utzoo.uucp (Henry Spencer) (02/23/89)
In article <671@oracle.oracle.com> csimmons@oracle.UUCP (Charles Simmons) writes: >My conclusion is that for an elegant architechture, such as the R2000, >combinatorial searches won't buy you much. On less elegant >architechtures, such as the 680x0 and 80x86, there are a sufficient >number of special cases that a combinatorial search can find cute >code sequences... I haven't read the Massalin paper yet -- possibly it addresses this -- but I am compelled to wonder whether those cute code sequences are really faster than straightforward ones, given the attention that the chip designers usually pay to optimizing the most common (i.e. simple) cases. There have been surprises in this area in the past. -- The Earth is our mother; | Henry Spencer at U of Toronto Zoology our nine months are up. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
rpw3@amdcad.AMD.COM (Rob Warnock) (02/23/89)
In article <671@oracle.oracle.com> csimmons@oracle.UUCP writes: +--------------- | (One of my favorite puzzles: You have two 32-bit registers | containing some pattern of bits. The most significant bit of a register | is numbered "0". You want to move bits 0, 2, 4, and 16 from the | first register, and bit 0 of the second register into some subset | of the low-order eight bits of some register. The high-order 24 bits | of the destination register must end up as zero. The three bits in | the low-order byte of the destination register which are not copies | of the five bits of interest may have any value at all.) +--------------- O.k., I'll bite. (Byte? Nybble? Chomp at the bit? ;-} ;-} ) On the Am29000, you can this in five instructions of straight-line code, involving a non-obvious use of the 29k "extract" instruction (which extracts a 32-bit field from a 64-bit source: any two of the 32-bit regs). Let "x" be the first source reg, "y" be the second, "t1" & "t2" temp regs, and "z" the result. (Depending on where the result goes and whether either/both of the source regs may be destroyed, one or both of the temps may be uneeded.) The code is: srl t1,x,27 ; t1<27:31> = x<0:4>, t1<0:26> = 0 sll t2,x,16 ; t2<0> = x<16> (t2<1:15> = "don't care") mtsrim FC,1 ; condition Funnel-shifter Count extract t1,t1,t2 ; t1<0:31> = t1<1:31> cat t2<0> extract z,t1,y ; z<0:31> = t1<1:31> cat y<0> The result (BigEndian bit numbers) satisfies the given conditions (bits 24, 26, & 28 are the "don't care" bits): 0:23 24 25 26 27 28 29 30 31 +- - - -+-------+-------+-------+-------+-------+-------+-------+-------+ | ...0 | 0 | x<0> | x<1> | x<2> | x<3> | x<4> | x<16> | y<0> | +- - - -+-------+-------+-------+-------+-------+-------+-------+-------+ "Extract" is also handy in 29k code when shifting multi-word quantities. A highly optimized version of "memcpy()" does non-aligned byte copies with inner loops of "load_multiple, extract, extract..., store_multiple". (Without using "extract", the best I could do was seven instructions, which happened to give the same result pattern.) Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
w-colinp@microsoft.UUCP (Colin Plumb) (02/24/89)
csimmons@oracle.UUCP (Charles Simmons) wrote: > [The desired code was]: > > nor $temp, $a, $b > and $c, $c, $temp > > The actual code generated [by the MIPS compiler] was: > > or $temp, $a, $b > nor $temp, $temp, $0 > and $c, $c, $temp > > Since the 'nor' instruction doesn't directly map into the C language, > I didn't really expect the compiler to handle this minor special case.) Really? I do. Since the sequence "or $c, $a, $b; not $c, $c" is exactly equivalent to "nor $c, $a, $b" and easily recognised by a peephole optimiser. If the code generator frequently uses extra temporary registers (say the first sequence ends in "not $d, $c"), you have to complicate this by checking to see if $c is dead after this point, but I expect MIPS have already fixed this. Well, guys? Have you? :-) -- -Colin (uunet!microsoft!w-colinp) "Don't listen to me. I never do."
mo@prisma (02/24/89)
Going very very fast is very very hard if for no other reason than time of flight delays across realizable circuit board material and delays going on and off chips reduces the "1-foot per nanosecond" speed-of-light-in-a-vacuum to more like "1-inch per nanosecond." It ain't quite that bad, but it is a good number to plan with. In theory, one *can* make multilayer circuit boards out of laminated teflon, TFE, and gold conductors, but they would be astronomically expensive. This alone makes many parts of the machine more than one clock away, and you'd be surprised at how badly that complicates things. Fairly quickly one comes to understand that the speed of the parts is NOT the driver of the machine clock speed. -Mike
wbeebe@bilver.UUCP (bill beebe) (02/25/89)
In article <3607@mipos3.intel.com> kds@blabla.UUCP (Ken Shoemaker) writes: >Actually, what I would like is a three-dimensional read/write optical storage >device. Imagine gigabytes of storage in a 1" lucite cube. Doubles as a >paperweight. Well, you should read the article February 1989 _BYTE_, "Digital Paper", by Dick Pountain. Bernoulli Optical Systems Corp (BOSCO, the chocolate syrup people), is using a new type of optical storage media to create a one gig floppy disk. Paper is something of a misnomer as the material is made of a flexible polymer sandwich. When I say thin, I mean the about the thickof a stiff piece of paper. BOSCO says they can put the equivalent of a double-sided WORM drive with a gigabyte of storage capacity in the same space as a 5.25" half-height. *Removeable*. Besides, if you get a cable long enough from the PC, it can still double as a paper weight.
jesup@cbmvax.UUCP (Randell Jesup) (02/25/89)
In article <740@tetons.UUCP> bb@tetons.UUCP (Bob Blau) writes: > What are your assumptions? >End Product: Embedded controller, PC, Workstation, Mini, SuperMini, > MiniSuper, Mainframe, Super, ... Home PCs to low-end workstations. >Architecture:RISC, CISC, Vector, Massively parallel, VLIW, Shared memory > multiprocessor, ... CISC 680x0 family >Application: Home, Business, Engineering, Scientific, Manufacturing, ... Home, Business (esp graphics/video/sound), plus others. >Timeframe: Next year, in 5 years, in 10 years, ... Next 2-3 years. > What problems are you trying to solve? >- Performance, Cost, Complexity, Size, Reliability, ... Cost, Performance > Examples: (all of these can be used more or less alone) Cheap large ram. Cheap VRAM. Reasonably priced (<$500) 1280x1024 or better color monitors. Reasonably priced (<$500) 2Kx2K grey-scale monitors. Cheaper small SCSI drives (say <$300 for 100Meg 28ms 3.5"). ASIC that allow both many pins (>>100) with 68000/68010 core and allows LOTS of other logic around it (currently 3 low-tech VLSI chips plus a gate array). (Not bloody likely in given time frame :-) Cheap ram (did I say that already? :-) -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
cik@l.cc.purdue.edu (Herman Rubin) (02/25/89)
In article <730@microsoft.UUCP>, w-colinp@microsoft.UUCP (Colin Plumb) writes: > csimmons@oracle.UUCP (Charles Simmons) wrote: > > [The desired code was]: > > > > nor $temp, $a, $b > > and $c, $c, $temp > > > > The actual code generated [by the MIPS compiler] was: > > > > or $temp, $a, $b > > nor $temp, $temp, $0 > > and $c, $c, $temp > > > > Since the 'nor' instruction doesn't directly map into the C language, > > I didn't really expect the compiler to handle this minor special case.) > > Really? I do. Since the sequence "or $c, $a, $b; not $c, $c" is > exactly equivalent to "nor $c, $a, $b" and easily recognised by a > peephole optimiser. If the code generator frequently uses extra > temporary registers (say the first sequence ends in "not $d, $c"), > you have to complicate this by checking to see if $c is dead after > this point, but I expect MIPS have already fixed this. There is no question that in some cases a peephole optimizer can catch things like this. But this requires that someone has anticipated the problem and built it into the optimizer. And how big is the peephole? This is a point that I have been bringing up for a long time. The language gurus are not aware of the hardware, and the "natural hardware," instuctions that Charles and I want to use. At least they do not seem to be. Now if I give you a complete list of such instructions today, the list may not be complete tomorrow. So you should provide a way for me to add this to the language, and communicate the necessary infor- mation to the compiler. I remind you that I have no difficulty with machine instructions on a wide variety of machines, and a new machine is, for me, easier than a new language. I have difficulty with the clumsy assembler languages; I understand how they got to be that way, and the advantages FOR THE ASSEMBLER to having them that way. I understand and appreciate the advantages of the HLLs, and their drawbacks. Possibly part of the solution can be done by having the intermediate code available to the programmer, and allowing the editing of it. This is generally not the case, and the intermediate code is rarely documented. > Well, guys? Have you? :-) > -- > -Colin (uunet!microsoft!w-colinp) > > "Don't listen to me. I never do." -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)