mark@mips.COM (Mark G. Johnson) (12/19/89)
In article <276@leia.WV.TEK.COM> johnt@opus.WV.TEK.COM (John Theus) writes: > >As I stated in a previous posting, we expect to be building Futurebus+ >hardware in the coming year that can sustain 500 Mbytes/sec on a 64 bit >wide data path. The questions that immediately come to mind are either >this does meet your definition for "fast", or you don't believe we can >deliver this bandwidth. Could you explain what is meant by "sustain" above? At least two really cool things might be implied by this: (1) In the coming year you'll be building Futurebus+ hardware that can transfer 512 bytes (i.e. 128 words, or 64 bus-widths of data) in 1024 nanoseconds. This'd be a cache refill from main memory. (2) In the coming year you'll be building Futurebus+ hardware that can do a DMA transfer of 50 Megabytes in 0.1 second. Are either of the assertions above correct? {they're based on the assumption that (500 MB/sec / 8 bytes/transfer) = 62.5 Mtransfers/sec = 16ns/transfer is the bus cycle time in "sustained" operation} Also, folks might think it was a bit smelly to claim high throughput rates for a "bus" that only has two or three slots, or for a "bus" that doesn't involve a printed circuit backplane board having connectors or sockets for separate daughterboards. Just to lay these types of silly "oh-yeah?" questions to rest permanently, Does next-years-500MB/sec-Futurebus+ have a mother/daughterboard construction with N>6 connectors and card slots, and will it run at 500 MBytes/sec when fully populated? Thanks. -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 (408) 991-0208 mark@mips.com {or ...!decwrl!mips!mark}
johnt@opus.WV.TEK.COM (John Theus;685-2564;61-183;625-6654;hammer) (12/20/89)
In article <33845@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: > >Could you explain what is meant by "sustain" above? At least two >really cool things might be implied by this: > > (1) In the coming year you'll be building Futurebus+ hardware that > can transfer 512 bytes (i.e. 128 words, or 64 bus-widths of data) > in 1024 nanoseconds. This'd be a cache refill from main memory. > > (2) In the coming year you'll be building Futurebus+ hardware that > can do a DMA transfer of 50 Megabytes in 0.1 second. > >Are either of the assertions above correct? {they're based on the assumption >that (500 MB/sec / 8 bytes/transfer) = 62.5 Mtransfers/sec = 16ns/transfer >is the bus cycle time in "sustained" operation} > By sustained, I mean that over a period of 0.1 second, the combined traffic of cache lines (64 bytes on Futurebus+) and large DMA blocks will move more than 50 Megabytes. The burst rate, the transfer rate within a single transaction, will be slightly higher. > >Also, folks might think it was a bit smelly to claim high throughput rates >for a "bus" that only has two or three slots, or for a "bus" that doesn't >involve a printed circuit backplane board having connectors or sockets for >separate daughterboards. Just to lay these types of silly "oh-yeah?" >questions to rest permanently, > > Does next-years-500MB/sec-Futurebus+ have a mother/daughterboard > construction with N>6 connectors and card slots, and will it run > at 500 MBytes/sec when fully populated? > >Thanks. >-- > -- Mark Johnson > MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 > (408) 991-0208 mark@mips.com {or ...!decwrl!mips!mark} Yes, I'm talking about a more-or-less standard Futurebus+ backplane environment with more than 6 populated slots. Futurebus+ used BTL (Backplane Transceiver Logic), made by NSC, TI and Signetics. BTL on a daughtercard drives a 50 to 60 ohm stripline backplane that is terminated in 39 ohms to 2 volts at both ends. As long as the edge speed stays longer than 1 nsec., this electrical environment is good for data periods of 10 nsec. or less on a 19 inch rack length backplane with 1 inch board spacing. The high speed data transfer protocol Futurebus+ uses is called packet mode and it was invented by Emil Hahn of Signetics. This protocol uses source synchronous transmission without transmitting any clock. Since there is no clock, there are no bus level set-up or hold times. The protocol is also not limited by signal skew, which turns out to be the biggest source of delay in more standard protocols. The bottom line is this protocol will not be a limiting factor in ultimate performance. The Futurebus+ spec requires a packet implementor to support a minimum packet speed of 60 MTransfers/sec or 480 MBytes/sec on a 64 bit bus. As I've tried to show, the electrical environment and the protocol will both support better than the 16 nsec/transfer rate that Mark asked about. The limitation on our performance this next year will be the silicon implementation. John Theus johnt@opus.wv.tek.com Futurebus+ Parallel Protocol Coordinator Tektronix, Inc. Interactive Technologies Div. - shipping the Futurebus-based XD88 workstations
yodaiken@freal.cs.umass.edu (victor yodaiken) (12/21/89)
In article <278@leia.WV.TEK.COM> johnt@opus.WV.TEK.COM (John Theus) writes: >The high speed data transfer protocol Futurebus+ uses is called packet mode >and it was invented by Emil Hahn of Signetics. This protocol uses source >synchronous transmission without transmitting any clock. Since there is >no clock, there are no bus level set-up or hold times. The protocol is >also not limited by signal skew, which turns out to be the biggest source >of delay in more standard protocols. The bottom line is this protocol will >not be a limiting factor in ultimate performance. How exactly does this work? References?
johnt@opus.WV.TEK.COM (John Theus;685-2564;61-183;625-6654;hammer) (01/04/90)
In article <7863@dime.cs.umass.edu> yodaiken@freal.cs.umass.edu (victor yodaiken) writes: >In article <278@leia.WV.TEK.COM> johnt@opus.WV.TEK.COM (John Theus) writes: >>The high speed data transfer protocol Futurebus+ uses is called packet mode >>and it was invented by Emil Hahn of Signetics. This protocol uses source >>synchronous transmission without transmitting any clock. ... > >How exactly does this work? References? The only references are the Futurebus+ spec itself, and the published working group meeting minutes where Emil presented the papers on his protocol. The packet data transport protocol was designed to move data as fast as possible with a minimum feature set. The protocol does not allow sub-word operations, only 32, 64, 128 or 256 bit wide words can be transferred. No lock operations can be done when using this protocol. Blocks are transferred of length 2, 4, 8, 16, 32 or 64 words long. The block length is signalled at the start of the transfer. The transfer protocol is very similar to the asynchronous protocol used on RS-232. If we just think about an individual bit for now, the sender transmits its data at the frequency of an on-board clock. As with RS-232, the frequency must be known by both the transmitter and the receiver in advance. The Futurebus+ protocols provide a mechanism for selecting one of two such frequencies on a transaction by transaction basis. To start data transmission, the sender transmits a sync bit which is a logic one. The data is encoded using NRZI, where a logic one is represented by an edge transition during a datum cell, and a logic zero is represented by no transition. Therefore to start a packet, an edge is sent followed by the encoded data, and concluded by an even longitudinal parity bit. When parity is correct, the signal line is left in the logic zero state. The receiver has its own on-board clock that runs at the same frequency as the sender. Both sender and receiver must have clock frequency tolerances of 0.01% or better. When the receiver sees the sync bit at the start of a packet, its logic sets a precision delay equal to the phase difference between the sync bit and its on-board clock. Thereafter, the logic uses the on-board clock plus the delay to define the datum cell positions for sampling the rest of the data. The maximum packet length is limited by the drift that occurs between the 2 clock sources. Now multiply the sending and receiving circuitry by the number of bits in a parallel word. Note that there is only 1 on-board clock source, but N (where N equals the number of bits/word) independently settable delays in the receiver. After the individual bits are captured in the receiver, additional stages of logic are used to synchronize the bits into a parallel word. Clearly this is not a protocol to implement in discrete logic, and silicon companies are hard at work building the parts necessary to run this protocol. The Futurebus+ spec requires a minimum clock frequency of 60 MHz, which translates to 60 Mtransfers/sec. We expect the first silicon to do better than this. The bandwidth utilization efficiency of this protocol varies greatly based on the packet length, from 50% for a 2 word packet to 97% for a 64 word packet. It is possible to sustain the 97% efficiency over transfers that are much longer than 64 words by using multiple packet mode. This protocol allows packets to be chained together back-to-back with no lost clocks; as long as a single source is transmitting all the packets. While a packet is being transmitted, the command, status and compelled handshake signals are used to request new packets and acknowledge new packets, including their cache attributes. The requesting process can occur asynchronously with respect to the packet currently being transmitted and also out of phase. By this I mean requests can be either in lock step with their packet transfer, or 1 or more packets ahead. Cache coherence is maintained during multiple packet mode and intervention is also supported. During a single transaction there can be multiple packet sources due to intervention. When a packet source change is made, at least 1 clock is lost in the change-over. A good example of multiple packet sources during a single transaction would be flushing a dirty page back to a disk subsystem that has dirty lines in several different caches. This protocol allows a single transaction to remove the page from memory and the caches, and invalidate the caches. John Theus johnt@opus.wv.tek.com Futurebus+ Parallel Protocol Coordinator Tektronix, Inc. Interactive Technologies Div. - shipping the Futurebus-based XD88 workstations
filbo@gorn.santa-cruz.ca.us (Bela Lubkin) (01/08/90)
In article <280@leia.WV.TEK.COM> John Theus writes: >The transfer protocol is very similar to the asynchronous protocol used on >RS-232. If we just think about an individual bit for now, the sender >transmits its data at the frequency of an on-board clock. As with RS-232, >the frequency must be known by both the transmitter and the receiver in >advance. The Futurebus+ protocols provide a mechanism for selecting one >of two such frequencies on a transaction by transaction basis. > [...] >The receiver has its own on-board clock that runs at the same frequency as >the sender. Both sender and receiver must have clock frequency tolerances >of 0.01% or better. When the receiver sees the sync bit at the start of a >packet, its logic sets a precision delay equal to the phase difference >between the sync bit and its on-board clock. Thereafter, the logic uses >the on-board clock plus the delay to define the datum cell positions for >sampling the rest of the data. The maximum packet length is limited by >the drift that occurs between the 2 clock sources. Why isn't one more line used to transmit the sender's idea of the data clock? clock +++---+++---+++---+++ "111111" data0 +++++++++---+++------ "001110" data1 +++------+++------+++ "101101" : dataN ++++++------------+++ "010001" The receiver could still choose to ignore the clock line and use the above method. It could also use the above method, but dynamically adjust the delay, tracking the guaranteed transition of the clock line at the start of each cell. This would eliminate the requirement for closely matched clock frequencies and would seem to provide much better reliability. I'm not a bus designer; not even a hardware person. Maybe I'm missing something really obvious. If so, how about explaining it instead of flaming me to toast? ;-} Bela Lubkin * * // filbo@gorn.santa-cruz.ca.us CI$: 73047,1112 (slow) @ * * // belal@sco.com ..ucbvax!ucscc!{gorn!filbo,sco!belal} R Pentomino * \X/ Filbo @ Pyrzqxgl +408-476-4633 and XBBS +408-476-4945
johnt@opus.WV.TEK.COM (John Theus) (01/11/90)
In article <136.filbo@gorn.santa-cruz.ca.us> filbo@gorn.santa-cruz.ca.us (Bela Lubkin) writes: >In article <280@leia.WV.TEK.COM> John Theus writes: >>The receiver has its own on-board clock that runs at the same frequency as >>the sender. Both sender and receiver must have clock frequency tolerances >>of 0.01% or better. When the receiver sees the sync bit at the start of a >>packet, its logic sets a precision delay equal to the phase difference >>between the sync bit and its on-board clock. Thereafter, the logic uses >>the on-board clock plus the delay to define the datum cell positions for >>sampling the rest of the data. The maximum packet length is limited by >>the drift that occurs between the 2 clock sources. > >Why isn't one more line used to transmit the sender's idea of the data >clock? >[...] > There are at least 2 major reasons way we don't ship a clock signal with the data. One is a fundamental performance limiter, while the other is related to the data encoding scheme we use. However, we didn't get to where we are today overnight, and in fact a little over a year ago we started out with a separate clock signal when I wrote the first non-compelled protocol proposal. What we've learned from evaluating transfer protocols is that the fundamental performance limiter is caused by signal skew (assuming a clean electrical environment). Skew is the difference in time between the arrival of two signals from a common source. The major sources of skew are variations in the propagation delay through logic and though the physical environment. In the Futurebus+ environment, it takes several bus transceiver chips to make a 32 bit wide data path. The limiting factor here is power dissipation. 9 bits is near the limit for present BTL transceivers with normal commercial cooling practices. The skew through these chips is their spec'd maximum propagation delay minus their minimum propagation delay. The best BTL transceivers available today have a skew of 5 nsec. So just accounting for getting on and off the bus introduces 10 nsec of skew, which is all lost time. In addition, the bus itself introduces skew due mainly to differences in capacitive loading on each line. After including the skews from all the other parts in the logic path, you're left with pretty poor performance. Also notice that there is no difference here based on signal type. The skews exists for both clock to data and data to data. We identified 2 classes of skew elimination techniques, which I'll call chip localized and bit independent. The chip localized technique takes advantage of the fact that you can hold skews to a much smaller value on a single chip than across multiple chips. A proposal was made to have a clock signal per transceiver (8 bits + parity + clock), which localizes the skew to what can it done on a single chip. Numbers in the range of 1 nsec. of skew were believed possible. This technique was eventually discarded primarily due to its physical overhead. Although the silicon was very simple for this technique, the cost in power, pins and real estate was judged too high. We agreed that complex silicon was better than a more complex physical environment. Farther down the list was that this technique did not account for bus skew. The bit independent techniques evolved a little more slowly. The first idea was to use an embedded clock such as one of the run length limited encodings. This idea didn't last long when people started thinking about building a phase locked loop per bit at several times the bit frequency. Eventually, Emil Hahn of Signetics realized that you don't need a clock in any form on the bus and he proposed the scheme that's in the Futurebus+ spec and which I talked about in an earlier posting. The other point I want to make about transmitting the clock concerns the required bandwidth and signal fidelity. When I previously talked about our minimum required clock rate of 60 MHz, that's the rate at which data is clocked onto the bus. The bandwidth of the data itself of one-half this frequency. I also previously stated that the limit for our packet protocol is the electrical environment, and somewhere below 10 nsec per word things start to fall apart. Putting these 2 bits of information together says you don't ship a single edge clock with the data or you have to half your data bandwidth due to the electrical limitations. As your example showed, you can use a two edge clock, which we do for our slower compelled protocol. However, at high speeds the variation in a signals propagation delay between its zero and one levels becomes very significant. This skew within the clock signal, or more precisely its duty cycle precision becomes a limiting factor. The precision required by the Futurebus+ packet protocol prevents the use of a 2 edge clock. There are several approaches to solving this including differential and 2 half frequency 180 degrees out of phase clocks, but each has its own set of problems. One final point, a 0.01% clock oscillator is a industry standard tolerance, and its not a big deal. John Theus johnt@opus.wv.tek.com Futurebus+ Parallel Protocol Coordinator Tektronix, Inc. Interactive Technologies Div. - shipping the Futurebus-based XD88 workstations
pauls@apple.com (Paul Sweazey) (01/13/90)
FUTUREBUS EMBEDDED CLOCK: THE REAL SCOOP FROM A FALLEN ANGEL We each have different views of history, and I have tried to stay out of these discussions, but the Futurebus discussion has led to issues that I used to live and breathe for a living. In article <285@leia.WV.TEK.COM> johnt@opus.WV.TEK.COM (John Theus) writes: > However, we didn't get to > where we are today overnight, and in fact a little over a year ago we started > out with a separate clock signal when I wrote the first non-compelled > protocol proposal. The parallel protocol spec that I wrote, which was based directly on your first non-compelled proposal, is dated 7 July 88. > A proposal was made to have > a clock signal per transceiver (8 bits + parity + clock), which localizes > the skew to what can it done on a single chip. I believe that this was first seriously and publicly proposed by RV Balakrishnan and Dave Hawley during the summer of 1988. > The bit independent techniques evolved a little more slowly. The first > idea was to use an embedded clock such as one of the run length limited > encodings. This idea didn't last long when people started thinking about > building a phase locked loop per bit at several times the bit frequency. > Eventually, Emil Hahn of Signetics realized that you don't need a clock in > any form on the bus and he proposed the scheme that's in the Futurebus+ spec > and which I talked about in an earlier posting. RV Balakrishnan suggested embedded-clock synchronization as the ultimate solution to skew in February 1988. I devised and proposed embedded-clock synchronization to the SuperBus Study Group (now SCI) in March 1988, privately to Futurebus Committee members in May 1988, and at various times in Futurebus public forums through December 1988. Emil Hahn devised a feasible implementation of embedded-clock syncrhonization between November 1988 and January 1989. A HISTORY/PolySci LESSON: In the fall of 1987 the Futurebus (IEEE896.1-1987) was just being finished. I was serving as Coordinator of the Futurebus Cache Coherence Task Group. There was little active interest in speeding it up, but I could see that the real-world performance would not match the idealized theory or the marketing hype, so I started another IEEE project called the SuperBus Study Group. In February 1988, before SuperBus had become SCI and when it was still assumed to be a bus, I proposed the use of a synchronizer (clock) per transceiver to eliminate interdevice skew. RV Balakrishnan of National Semiconductor (Balu, the inventor of BTL logic) was in attendence, and he said (half in jest) that the only way to do better would be to encode a clock in every bit. Until that day, this alternative had only been mentioned, along with optical fibers and radiation baths, as an unrealistic solution for a parallel bus. Since the stated bandwidth goal of SuperBus was 1 gigabyte per second, I began to pursue embedded clocking seriously. (SuperBus is now IEEE P1596 Scalable Coherent Interconnect (SCI), chaired by Dave Gustavson of SLAC and co-chaired by Dave James of Apple. It is now a point-to-point interconnect of arbitrary topology, and it REALLY WILL reach 1 gigabyte per second.) On April 22 I published a memo inside National Semiconductor (I worked there at the time.) which I copied to some Futurebus committee members including the Futurebus committee chairman (also then a National employee). In it I described the theory, benefits, and implementation of embedded clock data transmission in an enhanced Futurebus. One week later I published an expanded report on the subject, entitled "NSC Multiprocessing Performance Roadmap". The report described stages of enhancements to Futurebus that would allow the real-world performance to achieve the marketing hype. In it I estimated that burst rates of 250 to 300 megabytes per second (32 bits wide) would be achievable with the first generation of embedded clock silicon. While the proposal was accepted as credible and viable within the NSC technical community, it was determined by the Futurebus Committee contingent at National to be heretical--"a threat to all that we have worked for"--because it implied that Futurebus-1987 could not reach those speeds without further enhancement (which, of course, was quite true). My proposal for embedded-clock transcievers involved the use of precision delay elements and quadrature sampling of each bit stream, which did not require PLL locking to the bit streams. By the Fall of 1988 I no longer held any committee office, and I was no longer directly involved in Futurebus product planning at work, leaving me free to concentrate on technical issues without regard to politics. I discussed technology freely, including embedded-clock data transfer with many, including Emil Hahn of Signetics. Meanwhile the US Navy began a process of adopting Futurebus, pushing the need for it to become real SOON, and for it to deliver all of its promises. In the December 1988 Futurebus meeting in San Diego, I gave a presentation offering two proposals: either (1) backward-compatible enhancements to Futurebus-1987 as Theus had proposed, or (2) more aggressive enhancements using either clock-per-chip or embedded-clock techniques. Because of new industry pressure that the Navy created, any changes had to be finalized within 8 weeks, so alternative (1) was chosen. Nevertheless, Hahn of Signetics and Balu of National agreed in that meeting to analyze both techniques and report back at a later meeting. At the Santa Clara meeting in January 1989 they came back with two different answers, and Signetics won, based on a similar but different (than my) data recovery method that Emil was confident he could implement. Signetics won. I was not involved in the decision making or analysis process; Two weeks after the San Diego meeting I went to work for Apple Computer. Emil's solution involves the use of dynamically settable delay elements, also uses no PLL locking to the bit streams, and may need as little as 1/4 of the FIFO storage of my proposed implementation. So why bring this all up now? I didn't get a patent for my embedded-clocking contributions, or a bonus check, or stock options, or a raise. So I'll settle for glory. Embedded clocking is debatably the breakthrough performance feature of the "last great backplane bus", and I would hope that the gang remembers that I helped get it started. To those of you with radical breakthrough ideas: be persistent but be very patient. To the receivers of those ideas: File, don't trash. There are gems among the gravel. Greeting to Theus, Balu, Hahn, Hawley, Gustavson, James, and the rest. They are the best in the bus business! Paul Sweazey Apple Computer, Inc. pauls@apple.com (408)-974-0253
johnt@opus.WV.TEK.COM (John Theus) (01/16/90)
In article <6149@internal.Apple.COM> pauls@apple.com (Paul Sweazey) writes: >FUTUREBUS EMBEDDED CLOCK: THE REAL SCOOP FROM A FALLEN ANGEL > >We each have different views of history, and I have tried to stay out of >these discussions, but the Futurebus discussion has led to issues that I >used to live and breathe for a living. > > [...] >RV Balakrishnan suggested embedded-clock synchronization as the ultimate >solution to skew in February 1988. > >I devised and proposed embedded-clock synchronization to the SuperBus Study >Group (now SCI) in March 1988, privately to Futurebus Committee members in >May 1988, and at various times in Futurebus public forums through December >1988. > >Emil Hahn devised a feasible implementation of embedded-clock >syncrhonization between November 1988 and January 1989. > > [...] > This article along with a follow-up phone call to Paul cleared up some confusion I've had about who did what and when. Unfortunately, a lot of the events that Paul related were never published in the Futurebus minutes. Part of my confusion comes from the term "embedded-clock" and I want to make sure this doesn't mislead anyone else. The Futurebus+ packet mode protocol does not actually use an embedded-clock protocol. At one time Paul said he called it "implied embedded-clock", which I think would have been more accurate. Paul is very good at inventing names and techniques to describe new concepts. Another name he had for this protocol that caught my ear was "packet beaming". Typically, an embedded-clock protocol had the originating clock encoded into the data stream. A receiver is then capable of extracting the clock from data following transmission. Good examples are the encoding schemes used for disk drives such as MFM. The Futurebus+ packet protocol does not encode the clock into the data stream, but instead uses a starting sync bit to synchronize the receiver with the sender. Both the sender and receiver have previously agreed upon the transmission frequency. John Theus johnt@opus.wv.tek.com Futurebus+ Parallel Protocol Coordinator Tektronix, Inc. Interactive Technologies Div. - shipping the Futurebus-based XD88 workstations
rpw3@rigden.wpd.sgi.com (Robert P. Warnock) (01/16/90)
In article <286@leia.WV.TEK.COM> johnt@opus.WV.TEK.COM (John Theus) writes: +--------------- | Part of my confusion comes from the term "embedded-clock" and I want to | make sure this doesn't mislead anyone else. The Futurebus+ packet mode | protocol does not actually use an embedded-clock protocol... | The Futurebus+ packet protocol does not encode the clock into the data | stream, but instead uses a starting sync bit to synchronize the receiver | with the sender. Both the sender and receiver have previously agreed upon | the transmission frequency. +--------------- Then shouldn't this be called "embedded phase"??? ;-} ;-} For those with a bit of deja vu about now: Yes, RS-232 async works this way, only slower... -Rob p.s. Or for another way to look at it, think of each bus line as having a >60 megabaud "UART" on it, with *big* "bytes". Note that the .01% clock spec means that practically you're limited to about 2000 bits per start bit (assuming 20% skew is acceptable, which it probably is if you have at least 5 clock phases to choose from -- that gives you a total of 40% skew), or about 16K bytes per burst on the bus (64-bit-wide bus). That's probably enough. ;-} ----- Rob Warnock, MS-9U/510 rpw3@sgi.com rpw3@pei.com Silicon Graphics, Inc. (415)335-1673 Protocol Engines, Inc. 2011 N. Shoreline Blvd. Mountain View, CA 94039-7311