garyb@iotek.UUCP (Gary Burrell) (07/06/89)
I am presently evaluating the various DSP's available and I would be interested in any input from netland about the various offerings out there. I am particularly interested in any benchmarks which have been performed and their source if available. Of major interest are such things as FIR filters, IIR filters, and FFT's, but any benchmark data would be welcome. Please E-mail to me (Garyb@iotek.uucp) and I will sumarize to the net if there is intrest. Also is there any group for DSP related discusion. Thanks Gary R. Burrell <<<<<<******>>>>>> Gary R. Burrell, Iotek Inc, |*| E-Mail: garyb@iotek.uucp |*| 1127 Barrington St., Suite 100, |*| Fax: (902)420-0674 |*| Halifax, N.S., B3H 2P8, Canada |*| Phone: (902)420-1890 |*| Damm it Jim I'm a Doctor not a Computer Scientist! *************************************
garyb@iotek.UUCP (Gary Burrell) (07/13/89)
A while back I posted to the net looking for benchmarks on DSP chips. A number of people replied pointing me to sources of information and a number asked me to pass on what information I received. I really did not receive to much in the way of actual benchmarks but I did receive pointers to several sources of information. There seem to be some interest in DSP related discussion. Is there enough interest to create a DSP related group? Some of the better sources of information I found or was told about are: IEEE Micro December 1988 A Special Issue on DSP processors, contains detailed articles on the TMS320C30, DSP32c and DSP96002. These are fairly good articles about the various DSP processors. The only part of the issue I really question is the editors afterword in which they come up with some amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and DSP96000 (30 MFLOPS). I question these ratings as most supercomputers only get about 1/10 of there peak performance rating on the LINPACK benchmark. The chips may do better than that but I for one would like to see some real numbers using real systems and real compilers. (Can anyone provide these) I'll be surprised if these chips get anywhere near these figures on the LINPACK benchmark, I don't doubt that they can obtain these performances and better on multiply/accumulate benchmarks. EDN September 29 1988 A good article on benchmarking several DSP processors. Unfortunately the code for the benchmarks is not provided. You could order it from EDN up till April 1, 1989 but I missed that deadline. I got a reprint of this article from AT&T. IEEE ASSP Magazine October 1988 and January 1989 A two part article on the architectures of programmable DSP chips. A good article for those who are interested in the architecture of these processors. Other articles look at included IEEE Spectrum June 1987 on DSP's IEEE Spectrum April 1989 on the i860 Computer Design May 1989 on DSP's and the Product Information from: Intel on the i860 AT&T on the DSP 16/16A and DSP 32/32C TI on the TI320C30 I am still waiting for TI, Motorala, Futjitsu etc to send me more information. (Hopefully it's in the mail) As well Michael Slater is sending me a copy of his newsletter in which they did an article on some DSP chips. Thanks to those of you who sent me information and when I finish my report I'll see if I can summarize to the net. Gary <<<<<<******>>>>>> Gary R. Burrell, Iotek Inc, |*| E-Mail: garyb@iotek.uucp |*| 1127 Barrington St., Suite 100, |*| Fax: (902)420-0674 |*| Halifax, N.S., B3H 2P8, Canada |*| Phone: (902)420-1890 |*| Damm it Jim I'm a Doctor not a Computer Scientist! *************************************
mash@mips.COM (John Mashey) (07/15/89)
In article <337@venus.iotek.UUCP> garyb@iotek.UUCP (Gary R. Burrell) writes: >IEEE Micro December 1988 > > A Special Issue on DSP processors, contains detailed articles >on the TMS320C30, DSP32c and DSP96002. These are fairly good articles >about the various DSP processors. The only part of the issue I really >question is the editors afterword in which they come up with some >amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and >DSP96000 (30 MFLOPS). I question these ratings as most supercomputers >only get about 1/10 of there peak performance rating on the LINPACK >benchmark. The chips may do better than that but I for one would like >to see some real numbers using real systems and real compilers. (Can >anyone provide these) I'll be surprised if these chips get anywhere >near these figures on the LINPACK benchmark, I don't doubt that they >can obtain these performances and better on multiply/accumulate >benchmarks. I'm out of town, so I don't have that issue handy. I conjecture that what they must mean is the inner-loop timing for the standard LINPACK code, but with zero-wait-state memory, i.e., something not particularly buildable. To be fair, it is not unreasonable to quote such numbers, if they are clearly labeled as such, because a chip vendor can't control what they're put into, although it would be more meaningful to quotethem with a specified memory system of course. (What is unreasonable, I think, is to quote such numbers, and compare them to measured numbers on real systems built with real memory :-) moral: there are as many flavors of mflops as there are of mips-ratings; if you compare apples and oranges too often you'll go bananas. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
low@melair.UUCP (Rick Low) (07/23/89)
Sorry if I digress (I missed the original posting), but... In article <23379@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: > In article <337@venus.iotek.UUCP> garyb@iotek.UUCP (Gary R. Burrell) writes: > > >IEEE Micro December 1988 > > > > A Special Issue on DSP processors, contains detailed articles > >on the TMS320C30, DSP32c and DSP96002. These are fairly good articles > >about the various DSP processors. The only part of the issue I really > >question is the editors afterword in which they come up with some > >amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and > >DSP96000 (30 MFLOPS). I question these ratings as most supercomputers > >only get about 1/10 of there peak performance rating on the LINPACK > >benchmark. The chips may do better than that but I for one would like > >to see some real numbers using real systems and real compilers. (Can > >anyone provide these) I'll be surprised if these chips get anywhere > >near these figures on the LINPACK benchmark Here are some real numbers. Well, simulated anyway. Not Linpack either. I did a project for Bob Morris (one of the guest editors for this Micro) in which I studied how to build efficient DFT algorithms for the 320C30. I had a good, long look at the C30 architecture and what it means to DFT algorithms, then I wrote a 1024-point, radix-4, complex, floating-point (obviously), looped (i.e. not inline coded) FFT for this beast. I simulated this FFT using TI's C30 simulator and assuming zero wait states for external memory. This FFT ran in 2.71 ms for an average of about 17 MFLOPs. The control structure of this FFT -- i.e. non-butterfly code -- took 18 percent of the total execution time. > I conjecture that what they must mean is the inner-loop timing for > the standard LINPACK code, but with zero-wait-state memory, > i.e., something not particularly buildable. TI's C30 User's Guide shows you how to build zero wait-state memory and gives examples using Cypress CY7C164 25 ns SRAMs and IDT7198 25 ns SRAMs. In any case, my FFT only needed external memory for storage of some control variables. The rest of the code and data resided in the 4K words of on-chip ROM (code and twiddle factors) and 2K words of on-chip RAM (data). Accesses to all three memory areas were done in a way to cause no access conflict pipeline delays -- in effect zero wait states for all memory accesses (internal and external), even with parallel memory accesses, e.g. ADDF3 *AR0,R3,R2 || STF R0,*+AR1(IR1). Just more fuel for the fire. Cheers. __ __ _____ _ | \ / | |_____| | | | V | _____ | | Rick Low | |\_/| | |_____| | | MEL Defence Systems Limited, Ottawa, Canada | | | | _____ | |___ +1 613 836 6860 |_| |_| |_____| |_____| mitel!melair!low@uunet.UU.NET
wsmith@mdbs.UUCP (Bill Smith) (07/27/89)
In article <23379@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: > In article <337@venus.iotek.UUCP> garyb@iotek.UUCP (Gary R. Burrell) writes: > > >IEEE Micro December 1988 > > > > A Special Issue on DSP processors, contains detailed articles > >on the TMS320C30, DSP32c and DSP96002. These are fairly good articles > >about the various DSP processors. The only part of the issue I really > >question is the editors afterword in which they come up with some > >amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and > >DSP96000 (30 MFLOPS). I question these ratings as most supercomputers > >only get about 1/10 of there peak performance rating on the LINPACK > >benchmark. > I conjecture that what they must mean is the inner-loop timing for > the standard LINPACK code, but with zero-wait-state memory, > i.e., something not particularly buildable. To be fair, it is not > unreasonable to quote such numbers, if they are clearly labeled as such, > because a chip vendor can't control what they're put into, > although it would be more meaningful to quotethem with a specified memory > system of course. > -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> Sorry this is so late a followup. I haven't seen the article under discussion nor information about the DSP chips except the TMS320C30, and that information may invalidates John Mashey's comments. The TMS320C30 has a fairly large amount of program space on chip (something like 4k words) that is zero-wait state and probably also multi-ported (but I might be getting the register file confused with the memory buffers). If the inner loops fit within the on-chip memory or can be paged in and out with the on-chip DMA logic, the TMS320C30 can do some pretty amazing things. The other chips may also have similarly sophisticated architectures, but I'm not sure. (In case you can't tell, I like the TMS320x0 chips.) Don't incriminate a vendor about flakey benchmark numbers without verifying that the benchmarks are in fact flakey (which I haven't actually done, but...) Bill Smith pur-ee!mdbs!wsmith (A long time ago in a universe far far away, I was wsmith@cs.uiuc.edu, but no longer :-)
garyb@iotek.UUCP (Gary Burrell) (07/27/89)
In article <1451@mdbs.UUCP> wsmith@mdbs.UUCP (Bill Smith) writes: >In article <23379@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: >> In article <337@venus.iotek.UUCP> garyb@iotek.UUCP (Gary R. Burrell) writes: >> >> >IEEE Micro December 1988 >> > >> > A Special Issue on DSP processors, contains detailed articles >> >on the TMS320C30, DSP32c and DSP96002. These are fairly good articles >> >about the various DSP processors. The only part of the issue I really >> >question is the editors afterword in which they come up with some >> >amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and >> >DSP96000 (30 MFLOPS). I question these ratings as most supercomputers >> >only get about 1/10 of there peak performance rating on the LINPACK >> >benchmark. > >> I conjecture that what they must mean is the inner-loop timing for >> the standard LINPACK code, but with zero-wait-state memory, >> ..... Continued but deleated >Sorry this is so late a followup. > >I haven't seen the article under discussion nor information about the >DSP chips except the TMS320C30, and that information may invalidates John >Mashey's comments. The TMS320C30 has a fairly large amount of program >space on chip (something like 4k words) that is zero-wait state and probably >also multi-ported (but I might be getting the register file confused with >the memory buffers). > >If the inner loops fit within the on-chip memory or can be paged in and out >with the on-chip DMA logic, the TMS320C30 can do some pretty amazing things. >The other chips may also have similarly sophisticated architectures, but >I'm not sure. (In case you can't tell, I like the TMS320x0 chips.) > >Don't incriminate a vendor about flakey benchmark numbers without verifying >that the benchmarks are in fact flakey (which I haven't actually done, but...) > >Bill Smith >pur-ee!mdbs!wsmith (A long time ago in a universe far far away, I was > wsmith@cs.uiuc.edu, but no longer :-) Just to Clarify Things I wasn't trying to incriminate the vender about flakey benchmarks. What I was questioning was an article which compared benchmark data (On SuperComputers) to estimated Linpack ratings on DSP's. There was no benchmark done (As far as I could tell). This violates one of the Cardinal Rools of Benchmarking in that Simulated, Estimated, and Actual Benchmark data should not be compared. I like the TMSC320C30 and while it is an impresive chip capable of impresive performance and has a peak speed of 33.3 MFLOPS (This is manufactures data), It should be clarified that these performance values are only available by using the Multiply-and-Accumulate cycle and only while operating out of internal memory. The processor only :-) has a rated speed of 16.7 MIPS and accesses to external memory can cause a high performance penalty. Because of these facts and others I questioned the validity of the estimated benchmarks. I could be wrong to do this, I have been know to be wrong before. :-( I would be very interested in seeing actual benchmark data for the Linpack benchmark run on a real TMS320C30 (Any one from TI out there :-) ), and if it actually get 20 MFLOPS S.P. Linpack I will be the first one to jump on the bandwagon and shoot down my previous comments. Gary R. Burrell Disclamer: I have nothing to do with TI, AT&T, Motorala ETC. and have nothing to gain by questioning these Benchmarks, other than clarifying the facts. I Like the TMS320 family and must state that any DSP selection should be judged on how well a particular chip performs for that application not how well it performs on the XYZ Benchmark. <<<<<<******>>>>>> Gary R. Burrell, Iotek Inc, |*| E-Mail: garyb@iotek.uucp |*| 1127 Barrington St., Suite 100, |*| Fax: (902)420-0674 |*| Halifax, N.S., B3H 2P8, Canada |*| Phone: (902)420-1890 |*| Damm it Jim I'm a Doctor not a Computer Scientist! *************************************
jbuck@epimass.EPI.COM (Joe Buck) (08/01/89)
In article <277@melair.UUCP> low@melair.UUCP (Rick Low) writes: >..., then I wrote a 1024-point, radix-4, >complex, floating-point (obviously), looped (i.e. not inline coded) >FFT for this beast. [ the TI TMS320C30 ]. > >I simulated this FFT using TI's C30 simulator and assuming zero wait >states for external memory. This FFT ran in 2.71 ms for an average >of about 17 MFLOPs. The control structure of this FFT -- i.e. non-butterfly >code -- took 18 percent of the total execution time. I'm currently working on the real thing, not just the simulator. Unless you took account of a bug in the C30 simulator, your number is a bit too optimistic: it always takes two cycles to write to external memory, even with zero wait states; the C30 simulator counts it as one. To get the true time, add a cycle for each external memory write cycle. Yes, zero wait state external RAM is buildable, especially since all the external RAM doesn't have to have the same number of wait states. I'm currently writing code for a board with six C30's on it, with 8K of 0-wait-state external memory for each. It brings one back to the old days of computing, where you count cycles and count memory words. -- -- Joe Buck jbuck@epimass.epi.com, uunet!epimass.epi.com!jbuck
garyb@iotek.UUCP (Gary Burrell) (08/01/89)
In article <3469@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes: >In article <277@melair.UUCP> low@melair.UUCP (Rick Low) writes: >>..., then I wrote a 1024-point, radix-4, >>complex, floating-point (obviously), looped (i.e. not inline coded) >>FFT for this beast. [ the TI TMS320C30 ]. >> >>I simulated this FFT using TI's C30 simulator and assuming zero wait >>states for external memory. This FFT ran in 2.71 ms for an average >>of about 17 MFLOPs. The control structure of this FFT -- i.e. non-butterfly >>code -- took 18 percent of the total execution time. > >I'm currently working on the real thing, not just the simulator. > >Unless you took account of a bug in the C30 simulator, your number is >a bit too optimistic: it always takes two cycles to write to external >memory, even with zero wait states; the C30 simulator counts it as one. >To get the true time, add a cycle for each external memory write cycle. > This is one reason why I was questioning the original results in the afterword of DSP micro Dec 88. They were comparing estimated (not even simulated) data to real world benchmarks on super computers and comming up with some amazing results. (est 20 MFLOPS Single Prec. Linpack for the TMS320C30). IMHO one should not compare estimated, simulated and real data as estimation and simulation often err on the side of optimism. I repeat my challange can anyone show me 20 MFLOPS SP LINPACK on a real TMS320C30 system, or must I continue to be a Doubting Gary about this chip being able to perform that well. I'm not disputing this is a great DSP chip but what I am saying is that the estimations seem to me to be too optimistic, and I want some "REAL NUMBERS" before I will accept this estimate. Any TI applications Engineers ready to take up the challenge :-) Doubting Gary <<<<<<******>>>>>> Gary R. Burrell, Iotek Inc, |*| E-Mail: garyb@iotek.uucp |*| 1127 Barrington St., Suite 100, |*| Fax: (902)420-0674 |*| Halifax, N.S., B3H 2P8, Canada |*| Phone: (902)420-1890 |*| Damm it Jim I'm a Doctor not a Computer Scientist! *************************************
mash@mips.COM (John Mashey) (08/02/89)
In article <344@venus.iotek.UUCP> garyb@venus.UUCP (Gary Burrell) writes: >In article <3469@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes: .... >>Unless you took account of a bug in the C30 simulator, your number is >>a bit too optimistic: it always takes two cycles to write to external >>memory, even with zero wait states; the C30 simulator counts it as one. >>To get the true time, add a cycle for each external memory write cycle. > This is one reason why I was questioning the original results >in the afterword of DSP micro Dec 88. They were comparing estimated >(not even simulated) data to real world benchmarks on super computers >and comming up with some amazing results. (est 20 MFLOPS Single Prec. >Linpack for the TMS320C30). > IMHO one should not compare estimated, simulated and real data >as estimation and simulation often err on the side of optimism. It is often necessary to compare such things, in order to figure out whether something is worth building or not. I do think that it is very important to: a) Precisely label every such number as measured, simulated, or estimated, and if so, with what memory configuration, i.e., to be convincing that something is reasonably buildable. b) Precisely label what kind of MFLOPs you're talking about. FFTs are not FORTRAN DP 100x100 LINPACK MFLOPs, for example. Note that to get anything close to the peak rates on LINPACK, you probably: a) Have a vector machine, including a 3-pipe memory system. OR b) A scalar machine, with minimal-latency caches big enough to hold the array for LINPACK, and the cache pre-loaded with all of the data, and a cache structure that doesn't end up generating more misses, and that doesn't conflict with the different array sizes (201, etc) of which the 100x100 is a subarray. AND appropriate optimizing compilers Few micros are a) or b) ....... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086