[comp.arch] Info on DSP chips

garyb@iotek.UUCP (Gary Burrell) (07/06/89)

	I am presently evaluating the various DSP's available and I
would be interested in any input from netland about the various
offerings out there.  I am particularly interested in any benchmarks
which have been performed and their source if available.  Of major
interest are such things as FIR filters, IIR filters, and FFT's, but
any benchmark data would be welcome.

Please E-mail to me (Garyb@iotek.uucp) and I will sumarize to the net
if there is intrest.

Also is there any group for DSP related discusion.

		Thanks

			Gary R. Burrell

			<<<<<<******>>>>>>
Gary R. Burrell, Iotek Inc,     |*| E-Mail: garyb@iotek.uucp	|*| 
1127 Barrington St., Suite 100, |*| Fax:    (902)420-0674	|*|   
Halifax, N.S., B3H 2P8, Canada  |*| Phone:  (902)420-1890	|*| 

Damm it Jim 
  I'm a Doctor not a Computer Scientist!
              
             *************************************

garyb@iotek.UUCP (Gary Burrell) (07/13/89)

A while back I posted to the net looking for benchmarks on DSP chips.
A number of people replied pointing me to sources of information and a
number asked me to pass on what information I received.  I really did
not receive to much in the way of actual benchmarks but I did receive
pointers to several sources of information.

There seem to be some interest in DSP related discussion.  Is there
enough interest to create a DSP related group? 


Some of the better sources of information I found or was told about
are:

IEEE Micro December 1988

	A Special Issue on DSP processors, contains detailed articles
on the TMS320C30, DSP32c and DSP96002.  These are fairly good articles
about the various DSP processors.  The only part of the issue I really
question is the editors afterword in which they come up with some
amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and
DSP96000 (30 MFLOPS).  I question these ratings as most supercomputers
only get about 1/10 of there peak performance rating on the LINPACK
benchmark.  The chips may do better than that but I for one would like
to see some real numbers using real systems and real compilers.  (Can
anyone provide these) I'll be surprised if these chips get anywhere
near these figures on the LINPACK benchmark, I don't doubt that they
can obtain these performances and better on multiply/accumulate
benchmarks.

EDN September 29 1988

	A good article on benchmarking several DSP processors.
Unfortunately the code for the benchmarks is not provided.  You could
order it from EDN up till April 1, 1989 but I missed that deadline.  I
got a reprint of this article from AT&T.

IEEE ASSP Magazine  October 1988 and January 1989

	A two part article on the architectures of programmable DSP
chips.  A good article for those who are interested in the architecture
of these processors.

Other articles look at included

IEEE Spectrum June 1987 on DSP's 

IEEE Spectrum April 1989 on the i860

Computer Design May 1989 on DSP's

and the Product Information from:

	Intel on the i860
	AT&T on the DSP 16/16A and DSP 32/32C
	TI on the TI320C30

I am still waiting for TI, Motorala, Futjitsu etc to send me more
information.  (Hopefully it's in the mail)


As well Michael Slater is sending me a copy of his newsletter in which
they did an article on some DSP chips.


Thanks to those of you who sent me information and when I finish my
report I'll see if I can summarize to the net.


				Gary

			<<<<<<******>>>>>>
Gary R. Burrell, Iotek Inc,     |*| E-Mail: garyb@iotek.uucp	|*| 
1127 Barrington St., Suite 100, |*| Fax:    (902)420-0674	|*|   
Halifax, N.S., B3H 2P8, Canada  |*| Phone:  (902)420-1890	|*| 

Damm it Jim 
  I'm a Doctor not a Computer Scientist!
              
             *************************************

mash@mips.COM (John Mashey) (07/15/89)

In article <337@venus.iotek.UUCP> garyb@iotek.UUCP (Gary R. Burrell) writes:

>IEEE Micro December 1988
>
>	A Special Issue on DSP processors, contains detailed articles
>on the TMS320C30, DSP32c and DSP96002.  These are fairly good articles
>about the various DSP processors.  The only part of the issue I really
>question is the editors afterword in which they come up with some
>amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and
>DSP96000 (30 MFLOPS).  I question these ratings as most supercomputers
>only get about 1/10 of there peak performance rating on the LINPACK
>benchmark.  The chips may do better than that but I for one would like
>to see some real numbers using real systems and real compilers.  (Can
>anyone provide these) I'll be surprised if these chips get anywhere
>near these figures on the LINPACK benchmark, I don't doubt that they
>can obtain these performances and better on multiply/accumulate
>benchmarks.

I'm out of town, so I don't have that issue handy.
I conjecture that what they must mean is the inner-loop timing for
the standard LINPACK code, but with zero-wait-state memory,
i.e., something not particularly buildable.  To be fair, it is not
unreasonable to quote such numbers, if they are clearly labeled as such,
because a chip vendor can't control what they're put into,
although it would be more meaningful to quotethem with a specified memory
system of course.
(What is unreasonable, I think, is to quote such numbers, and compare them
to measured numbers on real systems built with real memory :-)

moral: there are as many flavors of mflops as there are of mips-ratings;
if you compare apples and oranges too often you'll go bananas.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

low@melair.UUCP (Rick Low) (07/23/89)

Sorry if I digress (I missed the original posting), but...

In article <23379@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
> In article <337@venus.iotek.UUCP> garyb@iotek.UUCP (Gary R. Burrell) writes:
> 
> >IEEE Micro December 1988
> >
> >	A Special Issue on DSP processors, contains detailed articles
> >on the TMS320C30, DSP32c and DSP96002.  These are fairly good articles
> >about the various DSP processors.  The only part of the issue I really
> >question is the editors afterword in which they come up with some
> >amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and
> >DSP96000 (30 MFLOPS).  I question these ratings as most supercomputers
> >only get about 1/10 of there peak performance rating on the LINPACK
> >benchmark.  The chips may do better than that but I for one would like
> >to see some real numbers using real systems and real compilers.  (Can
> >anyone provide these) I'll be surprised if these chips get anywhere
> >near these figures on the LINPACK benchmark

Here are some real numbers.  Well, simulated anyway.  Not Linpack either.

I did a project for Bob Morris (one of the guest editors for this Micro)
in which I studied how to build efficient DFT algorithms for the
320C30.  I had a good, long look at the C30 architecture and what it
means to DFT algorithms, then I wrote a 1024-point, radix-4,
complex, floating-point (obviously), looped (i.e. not inline coded)
FFT for this beast.

I simulated this FFT using TI's C30 simulator and assuming zero wait
states for external memory.  This FFT ran in 2.71 ms for an average
of about 17 MFLOPs.  The control structure of this FFT -- i.e. non-butterfly
code -- took 18 percent of the total execution time.

> I conjecture that what they must mean is the inner-loop timing for
> the standard LINPACK code, but with zero-wait-state memory,
> i.e., something not particularly buildable.

TI's C30 User's Guide shows you how to build zero wait-state
memory and gives examples using Cypress CY7C164 25 ns SRAMs and
IDT7198 25 ns SRAMs.  In any case, my FFT only needed external
memory for storage of some control variables.  The rest of the
code and data resided in the 4K words of on-chip ROM (code and
twiddle factors) and 2K words of on-chip RAM (data).  Accesses
to all three memory areas were done in a way to cause no
access conflict pipeline delays -- in effect zero wait states
for all memory accesses (internal and external), even with parallel
memory accesses, e.g. ADDF3 *AR0,R3,R2 || STF R0,*+AR1(IR1).

Just more fuel for the fire.  Cheers.
 __   __   _____   _
|  \ /  | |_____| | |
|   V   |  _____  | |       Rick Low
| |\_/| | |_____| | |       MEL Defence Systems Limited, Ottawa, Canada
| |   | |  _____  | |___    +1 613 836 6860
|_|   |_| |_____| |_____|   mitel!melair!low@uunet.UU.NET

wsmith@mdbs.UUCP (Bill Smith) (07/27/89)

In article <23379@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
> In article <337@venus.iotek.UUCP> garyb@iotek.UUCP (Gary R. Burrell) writes:
> 
> >IEEE Micro December 1988
> >
> >	A Special Issue on DSP processors, contains detailed articles
> >on the TMS320C30, DSP32c and DSP96002.  These are fairly good articles
> >about the various DSP processors.  The only part of the issue I really
> >question is the editors afterword in which they come up with some
> >amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and
> >DSP96000 (30 MFLOPS).  I question these ratings as most supercomputers
> >only get about 1/10 of there peak performance rating on the LINPACK
> >benchmark.  

> I conjecture that what they must mean is the inner-loop timing for
> the standard LINPACK code, but with zero-wait-state memory,
> i.e., something not particularly buildable.  To be fair, it is not
> unreasonable to quote such numbers, if they are clearly labeled as such,
> because a chip vendor can't control what they're put into,
> although it would be more meaningful to quotethem with a specified memory
> system of course.
> -john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

Sorry this is so late a followup.  

I haven't seen the article under discussion nor information about the
DSP chips except the TMS320C30, and that information may invalidates John
Mashey's comments.   The TMS320C30 has a fairly large amount of program 
space on chip (something like 4k words) that is zero-wait state and probably
also multi-ported (but I might be getting the register file confused with
the memory buffers). 

If the inner loops fit within the on-chip memory or can be paged in and out
with the on-chip DMA logic, the TMS320C30 can do some pretty amazing things.
The other chips may also have similarly sophisticated architectures, but
I'm not sure.  (In case you can't tell, I like the TMS320x0 chips.)

Don't incriminate a vendor about flakey benchmark numbers without verifying 
that the benchmarks are in fact flakey (which I haven't actually done, but...)

Bill Smith
pur-ee!mdbs!wsmith  (A long time ago in a universe far far away, I was 
			wsmith@cs.uiuc.edu, but no longer :-)

garyb@iotek.UUCP (Gary Burrell) (07/27/89)

In article <1451@mdbs.UUCP> wsmith@mdbs.UUCP (Bill Smith) writes:
>In article <23379@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
>> In article <337@venus.iotek.UUCP> garyb@iotek.UUCP (Gary R. Burrell) writes:
>> 
>> >IEEE Micro December 1988
>> >
>> >	A Special Issue on DSP processors, contains detailed articles
>> >on the TMS320C30, DSP32c and DSP96002.  These are fairly good articles
>> >about the various DSP processors.  The only part of the issue I really
>> >question is the editors afterword in which they come up with some
>> >amazing figures for Linpack ratings of the TMS320C30 (20 MFLOPS) and
>> >DSP96000 (30 MFLOPS).  I question these ratings as most supercomputers
>> >only get about 1/10 of there peak performance rating on the LINPACK
>> >benchmark.  
>
>> I conjecture that what they must mean is the inner-loop timing for
>> the standard LINPACK code, but with zero-wait-state memory,
>> ..... Continued but deleated


>Sorry this is so late a followup.  
>
>I haven't seen the article under discussion nor information about the
>DSP chips except the TMS320C30, and that information may invalidates John
>Mashey's comments.   The TMS320C30 has a fairly large amount of program 
>space on chip (something like 4k words) that is zero-wait state and probably
>also multi-ported (but I might be getting the register file confused with
>the memory buffers). 
>
>If the inner loops fit within the on-chip memory or can be paged in and out
>with the on-chip DMA logic, the TMS320C30 can do some pretty amazing things.
>The other chips may also have similarly sophisticated architectures, but
>I'm not sure.  (In case you can't tell, I like the TMS320x0 chips.)
>
>Don't incriminate a vendor about flakey benchmark numbers without verifying 
>that the benchmarks are in fact flakey (which I haven't actually done, but...)
>
>Bill Smith
>pur-ee!mdbs!wsmith  (A long time ago in a universe far far away, I was 
>			wsmith@cs.uiuc.edu, but no longer :-)

Just to Clarify Things I wasn't trying to incriminate the vender about
flakey benchmarks.  What I was questioning was an article which
compared benchmark data (On SuperComputers) to estimated Linpack
ratings on DSP's.  There was no benchmark done (As far as I could
tell).  This violates one of the Cardinal Rools of Benchmarking in
that Simulated, Estimated, and Actual Benchmark data should not be
compared.


	I like the TMSC320C30 and while it is an impresive chip
capable of impresive performance and has a peak speed of 33.3 MFLOPS
(This is manufactures data), It should be clarified that these
performance values are only available by using the
Multiply-and-Accumulate cycle and only while operating out of internal
memory.  The processor only :-) has a rated speed of 16.7 MIPS and
accesses to external memory can cause a high performance penalty.
Because of these facts and others I questioned the validity of the
estimated benchmarks.  I could be wrong to do this, I have been know
to be wrong before. :-(  I would be very interested in seeing actual
benchmark data for the Linpack benchmark run on a real TMS320C30 (Any
one from TI out there :-) ), and if it actually get 20 MFLOPS S.P.
Linpack I will be the first one to jump on the bandwagon and shoot
down my previous comments.

			Gary R. Burrell

Disclamer:  I have nothing to do with TI, AT&T, Motorala ETC. and have
nothing to gain by questioning these Benchmarks, other than clarifying
the facts.  I Like the TMS320 family and must state that any DSP selection
should be judged on how well a particular chip performs for that
application not how well it performs on the XYZ Benchmark. 

			<<<<<<******>>>>>>
Gary R. Burrell, Iotek Inc,     |*| E-Mail: garyb@iotek.uucp	|*| 
1127 Barrington St., Suite 100, |*| Fax:    (902)420-0674	|*|   
Halifax, N.S., B3H 2P8, Canada  |*| Phone:  (902)420-1890	|*| 

Damm it Jim 
  I'm a Doctor not a Computer Scientist!
              
             *************************************

jbuck@epimass.EPI.COM (Joe Buck) (08/01/89)

In article <277@melair.UUCP> low@melair.UUCP (Rick Low) writes:
>..., then I wrote a 1024-point, radix-4,
>complex, floating-point (obviously), looped (i.e. not inline coded)
>FFT for this beast. [ the TI TMS320C30 ].
>
>I simulated this FFT using TI's C30 simulator and assuming zero wait
>states for external memory.  This FFT ran in 2.71 ms for an average
>of about 17 MFLOPs.  The control structure of this FFT -- i.e. non-butterfly
>code -- took 18 percent of the total execution time.

I'm currently working on the real thing, not just the simulator.

Unless you took account of a bug in the C30 simulator, your number is
a bit too optimistic: it always takes two cycles to write to external
memory, even with zero wait states; the C30 simulator counts it as one.
To get the true time, add a cycle for each external memory write cycle.

Yes, zero wait state external RAM is buildable, especially since all the
external RAM doesn't have to have the same number of wait states.  I'm
currently writing code for a board with six C30's on it, with 8K of
0-wait-state external memory for each.  It brings one back to the old
days of computing, where you count cycles and count memory words.

-- 
-- Joe Buck	jbuck@epimass.epi.com, uunet!epimass.epi.com!jbuck

garyb@iotek.UUCP (Gary Burrell) (08/01/89)

In article <3469@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes:
>In article <277@melair.UUCP> low@melair.UUCP (Rick Low) writes:
>>..., then I wrote a 1024-point, radix-4,
>>complex, floating-point (obviously), looped (i.e. not inline coded)
>>FFT for this beast. [ the TI TMS320C30 ].
>>
>>I simulated this FFT using TI's C30 simulator and assuming zero wait
>>states for external memory.  This FFT ran in 2.71 ms for an average
>>of about 17 MFLOPs.  The control structure of this FFT -- i.e. non-butterfly
>>code -- took 18 percent of the total execution time.
>
>I'm currently working on the real thing, not just the simulator.
>
>Unless you took account of a bug in the C30 simulator, your number is
>a bit too optimistic: it always takes two cycles to write to external
>memory, even with zero wait states; the C30 simulator counts it as one.
>To get the true time, add a cycle for each external memory write cycle.
>

	This is one reason why I was questioning the original results
in the afterword of DSP micro Dec 88.  They were comparing estimated
(not even simulated) data to real world benchmarks on super computers
and comming up with some amazing results.  (est 20 MFLOPS Single Prec.
Linpack for the TMS320C30).

	IMHO one should not compare estimated, simulated and real data
as estimation and simulation often err on the side of optimism.

	I repeat my challange can anyone show me 20 MFLOPS SP LINPACK
on a real TMS320C30 system, or must I continue to be a Doubting
Gary about this chip being able to perform that well.

	I'm not disputing this is a great DSP chip but what I am
saying is that the estimations seem to me to be too optimistic, and I
want some "REAL NUMBERS" before I will accept this estimate.  Any TI
applications Engineers ready to take up the challenge :-)

		
		Doubting

			  Gary

			<<<<<<******>>>>>>
Gary R. Burrell, Iotek Inc,     |*| E-Mail: garyb@iotek.uucp	|*| 
1127 Barrington St., Suite 100, |*| Fax:    (902)420-0674	|*|   
Halifax, N.S., B3H 2P8, Canada  |*| Phone:  (902)420-1890	|*| 

Damm it Jim 
  I'm a Doctor not a Computer Scientist!
              
             *************************************

mash@mips.COM (John Mashey) (08/02/89)

In article <344@venus.iotek.UUCP> garyb@venus.UUCP (Gary Burrell) writes:
>In article <3469@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes:
....
>>Unless you took account of a bug in the C30 simulator, your number is
>>a bit too optimistic: it always takes two cycles to write to external
>>memory, even with zero wait states; the C30 simulator counts it as one.
>>To get the true time, add a cycle for each external memory write cycle.

>	This is one reason why I was questioning the original results
>in the afterword of DSP micro Dec 88.  They were comparing estimated
>(not even simulated) data to real world benchmarks on super computers
>and comming up with some amazing results.  (est 20 MFLOPS Single Prec.
>Linpack for the TMS320C30).

>	IMHO one should not compare estimated, simulated and real data
>as estimation and simulation often err on the side of optimism.

It is often necessary to compare such things, in order to figure out
whether something is worth building or not.  I do think that it is very
important to:
	a) Precisely label every such number as measured, simulated, or
	estimated, and if so, with what memory configuration, i.e., 
	to be convincing that something is reasonably buildable.
	b) Precisely label what kind of MFLOPs you're talking about.
	FFTs are not FORTRAN DP 100x100 LINPACK MFLOPs, for example.
Note that to get anything close to the peak rates on LINPACK,
you probably:
	a) Have a vector machine, including a 3-pipe memory system.
OR
	b) A scalar machine, with minimal-latency caches big enough
	to hold the array for LINPACK, and the cache pre-loaded
	with all of the data, and a cache structure that doesn't
	end up generating more misses, and that doesn't conflict
	with the different array sizes (201, etc) of which the 100x100
	is a subarray.

AND
	appropriate optimizing compilers

Few micros are a) or b) .......
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086