[comp.arch] SRAM vs. DRAM, 33MHz 386 UNIX-PC

cliffhanger@cup.portal.com (Cliff C Heyer) (09/07/89)

(The HARDWARE Q's are intermixed below.)

I'm planning to buy a 80386 PC for use with UNIX, MSDOS, OS/2,
and WINDOWS/386. After studying the trade papers and marketing
literature, I've made the following conclusions: (feel free to
comment)

1. Price: 33MHz hardware about same ballpark as 25MHz hardware.

2. 33MHz hardware not yet reviewed in key areas of: 
bus speed, paged/interleaved memory, shadow 
(BIOS/video) RAM, disk cache(memory or controller), 
extended memory speed, wait states.

3. 100% of 33MHz hardware gives 10-20% better MIPS than 25MHz.

4. 33MHz hardware disk I/O only 0-5% better than 25MHz. In
other words, it might as well be the same.

5. 80386 portables are about the same price as 33MHz 
desk hardware but are 50% slower in CPU and 70% slower in 
I/O.


QUESTIONS about 33MHz 80386 PCs...
(This is where I need the help!)

1. UNIX (or any multitasking OS) and the effects of 
the on-board cache:

 	While multitasking, does flushing the cash waste a 
measurable amount of run time or is it 
insignificant compared to swapping, paging, and/or 
other overhead? In other words, is the cache still 
beneficial even though it is being flushed? (I 
assume "yes" since minicomputers such as all VAX 
models have them.)

2. Is memory technology (cost/speed ) lagging behind 
microprocessor technology? All the newest 33MHz 
80386 PCs are using 70+ ns DRAMs when the 386 is 
running at 30 ns and the on-board caches are rated 
at 25 ns. You can't get 0 wait states 100% of the 
time with this approach.

3. Is it impractical (cost and/or size) to put 40 256KB
25ns SRAMs (no refresh overhead and cycle 
time=access time) up for main memory? In other 
words, is it cheaper to implement paged (PMRAM, 
SCRAM) or interleaved schemes to reduce wait 
states rather than use 40 SRAMs? It seems like
alot of trouble to go to...but how much do SRAMs cost?

4. Are any board makers making (or have made) 
motherboards with ESDI and/or SCSI interfaces ON 
BOARD to bypass the 8MHz AT bus? Also hopefully 
this mfg. would include shadow RAM (BIOS & video) 
and extended/expanded memory that is as fast as 
main memory. (eg. add on memory boards have same 
cycle time as the first 2MB.)

5. I assume the ONLY thing that makes the 33MHz PCs 
faster is the 25 ns cache. Otherwise, with 70 ns 
DRAM the BEST you could do would be run as fast as 
a 16MHz 80386 PC (62 ns) but with lots of wait 
states. In other words, memory cycle time limits 
non-cache CPU performance to that of a 16MHz 80386.

6. If you whipped out your trusty soldering gun and 
anti-static gear and changed all your memory chips 
to 25 ns SRAMS(on a 33MHz machine w/no cache) would 
the wait states go away? OR is the timing part of the 
hardware architecture? (Pardon me for this seemingly
naive question, but I've been doing software for 8
years.)

7. The PC manufacturers never talk about parity error 
checked memory, ECC memory, Harvard A.(separate 
data/instruction cache), data write-thru cache, 
write buffers (CPU can go on after issuing write 
instruction only), and multi-word memory 
transfers. Are PC mfgs behind the times? Or is this
because of all the canned stuff from the east.

8. Is there ANY manufacturer who has fully exploited 
the power of the 80386 chip? That is, at 33MHz is 
there any hardware that...

   >can support sustained disk I/O >1MB/sec by
     bypassing the AT bus via on-board controllers, or using VME, etc.,

   >has "real" 100% zero wait state memory (probably SRAM), AND 
    expanded/extended memory boards (no wait states 100% of the time),

   >(for PCs) has shadow RAM (BIOS & video),

   >gives you several "real" 32-bit "backplane" slots 
    and controllers for them (Intel or Zeos?),

   >operates FCC class B.

Please POST your comments.

daveh@cbmvax.UUCP (Dave Haynie) (09/07/89)

in article <21936@cup.portal.com>, cliffhanger@cup.portal.com (Cliff C Heyer) says:

> 1. UNIX (or any multitasking OS) and the effects of 
> the on-board cache:

>  	While multitasking, does flushing the cash waste a measurable amount 
> of run time or is it insignificant compared to swapping, paging, and/or 
> other overhead? 

Generally, UNIX system do cache flushes when they enter kernel space, and they
MUST between user tasks, since the address spaces are aliased.  Caching still
helps under UNIX, maybe even considerably depending on the main memory design,
but you won't get the maximum possible performance out of a system under UNIX.
Check out MIPS magazine, they tend to benchmark '386 systems under a variety
of UNIX derivitives and raw (program loaded in '386 native mode from MS-DOS);
the raw case always performs better.  Of course, chances are, a benchmarking
program will likely fit entirely in a 32k or 64k cache.  You have to ask 
yourself if a little better raw speed is enough to drive you from UNIX to 
MS-DOS :-).

Another issue may be disk drive speed.  On the 68030 systems I work on, I've
seen hard disk speed drop by a factor of 2-3 when going from our native OS to
UNIX.  I'm not sure any '386 PCs have fast enough disk hardware for this to 
make a difference, but it could.

> 2. Is memory technology (cost/speed ) lagging behind microprocessor 
> technology? 

Undoubtedly.  Try running a 50MHz '030 (real, sitting next to me on my desk
here) without wait states on any memory.

> 3. Is it impractical (cost and/or size) to put 40 256KB 25ns SRAMs (no 
> refresh overhead and cycle time=access time) up for main memory? 

Yes.  First, it'll cost a big bundle.  Second, I don't think you really want
to run UNIX with only 1 megabyte of real memory; I've found it's really not
what I'd call comfortably fast without about 4 megabytes, at least on our
68030 systems.  

> In other words, is it cheaper to implement paged (PMRAM, SCRAM) or 
> interleaved schemes to reduce wait states rather than use 40 SRAMs? 

By far.  

> It seems like alot of trouble to go to...but how much do SRAMs cost?

I think they're somewhere on the order of 5x the cost for the same amount
of memory, and they also take up at least twice the board space.  Also,
SRAM densities are a generation or so behind DRAM densities (256k x 4
SRAMs are almost as easy to come by today as 1 Meg x 4 DRAMs).

> 4. Are any board makers making (or have made) motherboards with ESDI and/or
> SCSI interfaces ON BOARD to bypass the 8MHz AT bus? 

Even that's not enough.  SCSI, for example, is still only 8 bits wide, and it's
maximum transfer rate of 4 megs/second is in the same ballpark as the fastest
AT buses.  That's all a secondary issue as long as you're multitasking; the real
concern is how long disk I/O ties up the main bus.  The best thing to do here is
read a reasonable number of bytes over the SCSI bus into a FIFO, then transfer 
that data over you're full bus width via DMA (another reason to avoid the AT 
bus).  This is what Amiga hard disk controllers do, and we get up to 900k/sec
through the filesystem using asynchronous SCSI (1.5 megs/sec).  I've heard of
PC systems (ALR I think) that get around 750k/sec, I suspect they do something
clever.

> 6. If you whipped out your trusty soldering gun and anti-static gear and 
> changed all your memory chips to 25 ns SRAMS(on a 33MHz machine w/no cache)
> would the wait states go away? 

Well, first of all, they woundn't physically fit.  You'd really have to throw
out the whole memory subsystem.  There's undoubtedly some logic that sets the
memory cycle time.  And on top of that, even if the memory did fit, there's
no guarantee that a DRAM bus which cycles in 140ns would still work with SRAM
and 40ns cycles times.

> 7. The PC manufacturers never talk about parity error checked memory, 

Most PCs use partity checked memory, which lets them detect a memory error
(and increases the chance of such an error by 1/9), but doesn't let them do
anything about it.

> Harvard A.(separate data/instruction cache), 

This isn't an advantage as an external cache unless you're CPU has external
I and D busses.  The only one I know of that does is the Motorola 88100.  While
68030s have an internal Harvard design with separate I and D caches, the '386
doesn't have any internal caching.  While it's no problem for UNIX, apparently
MS-DOS would have some great trouble with split caches, which is the main
reason given for the unified cache built into the 80486 (though I suspect the
other reason is that Intel might rather sell you a different chip for non-PC
applications, so it wouldn't do to make the 80486 overly fast, would it...).

> Are PC mfgs behind the times? Or is this because of all the canned stuff 
> from the east.

Most of the canned stuff comes from California chip fabs.  And I suspect many
PC makers are behind the times.  The rest of them are struggling with the
architecture.

>    >operates FCC class B.

Whaddya expecting, miracles?!?
-- 
Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
                    Too much of everything is just enough

wilkes@mips.COM (John Wilkes) (09/08/89)

In article <7851@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes:
>
  {talking about flushing the cache}

>MUST between user tasks, since the address spaces are aliased.  Caching still

i believe that this is true only for a virtual cache.  a physical cache
does not have the aliasing problem.

>> Harvard A.(separate data/instruction cache), 
>
>This isn't an advantage as an external cache unless you're CPU has external
>I and D busses.  The only one I know of that does is the Motorola 88100.  While

not necessarily.  you might want your d cache to have a 128-byte line, and
your i cache to have an eight byte line, for example.  code references and
data references usually have different access patterns.

doesn't the amd 29000 have external i and d busses?  it's been a long time
since i looked at that chip...
-- 
-wilkes

wilkes@mips.com   -OR-   {ames, decwrl, pyramid}!mips!wilkes

davidb@inmos.co.uk (David Boreham) (09/08/89)

cliffhanger@cup.portal.com (Cliff C Heyer) posted some questions about
the speed of 386PC's (too long to repeat here). My response is that the
main thing he is missing is that you don't get performance for nothing.
386PC's and most other computers are designed by highly skilled engineers
who know all the tricks there are to extract the maximum performance for
a given number of dollars. The PC market especially is so competitive that
you can assume that if BYTE tells you that the fastest machine for $nK is
machine y, then most other machines sold for $nK will be within 20% of the
performance of the fastest one. There *are* no hidden easy answers to getting
extra performance---you just need to shell out more cash. Of course the 
technology moves on and new architectures are developed but generally everyone
is using the same technology as each other at any particular point in history.

Now, if MIPS or SUN or IBM could develop a 20ns SRAM which cost the same
as a 140ns cycle-time DRAM they they could wipe the floor with the
competition. However the memory market (and most other components) is universal
and IBM or whoever don't have any great advantage over the independent chip
manufacturers.  




David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb
Bristol,  England            |      (us): uunet!inmos-c!davidb
+44 454 616616 ex 543        | Internet : @col.hp.com:davidb@inmos-c

diomidis@ecrcvax.UUCP (Diomidis Spinellis) (09/08/89)

In article <21936@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) writes:
[...]
>I'm planning to buy a 80386 PC for use with UNIX, MSDOS, OS/2,
>and WINDOWS/386. After studying the trade papers and marketing
>literature, I've made the following conclusions: 

[...]
>3. Is it impractical (cost and/or size) to put 40 256KB
>25ns SRAMs (no refresh overhead and cycle 
>time=access time) up for main memory? In other 
>words, is it cheaper to implement paged (PMRAM, 
>SCRAM) or interleaved schemes to reduce wait 
>states rather than use 40 SRAMs? It seems like
>alot of trouble to go to...but how much do SRAMs cost?

A lot.  I haven't got any prices for 25ns SRAMs handy, but at 100ns
256K of SRAM cost the same as 1MB of DRAM ($24.95 retail).  For 256K
the SRAM part is about 3 times as expensive as the DRAM one ($24.95 vs.
$7.25).  Note that both have the same speed.  Fast memories are VERY
expensive.

[...]
>5. I assume the ONLY thing that makes the 33MHz PCs 
>faster is the 25 ns cache. Otherwise, with 70 ns 
>DRAM the BEST you could do would be run as fast as 
>a 16MHz 80386 PC (62 ns) but with lots of wait 
>states. In other words, memory cycle time limits 
>non-cache CPU performance to that of a 16MHz 80386.

This is not true.  Typicaly an instruction has to be fetched, fetch
some operands, do some processing and write some results back.  The
time taken for the processing decreases as the clock frequency
increases.  Instruction prefetchig also contributes to a speed-up.
Viewing the microcode as a program and the memory accesses as I/O your
question can be rephrased as:  ``Is the 80386 microcode execution I/O
(i.e. memory) or CPU (i.e. microCPU) bound ?'' Your assertion would be
correct only if the 386 microcode execution was memory bound.  I think
this is not the case for typical instruction sequences.

[...]
>6. If you whipped out your trusty soldering gun and 
>anti-static gear and changed all your memory chips 
>to 25 ns SRAMS(on a 33MHz machine w/no cache) would 
>the wait states go away? OR is the timing part of the 
>hardware architecture? (Pardon me for this seemingly
>naive question, but I've been doing software for 8
>years.)

The timing is part of the machine design.  Replacing the components
with faster ones will most of the time not contribute to the speed of
your machine*.  In the memory components used in the PCs there is no
signal from the memory chip which is asserted when data is available.
This a problem of the synchronous designs.  (Conceptual solutions
exist, but require a total rethink of the way hardware is designed.
Read the 1989 ACM Turing award paper by Ivan Sutherland on
micropipelines for more details.)  I would expect for the timing to be
hard coded in the form of some jumpers from a PISO shift register, a
PAL  or a PROM.  You could always modify these, but it would not be
trivial.

* An exception is when a processing element is replaced with a more 
efficient one e.g. when an Intel 8088 is replaced with a NEC V-20.

[...]
>7. The PC manufacturers never talk about parity error 
>checked memory, ECC memory, Harvard A.(separate 
>data/instruction cache), data write-thru cache, 
>write buffers (CPU can go on after issuing write 
>instruction only), and multi-word memory 
>transfers. Are PC mfgs behind the times? Or is this
>because of all the canned stuff from the east.

Parity error checked memory is available by some manufacturers.  (IBM
was one of them.) I don't think that the target market for PCs could
bear the cost of ECC memory.  For Harvard architecture you need support
from the processor, the 386 does not directly support it.

Diomidis
-- 
Diomidis Spinellis          European Computer-Industry Research Centre (ECRC)
Arabellastrasse 17, D-8000 Muenchen 81, West Germany        +49 (89) 92699199
USA: diomidis%ecrcvax.uucp@pyramid.pyramid.com   ...!pyramid!ecrcvax!diomidis
Europe: diomidis@ecrcvax.uucp                      ...!unido!ecrcvax!diomidis

hjm@cernvax.UUCP (Hubert Matthews) (09/08/89)

In article <21936@cup.portal.com> cliffhanger@cup.portal.com (Cliff C
Heyer) writes a lot of stuff about PCs and UNIX and RAMs.

To save on bandwidth, I shall not not quote great tracts, but I hope
the context is obvious.

Why do you want zero-wait state RAM for all of your memory?  A data
cache is there to provide zero-wait states for most accesses for most
sorts of program.  If the data cache hit rate is 90%, then changing
all of the memory to zero-wait state will give you 0.9*1+0.1*3=1.2
cycles per access (assuming 3 cycle main memory), on average, which is
0.2 wait-states assuming a one cycle cache.  Are you prepared to pay 5
times as much for your machine for a 20% increase in performance on
that type of program?

You may be running software that has a very low cache hit rate if you
are doing CAD work or scientific calculations.  Take this little loop
for example:

      SUM = 0.0
      DO 10 I = 1, 1000000
	SUM = SUM + VEC(I)
   10 CONTINUE

A data cache is *no use at all* for this problem.  You will get a
cache miss on every data access.  Similarly, copying data from one bit
of memory to another will be limited by the raw memory speed.  

For these sort of programs, a fast main memory is essential.  The
factor of 5 in cost starts to look not too bad if you get about a
factor of 5 increase in performance.  Why do Crays have fast memories?
Because this is exactly the type of problems they are designed for.

I/O is another area where raw memory speed is important as a data
cache cannot help at all.  The AT bus is indeed slow, but so is the
memory attached to it.  If you speed up the bus, you need to speed up
the memory to go with it or your fast disk will wait for memory.

In essence, the PC architecture was a reasonably well balanced system
when the 8088 and the 8086 were used as CPUs, but most top-end
machines are now hopelessly imbalanced with a 386 inside them.  If you
really want the performance from a 386, you are going to have to pay
about 5 times as much to get the memory and the I/O to feed it at a
decent rate.  There ain't no such thing as a free lunch.

>Please POST your comments.

OK, you asked for war, you got it...



-- 
Hubert Matthews      ...helping make the world a quote-free zone...

hjm@cernvax.cern.ch   hjm@vxomeg.decnet.cern.ch    ...!mcvax!cernvax!hjm

daveh@cbmvax.UUCP (Dave Haynie) (09/08/89)

in article <27133@proton.mips.COM>, wilkes@mips.COM (John Wilkes) says:
> 
> In article <7851@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes:

>>MUST between user tasks, since the address spaces are aliased.  Caching still

> i believe that this is true only for a virtual cache.  a physical cache
> does not have the aliasing problem.

Yup, I thought of that later.  Guess I'm used to virtual caches.  However, it
may not make much difference anyway.  You are guaranteed under today's UNIX
at least that two tasks won't be sharing any memory.  So whatever's cached up
by one task is on the chopping block as soon as you switch, and it's not going
to do any good, even if it's not overwritten, until the original task gets
swapped back in.  Especially since most of the external caches tend to be 
direct mapped.  Some of the new chips with cool 4 set associative physical 
caches ('040, 88k) might make this a bit more interesting.

>>> Harvard A.(separate data/instruction cache), 

>>This isn't an advantage as an external cache unless you're CPU has external
>>I and D busses.  The only one I know of that does is the Motorola 88100.  While

> not necessarily.  you might want your d cache to have a 128-byte line, and
> your i cache to have an eight byte line, for example.  code references and
> data references usually have different access patterns.

Well, that's true.  You may also decide that it's more efficient to have an I
cache of size M and a D cache of size N than a unified cache, depending on 
the system.

> doesn't the amd 29000 have external i and d busses?  it's been a long time
> since i looked at that chip...

Bingo!  While they share a common set of address lines, the 29K does have 
separate data paths for I and D caches.  It's been awhile for me too, but 
the 29K is a neat chip.

> -wilkes
-- 
Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
                    Too much of everything is just enough

mcdonald@uxe.cso.uiuc.edu (09/08/89)

>Generally, UNIX system do cache flushes when they enter kernel space, and they
>MUST between user tasks, since the address spaces are aliased.  Caching still

I don't understand this. It looks to me that there could be two kinds of
RASM caches: one would cache address in user (or kernel) address space,
before translation of addresses to real physical addresses. These would
need a cache flush, if only for security reasons. 

But, if the cache works on physical addresses, it would seem to me
that it would not need to be flushed. Of course, it would likely have
useless data after a task switch, but would it not that be refreshed
automatically as new stuff was needed? This of course would still take time.

Doug McDonald

david@cs.washington.edu (David Callahan) (09/09/89)

In article <1082@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>  Why do Crays have fast memories?
>Because [long vector] is exactly the type of problems they are designed for.

Actually the memories on Cray's are commodity parts and hence are not
that fast. The Cray-2 came out with 120ns DRAMS and later with 55ns
SRAMS which is the context of this discussion of 33-50MHz parts don't
seem that fast. Further, the nominal 55ns is only part of latency of
the memory access; the change from 120ns DRAMS to 55ns SRAMS showed
only a 13% improvement in some benchmarks [reported in Supercomputing
88 by Simmons & Wasserman, I think].

Cray's cope with long memory latencies by using vector loads,
effectively pipelining memory. So you were right about these machines
being designed for vector operations but really a stronger statement
could be made: Cray machines require vector operations to acheive
anywhere near peak performance. 

-- 
David Callahan  (david@tera.com, david@june.cs.washington.edu,david@rice.edu)
Tera Computer Co. 	400 North 34th Street  		Seattle WA, 98103

cliffhanger@cup.portal.com (Cliff C Heyer) (09/10/89)

>I think SRAMS somewhere on the order of 5x the cost for the same amount
>of memory, and they also take up at least twice the board space.  

Got some recent prices:

#1) SRAM 
#     4--64K (70--100ns)   $4
#     4--64K (25--55ns)    $10
#     256K (70--100ns)    $15
#     256K (25--35ns)     $30
#     1Mbit    $100
#
#2) DRAM 
#     256K     $4
#     1Mbit    $12
#     4Mbit    $100

For 1MB, looks like about an extra $300 to $1200 for the chips. BUT what about
the cost to engineer new boards for the high speed chips? $50,000 workstation
prices....+ extra power...

>> 4. Are any board makers making (or have made) motherboards with ESDI and/or
>> SCSI interfaces ON BOARD to bypass the 8MHz AT bus? 
>Even that's not enough.  SCSI, for example, is still only 8 bits wide, and it's
>maximum transfer rate of 4 megs/second is in the same ballpark as the fastest
>AT buses.  This is what Amiga hard disk controllers do (DMA), and we get up to 900k/sec
>through the filesystem using asynchronous SCSI (1.5 megs/sec). 

Sounds like Commodore is really *trying* to put out a "best" product...I wonder
about many others. For example, IBM PS/2s always score the lowest in disk I/O. I
assume this is because they want to *encourage* customers to shift the needed
I/O BW to IBM big iron.

What about SYNCHRONOUS SCSI in the Amiga? (4.0MB/sec) Or DMA & memory 
can't handle it at this speed(?)

Do sync & async are about the same cost? Why use synchronous SCSI vs. 
async? 

Actually I must admit I don't know that much about the AT bus. 
What is the AT bus rated at in aggregate and effective MB/sec throughput?

 (For example, I assume it is
16 bits wide? Does this mean at 8 MHz it's aggregate throughput is 16MB/sec?
But I suppose because of DOS I/O is done via 8 bit bytes...which would give
8MB/sec. Still this sounds awfully high - A 15MHz ESDI drive transfers at
15Mbits/sec which is 1.5MB/sec, but this is said to be too fast for 
the AT bus.....and so the 10MHz ESDI  is used.)

Cliff

kleinman@hplabsz.HPL.HP.COM (Bruce Kleinman) (09/10/89)

+-------
| You may be running software that has a very low cache hit rate if you
| are doing CAD work or scientific calculations.  Take this little loop
| for example:
| 
|    SUM = 0.0
|    DO 10 I = 1, 1000000
|      SUM = SUM + VEC(I)
| 10 CONTINUE
| 
| A data cache is *no use at all* for this problem.  You will get a
| cache miss on every data access.  Similarly, copying data from one bit
| of memory to another will be limited by the raw memory speed.
+-------
Unless your d-cache line size is wider than a word, and performs burst refills
from main memory.  The '486, for example, will burst fill the 4-word line if
suitably equipped with nibble-mode DRAMs.  Assuming two wait-states for the
initial access, and zero wait states for the sebsequent three, the result is
0.5 wait-states for your example.  Granted, you're still busting the d-cache,
but you are able to take advantage of nibble-mode DRAMs (which tend to be
about the same price as regualr DRAMs).

Bruce "and why would anyone be running FORTRAN code on a PC :-)" Kleinman

hjm@cernvax.UUCP (Hubert Matthews) (09/11/89)

In article <3934@hplabsz.HPL.HP.COM> kleinman@hplabs.hp.com (Bruce Kleinman) writes:
>[...I said that data caches aren't any good for long loops...]
>Unless your d-cache line size is wider than a word, and performs
>burst refills from main memory.

But I can always find some access pattern that makes a data cache
ineffective.  For instance, if the line size of the cache is N bytes,
then fetching data separated by >= N bytes renders the cache useless.
And unless you have a processor capable of streaming from the cache
(executing the instruction which started the data fetch whilst the
following bytes of the cache line are arriving), the cache prefetch
will slow things down in this case.  On top of this, there is the
problem of stomping all over the cache so the effects of the loops are
felt for some time after it has ended.

Caches are effective for certain, albeit common, access patterns.  But
if the software you run doesn't conform to these patterns, then you
are stuck.  All you can do is get faster memory or wait for the
program to finish.

>Bruce "and why would anyone be running FORTRAN code on a PC :-)" Kleinman

If the "FORTRAN code" has long arrays in it like the above examples
(not untypical of scientific code), then you need a machine with
faster main memory to get any decent performance out of your '386,
which is, I believe, where this whole thread of discussion started.

-- 
Hubert Matthews      ...helping make the world a quote-free zone...

hjm@cernvax.cern.ch   hjm@vxomeg.decnet.cern.ch    ...!mcvax!cernvax!hjm

hjm@cernvax.UUCP (Hubert Matthews) (09/11/89)

In article <9149@june.cs.washington.edu> david@june.cs.washington.edu.cs.washington.edu (David Callahan) writes:
>In article <1082@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>>  Why do Crays have fast memories?
>>Because [long vector] is exactly the type of problems they are designed for.

OK, if you want me to be more specific: Why do Crays have fast memory
systems?

>[Crays use commodity RAMs]

I know, but have you seen how much stuff goes around these jellybean
RAMs so that they can provide enough bandwidth?  Lots of banks,
interleaved to get the bandwidth up.  Not the latency, but the
bandwidth.  If you do vector loads, then the latency of the first load
is offset by the speed of the following loads.  If you get the vector
stride wrong (a power of 2, for example), then your Cray's memory will
crawl along as you will be accessing one bank continuously, rather
than interleaving accesses to several banks.

>Cray's cope with long memory latencies by using vector loads,
>effectively pipelining memory. So you were right about these machines
>being designed for vector operations but really a stronger statement
>could be made: Cray machines require vector operations to acheive
>anywhere near peak performance. 

Get your interleaving wrong and you can kiss your MFLOPS goodbye
because of the memory system, which was, after all, my original point:
There is more to computing systems than just a CPU; if you don't have
the memory and the I/O to feed the CPU, then the CPU speed is wasted.
A computing system must be balanced, *with respect to the code it is
running*, if it is to be cost effective.  PCs are biased towards a
market where they offer a reasonable balance; Crays are biased towards
another sort of market.  Just don't expect me to buy a PC to crunch
a gigabyte of data or to buy a Cray to do word processing.

-- 
Hubert Matthews      ...helping make the world a quote-free zone...

hjm@cernvax.cern.ch   hjm@vxomeg.decnet.cern.ch    ...!mcvax!cernvax!hjm

colin@watcsc.waterloo.edu (Colin Plumb) (09/11/89)

>In article <7851@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes:
>> [Separate I & D caches] isn't an advantage as an external cache unless your
>> CPU has external I and D busses.  The only one I know of that does is the
>> Motorola 88100.

In article <27133@proton.mips.COM> wilkes@mips.COM (John Wilkes) writes:
> Doesn't the amd 29000 have external I and D busses?  It's been a long time
> since I looked at that chip...

Yes, it does.  The address bus, however, is shared.  Since the MMU can only
translate one address per cycle, it's no bandwidth loss, and things like
pipelined loads and burst mode reduce the number of addresses required.

The only on-chip cache, the branch target cache, stores 4 instructions at
the destination of a branch to give your i-fetch system time to re-establish
a burst, and the prefetch queue will handle bursts interrupted by page
boundaries.

I rather like it.  Note that a working BTC is a comparatively recent
development. :-)
-- 
	-Colin Plumb

hankd@pur-ee.UUCP (Hank Dietz) (09/11/89)

In article <1083@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>In article <3934@hplabsz.HPL.HP.COM> kleinman@hplabs.hp.com (Bruce Kleinman) writes:
>>[...I said that data caches aren't any good for long loops...]
...
>But I can always find some access pattern that makes a data cache
>ineffective.  For instance, if the line size of the cache is N bytes,

Yes for conventional cache management, not really for compiler-driven
cache management (as I've said many times before  ;-).  See:

	Chi-Hung Chi, "Compiler-Driven Cache Management Using A State Level
	Transition Model," PhD dissertation, Purdue University, May 1989.

						-hankd@ecn.purdue.edu

les@unicads.UUCP (Les Milash) (09/11/89)

In article <3934@hplabsz.HPL.HP.COM> kleinman@hplabs.hp.com (Bruce Kleinman) writes:
>[quote of a common way to make a d-cache useless]
>Unless your d-cache line size is wider than a word, and performs burst refills
>from main memory.  The '486, for example, will burst fill the 4-word line if
>suitably equipped with nibble-mode DRAMs.
a good point.

BUT:
	what is a nibble-mode dram?

	i do know what
		static ram
		VRAM
		page-mode dram
		static-column dram
	are (i think).
Ignorantly yours,
   xxx
Les Milash

firth@sei.cmu.edu (Robert Firth) (09/12/89)

In article <1082@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:

>      SUM = 0.0
>      DO 10 I = 1, 1000000
>	SUM = SUM + VEC(I)
>   10 CONTINUE
>
>A data cache is *no use at all* for this problem.

I don't see that.  Suppose the first reference causes a cache miss.
The machine then fetches a chunk of data from memory - 32 bytes, say.
The next reference finds the data in the cache, and runs right along.
If we assume double-precision FP at 8 bytes per datum, we have a 75%
cache hit rate.  Not stellar, but still pretty good.

Did you mean to write

	SUM = SUM + MAT(I,J)

varying J?  That, I grant, is a tough one.

katin@skylark.Sun.COM (Neil Katin) (09/13/89)

In article <7851@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes:
=in article <21936@cup.portal.com>, cliffhanger@cup.portal.com (Cliff C Heyer) says:
=> 1. UNIX (or any multitasking OS) and the effects of 
=> the on-board cache:
=
=>  	While multitasking, does flushing the cash waste a measurable amount 
=> of run time or is it insignificant compared to swapping, paging, and/or 
=> other overhead? 
=
=Generally, UNIX system do cache flushes when they enter kernel space, and they
=MUST between user tasks, since the address spaces are aliased.  Caching still
=helps under UNIX, maybe even considerably depending on the main memory design,
=but you won't get the maximum possible performance out of a system under UNIX.

There is no need to flush the cache on task switches on the 80386 (the
subject of the original question) because the cache is a physical cache
-- there is no address aliasing in the cache because the cache works
with physical addresses (i.e. addresses that have already been
translated by the MMU.  This is one of the big advantages of caching
based in physical rather than virtual addresses.

You *do* need to flush the on-chip 68030 cache when doing task switches
because it is a virtual cache.  You probably don't need to flush any
off-chip 68030 caches because they are physical caches (since the 68030
has an on-chip MMU).

Neil Katin

jesup@cbmvax.UUCP (Randell Jesup) (09/13/89)

In article <22011@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) writes:
>>> 4. Are any board makers making (or have made) motherboards with ESDI and/or
>>> SCSI interfaces ON BOARD to bypass the 8MHz AT bus? 
>>Even that's not enough.  SCSI, for example, is still only 8 bits wide, and it's
>>maximum transfer rate of 4 megs/second is in the same ballpark as the fastest
>>AT buses.  This is what Amiga hard disk controllers do (DMA), and we get up to 900k/sec
>>through the filesystem using asynchronous SCSI (1.5 megs/sec). 
>
>Sounds like Commodore is really *trying* to put out a "best" product...I wonder
>about many others. For example, IBM PS/2s always score the lowest in disk I/O. I
>assume this is because they want to *encourage* customers to shift the needed
>I/O BW to IBM big iron.

	I've never quite understood why msdos computers are so slow at FS
access and I/O in general.  Note that that 900K/sec is through the filesystem,
not just raw I/O speed (with a good drive I've seen 1.3-1.4 meg/sec through
the I/O system, maybe 1.0-1.1 through the filesystem, using asynch scsi (1.5
meg/sec max)).

	We try real hard to provide excellent HD/filesystem speeds compared
to any other micro (and often end up with 4-10x better on the same HD).

>What about SYNCHRONOUS SCSI in the Amiga? (4.0MB/sec) Or DMA & memory 
>can't handle it at this speed(?)
>
>Do sync & async are about the same cost? Why use synchronous SCSI vs. 
>async? 

	Sure, synchronous scsi con be done, though the controller we're
discussing didn't have a fast enough clock on-board to use the 4 meg/sec
rate (7 Mhz clock available, need a 10(?) Mhz clock to do 4 meg/sec - note
I'm just talking clocks, not bus speeds).  The current amiga bus bandwidth
is about 3.5 meg/sec, if I remember correctly.  If we had the clock we
could approach that.

>Actually I must admit I don't know that much about the AT bus. 
>What is the AT bus rated at in aggregate and effective MB/sec throughput?

	I suspect it's based on the speed of the processor (causes amusing
problems for boards designed for lower speeds, I suspect).

> (For example, I assume it is
>16 bits wide? Does this mean at 8 MHz it's aggregate throughput is 16MB/sec?
>But I suppose because of DOS I/O is done via 8 bit bytes...which would give
>8MB/sec. Still this sounds awfully high - A 15MHz ESDI drive transfers at
>15Mbits/sec which is 1.5MB/sec, but this is said to be too fast for 
>the AT bus.....and so the 10MHz ESDI  is used.)

	8Mhz doesn't mean 8 million xfers of 16 bits per second.  For example,
the amiga bus clock is 7.16 Mhz, 16 bits wide.  However, bus transfers require
a number of cycles (on almost any bus/processor I've seen).

>Cliff


-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"

colin@watcsc.waterloo.edu (Colin Plumb) (09/13/89)

(This is off-topic for comp.arch; see the Followup-To: line)

In article <22011@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) writes:
>What about SYNCHRONOUS SCSI in the Amiga? (4.0MB/sec) Or DMA & memory 
>can't handle it at this speed(?)

The Microbotics HardFrame can do synchronous SCSI whenever they get the
drivers written.  My problem is finding a drive that can do it.

>Do sync & async are about the same cost? Why use synchronous SCSI vs. 
>async? 

Sync is considerably more complex, and chips to support it are a more recent
development.  But it's a few times faster, and nothing is ever too fast...
-- 
	-Colin Plumb

daveh@cbmvax.UUCP (Dave Haynie) (09/14/89)

in article <124529@sun.Eng.Sun.COM>, katin@skylark.Sun.COM (Neil Katin) says:

> You *do* need to flush the on-chip 68030 cache when doing task switches
> because it is a virtual cache.  You probably don't need to flush any
> off-chip 68030 caches because they are physical caches (since the 68030
> has an on-chip MMU).

Undoubtedly why Motorola went to physical caches on the 68040.  The main
problem with physical cache, of course, is speed, but they apparently got
that licked as of the 88k system.  You rarely have a choice with off-chip
caches -- they tend to be physical, slow, and/or expensive.

> Neil Katin
-- 
Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
                    Too much of everything is just enough

dfields@urbana.mcd.mot.com (David Fields) (09/14/89)

In article <7857@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes:
>in article <27133@proton.mips.COM>, wilkes@mips.COM (John Wilkes) says:
>> 
>> In article <7851@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes:

Context:
	[A discussion about flushing Unix(TM) flushing caches between
		every context switch with virtual caches and whether
		it physical caches could retain usefull data after
		a context switch.				MDF]

>You are guaranteed under today's UNIX
>at least that two tasks won't be sharing any memory.  So whatever's cached up

That's not quite correct.  Both 4.3BSD and SysV have shared text,
copy-on-write data and several implementations of BSD have SysV shared
memory grafted onto them.  I'm not sure how much of this stuff remains
in cache on real systems, but ...
-- 
Dave Fields // Motorola MCD //  uunet!uiucdcs!mcdurb!dfields

jab@dasys1.UUCP (Jeff A Bowles) (09/16/89)

In article <21936@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) writes:
> 	While multitasking, does flushing the cash waste a 
>measurable amount of run time or is it 
>insignificant compared to swapping, paging, and/or 
>other overhead? In other words, is the cache still 
>beneficial even though it is being flushed? (I 
>assume "yes" since minicomputers such as all VAX 
>models have them.)

First, the old phrase I always heard was "real memory for real
performance."  But for a machine that is(strictly speaking)
CPU-bound, the cache will help you a lot.

Say you change contexts 16 times a second, i.e. you flush the
cache 16 times in a second --- think about how much you can
do in (an average of) 1/16 of a second, on a fast CPU. On a 4 MIPS
machine, for example, that's 256K instructions at a time that run
with benefit of a cache.

All these are numbers pulled out of the air, but the upshot is that
if you are entirely CPU-bound, a cache will help; if you're spending
precious time waiting for a disk, thrashing for memory, or whatever,
the dynamic is different.

	Jeff Bowles
-- 
Jeff A Bowles
Big Electric Cat Public UNIX
..!cmcl2!{ccnysci,cucard,hombre}!dasys1!jab