[comp.sys.ibm.pc] Pagemode vs Cache Memory Architectures

krogers@esunix.UUCP (Keith Rogers) (07/22/89)

I'm sorry if this topic has already been discussed (i.e. beat to death)
before in this group, but I will soon be in the market for a 386 machine
and I'm confused about the two major memory architectures touted by
manufacturers: pagemode interleaved vs caches.

I believe the crux of my confusion rests upon my non-understanding of
how data is loaded into the CPU.  When an application is run, assuming
few branch insturctions and mostly array related data, instructions and
data come in nice contiguous streams.  Herein lies the claim (I think)
that an interleaved architecture runs at nearly zero wait states,
because wait states are only incurred at page boundaries and refreshes.

What I don't understand is how are the RAM's interleaved?  An example
will help to explain my question: let's say I have a 386 with 4 Meg of
RAM composed of 32 1Meg x 1 DRAM chips.  Since the 386 has 32 data bits
going into it, it seems to me that only one read or write cycle is
needed to read or write, say, a long integer or short float value, each
of which is 4 bytes wide.  However, if the RAM's are interleaved, the
only likely interleave pattern I can fathom is to split the 32 RAM's
into 4 banks of 8 RAM's each, thus yielding only 8 of the 32 bits of our
long integer of data per read cycle, therefore, requiring 4 read cycles
to get the whole value into the processor.  A cache memory, however, can
present all 32 bits of the long integer in one read cycle (presuming a
hit, of course).  Is this correct?

I guess a related issues is: if I issue the following assembly command:

		MOV	EAX, LONG_INT

how does the 386 get the value?  If the memory is cached, all 32 bits
are loaded in the next state.  If the memory is interleaved, it seems
to me that 4 read cycles (states) are needed to present all 32 bits to
the CPU.  Thus while the interleaved memory may be running close to zero
wait states, it is still 4 times slower than the cached memory machine,
because data can only be accessed on byte boundaries rather than
longword boundaires.  I guess my question here is, are cached memories,
in fact, accessed on longword boundaries or byte boundaries?

These questions are important to me because 95% of my computing is
crunching floats, longs, and doubles in tight loops, and the faster
I can load a 4 (or 8) byte value into the CPU (or FPU) the better.
Any enlightenment on this topic would be appreciated but I will be more
interested in replies related to my number crunching needs. I only run 
two canned DOS programs, a word processor, and a text editor, both of
which already run fast enough for me on my baby 8086 machine.  Thus, I
don't care which one is "best" for existing DOS programs, I want to know
how I can get 32 bit operands into the CPU as fast as possible so that I
can be more intelligent about how I write my own programs and which
memory architecture, and hence, machine I should buy.

An observation:

     Memory architecture, at least amongst manufacturers, seems to be
one of those religious issues which could start a prolonged, and nasty
flame-war.  I HATE FLAME-WARS!  Please restrict replies to a
professional tone backed by facts.

Thanks to all in advance who take some of their time to reply, thanks
indeed.

Keith Rogers    UUCP: utah-cs!esunix!krogers

johne@hpvcfs1.HP.COM (John Eaton) (07/24/89)

<<<<
< 
< What I don't understand is how are the RAM's interleaved?  An example
< will help to explain my question: let's say I have a 386 with 4 Meg of
< RAM composed of 32 1Meg x 1 DRAM chips.  
----------
Will not work. Use 32 256K x 4 rams instead to give you four banks of
32 bit wide memory. The four banks are decoded off the LOWEST 32 bit
addresses so that reading consecutive 32 bit words causes acceses to
sequential banks.


John Eaton
!hpvcfs1!johne