[comp.sys.m68k] 68000 vs. 68020 question.

meulenbr@cstw01.UUCP (Frans Meulenbroeks) (01/02/89)

Happy New Year.

I've been upgrading my 68000 system by replacing the 68000 with a 68020
using 16 bit bus cycles (and the addition of some glue to generate
things like E and VMA).
The system runs on the same clock frequency.
The upgrade works, but leaves the following question:
The expected performance gain due to the cache is not realised.
I expected the 020 to run about the same speed as the 68000 with the
cache disabled and about 20-30% faster due to the cache.

However this does not hold.
The system (an Atari ST) has the video memory access interleaved with
the cpu. This means that every 4 clock cycles the video makes a memory
access. Normally this runs quite smooth.

The upgraded system shows some performance degradation, due to different
timing.
Measurements with an analyser have lead to the following theory:
Somewhere in the 020 user manual it is stated that instruction prefetch
is always a longword operation. This results in two bus cycles when
using a 16 bit bus. In the simplest case two instructions are fetched.
One is executed during the prefetch. The other one cannot complete in
time, thus needing an extra clock cycle to complete. 
This extra cycle however, causes that the next instruction prefetch is
delayed since otherwise it would interfere with the video thus giving
additional delay.

My questions:
- is the above scenario a plausible explanation (the instruction used in
  the test was a NOP).
- will the 020 start instruction execution while the second bus cycle is
  still in progress?
- if I want to speed up the 020 by going to 16 Mhz (synchronously with
  the system clock) at which signals should I pay attention. 
  I know that I should look at DSACK/DTACK since they may be asserted
  before data are valid on a read cycle. Also AS loooks like something
  to look at. Anyone knows about other signals to keep an eye on?
  Any suggestions about the problems that may occur on write cycles??

Many thanks!
-- 
Frans Meulenbroeks        (meulenbr@cst.prl.philips.nl)
	Centre for Software Technology
	( or try: ...!mcvax!philmds!prle!cst!meulenbr)

bjh@motsj1.UUCP (Brad Holtzinger) (01/03/89)

meulenbr@cst.UUCP writes:
>
>My questions:
>- is the above scenario a plausible explanation (the instruction used in
>  the test was a NOP).

Your choice of instructions is a poor one.  The nop instruction
in the 68020 does more than just nothing, it is also used as a
pipeline synchronizer.  All possibilities of instruction overlap
disappear when the nop is executed.  Typically the bus
control state machine would be decoupled (allowed to operate in
parallel) from the execution unit,
allowing the bus control state machine to complete the write bus cycle
of one instruction while the next instruction's execution may be
started. The next instruction would not be started if its operand
was located in a memory location.  (ie. All reads are sync points in
the 68020 microcode.)

>- will the 020 start instruction execution while the second bus cycle is
>  still in progress?

It should but your choice of instruction may negate any
measureable effect.

I hope that this is helpful.
-- 
Brad Holtzinger     Western Region Systems Engineering Manager
Motorola Microcomputer Division 
1150 Kifer Road, Sunnyvale, CA  94086, UUCP: {hplabs, mot, oakhill} !motsj1!bjh
Telephone:  +1 408-991-7340

bruce@blue.gwd.tek.com (Bruce Robertson) (01/04/89)

I've also worked with this sort of arrangement in the past, and I
found that you only just break even with the cache enabled, and lose
dramatically with the cache off.

One thing that will cause you trouble is instructions not aligned on
long word boundaries.  For example, the following instruction sequence
takes only two fetches with the 68000, but a total of 4 on the 68020
with 16 bit memory:

	0x02:	moveq	#1,d0
	0x04:	bra.b	<somewhere>

The words at location 0x00 and 0x06 are fetched, even though they are
never used.  If the branch target is misaligned, you have yet another
wasted fetch.

Also, I can't remember this for sure, but I think the 68020 may
prefetch the instruction past the branch, on the assumption that you
aren't going to branch, and that's two more 16-bit accesses wasted if
you *do* branch.

  - will the 020 start instruction execution while the second bus cycle is
    still in progress?

I can't say for sure, but it seems extremely unlikely since it's the
bus interface that is doing the dynamic bus sizing, transparently to
the rest of the processor.
--
--
	Bruce Robertson
	bruce@blue.gwd.tek.com

daveh@cbmvax.UUCP (Dave Haynie) (01/04/89)

in article <322@cstw01.UUCP>, meulenbr@cstw01.UUCP (Frans Meulenbroeks) says:

> I've been upgrading my 68000 system by replacing the 68000 with a 68020
> using 16 bit bus cycles (and the addition of some glue to generate
> things like E and VMA).
> The system runs on the same clock frequency.
> The upgrade works, but leaves the following question:
> The expected performance gain due to the cache is not realised.
> I expected the 020 to run about the same speed as the 68000 with the
> cache disabled and about 20-30% faster due to the cache.

No, the '020 will actually run slower in most cases on a 16 bit bus, with
cache disabled, than the 68000 will.  As you guessed below, the cache
pre-fetch is to blame -- whether the cache is enabled or not, the '020
will always fetch a full longword for instruction fetches.  Half of the
pre-fetched longword may not be used, especially if the cache is turned
off.  With the cache on, you get the advantage of caching the whole longword,
plus normal other cache benefits that generally make the '020 come out
faster, though depending on application, maybe not all that faster.  The
'030 extends this "feature" to data fetches as well, so you will find that
the '030 with caches disabled on a 16 bit bus runs a little slower than
the '020.

> - is the above scenario a plausible explanation (the instruction used in
>   the test was a NOP).

Yup.

> - will the 020 start instruction execution while the second bus cycle is
>   still in progress?

I don't think so; though I've never actually proven it, based on the way I
think the cache fetch mechanism works, this would be impossible.

> - if I want to speed up the 020 by going to 16 Mhz (synchronously with
>   the system clock) at which signals should I pay attention. 
>   I know that I should look at DSACK/DTACK since they may be asserted
>   before data are valid on a read cycle. Also AS loooks like something
>   to look at. Anyone knows about other signals to keep an eye on?
>   Any suggestions about the problems that may occur on write cycles??

You may be able to simplify the design based on Atari ST particulars, but
in general, you have to emulate important edges of the 68000.  These include:

	- Clock external /AS during S2, and /UDS-/LDS during S2 (READ) or 
	  S4 (WRITE).

	- Only sample /DTACK on the falling edge of S4.

	- Sample data on the falling edge of S6.  Depending on what the ST and
	  any peripherals may do, you may have to latch it.  Also, you'll have
	  to coordinate your assertion of /DSACK1 to this, making sure your
	  '020 will have data in time.

	- Always end your external cycles on S7.  I end my internal cycles
	  slightly after S7.

	- You may have to clock /BG on the 68000's 8MHz clock, depending on
	  the assumptions of the system (eg, how dependent is the external
	  system on the 68000 specs).

	- You may have to sample the /IPL0-2 lines on an 8MHz clock, again
	  depending on the external system.

That should be enough to get it doing something interesting.

> Many thanks!
> -- 
> Frans Meulenbroeks        (meulenbr@cst.prl.philips.nl)
> 	Centre for Software Technology
> 	( or try: ...!mcvax!philmds!prle!cst!meulenbr)
-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
              Amiga -- It's not just a job, it's an obsession