[comp.sys.nsc.32k] NS32532 Patents

roger@nsc.UUCP (04/14/87)

To stimulate some more valued discussions, let me lift some
of what is discussed in the 32532 overview brochure;

"  At least a dozen manufacturers have brought 32-bit solutions to the
marketplace.  While each design is similar in the broad view, the specifics
of each implementation can vary greatly.  And it is those specifics
that determine which is best for your needs.

   The specifics of the NS32532, however are unprecedented in 32-bit
microprocessor architectures.  In fact National has applied for 
eigth separate patents on the NS32532:

1.)  The method of detecting and handling memory-mapped I/O
     by a pipelined microprocessor.  ----- Think about
     that for a while.  The 32532 has a 1024 byte 2 way set
     associative data cache.  Without the special method
     of handling I/O, writing I/O drivers is somewhat problematic.

2.)  Maintaining coherence between a microprocessors integrated cache
     and the external memory.  ------ Since both the Instruction
     and Data caches are physical caches, we were able to devise
     a means to provide "hardware" cache coherence hooks.  Coherency
     can be maintaind without cubersome software overhead and at
     cost in performance.

3.)  Monitoring control flow in a microprocessor ----- in other 
     words, branch prediction.

4.)  The concept of a fully integrated cache, Memory Management Unit,
     and Instruction pipeline.

5.)  Method of simultanous references to the cache and Bus Interface unit.

6.)  Method for completing instructions without waiting for writes. ----
     Yes thats right.  Reads have priority over writes.  Writes are
     buffered in a 2 entry FIFO.  There is one exception to this
     rule ----- memory mapped I/O as in patent # 1 above.

7.)  Method of optimizing instruction fetches.

8.)  MMU that is accessible by the instruction unit, address unit
     and the execution unit.

   These unique and innovative architectural refinements give the
NS32532 key performance advantages in a variety of 32-bit applications."


I'm open to discussion on any of these unique attributes.

------- Roger

shebanow@ji.Berkeley.EDU (Mike Shebanow) (04/14/87)

In article <4206@nsc.nsc.com> roger@nsc.nsc.com (Roger Thompson) writes:
>   The specifics of the NS32532, however are unprecedented in 32-bit
>microprocessor architectures.  In fact National has applied for 
>eigth separate patents on the NS32532:
>
>1.)  The method of detecting and handling memory-mapped I/O
>     by a pipelined microprocessor.  ----- Think about
>     that for a while.  The 32532 has a 1024 byte 2 way set
>     associative data cache.  Without the special method
>     of handling I/O, writing I/O drivers is somewhat problematic.

What happens in a VAX??? It has the same problem. How about the
680X0?? Same thing again. The simple solution is to have a bit in the
page table entry saying that this is I/O. That way, the data is uncached.
Is there something wrong with this solution?

>2.)  Maintaining coherence between a microprocessors integrated cache
>     and the external memory.  ------ Since both the Instruction
>     and Data caches are physical caches, we were able to devise
>     a means to provide "hardware" cache coherence hooks.  Coherency
>     can be maintaind without cubersome software overhead and at
>     cost in performance.

This is new for a microprocessor, but not in general. Is what you are
doing a different method for cache coherency (Archibald and Baer in
the Nov. 86 ACM Transactions on Computer Systems has a good survey)?

>3.)  Monitoring control flow in a microprocessor ----- in other 
>     words, branch prediction.

This again is new for a micro, but not in general. What type of branch 
prediction are you doing (Lee and Smith IEEE Computer Jan. 84, MacFarling
and Hennesy 13th ISCA June 86, JE. Smith 8th SCA 81, IBM <too many to list>)?

>6.)  Method for completing instructions without waiting for writes. ----
>     Yes thats right.  Reads have priority over writes.  Writes are
>     buffered in a 2 entry FIFO.  There is one exception to this
>     rule ----- memory mapped I/O as in patent # 1 above.

Again, I don't understand what is new and unique. This is a well known
technique. Alan Smith in his '83 (82???) ACM paper on "CPU Cache Memories"
describes such a write buffer being used to improve write-through cache
performance.

Sorry for creating a flame letter, but maybe I am confused about
what is being patented here. Are you claiming that the concepts above
are patentable, or that the methods used to reduce the concepts to practice
are patentable?

Mike Shebanow
(shebanow@ji.berkeley.edu)

hansen@mips.UUCP (04/15/87)

In article <18308@ucbvax.BERKELEY.EDU>, shebanow@ji.Berkeley.EDU (Mike Shebanow) writes:
> In article <4206@nsc.nsc.com> roger@nsc.nsc.com (Roger Thompson) writes:
> >   The specifics of the NS32532, however are unprecedented in 32-bit
> >microprocessor architectures.  In fact National has applied for 
> >eigth separate patents on the NS32532:
> Sorry for creating a flame letter, but maybe I am confused about
> what is being patented here. Are you claiming that the concepts above
> are patentable, or that the methods used to reduce the concepts to practice
> are patentable?

Seems to me that all Roger said was that National has applied for these
patents. For all we know, the applications might be rejected because they
are considered either not "novel" or because they are judged to be "obvious
to one skilled in the state of the art."
-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

davet@oakhill.UUCP (04/15/87)

In article <4206@nsc.nsc.com> roger@nsc.nsc.com (Roger Thompson) writes:
>
>   The specifics of the NS32532, however are unprecedented in 32-bit
>microprocessor architectures.  In fact National has applied for 
>eigth separate patents on the NS32532:

Only eight patents?  I'm just a software guy and I was associated with four
patent applications for the MC68020.

What follows is not a critique of the NS32532 at all, just a comment on
your highly touted list of architectural breakthroughs.

Remember, I'm only a software guy so Motorola may have already done
some of the other things in your list I don't address.  But if you take into
consideration all of the other microprocessor firms representing themselves
in this newsgroup, I would be rather suprized if your list doesn't turn to
zip.

>1.)  The method of detecting and handling memory-mapped I/O
>     by a pipelined microprocessor.  ----- Think about
>     that for a while.  The 32532 has a 1024 byte 2 way set
>     associative data cache.  Without the special method
>     of handling I/O, writing I/O drivers is somewhat problematic.

Motorola offers this via several means.  First, a non-cachable bit in our
MMU descriptor can be used to indicate I/O space.  Second, a class of
instructions which lock the bus automatically avoid using on-chip cache.
Third, external hardware can signal any bus cycle to be non-cached thus
forcing the next reference to again come out onto the external bus.

>3.)  Monitoring control flow in a microprocessor ----- in other 
>     words, branch prediction.

The MC68010 (out about 5 years now?) supported this for it's DBcc set of
branch instructions (loop mode.)  Yes it was more primitive, but the idea
is the same.

>5.)  Method of simultanous references to the cache and Bus Interface unit.

The MC68020 does this.  Instruction references go to both the on chip cache
and to the bus controller.  The bus controller aborts it's cycle if the cache
comes up with the data.

>6.)  Method for completing instructions without waiting for writes. ----
>     Yes thats right.  Reads have priority over writes.  Writes are
>     buffered in a 2 entry FIFO.  There is one exception to this
>     rule ----- memory mapped I/O as in patent # 1 above.

The MC68020 has a one buffer write mechanism.  Intel claims that both their
286 and 386 chips support a one buffer write queue also.

>7.)  Method of optimizing instruction fetches.

Most latter day microprocessors could make this claim.  Do you have unique
logic on the part to accomplish this?

>8.)  MMU that is accessible by the instruction unit, address unit
>     and the execution unit.

Again, unique logic on the part is necessary for a patent here.

>I'm open to discussion on any of these unique attributes.
>
>------- Roger 

First you need to establish just what is unique or not.

 -- Dave Trissel  Motorola Semiconductor Inc., Austin, Texas
	{ihnp4,seismo}!ut-sally!im4u!oakhill!davet

kenm@sci.UUCP (04/16/87)

In article <4206@nsc.nsc.com>, roger@nsc.nsc.com (Roger Thompson) writes:
> To stimulate some more valued discussions, let me lift some
> of what is discussed in the 32532 overview brochure;
> 
> "  At least a dozen manufacturers have brought 32-bit solutions to the
> marketplace.  While each design is similar in the broad view, the specifics
> of each implementation can vary greatly.  And it is those specifics
> that determine which is best for your needs.
> 
>    The specifics of the NS32532, however are unprecedented in 32-bit
> microprocessor architectures.  In fact National has applied for 
> eigth separate patents on the NS32532:

Introduction:
When I say "we" below I mean a group of CPU designers I was part of
at HP for about 4 years (80-84).

> 
> 1.)  The method of detecting and handling memory-mapped I/O
>      by a pipelined microprocessor.  ----- Think about
>      that for a while.  The 32532 has a 1024 byte 2 way set
>      associative data cache.  Without the special method
>      of handling I/O, writing I/O drivers is somewhat problematic.
Not clear just what the problem is.  Presumably the I/O addresses
can identifiy themselves, so the cache just has to pay attention.
> 
> 2.)  Maintaining coherence between a microprocessors integrated cache
>      and the external memory.  ------ Since both the Instruction
>      and Data caches are physical caches, we were able to devise
>      a means to provide "hardware" cache coherence hooks.  Coherency
>      can be maintaind without cubersome software overhead and at
>      cost in performance.
An extra tag set for the instruction cache so it can monitor all writes
to the data cache.  A simpler solution it to make it illegal architecturally
to write into your own instruction stream and to provide a mechanism
for flushing cache blocks.
> 
> 3.)  Monitoring control flow in a microprocessor ----- in other 
>      words, branch prediction.

We used a small special purpose cache for this.  The way it worked
was that the address of the conditional branch was hashed down to 9 bits
which were used to index a 512x2 bit ram.  The two bits were used to
implement a "slow learner" state machine that predicted which way the
branch would go.  We saw a 95% prediction rate if programs were allowed
to run long enough without a context switch.  With context switch effects this
dropped into the 80-85% rate for our test cases.  Being a slow learner
means that it only makes one mistake on the execution of a loop,
on the very last pass.  We also tried various 1,2, and 3 bit state machines
but none of them worked as well.  Credit for this goes to Mike Manlove at
HP.  There is also quite a bit of literature on the subject.
> 
> 4.)  The concept of a fully integrated cache, Memory Management Unit,
>      and Instruction pipeline.
Pretty vague.  I have heard lots of "concepts" in this area.
> 
> 5.)  Method of simultanous references to the cache and Bus Interface unit.
Ditto.  
> 
> 6.)  Method for completing instructions without waiting for writes. ----
>      Yes thats right.  Reads have priority over writes.  Writes are
>      buffered in a 2 entry FIFO.  There is one exception to this
>      rule ----- memory mapped I/O as in patent # 1 above.

I remember reading about CDC machines back in the dark ages doing this.
Essentially the output fifo contained both addresses and data and
each read did a partial comparison (about 8 bits) of the read address
against all the write addresses in the fifo and if a match was found
then the data was grabbed out of the fifo and the writes had priority.
Virtual addressing might complicate this if aliasing is allowed.

> 
> 7.)  Method of optimizing instruction fetches.
Instruction buffers.
Instruction caches.
Fetching multiple paths simultaniously.
Using branch prediction to fetch the probable path.
Putting the instruction decoder on the other side of the instruction
	cache.  (this takes the next address and branch target calculation
        out of the critical path)
...
> 
> 8.)  MMU that is accessible by the instruction unit, address unit
>      and the execution unit.
If it wasn't, how would the processor work?
> 
>    These unique and innovative architectural refinements give the
> NS32532 key performance advantages in a variety of 32-bit applications."
> 
> 
> I'm open to discussion on any of these unique attributes.
> 
> ------- Roger 

I'm going to be interested in how many of these National manages to patent.
I'm also sure a lot of good engineering work went into the 32532 but most
new ideas in this area aren't.

Ken McElvain
decwrl!sci!kenm

roger@nsc.nsc.com (Roger Thompson) (04/17/87)

In article <299@dumbo.mips.UUCP>, hansen@mips.UUCP (Craig Hansen) writes:

> Seems to me that all Roger said was that National has applied for these
> patents. For all we know, the applications might be rejected because they
> are considered either not "novel" or because they are judged to be "obvious
> to one skilled in the state of the art."
> -- 

I haven't disappeared.  I have been working on responding to several
email requests.  I have been working sort of as a FIFO and I'm
responding by time and date.  I have but one or two left.

As it relates to the patents, there is a difference between applying
for a patent and actually getting one. Our success rate is very very
high.  But in the time between applying and actually being granted
a patent and having the text available from the Gov't printing
office ------- well, it could be late 1988 or early 1989.  In the
interim, the text is held by the patent office as confidential
material so I can't send drafts of it off to anyone. In fact
I haven't seen the full drafts myself.  What I can do however is
answer questions in almost any detail one wishes.  I may have to
search out an answer or two.

Roger

roger@nsc.nsc.com (Roger Thompson) (04/19/87)

In article <863@oakhill.UUCP>, davet@oakhill.UUCP (Dave Trissel) writes:
> some of the other things in your list I don't address.  But if you take into
> consideration all of the other microprocessor firms representing themselves
> in this newsgroup, I would be rather suprized if your list doesn't turn to
> zip.

We'll see what transpires ---- the list could even get longer, but
the length of the list won't change how the features of the 32532
operate togehter.

> Motorola offers this via several means.  First, a non-cachable bit in our
> MMU descriptor can be used to indicate I/O space.  Second, a class of
> instructions which lock the bus automatically avoid using on-chip cache.
> Third, external hardware can signal any bus cycle to be non-cached thus
> forcing the next reference to again come out onto the external bus.

I presume your references here are to the 68030 since the 020 doesn't
support a data cache.  The non-cachable bit is the classic solution,
special classes of instructions which lock the bus??? I understand
the need for bus interlocks and yes in this case you wish to avoid
the cache ----- but how does this relate to Memory mapped I/O.

My comment relates to physical I/O devices and the mechanism we
have designed into the 532 to both force an external cycle (via 
hardware) as well as to serialize read and writes.  This is required
since the internal pipeline normally prioritzes reads over writes.

> The MC68010 (out about 5 years now?) supported this for it's DBcc set of
> branch instructions (loop mode.)  Yes it was more primitive, but the idea
> is the same.

Computer architecture continues to evolve -- I agree and the designers
of tomorrows micros will borrow from the past BUT in the process
they will add new wrinkles.  What is the effectiveness of the 010s
prediction, what overall performance gain/loss does it provide since
it is only supported in one class of branch instructions.

> The MC68020 does this.  Instruction references go to both the on chip cache
> and to the bus controller.  The bus controller aborts it's cycle if the cache
> comes up with the data.
> 
The concept is quite similar, BUT far more complicated in the case
of the 532 since it also has an internal MMU.  Yes you say the
030 will support that.  Yes --- but the 030 really only supports the
TLB and even then the caches are virtual.  The 532 contains the
whole MMU , with a 64 entry TLB, and physical caches both quite
a bit larger than on the 030.

> The MC68020 has a one buffer write mechanism.  Intel claims that both their
> 286 and 386 chips support a one buffer write queue also.

Agreed, but the new wrinkle here relates to how it reacts to bus errors
interrupts, traps and other activities which affect the performance of
the pipeline.

> Most later day microprocessors could make this claim.  Do you have
> unique logic on the part to accomplish this?

Yes ---- the issue gets interesting since the 32532 supports dynamic
bus sizeing.  There are situations where intructions are fetched
both sequentially and non-sequentially.

> Again, unique logic on the part is necessary for a patent here.
> 
That is the requirement of the patent office.

Sorry for the delay in responding Dave --- but I think I'm caught
up now.

	     Roger

amos@instable.UUCP (Amos Shapir) (04/19/87)

Before you use the 'it has been done before' argument to flame National's
patents of the 32532, keep in mind that a specific implementation of a
combination of ideas may be patented even if each one of them has been
done before separately.
Are there any patent lawyers in the audience?
-- 
	Amos Shapir
National Semiconductor (Israel)
6 Maskit st. P.O.B. 3007, Herzlia 46104, Israel  Tel. (972)52-522261
amos%nsta@nsc.com {hplabs,pyramid,sun,decwrl} 34.48'E 32.10'N

baum@apple.UUCP (Allen J. Baum) (04/21/87)

--------
[]

I think the argument about whether National has something that is innovative
or patentable is not a question that can be answered by examining the claims
in some marketing literature. Obviously, products have been delivered and
patented that have all of the features claimed- but that doesn't mean that
National did it the same way they did, or that National didn't do it in a way
that has much better price/performance or functionality.

What National has patent applications for doesn't have to be something that
hasn't been done before, or even something better than has been done before;
it only has to do it in a different manner.
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

roger@nsc.UUCP (04/22/87)

In article <4042@sci.UUCP>, kenm@sci.UUCP (Ken McElvain) writes:
> > 
> > 1.)  The method of detecting and handling memory-mapped I/O
> >      by a pipelined microprocessor.  ----- 
> Not clear just what the problem is.  Presumably the I/O addresses
> can identify themselves, so the cache just has to pay attention.

There are two hardware mechanisms.  One is a hand-shake protocol
using two signals, one called IOINH/ and one called IODEC/.  These
will both force references to the data cache to be non-cacheable
as well as force the proper sequencing of reads and writes.The second mechanism
is by dedicating the upper 16 Mbytes of the memory map to be for I/O.

> > 2.)  Maintaining coherence between a microprocessors integrated cache
> >      and the external memory.  ----
> An extra tag set for the instruction cache so it can monitor all writes
> to the data cache.  A simpler solution it to make it illegal architecturally
> to write into your own instruction stream and to provide a mechanism
> for flushing cache blocks.

The issue here is more related to providing hooks to allow hardware external
to the CPU to invalidate the internal caches.  There are 7 cache invalidate
address inputs and 4 control lines that will allow external hardware
to invalidate either an entire cache, or set of a cache, or an 
individual line (16 bytes) of a cache or set.

> > 3.)  Monitoring control flow in a microprocessor -----  

> We used a small special purpose cache for this.  The way it worked
> was that the address of the conditional branch was hashed down to 9 bits
> which were used to index a 512x2 bit ram.  The two bits were used to
> implement a "slow learner" state machine that predicted which way the
> branch would go.  We saw a 95% prediction rate if programs were allowed
> dropped into the 80-85% rate for our test cases.  Being a slow learner
> means that it only makes one mistake on the execution of a loop,
> on the very last pass.  We also tried various 1,2, and 3 bit state machines
> but none of them worked as well.  Credit for this goes to Mike Manlove at
> HP.  There is also quite a bit of literature on the subject.

Your approach is far more elaborate than the one we use. Part of the reason
is that the 32532 was/is targeted towards applications which are context
switch intensive.  Our approach takes into account that programs
typically have loops and that branches backward are taken more often
than not.  Our brochure is confusing in this area. The predictor section
of the chip has a separate address calculation unit so that this
can be done in parallel with other operations.  I will give a more
detailed response in this area in reference to a posting by Craig Hansen.

> > 6.)  Method for completing instructions without waiting for writes. ----

> I remember reading about CDC machines back in the dark ages doing this.
> Essentially the output fifo contained both addresses and data and
> each read did a partial comparison (about 8 bits) of the read address
> against all the write addresses in the fifo and if a match was found
> then the data was grabbed out of the fifo and the writes had priority.
> Virtual addressing might complicate this if aliasing is allowed.

Our approach is not this elaborate.  Since the data cache is write-through,
the cache is always up to date and external writes can be delayed.
In addition to this, there are mechanisms that check whether a subsequent
instruction is reading an operand befor it has been written even in the
cache.  The read will be delayed.  This is somewhat similar to how the
pipe handles register referneces.

> > 7.)  Method of optimizing instruction fetches.
> Instruction buffers.
> Instruction caches.
> Fetching multiple paths simultaniously.
> Using branch prediction to fetch the probable path.
> Putting the instruction decoder on the other side of the instruction
> 	cache.  (this takes the next address and branch target calculation
>         out of the critical path)

The reference here was more related to fetching the instruction opcode
itself.  Yes we have buffers and caches etc as you list above, but
since the CPU supports dynamic bus sizing, instruction fetching
can be from 8, 16 or 32 bit wide memory.  There are scenarios
where both non-sequential and sequential fetching is supported.

Roger