[comp.arch] Multi-Processor Serializability

tomc@cloud9.Stratus.COM (Tom Clark) (02/01/89)

In  any computer system, the programmer expects operations in the source
code to be carried out in the order specified.  This is especially  true
in  a  multi-processor  system  where multiple processes may share data.
For example, process A may write two locations with x' and  y'.  Process
B,  seeing  y',  assumes  x' has been written and reads it.  This may be
extended causally by process B then writing  a  third  location  z'  and
process  C,  seeing  z', knowing that x' must have been written.  If for
some reason these writes are performed out of order, the model on  which
the  programmers  have  based their thinking is wrong.  Programmers have
been used to such a model  for  quite  a  while.  By  the  use  of  both
explicit  and  implicit  locks,  they  have  exploited  it  to arrive at
well-behaved high-performance MP systems.

However, today compilers are optimizing and rearranging the order of the
operations specified in the program (especially for RISC).  In addition,
newer  high-performance  processors  will  reorder operations within the
chip to improve performance by the use of data-driven techniques.  Also,
newer computers have more complex busses (multiple  paths)  between  the
CPUs  and memories.  The problem of cache coherence also adds complexity
to the problem.

In  the  case  of  explicit  locks, the programmer can give hints to the
compiler and the hardware in  order  to  force  a  certain  sequence  of
operations.  These hints may take the form of a 20 pound sledge at times
(e.g.  forcing  serial  operation  of  the  CPU  or  disabling  compiler
optimization in some parts of code) but they will work.

The  problem  comes  with  implicit  locks.  An  implicit  lock  is  the
dependence on the ordering of data references (both reads  and  writes).
These  are  often  very  hard to find by inspection, even if one has the
time to examine all parts of the source code.  Have  people  thought  of
how  to  get  older  software  to  work on newer machines and compilers?
Obviously older applications and operating systems would  like  to  take
advantage  of  the new technology, but finding all these problems before
putting the system into production is very difficult.  The class of bugs
introduced is such that testing cannot be depended upon  to  find  them,
and they will likely occur when the system is very busy (cost of failure
is high).

I'd like to hear any suggestions for dealing with this problem.  Even if
you  can handle the issue for your own code, how do you train a customer
to do it for their code?  What techniques (hardware and software) can be
applied?  Is the problem solvable?



        - Tom Clark
        - Stratus Computer, Inc., Marlboro, MA
        - Disclaimer: My opinions, nobody elses.

billo@snacpac.npac.syr.edu (Bill O) (02/01/89)

In article <3492@cloud9.Stratus.COM> tomc@cloud9.Stratus.COM (Tom Clark) writes:
>In  any computer system, the programmer expects operations in the source
>code to be carried out in the order specified.
> ...
>However, today compilers are optimizing and rearranging the order of the
>operations specified in the program (especially for RISC).  In addition,
>newer  high-performance  processors  will  reorder operations within the
>chip to improve performance by the use of data-driven techniques.  Also,
>newer computers have more complex busses (multiple  paths)  between  the
>CPUs  and memories.  The problem of cache coherence also adds complexity
>to the problem.

But it is a problem that has been solved. There exist many protocols for
efficient coherent caches that correctly implement atomic lock operations.

> ...
>The  problem  comes  with  implicit  locks.  An  implicit  lock  is  the
>dependence on the ordering of data references (both reads  and  writes).
>These  are  often  very  hard to find by inspection, even if one has the
>time to examine all parts of the source code.  Have  people  thought  of
>how  to  get  older  software  to  work on newer machines and compilers?
>Obviously older applications and operating systems would  like  to  take
>advantage  of  the new technology,

No optimizing or parallelizing compiler worth its salt will reorder statements
(either through parallelization or global optimization) if such reordering
would change the meaning of the program. The technology of optimizing compilers
is about 30 years old, and is very robust. The technology of automatically
parallelizing compilers is, perhaps, 10 years old, but there are many examples
of success. A couple that we know about at NPAC are the parallelizing
Fortran compilers for the Alliant FX/80 and the Encore Multimax. Compilers
of this sort examine programs for loops that can be run in parallel on
separate processors, and insert synchronization points for any data-dependencies
that are found. The Alliant compiler also examines loops for vectorizability,
and will usually produce code that runs "concurrent outer, vector inner", which
means that an outer loop is having its iterations performed in parallel, while
the inner loop has been "unwound"  into vector operations. Both the Alliant
and Encore compiler also perform "good old" global optimization, and both
never NEVER perform an optimization if it would change the meaning of a program.

>
>I'd like to hear any suggestions for dealing with this problem.  Even if
>you  can handle the issue for your own code, how do you train a customer
>to do it for their code?

I have a good deal of experience with the Alliant compiler, so I'll
talk about it. The compiler, by its very nature, is conservative. It
does not perform an optimization or parallelization if it thinks
there's any chance that it could change the order of interdependent
computations.  Of course, it sometimes is too conservative, and will
fail to optimize where it really could have. In these cases it prints
an "informational message" which says what it thinks is the problem.
The programmer can then opt, if he/she feels the compiler was been too
conservative, to include a compiler directive in the code telling the
compiler to go ahea and optimize anyway. These informational messages
are tremendously helpful to our users. Being primarily a Fortran
engine, the FX/80 is used principally by "real" scientific Fortran
programmers -- not computer scientists, yet we have had real success
in training our users how to interpret the messages, and when to try
compiler directives


>What techniques (hardware and software) can be
>applied?

Well, automatically parallelizing compilers, as mentioned, are a
viable option. Perhaps the most aggressive compiler of this sort is
the Fortran compiler for the Multiflow Trace machines. The Trace
compilers move code even when it *will* affect meaning --
specifically, assumptions are made about the result of branch tests
before the test is performed. These assumptions allow more parallelism
to be exploited, and the compiler can "get away with it" because it
inserts extra instructions to "undo the damage" in cases when the
branch went a different way then predicted.  Just as with the Alliant
compiler, and with any good optimizing compiler, the overall semantics
of the program are not changed.  This is not handwaving. All of the
techniques used by such compilers are provably correct. (Naturally,
any compiler may have bugs, and an optimizing compilers is no exception,
but that is a problem of software engineering).

I should point out that Alliant is developing an
optimizing/vectorizing/parallelizing C compiler too, so the techniques
aren't limited to Fortran. It just so happens that Fortran is where
the demand is, so that is why so much effort has been directed at the
development of Fortran compilers.

As for RISC, perhaps Tom is thinking about RISC-specific techniques such
as inserting a branch sooner in the code than one would for a non-RISC
machine. But such branch instructions are always of the deferred branch
sort which guarantee that the semantics of the program will be preserved.
Of course, RISC compilers also may perform global optimization, but such
techniques are well known and understood.

Finally, concerning explicit locks, an optimizing compiler will never
move an explicit lock operation, either because the locking operation
is built-into the language, and so it knows better, or because the
locking operation is performed by an external subroutine, which the
compiler will likely make worst-case assumptions about, thus preserving
language semantics.

Is the problem solvable?

YES, but maybe I've just been rambling, and haven't answered your question

>
>        - Tom Clark
>        - Stratus Computer, Inc., Marlboro, MA
>        - Disclaimer: My opinions, nobody elses.

Bill O'Farrell, Northeast Parallel Architectures Center at Syracuse University
(billo@cmx.npac.syr.edu)
Bill O'Farrell, Northeast Parallel Architectures Center at Syracuse University
(billo@cmx.npac.syr.edu)
#! rnews            404
Relay-Version: V

brooks@vette.llnl.gov (Eugene Brooks) (02/01/89)

In article <3492@cloud9.Stratus.COM> tomc@cloud9.Stratus.COM (Tom Clark) writes:
>In  any computer system, the programmer expects operations in the source
>code to be carried out in the order specified.  This is especially  true
>in  a  multi-processor  system  where multiple processes may share data.

In ANSI C the tool to use to enforce reference order for reads and writes
is the keyword volatile, although I suppose that the purveyors of this keyword
did not have shared memory multiprocessors in mind when they inserted it into
the ANSI C definition.   If instruction reordering can be performed invisibly
by the system at lower levels than the compiler, the effect of these keywords
must trickle down into the hardware properly.  We use the volatile keyword
support in GNU C on our shared memory multiprocessors routinely in codes which
require certainly references to actually go out to main memory for
synchronization purposes.

hoefling@uicsrd.csrd.uiuc.edu (02/02/89)

Manufacturers of multi-processors provide program-restructurers
(sometimes called "parallelizers" or "vectorizers").  All of the
ones I know about accept a sequential version of the program and
produce a new version of the program containing vector statements
(if you have vector-processing hardware) and parallel constructs.

There are many such restructurers in use in universities and I know
of two commercial restructurers (KAP and VAST).  Typically, these
restructurers are used with Fortran, but some are beginning to be
used for C and Lisp.

Hope this answers your question.

Jay Hoeflinger
Center for Supercomputing Research and Development
University of Illinois at Urbana-Champaign

    UUCP:    {ihnp4,uunet,convex}!uiucuxc!uicsrd!hoefling
    ARPANET: hoefling%uicsrd@uxc.cso.uiuc.edu
    CSNET:   hoefling%uicsrd@uiuc.csnet
    BITNET:  hoefling@uicsrd.csrd.uiuc.edu

tomc@cloud9.Stratus.COM (Tom Clark) (02/02/89)

In article <19635@lll-winken.LLNL.GOV>, brooks@vette.llnl.gov (Eugene Brooks) writes:
> In article <3492@cloud9.Stratus.COM> tomc@cloud9.Stratus.COM (Tom Clark) writes:
> >In  any computer system, the programmer expects operations in the source
> >code to be carried out in the order specified.  This is especially  true
> >in  a  multi-processor  system  where multiple processes may share data.
> 
> In ANSI C the tool to use to enforce reference order for reads and writes
> is the keyword volatile, although I suppose that the purveyors of this keyword
> did not have shared memory multiprocessors in mind when they inserted it into
> the ANSI C definition.

Unfortunately volatile does not do it.  Volatile does indeed force a write
to memory instead of holding an intermediate result in a register, but it
says nothing about ordering of instructions.  Such a construct does not
exist in most languages, except for the disabling of optimization.  Sigh.


        Tom Clark, Stratus Computer

cik@l.cc.purdue.edu (Herman Rubin) (02/02/89)

In article <3492@cloud9.Stratus.COM>, tomc@cloud9.Stratus.COM (Tom Clark) writes:
> In  any computer system, the programmer expects operations in the source
> code to be carried out in the order specified.

The programmer expects the results to be as if the operations were carried
out in the order specified.  I see no reason why it is necessary that the
actual computations be done in that order.  There are architectures where
that is not the case now.

<		 This is especially  true
< in  a  multi-processor  system  where multiple processes may share data.
< For example, process A may write two locations with x' and  y'.  Process
< B,  seeing  y',  assumes  x' has been written and reads it.  This may be
< extended causally by process B then writing  a  third  location  z'  and
< process  C,  seeing  z', knowing that x' must have been written.  If for
< some reason these writes are performed out of order, the model on  which
< the  programmers  have  based their thinking is wrong.  Programmers have
< been used to such a model  for  quite  a  while.  By  the  use  of  both
< explicit  and  implicit  locks,  they  have  exploited  it  to arrive at
< well-behaved high-performance MP systems.

One way of blocking this is to require that a write to a location, including
a register, put a block on that location.  Also, a pending read from a 
location must put a block on a write to that location.  For scalar
operations, this can be handled in hardware at some cost.  The only
cheap way is to have a write block all reads and writes, and a pending
read block all writes, but this will slow down code.

> However, today compilers are optimizing and rearranging the order of the
> operations specified in the program (especially for RISC).  In addition,
> newer  high-performance  processors  will  reorder operations within the
> chip to improve performance by the use of data-driven techniques.  Also,
> newer computers have more complex busses (multiple  paths)  between  the
> CPUs  and memories.  The problem of cache coherence also adds complexity
> to the problem.
> 
> In  the  case  of  explicit  locks, the programmer can give hints to the
> compiler and the hardware in  order  to  force  a  certain  sequence  of
> operations.  These hints may take the form of a 20 pound sledge at times
> (e.g.  forcing  serial  operation  of  the  CPU  or  disabling  compiler
> optimization in some parts of code) but they will work.
> 
> The  problem  comes  with  implicit  locks.  An  implicit  lock  is  the
> dependence on the ordering of data references (both reads  and  writes).
> These  are  often  very  hard to find by inspection, even if one has the
> time to examine all parts of the source code.  Have  people  thought  of
> how  to  get  older  software  to  work on newer machines and compilers?
> Obviously older applications and operating systems would  like  to  take
> advantage  of  the new technology, but finding all these problems before
> putting the system into production is very difficult.  The class of bugs
> introduced is such that testing cannot be depended upon  to  find  them,
> and they will likely occur when the system is very busy (cost of failure
> is high).
> 
> I'd like to hear any suggestions for dealing with this problem.  Even if
> you  can handle the issue for your own code, how do you train a customer
> to do it for their code?  What techniques (hardware and software) can be
> applied?  Is the problem solvable?

Here is a simple example of this problem, which requires knowledge of the
machine to handle.  It is an actual situation, not invented for this 
posting.

A good method for generating pseudo-random numbers is to form

	x[n] = x[n-j] OP x[n-k],

where there are several choices for OP (exclusive or is always one of them)
depending on the machine.  Now it is necessary that the values of x[n-j]
and x[n-k], for sufficiently large n, are the values computed by the recur-
rence.  There only completely safe method I can think of is to make sure
that the number of new items generated at one time is at most min(j,k). 
Whether more can be done depends on machine parameters.

On the vector register machines in my ken, this is no problem, as the vector
lengths are fairly small, although it can rule out small values of j and k.
But on vector stream machines, if one knows, for example, that the an item M
units back in the array must have been stored before reading, all that is
needed is that j and k are at least M.

Thus it takes interaction between the programmer and the compiler, and some
machine dependence.  It is inefficient not to allow the programmer to over-
ride the safety features, but also dangerous.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

brooks@vette.llnl.gov (Eugene Brooks) (02/03/89)

In article <3507@cloud9.Stratus.COM> tomc@cloud9.Stratus.COM (Tom Clark) writes:
>Unfortunately volatile does not do it.  Volatile does indeed force a write
>to memory instead of holding an intermediate result in a register, but it
>says nothing about ordering of instructions.  Such a construct does not
>exist in most languages, except for the disabling of optimization.  Sigh.

Reordering of instructions is fine as long as the meaning of the code
is not changed.  The volatile keyword is needed to take care of the
added dimension that memory cells may change spontaneously or must
be changed at the point of the write in order to effect communication
to other devices which the compiler is not managing.  The volatile
references must be performed in proper order, and I believe that ANSI
C specifies this, but any reordering of instructions which do not
affect the values written to volatile memory cells poses no problem.

cquenel@polyslo.CalPoly.EDU (88 more school days) (02/03/89)

brooks@vette.llnl.gov (Eugene Brooks) writes:
> In ANSI C the tool to use to enforce reference order for reads and writes
> is the keyword volatile, although I suppose that the purveyors of this keyword
> did not have shared memory multiprocessors in mind when they inserted it into
> the ANSI C definition.

tomc@cloud9.Stratus.COM (Tom Clark) writes:
>Unfortunately volatile does not do it.  Volatile does indeed force a write
>to memory instead of holding an intermediate result in a register, but it
>says nothing about ordering of instructions.  Such a construct does not
>exist in most languages, except for the disabling of optimization.  Sigh.

Wait a minute.  What exactly is "ordering of instructions" ?
You talk about it as if there is one true "order" for instructions
generated by the compiler, and optimizations *change* that order.

The compiler GENERATES the order.  Probably no two compilers will
generate the SAME code sequence for a significantly large program.
So which one is right ?  *Neither* obviously, right ?  It depends.

So, what you're trying to say is this : I want the compiler
to generate code within certain constraints.  I want it to
generate code that is what I expect, so that I can second-guess
the code and do things with my code that are directly supported
by the language.

This is not a goal that I can frown on easily.

C's notable lack of support for multi-processor synchronization
makes it very tempting to try to wangle a way to *reliably* 
synchronize processes on different processors, however, this
is not a part of the language.

Unless you can define exactly WHAT your constraints are, you
can't expect a compiler implementor to generate his code
by any other rules than that of supporting the definition
of the language, and NOT "what the language has always done before."

The reason "it's always worked before" is a Marketing reason for
doing something, not a Software/Computer Engineering reason for
doing it.

--chris
-- 
@---@  -----------------------------------------------------------------  @---@
\. ./  |Chris Quenelle (The First Lab Rat) cquenel@polyslo.calpoly.edu |  \. ./
 \ /   |  YOU can do away with mode-less editors in YOUR life time!!!  |   \ / 
==o==  -----------------------------------------------------------------  ==o==

khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (02/03/89)

In article <1066@cmx.npac.syr.edu> billo@snacpac.npac.syr.edu.UUCP (Bill O'Farrell) writes:
>means that an outer loop is having its iterations performed in parallel, while
>the inner loop has been "unwound"  into vector operations. Both the Alliant
>and Encore compiler also perform "good old" global optimization, and both
>never NEVER perform an optimization if it would change the meaning of a program.


This is not quite true. Most very highly optimizing compilers have
(often undocumented) unsafe modes which intentionally change the
semantics of the code. On a CDC system with the u level of
optimization (0,1,2,3,U) it produced correct results about 70% of the
time, which made it a very useful tool.

In addition, even modest levels of optimization can cause subtle
changes in numerical results. On ieee 64-bit mode this is seldom seen,
but in 32-bits it can often be observed.

Keith H. Bierman
It's Not My Fault ---- I Voted for Bill & Opus

schow@bnr-public.uucp (Stanley Chow) (02/03/89)

The problem that Tom posted refers to the problem of compilers doing
optimizations that are correct for a single process but becomes wrong
only when multi-processing.
 
Specifically, if x and y are declared in shared memory (but not volatile),
then it is legal for the compiler to reorder writes to them. All 
semantics can be preserved (if the compiler did it right). The problem
comes from a second process looking at y and making inferences about
the value of x. Reordered writes becomes a problem.
 
Note that using 'volatile' is one possible solution but is incomplete
in two ways:
  - if all my data is shared and I put volatile on everything, it
    can be a major major performance hit.
  - the ANSI definition of volatile is intended for dealing with
    I/O registers and such (at least what little I know of it). There
    can be circumstances when reordering of shared memory and my
    own process memory can cause errors. (I am not saying this is
    good or desirable, just that there are systems with this attribute).

The more aggressive optimizations mentioned where program semantics can
be violated momentarily (due to wrong branch prediction and or vector
performance but fixed later) can be very nasty if an interrupt comes
at the wrong time and the interrupt handler (or scheduler, or trap
handler) was written knowing certain ordering of the memory references.

Preemptive flame retardent: I am not defending this style of coding.
But reality is that there exist large software bases that were written
relying on this ordering (perhaps because the software is written 
before any body knew how to multi-processing).
 

Stanley Chow  ..!utgpu!bnr-vpa!schow%bnr-public
              (613) 763-2831
 
RISCy: We already know the ultimate RISC (the single instruction CPU)
       is a low performer. Why does anyone want to be RISCy?

bjornl@octopus.tds.kth.se (Bj|rn Lisper) (02/03/89)

In article <88088@sun.uucp> khb%chiba@Sun.COM (Keith Bierman - Sun Tactical
Engineering) writes:
>In article <1066@cmx.npac.syr.edu> billo@snacpac.npac.syr.edu.UUCP (Bill
>O'Farrell) writes:
>>					..... Both the Alliant
>>and Encore compiler also perform "good old" global optimization, and both
>>never NEVER perform an optimization if it would change the meaning of a
>>program.

>This is not quite true. Most very highly optimizing compilers have
>(often undocumented) unsafe modes which intentionally change the
>semantics of the code. ....

The problem is, of course, the difference between the semantics of the
programming language and what the programmer considers to be the meaning. In
scientific/numerical applications I think the intended meaning often is a
function (e.g. the solution vector to a system of linear equations, which is
a function of the system matrix and the right-hand side vector). The
semantics of an imperative program (read: Fortran) is, however, based on
states and state transitions. Since the state is comprised of the values of
all the variables in a program and the value of the program counter, this
means that an optimizing or parallelizing compiler really must take ALL
variables into account, including those just intended as temporary work
areas, if it is not to violate the semantics. Thus, a compiler for an
imperative language can either be very conservative and abide to the
semantics of the programming language, or try to guess the real intention of
the programmer in order to allow itself a richer set of transformations.

>In addition, even modest levels of optimization can cause subtle
>changes in numerical results. On ieee 64-bit mode this is seldom seen,
>but in 32-bits it can often be observed.

Yes, this is another problem. For instance, a typical parallelizing
operation is to use associativity to change a linear, inherently serial
n-chain of binary operations into a balanced tree with O(log n) height. Sums
can be treated this way PROVIDED THAT ADDITION IS ASSOCIATIVE. Floating
point addition is, however, NOT associative. Changing the order of summation
will give slightly different results, because round-off and cancellation
errors depend on the magnitude of the added numbers. One can, in fact,
easily construct examples where reordering of summations give disastrous
results from a numerical point of view. Thus, a compiler should not use
algebraic laws for real numbers to transform floating-point expressions
without the consent of the programmer.

Bjorn Lisper

eugene@eos.UUCP (Eugene Miya) (02/04/89)

Here we go again!  Reinventing the HEP.
Here we go again!  Reinventing the ILLIAC.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Post follow ups.  Contribute to network noise."

cik@l.cc.purdue.edu (Herman Rubin) (02/04/89)

In article <7650@polyslo.CalPoly.EDU>, cquenel@polyslo.CalPoly.EDU (88 more school days) writes:

			......................

> Wait a minute.  What exactly is "ordering of instructions" ?
> You talk about it as if there is one true "order" for instructions
> generated by the compiler, and optimizations *change* that order.
> 
> The compiler GENERATES the order.  Probably no two compilers will
> generate the SAME code sequence for a significantly large program.
> So which one is right ?  *Neither* obviously, right ?  It depends.
> 
> So, what you're trying to say is this : I want the compiler
> to generate code within certain constraints.  I want it to
> generate code that is what I expect, so that I can second-guess
                    DOES
> the code and do things with my code that are directly supported
> by the language.
> 
> This is not a goal that I can frown on easily.

Nor I.  But the problem is more difficult than it seems.  A "bug" was
found in the CDC6600 architecture.  Consider the following machine
instructions, which I am writing in pseudocode, as I do not expect the
readers to understand COMPASS.  Xi denotes register X register i.

	X7 = X0/X1
	memloc = X7
	if(X3 < 0) goto LL
	...

LL	memloc = X6
	...

But the timing is such that if X3 < 0, then the "second" store is completed
before the "first" store begins.

The original architecture did not have a pending store block the issuance of
a store instruction, and this was changed to prevent this from happening.
Of course, this slows things down when this type of conflict does not occur.

THIS one could be caught by a compiler. But there are others which can not.
The programmer could tell if different symbolic locations are used, but not
in all cases.  Supposed the first memloc was array1[B3] and the second
array2[B4].  It is much more difficult now.  One could have hardware blocks,
which requires the CPU to keep track of all pending load and store addresses.

The upshot is that one can put severe general restrictions on the compiler and
assembler, or one can try to get the programmer into the act at the machine
level.  I strongly suggest the latter.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

brooks@vette.llnl.gov (Eugene Brooks) (02/05/89)

In article <1120@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) notes
that failure of the hardware to enforce the the ordering of side effects to
memory locations can be a problem.  We in fact ran into this problem with the
Cerberus multiprocessor simulator, accesses to shared memory are fully
pipelined through a packet switched network.  Because conflicts in the
network could cause the actual stores to the memory banks to occur out of
order the set of a flag which is guarding access to data being written could
beat the actual writes to the data.  In Cerberus, we dealt with this problem
by requiring return reciepts for all memory accesses.  A counter in each
processor kept track of the number of pending memory operations issued by
the processor, and the instruction "sync" stalled further instruction
issue until all pending memory operations were complete.  The user explicitly
used "sync" in his algorithm before a flag was set indicating written data
was ready.  By seperating the "sync" operation as a seperate instruction
one could fully pipeline the writes of an aggregate data object.  We actually
found that the use of "sync" was required in this sort of data transaction
for a customized Gauss elimination algorithm, without the "sync" the set of
a flag would indeed beat the writes of data it was protecting in rare cases
and caused incorrect execution of the algorthm.

brooks@maddog.llnl.gov or brooks@maddog.uucp

beyer@houxs.ATT.COM (J.BEYER) (02/06/89)

In article <3507@cloud9.Stratus.COM>, tomc@cloud9.Stratus.COM (Tom Clark) writes:
> > In ANSI C the tool to use to enforce reference order for reads and writes
> > is the keyword volatile, although I suppose that the purveyors of this keyword
> > did not have shared memory multiprocessors in mind when they inserted it into
> > the ANSI C definition.
> 
> Unfortunately volatile does not do it.  Volatile does indeed force a write
> to memory instead of holding an intermediate result in a register, but it
> says nothing about ordering of instructions.  Such a construct does not
> exist in most languages, except for the disabling of optimization.  Sigh.

The trick of disabling optimization to obtain a particular style of code
at the assembly-language level cannot be relied upon, because anything an
optimizer can do, a compiler can do as well. One need only imagine an ideal
compiler that produces perfect code without an optimizer. With such a compiler,
the concept of turning optimization does not exist. It seems to me that if
the higher level language in use does not support the semantics desired, you
must either rely on the characteristics of the compiler in use (and risk
portability), change to a higher level language that does support the desired
semantics, or write in a lower level language. But beware, some assemblers
have optimizers in them, too.

-- 
Jean-David Beyer
A.T.&T., Holmdel, New Jersey, 07733
houxs!beyer

markhall@pyramid.pyramid.com (Mark Hall) (02/08/89)

In article <3507...>, tomc@cloud9.Stratus.COM (Tom Clark) writes:
> > In ANSI C the tool to use to enforce reference order for reads and writes
> > is the keyword volatile, 
> 
> Unfortunately volatile does not do it.  Volatile does indeed force a write
> to memory instead of holding an intermediate result in a register, but it
> says nothing about ordering of instructions.  

I must be wrong about this (cuz no one else has posted it yet)
but are you SURE ANSI volatile ``says nothing about the ordering
of instructions''?  Well, what does the part about ``the value
of the volatile object shall agree with that prescribed by the
abstract machine at all sequence points''?  Dang, I thought it
meant that you couldn't move computations involving volatile
object across sequence points, which to me means you have
constraints on ordering.  As for operator application order, for
the expression  a <op1> b <op2> c, the evaluation must proceed
`as if' op1 were applied first, and then op2, etc.

I don't get it (I'm sure someone will set me straight, though).

(I love getting into these discussions a month late).

-Mark Hall (smart mailer): markhall@pyramid.pyramid.com
	   (uucp paths): {ames|decwrl|sun|seismo}!pyramid!markhall

mash@mips.COM (John Mashey) (02/15/89)

In article <58218@pyramid.pyramid.com> markhall@pyramid.UUCP (Mark Hall) writes:
>In article <3507...>, tomc@cloud9.Stratus.COM (Tom Clark) writes:
>> > In ANSI C the tool to use to enforce reference order for reads and writes
>> > is the keyword volatile, 
>> 
>> Unfortunately volatile does not do it.  Volatile does indeed force a write
>> to memory instead of holding an intermediate result in a register, but it
>> says nothing about ordering of instructions.  
>
>I must be wrong about this (cuz no one else has posted it yet)
>but are you SURE ANSI volatile ``says nothing about the ordering
>of instructions''?  Well, what does the part about ``the value
>of the volatile object shall agree with that prescribed by the
>abstract machine at all sequence points''?  Dang, I thought it
>meant that you couldn't move computations involving volatile
>object across sequence points, which to me means you have
>constraints on ordering.  As for operator application order, for
>the expression  a <op1> b <op2> c, the evaluation must proceed
>`as if' op1 were applied first, and then op2, etc.

When you do global optimization on the UNIX kernel, including device
drivers, "volatile" better work "right", regardless of what the standard
says or doesn't say.  Specifically, if you declare something volatile:
	a) Take the completely unoptimized version of the code, and consider
	the sequence of loads and stores from/to volatile variables.
	b) The optimized version of the code had better do exactly the same
	number, in the same sequence, of such loads and stores.
If it doesn't work exactly this way, life will be hellish for the kernel folks,
especially those writing device drivers.

BTW, doing this right does not necessarily mesh well with classical optimization
technology; it's very hard to get right and still have aggressive
optimization; bugs are very subtle.  [We used to do "binary search" on the
kernel, i.e., optimization half of it; if it still worked, optimize half of
the rest, etc, until the offending module was found.  This was not fun.]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086