[comp.arch] Condition number of uniform random matrices; residuals

dgh@validgh.com (David G. Hough on validgh) (06/18/91)

jbs@ibm.com asserts about solving linear equation with
matrices composed of elements from a uniform random distribution:

> Problems
> of size 1000 appear to typically have condition number between 10**5
> and 10**6.

I haven't observed any evidence to support this.  If uniform random 1000x1000
matrices had such condition numbers then it would be impossible to solve
them accurately with Gaussian elimination.  I did report that 
gaussian elimination without pivoting was somewhat unstable on matrices
of that size, but that only reflects on the stability of that method,
rather than the underlying condition of the problem, independent of the
method.

For instance, the normalized residual computed in the linpack program,

        RESIDn = RESID/( N*NORMA*NORMX*EPS )

grows very slowly; typically 1.5-2 for 100x100 systems, and around 10
for 1000x1000 systems, using IEEE arithmetic and conventional partial
pivoting.  So the growth of RESID itself is only slightly faster than linear 
with N.  Of course the size of the residual b-Ax doesn't tell you too much 
about the size of the error in x, in general.   Luck is with us in this case,
however: for single precision matrices, the measured difference in the 
computed x's vs. the nominally correct x's whose elements are all exactly
one is in the same ratio, e.g. about 1e-4 for 1000x1000 vs. 1e-5 for 100x100.
Of course the actual correct answer for this problem isn't a vector of 1's
since the right hand side b was computed by rounding the result of computing
A*(1 1 1 1 1 ...) and so the computed b differs from the one corresponding
to the vector of 1's, and correspondingly the correct x corresponding to
the right hand side b actually used is somewhat different - but fortunately
that doesn't matter much for this well-conditioned problem.

But this is a good illustration of a point I made previously: in general
you don't know the error, because that would be tantamount to knowing the
answer, and most people don't spend most of their time solving problems
with known answers.  So the normalized residual is a good indication of
whether an algorithm has solved a linear equations problem about as well as
can be expected in a particular floating-point precision.  If an answer
with a small error bound is desired,
regardless of the condition of the problem, something else will have to be
done.

This reminds me of a problem that was posed to me several days in advance
of my qualifying examination:

A very eminent applied mathematician and one of his students
had submitted a paper to a prestigious
journal showing experimentally that all the theory about iterative 
improvement must be wrong.

(Iterative improvement is a method for improving the accuracy of the answer
x for linear equation problems that are moderately ill-conditioned.
When the residual can be computed in higher precision, the accuracy of x
can often be improved to nearly full working precision. )

The authors had computed solutions to problems from
several collections of test matrices of varying conditions
and to a number of random matrices and had seldom seen iterative improvement
yield more than a digit of improvement.

What was really going on here?  I couldn't figure it out before the
qualifying exam, but for better or worse I passed anyway.
-- 

David Hough

dgh@validgh.com		uunet!validgh!dgh	na.hough@na-net.ornl.gov

jbs@WATSON.IBM.COM (06/19/91)

         David Hough said:
jbs@ibm.com asserts about solving linear equation with
matrices composed of elements from a uniform random distribution:
> Problems
> of size 1000 appear to typically have condition number between 10**5
> and 10**6.
I haven't observed any evidence to support this.  If uniform random 1000x1000
matrices had such condition numbers then it would be impossible to solve
them accurately with Gaussian elimination.  I did report that
gaussian elimination without pivoting was somewhat unstable on matrices
of that size, but that only reflects on the stability of that method,
rather than the underlying condition of the problem, independent of the
method.
For instance, the normalized residual computed in the linpack program,
        RESIDn = RESID/( N*NORMA*NORMX*EPS )
grows very slowly; typically 1.5-2 for 100x100 systems, and around 10
for 1000x1000 systems, using IEEE arithmetic and conventional partial
pivoting.  So the growth of RESID itself is only slightly faster than linear
with N.  Of course the size of the residual b-Ax doesn't tell you too much
about the size of the error in x, in general.   Luck is with us in this case,
however: for single precision matrices, the measured difference in the
computed x's vs. the nominally correct x's whose elements are all exactly
one is in the same ratio, e.g. about 1e-4 for 1000x1000 vs. 1e-5 for 100x100.
Of course the actual correct answer for this problem isn't a vector of 1's
since the right hand side b was computed by rounding the result of computing
A*(1 1 1 1 1 ...) and so the computed b differs from the one corresponding
to the vector of 1's, and correspondingly the correct x corresponding to
the right hand side b actually used is somewhat different - but fortunately
that doesn't matter much for this well-conditioned problem.
But this is a good illustration of a point I made previously: in general
you don't know the error, because that would be tantamount to knowing the
answer, and most people don't spend most of their time solving problems
with known answers.  So the normalized residual is a good indication of
whether an algorithm has solved a linear equations problem about as well as
can be expected in a particular floating-point precision.  If an answer
with a small error bound is desired,
regardless of the condition of the problem, something else will have to be
done.

         I tried 10 random matrices of size 1000 and observed condition
numbers from 97627. to 1124910.  I believe the actual 1000 by 1000 ma-
trix generated in the linpack benchmark has condition number 368554.
What do you observe it to be?
         A bad condition number does not guarantee you will be unable
to solve the problem accurately.  The matrix   e 0
has condition number 1/e but there are no      0 1
numeric problems.
         For the linpack matrix the entries are such (a/16384, a integer
between -32768 to 32768) that the row sums are generally computed exact-
ly (especially with the 64-bit formats) so (1,...,1) is the exact answer.
Otherwise comparisons between different floating formats would be unfair.
         On the general question of what the growth rate of the condition
number with n is I found a somewhat relevant reference ("The Probability
that a Numerical Analysis Problem is Difficult", James W. Demmel, Math-
ematics of Computation, Vol 50, 1988. p449-480).  This paper is a bit
obscure but it seemed to be saying if we consider complex matrices in-
stead of real, normal distributions about 0 instead of uniform and a
somewhat unusual definition of condition number then the asymptotic
growth rate of the median condition number will be n**1.5.  It also
suggested the real case will have worse condition.  If anyone knows of
better results I would be interested.
         David Hough said (in a later post)
jbs@ibm.com says:
> I believe IEEE chopped will be easier to implement and perform
> better than IEEE rounded.
This is true, but only marginally.  The most important cost in IEEE (or VAX)
floating-point arithmetic is getting the correct unrounded result,
or at least enough of it to permit correct rounding.  This requires
provision of guard and (for chopping or unbiased rounding) sticky bits,
postnormalization,
development of full doubled-precision products,
careful division and sqrt, etc.
Once you've gotten the unrounded answer, rounding (as opposed to chopping)
costs an extra add time and dealing with the possibility of a carry out
of the most significant bit.  Although these are real design costs, I doubt
they've affected performance of modern pipelined IEEE designs compared to
the cost of getting the unrounded result.

         Well as I pointed out in an earlier post the performance of the
IBM Risc design was seriously affected by the need to round.  The per-
formance impact was reduced by the multiply-add instruction (which only
rounds once) and by the ability of the multiply-add unit to accept an
operand with a pending round.  Just adding a stage to the floating pipe-
line is not satisfactory for a scalar design since the latency is very
important.
         David Hough said:
In general, any process that involves repeated inexact conversions,
particularly binary->decimal->binary->decimal, or addition of many small
inexact quantities, will benefit from the better statistics of rounding
over chopping - drifting tendencies will be reduced, and error bounds
will be smaller.

         If your conversion routines are written properly and you output
sufficient decimal digits binary->decimal->binary should not drift.  I
don't believe rounding generally gives smaller error bounds.  If you add
a bunch of positive numbers both methods lead to a interval containing
the true answer.  Chopping gives the lower end of the interval, round-
ing the middle.  The true answer is statistically much more likely to
lie near the center of the interval but there is no guarantee of this.
Also I believe the statistics of chopping improve if .5 ulp is added to
each operand before every operation.  I believe this is still cheaper
to implement than rounding.
         David Hough said:
For instance, during the 1970's one of the minor regional stock
exchanges noticed that its market stock average was declining in a bull
market in which most of the component stocks were rising in value.
There were a number of factors at work in this phenomenon; using
chopping arithmetic was one of them.

         Do you have an exact reference for this story?
         David Hough said:
Using enough precision can sometimes overcome deficiencies of chopping
or dirty arithmetic, otherwise Seymour Cray would have done things
differently. But it isn't always enough, as most Cray customers find
out occasionally.

         Well "sometimes" is really almost always.  Also Cray arithmetic
is an extreme example, problems with Cray arithmetic would not necessar-
ily occur with chopping alone.
                       James B. Shearer

dik@cwi.nl (Dik T. Winter) (06/20/91)

Some initial comments:
I think fp arithmetic is good if it has the Brown model properties.
I.e. given an mathematical operation o and its fp approximation o', then
a.  The should never be a machine number between o' applied to its operands
    and o applied to its operands (i.e. the error is at most 1 ulp).
b.  If o applied to its operands is representable as a machine number, that
    machine number should be returned when o' is applied to its operands
    (i.e. the error is never exactly 1 ulp).
So in all the error is always less than 1 ulp.
These properties allow truncation, rounding to nearest, random rounding etc.
Many machines fail with these properties.
FP arithmetic is very good if rounding to nearest is done.
Round to nearest allows some neat algorithms, for instance T.J.Dekker's,
1966-1970 timeframe, routines to perform double precision arithmetic using
only single precision.  (The resultant arithmetic is *not* properly rounding,
as in many cases more information is kept than would result from proper
rounding.  So the algorithms can not be applied recursively.)
Having said that:

In article <9106190421.AA02128@ucbvax.Berkeley.EDU> jbs@WATSON.IBM.COM writes:
 > Also I believe the statistics of chopping improve if .5 ulp is added to
 > each operand before every operation.

This technique (or somesuch similar) is known and used in the 60 bit CDC Cybers
and in Crays.  I have no CDC and Cray manuals here at home, so I do not know
now the exact technique they use.  But:
Note first that it is not necessary to tack a bit onto both numbers, tacking
a bit only to the absolute largest number differs only in those cases where
there is a tie between the nearest machine numbers.  Cray tacks slightly more
than 1 bit to the mantissa of the absolute largest number.  The reason being
that loosely speaking 'this makes it statistically approximate round to
nearest', in some measure.  I have my doubts.

The problem is that when you subtract and if there is cancellation of high
order bits, the tacked on bit can become visible.  Try this with a t bit
machine, subtracting 2^t-1 from 2^1.  When tacking a 1 bit to both numbers
the result is 1.5, which violates property b.  Using this method your machine
will only satisfy the above model with one bit of precision.
Try (1.0-0.75)-0.25.  The result is non-zero.

However, worse than tacking 1 bits onto numbers is when the final result is
{truncated,rounded} first and normalized later (as many machines do).
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl