[comp.software-eng] metrics and the SAT example

rcd@ico.isc.com (Dick Dunn) (05/23/91)

jls@netcom.COM (Jim Showalter) writes a bunch of generally on-target stuff,
but he uses one example that should caution us in our quest for decent
metrics...

> ...Think of metrics like the SAT college admissions test. It doesn't
> purport to measure intelligence, it just claims to be a reasonably accurate
> predictor of success in college. The evidence supports this claim: SOMETHING
> that has some bearing on success in college is being measured by the SAT's,
> since those with lower scores tend to do worse in college...

OK, I don't argue with the SAT's success there, but consider:  What is
"success in college"?  Generally it's a matter of satisfying another set of
metrics which purport to be related to the acquisition of knowledge and
skills.  HOWEVER, these metrics (grades, oversimplifying a bit) are also
indirect measures.  So, for the SAT to work, all it has to do is measure a
student's ability to perform well according to the college metrics; it may
not mean squat about what a student is actually going to get out of
college.  In particular, both the SAT and college grades often reflect
one's ability to take multiple-choice tests.  (One of my favorite examples
is that I've done well on a multiple-choice test in basic French, in spite
of never having learned the language...hell, I can barely read a Bordeaux
label.)

Now, note that this doesn't make the SAT inaccurate--it DOES predict what
it's supposed to predict (for whatever reason), just as Jim said.  But we
have to be careful that the metrics, particularly if two levels deep,
predict something useful in the end result.
-- 
Dick Dunn     rcd@ico.isc.com -or- ico!rcd       Boulder, CO   (303)449-2870
   ...Simpler is better.

jls@netcom.COM (Jim Showalter) (05/23/91)

>Now, note that this doesn't make the SAT inaccurate--it DOES predict what
>it's supposed to predict (for whatever reason), just as Jim said.  But we
>have to be careful that the metrics, particularly if two levels deep,
>predict something useful in the end result.

Agreed. It may seem silly, but my basic point is that if it turned out
that there was a very strong correlation between a metric that measured
the monthly rutabaga consumption of a development team and that team's
success on a project, then it is quite arguable that rutabaga consumption
is a VALID metric. I realize this is a reductio ad absurdum, but consider
this: the SAT's only really measure one's ability to take multiple-choice
tests...and yet there is apparently a very strong correlation between
that ability and one's success in college (this may well say something very
nasty about the state of education in this country, but that's for another
newsgroup!). Hopefully we can even do BETTER than rutabaga consumption,
but if it works, what the hell...

This is all a rather murky area, really. My father pointed out one time
that the statement "it only provides symptomatic relief" was stupid: if
the symptoms of a broken arm are pain, bone jutting from muscle, and an
inability to lift objects with the arm, then relieving those symptoms
is the same as curing the problem--so what's the objection? Similarly,
if SAT's only measure the ability to take multiple-choice tests, but this
predicts success in college, then that's functionally equivalent to
directly measuring college-success-ish-ness. You've provided symptomatic
relief.

The real danger is in arguing backwards from a metric. For example, I
remember reading somewhere that 90% of all violent felons in prison
were determined to have eaten potatoes in one form or another within
the 48 hours preceding their commission of the crime for which they
were convicted. Obviously, then, potato consumption is a valid metric
for violent criminal behavior... ;-)
-- 
**************** JIM SHOWALTER, jls@netcom.com, (408) 243-0630 ****************
*Proven solutions to software problems. Consulting and training on all aspects*
*of software development. Management/process/methodology. Architecture/design/*
*reuse. Quality/productivity. Risk reduction. EFFECTIVE OO usage. Ada/C++.    *

frank@grep.co.uk (Frank Wales) (05/25/91)

In article <1991May23.014904.5896@netcom.COM> jls@netcom.COM (Jim Showalter) writes:
>It may seem silly, but my basic point is that if it turned out
>that there was a very strong correlation between a metric that measured
>the monthly rutabaga consumption of a development team and that team's
>success on a project, then it is quite arguable that rutabaga consumption
>is a VALID metric.

It must be possible to establish credible causality too, otherwise you can't
be sure what you're measuring.  Say you notice that levels of ice-cream 
consumption correlate strongly with deaths at the beach.  Does this mean
ice-cream is a killer?  Not if you realise that both variables have
a common influence, such as sunny weather.

>My father pointed out one time
>that the statement "it only provides symptomatic relief" was stupid: if
>the symptoms of a broken arm are pain, bone jutting from muscle, and an
>inability to lift objects with the arm, then relieving those symptoms
>is the same as curing the problem--so what's the objection?

I take it your father wasn't a doctor, then? :-)  Taking Tylenol whenever
you have a headache doesn't cure your brain tumour.

>Obviously, then, potato consumption is a valid metric
>for violent criminal behavior... ;-)

Indeed.  Just like looking at local death rates makes hospitals hazardous.
--
Frank Wales, Grep Limited,             [frank@grep.co.uk<->uunet!grep!frank]
Kirkfields Business Centre, Kirk Lane, LEEDS, UK, LS19 7LX. (+44) 532 500303

jls@netcom.COM (Jim Showalter) (05/25/91)

>It must be possible to establish credible causality too, otherwise you can't
>be sure what you're measuring.  Say you notice that levels of ice-cream 
>consumption correlate strongly with deaths at the beach.  Does this mean
>ice-cream is a killer?

Not at all. But it DOES mean ice-cream is a good predictor for beach-deaths,
which was precisely my point. Thanks for providing another example to support
my thesis! :-)

>>My father pointed out one time
>>that the statement "it only provides symptomatic relief" was stupid: if
>>the symptoms of a broken arm are pain, bone jutting from muscle, and an
>>inability to lift objects with the arm, then relieving those symptoms
>>is the same as curing the problem--so what's the objection?

>I take it your father wasn't a doctor, then? :-)  Taking Tylenol whenever
>you have a headache doesn't cure your brain tumour.

My dad was only pointing it out for broken arms. It doesn't work for
everything.
-- 
**************** JIM SHOWALTER, jls@netcom.com, (408) 243-0630 ****************
*Proven solutions to software problems. Consulting and training on all aspects*
*of software development. Management/process/methodology. Architecture/design/*
*reuse. Quality/productivity. Risk reduction. EFFECTIVE OO usage. Ada/C++.    *

adamksh@ip2020.Berkeley.EDU (Adam Kao (KSh)) (05/26/91)

In article <1991May25.053304.10445@netcom.COM>, jls@netcom.COM (Jim Showalter) writes:

[attribution lost]
>>It must be possible to establish credible causality too, otherwise you can't
>>be sure what you're measuring.  Say you notice that levels of ice-cream 
>>consumption correlate strongly with deaths at the beach.  Does this mean
>>ice-cream is a killer?

>Not at all. But it DOES mean ice-cream is a good predictor for beach-deaths,
>which was precisely my point. Thanks for providing another example to support
>my thesis! :-)

No.  Please, take the argument one step further.  Imagine that a
high-level committee is formed for the purpose of reducing deaths at
the beach.  Upon discovering the correlation above, the committee
promptly imposes limits upon the consumption of ice-cream.
Surprisingly, nothing happens.  (More likely, the committee acts in
late September, and then declares victory as beach deaths decline.)

We don't go around discovering correlations just for fun.  We usually
wish to draw conclusions about actions we should take to reach a
desired outcome.  We don't want software metrics just to predict when
our project will fail (as we stand helpless); we want to be able to
prevent a possible project failure.

What most people don't understand is that we must establish causality
before we can know what action to take.  Correlations do not establish
causality.

>>>My father pointed out one time
>>>that the statement "it only provides symptomatic relief" was stupid: if
>>>the symptoms of a broken arm are pain, bone jutting from muscle, and an
>>>inability to lift objects with the arm, then relieving those symptoms
>>>is the same as curing the problem--so what's the objection?

>>I take it your father wasn't a doctor, then? :-)  Taking Tylenol whenever
>>you have a headache doesn't cure your brain tumour.

>My dad was only pointing it out for broken arms. It doesn't work for
>everything.

This is exactly the point.  Now that you admit symptomatic relief
doesn't work for everything, you must show that symptomatic relief
does work for software.

Adam

frank@grep.co.uk (Frank Wales) (05/28/91)

JS == jls@netcom.COM (Jim Showalter) == JS:
ME == me:
ME>It must be possible to establish credible causality too, otherwise you
ME>can't be sure what you're measuring.  Say you notice that levels of 
ME>ice-cream consumption correlate strongly with deaths at the beach.
ME>Does this mean ice-cream is a killer?

JS>Not at all. But it DOES mean ice-cream is a good predictor for
JS>beach-deaths, which was precisely my point. Thanks for providing another
JS>example to support my thesis! :-)

"It's dead, Jim!"  :-) Beach deaths are already an excellent measure of
beach deaths.  There is little point in obtaining others unless they buy
you something; for example, insight or understanding.  My concern here
is not with whatever other auxiliary numbers can be obtained that mean
the same thing as the original statistics; it's what people *do* with
these other numbers, and especially what they attempt to intuit from 
the relationship between them.  For example, if people attempt to cure the
"beach-death problem" by restricting ice-cream sales on the basis of
these data, they've only bought disappointment and frustration.

More to the point, such bogus applications of non-causal correlations
undermine similar, perhaps more valid, measures, and that is a real cost;
for a start, it makes it harder to convince people of the value of metrics.

JS>My father pointed out one time
JS>that the statement "it only provides symptomatic relief" was stupid: if
JS>the symptoms of a broken arm are pain, bone jutting from muscle, and an
JS>inability to lift objects with the arm, then relieving those symptoms
JS>is the same as curing the problem--so what's the objection?

ME>I take it your father wasn't a doctor, then? :-)  Taking Tylenol whenever
ME>you have a headache doesn't cure your brain tumour.

JS>My dad was only pointing it out for broken arms. It doesn't work for
JS>everything.

Indeed.  Like helping to 'cure' ailing software projects, for example.
--
Frank Wales, Grep Limited,             [frank@grep.co.uk<->uunet!grep!frank]
Kirkfields Business Centre, Kirk Lane, LEEDS, UK, LS19 7LX. (+44) 532 500303

hlavaty@CRVAX.Sri.Com (05/28/91)

In article <1991May24.192101.22317@grep.co.uk>, frank@grep.co.uk (Frank Wales) writes...
>In article <1991May23.014904.5896@netcom.COM> jls@netcom.COM (Jim Showalter) writes:
> 
>It must be possible to establish credible causality too, otherwise you can't
>be sure what you're measuring.  Say you notice that levels of ice-cream 
>consumption correlate strongly with deaths at the beach.  Does this mean
>ice-cream is a killer?  Not if you realise that both variables have
>a common influence, such as sunny weather.

While causality is important and helpfull, I don't agree that it prevents the
use of the metric.  In your example, if you wanted to prevent deaths at the 
beach, you could use the ice cream metric to alert you when more deaths at the
beach were immininent.  Then you could go to the beach and figure out what was
going on.  While a simple (and silly) example, the same principle applies to 
something more arcane like the health and status of an integration effort.  If
you have noticed that more overtime usually indicates that your effort is in
trouble, you certainly don't prevent people from working overtime.  You spend
some time going over your integration effort and try and figure out what the
problems are.  The major problem facing any software development is that so 
many potential things can go wrong that are *not* obvious, or that "seemed
like a good idea at the time" but turned out later to be short term thinking.
The quest for metrics is a search to find something (ANY something) that you
can demonstrate is a good indication of success or failure, OR allows you
more insight into the inner workings of the project.  An example of the 
former would be looking at overtime or number of compiles/day.  A good 
example of the latter would be tracking test cases completed over time and
comparing it to your original plan.

>Indeed.  Just like looking at local death rates makes hospitals hazardous.

Well, yes.  Being in a hospital is a good metric to indicate that your 
chances of dying in the near future MAY have gone up.  So you look closer.
Why am I here?  If it's because a broke my foot, I relax and conclude that
there is no cause for concern.  If it's because I am having an operation
that requires me to be knocked out, I get a little more concerned and may
look at this hospital's track record for 1) knocking people out and not 
killing them and 2) their overall success at this type of operation.  
Applying the analogy directly to software development, let's say I have been
tracking the success rate of all my programmers  (How I am doing this or what
my definition of success is I will leave to future discussions).  Now, when I
realize that a particular module that I am concerned with is being worked on
by a programmer with a demonstrated poor performance, I get concerned and
proceed to look closer at the situation.  Has that programmer ever done 
something similar to this module?  How did he do on that one?  What did he 
learn (if anything) from his mistakes?  After going through this process I 
may decide that no action is warranted, or that some help is in order, or that
the module should be given to someone else entirely.  In fact, here I am not
concerned too much with causality.  Last time the programmer tried this it got
all bungled up.  I am now concerned, whether or not I know why he bungled it
up.  If I know why it got bungled, I can possibly make a more eductaed 
decision.  If I don't know why it got bungled, I can decide to 1) watch the
development vert closely this time and try and figure out why the programmer
has problems or 2) decide the module is too important to risk and give it 
to someone else.  The metric is more valuable to me with causality, but still
usefull even without it.

Jim Hlavaty

jls@netcom.COM (Jim Showalter) (05/29/91)

>Beach deaths are already an excellent measure of
>beach deaths.

Granted, just as project failures are an excellent measure of project
failures. But so what? We are trying to intercede BEFORE the death
occurs or the project fails.

>More to the point, such bogus applications of non-causal correlations
>undermine similar, perhaps more valid, measures,

Also granted. Now, I invite you to provide me with a list of causal
correlations with respect to software projects. All the ones I know
of are based on experience and intuition into the process of software
development--and are good metrics--but none of them have, to the best
of my knowledge, ever been proven to be the CAUSE of a failed project.
-- 
**************** JIM SHOWALTER, jls@netcom.com, (408) 243-0630 ****************
*Proven solutions to software problems. Consulting and training on all aspects*
*of software development. Management/process/methodology. Architecture/design/*
*reuse. Quality/productivity. Risk reduction. EFFECTIVE OO usage. Ada/C++.    *