[mod.computers.vax] Performance measurement - long

Magill@upenn.CSNET (CETS Operations Manager) (05/15/86)
The subject of performance measurement is a very lengthy and detailed subject.
I have "ridden heard" over several different operating systems in my career
in computing (dating back to 1963 and an IBM 1620). I will not go in to lots
of detail but just give some typical things which people and machines 
see differently.

First, response time. Typically, the definition that a person uses for response
time is "from when I hit return until I get my answer back". Typically, the
definition a machine uses (in programs like monitor and the like) is 
"how long between the last time I serviced this turkey and the next time I
service this turkey." In other words, people look at the time it takes to get
the complete job done and machines look at the time it takes to do little
pieces of the job - apples and oranges. Obviously if you wish to measure the
human's view from a terminal you must measure AT THE TERMINAL. Any measurement
taken at the CPU ignores transmission and display times. There are black boxes
available to instrument this measurement. The "DynaMeter" is one which comes
to mind.

What do people really mean by response time. Many studies, both formal and 
informal, have shown that the most important factor for a system user is 
not that a system responds in .1 seconds or 5 seconds, but that it ALWAYS
responds in .1 or 5 seconds! Consistancy, predicitibility, these are the things
that people really mean by response time. (I am assuming of course that you 
have a system where the work to be done can be acompished in a reasonable period
of time.)  They want to know that it will take 30 seconds to inquire into
a data base about someone's reservation, or that it will take 5 seconds to load
the mail progrm to send a one line message. 

On systems capable of supporting only a few users (5-10 simultaneous) response 
time tends to be a direct function of what computing each user is doing at any 
given time. On systems capable of supporting 100-150 simultaneous users, 
response time tends to be related to the number of users logged on. This is a
particularly onerous situation. An 8650 with 10 people logged on will provide
virtually instentaneous response time to an interactive session. But, as soon
as you load the sucker up with 75+ users response time increases to 1 second.
This change is VISIBLE to someone running a 9600 baud terminal and typing a
file to the screen. (I am assuming direct connections, networks are a separate
issue.) This person who is used to running on a computer all by himself begins
to complain because other people who use it are giving him 1 second response
time. Now if you load that system up to 150 users and find that response time
is up to 3 seconds, some of these users remain on when the total number of users
drops back to the 75 level... these people are in heaven because response time
has dropped from 3 seconds to 1. Several operating systems have available the
concept of "guaranteed MINIMUM response time" to deal with this situation.

Now lets look at the terms interactive and batch.
Interactive computer usage and batch usage have different effects on response
time and on system usage. A computer is a funny resource. If you don't use
a cpu cycle, you can't store it away and save it for later, it's gone.
Anyone who operates a computing facility is operating a utility. You are 
providing a service for someone or many someones. Work load profiles vary
depending upon your situation. If you are in an academic environment
providing support for student class related computing (not research). You
discover that the system is un used 80-90% of the time. Classes take up
a portion of the day, meal contracts another portion, sleep another, and so on.
Students wait until the night befor an assignment is due to start.
In this situation, interactive means editing and short (5-10 cpu second)
compiles with 20-30 second job runs. Typically one sees something on the 
order of 1-3 cpu seconds per hour of connect time for freshman, 3-5 for
sophmores, etc. Naturally this all depends upon assignments, etc but you
get the idea. What you have is a utility confronted with a PEAK LOAD DEMAND.
In this type of situation, a true interactive environment, the cpu MUST be
idle at least 50% of the time or you have no reserve left for the day before
the assignment is due!

Now if you throw graduate students doing research into the mix things get to 
be fun. This class of user is suddenly using MINUTES of cpu time per connect
hour. And if you add in production jobs run interactively insetad of batch
(things like muther AI stuff in LISP) arggg.

In a batch based environment, however, things are just the opposite. 
Response time becomes turnaround time. How long does it take to perform the
update run accross accounts receiveable? Typically in a batch environment, the
goal is to load the cpu up to 100% with no idle time, because PEOPLE are not
sitting around at a terminal witing for a response back. In a batch environment,
if you need to "re-load" the cpu one simply does not submit certain jobs 
until after others have completed, or one "pends" an executing batch job to
permit another to complete.

How to measure all this. Well, the first and most obvious, but least done
thing, is to KNOW your user community. What do they really do with the computer,
when do they do it, how do they do it, and what external forces (lunch, dinner
contracts, sleep) modify or shape the times when they use the system.
Armed with this knowledge and your own 9600 baud vt terminal, you can SEE
many response time situations if you simply use the system yourself. 
For VMS I find the MONITOR utility, especially the new one screen display,
in version 4 to be quite descriptive. DEC also has a product, whose name
I forget (it costs $$) which generates a Kiviat plot (4 axis and 4 variables)
of system resource usage. It is a very useful way to determine the real
culpret when your system acts up. The secret to all of these, however, is
that you must use them frequently if not continuously so that you know
what "normal" is for any given workload, time of day, etc.

Another useful tool is your accounting. It will tell you what your load
is - how many users, how much cputime, etc. Of course you have to run it
daily and write the programs to give you operations statisitcs from the
data, but if you need to justify system expansion to management, that is
the first place to look. You may discover that you simply need to 
convince one third of the secretarys and their bosses to work from 5 to 1,
and another third to work from 1 to 9 to get three times as much work out
of your system without spending any more money! (Management will usually let you
upgrade the CPU when presented with that option, it is much cheaper!)

William H. Magill
Operations Manager
Computing and Educational Technology Services
(Formerly the Moore School Computing Facility, home of the ENIAC)
University of Pennsylvania