Magill@upenn.CSNET (CETS Operations Manager) (05/15/86)
The subject of performance measurement is a very lengthy and detailed subject. I have "ridden heard" over several different operating systems in my career in computing (dating back to 1963 and an IBM 1620). I will not go in to lots of detail but just give some typical things which people and machines see differently. First, response time. Typically, the definition that a person uses for response time is "from when I hit return until I get my answer back". Typically, the definition a machine uses (in programs like monitor and the like) is "how long between the last time I serviced this turkey and the next time I service this turkey." In other words, people look at the time it takes to get the complete job done and machines look at the time it takes to do little pieces of the job - apples and oranges. Obviously if you wish to measure the human's view from a terminal you must measure AT THE TERMINAL. Any measurement taken at the CPU ignores transmission and display times. There are black boxes available to instrument this measurement. The "DynaMeter" is one which comes to mind. What do people really mean by response time. Many studies, both formal and informal, have shown that the most important factor for a system user is not that a system responds in .1 seconds or 5 seconds, but that it ALWAYS responds in .1 or 5 seconds! Consistancy, predicitibility, these are the things that people really mean by response time. (I am assuming of course that you have a system where the work to be done can be acompished in a reasonable period of time.) They want to know that it will take 30 seconds to inquire into a data base about someone's reservation, or that it will take 5 seconds to load the mail progrm to send a one line message. On systems capable of supporting only a few users (5-10 simultaneous) response time tends to be a direct function of what computing each user is doing at any given time. On systems capable of supporting 100-150 simultaneous users, response time tends to be related to the number of users logged on. This is a particularly onerous situation. An 8650 with 10 people logged on will provide virtually instentaneous response time to an interactive session. But, as soon as you load the sucker up with 75+ users response time increases to 1 second. This change is VISIBLE to someone running a 9600 baud terminal and typing a file to the screen. (I am assuming direct connections, networks are a separate issue.) This person who is used to running on a computer all by himself begins to complain because other people who use it are giving him 1 second response time. Now if you load that system up to 150 users and find that response time is up to 3 seconds, some of these users remain on when the total number of users drops back to the 75 level... these people are in heaven because response time has dropped from 3 seconds to 1. Several operating systems have available the concept of "guaranteed MINIMUM response time" to deal with this situation. Now lets look at the terms interactive and batch. Interactive computer usage and batch usage have different effects on response time and on system usage. A computer is a funny resource. If you don't use a cpu cycle, you can't store it away and save it for later, it's gone. Anyone who operates a computing facility is operating a utility. You are providing a service for someone or many someones. Work load profiles vary depending upon your situation. If you are in an academic environment providing support for student class related computing (not research). You discover that the system is un used 80-90% of the time. Classes take up a portion of the day, meal contracts another portion, sleep another, and so on. Students wait until the night befor an assignment is due to start. In this situation, interactive means editing and short (5-10 cpu second) compiles with 20-30 second job runs. Typically one sees something on the order of 1-3 cpu seconds per hour of connect time for freshman, 3-5 for sophmores, etc. Naturally this all depends upon assignments, etc but you get the idea. What you have is a utility confronted with a PEAK LOAD DEMAND. In this type of situation, a true interactive environment, the cpu MUST be idle at least 50% of the time or you have no reserve left for the day before the assignment is due! Now if you throw graduate students doing research into the mix things get to be fun. This class of user is suddenly using MINUTES of cpu time per connect hour. And if you add in production jobs run interactively insetad of batch (things like muther AI stuff in LISP) arggg. In a batch based environment, however, things are just the opposite. Response time becomes turnaround time. How long does it take to perform the update run accross accounts receiveable? Typically in a batch environment, the goal is to load the cpu up to 100% with no idle time, because PEOPLE are not sitting around at a terminal witing for a response back. In a batch environment, if you need to "re-load" the cpu one simply does not submit certain jobs until after others have completed, or one "pends" an executing batch job to permit another to complete. How to measure all this. Well, the first and most obvious, but least done thing, is to KNOW your user community. What do they really do with the computer, when do they do it, how do they do it, and what external forces (lunch, dinner contracts, sleep) modify or shape the times when they use the system. Armed with this knowledge and your own 9600 baud vt terminal, you can SEE many response time situations if you simply use the system yourself. For VMS I find the MONITOR utility, especially the new one screen display, in version 4 to be quite descriptive. DEC also has a product, whose name I forget (it costs $$) which generates a Kiviat plot (4 axis and 4 variables) of system resource usage. It is a very useful way to determine the real culpret when your system acts up. The secret to all of these, however, is that you must use them frequently if not continuously so that you know what "normal" is for any given workload, time of day, etc. Another useful tool is your accounting. It will tell you what your load is - how many users, how much cputime, etc. Of course you have to run it daily and write the programs to give you operations statisitcs from the data, but if you need to justify system expansion to management, that is the first place to look. You may discover that you simply need to convince one third of the secretarys and their bosses to work from 5 to 1, and another third to work from 1 to 9 to get three times as much work out of your system without spending any more money! (Management will usually let you upgrade the CPU when presented with that option, it is much cheaper!) William H. Magill Operations Manager Computing and Educational Technology Services (Formerly the Moore School Computing Facility, home of the ENIAC) University of Pennsylvania