[comp.unix.internals] Monitoring processes and machines

khouglan@autoc3.intel.COM (Kriss Hougland~) (03/29/91)

I'm intrested in being able to keep tabs on our whole domain.  That way, when 
people log off for the day; it's usable CPU time!  The unfortune problem is
that sometimes the programs crash and burn by themselves and sometimes ye old
operator does a kill -9 one them.

What I am wondering is:
1) Where can I find say the source code for a "ps" function so I don't have
to C shell out and get the info.

2) I'm trying to find the "ofiles" on a comp.source.unix machine. (so far
no luck.)

3) I'm trying to figure out if there is a way to totally swap out the program
(context or whatever) so I can resume execution later. Or at worst, have a 
central program (daemon time) that will kill it remotely.  (like when someone
comes in the morning and logs on, I want to either kill the process via a
central program on another machine -- trying to use sockets now, or swap out
the program so people don't gripe and get the operator to do a #9 on it.)

Currently, I don't have source for the number crunching programs.

Please post any comments or suggestions.  I hope I have not screwed up my 
point, but I have a feeling that other people might be intrested in 
distributive computing the chuncky way other than using "at".

All rights given, All wrongs deserved!
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
Addresses:		!Disclaimer:  All information is my own and is not that
khouglan@hopi.intel.com ! my employer.   "Opportunity came knocking, but I was
askah@acvax.inre.asu.edu!                 in the bathroom."  (ME)
--
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
Addresses:		!Disclaimer:  All information is my own and is not that
khouglan@hopi.intel.com ! my employer.   "Opportunity came knocking, but I was
askah@acvax.inre.asu.edu!                 in the bathroom."  (ME)

hafner@mysost.cs.wisc.edu (Brian J. Hafner) (03/31/91)

In article <3545@inews.intel.com> khougland@sedona.intel.com writes:
>
>I'm intrested in being able to keep tabs on our whole domain.  That way, when 
>people log off for the day; it's usable CPU time!  The unfortune problem is
>that sometimes the programs crash and burn by themselves and sometimes ye old
>operator does a kill -9 one them.

You may be interested in "condor" from the Univ. of Wisconsin.
A portion of the condor_intro man page:

     Condor is a facility for executing UNIX jobs on a pool of
     cooperating workstations.  Jobs are queued and executed
     remotely on workstations at times when those workstations
     would otherwise be idle.  A transparent checkpointing
     mechanism is provided, and jobs migrate from workstation to
     workstation without user intervention.  When the jobs com-
     plete, users are notified by mail.

Condor may be obtained via anon-ftp from shorty.cs.wisc.edu

Brian J. Hafner
Computer Sciences Department
University of Wisconsin - Madison
hafner@cs.wisc.edu