[comp.unix.ultrix] Error Logging Requirements

cranston@guru.dec.com (Scott Cranston) (11/06/90)

 
This query is posted on both unix.admin and unix.ultrix.  Please feel free
to respond directly to myself or to the net.
 
I am interested in finding out what system administrators and other users
expect of system error logging.  To this end I have several questions for your
consideration and reply. Please expand beyond these is you like.
 
Thank you for your time,
Scott
 
 
  1.  How is the error log used?
 
	-  Diagnose to a specific problem on a specific device or system
	   component?
 
	-  Isolate a problem to a failing subsystem?
 
	-  High level monitoring of the systems health?
 
	-  System crash debugging?
 
	-  Are tools like grep and awk used to further reduce the data
	   and/or generate custom reports?
 
 
   2.  Who uses the error log?
 
	-  System manager
 
	-  Programmers
 
	-  General users
 
 
   3.  What information should the error log contain
 
	-  Only summary information
 
	   For example:  /dev/ra189: unrecoverable hard error
			 /dev/ra189: bad block replacement, LBN 123456
			 Uncorrectable memory error, phys adrs: 0x123456
			 panic: duplicate inode
 
	-  Detailed error information. For example device or controller
	   register contents, error message packets, stack traces, etc.
 
	   Should this detailed data simply be the octal, or hex
	   representation of the data?  Or, does this detailed information
	   need to have a descriptive translation of the individual bits done?
 
	-  System context, such as time stamp, system ID, hardware type,
	   operating system type and version.
 
	-  Do different users (such as those in #2 above) have different
	   requirements?
 
 
   4.  What format should the error log data be in?
 
	-  Only Plain ASCII text
 
	-  Only Binary data which requires a separate bit-to-text report
	   generator tool.
 
	-  Separate error logs...summary info in plain ASCII text, highly
	   detailed in binary with report generator.
 
 
   5.  Compatibility with other systems?
 
	-  Is syslog a defacto standard?
 
	-  What are the system/vendor interoperability requirements of error
	   logging?
 
 
   6.  What requirements would you make of an error log system if you were
       designing it?

mf@ircam.ircam.fr (Michel Fingerhut) (11/12/90)

By order of importance (to me):

Compatibility with other systems?  

    syslog is a *defacto* standard, esp. in a heterogeneous environment.
    I'd say (for me) -- 4.3 syslog.  The fact that currently ultrix 4.0
    supports only 4.2 syslog is a major pain for us.  Any other system
    should be compatible with 4.3 syslog (at the same time extend it,
    if at all possible).  A one-line report is the best place to start
    with... (but that should not be the only available info).

    Insofar as LANs, the centralization of reports from different machines
    in a common log may help resolve problems common to multiple machines
    which otherwise wouldn't be noticed (e.g., due to security, network,
    electrical problems, etc...)

How is the error log used?

    An error logging mechanism should (a) alert (b) log info of help to
    diagnose and repair a problem.  It should present both a low-level,
    nitty-gritty view of the problem, as well as a "high level
    monitoring" of system's health.  Ideally, it should accumulate
    statistics and make them available to monitoring tools that would allow to
    see changes in performance over time and alert in such cases
    (e.g.: disk effective throughput going down over time; load average
    constantly too high or rising; terminal lines getty's eating too
    much of cpu time, ECC errors increasing, etc...)

How it is reported (alerts)

    Alerts should be, somewhat like for syslog, either real-time alarm messages
    to console and/or specific users, or else "trigger" a user-selectable
    program (e.g., mail, if so selected, to send mail to the system manager;
    but also other programs which could take site-specific action).

    The extent of the "report" should be configurable, so that one could
    configure the same event to be reported differently to different audiences
    (= classes of users).

    It should also be possible to do LAN-wide alerts.

    With the standardisation of X11, it would be nice to have popup alert
    windows too.

What information should the error log contain

    See above.  For "specific bit meaning" -- I think this *should* be
    included.  I.e., an error report of the type IEREG (say) = 0x1234
    is useless unless you happen to know the meaning of all bits in all
    registers of all your devices.  Since hopefully the device driver
    knows it, it might as well elucidate.

Error log could be a combination of a short plain ascii text report combined
with a detailed (binary) snapshot "somewhere else" with tools to decrypt that
info.

Michael Fingerhut

rusty@belch.Berkeley.EDU (rusty wright) (11/15/90)

Another thing that I'd like to see is some sort of version number in
the packets.  Client programs should be able to query the syslog
server and send it log packets in the appropriate format.  Or any new
stuff could be at the end of the packet and old servers could simply
ignore it.  Alternatively, they could send it to a syslog server on
the local machine and its config file could say to send it to another
syslog server, possibly with reformatting if it's a different version.
The fact that the 4.2 and 4.3 BSD syslogs don't work together is just
bad design.