On startup, statmgr reads the configuration file named on the command line. Commands in this file set up all parameters used in monitoring the health of an Earthworm system. In the control file, lines may begin with a valid statmgr command (listed below) or with one of 2 special characters:
Command names must be typed in the control file exactly as shown in this document (upper/lower case matters!).# marks the line as a comment (example: # This is a comment).
@ allows control files to be nested; one control file can be accessed from another with the command "@" followed by a string representing the path name of the next control file (example: @model.d).
Below are the commands recognized by statmgr, grouped by the function
they influence. Most of the commands are required.
In the following section, all configuration file commands are listed
in alphabetical order. Listed along with the command (bold-type) are
its arguments (in red), the name of the subroutine that processes the
command, and the function within the module that the command influences.
A detailed description of the command and is also given. Default values
and example commands are listed after each command description.
All errors received by the statmgr are written to its daily log file.
Each descriptor file specifies when error messages are to be reported via
email and pager. The default pager group name and a list of email recipients are
listed in file statmgr's configuration file. A different pagegroup can be
listed in each module's descriptor file to override the default.
Here are the lines that make up a descriptor file:
instId inst
tsec: tsec page: npage mail: nmail
err: code nerr: nerr tsec: tsec page: npage mail: nmail
nerr and tsec specify the maximum allowable error rate.
If the error rate exceeds nerr errors per tsec seconds,
an email or pager message may be reported. To report all
errors, set nerr to 1 and tsec to 0.
npage is the maximum number of pager messages that will be
reported and nmail is maximum number of email messages that
will be reported. If the page or mail limit is exceeded, no
further errors will be reported until the statmgr is restarted.
description is the default text string (up to 79 characters) that
statmgr will report for this error code. Enclose the string in
double-quotes if it contains embedded blanks. Each module may
include a (hopefully more informative) text string in its error
message; if so, that string overrides the default, description.
1. EXAMPLE CONFIGURATION FILE
# Status Manager Configuration File
# (statmgr.d)
#
# This file controls the notifications of earthworm error conditions.
# The status manager can send pager messages to a pageit system, and
# it can also send email messages to a list of recipients.
# Earthquake notifications are not handled by the status manager.
# In this file, comment lines are preceded by #.
#
MyModuleId MOD_STATMGR
# "RingName" specifies the name of the transport ring to check for
# heartbeat and error messages. Ring names are listed in file
# earthworm.h. Example -> RingName HYPO_RING
#
RingName HYPO_RING
# "GetStatusFrom" lists the installations & modules whose heartbeats
# and error messages statmgr should grab from transport ring:
#
# Installation Module Message Types
GetStatusFrom INST_MENLO MOD_WILDCARD # heartbeats & errors
# "LogFile" sets the switch for writing a log file to disk.
# Set to 1 to write a file to disk.
# Set to 0 for no log file.
# Set to 2 for module log file but no logging to stderr/stdout
#
LogFile 1
# "heartBeatPageit" is the time in seconds between heartbeats
# sent to the pageit system. The pageit system will report an error
# if heartbeats are not received from the status manager at regular
# intervals.
#
heartbeatPageit 60
# "pagegroup" is the pager group name.
# The pageit program maps this name to a list of pager recipients.
# This line is required. Individual modules can override this group
# by including the "pagegroup" command in their descriptor file.
#
pagegroup larva_test
# Specify the name of a computer to use as a mail server.
# This system must be alive for mail to be sent out.
# This parameter is used by Windows NT only.
#
MailServer andreas
# Any number (or none) of email recipients may be specified below.
# These lines are optional.
#
# Syntax
# mail emailAddress1
# mail emailAddress2
# ...
# mail emailAddressN
#
mail Questions? Issues? Subscribe to the Earthworm List (earthw).
#
#
# Mail program to use, e.g /usr/ucb/Mail (not required)
# If given, it must be a full pathname to a mail program
MailProgram /usr/ucb/Mail
#
# Subject line for the email messages. (not required)
#
Subject "This is an earthworm status message"
#
# Message Prefix - useful for paging systems, etc.
# this parameter is optional
#
MsgPrefix "(("
#
# Message Suffix - useful for paging systems, etc.
# this parameter is optional
#
MsgSuffix "))"
# Now list the descriptor files which control error reporting
# for earthworm modules. One descriptor file is needed
# for each earthworm module. If a module is not listed here,
# no errors will be reported for the module. The file name of a
# module may be commented out, if it is temporarily not to be used.
# To comment out a line, insert # at the beginning of the line.
#
Descriptor statmgr.desc
# Descriptor adsend_a.desc # Data source (adsend) on lardass
# Descriptor adsend_b.desc # Data source (adsend) on honker
# Descriptor picker_a.desc # Picker programs on redhot
# Descriptor picker_b.desc # Picker programs on redhot
# Descriptor coaxtoring.desc
# Descriptor diskmgr.desc
# Descriptor binder.desc
# Descriptor eqproc.desc
# Descriptor startstop.desc
# Descriptor pagerfeeder.desc
# Descriptor pick_client.desc
# Descriptor pick_server.desc
2. FUNCTIONAL COMMAND LISTING
Earthworm system setup:
GetStatusFrom required
MyModuleId required
RingName required
Monitor system:
heartbeatPageit required
Descriptor required
mail
pagegroup required
Output Control:
LogFile required
3. ALPHABETIC COMMAND LISTING & DESCRIPTION
command arg1 processed by function
Descriptor descfile statmgr_config Monitor system
Registers patients with the statmgr. descfile is the name of a file
(up to 29 characters long) that describes a module that statmgr will
monitor. One "Descriptor" command must give the name of statmgr's own
descriptor file (ie, the statmgr is a patient of itself). Up to
MAXDESC (currently defined as 15 in statmgr.h) "Descriptor" commands
may be issued. All descriptor files should live in directory
specified by the EW_PARAMS environment variable. Each descriptor file
contains the patient module's name and ID, its heartbeat interval, and
all its possible error codes and what they mean. It also contains
information on how and how often the statmgr should notify system
operators when errors do occur (see section 3 for more details on the
descriptor files).
Default: none
Examples: Descriptor statmgr.desc
Descriptor "statmgr.desc"
GetStatusFrom inst mod_id statmgr_config Earthworm setup
Controls the heartbeat and error messages input to statmgr. statmgr
will only process TYPE_HEARTBEAT and TYPE_ERROR messages that come
from module mod_id at installation inst. inst and mod_id are
character strings (valid strings are listed in earthworm.h/earthworm.d)
which are related to single-byte numbers that uniquely identify each
installation and module. Up to 2 "GetStatusFrom" command may be
issued; wildcards (INST_WILDCARD and MOD_WILDCARD) will force statmgr
to process all heartbeat and error messages, regardless of their place
of origin.
Default: none
Calnet: GetStatusFrom INST_WILDCARD MOD_WILDCARD
heartbeatPageit nsec statmgr_config Monitor system
Defines the number of seconds nsec between heartbeat messages
issued by statmgr to the Pageit computer. This heartbeat serves as
the heartbeat for the entire Earthworm system being monitored by
statmgr. A statmgr heartbeat is actually a TYPE_PAGE message that
contains a character string (example: "alive: sysname#"). statmgr
places this TYPE_PAGE message into shared memory where the pagerfeeder
module can find it and send it to the Pageit system via the serial
port. If the Pageit computer doesn't receive a heartbeat within a
specified time interval, it will issue an "obituary" page for the
Earthworm system.
Default: none
Calnet: heartbeatPageit 60
LogFile switch statmgr_config output
Sets the on-off switch for writing a log file to disk. If switch
is 0, no log file will be written. If switch is 1, statmgr
will write a daily log file(s) called statmgrxx.log_yymmdd
where xx is statmgr's module id (set with "MyModuleId" command)
and yymmdd is the current UTC date (ex: 960123) on the system clock.
The file(s) will be written in the EW_LOG directory (environment
variable).
Default: none
mail recipient statmgr_config Monitor system
Registers one recipient email address with the statmgr. As
configured by descriptor files, statmgr will send every recipient
an email message about patient-module errors and state of health
(dead/alive) changes. Up to MAXRECIP (currently defined as 10 in
statmgr.h) "mail" commands may be issued, but none are required.
Each recipient address can be up to 59 characters long.
Default: none
Example: mail jdoe@yourmachine.edu
MyModuleId mod_id statmgr_config Earthworm setup
Sets the module id for labeling all outgoing messages. mod_id is
a character string (valid strings are listed in earthworm.d) that
relates (in earthworm.d) to a unique single-byte number.
Default: none
Calnet: MyModuleId MOD_STATMGR
pagegroup group statmgr_config Monitor system
Registers a pager group (string up to 79 characters long) with the
statmgr. statmgr will address all of its TYPE_PAGE messages to
group unless the module's descriptor file included its own
pagegroup command. When the paging system computer receives the message, it maps
group to a list of pager recipients and sends a page to each one.
Only one "pagegroup" command is allowed and it is required.
Default: none
Example: pagegroup ew_operators
RingName ring statmgr_config Earthworm setup
Tells statmgr which shared memory region to use for input/output.
ring is a character string (valid strings are listed in earthworm.d)
that relates (in earthworm.d) to a unique number for the key to the
shared memory region.
Default: none
Calnet: RingName HYPO_RING
4. DESCRIPTOR FILE DETAILS
Every module is registered with the statmgr by means of a "Descriptor" command in statmgr's configuration file. This command gives the name of the module's
"descriptor file" which contains details about the module's name and ID, its heartbeat rate, its error codes, and when/how to notify operators of any
problems. Statmgr processes each descriptor file in the function
statmgr_getdf().
inst is the installation at which the patient-module is running.
inst is a character string (valid strings are listed in earthworm.h)
that relates (in earthworm.h) to a unique single-byte number.
This line is required; inst and modId allow statmgr to match an
error message with its proper descriptor file instructions.
modId modId
modid is the module id of the patient module. modid is a
character string (valid strings are listed in earthworm.d) that
relates (in earthworm.d) to a unique single-byte number. modid
must match that used in the patient module's own configuration
file. This line is required; inst and modId allow statmgr to
match an error message with its proper descriptor file instructions.
modName modName
Give the name of the patient module. name is text string
(up to 39 characters) which statmgr includes in each logged and
reported error message from this patient. This line is required.
system sysname
This is an optional parameter. sysname is a string (up to 29
characters) giving the name of the computer on which the patient
module is running. statmgr includes this text string in each
logged and reported error message from this patient. If the
"system" line is ommitted, statmgr assumes the module is running
on the local computer and uses the environment variable, SYS_NAME,
in its place.
pagegroup group
This is an optional parameter. group is a string (up to 79
characters) to which statmgr will address all TYPE_PAGE messages regarding
this specific module.
If the "pagegroup" line is ommitted here, statmgr uses the pagegroup
listed in its own configuration file.
If the statmgr does not receive a heartbeat message every tsec
seconds from this patient module, an error will be reported
(LOCAL_time modName/sysname module dead). If statmgr receives
a heartbeat from a module that it has reported "dead," it will send
out an "alive" message (LOCAL_time modName/sysname module alive).
tsec is generally set to 2*(heartbeat-interval) of the patient
module. npage is the maximum number of pager messages that will
be reported and nmail is the maximum number of email messages that
will be reported. Each "dead" and "alive" message counts as a
separate message. If the page or mail limit is exceeded, no further
errors will be reported until the status manager is restarted.
text: description
code is the error code generated by the patient module.
Error codes can be any unsigned integer, not necessarily
sequential.
Contact: Questions? Issues? Subscribe to the Earthworm List (earthw).