Reading the Gridmanager Log

Normal Logging Level

This section describes entries that always appear in the gridmanager log.

Like all HTCondor daemons, the gridmanager prints a banner when it starts up.

11/08/11 23:38:56 ******************************************************
11/08/11 23:38:56 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
11/08/11 23:38:56 ** /afs/cs.wisc.edu/unsup/condor-production/condor-7.6.4/i386_rhel5/sbin/condor_gridmanager
11/08/11 23:38:56 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(12) class=DAEMON(1)
11/08/11 23:38:56 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
11/08/11 23:38:56 ** $CondorVersion: 7.6.4 Oct 17 2011 BuildID: 378655 PRE-RELEASE-UWCS $
11/08/11 23:38:56 ** $CondorPlatform: x86_rhap_5 $
11/08/11 23:38:56 ** PID = 15864
11/08/11 23:38:56 ** Log last touched 11/8 23:32:33
11/08/11 23:38:56 ******************************************************
11/08/11 23:38:56 Using config source: /p/condor/workspaces/jfrey/test.68/condor_config
11/08/11 23:38:56 Using local config sources:
11/08/11 23:38:56    /p/condor/workspaces/jfrey/test.68/condor_config.local
11/08/11 23:38:56 DaemonCore: command socket at <128.105.165.29:37201>
11/08/11 23:38:56 DaemonCore: private command socket at <128.105.165.29:37201>

This banner includes the HTCondor version and platform, daemon pid, when the log file was last touched, config files, and the daemon's command port.

It's possible for multiple gridmanger processes to be writing to the log at the same time. So after the banner, the gridmanager includes its pid in square brackets on every line.

When the gridmanager starts managing a job (whether newly submitted or released from hold, it writes this to the log:

11/08/11 23:38:59 [15864] Found job 716.0 --- inserting

Lines in the log that relate to a specific job will often include the job id in parentheses:

11/08/11 23:38:59 [15864] (716.0) doEvaluateState called: gmState GM_CLEAR_REQUEST, globusState 32

With GRAM, whenever the jobmanager notifies the gridmanager of a job status change, it's logged:

11/08/11 23:39:10 [15864] (716.0) gram callback: state 1, errorcode 0

When a GAHP command fails, you can usually find a line mentioning it:

11/08/11 23:42:40 [15864] (716.0) gmState GM_RESTART, globusState 128: globus_gram_client_job_request() returned Globus error 93

When the gridmanager exits normally, it prints a goodbye message:

11/08/11 23:42:44 [15864] **** condor_gridmanager (condor_GRIDMANAGER) pid 15864 EXITING WITH STATUS 0

Verbose Logging Level

When something goes wrong, it's useful to increase the verbosity of the gridmanager's logging to collect more information. When you do so, you may also want to increase the maximum log size

GRIDMANAGER_DEBUG = D_FULLDEBUG
MAX_GRIDMANAGER_LOG = 50000000

This section will describe some of the additional information that will appear in the log when D_FULLDEBUG is enabled.

GAHP

All data exchanged with GAHP servers is logged. The gridmanager communicates with a GAHP server over a pair of pipes. The GAHP acknowledges the issuance of each command. The results of the command are supplied later, in response to a RESULTS command.

11/03/11 16:51:54 [29399] GAHP[29509] <- 'GRAM_JOB_SIGNAL 10632 https://stress2.
chtc.wisc.edu:46861/16217854812539030796/8230609966348670593/ 10 NULL'
11/03/11 16:51:54 [29399] GAHP[29509] -> 'S'
...
11/03/11 16:51:54 [29399] GAHP[29509] <- 'RESULTS'
11/03/11 16:51:54 [29399] GAHP[29509] -> 'R'
11/03/11 16:51:54 [29399] GAHP[29509] -> 'S' '1'
11/03/11 16:51:54 [29399] GAHP[29509] -> '10632' '0' '0' '8'

In this example, 29509 is the pid of the GAHP server, GRAM_JOB_SIGNAL is the type of command being issued, 10632 is a unique id for the command, and the rest of the line is arguments for the command. To learn more about the GAHP protocol, see http://www.cs.wisc.edu/condor/gahp

The stderr of the GAHP server is also logged:

09/21/11 15:26:28 [4205] GAHP[4209] (stderr) -> 2011-09-21 15:26:28,909 INFO - CreamProxy_Delegate::execute() - Signing returned proxy with our proxy certificate [/tmp/x509up_u17063]...

Status and attribute changes

The gridmanager logs when it observes the job's status as reported by the remote system change:

11/09/11 00:03:25 [16311] (717.0) globus state change: ACTIVE -> STAGE_OUT
11/09/11 00:17:39 [16741] (718.0) cream state change: IDLE -> REALLY-RUNNING

The gridmanager's connections to its schedd are also logged. This includes every job attribute it modifies.

11/03/11 16:44:47 [29399] in doContactSchedd()
11/03/11 16:44:47 [29399] Updating classad values for 714.933:
11/03/11 16:44:47 [29399]    GlobusStatus = 64
11/03/11 16:44:47 [29399]    GridJobStatus = "STAGE_IN"
11/03/11 16:44:47 [29399]    LastRemoteStatusUpdate = 1320356686
11/03/11 16:44:47 [29399]    NumGlobusSubmits = 1
11/03/11 16:44:47 [29399] leaving doContactSchedd()

State machine

The gridmanager maintains a state machine for each job. Transitions between states are usually triggered by events related to the job. Most states involve some action being taken on the job, with a transition triggered by completion of the action.

Any time something happens that may affect a job, it's state machine is evaluated by calling the function doEvaluateState().

11/03/11 16:46:56 [29399] (714.933) doEvaluateState called: gmState GM_DONE_COMMIT, globusState 8

This may result in one or more transitions in the state machine.

11/03/11 16:46:56 [29399] (714.933) gm state change: GM_DONE_COMMIT -> GM_DELETE

The state machine is different for each scheduler type. However, there is a lot of similarity. Here are the states a GRAM or CREAM job will transition through during a normal job execution.

If an error occurs, the job may end up in one of these states: