How to scavenge cycles from PBS

Known to work in HTCondor version: 7.0

Overview

The HTCondor system is designed (among many other things) to scavenge compute cycles on desktop workstations when interactive users are idle. This same concept can be applied to scavenging cycles from another batch system running on the same computer. The main idea is that instead of configuring HTCondor to notice when an interactive user is idle, to configure HTCondor to notice when the other batch system is idle on the machine. When the other system is idle, HTCondor is free to run jobs, until such time as the other batch system has work to do. Then, HTCondor must preempt or checkpoint the current work. This page discusses how to configure HTCondor to do this with PBS, though the concept works for other batch systems as well.

HTCondor and PBS

First, configure the HTCondor startd to only run jobs when the attribute PBSRunning is set. We'll set this dynamically with the condor_config_val -rset command.

On the worker nodes, define in the HTCondor config:

ENABLE_RUNTIME_CONFIG = TRUE
STARTD_SETTABLE_ATTRS_OWNER = PBSRunning
PBSRunning                      = False

# Only start jobs if PBS is not currently running a job
START_NOPBS = ( $(PBSRunning) == False )

START = $(START) && $(START_NOPBS)

so that HTCondor will only start if START is true and there are no PBS jobs running.

In the PBS world, again on the worker side, have PBS tell HTCondor when it is running, by adding the following to the PBS prologue.

        if [ -x /opt/condor/bin/condor_config_val ]; then
                 /opt/condor/bin/condor_config_val -rset -startd PBSRunning=True > /dev/null
                 /opt/condor/sbin/condor_reconfig -startd > /dev/null
                 sleep 2
                 if ( /opt/condor/bin/condor_status  -format '%s' Name -format '%s \n' State  $(hostname) 2> /dev/ null | grep -q Claimed )
                 then
                         /opt/condor/sbin/condor_vacate > /dev/null
                         sleep 2
                 fi
         fi

In the PBS Epilogue, tell HTCondor that it is OK to use this machine again:

                 if [ -x /opt/condor/bin/condor_config_val ]; then
                         /opt/condor/bin/condor_config_val -rset -startd PBSRunning=False > /dev/null
                         /opt/condor/sbin/condor_reconfig -startd > / dev/null
                 fi

Acknowledgments

This is based on a recipe from Preston Smith of Purdue University. Thanks Preston!