Warning: DMTCP 1.2.8 or later required

DMTCP 1.2.8 or later is required. Earlier versions do not work with our shim script, version 0.6 and later. Earlier versions of our shim script (0.5 and earlier) do not reliably work under HTCondor.

(Information on why this was necessary is at #3747)

Warning: Pools with different CPUs

Moving jobs between different processors can cause the jobs to crash because of incompatibilities. For example, if your job checkpoints on a system that uses SSE4, glibc will cache that information and use SSE4 optimized code paths. If you then move to a machine lacking that support, the next time you call into such a function (including many common string routines) the job will crash. We're working on an easier solution (See #3753) but for now you if you have a CPU-heterogenous pool, you'll need to use your job's Requirements to check the CheckpointPlatform.

You can use

condor_status -af CheckpointPlatform | sort |uniq -c

to identify which machines you want to use. For example, you might decide that machines identified as "LINUX X86_64 2.6.x normal N/A ssse3 sse4_1 sse4_2" are the ones you want to target, in which case you would use something like this in your submit file:

requirements = CheckpointPlatform == "LINUX X86_64 2.6.x normal N/A ssse3 sse4_1 sse4_2"

Module Description

DMTCP is a third party user space checkpoint package available from here: http://dmtcp.sourceforge.net/

The current integration model of DMTCP with HTCondor is a shim script with additions to the submit description file for a vanilla universe job. This allows your unmodified vanilla universe job, which could be: dynamically linked executables, processes that fork, Matlab(tm) scripts, etc, to checkpoint and migrate between machines.

The download tarball contains a README file which contains a manifest and more detailed instructions to how to use the integration software. This software is considered alpha, so the documentation or other things might be a little rough for a bit.

Download

Download Released Min DMTCP Notes
dmtcp_condor_integration-0.6 Jul 11th, 2013 1.2.8 Critical fixes
dmtcp_condor_integration-0.5 Mar 25th, 2013 1.2.7 Updates to support more recent releases of DMTCP
dmtcp_condor_integration-0.4 Nov 1st, 2011
dmtcp_condor_integration-0.3 Oct 13th, 2011
dmtcp_condor_integration-0.2 Aug 26th, 2011
dmtcp_condor_integration-0.1 May 25th, 2011

Contact

Alan De Smet, a HTCondor project member, is the current maintainer of the DMTCP/HTCondor integration software. Inquiries, bug reports, or comments about the DMTCP/HTCondor integration package should be sent to condor-admin at cs.wisc.edu.

For help or bug reports relating to DMTCP itself, please subscribe to the dmtcp-forum mailing list located here: http://dmtcp.sourceforge.net/mailingLists.html

Homepage

The homepage for DMTCP is: http://dmtcp.sourceforge.net

The homepage for HTCondor is: http://www.cs.wisc.edu/condor

License

The DMTCP/HTCondor integration software is released under the Apache 2.0 License.

Contributors

People who have contributed to the source code of the DMTCP/HTCondor integration software include:

Other Notes

See DmtcpCondorDev for notes on developing the DMTCP support in HTCondor.