Map-Reduce Jobs under HTCondor

NOTE: If you want to try MapReduce with HTCondor, you want to download the file "mrscriptVanilla.tar.gz

Introduction

HTCondor provides support for starting Hadoop HDFS services, namely Name- and Datanodes. HDFS data access is up the the user's application, through, including the usage of Hadoop's MapReduce framework. However, we provide a submit file generator script for submitting MapReduce jobs into the vanilla universe (1 jobtracker and n tasktrackers, where n is specified by the user)

Why running MapReduce job under HTCondor at all?

  1. HTCondor has powerful match making capabilities using excellent framework based on Class-Ad mechanism. These capabilities can be exploited to implement multiple policies for a MR cluster beyond the current capabilities of existing frameworks.

  2. MR style of computation might not be suitable for all sorts of applications or problems (e.g. the ones which are inherently sequential). A support for multiple execution environments is needed along with different set of policies for each environment. HTCondor supports a wide variety of execution environment including MPI style jobs, VMWare job etc.

  3. Perhaps, one of the bigger advantages is related to capacity management with a large shared MR cluster. Currently, the Hadoop MR framework has a very limited support for managing users' job priorities.

Prerequisites

You need to have a distributed file system setup e.g. Hadoop distributed file system (HDFS). Starting from version 7.5 HTCondor comes with a storage daemon that provides support for HDFS. More details about our HDFS daemon can be found in HTCondor manual (see section 3.3.23 and 3.13.2). Apart from these python version 2.4 or above is required on all the machines, which are part of PU.

Submitting a Job

Getting required files

We have written a handy python script that takes care of a lot of configuration steps involved in creating a job description file. It generates a specially crafted HTCondor job submit file for your job. It can then be submitted to HTCondor scheduler using the same script to get back the tracking URL of the job. This URL is where the Job-Tracker' embedded web-server is running. The information about this URL is published as a Job-Ad by mrscriptVanilla once the Tracker is setup. Using the script you can specify: number of slots (CPUs) to be used, MR cluster parameters e.g. capacity of each Tasktracker, job jar file or a script file if you are trying to submit a set of jobs and also HTCondor job parameters e.g. 'requirement' attribute.

This script will soon be part of HTCondor distribution but for now you can just use the latest version attached with this wiki. The attached file (mrscriptVanilla.tar.gz) contains two files:

You will also need the copy of hadoop software on the machine from where you are submitting the job. This is required, as all required Jar file for running MR processes are copied to machine selected for a given job. As different versions of Hadoop are not compatible with each other, make sure to download the same version as used in Hadoop distributed file system.

Configuration Parameters

mrscriptVanilla.py requires you to specify certain configuration parameters e.g. number of CPU slots to request, location of Hadoop installation directory on your submit machine etc. Below is a list of configuration variables whose value you should decide on before running the script to generate submit file.

  1. The URL of the Hadoop name-node server.
  2. The java home directory
  3. The hadoop installation directory
  4. The map capacity of each Tasktracker.
  5. The reduce capacity of each Tasktracker.
  6. The number of Tasktrackers that should be used for your job. There will only be one Jobtracker submitted!
  7. The jar file for your job
  8. The parameters passed to your job.
  9. The number of mappers running per Tasktracker
  10. The number of reducers running per Tasktracker
  11. (Optional) The list of user files that should be sent with your job. These files are other than the ones that mrscript.py is configured to send.

Generating job submit file

Once you have decided upon the values of all of the above-mentioned parameters, you are ready to generate the file. Use the mrscript.py to generate the job.desc file.

	./mrscriptVanilla.py  -m <map> -r <reduce> -j <java> -c <count> -n <URL> -f <user-file1> -f <user-file2> -c <key-value pair> job-jar-file 'args for job-jar-file'
             (Where) :
               -m: map capacity of a single Task-Tracker
               -r: reducer capacity of a single Task-Tracker
               -j: Location of java home directory on local machine
               -c: Number of machines used as Tasktrackers
               -n: URL of hadoop name-node server
               -f: You can use this option multiple times to specify additional files that should go with your job.
                   Note that you don't need to specify any of the Hadoop core Jar files.
               -c: Key-value pair corresponding to Hadoop XML configuration files that are placed in appropriate XML files when setting up the MR cluster.
                   You can use this option multiple times.

The above command with right set of parameters will generate 'job.desc' file under current directory. You can directly submit this file. Potentially you may want to add certain requirements (OpSys, Arch, Java version ...)

Examples

Assume you have a jar file "wc.jar" (i.e. Hadoop's wordcount example). The a possible call for creating your submit file could be

./mrscriptVanilla.py -p /path/to/hadoop-0.21.0/ -j /path/to/java/ -m 1 -r 2 -n my.namenode.edu:54310  wc.jar org.apache.hadoop.examples.WordCount hdfs://my.namenode.edu:54310/input hdfs://my.namenode.edu:54310/output

Attachments: