Table of Contents
Use of HTCondor Infrastructure
Suppose to have a small production HTCondor cluster and that it's composed of:
sn-01.cr.cnaf.infn.it
: a submit node for submission from local UIce0X-htc.cr.cnaf.infn.it
: computing elements for grid submissionhtc-2.cr.cnaf.infn.it
: a central manager
On a UI the user can submit a job to the submit node, which then deals with the routing to the central manager, responsible for dispatching the jobs to the worker nodes.
Main HTCondor commands
Most used commands are:
condor_submit
: Command for job submission. The following options are required:-name sn-01.cr.cnaf.infn.it
: to correctly address the job to thesubmit node-spool
: to transfer the input files and keep a local copy of the output files- the submission file (a .sub file containing the relevant information for the batch system, the equivalent of the .jdl file), to be indicated as argument.
condor_q
: Command to progressively check the job status. To be used always with the-name sn-01.cr.cnaf.infn.it
option.
condor_transfer_data
: Command to locally transfer the job output files once the execution is finished. To be used with-name sn-01.cr.cnaf.infn.it
option followed by the cluster id returned bycondor_submit
command at submission.
condor_rm
: Command to remove a job. To be used with-name sn-01.cr.cnaf.infn.it
option followed by the cluster id returned bycondor_submit
command at submission.
condor_history
: Command to see detailed information about all your jobs.
How to submit Grid jobs
First, create the proxy:
voms-proxy-init --voms <vo name>
then submit the job with the following command:
export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=GSI condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool ce_testp308.sub
where ce_testp308.sub is the submit file:
-bash-4.2$ cat ce_testp308.sub universe = vanilla executable = /bin/hostname output = outputFile.out error = errorFile.err log = logFile.log ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT queue 1
How to submit local jobs
To submit jobs locally, i.e. from CNAF UI, use the following command:
condor_submit -spool -name sn-01.cr.cnaf.infn.it test.sub
where test.sub is the submit file. Example:
-bash-4.2$ cat test.sub universe = vanilla executable = /bin/hostname output = outputFile.out error = errorFile.err log = logFile.log ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT queue 1 -bash-4.2$ condor_submit -spool -name sn-01.cr.cnaf.infn.it test.sub Submitting job(s). 1 job(s) submitted to cluster 8938.
where 8938 is the cluster id.
Monitoring
- To see all jobs launched by a user:
condor_q -submitter <user>
Example:
-bash-4.2$ condor_q -submitter ecorni -- Submitter: ecorni@htc_tier1 : <131.154.192.58:9618?... : sn-01.cr.cnaf.infn.it @ 10/25/19 09:55:57 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS ecorni ID: 6963 10/24 16:56 _ _ _ 1 1 6963.0 ecorni ID: 6968 10/24 17:23 _ _ _ 1 1 6968.0 ecorni ID: 8551 10/24 17:41 _ _ _ 1 1 8551.0 ecorni ID: 8552 10/24 17:44 _ _ _ 1 1 8552.0 ecorni ID: 8570 10/24 17:50 _ _ _ 1 1 8570.0 Total for query: 5 jobs; 0 completed, 0 removed, 0 idle, 0 running, 5 held, 0 suspended Total for all users: 3937 jobs; 3880 completed, 0 removed, 9 idle, 0 running, 48 held, 0 suspended -bash-4.2$
- To get the list of held jobs and the held reason:
condor_q -submitter <user> -held
Example:
-bash-4.2$ condor_q -submitter ecorni -held -- Submitter: ecorni@htc_tier1 : <131.154.192.58:9618?... : sn-01.cr.cnaf.infn.it @ 10/25/19 09:56:46 ID OWNER HELD_SINCE HOLD_REASON 6963.0 ecorni 10/24 16:56 Failed to initialize user log to /home/TIER1/ecorni/logFile.log 6968.0 ecorni 10/24 17:23 Failed to initialize user log to /home/TIER1/ecorni/logFile.log 8551.0 ecorni 10/24 17:41 Failed to initialize user log to /home/TIER1/ecorni/logFile.log 8552.0 ecorni 10/24 17:45 Failed to initialize user log to /home/TIER1/ecorni/logFile.log 8570.0 ecorni 10/24 17:51 Failed to initialize user log to /home/TIER1/ecorni/logFile.log Total for query: 5 jobs; 0 completed, 0 removed, 0 idle, 0 running, 5 held, 0 suspended Total for all users: 3937 jobs; 3880 completed, 0 removed, 9 idle, 0 running, 48 held, 0 suspended -bash-4.2$
- To get detailed information about a single job:
condor_q -better-analyze -name sn-01.cr.cnaf.infn.it <cluster id>
Example:
-bash-4.2$ condor_q -better-analyze -name sn-01.cr.cnaf.infn.it 8570.0 -- Schedd: sn-01.cr.cnaf.infn.it : <131.154.192.58:9618?... The Requirements expression for job 8570.000 is (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer) Job 8570.000 defines the following attributes: DiskUsage = 1 ImageSize = 1 RequestDisk = DiskUsage RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024) The Requirements expression for job 8570.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 437 TARGET.Arch == "X86_64" [1] 437 TARGET.OpSys == "LINUX" [3] 437 TARGET.Disk >= RequestDisk [5] 437 TARGET.Memory >= RequestMemory [7] 437 TARGET.HasFileTransfer 8570.000: Job is held. Hold reason: Failed to initialize user log to /home/TIER1/ecorni/logFile.log Last successful match: Thu Oct 24 17:51:15 2019 8570.000: Run analysis summary ignoring user priority. Of 72 machines, 0 are rejected by your job's requirements 3 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 69 are able to run your job -bash-4.2$
Retrieve the output files
The job outputs cannot be copied automatically. The user should launch:
condor_transfer_data -name sn-01.cr.cnaf.infn.it <cluster id>
NOTE: there is a limit of few MB on the size of files that can be transferred in this way. For larger file the data management tools have to be used. See the next chapter.
Example:
-bash-4.2$ condor_transfer_data -name sn-01.cr.cnaf.infn.it 8938 Fetching data files... -bash-4.2$ ls -lhtr total 0 -rw-r--r-- 1 ecorni tier1 173 Oct 25 14:35 test.sub -rw-r--r-- 1 ecorni tier1 0 Oct 25 14:37 errorFile.err -rw-r--r-- 1 ecorni tier1 1.1K Oct 25 14:37 logFile.log -rw-r--r-- 1 ecorni tier1 34 Oct 25 14:37 outputFile.out
Another way is to have the output files written in a path shared between the WN and the UI. To do the user needs to modify the submit file as follows:
-bash-4.2$ cat test.sub universe = vanilla executable = /bin/hostname output = /storage/gpfs_ds50/darkside/users/fornarids/outputFile.out error = /storage/gpfs_ds50/darkside/users/fornarids/errorFile.err log = /storage/gpfs_ds50/darkside/users/fornarids/logFile.log ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT queue 1