===== Use of HTCondor Infrastructure =====
Suppose to have a small production HTCondor cluster and that it's composed of:
* ''sn-01.cr.cnaf.infn.it'': a submit node for submission from local UI
* ''ce0X-htc.cr.cnaf.infn.it'': computing elements for grid submission
* ''htc-2.cr.cnaf.infn.it'': a central manager
On a UI the user can submit a job to the submit node, which then deals with the routing to the central manager, responsible for dispatching the jobs to the worker nodes.
==== Main HTCondor commands ====
Most used commands are:
* ''condor_submit'': Command for job submission. The following options are __required__:
* ''-name sn-01.cr.cnaf.infn.it'': to correctly address the job to thesubmit node
* ''-spool'': to transfer the input files and keep a local copy of the output files
* the submission file (a //.sub// file containing the relevant information for the batch system, the equivalent of the //.jdl// file), to be indicated as argument.
* ''condor_q'': Command to progressively check the job status. To be used always with the ''-name sn-01.cr.cnaf.infn.it'' option.
* ''condor_transfer_data'': Command to locally transfer the job output files once the execution is finished. To be used with ''-name sn-01.cr.cnaf.infn.it'' option followed by the cluster id returned by ''condor_submit'' command at submission.
* ''condor_rm'': Command to remove a job. To be used with ''-name sn-01.cr.cnaf.infn.it'' option followed by the cluster id returned by ''condor_submit'' command at submission.
* ''condor_history'': Command to see detailed information about all your jobs.
==== How to submit Grid jobs ====
First, create the proxy:
voms-proxy-init --voms
then submit the job with the following command:
export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=GSI
condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool ce_testp308.sub
where //ce_testp308.sub// is the submit file:
-bash-4.2$ cat ce_testp308.sub
universe = vanilla
executable = /bin/hostname
output = outputFile.out
error = errorFile.err
log = logFile.log
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
queue 1
==== How to submit local jobs ====
To submit jobs locally, i.e. from CNAF UI, use the following command:
condor_submit -spool -name sn-01.cr.cnaf.infn.it test.sub
where //test.sub// is the submit file.
Example:
-bash-4.2$ cat test.sub
universe = vanilla
executable = /bin/hostname
output = outputFile.out
error = errorFile.err
log = logFile.log
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
queue 1
-bash-4.2$ condor_submit -spool -name sn-01.cr.cnaf.infn.it test.sub
Submitting job(s).
1 job(s) submitted to cluster 8938.
where //8938// is the cluster id.
==== Monitoring ====
=== - To see all jobs launched by a user: ===
condor_q -submitter
Example:
-bash-4.2$ condor_q -submitter ecorni
-- Submitter: ecorni@htc_tier1 : <131.154.192.58:9618?... : sn-01.cr.cnaf.infn.it @ 10/25/19 09:55:57
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
ecorni ID: 6963 10/24 16:56 _ _ _ 1 1 6963.0
ecorni ID: 6968 10/24 17:23 _ _ _ 1 1 6968.0
ecorni ID: 8551 10/24 17:41 _ _ _ 1 1 8551.0
ecorni ID: 8552 10/24 17:44 _ _ _ 1 1 8552.0
ecorni ID: 8570 10/24 17:50 _ _ _ 1 1 8570.0
Total for query: 5 jobs; 0 completed, 0 removed, 0 idle, 0 running, 5 held, 0 suspended
Total for all users: 3937 jobs; 3880 completed, 0 removed, 9 idle, 0 running, 48 held, 0 suspended
-bash-4.2$
=== - To get the list of held jobs and the held reason: ===
condor_q -submitter -held
Example:
-bash-4.2$ condor_q -submitter ecorni -held
-- Submitter: ecorni@htc_tier1 : <131.154.192.58:9618?... : sn-01.cr.cnaf.infn.it @ 10/25/19 09:56:46
ID OWNER HELD_SINCE HOLD_REASON
6963.0 ecorni 10/24 16:56 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
6968.0 ecorni 10/24 17:23 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
8551.0 ecorni 10/24 17:41 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
8552.0 ecorni 10/24 17:45 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
8570.0 ecorni 10/24 17:51 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
Total for query: 5 jobs; 0 completed, 0 removed, 0 idle, 0 running, 5 held, 0 suspended
Total for all users: 3937 jobs; 3880 completed, 0 removed, 9 idle, 0 running, 48 held, 0 suspended
-bash-4.2$
=== - To get detailed information about a single job: ===
condor_q -better-analyze -name sn-01.cr.cnaf.infn.it
Example:
-bash-4.2$ condor_q -better-analyze -name sn-01.cr.cnaf.infn.it 8570.0
-- Schedd: sn-01.cr.cnaf.infn.it : <131.154.192.58:9618?...
The Requirements expression for job 8570.000 is
(TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) &&
(TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)
Job 8570.000 defines the following attributes:
DiskUsage = 1
ImageSize = 1
RequestDisk = DiskUsage
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
The Requirements expression for job 8570.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 437 TARGET.Arch == "X86_64"
[1] 437 TARGET.OpSys == "LINUX"
[3] 437 TARGET.Disk >= RequestDisk
[5] 437 TARGET.Memory >= RequestMemory
[7] 437 TARGET.HasFileTransfer
8570.000: Job is held.
Hold reason: Failed to initialize user log to /home/TIER1/ecorni/logFile.log
Last successful match: Thu Oct 24 17:51:15 2019
8570.000: Run analysis summary ignoring user priority. Of 72 machines,
0 are rejected by your job's requirements
3 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
69 are able to run your job
-bash-4.2$
==== Retrieve the output files ====
The job outputs cannot be copied automatically. The user should launch:
condor_transfer_data -name sn-01.cr.cnaf.infn.it
NOTE: there is a limit of few MB on the size of files that can be transferred in this way. For larger file the data management tools have to be used. See the next chapter.
Example:
-bash-4.2$ condor_transfer_data -name sn-01.cr.cnaf.infn.it 8938
Fetching data files...
-bash-4.2$ ls -lhtr
total 0
-rw-r--r-- 1 ecorni tier1 173 Oct 25 14:35 test.sub
-rw-r--r-- 1 ecorni tier1 0 Oct 25 14:37 errorFile.err
-rw-r--r-- 1 ecorni tier1 1.1K Oct 25 14:37 logFile.log
-rw-r--r-- 1 ecorni tier1 34 Oct 25 14:37 outputFile.out
Another way is to have the output files written in a path shared between the WN and the UI. To do the user needs to modify the submit file as follows:
-bash-4.2$ cat test.sub
universe = vanilla
executable = /bin/hostname
output = /storage/gpfs_ds50/darkside/users/fornarids/outputFile.out
error = /storage/gpfs_ds50/darkside/users/fornarids/errorFile.err
log = /storage/gpfs_ds50/darkside/users/fornarids/logFile.log
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
queue 1