User Tools

Site Tools


Sidebar

progetti:htcondor-tf:using_htcondor

Use of HTCondor Infrastructure

Suppose to have a small production HTCondor cluster and that it's composed of:

  • sn-01.cr.cnaf.infn.it: a submit node for submission from local UI
  • ce0X-htc.cr.cnaf.infn.it: computing elements for grid submission
  • htc-2.cr.cnaf.infn.it: a central manager

On a UI the user can submit a job to the submit node, which then deals with the routing to the central manager, responsible for dispatching the jobs to the worker nodes.

Main HTCondor commands

Most used commands are:

  • condor_submit: Command for job submission. The following options are required:
    • -name sn-01.cr.cnaf.infn.it: to correctly address the job to thesubmit node
    • -spool: to transfer the input files and keep a local copy of the output files
    • the submission file (a .sub file containing the relevant information for the batch system, the equivalent of the .jdl file), to be indicated as argument.
  • condor_q: Command to progressively check the job status. To be used always with the -name sn-01.cr.cnaf.infn.it option.
  • condor_transfer_data: Command to locally transfer the job output files once the execution is finished. To be used with -name sn-01.cr.cnaf.infn.it option followed by the cluster id returned by condor_submit command at submission.
  • condor_rm: Command to remove a job. To be used with -name sn-01.cr.cnaf.infn.it option followed by the cluster id returned by condor_submit command at submission.
  • condor_history: Command to see detailed information about all your jobs.

How to submit Grid jobs

First, create the proxy:

voms-proxy-init --voms <vo name>

then submit the job with the following command:

export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=GSI
condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool ce_testp308.sub

where ce_testp308.sub is the submit file:

-bash-4.2$ cat ce_testp308.sub
universe = vanilla
executable = /bin/hostname
output = outputFile.out
error = errorFile.err
log = logFile.log
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
queue 1

How to submit local jobs

To submit jobs locally, i.e. from CNAF UI, use the following command:

condor_submit -spool -name sn-01.cr.cnaf.infn.it test.sub

where test.sub is the submit file. Example:

-bash-4.2$ cat test.sub
universe = vanilla
executable = /bin/hostname
output = outputFile.out
error = errorFile.err
log = logFile.log
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
queue 1

-bash-4.2$ condor_submit -spool -name sn-01.cr.cnaf.infn.it test.sub
Submitting job(s).
1 job(s) submitted to cluster 8938.

where 8938 is the cluster id.

Monitoring

- To see all jobs launched by a user:

condor_q -submitter <user>

Example:

-bash-4.2$ condor_q -submitter ecorni
-- Submitter: ecorni@htc_tier1 : <131.154.192.58:9618?... : sn-01.cr.cnaf.infn.it @ 10/25/19 09:55:57
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ecorni ID: 6963    10/24 16:56      _      _      _      1      1 6963.0
ecorni ID: 6968    10/24 17:23      _      _      _      1      1 6968.0
ecorni ID: 8551    10/24 17:41      _      _      _      1      1 8551.0
ecorni ID: 8552    10/24 17:44      _      _      _      1      1 8552.0
ecorni ID: 8570    10/24 17:50      _      _      _      1      1 8570.0

Total for query: 5 jobs; 0 completed, 0 removed, 0 idle, 0 running, 5 held, 0 suspended
Total for all users: 3937 jobs; 3880 completed, 0 removed, 9 idle, 0 running, 48 held, 0 suspended
-bash-4.2$

- To get the list of held jobs and the held reason:

condor_q -submitter <user> -held

Example:

-bash-4.2$ condor_q -submitter ecorni -held
-- Submitter: ecorni@htc_tier1 : <131.154.192.58:9618?... : sn-01.cr.cnaf.infn.it @ 10/25/19 09:56:46
 ID      OWNER          HELD_SINCE  HOLD_REASON
6963.0   ecorni         10/24 16:56 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
6968.0   ecorni         10/24 17:23 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
8551.0   ecorni         10/24 17:41 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
8552.0   ecorni         10/24 17:45 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
8570.0   ecorni         10/24 17:51 Failed to initialize user log to /home/TIER1/ecorni/logFile.log

Total for query: 5 jobs; 0 completed, 0 removed, 0 idle, 0 running, 5 held, 0 suspended
Total for all users: 3937 jobs; 3880 completed, 0 removed, 9 idle, 0 running, 48 held, 0 suspended
-bash-4.2$

- To get detailed information about a single job:

condor_q -better-analyze -name sn-01.cr.cnaf.infn.it <cluster id>

Example:

-bash-4.2$ condor_q -better-analyze -name sn-01.cr.cnaf.infn.it 8570.0
-- Schedd: sn-01.cr.cnaf.infn.it : <131.154.192.58:9618?...
The Requirements expression for job 8570.000 is

    (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) &&
    (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)

Job 8570.000 defines the following attributes:

    DiskUsage = 1
    ImageSize = 1
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)

The Requirements expression for job 8570.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]         437  TARGET.Arch == "X86_64"
[1]         437  TARGET.OpSys == "LINUX"
[3]         437  TARGET.Disk >= RequestDisk
[5]         437  TARGET.Memory >= RequestMemory
[7]         437  TARGET.HasFileTransfer

8570.000:  Job is held.
Hold reason: Failed to initialize user log to /home/TIER1/ecorni/logFile.log
Last successful match: Thu Oct 24 17:51:15 2019

8570.000:  Run analysis summary ignoring user priority.  Of 72 machines,
      0 are rejected by your job's requirements
      3 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
     69 are able to run your job
-bash-4.2$

Retrieve the output files

The job outputs cannot be copied automatically. The user should launch:

condor_transfer_data -name sn-01.cr.cnaf.infn.it <cluster id>

NOTE: there is a limit of few MB on the size of files that can be transferred in this way. For larger file the data management tools have to be used. See the next chapter.

Example:

-bash-4.2$ condor_transfer_data -name sn-01.cr.cnaf.infn.it 8938
Fetching data files...
-bash-4.2$ ls -lhtr
total 0
-rw-r--r-- 1 ecorni tier1  173 Oct 25 14:35 test.sub
-rw-r--r-- 1 ecorni tier1    0 Oct 25 14:37 errorFile.err
-rw-r--r-- 1 ecorni tier1 1.1K Oct 25 14:37 logFile.log
-rw-r--r-- 1 ecorni tier1   34 Oct 25 14:37 outputFile.out

Another way is to have the output files written in a path shared between the WN and the UI. To do the user needs to modify the submit file as follows:

-bash-4.2$ cat test.sub 
universe = vanilla
executable = /bin/hostname
output = /storage/gpfs_ds50/darkside/users/fornarids/outputFile.out
error = /storage/gpfs_ds50/darkside/users/fornarids/errorFile.err
log = /storage/gpfs_ds50/darkside/users/fornarids/logFile.log
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
queue 1
progetti/htcondor-tf/using_htcondor.txt · Last modified: 2020/01/31 14:16 by ecorni@infn.it