User Tools

Site Tools


progetti:htcondor-tf:using_htcondor

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
progetti:htcondor-tf:using_htcondor [2020/01/31 10:33] ecorni@infn.itprogetti:htcondor-tf:using_htcondor [2020/01/31 14:16] (current) – [Use of HTCondor Infrastructure] ecorni@infn.it
Line 1: Line 1:
 +===== Use of HTCondor Infrastructure =====
 +Suppose to have a small production HTCondor cluster and that it's composed of:
 +
 +  * ''sn-01.cr.cnaf.infn.it'': a submit node for submission from local UI
 +  * ''ce0X-htc.cr.cnaf.infn.it'': computing elements for grid submission  
 +  * ''htc-2.cr.cnaf.infn.it'': a central manager
 +
 +On a UI the user can submit a job to the submit node, which then deals with the routing to the central manager, responsible for dispatching the jobs to the worker nodes.
 +==== Main HTCondor commands ====
 +Most used commands are:
 +  * ''condor_submit'': Command for job submission. The following options are __required__:
 +     * ''-name sn-01.cr.cnaf.infn.it'': to correctly address the job to thesubmit node
 +     * ''-spool'': to transfer the input files and keep a local copy of the output files
 +     * the submission file (a //.sub// file containing the relevant information for the batch system, the equivalent of the //.jdl// file), to be indicated as argument.
 +
 +  * ''condor_q'': Command to progressively check the job status. To be used always with the ''-name sn-01.cr.cnaf.infn.it'' option.
 +
 +  * ''condor_transfer_data'': Command to locally transfer the job output files once the execution is finished. To be used with ''-name sn-01.cr.cnaf.infn.it'' option followed by the cluster id returned by ''condor_submit'' command at submission.
 +
 +  * ''condor_rm'': Command to remove a job. To be used with ''-name sn-01.cr.cnaf.infn.it'' option followed by the cluster id returned by ''condor_submit'' command at submission.
 +
 +  * ''condor_history'': Command to see detailed information about all your jobs.
 +
 +==== How to submit Grid jobs ====
 +First, create the proxy:
 +<code>
 +voms-proxy-init --voms <vo name>
 +</code>
 +then submit the job with the following command:
 +<code>
 +export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=GSI
 +condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool ce_testp308.sub
 +</code>
 +where //ce_testp308.sub// is the submit file:
 +<code>
 +-bash-4.2$ cat ce_testp308.sub
 +universe = vanilla
 +executable = /bin/hostname
 +output = outputFile.out
 +error = errorFile.err
 +log = logFile.log
 +ShouldTransferFiles = YES
 +WhenToTransferOutput = ON_EXIT
 +queue 1
 +</code>
 +
 +==== How to submit local jobs ====
 +To submit jobs locally, i.e. from CNAF UI, use the following command:
 +<code>
 +condor_submit -spool -name sn-01.cr.cnaf.infn.it test.sub
 +</code>
 +where //test.sub// is the submit file.
 +Example:
 +<code>
 +-bash-4.2$ cat test.sub
 +universe = vanilla
 +executable = /bin/hostname
 +output = outputFile.out
 +error = errorFile.err
 +log = logFile.log
 +ShouldTransferFiles = YES
 +WhenToTransferOutput = ON_EXIT
 +queue 1
 +
 +-bash-4.2$ condor_submit -spool -name sn-01.cr.cnaf.infn.it test.sub
 +Submitting job(s).
 +1 job(s) submitted to cluster 8938.
 +</code>
 +where //8938// is the cluster id.
 +
 +==== Monitoring ====
 +=== - To see all jobs launched by a user: ===
 +<code>
 +condor_q -submitter <user>
 +</code>
 +Example:
 +<code>
 +-bash-4.2$ condor_q -submitter ecorni
 +-- Submitter: ecorni@htc_tier1 : <131.154.192.58:9618?... : sn-01.cr.cnaf.infn.it @ 10/25/19 09:55:57
 +OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
 +ecorni ID: 6963    10/24 16:56      _      _      _      1      1 6963.0
 +ecorni ID: 6968    10/24 17:23      _      _      _      1      1 6968.0
 +ecorni ID: 8551    10/24 17:41      _      _      _      1      1 8551.0
 +ecorni ID: 8552    10/24 17:44      _      _      _      1      1 8552.0
 +ecorni ID: 8570    10/24 17:50      _      _      _      1      1 8570.0
 +
 +Total for query: 5 jobs; 0 completed, 0 removed, 0 idle, 0 running, 5 held, 0 suspended
 +Total for all users: 3937 jobs; 3880 completed, 0 removed, 9 idle, 0 running, 48 held, 0 suspended
 +-bash-4.2$
 +</code>
 +
 +=== - To get the list of held jobs and the held reason: ===
 +<code>
 +condor_q -submitter <user> -held
 +</code>
 +Example:
 +<code>
 +-bash-4.2$ condor_q -submitter ecorni -held
 +-- Submitter: ecorni@htc_tier1 : <131.154.192.58:9618?... : sn-01.cr.cnaf.infn.it @ 10/25/19 09:56:46
 + ID      OWNER          HELD_SINCE  HOLD_REASON
 +6963.0   ecorni         10/24 16:56 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
 +6968.0   ecorni         10/24 17:23 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
 +8551.0   ecorni         10/24 17:41 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
 +8552.0   ecorni         10/24 17:45 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
 +8570.0   ecorni         10/24 17:51 Failed to initialize user log to /home/TIER1/ecorni/logFile.log
 +
 +Total for query: 5 jobs; 0 completed, 0 removed, 0 idle, 0 running, 5 held, 0 suspended
 +Total for all users: 3937 jobs; 3880 completed, 0 removed, 9 idle, 0 running, 48 held, 0 suspended
 +-bash-4.2$
 +</code>
 +
 +=== - To get detailed information about a single job: ===
 +<code>
 +condor_q -better-analyze -name sn-01.cr.cnaf.infn.it <cluster id>
 +</code>
 +Example:
 +<code>
 +-bash-4.2$ condor_q -better-analyze -name sn-01.cr.cnaf.infn.it 8570.0
 +-- Schedd: sn-01.cr.cnaf.infn.it : <131.154.192.58:9618?...
 +The Requirements expression for job 8570.000 is
 +
 +    (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) &&
 +    (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)
 +
 +Job 8570.000 defines the following attributes:
 +
 +    DiskUsage = 1
 +    ImageSize = 1
 +    RequestDisk = DiskUsage
 +    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
 +
 +The Requirements expression for job 8570.000 reduces to these conditions:
 +
 +         Slots
 +Step    Matched  Condition
 +-----  --------  ---------
 +[0]         437  TARGET.Arch == "X86_64"
 +[1]         437  TARGET.OpSys == "LINUX"
 +[3]         437  TARGET.Disk >= RequestDisk
 +[5]         437  TARGET.Memory >= RequestMemory
 +[7]         437  TARGET.HasFileTransfer
 +
 +8570.000:  Job is held.
 +Hold reason: Failed to initialize user log to /home/TIER1/ecorni/logFile.log
 +Last successful match: Thu Oct 24 17:51:15 2019
 +
 +8570.000:  Run analysis summary ignoring user priority.  Of 72 machines,
 +      0 are rejected by your job's requirements
 +      3 reject your job because of their own requirements
 +      0 match and are already running your jobs
 +      0 match but are serving other users
 +     69 are able to run your job
 +-bash-4.2$
 +</code>
 +
 +==== Retrieve the output files ====
 +The job outputs cannot be copied automatically. The user should launch:
 +<code>
 +condor_transfer_data -name sn-01.cr.cnaf.infn.it <cluster id>
 +</code>
 +NOTE: there is a limit of few MB on the size of files that can be transferred in this way. For larger file the data management tools have to be used. See the next chapter.
 +
 +Example:
 +<code>
 +-bash-4.2$ condor_transfer_data -name sn-01.cr.cnaf.infn.it 8938
 +Fetching data files...
 +-bash-4.2$ ls -lhtr
 +total 0
 +-rw-r--r-- 1 ecorni tier1  173 Oct 25 14:35 test.sub
 +-rw-r--r-- 1 ecorni tier1    0 Oct 25 14:37 errorFile.err
 +-rw-r--r-- 1 ecorni tier1 1.1K Oct 25 14:37 logFile.log
 +-rw-r--r-- 1 ecorni tier1   34 Oct 25 14:37 outputFile.out
 +</code>
 +Another way is to have the output files written in a path shared between the WN and the UI. To do the user needs to modify the submit file as follows:
 +<code>
 +-bash-4.2$ cat test.sub 
 +universe = vanilla
 +executable = /bin/hostname
 +output = /storage/gpfs_ds50/darkside/users/fornarids/outputFile.out
 +error = /storage/gpfs_ds50/darkside/users/fornarids/errorFile.err
 +log = /storage/gpfs_ds50/darkside/users/fornarids/logFile.log
 +ShouldTransferFiles = YES
 +WhenToTransferOutput = ON_EXIT
 +queue 1
 +</code>
 +
  

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki