progetti:htcondor-tf:using_htcondor
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revision | |||
| progetti:htcondor-tf:using_htcondor [2020/01/31 10:46] – ecorni@infn.it | progetti:htcondor-tf:using_htcondor [2020/01/31 14:16] (current) – [Use of HTCondor Infrastructure] ecorni@infn.it | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ===== Use of HTCondor Infrastructure ===== | ||
| + | Suppose to have a small production HTCondor cluster and that it's composed of: | ||
| + | |||
| + | * '' | ||
| + | * '' | ||
| + | * '' | ||
| + | |||
| + | On a UI the user can submit a job to the submit node, which then deals with the routing to the central manager, responsible for dispatching the jobs to the worker nodes. | ||
| + | ==== Main HTCondor commands ==== | ||
| + | Most used commands are: | ||
| + | * '' | ||
| + | * '' | ||
| + | * '' | ||
| + | * the submission file (a //.sub// file containing the relevant information for the batch system, the equivalent of the //.jdl// file), to be indicated as argument. | ||
| + | |||
| + | * '' | ||
| + | |||
| + | * '' | ||
| + | |||
| + | * '' | ||
| + | |||
| + | * '' | ||
| + | |||
| + | ==== How to submit Grid jobs ==== | ||
| + | First, create the proxy: | ||
| + | < | ||
| + | voms-proxy-init --voms <vo name> | ||
| + | </ | ||
| + | then submit the job with the following command: | ||
| + | < | ||
| + | export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=GSI | ||
| + | condor_submit -pool ce02-htc.cr.cnaf.infn.it: | ||
| + | </ | ||
| + | where // | ||
| + | < | ||
| + | -bash-4.2$ cat ce_testp308.sub | ||
| + | universe = vanilla | ||
| + | executable = / | ||
| + | output = outputFile.out | ||
| + | error = errorFile.err | ||
| + | log = logFile.log | ||
| + | ShouldTransferFiles = YES | ||
| + | WhenToTransferOutput = ON_EXIT | ||
| + | queue 1 | ||
| + | </ | ||
| + | |||
| + | ==== How to submit local jobs ==== | ||
| + | To submit jobs locally, i.e. from CNAF UI, use the following command: | ||
| + | < | ||
| + | condor_submit -spool -name sn-01.cr.cnaf.infn.it test.sub | ||
| + | </ | ||
| + | where // | ||
| + | Example: | ||
| + | < | ||
| + | -bash-4.2$ cat test.sub | ||
| + | universe = vanilla | ||
| + | executable = / | ||
| + | output = outputFile.out | ||
| + | error = errorFile.err | ||
| + | log = logFile.log | ||
| + | ShouldTransferFiles = YES | ||
| + | WhenToTransferOutput = ON_EXIT | ||
| + | queue 1 | ||
| + | |||
| + | -bash-4.2$ condor_submit -spool -name sn-01.cr.cnaf.infn.it test.sub | ||
| + | Submitting job(s). | ||
| + | 1 job(s) submitted to cluster 8938. | ||
| + | </ | ||
| + | where //8938// is the cluster id. | ||
| + | |||
| + | ==== Monitoring ==== | ||
| + | === - To see all jobs launched by a user: === | ||
| + | < | ||
| + | condor_q -submitter < | ||
| + | </ | ||
| + | Example: | ||
| + | < | ||
| + | -bash-4.2$ condor_q -submitter ecorni | ||
| + | -- Submitter: ecorni@htc_tier1 : < | ||
| + | OWNER BATCH_NAME | ||
| + | ecorni ID: 6963 10/24 16:56 _ _ _ 1 1 6963.0 | ||
| + | ecorni ID: 6968 10/24 17:23 _ _ _ 1 1 6968.0 | ||
| + | ecorni ID: 8551 10/24 17:41 _ _ _ 1 1 8551.0 | ||
| + | ecorni ID: 8552 10/24 17:44 _ _ _ 1 1 8552.0 | ||
| + | ecorni ID: 8570 10/24 17:50 _ _ _ 1 1 8570.0 | ||
| + | |||
| + | Total for query: 5 jobs; 0 completed, 0 removed, 0 idle, 0 running, 5 held, 0 suspended | ||
| + | Total for all users: 3937 jobs; 3880 completed, 0 removed, 9 idle, 0 running, 48 held, 0 suspended | ||
| + | -bash-4.2$ | ||
| + | </ | ||
| + | |||
| + | === - To get the list of held jobs and the held reason: === | ||
| + | < | ||
| + | condor_q -submitter < | ||
| + | </ | ||
| + | Example: | ||
| + | < | ||
| + | -bash-4.2$ condor_q -submitter ecorni -held | ||
| + | -- Submitter: ecorni@htc_tier1 : < | ||
| + | | ||
| + | 6963.0 | ||
| + | 6968.0 | ||
| + | 8551.0 | ||
| + | 8552.0 | ||
| + | 8570.0 | ||
| + | |||
| + | Total for query: 5 jobs; 0 completed, 0 removed, 0 idle, 0 running, 5 held, 0 suspended | ||
| + | Total for all users: 3937 jobs; 3880 completed, 0 removed, 9 idle, 0 running, 48 held, 0 suspended | ||
| + | -bash-4.2$ | ||
| + | </ | ||
| + | |||
| + | === - To get detailed information about a single job: === | ||
| + | < | ||
| + | condor_q -better-analyze -name sn-01.cr.cnaf.infn.it <cluster id> | ||
| + | </ | ||
| + | Example: | ||
| + | < | ||
| + | -bash-4.2$ condor_q -better-analyze -name sn-01.cr.cnaf.infn.it 8570.0 | ||
| + | -- Schedd: sn-01.cr.cnaf.infn.it : < | ||
| + | The Requirements expression for job 8570.000 is | ||
| + | |||
| + | (TARGET.Arch == " | ||
| + | (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer) | ||
| + | |||
| + | Job 8570.000 defines the following attributes: | ||
| + | |||
| + | DiskUsage = 1 | ||
| + | ImageSize = 1 | ||
| + | RequestDisk = DiskUsage | ||
| + | RequestMemory = ifthenelse(MemoryUsage =!= undefined, | ||
| + | |||
| + | The Requirements expression for job 8570.000 reduces to these conditions: | ||
| + | |||
| + | Slots | ||
| + | Step Matched | ||
| + | ----- -------- | ||
| + | [0] | ||
| + | [1] | ||
| + | [3] | ||
| + | [5] | ||
| + | [7] | ||
| + | |||
| + | 8570.000: | ||
| + | Hold reason: Failed to initialize user log to / | ||
| + | Last successful match: Thu Oct 24 17:51:15 2019 | ||
| + | |||
| + | 8570.000: | ||
| + | 0 are rejected by your job's requirements | ||
| + | 3 reject your job because of their own requirements | ||
| + | 0 match and are already running your jobs | ||
| + | 0 match but are serving other users | ||
| + | 69 are able to run your job | ||
| + | -bash-4.2$ | ||
| + | </ | ||
| + | |||
| + | ==== Retrieve the output files ==== | ||
| + | The job outputs cannot be copied automatically. The user should launch: | ||
| + | < | ||
| + | condor_transfer_data -name sn-01.cr.cnaf.infn.it <cluster id> | ||
| + | </ | ||
| + | NOTE: there is a limit of few MB on the size of files that can be transferred in this way. For larger file the data management tools have to be used. See the next chapter. | ||
| + | |||
| + | Example: | ||
| + | < | ||
| + | -bash-4.2$ condor_transfer_data -name sn-01.cr.cnaf.infn.it 8938 | ||
| + | Fetching data files... | ||
| + | -bash-4.2$ ls -lhtr | ||
| + | total 0 | ||
| + | -rw-r--r-- 1 ecorni tier1 173 Oct 25 14:35 test.sub | ||
| + | -rw-r--r-- 1 ecorni tier1 0 Oct 25 14:37 errorFile.err | ||
| + | -rw-r--r-- 1 ecorni tier1 1.1K Oct 25 14:37 logFile.log | ||
| + | -rw-r--r-- 1 ecorni tier1 34 Oct 25 14:37 outputFile.out | ||
| + | </ | ||
| + | Another way is to have the output files written in a path shared between the WN and the UI. To do the user needs to modify the submit file as follows: | ||
| + | < | ||
| + | -bash-4.2$ cat test.sub | ||
| + | universe = vanilla | ||
| + | executable = / | ||
| + | output = / | ||
| + | error = / | ||
| + | log = / | ||
| + | ShouldTransferFiles = YES | ||
| + | WhenToTransferOutput = ON_EXIT | ||
| + | queue 1 | ||
| + | </ | ||
| + | |||
