User Tools

Site Tools


progetti:icarus:production-guide

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
progetti:icarus:production-guide [2025/10/14 21:53] – [Resubmit Raw Data Production] vpia@infn.itprogetti:icarus:production-guide [2025/10/27 15:24] (current) – [Submit the batch of jobs] vpia@infn.it
Line 1: Line 1:
 +====== Production Guide ======
 +
 +====== List of Productions ======
 +^ Date   ^ Type  ^ Tag                                 ^
 +| 10/25  | DATA  | run3-processing-cnaf-1025-v10_06_00_04p03  |
 +| 03/24  | MC    | mc-v09_84_00_01-202403-cnaf-corrsce        |
 +| 03/24  | DATA  | run2-v09_84_00_01-202403-cnaf              |
 +| 03/24  | MC    | mc-v09_84_00_01-202403-cnaf                |
 +| 02/24  | DATA  | data_run2-v09_83_01202402-cnaf             |
 +| 02/24  | MC    | mc_nucosm-v09_83_01202402-cnaf             |
 +| 12/23  | DATA  | run2-v09_72_00_06-122023-variables         |
 +
 +====== General Info ======
 +This page details the steps needed to submit and monitor production campaigns (hereafter campaigns) at CNAF. Two main types of campaigns are possible: **real** and **MC** data. In both cases, a first setup is needed for each new campaign. After that, the capaign is submitted in multiple steps, with each step submitting a batch of jobs. At the end, a final check of the completion of the campaign is requested.
 +
 +Below, details on the [[https://wiki.infn.it/progetti/icarus/production-guide#initial_setup|initial setup]], 
 + [[https://wiki.infn.it/progetti/icarus/production-guide#what_to_do_while_on_shift|what to do while on shift]] and how to [[https://wiki.infn.it/progetti/icarus/production-guide#check_the_completion_of_the_campaing| check the completion of the campaign]] are given.
 +
 +====== Initial Setup ======
 +The first step is to download and setup all the needed scripts. This must be done **only once** per campaign. All the needed batches of jobs for the current campaign will be submitted with the **same** scripts. Each production request has its own configuration and will be associated to a (//git//) tag (''<selected-tag>'') used to download the correct version of the scripts. Once the tag has been provided, the shifter has to create a working directory in it own folder and access it. Create it if it doesn't exist yet:
 +
 +  mkdir /storage/gpfs_data/icarus/local/users/$USER/production
 +  cd /storage/gpfs_data/icarus/local/users/$USER/production
 +  mkdir <selected-tag>
 +  cd <selected-tag>
 +
 +From this folder, download the correct version of the scripts,
 +  
 +  git clone https://baltig.infn.it/icarus/prod-scripts/ --recurse-submodules --branch <selected-tag>
 +
 +and access the prod-scripts folder from where all steps will be submitted.
 +
 +  cd prod-scripts
 +
 +Now, The initial setup is complete. Here is a complete example:
 +
 +  cd /storage/gpfs_data/icarus/local/users/vpia/productions
 +  mkdir run2-v09_72_00_06-122023-variables
 +  cd run2-v09_72_00_06-122023-variables
 +  git clone https://baltig.infn.it/icarus/prod-scripts/ --recurse-submodules --branch run2-v09_72_00_06-122023-variables
 +  cd prod-scripts
 +
 +====== What to do while on shift ======
 +During the campaign, it's requested to check, **every 6 hours**, the status of the submitted jobs and submit new batch of jobs if needed.
 +The steps, in sequential order, are:
 +
 +  - [[https://wiki.infn.it/progetti/icarus/production-guide#check_queue_s_status|Check queue's status]]
 +  - [[https://wiki.infn.it/progetti/icarus/production-guide#configure_the_next_job_submission|Configure the next job submission (either real or MC)]]
 +  - [[https://wiki.infn.it/progetti/icarus/production-guide#create_a_proxy_with_voms_extensions|Create a proxy with voms extensions]]
 +  - [[https://wiki.infn.it/progetti/icarus/production-guide#submit_the_batch_of_jobs|Submit the batch of jobs]]
 +
 +Once the submission of the production is complete, the shifter should [[https://wiki.infn.it/progetti/icarus/production-guide#check_the_completion_of_the_campaing|check the completion of the campaing]]
 +
 +
 +===== Check queue's status =====
 +The first step is to check the number of jobs in the //idle// state. This can be done as follows:
 +{{ :progetti:icarus:screenshot_2025-10-14_alle_11.45.27.png?800 |}}
 +    - Open the [[https://t1metria.cr.cnaf.infn.it/d/db182c5e-6570-4b36-b47b-2cade2437b60-4/htcondor-job-overview?orgId=18&refresh=3m&var-retention=one_week&var-Cluster=All&var-VO=icarus&from=now-7d&to=now|grafana page]]
 +    - Scroll to the bottom of the page and check the //Idle// number in the Job Status icarus section
 +
 +If this number is smaller than **300**, go to the [[https://wiki.infn.it/progetti/icarus/production-guide#configure_the_next_job_submission|next step.]] 
 +If you don't see the the //Idle// section, also go to the [[https://wiki.infn.it/progetti/icarus/production-guide#configure_the_next_job_submission|next step.]]
 +If there are more than 300 Idle jobs, repeat this step in 6 hours.
 +
 +
 +===== Configure the next job submission =====
 +If the current number of //pending// jobs is smaller than **300**, a new batch of jobs can be submitted. First the shifter has to configure the job submission. To do so, the shifter has to go to the ''prod-scripts'' folder inside the working area created in the intial setup section. Example:
 +
 +  cd /storage/gpfs_data/icarus/local/users/vpia/productions/run2-v09_72_00_06-122023-variables/prod-scripts
 +
 +Then, the shifter has to configure the job submission editing and modifing the ''variable.sh'' file. This step is different based on the production type, [[https://wiki.infn.it/progetti/icarus/production-guide#configure_the_job_submission_for_real_data_production|real]] or [[https://wiki.infn.it/progetti/icarus/production-guide#configure_the_job_submission_for_mc_data_production|mc]] data.
 +
 +==== Configure the job submission for real data production ====
 +Here, it's requested to modify the ''variables.sh'' file with the details of the batch of jobs to be submitted. The **only** variable to be modified is the **YOUR_CUSTOM_RUN_LIST** variable. This should be a list of numbers corresponding to the runs to submit in the batch. 
 +
 +===From a Google Sheet document===
 +The list of runs is provided in the first column of the sheet.
 +
 +Here what to do for each batch:
 +  * open (with an editor, i.e. ''vim'', ''nano'', ''emacs'' or whatever you like) the //variables.sh// file and check the **YOUR_CUSTOM_RUN_LIST** variable (it should be empty the first time)
 +  * looks for runs with the **Submitted column** set to **no**. Select runs among these to have a total number of files of about 1000
 +  * copy the selected run numbners in the **YOUR_CUSTOM_RUN_LIST** variable, between quotes and separated by a space
 +  * save //variables.sh//
 +
 +{{ :progetti:icarus:google_sheet.png?800 |}}
 +
 +==== Configure the job submission for MC data production ====
 +Here, it's requested to modify the //variables.sh// file with the details of the batch of jobs to be submitted. Two variables need to be modified: 
 +  * **STARTING_RUN**
 +  * **NUMBER_OF_RUNS**
 +
 +The values for both variables for each step are provided in the //batch.info// file, located in the same folder created during the setup. If no //batch.info// file is present, a list of batches should have been provided differently. 
 +
 +Independently from the distribution method, here what to do for each batch:
 +  * open (with an editor, i.e. ''vim'', ''nano'', ''emacs'' or whatever you like) the //variables.sh// file and check the **STARTING_RUN** and **NUMBER_OF_RUNS** variables (they should both be 0 the first time)
 +  * open (with an editor, i.e. ''vim'', ''nano'', ''emacs'' or whatever you like) the //batch.info// file (or the resources provided with such info) and find the step corresponding to the values of the **STARTING_RUN** and **NUMBER_OF_RUNS** variables
 +  * go to the next step in the list and copy the corresponding values in the **STARTING_RUN** and **NUMBER_OF_RUNS** variables
 +  * save //variables.sh//
 +
 +{{ :progetti:icarus:mc-step.png?direct&1000 |}}
 +
 +===== Create a proxy with voms extensions =====
 +
 +You have to create a proxy with the voms extension, this step should be done **every time** your are going to submit a batch of jobs.
 +
 +If you didn't get the needed certificate to generate a proxy, please do so following the instruction provided in the **Personal Certificate and VO Enrollment** section of [[https://wiki.infn.it/progetti/icarus/data|this page]].
 +
 +After you got the certificate, or if you already have one, simply run the following command:
 +
 +  voms-proxy-init --voms icarus-exp.org --valid 72:00
 +
 +You'll be asked to confirm your identity by inserting the GRID pass phrase used during the proxy setup. After inserting it, a proxy with a duration of three days will be created:
 +
 +  voms-proxy-init --voms icarus-exp.org --valid 72:00
 +  Enter GRID pass phrase for this identity:
 +  Contacting vomsigi-na.unina.it:15000 [/DC=org/DC=terena/DC=tcs/C=IT/ST=Napoli/O=Universita degli Studi di Napoli FEDERICO II/CN=vomsigi-na.unina.it] "icarus-exp.org"...
 +  Remote VOMS server contacted succesfully.
 +  
 +  vomsigi-na.unina.it:15000: The validity of this VOMS AC in your proxy is shortened to 86400 seconds!
 +  
 +  Created proxy in /tmp/x509up_u####.
 +  
 +  Your proxy is valid until Fri Oct 17 19:07:56 CEST 2025
 +
 +===== Submit the batch of jobs =====
 +**Heads-up**: have you created a proxy with the voms extension? If not:
 +
 +  voms-proxy-init --voms icarus-exp.org --valid 72:00
 + 
 +Then, go to the ''prod-scripts'' folder inside the working area created in the intial setup section (i.e ''cd /storage/gpfs_data/icarus/local/prod/<selected-tag>/prod-scripts''). Example:
 +
 +  cd /storage/gpfs_data/icarus/local/prod/run2-v09_72_00_06-122023-variables/prod-scripts
 +
 +After updating the file //variables.sh// with the new batch, run the command:
 +
 +  module switch htc
 +
 +and submit the production with:
 +
 +  ./submit_production.sh
 +
 +The script will automatically submit all the needed jobs. This could take a few minutes during which the shell will look unresponsive (it is not).
 +
 +After regaining control of the shell, you can check if the jobs were correctly submitted by either check the same grafana page shown previously (it updates with a **4-5 minutes delay** so you'll not be able to see the new jobs immediatly) or by running the **condor_q** command:
 +
 +  $ condor_q
 +  
 +  Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 10/27/25 16:17:10
 +  OWNER    BATCH_NAME           SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
 +  valerpia ID: 11228736       10/23 05:30    238      _      _    356 11228736.0-352
 +  valerpia ID: 11228746       10/23 05:31    157      _      _    264 11228746.4-253
 +  valerpia 11854_BNBMAJORITY  10/24 21:00      _      _      _     12 11391004.0-11
 +  valerpia 11854_BNBMINBIAS   10/24 21:00      _      _      _     12 11391005.0-11
 +  valerpia 11873_BNBMAJORITY  10/26 05:03      2      _      _    767 11441545.0-766
 +  valerpia 11873_BNBMINBIAS   10/26 05:05      1    127      _    559 11441546.0-558
 +  
 +  Total for query: 1572 jobs; 1445 completed, 0 removed, 0 idle, 127 running, 0 held, 0 suspended 
 +  Total for valerpia: 1572 jobs; 1445 completed, 0 removed, 0 idle, 127 running, 0 held, 0 suspended 
 +  Total for all users: 89517 jobs; 60905 completed, 0 removed, 18666 idle, 9666 running, 280 held, 0 suspended
 +
 +The new jobs should be at the bottom of the list, either in the RUN or IDLE stage.
 +
 +If the run list was provided with a Google Sheet document, update the **Submitted column** to **yes**.
 +
 +====== Check the completion of a batch ======
 +Once a batch of runs has been processed, the shifter should check how many files were correctly completed.
 +
 +This is done by running:
 +  ./get_info.sh [RUN_NUMBERS]
 +
 +where [RUN_NUMBERS] is the list of runs to be checked. Example:
 +
 +  ./get_info.sh 11806 11812 11817 11818
 +  
 +The script creates a ''logs'' folder in the ''prod-scripts'' folder, with these files inside:
 +  ./logs/all_raw_files.log         # The list of all submitted files
 +  ./logs/duplicated_folders.log    # The list of folders with multiple output files
 +  ./logs/missing_files.log         # The list of folders without output files
 +  ./logs/missing_folders.log       # The list of missing folders
 +  ./logs/ok_files.log              # The list of folders of the correctly processed files
 +  ./logs/resubmit_list.log         # The list of runs with failed files
 +  ./logs/run_summary.log           # The list of runs and the number of completed/failed files
 +  
 +A print message will tell the shifter whether there are some missing or duplicated files, or if everything is as expected (same number of //raw// and //ok// files).
 +
 +If the run list was provided with a Google Sheet document, check the **run_summary.log file**. For each run, copy the first number in the corresponding **Completed CAFs** column in the sheet, and the second number in the **Failed Files** column.
 +
 +{{ :progetti:icarus:completed_files.png?600 |}}
 +
 +In any case, if some files were not correctly processed, go to the resubmit step.
 +
 +====== Check the completion of the campaign ======
 +Once the submission of the production is complete, the shifter should check the completion of the campaign.
 +
 +For **standard campaigns**, this is done by running the same script used in the **Check the completion of a batch** step, without any argument:
 +
 +  ./get_info.sh
 +  
 +A print message will tell the shifter whether there are some missing or duplicated files, or if everything is as expected (same number of //raw// and //ok// files).
 +
 +In case some files were not correctly processed, go to the resubmit step.
 +
 +For **non-standard campaigns**, specific instructions will be provided and added to this page each time.
 +  
 +======Resubmit Raw Data Production======
 +When processing Raw Data, if some jobs didn't end succesfully, it is possible to resubmit them with the **resubmit.sh** script. To do so, look at the **logs/resubmit_list.log** file and run the following command for each line in the file:
 +
 +  ./resubmit.sh <RUN-NUMBER> <STREAMS> <MEMORY>
 +
 +where
 +  * RUN-NUMBER is the number of the run to resubmit, the first argument of each line
 +
 +  * STREAMS is a list of the streams to resubmit. The list of stream is the second argument of each line of the file, quotes included
 +
 +  * MEMORY is an optional argument to specify a new memory requirement in GB for the resubmitted jobs (max 15).
 +
 +Examples:
 +
 +  ./resubmit 9888 "BNBMAJORITY BNBMINBIAS"
 +  ./resubmit 9435 "BNBMAJORITY" 15
 +  
 +The script will check the logs of the run/stream in the OUT_FOLDER/RUN_NUMBER/STREAM directory, remove the jobs related to the run/stream from the queue, check which files need to be reprocessed and resubmit the jobs with the same configuration used to submit them the first time but a different memory requirement, if specified.
 +
 +====== FAQ ======
 +TO DO
  

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki