strutture:cnaf:clusterhpc:using_the_cnaf_hpc_cluster
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
strutture:cnaf:clusterhpc:using_the_cnaf_hpc_cluster [2014/05/21 20:43] – [Submitting GPU Jobs] dcesini@infn.it | strutture:cnaf:clusterhpc:using_the_cnaf_hpc_cluster [2022/07/06 09:22] (current) – dcesini@infn.it | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Using The HPC Cluster ===== | ||
+ | |||
+ | |||
+ | Usage instruction moved here: https:// | ||
+ | |||
+ | |||
+ | |||
+ | OLD STUFF: | ||
+ | |||
+ | |||
+ | ===== Requesting Access ===== | ||
+ | |||
+ | To access the cluster you should first obtain an account at CNAF following the procedure you can find at [[https:// | ||
+ | In the application form specify in the " | ||
+ | Please Specify " | ||
+ | |||
+ | ===== Log to the User Interface ===== | ||
+ | Once the CNAF account will be provided, you can login to the bastion host. \\ | ||
+ | This is not your user interface!\\ \\ | ||
+ | To access the cluster from the bastion log into:\\ | ||
+ | < | ||
+ | ui-hpc.cr.cnaf.infn.it | ||
+ | </ | ||
+ | using the same bastion credentials. | ||
+ | |||
+ | ===== Getting Support ===== | ||
+ | |||
+ | Information and Support can be asked to: | ||
+ | < | ||
+ | hpc-support < | ||
+ | </ | ||
+ | ===== Home Directory and extra disk space ===== | ||
+ | Your home directory: | ||
+ | < | ||
+ | / | ||
+ | </ | ||
+ | in the user interface is shared among all the cluster nodes.\\ | ||
+ | \\ No quotas are currently enforced on the home directories and about only 4TB are available in the /home partition.\\ | ||
+ | \\ In the case you need more disk space for data and checkpointing every user can access the following directory: | ||
+ | < | ||
+ | / | ||
+ | </ | ||
+ | which is on a shared gpfs storage. \\ | ||
+ | |||
+ | Please, do not leave huge unused files in both home directories and gpfs storage areas. Quotas will be enforced in the near future. | ||
+ | |||
+ | |||
+ | ===== The LSF Batch System ===== | ||
+ | The cluster is managed and accessible via the LSF (version 9.1.2) batch system.\\ | ||
+ | A detailed LSF user guide can be found [[http:// | ||
+ | In the following there is a minimal how-to, describing basic operations needed to properly access the CNAF HPC cluster for various job types.\\ | ||
+ | |||
+ | ==== Querying the Cluster Status with LSF ==== | ||
+ | To obtain on overview of the nodes status:\\ | ||
+ | < | ||
+ | bhosts -w | ||
+ | </ | ||
+ | |||
+ | To obtain the queues status:\\ | ||
+ | < | ||
+ | bqueues | ||
+ | </ | ||
+ | |||
+ | Add the option **" | ||
+ | |||
+ | Currently four queues have been defined: | ||
+ | |||
+ | * **hpc_inf** : Max CPU Time is 128 core days (24hours using 128 cores or 48hours using 64 cores) and Max WallClock Time is 79 hours. \\ | ||
+ | * **hpc_short** : Max CPU Time is 32 core days (24hours using 32 cores or 24hours using 16 cores) and Max WallClock Time is 79 hours.\\ | ||
+ | * **hpc_gpu: | ||
+ | * **hpc_int: | ||
+ | |||
+ | |||
+ | |||
+ | To obtain nodes load information: | ||
+ | < | ||
+ | lsload | ||
+ | </ | ||
+ | Much more details with: \\ | ||
+ | < | ||
+ | lsload -l | ||
+ | </ | ||
+ | |||
+ | To restrict the lsload query to a numerical fields (i.e. io and r15s) use the **" | ||
+ | < | ||
+ | lsload -I io:r15s | ||
+ | </ | ||
+ | |||
+ | To restrict the query to string fields use the **" | ||
+ | < | ||
+ | lsload -s gpu_model0 | ||
+ | </ | ||
+ | |||
+ | ==== Submitting Single Batch Jobs ==== | ||
+ | |||
+ | Single batch jobs can be submitted via the **bsub** command. \\ | ||
+ | Use option **" | ||
+ | Option **" | ||
+ | < | ||
+ | bsub -o test.out -e test.err / | ||
+ | bsub -o test.out -e test.err -m ' | ||
+ | </ | ||
+ | |||
+ | === Standard Output and Standard Error === | ||
+ | |||
+ | As previously stated standard output and standard error can be redirected with the " | ||
+ | The files generated in this way are available at the end of the job, they are owned by root but can be read and removed by the user. They cannot be edited directly. Should you need to edit them you need to make a copy with " | ||
+ | |||
+ | To have real time update files redirect the standard output and error using ">" | ||
+ | < | ||
+ | </ | ||
+ | The single quotes is important otherwise the output of the bsub command will be redirected.\\ | ||
+ | |||
+ | |||
+ | === Check the Job Status === | ||
+ | |||
+ | Job status can be queried with the **bjobs** command. \\ | ||
+ | Use the **" | ||
+ | Use the **" | ||
+ | Use the job number to get information of a single job. \\ | ||
+ | Use option **" | ||
+ | **" | ||
+ | I.e: \\ | ||
+ | < | ||
+ | bjobs -W | ||
+ | bjobs -l < | ||
+ | bjobs -a -u < | ||
+ | </ | ||
+ | |||
+ | === Killing Submitted Jobs === | ||
+ | To kill submitted jobs launch the **bkill** command. I.e. : \\ | ||
+ | < | ||
+ | bkill < | ||
+ | </ | ||
+ | |||
+ | ==== Submitting MPI Jobs via mpirun.lsf (obsolete)==== | ||
+ | |||
+ | Currently only **OpenMPI** jobs have been tested on the HPC cluster. \\ | ||
+ | To submit an OpenMPI Job please follow the following steps:\\ | ||
+ | |||
+ | * Set the right environment in the **.bashrc** in your User Interface home directory as shown below: | ||
+ | < | ||
+ | [cesinihpc@ui-hpc ~]$ cat .bashrc | ||
+ | |||
+ | # .bashrc | ||
+ | # Source global definitions | ||
+ | if [ -f /etc/bashrc ]; then | ||
+ | . /etc/bashrc | ||
+ | fi | ||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | |||
+ | </ | ||
+ | |||
+ | * Make sure the **.bash_profile** in your home directory exists. If not, create it:\\ | ||
+ | < | ||
+ | [cesinihpc@ui-hpc ~]$ cat .bash_profile | ||
+ | # .bash_profile | ||
+ | |||
+ | # Get the aliases and functions | ||
+ | if [ -f ~/.bashrc ]; then | ||
+ | . ~/.bashrc | ||
+ | fi | ||
+ | </ | ||
+ | |||
+ | * Place your executables in a location accessible by your user: home directory or the gpfs shared area (/ | ||
+ | * Create the following wrapper script for your executable :\\ | ||
+ | < | ||
+ | [cesinihpc@ui-hpc ~]$ cat cpmpi_test.sh | ||
+ | #!/bin/sh | ||
+ | |||
+ | #can do initial environment setup here if needed | ||
+ | #export < | ||
+ | echo " | ||
+ | |||
+ | / | ||
+ | </ | ||
+ | |||
+ | **PLEASE NOTE:** **mpirun.lsf** has to be used instead of standard mpirun! \\ \\ | ||
+ | **PLEASE NOTE:** **PSM_SHAREDCONTEXTS_MAX=8** has to be used if you are not using whole nodes (i.e. not using a number of mpi processes which is a multiple of 32 with a 32 processors per node, ptile in LSF). If you are using whole nodes you can skip this and your job will use the maximun number of shared contexts available on a node (which is 16). If you are not using whole nodes and do not set the PSM_SHAREDCONTEXTS_MAX variable to a number lower than 16 the next job landing on the same node will probably fail.\\ \\ | ||
+ | **PLEASE NOTE:** do not set the number of nodes to be used in the mpirun.lsf command, it will be in the bsub command and will be handled by LSF \\ \\ | ||
+ | |||
+ | * **Launch the bsub command in this way:**\\ | ||
+ | < | ||
+ | bsub -q < | ||
+ | </ | ||
+ | |||
+ | The option **-R " | ||
+ | |||
+ | If you want to select specific nodes you can use the option " | ||
+ | |||
+ | ==== An MPI submission script ==== | ||
+ | To hide the complexity of the syntax of the submission command you can use this {{: | ||
+ | \\ Just customize the first lines according to your needs. \\ | ||
+ | (Thanks to S.Sinigardi for sharing it)\\ | ||
+ | |||
+ | |||
+ | ==== Alternative MPI multinode submission ==== | ||
+ | It is possible to avoid the usage of mpirun.lsf and dinamically set the mpirun machine file in the following way: | ||
+ | |||
+ | 1) Create automatically the machine file to be using in the mpirun: | ||
+ | |||
+ | echo $LSB_HOSTS | awk ' | ||
+ | |||
+ | 2) Use this command to launch mpirun: | ||
+ | |||
+ | mpirun --machinefile / | ||
+ | |||
+ | A possible bsub submission is: | ||
+ | |||
+ | bsub -q hpc_inf_SL7 | ||
+ | |||
+ | where in the run_this_example.sh script you launch the previous commands: | ||
+ | |||
+ | ----run_this_example.sh---- | ||
+ | |||
+ | #!/bin/bash | ||
+ | |||
+ | echo $LSB_HOSTS | awk ' | ||
+ | |||
+ | mpirun --machinefile / | ||
+ | ==== Submitting GPU Jobs ==== | ||
+ | |||
+ | * Prepare a job wrapper as the following, setting the needed environment: | ||
+ | < | ||
+ | [cesinihpc@ui-hpc ~]$ cat test_2gpu_lsf.sh | ||
+ | #!/bin/sh | ||
+ | export BASE=/ | ||
+ | export PATH=$BASE/ | ||
+ | export C_INCLUDE_PATH=$BASE/ | ||
+ | export CPLUS_INCLUDE_PATH=$BASE/ | ||
+ | export LD_LIBRARY_PATH=$BASE/ | ||
+ | |||
+ | #env | ||
+ | #echo " | ||
+ | #now your GPU executable | ||
+ | / | ||
+ | # if it a GPU and OPENMPI job: | ||
+ | # / | ||
+ | # remember to add option "-a openmpi" | ||
+ | ############# | ||
+ | </ | ||
+ | * Submit the job wrapper selecting a GPU enabled node using the " | ||
+ | < | ||
+ | bsub -q hpc_inf -R " | ||
+ | </ | ||
+ | The **-R** option showed in the example selects a node with **two Tesla K20 GPUs**. Customise it according to your requirements. \\ \\ | ||
+ | |||
+ | If your job does not use many CPU cores and the site is fully used by CPU-only jobs, to submit a GPU job you can use the **hpc_gpu queue** to access | ||
+ | The hpc_gpu queue can use **only 2 cores** and only in the nodes where the GPUs are installed. \\ | ||
+ | The hostgroups gpuk20 and gpuk40 have been defined to simplify the submission command. \\ | ||
+ | I.e. : \\ | ||
+ | |||
+ | < | ||
+ | bsub -q hpc_gpu | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | **PLEASE NOTE:** the **" | ||
+ | LSF will subtract the number of GPUs specified in " | ||
+ | |||
+ | If your job is also an **OpenMPI job** add the options **-a openmpi** and **-n < | ||
+ | ==== Submitting Interactive Jobs ==== | ||
+ | |||
+ | To allow interactive access to the nodes for debugging, testing | ||
+ | < | ||
+ | bsub -q hpc_int -Is /bin/bash | ||
+ | </ | ||
+ | **PLEASE NOTE:** After about two hours you will be logged out! **Do not** use interactive shell to submit real life long jobs . \\ | ||
+ | |||
+ | ===== Getting Support ===== | ||
+ | |||
+ | Information and Support can be asked to: \\ | ||
+ | < | ||
+ | hpc-support < | ||
+ | </ | ||
strutture/cnaf/clusterhpc/using_the_cnaf_hpc_cluster.txt · Last modified: 2022/07/06 09:22 by dcesini@infn.it