progetti:cloud-areapd:operations:production_cloud:elastic_cluster
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
progetti:cloud-areapd:operations:production_cloud:elastic_cluster [2018/01/19 09:22] – [Check the log files] segatta@infn.it | progetti:cloud-areapd:operations:production_cloud:elastic_cluster [2018/01/23 16:48] (current) – [Check the running processes] segatta@infn.it | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== How to restart elastic cluster ====== | ||
+ | |||
+ | This guide explains how to restart an elastic cluster based on htcondor and elastiq services. | ||
+ | ===== Check the repository virtual machine ===== | ||
+ | |||
+ | The flow to restart an elastic cluster after the shut off has to follow these steps. | ||
+ | The cloud administrator has to check if the VM **yum-repo-pubb**, | ||
+ | If not, log in on the VM | ||
+ | <code bash> | ||
+ | ssh root@90.147.77.142 | ||
+ | </ | ||
+ | and restart it checking the mount of /dev/vdb disk | ||
+ | <code bash> | ||
+ | df | ||
+ | / | ||
+ | </ | ||
+ | The root password is the usual one. | ||
+ | |||
+ | ===== Check the cluster status ===== | ||
+ | |||
+ | In order to restart correctly the elastic cluster after a shut off, follow these steps: | ||
+ | * if possible, delete all the slave nodes via dashboard; | ||
+ | * switch on the master node; | ||
+ | * check if both condor and elastiq services are already running (i.e in centos release they are enabled) | ||
+ | <code bash> | ||
+ | service condor status | ||
+ | service elastiq status | ||
+ | </ | ||
+ | In this case new slaves will be created and will join the cluster in some minutes; | ||
+ | * if only condor service is running but elastiq isn't, please restart elastiq with | ||
+ | <code bash> | ||
+ | service elastiq start | ||
+ | </ | ||
+ | or | ||
+ | <code bash> | ||
+ | elastiqctl restart | ||
+ | </ | ||
+ | and wait for the creation of new slaves that will reach the cluster in some minutes; | ||
+ | |||
+ | * if condor isn't running and some elastiq processes are up and running, kill them with | ||
+ | <code bash> | ||
+ | ps -ef | grep elastiq | ||
+ | kill -9 < | ||
+ | </ | ||
+ | and start the condor service with | ||
+ | <code bash> | ||
+ | service condor start | ||
+ | </ | ||
+ | The condor_q should return | ||
+ | <code bash> | ||
+ | condor_q | ||
+ | -- Schedd: | ||
+ | | ||
+ | 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended | ||
+ | </ | ||
+ | and the condor_status should be empty (no nodes running) | ||
+ | <code bash> | ||
+ | condor_status | ||
+ | </ | ||
+ | then start the elastiq service | ||
+ | <code bash> | ||
+ | service elastiq start | ||
+ | </ | ||
+ | in some minutes the minimum number of nodes should reach the condor cluster and the condor_status should show them i.e. | ||
+ | <code bash> | ||
+ | condor_status | ||
+ | Name | ||
+ | slot1@10-64-22-215 LINUX X86_64 Unclaimed Idle 0.000 1896 0+00:24:46 | ||
+ | slot2@10-64-22-215 LINUX X86_64 Unclaimed Idle 0.000 1896 0+00:25:05 | ||
+ | slot1@10-64-22-217 LINUX X86_64 Unclaimed Idle 0.000 1896 0+00:24:44 | ||
+ | slot2@10-64-22-217 LINUX X86_64 Unclaimed Idle 0.000 1896 0+00:25:05 | ||
+ | slot1@10-64-22-89. LINUX X86_64 Unclaimed Benchmar | ||
+ | slot2@10-64-22-89. LINUX X86_64 Unclaimed Idle 0.040 1896 0+00:00:05 | ||
+ | | ||
+ | X86_64/ | ||
+ | | ||
+ | </ | ||
+ | |||
+ | ===== Check the log files ===== | ||
+ | |||
+ | The log file of elastiq, locate in **/ | ||
+ | When you start the elastiq service, the first part of log file reports the check of cloud user's credentials and of other parameters configured in the elastiq.conf file (i.e. the userdata file for slave nodes) | ||
+ | |||
+ | <code bash> | ||
+ | INFO [__init__.conf] Configuration: | ||
+ | INFO [__init__.conf] Configuration: | ||
+ | INFO [__init__.conf] Configuration: | ||
+ | 2INFO [__init__.conf] Configuration: | ||
+ | INFO [__init__.conf] Configuration: | ||
+ | INFO [__init__.conf] Configuration: | ||
+ | INFO [__init__.conf] Configuration: | ||
+ | |||
+ | INFO [__init__.conf] Configuration: | ||
+ | INFO [__init__.conf] Configuration: | ||
+ | NFO [__init__.main] Loaded batch plugin " | ||
+ | DEBUG [htcondor.init] HTCondor plugin initialized | ||
+ | DEBUG [__init__.main] EC2 image " | ||
+ | </ | ||
+ | |||
+ | if your credentials are wrong you get an error as | ||
+ | |||
+ | <code bash> | ||
+ | ERROR [__init__.ec2_running_instances] Can't get list of EC2 instances (maybe wrong credentials? | ||
+ | </ | ||
+ | |||
+ | instead, if you insert a wrong ami_id for the image of slave nodes you get | ||
+ | |||
+ | <code bash> | ||
+ | ERROR [__init__.main] Cannot find EC2 image " | ||
+ | </ | ||
+ | |||
+ | Elastiq periodically checks all the VMs. If a VM is correctly added to the condor cluster, it logs | ||
+ | |||
+ | <code bash> | ||
+ | DEBUG [__init__.ec2_running_instances] Found IP 10.64.22.188 corresponding to instance | ||
+ | </ | ||
+ | |||
+ | otherwise | ||
+ | |||
+ | <code bash> | ||
+ | WARNING [__init__.ec2_running_instances] Cannot find instance 10.64.22.216 in the list of known IPs | ||
+ | WARNING [__init__.ec2_running_instances] Cannot find instance 10.64.22.182 in the list of known IPs | ||
+ | WARNING [__init__.ec2_running_instances] Cannot find instance 10.64.22.236 in the list of known IPs | ||
+ | </ | ||
+ | |||
+ | When elastiq instantiates a new VM it logs | ||
+ | |||
+ | <code bash> | ||
+ | WARNING [__init__.ec2_scale_up] Quota enabled: requesting 1 (out of desired 1) VMs | ||
+ | INFO [__init__.ec2_scale_up] VM launched OK. Requested: 1/1 | Success: 1 | Failed: 0 | ID: i-f026f340 | ||
+ | DEBUG [__init__.save_owned_instances] Saved list of owned instances: i-f026f340 | ||
+ | </ | ||
+ | |||
+ | and when elastiq deletes an idle VM it logs | ||
+ | <code bash> | ||
+ | INFO [__init__.check_vms] Host 10-64-22-190.INFN-PD is idle for more than 2400s: requesting shutdown | ||
+ | INFO [__init__.ec2_scale_down] Requesting shutdown of 1 VMs... | ||
+ | </ | ||
+ | |||
+ | In the master node of condor, logs are located in **/ | ||
+ | |||
+ | <code bash> | ||
+ | # ls -l / | ||
+ | total 76 | ||
+ | -rw-r--r--. 1 condor condor 24371 Jan 18 08:42 CollectorLog | ||
+ | -rw-r--r--. 1 root | ||
+ | -rw-r--r--. 1 condor condor | ||
+ | -rw-r--r--. 1 condor condor | ||
+ | -rw-r--r--. 1 condor condor 19126 Jan 18 08:42 NegotiatorLog | ||
+ | -rw-r--r--. 1 root | ||
+ | -rw-r--r--. 1 condor condor | ||
+ | -rw-r--r--. 1 condor condor | ||
+ | </ | ||
+ | ===== Check the running processes ===== | ||
+ | |||
+ | Generally running processes are: | ||
+ | * for the condor service: | ||
+ | <code bash> | ||
+ | [root@test-centos7-elastiq centos]# ps -ef | grep condor | ||
+ | condor | ||
+ | root | ||
+ | condor | ||
+ | condor | ||
+ | condor | ||
+ | </ | ||
+ | * for the elastiq service: | ||
+ | <code bash> | ||
+ | [root@test-centos7-elastiq centos]# ps -ef | grep elastiq | ||
+ | elastiq | ||
+ | elastiq | ||
+ | elastiq | ||
+ | </ | ||
+ | **NB:** The condor_status information isn't updated as frequently as the check of VM status done by elastiq. It could happen that for some minutes the condor_status shows nodes that have been already removed from the cloud by elastiq. | ||