Table of Contents
How to restart elastic cluster
This guide explains how to restart an elastic cluster based on htcondor and elastiq services.
Check the repository virtual machine
The flow to restart an elastic cluster after the shut off has to follow these steps. The cloud administrator has to check if the VM yum-repo-pubb, in the multi-tenant services project, that works as condor and elastiq repository is up and running http://90.147.77.142/repo/. If not, log in on the VM
ssh root@90.147.77.142
and restart it checking the mount of /dev/vdb disk
df /dev/vdb 15350728 276520 14271392 2% /var/www/html/repo
The root password is the usual one.
Check the cluster status
In order to restart correctly the elastic cluster after a shut off, follow these steps:
- if possible, delete all the slave nodes via dashboard;
- switch on the master node;
- check if both condor and elastiq services are already running (i.e in centos release they are enabled)
service condor status service elastiq status
In this case new slaves will be created and will join the cluster in some minutes;
- if only condor service is running but elastiq isn't, please restart elastiq with
service elastiq start
or
elastiqctl restart
and wait for the creation of new slaves that will reach the cluster in some minutes;
- if condor isn't running and some elastiq processes are up and running, kill them with
ps -ef | grep elastiq kill -9 <n_proc>
and start the condor service with
service condor start
The condor_q should return
condor_q -- Schedd: : <10.64.xx.yyy:zzzzz> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
and the condor_status should be empty (no nodes running)
condor_status
then start the elastiq service
service elastiq start
in some minutes the minimum number of nodes should reach the condor cluster and the condor_status should show them i.e.
condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@10-64-22-215 LINUX X86_64 Unclaimed Idle 0.000 1896 0+00:24:46 slot2@10-64-22-215 LINUX X86_64 Unclaimed Idle 0.000 1896 0+00:25:05 slot1@10-64-22-217 LINUX X86_64 Unclaimed Idle 0.000 1896 0+00:24:44 slot2@10-64-22-217 LINUX X86_64 Unclaimed Idle 0.000 1896 0+00:25:05 slot1@10-64-22-89. LINUX X86_64 Unclaimed Benchmar 1.000 1896 0+00:00:04 slot2@10-64-22-89. LINUX X86_64 Unclaimed Idle 0.040 1896 0+00:00:05 Machines Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 6 0 0 6 0 0 Total 6 0 0 6 0 0
Check the log files
The log file of elastiq, locate in /var/log/elastiq/elastiq.log, is quite difficult to read if you don't know its structure. When you start the elastiq service, the first part of log file reports the check of cloud user's credentials and of other parameters configured in the elastiq.conf file (i.e. the userdata file for slave nodes)
INFO [__init__.conf] Configuration: ec2.image_id = ami-9f3da3fc (from file) INFO [__init__.conf] Configuration: ec2.flavour = cldareapd.medium (from file) INFO [__init__.conf] Configuration: ec2.api_url = https://cloud-areapd.pd.infn.it:8788/services/Cloud (from file) 2INFO [__init__.conf] Configuration: ec2.aws_secret_access_key = <...> (from file) INFO [__init__.conf] Configuration: ec2.key_name = my_key (from file) INFO [__init__.conf] Configuration: ec2.user_data_b64 = <...> (from file) INFO [__init__.conf] Configuration: ec2.aws_access_key_id = <...> (from file) INFO [__init__.conf] Configuration: quota.max_vms = 3.0 (from file) INFO [__init__.conf] Configuration: quota.min_vms = 1.0 (from file) NFO [__init__.main] Loaded batch plugin "htcondor" DEBUG [htcondor.init] HTCondor plugin initialized DEBUG [__init__.main] EC2 image "ami-9f3da3fc" found
if your credentials are wrong you get an error as
ERROR [__init__.ec2_running_instances] Can't get list of EC2 instances (maybe wrong credentials?)
instead, if you insert a wrong ami_id for the image of slave nodes you get
ERROR [__init__.main] Cannot find EC2 image "ami-00000000"
Elastiq periodically checks all the VMs. If a VM is correctly added to the condor cluster, it logs
DEBUG [__init__.ec2_running_instances] Found IP 10.64.22.188 corresponding to instance
otherwise
WARNING [__init__.ec2_running_instances] Cannot find instance 10.64.22.216 in the list of known IPs WARNING [__init__.ec2_running_instances] Cannot find instance 10.64.22.182 in the list of known IPs WARNING [__init__.ec2_running_instances] Cannot find instance 10.64.22.236 in the list of known IPs
When elastiq instantiates a new VM it logs
WARNING [__init__.ec2_scale_up] Quota enabled: requesting 1 (out of desired 1) VMs INFO [__init__.ec2_scale_up] VM launched OK. Requested: 1/1 | Success: 1 | Failed: 0 | ID: i-f026f340 DEBUG [__init__.save_owned_instances] Saved list of owned instances: i-f026f340
and when elastiq deletes an idle VM it logs
INFO [__init__.check_vms] Host 10-64-22-190.INFN-PD is idle for more than 2400s: requesting shutdown INFO [__init__.ec2_scale_down] Requesting shutdown of 1 VMs...
In the master node of condor, logs are located in /var/log/condor/ directory and are easy to read and understand:
# ls -l /var/log/condor/ total 76 -rw-r--r--. 1 condor condor 24371 Jan 18 08:42 CollectorLog -rw-r--r--. 1 root root 652 Jan 18 08:35 KernelTuning.log -rw-r--r--. 1 condor condor 2262 Jan 18 08:35 MasterLog -rw-r--r--. 1 condor condor 0 Jan 18 08:35 MatchLog -rw-r--r--. 1 condor condor 19126 Jan 18 08:42 NegotiatorLog -rw-r--r--. 1 root root 13869 Jan 18 08:42 ProcLog -rw-r--r--. 1 condor condor 474 Jan 18 08:35 ScheddRestartReport -rw-r--r--. 1 condor condor 2975 Jan 18 08:40 SchedLog
Check the running processes
Generally running processes are:
- for the condor service:
[root@test-centos7-elastiq centos]# ps -ef | grep condor condor 764 1 0 14:09 ? 00:00:00 /usr/sbin/condor_master -f root 960 764 0 14:09 ? 00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 996 condor 961 764 0 14:09 ? 00:00:00 condor_collector -f condor 974 764 0 14:09 ? 00:00:00 condor_negotiator -f condor 975 764 0 14:09 ? 00:00:00 condor_schedd -f
- for the elastiq service:
[root@test-centos7-elastiq centos]# ps -ef | grep elastiq elastiq 899 1 0 14:09 ? 00:00:00 SCREEN -dmS __|elastiq|__ /bin/sh -c python /usr/bin/elastiq-real.py --logdir=/var/log/elastiq --config=/etc/elastiq.conf --statefile=/var/lib/elastiq/state 2> /var/log/elastiq/elastiq.err elastiq 952 899 0 14:09 pts/0 00:00:00 /bin/sh -c python /usr/bin/elastiq-real.py --logdir=/var/log/elastiq --config=/etc/elastiq.conf --statefile=/var/lib/elastiq/state 2> /var/log/elastiq/elastiq.err elastiq 953 952 0 14:09 pts/0 00:00:01 python /usr/bin/elastiq-real.py --logdir=/var/log/elastiq --config=/etc/elastiq.conf --statefile=/var/lib/elastiq/state
NB: The condor_status information isn't updated as frequently as the check of VM status done by elastiq. It could happen that for some minutes the condor_status shows nodes that have been already removed from the cloud by elastiq.