====== Monitoring ====== ===== Ganglia ===== [[http://cld-ganglia.cloud.pd.infn.it/ganglia|Ganglia Monitoring Page]] ===== Nagios ====== [[http://cld-nagios.cloud.pd.infn.it/nagios|Nagios Monitoring Page]] ==== Active sensors ==== * ''Current load'': Check the 1m, 5m, 15m loads on the monitored host. Report an error or warning if one of these values exceeds a threshold (thresholds are defined in the Nagios .cfg files) * ''Dell Server XYZ'': OpenManage sensors installed on Dell boxes. They check different hardware components. * ''HTTP'': check if the http service is running * ''HTTPS'': check the status of the http and https services * ''HTTPS Certificate'': check the certificate used in a HTTPS server. It reports a warning if it is expiring in less than 30 days * ''KVM'': this sensor is used on the Compute Nodes. If checks the status of KVM service and of the hosted VMs. It reports a warning if VMs are not all actives. So e.g. it could report a warning if a VM is in shutoff state, but this could not be a problem is the instance is in that state because of an user action * ''MySQL XYZ database'': this sensor is used in the mysql nodes. It checks the status of the considered DB. * ''Neutron Agents'': this sensor is used on the Network Node. It runs the ''neutron agent-list'' command and check if for all the agents the status is :-), i.e. ok * ''Neutron Server'': this sensor is used on the Network Node. It checks if a neutron-server process is being executed. If is reports a problem it should be checked if indeed the process is not running. Before restarting it, a check in the log files (in particular in '/var/log/neutron' is recommended) * ''Nova Compute'': this sensor is used on the Compute Nodes. It checks if a nova-compute process is being executed. If is reports a problem it should be checked if indeed the process is not running. Before restarting it, a check in the log files (in particular in '/var/log/nova' is recommended) * ''Openstack Check EC2 Instances'': Check, using the EC2 API, the instances on the Cloud. It reports a problem if the command doesn't work or if the VMs in error are more than a certain threshold, usually 5 (it is defined in the Nagios .cfg files) * ''Openstack Check Nova Instances'': Check, using the Nova API, the instances on the Cloud. It reports a problem if the command doesn't work or if the VMs in error are more than a certain threshold, usually 5 (it is defined in the Nagios .cfg files) * ''OpenStack Glance API'': Check, using the Glance API, the available images registered in Glance. In case of problems please check the log files, in particular in '/var/log/glance'. * ''OpenStack Glance Upload'': Upload (and then delete) a small image in the Glance service, to test it. In case of problems please check the log files, in particular in '/var/log/glance'. * ''OpenStack Keystone API'': this sensor is used in the Controller node. It checks the functionality of the keystone service trying to get a token for it. If it reports a problem, please check the log files, in particular the ones in '/var/log/keystone'. In case of problems please check the log files, in particular in '/var/log/nova' * ''OpenStack Nova API'': this sensor tests the functionality of the Nova API asking the list of the available flavors. * ''OpenvSwitch'': this sensor is used on the Compute Nodes. First of all it checks if the openvswitch service is running. It then parses the output of the 'ovs-ofctl dump-flows br-tun' command: a row with the strings 'table=0' and 'actions=resubmit' should be there. If this is not the case, a restart of the 'openvswitch' and 'neutron-openvswitch-agent' could help * ''PerconaStatus'': this sensor is used to monitor the members of the Percona cluster. It checks if it is synced and if the number of members of tha database cluster is 3 (i.e. if all the members are working) * ''PING'': check if the considered host is pingable * ''SSH'': check if the SSH service on the host is active * ''XYZ Filesystem total size'': check the total size of a filesystem and compares it with the expected value. This sensor is used for Gluster file systems, to be notified when one brick "is lost" * ''XYZ Gluster'': checks a gluster volume in the servers providing it * ''XYZ Partition''': check if the considered file system is available and check its size. Trigger a warning/error if the free space is less than a certain threshold defined in the Nagios .cfg files (usually 20 % for Warning, 10 % for Error) ===== Ceilometer ===== Installation instructions for Icehouse on: * http://docs.openstack.org/icehouse/install-guide/install/yum/content/ch_ceilometer.html * https://openstack.redhat.com/CeilometerQuickStart Graphic interface available in Horizon: * https://cloud-areapd.pd.infn.it:8443/dashboard/admin/metering/ ==== Installation on Controller ==== #!/bin/sh export CEILOHOST=controller-01.pd.infn.it export CEILOHOSTPV=controller-01.cloud.pd.infn.it export DBHOST=cld-nagios export KEYHOST=cloud-areapd.pd.infn.it export RABBITHOSTS="192.168.60.100:5672,192.168.60.101:5672" # yum install -y openstack-ceilometer-api openstack-ceilometer-collector openstack-ceilometer-notification openstack-ceilometer-central openstack-ceilometer-alarm python-ceilometerclient ## the lines below to be executed on $DBHOST #cat </etc/yum.repos.d/mongodb.repo #[mongodb] #name=MongoDB Repository #baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/ #gpgcheck=0 #enabled=1 #EOF # #yum install -y mongodb-org #service mongod start #chkconfig mongod on #mongo --host $DBHOST --eval 'db = db.getSiblingDB("ceiloprod"); db.createUser({user: "ceiloprod",pwd: "CEILOMETER_DBPASS",roles: [ "readWrite", "dbAdmin" ]})' ## end of lines to be executed on $DBHOST # ## if mongodb already in place at $DBHOST and ceilometer DB not yet created, you can do it from the controller as below: yum install -y mongodb-org-shell mongo --host $DBHOST --eval 'db = db.getSiblingDB("ceiloprod"); db.createUser({user: "ceiloprod",pwd: "CEILOMETER_DBPASS",roles: [ "readWrite", "dbAdmin" ]})' ## end of $DBHOST stuff # openstack-config --set /etc/ceilometer/ceilometer.conf database connection mongodb://ceiloprod:CEILOMETER_DBPASS@$DBHOST:27017/ceiloprod # CEILOMETER_TOKEN=$(openssl rand -hex 10) echo $CEILOMETER_TOKEN openstack-config --set /etc/ceilometer/ceilometer.conf publisher metering_secret $CEILOMETER_TOKEN openstack-config --set /etc/ceilometer/ceilometer.conf DEFAULT rpc_backend ceilometer.openstack.common.rpc.impl_kombu openstack-config --set /etc/ceilometer/ceilometer.conf DEFAULT rabbit_ha_queues true openstack-config --set /etc/ceilometer/ceilometer.conf DEFAULT rabbit_hosts $RABBITHOSTS openstack-config --set /etc/ceilometer/ceilometer.conf DEFAULT auth_strategy noauth # source keystone_admin.sh keystone user-create --name=ceilometer --pass=CEILOMETER_PASS --email=ceilometer@example.com keystone user-role-add --user=ceilometer --tenant=services --role=admin keystone service-create --name=ceilometer --type=metering --description="Ceilometer Telemetry Service" keystone endpoint-create --region=regionOne --service-id=$(keystone service-list | awk '/ metering / {print $2}') --publicurl=https://$CEILOHOST:8777 --internalurl=https://$CEILOHOSTPV:8777 --adminurl=https://$CEILOHOSTPV:8777 # openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken auth_host $KEYHOST openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken admin_user ceilometer openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken admin_tenant_name services openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken auth_protocol https openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken auth_uri https://$KEYHOST:35357 openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken admin_password CEILOMETER_PASS openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken cafile /etc/grid-security/certificates/INFN-CA-2006.pem openstack-config --set /etc/ceilometer/ceilometer.conf service_credentials os_username ceilometer openstack-config --set /etc/ceilometer/ceilometer.conf service_credentials os_tenant_name services openstack-config --set /etc/ceilometer/ceilometer.conf service_credentials os_password CEILOMETER_PASS openstack-config --set /etc/ceilometer/ceilometer.conf service_credentials os_cacert /etc/grid-security/certificates/INFN-CA-2006.pem # openstack-config --set /etc/cinder/cinder.conf DEFAULT control_exchange cinder openstack-config --set /etc/cinder/cinder.conf DEFAULT notification_driver cinder.openstack.common.notifier.rpc_notifier service openstack-cinder-api restart service openstack-cinder-volume restart # openstack-config --set /etc/glance/glance-api.conf DEFAULT notification_driver messaging openstack-config --set /etc/glance/glance-api.conf DEFAULT rpc_backend rabbit service openstack-glance-api restart service openstack-glance-registry restart # for i in `ls /etc/rc.d/init.d/openstack-ceilometer-*` ; do $i start; done ==== Installation on Compute Nodes ==== #!/bin/sh export KEYHOST=cloud-areapd.pd.infn.it export RABBITMGIP=192.168.60.100 export RABBITHOSTS="192.168.60.100:5672,192.168.60.101:5672" export CEILOMETER_TOKEN=5bfc5aa7fb1162dad8f1 # # 1) do not forget to add in nova.conf: #[DEFAULT] #... #notification_driver = nova.openstack.common.notifier.rpc_notifier #notification_driver = ceilometer.compute.nova_notifier sed -i 's|#notification_driver=|notification_driver = nova.openstack.common.notifier.rpc_notifier\nnotification_driver = ceilometer.compute.nova_notifier|g' /etc/nova/nova.conf # # 2) use the same $CEILOMETER_TOKEN created on the Controller Node # yum install -y openstack-ceilometer-compute openstack-config --set /etc/nova/nova.conf DEFAULT instance_usage_audit True openstack-config --set /etc/nova/nova.conf DEFAULT instance_usage_audit_period hour openstack-config --set /etc/nova/nova.conf DEFAULT notify_on_state_change vm_and_task_state openstack-config --set /etc/ceilometer/ceilometer.conf DEFAULT rpc_backend ceilometer.openstack.common.rpc.impl_kombu openstack-config --set /etc/ceilometer/ceilometer.conf DEFAULT auth_strategy noauth openstack-config --set /etc/ceilometer/ceilometer.conf DEFAULT rabbit_ha_queues true openstack-config --set /etc/ceilometer/ceilometer.conf DEFAULT rabbit_hosts $RABBITHOSTS #openstack-config --set /etc/ceilometer/ceilometer.conf DEFAULT rabbit_host $RABBITMGIP openstack-config --set /etc/ceilometer/ceilometer.conf publisher metering_secret $CEILOMETER_TOKEN openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken auth_host $KEYHOST openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken admin_user ceilometer openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken admin_tenant_name services openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken auth_protocol https openstack-config --set /etc/ceilometer/ceilometer.conf keystone_authtoken admin_password CEILOMETER_PASS openstack-config --set /etc/ceilometer/ceilometer.conf service_credentials os_username ceilometer openstack-config --set /etc/ceilometer/ceilometer.conf service_credentials os_tenant_name services openstack-config --set /etc/ceilometer/ceilometer.conf service_credentials os_password CEILOMETER_PASS openstack-config --set /etc/ceilometer/ceilometer.conf service_credentials os_auth_url https://$KEYHOST:35357/v2.0 openstack-config --set /etc/ceilometer/ceilometer.conf service_credentials os_cacert /etc/grid-security/certificates/INFN-CA-2006.pem sed -i 's|- cpu_sink|- cpu_sink\n - cpu_sink.hrs|g' /etc/ceilometer/pipeline.yaml cat <> /etc/ceilometer/pipeline.yaml - name: cpu_sink.hrs transformers: - name: "unit_conversion" parameters: target: name: "cpu.hours" scale: "volume / (10**9 * 3600.0)" unit: "hours" type: "cumulative" publishers: - rpc:// EOF service openstack-nova-compute restart service openstack-ceilometer-compute start chkconfig openstack-ceilometer-compute on service openstack-ceilometer-compute status ls -l /var/log/ceilometer/ ==== Testing from Controller ==== source ceilometerrc.sh # show meters and resources ceilometer meter-list ceilometer resource-list # show the entire sample for a given meter for all resources ceilometer sample-list -m cpu # show the aggregate statistics (average, maxima, minima, sum, count) for a given meter for all resources ceilometer statistics -m cpu ceilometer statistics -m cpu.hours ceilometer statistics -m cpu_util # cpu for an instance is defined as the Total cpu_time (in nanoseconds) output from the command "virsh cpu-stats instance-id" on the hypervisor # cpu_util is (cpu_t2 - cpu_t1)*100/(Dt*n_vcpus*10**9) where Dt=t2-t1 is the sampling time (600 s by default) and n_vcpu is the number of VCPU of the instance # show the sample for a given resource after a given start time START=2015-02-09T12:00:00 INSTANCE=67fccb06-b80e-42ab-9fca-f5e57adaba7 ceilometer sample-list -m cpu.hours -q "timestamp>$START;resource_id=$INSTANCE" # show the aggregate statistics (average, maxima, minima, sum, count) for a given resource after a given start time for a given period (here 3600 seconds) ceilometer statistics -m cpu.hours -q "timestamp>$START;resource_id=$INSTANCE" -p 3600 ==== Set up a monthly backup of Ceilometer DB ==== * Inspired by the instructions [[http://blog.zhaw.ch/icclab/managing-ceilometer-data-in-openstack/|here]] * From the controller node execute the script below: [root@cld-ctrl-01 ceilometer]# cat ceilometer-backup.sh #!/bin/sh export DBHOST=cld-nagios export DBHOSTIP=192.168.60.32 cat </etc/yum.repos.d/mongodb.repo [mongodb] name=MongoDB Repository baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/ gpgcheck=0 enabled=1 EOF yum -y install mongodb-org-shell mongodb-org-tools # set time_to_live to 32 days openstack-config --set /etc/ceilometer/ceilometer.conf DEFAULT time_to_live 2764800 for i in `ls /etc/rc.d/init.d/openstack-ceilometer-*` ; do $i restart; done # copying the mongo_backup.py script git clone https://github.com/icclab/arcus-energy-monitoring-tool/ useradd ceilobackup cp arcus-energy-monitoring-tool/Tools/mongo_backup.py /home/ceilobackup/ cd /home/ceilobackup/ # set properly the DB host_ip (the one of cld-nagios) and ceilometer username password sed -i 's|username = "ceilometer"|username = "ceiloprod"|g' mongo_backup.py sed -i 's|password = "password"|password = "CEILOMETER_DBPASS"|g' mongo_backup.py sed -i 's|host_ip = "192.168.0.2"|host_ip = "192.168.60.32"|g' mongo_backup.py sed -i 's|zipath = "/path/to/zip/"|zipath = "/home/ceilobackup/zip/"|g' mongo_backup.py sed -i 's|path = "/path/to/tmp/"|path = "/home/ceilobackup/tmp/"|g' mongo_backup.py sed -i 's|ceilometer|ceiloprod|g' mongo_backup.py # chown ceilobackup.ceilobackup /home/ceilobackup/mongo_backup.py su - ceilobackup -c "python mongo_backup.py; ls -lh zip/" # purge monthly the DB: cat </etc/cron.d/ceilometer-backup 0 0 1 * * ceilobackup python mongo_backup.py; mongo --host cld-nagios --eval 'db.getSiblingDB("ceiloprod").dropDatabase()'; mongo --host cld-nagios --eval 'db = db.getSiblingDB("ceiloprod"); db.createUser({user: "ceiloprod",pwd: "CEILOMETER_DBPASS",roles: [ "readWrite", "dbAdmin" ]})' EOF