progetti:cloud-areapd:ced-c:nfs_cluster_monitoring
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| progetti:cloud-areapd:ced-c:nfs_cluster_monitoring [2015/07/20 15:29] – [Allow nrpe to run the check as root] mazzon@infn.it | progetti:cloud-areapd:ced-c:nfs_cluster_monitoring [2015/10/14 13:29] (current) – [Reload nagios] mazzon@infn.it | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Monitoring the NFS cluster with Nagios ====== | ||
| + | The NFS service provided by the 2 nodes cluster is an " | ||
| + | |||
| + | * one node is actually running the nfsd daemon | ||
| + | * the other nodes are in standby | ||
| + | * takeover of the service is handled by the cluster daemons | ||
| + | |||
| + | Therefore we decide to monitor the situation by: | ||
| + | |||
| + | * checking the **cluster** daemons are running on each node | ||
| + | * checking the '' | ||
| + | * if the server is running on my host check detailed status of the **nfs** daemons | ||
| + | * if I'm a standby node but the cluster is OK and nfs is running somewhere return OK | ||
| + | |||
| + | ===== Install needed packages ===== | ||
| + | |||
| + | On all the monitored nodes: | ||
| + | |||
| + | <code bash> | ||
| + | # yum -y install nrpe nagios-plugins-perl perl-Nagios-Plugin | ||
| + | </ | ||
| + | |||
| + | Obtain latest version of the monitoring scripts from [[https:// | ||
| + | and [[https:// | ||
| + | |||
| + | < | ||
| + | # cp check_nfs4.0.2.pl / | ||
| + | # cp check_crm_v0_7 / | ||
| + | # chmod +rx / | ||
| + | # chmod +rx / | ||
| + | </ | ||
| + | |||
| + | Since all nodes on the cluster share the same domain and users we do not use the idmapd daemon. Its absence is therefore not critical: | ||
| + | |||
| + | <code bash> | ||
| + | sed -i 's/^if (!$idmapd_d) { $daelist/# if (!$idmapd_d) { $daelist/' | ||
| + | </ | ||
| + | |||
| + | ===== Create an helper script ===== | ||
| + | |||
| + | To implement the nagios check as designed we use an helper script that checks if the nfs daemon is running on the tested host or not. | ||
| + | In the former case the result of the check is handled over to the '' | ||
| + | |||
| + | <code bash check_my_nfs> | ||
| + | #!/bin/bash | ||
| + | |||
| + | monitor="/ | ||
| + | |||
| + | # check cluster is healthy | ||
| + | ${monitor} -s 1>/ | ||
| + | if [ " | ||
| + | then | ||
| + | echo " | ||
| + | exit 2 | ||
| + | else | ||
| + | # | ||
| + | # check if there is at least one nfs server active | ||
| + | # | ||
| + | | ||
| + | if [ " | ||
| + | then | ||
| + | echo "NFS server is not running anywhere!" | ||
| + | exit 2 | ||
| + | else | ||
| + | hname=$(hostname -s) | ||
| + | ${monitor} | grep $hname | grep nfsclusterserver 1>/ | ||
| + | if [ " | ||
| + | then | ||
| + | # | ||
| + | # I am the nfs server: check if I'm healthy | ||
| + | # | ||
| + | exec / | ||
| + | else | ||
| + | # | ||
| + | # I am not the nfs server but: | ||
| + | # - the cluster is ok | ||
| + | # - the service is running | ||
| + | # | ||
| + | echo "NFS is running somewhere..." | ||
| + | exit 0 | ||
| + | fi | ||
| + | fi | ||
| + | fi | ||
| + | </ | ||
| + | ===== Setup nrpe on monitored hosts ===== | ||
| + | |||
| + | ==== nrpe directives ==== | ||
| + | |||
| + | On all the hosts composing the cluster create the file ''/ | ||
| + | |||
| + | < | ||
| + | # Allow requests from cld-nagios by adding the cld-nagios IP to the list of allowed hosts | ||
| + | allowed_hosts=127.0.0.1, | ||
| + | |||
| + | # Define the check_crm command: | ||
| + | command[check_crm]=/ | ||
| + | |||
| + | # Define the check_nfs4 command: | ||
| + | # On CentOS the file '/ | ||
| + | # by root so we run this check through ' | ||
| + | command[check_nfs4]=sudo / | ||
| + | |||
| + | |||
| + | ==== Allow nrpe to run the checks as root ==== | ||
| + | |||
| + | * Create the file ''/ | ||
| + | < | ||
| + | Defaults: | ||
| + | |||
| + | nrpe ALL = (root) NOPASSWD: / | ||
| + | nrpe ALL = (root) NOPASSWD: / | ||
| + | nrpe ALL = (root) NOPASSWD: / | ||
| + | </ | ||
| + | |||
| + | * Give the file the correct permissions | ||
| + | <code bash> | ||
| + | chmod 440 / | ||
| + | |||
| + | |||
| + | ==== Open firewall port 5666 ==== | ||
| + | |||
| + | <code bash> | ||
| + | firewall-cmd --add-port=5666/ | ||
| + | firewall-cmd --permanent --add-port=5666/ | ||
| + | </ | ||
| + | |||
| + | ==== Start and enable the nrpe daemon ==== | ||
| + | |||
| + | <code bash> | ||
| + | systemctl start nrpe | ||
| + | systemctl enable nrpe | ||
| + | </ | ||
| + | ===== Define needed commands on cld-nagios ===== | ||
| + | |||
| + | * Make sure nrpe is installed on the nagios server | ||
| + | < | ||
| + | nrpe-2.15-2.el6.x86_64 | ||
| + | nagios-plugins-nrpe-2.15-2.el6.x86_64</ | ||
| + | |||
| + | * Make sure a command to exec checks using nrpe is defined (check the '' | ||
| + | < | ||
| + | define command{ | ||
| + | command_name | ||
| + | command_line | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | * Create the new command that execs check_nfs4 on the monitored host | ||
| + | < | ||
| + | define command{ | ||
| + | command_name | ||
| + | contact_groups | ||
| + | command_line | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | * Add it to the list of the scheduled checks for every node in the cluster | ||
| + | < | ||
| + | define service{ | ||
| + | use | ||
| + | contact_groups | ||
| + | host_name | ||
| + | service_description | ||
| + | check_command | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | * Create the new command that execs check_crm on the monitored host | ||
| + | < | ||
| + | define command{ | ||
| + | command_name | ||
| + | contact_groups | ||
| + | command_line | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | * Add it to the list of the scheduled checks for every node in the cluster | ||
| + | < | ||
| + | define service{ | ||
| + | use | ||
| + | contact_groups | ||
| + | host_name | ||
| + | service_description | ||
| + | check_command | ||
| + | } | ||
| + | </ | ||
| + | ===== Reload nagios ===== | ||
| + | < | ||
