====== Nagios plugins for swift monitoring ====== Sono stati implementati due tipi di verifica: * **Verifiche dei processi**, cioe` verifiche che i processi necessari siano up. * **Verifiche funzionali**, cioe` verifiche che swift funzioni correttamente. Nel primo caso si e` utilizzato un il plugin check_procs installato con nagios-plugins-procs. Nel secondo caso ho preso dei plugin distribuiti in [[http://exchange.nagios.org/directory/Plugins/Clustering-and-High-2DAvailability/check_swift/details|nagios-exchange]], un po' datati, e li ho modificati, quando necessario, in base alle nostre necessita`. In tutti i casi i comandi remoti sono lanciati usando il plugin check_nrpe. Per completezza inserisco qui il codice e descrivo le modifiche. ==== check_swift ==== **Description:** "check_swift_object_servers uses swift-recon to query all clusters servers and ensure they all have the same copy of the object ring." * Codice originale {{:progetti:cloud-areapd:check_swift.txt| qui}}. * Codice modificato per ottenere la funzionalita` {{:progetti:cloud-areapd:check_swift_sabe_1.txt| qui}}. Ho fatto una modifica di minima per vederlo funzionare aggiungendo come parametri di input il tenant ed il tenant-id e mettendo l'opzione '--insecure' (hardcoded) nei comandi. * Codice modificato per leggere i parametri da file di configurazione anziche` da linea di comando {{:progetti:cloud-areapd:check_swift_sabe_2.txt| qui}}. File di configurazione {{:progetti:cloud-areapd:swift-check.conf.txt|/etc/swift/swift-check.conf}}. Ho introdotto una modifica nel check dell'operazione di delete poiche` (anche provando a mano) l'operazione di delete restituiva un fallimento con messaggio "object not found", ma in realta` andava a buon fine perche`, facendo un listing, si vedeva che l'oggetto non c'era piu`. Probabilmente si tratta di problemi di sincronizzazione. Ho quindi fatto il check come risultato di una delete+list e il check fallisce solo se l'oggetto, dopo la delete, e` ancora presente nel list. Questa soluzione evita di spargere qua e la le credenziali di accesso che in certi casi di errore venivano anche stampate (come parte del comando) nell'output grafico di nagios. ==== check_swift_dispersion ==== **Description:** "uses swift-dispersion tools to report dispersion analysis and checks that all copies of objects are OK" * Codice originale {{:progetti:cloud-areapd:check_swift_dispersion.txt| qui}}. File di configurazione {{:progetti:cloud-areapd:dispersion.conf.txt|/etc/swift/dispersion.conf}}. * Codice modificato {{:progetti:cloud-areapd:check_swift_dispersion_sabe.txt| qui}}. File di configurazione {{:progetti:cloud-areapd:dispersion.conf.txt|/etc/swift/dispersion.conf}}. Lo script originale usa lo script swift_dispersion il cui output in icehouse e` cambiato (altri se ne sono lamentati [[https://github.com/enovance/openstack-monitoring/issues/29|qui]]). La [[http://docs.openstack.org/icehouse/config-reference/content/object-storage-dispersion.html|documentazione]] dice che l'output json deve essere tipo: {"object":{"retries:": 0, "missing_two": 0, "copies_found": 7863, "missing_one": 0,"copies_expected": 7863, "pct_found": 100.0, "overlapping": 0, "missing_all": 0}, "container":{"retries:": 0, "missing_two": 0, "copies_found": 12534, "missing_one": 0, "copies_expected":12534, "pct_found": 100.0, "overlapping": 15, "missing_all": 0}} invece nella realta` e`: {"object":{"retries": 0, "missing_0": 2621, "copies_expected": 7863, "pct_found": 100.0, "overlapping": 0, "copies_found": 7863}, "container":{"retries": 0, "copies_expected": 7866, "pct_found": 100.0, "overlapping": 0, "copies_found": 7866}} cioe` mancano i missing_one _two _all che il plugin si aspetta. Ho modificato il plugin in modo da fare il check solo su 'pct_found'. E` vero che si perde l'informazione se manca una o due o tutte le copie, ma lo script swift_dispersion non la da` piu`. ==== check_swift_object_servers ==== **Description:** "check_swift_object_servers uses swift-recon to query all clusters servers and ensure they all have the same copy of the object ring." * Codice originale {{:progetti:cloud-areapd:check_swift_object_servers_1.txt| qui}}. * Codice modificato per ottenere la funzionalita` {{:progetti:cloud-areapd:check_swift_object_servers.txt| qui}}. Ho solo cambiato un parametro con cui viene chiamato lo script swift_recon. * Vecchio comando: swift-recon --objmd5 * Nuovo comando: swift-recon --md5 ====== Configurazione del Nagios server host ====== ==== Configurazione dei comandi ==== **commands.cfg** [SL: /etc/nagios/objects/commands.cfg] * Configurazione del comando check_nrpe plugin. Ho configurato molti parametri perche` nel caso di check_swift in versione parametrica, ad esempio, ne servono 9, incluso il comando. Mi sono tenuta larga. define command{ command_name check_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 480 -c $ARG1$ -a $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$ $ARG7$ $ARG8$ $ARG9$ $ARG10$ $ARG11$ $ARG12$ } * comandi per monitorare i processi swift con check_procs lanciato sull'host da monitorare tramite check_nrpe ### Check swift processes define command{ command_name check_swift-proxy-server command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-proxy-server } define command{ command_name check_swift-object-server command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-object-server } define command{ command_name check_swift-object-auditor command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-object-auditor } define command{ command_name check_swift-object-replicator command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-object-replicator } define command{ command_name check_swift-object-updater command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-object-updater } define command{ command_name check_swift-account-server command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-account-server } define command{ command_name check_swift-account-auditor command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-account-auditor } define command{ command_name check_swift-account-replicator command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-account-replicator } define command{ command_name check_swift-account-reaper command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-account-reaper } define command{ command_name check_swift-container-server command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-container-server } define command{ command_name check_swift-container-auditor command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-container-auditor } define command{ command_name check_swift-container-replicator command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-container-replicator } define command{ command_name check_swift-container-updater command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-container-updater } define command{ command_name check_swift-container-sync command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_swift-container-sync } * comandi per monitorare le funzionalita` di swift con check_procs lanciato sull'host da monitorare tramite check_nrpe ### Check swift functionalities define command{ command_name check_swift command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 480 -c check_swift -A $ARG2$ -U $ARG3$ -T $ARG4$ -I $ARG5$ -K $ARG6$ -V $ARG7$ -c $ARG8$ } define command{ command_name check_swift_1 command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 480 -c check_swift_1 } define command{ command_name check_swift_dispersion command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 480 -c check_swift_dispersion } define command{ command_name check_swift_object_servers command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 480 -c check_swift_object_servers } define command{ command_name check_rsync command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_rsync } ==== Configurazione dei servizi per monitorare swift ==== **swift_nodes.cfg** [SL: /etc/nagios/objects/swift_nodes.cfg] # ============ Define service base template ============= define service{ name swift-service use server-ssh-service hostgroup_name swift-nodes ;,%Cluster2 ;append new cluster(host_group) here register 0 } # ============ Define Swift Clusters ============ # A list of nodes in a cluster define hostgroup { hostgroup_name swift-nodes ;Fixed me alias SwiftStack pd nodes ;Fixed me members storage-node-01, storage-node-02 ;Fixed me } define hostgroup { hostgroup_name swift-proxies ;Fixed me alias SwiftStack pd nodes ;Fixed me members proxy-node ;Fixed me } # ============ Define Swift Services ============== define service { service_description try to upload download and delete a file in a Swift container to check that it works correctly. Read input parameters from configuration file check_command check_nrpe!check_swift_1!$HOSTADDRESS$ use swift-proxy-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description uses swift-dispersion tools to report dispersion analysis and checks that all copies of objects are OK check_command check_nrpe!check_swift_dispersion!$HOSTADDRESS$ use swift-proxy-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description uses swift-recon to query all clusters servers and ensure they all have the same copy of the object ring. check_command check_nrpe!check_swift_object_servers!$HOSTADDRESS$ use swift-proxy-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if process swift-proxy-server is alive check_command check_nrpe!check_swift-proxy-server!$HOSTADDRESS$ use swift-proxy-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process object-server is alive check_command check_nrpe!check_swift-object-server!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process object-auditor is alive check_command check_nrpe!check_swift-object-auditor!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process object-replicator is alive check_command check_nrpe!check_swift-object-replicator!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process object-updater is alive check_command check_nrpe!check_swift-object-updater!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process account-server is alive check_command check_nrpe!check_swift-account-server!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process account-auditor is alive check_command check_nrpe!check_swift-account-auditor!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process account-replicator is alive check_command check_nrpe!check_swift-account-replicator!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process account-reaper is alive check_command check_nrpe!check_swift-account-reaper!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process container-server is alive check_command check_nrpe!check_swift-container-server!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process container-auditor is alive check_command check_nrpe!check_swift-container-auditor!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process container-replicator is alive check_command check_nrpe!check_swift-container-replicator!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process container-updater is alive check_command check_nrpe!check_swift-container-updater!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if swift process container-sync is alive check_command check_nrpe!check_swift-container-sync!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } define service { service_description check if process rsync is alive check_command check_nrpe!check_rsync!$HOSTADDRESS$ use swift-service notification_interval 0 ; set > 0 if you want to be renotified } ==== Configurazione del server nagios ==== **nagios.cfg** [SL: /etc/nagios/nagios.cfg] * Se ci sono problemi di timeout aggiustare: # TIMEOUT VALUES # These options control how much time Nagios will allow various # types of commands to execute before killing them off. Options # are available for controlling maximum time allotted for # service checks, host checks, event handlers, notifications, the # ocsp command, and performance data commands. All values are in # seconds. service_check_timeout=480 host_check_timeout=300 event_handler_timeout=300 notification_timeout=300 ocsp_timeout=150 perfdata_timeout=150 * Per debuggare aiuta debug_level=2048 che nel file /var/log/nagios/nagios.debug mostra come costruisce i comandi # DEBUG LEVEL # This option determines how much (if any) debugging information will # be written to the debug file. OR values together to log multiple # types of information. # Values: # -1 = Everything # 0 = Nothing # 1 = Functions # 2 = Configuration # 4 = Process information # 8 = Scheduled events # 16 = Host/service checks # 32 = Notifications # 64 = Event broker # 128 = External commands # 256 = Commands # 512 = Scheduled downtime # 1024 = Comments # 2048 = Macros debug_level=2048 Alla fine non dimenticare che per attivare le modifiche di configurazione devi restartare l'nrpe server, in SL: service nagios restart ====== Installazione e configurazione del plugin nagios nrpe sull'host da monitorare ====== == Installazione == * Su SL: yum install nagios-plugins-nrpe * Su Ubuntu: sudo apt-get install nagios-nrpe-server nagios-plugins == Configurazione == **/etc/nagios.nrpe.cfg** * Definizione dei comandi ######### # Swift # ######### # Swift processes checks command[check_swift-proxy-server]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "swift-proxy-server" command[check_swift-object-server]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "object-server" command[check_swift-object-auditor]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "object-auditor" command[check_swift-object-replicator]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "object-replicator" command[check_swift-object-updater]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "object-updater" command[check_swift-account-server]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "account-server" command[check_swift-account-auditor]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "account-auditor" command[check_swift-account-replicator]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "account-replicator" command[check_swift-account-reaper]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "account-reaper" command[check_swift-container-server]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "container-server" command[check_swift-container-auditor]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "container-auditor" command[check_swift-container-replicator]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "container-replicator" command[check_swift-container-updater]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "container-updater" command[check_swift-container-sync]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "container-sync" command[check_rsync]=/usr/lib/nagios/plugins/check_procs -w 2:10 -c 1:100 -a "rsync" # Swift functionalities checks #command[check_swift]=/usr/lib/nagios/plugins/check_swift -w 5 -c 10 command[check_swift]=/usr/lib/nagios/plugins/check_swift -A $ARG2$ -U $ARG3$ -T $ARG4$ -I $ARG5$ -K $ARG6$ -V $ARG7$ -c $ARG8$ command[check_swift_1]=/usr/lib/nagios/plugins/check_swift_sabe command[check_swift_dispersion]=/usr/lib/nagios/plugins/check_swift_dispersion -w 5 -c 1 command[check_swift_object_servers]=/usr/lib/nagios/plugins/check_swift_object_servers -w 5 -c 1 * Per fare in modo che check_nrpe accetti piu` argomenti # COMMAND ARGUMENT PROCESSING # This option determines whether or not the NRPE daemon will allow clients # to specify arguments to commands that are executed. This option only works # if the daemon was configured with the --enable-command-args configure script # option. # # *** ENABLING THIS OPTION IS A SECURITY RISK! *** # Read the SECURITY file for information on some of the security implications # of enabling this variable. # # Values: 0=do not allow arguments, 1=allow command arguments dont_blame_nrpe=1 * Per operazioni di debug (output nel syslog) # DEBUGGING OPTION # This option determines whether or not debugging messages are logged to the # syslog facility. # Values: 0=debugging off, 1=debugging on debug=1 * Se il plugin va in timeout, correggere # COMMAND TIMEOUT # This specifies the maximum number of seconds that the NRPE daemon will # allow plugins to finish executing before killing them off. command_timeout=300 Alla fine non dimenticare che per attivare le modifiche di configurazione devi restartare l'nrpe server: /etc/init.d/nagios-nrpe-server restart