Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
tutorial:adm:server_monitoring [2019/03/25 10:18]
fiserp created
tutorial:adm:server_monitoring [2020/06/30 12:27] (current)
urbanl [Server preparation - Server monitoring]
Line 1: Line 1:
-====== Monitoring of server with CzechIdM ====== +====== Server preparation - Server monitoring ====== 
-Automatic monitoring of production system is crucial for buissines continuity. Monitoring is recommended also for the testing environment, but it is not mandatory. +Automatic monitoring of production system is crucial for bussiness continuity. Monitoring is recommended also for the testing environment, but it is not mandatory. 
-This page will show you how to set up basic monitoring of CzechIdM server with Nagios NRPE. It is very useful to store monitored values for trend overview (f.e. with Munin). Some monitoring systems (like Zabbix) can store trends and monitor services at once. It is also practical to install //iostat////vmstat// and //sar// utilities on the server.+This page will show you how to set up basic monitoring of server with CzechIdM using Nagios NRPE. It is very useful to store monitored values for trend overview (e.g. with Munin). Some monitoring systems (like Zabbix) can store trends and monitor services at once. It is also practical to install ''iostat''''vmstat'' and ''sar'' utilities on the server.
  
-**Typical CzechIdM server**+<note>This article is about real-time monitoring of the server and its services. It does not deal with monitoring of "the insides" of CzechIdM.</note>
  
-This is a typical configuration of a production server for a small company. These parameters may need to be adjusted to complexity of a particular implementation.+**Example server parameters for this guide** 
 + 
 +Table with example monitoring parameters was created for these server resources.
   * RHEL7-flavoured system.   * RHEL7-flavoured system.
-  * About 80GB HDD. +  * About 100GB HDD. 
-  * At least 6GB RAM.+  * At least 8GB RAM.
   * At least 2x2GHz CPU.   * At least 2x2GHz CPU.
 +When implementing server monitoring adjust monitored parameters for your particular deployment.
  
 ===== Monitored parameters ===== ===== Monitored parameters =====
Line 17: Line 20:
 ^Service/Parameter ^Probe binary ^Name in NRPE ^Warning threshold ^Critical threshold ^Check frequency ^Notification frequency ^ ^Service/Parameter ^Probe binary ^Name in NRPE ^Warning threshold ^Critical threshold ^Check frequency ^Notification frequency ^
 |HOST UP| N/A | this is not implemented on the target machine | N/A or ping RTT threshold | high ping RTT or host is not pingable at all | every 5 minutes | every 6 hours | |HOST UP| N/A | this is not implemented on the target machine | N/A or ping RTT threshold | high ping RTT or host is not pingable at all | every 5 minutes | every 6 hours |
-|swap used space | check_swap check_swap | 50% swap free | 10% swap free | every 5 minutes | every 24 hours | +|swap used space | check\_swap check\_swap | 50% swap free | 10% swap free | every 5 minutes | every 24 hours | 
-|disk free space | check_disk check_disk | 90% used | 95% used | every 5 minutes | every 24 hours | +|disk free space | check\_disk check\_disk | 90% used | 95% used | every 5 minutes | every 24 hours | 
-|system load | check_load check_load | 4,3.5,3 | 6,5.5,5 | every 5 minutes | every 24 hours | +|system load | check\_load check\_load | 4,3.5,3 | 6,5.5,5 | every 5 minutes | every 24 hours | 
-|used memory | check_mem check_mem | 90% used | 95% used | every 5 minutes | every 24 hours | +|used memory | check\_mem check\_mem | 90% used | 95% used | every 5 minutes | every 24 hours | 
-|process count | check_procs check_procs | 300+ | 500+ | every 5 minutes | every 24 hours | +|process count | check\_procs check\_procs | 300+ | 500+ | every 5 minutes | every 24 hours | 
-|zombie process count | check_procs check_zombies | 1+ | 5+ | every 5 minutes | every 24 hours | +|zombie process count | check\_procs check\_zombies | 1+ | 5+ | every 5 minutes | every 24 hours | 
-|system time | check_ntp_time check_time | skew >1min | skew >5min | every hour | every 24 hours | +|system time | check\_ntp\_time check\_time | skew >1min | skew >5min | every hour | every 24 hours | 
-|CzechIdM is running | check_http check_idm | N/A | CzechIdM not running | every 5 minutes | every 24 hours | +|CzechIdM is running | check\_http check\_idm | N/A | CzechIdM not running | every 5 minutes | every 24 hours | 
-|HTTPD is running | check_http check_httpd | response time >1s | HTTPD is not running | every 5 minutes | every 24 hours | +|HTTPD is running | check\_http check\_httpd | response time >1s | HTTPD is not running | every 5 minutes | every 24 hours | 
-|HTTPS certificate expiration | check_http check_httpd_cert | less than 30 days | less than 7 days | once a day | every 24 hours | +|HTTPS certificate expiration | check\_http check\_httpd\_cert | less than 30 days | less than 7 days | once a day | every 24 hours | 
-|PostgresSQL is running | check_pgsql check_postgres | response time >0.5s | response time >1s or not running at all | every 5 minutes | every 24 hours |+|PostgresSQL is running | check\_pgsql check\_postgres | response time >0.5s | response time >1s or not running at all | every 5 minutes | every 24 hours |
  
 ===== Implementation ===== ===== Implementation =====
 We will use nrpe and probes from the standard system packages. We have epel repository enabled. We will use nrpe and probes from the standard system packages. We have epel repository enabled.
-  * NRPE daemon will listen on 5666\tcp (its default port). Open the port in your iptables by adding the rule: //-A INPUT -m state --state ESTABLISHED,RELATED -p tcp --dport 5666 -j ACCEPT//+  * NRPE daemon will listen on 5666\tcp (its default port). Open the port in your iptables by adding the rule: ''-A INPUT -m state --state ESTABLISHED,RELATED -p tcp --dport 5666 -j ACCEPT''
-  * All probes are located in their default installation location ///usr/lib64/nagios/plugins///+  * All probes are located in their default installation location ''/usr/lib64/nagios/plugins/''
-  * We use one external probe check_mem which can be downloaded here: [[https://exchange.nagios.org/directory/Plugins/System-Metrics/Memory/check_mem-2Esh/details]]. This probe, however, returns bad results on RHEL7 because of the different meaning of the //free// command output. The fixed version is: +  * We use one external probe check\_mem which can be downloaded here: [[https://exchange.nagios.org/directory/Plugins/System-Metrics/Memory/check_mem-2Esh/details]]. This probe, however, returns bad results on RHEL7 because of the different meaning of the ''free'' command output. You can dowload the fixed version from [[https://github.com/bcvsolutions/czechidm-monitoring/blob/master/monitoring/nagios-plugins/check_mem/check_mem|here]].
-<code bash> +
-#!/bin/bash +
- +
-# Modified for CentOS7 - Petr Fiser, BCV solutions s.r.o. +
-if [ "$1" = "-w" ] && [ "$2" -gt "0" ] && [ "$3" = "-c" ] && [ "$4" -gt "0" ]; then +
- +
-        memTotal_b=`free -b |grep Mem |awk '{print $2}'+
-        memFree_b=`free -b |grep Mem |awk '{print $4}'+
-        memBuffer_b=`free -b |grep Mem |awk '{print $6}'+
- +
-        memTotal_m=`free -m |grep Mem |awk '{print $2}'+
-        memFree_m=`free -m |grep Mem |awk '{print $4}'+
-        memBuffer_m=`free -m |grep Mem |awk '{print $6}'+
- +
-        memUsed_b=$(($memTotal_b-$memFree_b-$memBuffer_b)) +
-        memUsed_m=$(($memTotal_m-$memFree_m-$memBuffer_m)) +
- +
-        memUsedPrc=$((($memUsed_b*100)/$memTotal_b)) +
- +
- +
-        if [ "$memUsedPrc" -ge "$4" ]; then +
-                echo "Memory: CRITICAL Total: $memTotal_m MB - Used: $memUsed_m MB - $memUsedPrc% used!|TOTAL=$memTotal_b;;;; USED=$memUsed_b;;;; BUFFER=$memBuffer_b;;;;" +
-                $(exit 2) +
-        elif [ "$memUsedPrc" -ge "$2" ]; then +
-                echo "Memory: WARNING Total: $memTotal_m MB - Used: $memUsed_m MB - $memUsedPrc% used!|TOTAL=$memTotal_b;;;; USED=$memUsed_b;;;; BUFFER=$memBuffer_b;;;;" +
-                $(exit 1) +
-        else +
-                echo "Memory: OK Total: $memTotal_m MB - Used: $memUsed_m MB - $memUsedPrc% used|TOTAL=$memTotal_b;;;; USED=$memUsed_b;;;; BUFFER=$memBuffer_b;;;;" +
-                $(exit 0) +
-        fi +
- +
-else +
-        echo "check_mem v1.1" +
-        echo "" +
-        echo "Usage:" +
-        echo "check_mem.sh -w <warnlevel> -c <critlevel>" +
-        echo "" +
-        echo "warnlevel and critlevel is percentage value without %" +
-        echo "" +
-        echo "Copyright (C) 2012 Lukasz Gogolin (lukasz.gogolin@gmail.com)" +
-        exit +
-fi +
-</code>+
  
 **Deployment** **Deployment**
  
-First, install the necessary packages:+First, install necessary packages:
 <code> <code>
 yum install nrpe nagios-plugins-nrpe nagios-plugins-swap nagios-plugins-disk nagios-plugins-load nagios-plugins-procs nagios-plugins-ntp nagios-plugins-http nagios-plugins-pgsql yum install nrpe nagios-plugins-nrpe nagios-plugins-swap nagios-plugins-disk nagios-plugins-load nagios-plugins-procs nagios-plugins-ntp nagios-plugins-http nagios-plugins-pgsql
 </code> </code>
-If you use SELinux, we need to permit the check_disk plugin access to the ///sys/kernel/...//:+If you use SELinux, we need to permit the check\_disk plugin access to the ''/sys/kernel/...''. Easiest way (but not necessarily the most correct) is to set permissive mode for some plugins:
 <code> <code>
 yum install policycoreutils-python yum install policycoreutils-python
 semanage permissive -a nagios_checkdisk_plugin_t semanage permissive -a nagios_checkdisk_plugin_t
 </code> </code>
-Edit the ///etc/nagios/nrpe.cfg// file and add your monitoring server address to the allowed_hosts directive:+Edit the ''/etc/nagios/nrpe.cfg'' file and add your monitoring server address to the allowed\_hosts directive:
 <code> <code>
 allowed_hosts=127.0.0.1,IPofMonitoringServer allowed_hosts=127.0.0.1,IPofMonitoringServer
 </code> </code>
-Create a configuration of system checks in the file ///etc/nrpe.d/checks.cfg//. Fill in the //YOUR_NTP_SERVER// and //IDM_SERVICE_DOMAIN_NAME// accordingly. The //MONITORING_USER// and //MONITORING_USER_PASSWORD// are values filled with credentials of an user which is capable to log into the PostgreSQL database. Create separate user just for this purpose. +Create a configuration of system checks in the file ''/etc/nrpe.d/checks.cfg''. Fill in the ''YOUR\_NTP\_SERVER'' and ''IDM\_SERVICE\_DOMAIN\_NAME'' accordingly. The ''MONITORING\_USER'' and ''MONITORING\_USER\_PASSWORD'' are values filled with credentials of an user which is capable to log into the PostgreSQL database. **Create separate user just for this purpose**
-<code>+<file txt checks.cfg>
 command[check_swap]=/usr/lib64/nagios/plugins/check_swap -w 50% -c 10% command[check_swap]=/usr/lib64/nagios/plugins/check_swap -w 50% -c 10%
 command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w 90 -c 95 command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w 90 -c 95
Line 106: Line 66:
 command[check_httpd_cert]=/usr/lib64/nagios/plugins/check_http -H IDM_SERVICE_DOMAIN_NAME -S -p443 -C30,7 command[check_httpd_cert]=/usr/lib64/nagios/plugins/check_http -H IDM_SERVICE_DOMAIN_NAME -S -p443 -C30,7
 command[check_postgres]=/usr/lib64/nagios/plugins/check_pgsql -H 127.0.0.1 -P 5432 -d template1 -l MONITORING_USER -p MONITORING_USER_PASSWORD -w0.5 -c1 command[check_postgres]=/usr/lib64/nagios/plugins/check_pgsql -H 127.0.0.1 -P 5432 -d template1 -l MONITORING_USER -p MONITORING_USER_PASSWORD -w0.5 -c1
-</code+</file
-Add the //check_mem// script to the ///usr/lib64/nagios/plugins/// directory, make it executable:+Add the ''check\_mem'' script to the ''/usr/lib64/nagios/plugins/'' directory, make it executable:
 <code> <code>
 cp check_mem /usr/lib64/nagios/plugins/ cp check_mem /usr/lib64/nagios/plugins/
 chmod 755 /usr/lib64/nagios/plugins/check_mem chmod 755 /usr/lib64/nagios/plugins/check_mem
 </code> </code>
-Create the MONITORING_USER in the PostgreSQL. Please generate some strong password - you can use //pwgen// for that.+Create the ''MONITORING\_USER'' in the PostgreSQL. Generate some strong password - you can use ''pwgen'' for that.
 <code> <code>
 create user monitoring password 'somepassword'; create user monitoring password 'somepassword';
Line 121: Line 81:
 systemctl enable nrpe systemctl enable nrpe
 </code> </code>
-To test the probes, you can use //check_nrpe// plugin:+To test the probes, you can use ''check\_nrpe'' plugin:
 <code> <code>
 /usr/lib64/nagios/plugins/check_nrpe -H 127.0.0.1 -b 127.0.0.1 -c check_swap /usr/lib64/nagios/plugins/check_nrpe -H 127.0.0.1 -b 127.0.0.1 -c check_swap
Line 130: Line 90:
 This is a sample configuration for the Nagios server. It is meant more as an inspiration, feel free to adapt it to your Nagios deployment. This is a sample configuration for the Nagios server. It is meant more as an inspiration, feel free to adapt it to your Nagios deployment.
  
-Configure the check_nrpe command (you probably already have this in your Nagios configuration):+Configure the ''check\_nrpe'' command (you probably already have this in your Nagios configuration):
 <code> <code>
 define command{ define command{
  • by fiserp