Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
tutorial:adm:server_monitoring [2019/03/25 10:21]
fiserp [Monitoring of server with CzechIdM]
tutorial:adm:server_monitoring [2019/03/25 10:27]
fiserp [Implementation]
Line 2: Line 2:
 Automatic monitoring of production system is crucial for bussiness continuity. Monitoring is recommended also for the testing environment, but it is not mandatory. Automatic monitoring of production system is crucial for bussiness continuity. Monitoring is recommended also for the testing environment, but it is not mandatory.
 This page will show you how to set up basic monitoring of server with CzechIdM using Nagios NRPE. It is very useful to store monitored values for trend overview (e.g. with Munin). Some monitoring systems (like Zabbix) can store trends and monitor services at once. It is also practical to install ''iostat'', ''vmstat'' and ''sar'' utilities on the server. This page will show you how to set up basic monitoring of server with CzechIdM using Nagios NRPE. It is very useful to store monitored values for trend overview (e.g. with Munin). Some monitoring systems (like Zabbix) can store trends and monitor services at once. It is also practical to install ''iostat'', ''vmstat'' and ''sar'' utilities on the server.
 +
 +<note>This article is about real-time monitoring of the server and its services. It does not deal with monitoring of the "insides" of CzechIdM.</note>
  
 **Typical CzechIdM server** **Typical CzechIdM server**
Line 17: Line 19:
 ^Service/Parameter ^Probe binary ^Name in NRPE ^Warning threshold ^Critical threshold ^Check frequency ^Notification frequency ^ ^Service/Parameter ^Probe binary ^Name in NRPE ^Warning threshold ^Critical threshold ^Check frequency ^Notification frequency ^
 |HOST UP| N/A | this is not implemented on the target machine | N/A or ping RTT threshold | high ping RTT or host is not pingable at all | every 5 minutes | every 6 hours | |HOST UP| N/A | this is not implemented on the target machine | N/A or ping RTT threshold | high ping RTT or host is not pingable at all | every 5 minutes | every 6 hours |
-|swap used space | check_swap check_swap | 50% swap free | 10% swap free | every 5 minutes | every 24 hours | +|swap used space | check\_swap check\_swap | 50% swap free | 10% swap free | every 5 minutes | every 24 hours | 
-|disk free space | check_disk check_disk | 90% used | 95% used | every 5 minutes | every 24 hours | +|disk free space | check\_disk check\_disk | 90% used | 95% used | every 5 minutes | every 24 hours | 
-|system load | check_load check_load | 4,3.5,3 | 6,5.5,5 | every 5 minutes | every 24 hours | +|system load | check\_load check\_load | 4,3.5,3 | 6,5.5,5 | every 5 minutes | every 24 hours | 
-|used memory | check_mem check_mem | 90% used | 95% used | every 5 minutes | every 24 hours | +|used memory | check\_mem check\_mem | 90% used | 95% used | every 5 minutes | every 24 hours | 
-|process count | check_procs check_procs | 300+ | 500+ | every 5 minutes | every 24 hours | +|process count | check\_procs check\_procs | 300+ | 500+ | every 5 minutes | every 24 hours | 
-|zombie process count | check_procs check_zombies | 1+ | 5+ | every 5 minutes | every 24 hours | +|zombie process count | check\_procs check\_zombies | 1+ | 5+ | every 5 minutes | every 24 hours | 
-|system time | check_ntp_time check_time | skew >1min | skew >5min | every hour | every 24 hours | +|system time | check\_ntp\_time check\_time | skew >1min | skew >5min | every hour | every 24 hours | 
-|CzechIdM is running | check_http check_idm | N/A | CzechIdM not running | every 5 minutes | every 24 hours | +|CzechIdM is running | check\_http check\_idm | N/A | CzechIdM not running | every 5 minutes | every 24 hours | 
-|HTTPD is running | check_http check_httpd | response time >1s | HTTPD is not running | every 5 minutes | every 24 hours | +|HTTPD is running | check\_http check\_httpd | response time >1s | HTTPD is not running | every 5 minutes | every 24 hours | 
-|HTTPS certificate expiration | check_http check_httpd_cert | less than 30 days | less than 7 days | once a day | every 24 hours | +|HTTPS certificate expiration | check\_http check\_httpd\_cert | less than 30 days | less than 7 days | once a day | every 24 hours | 
-|PostgresSQL is running | check_pgsql check_postgres | response time >0.5s | response time >1s or not running at all | every 5 minutes | every 24 hours |+|PostgresSQL is running | check\_pgsql check\_postgres | response time >0.5s | response time >1s or not running at all | every 5 minutes | every 24 hours |
  
 ===== Implementation ===== ===== Implementation =====
 We will use nrpe and probes from the standard system packages. We have epel repository enabled. We will use nrpe and probes from the standard system packages. We have epel repository enabled.
-  * NRPE daemon will listen on 5666\tcp (its default port). Open the port in your iptables by adding the rule: //-A INPUT -m state --state ESTABLISHED,RELATED -p tcp --dport 5666 -j ACCEPT//+  * NRPE daemon will listen on 5666\tcp (its default port). Open the port in your iptables by adding the rule: ''-A INPUT -m state --state ESTABLISHED,RELATED -p tcp --dport 5666 -j ACCEPT''
-  * All probes are located in their default installation location ///usr/lib64/nagios/plugins///+  * All probes are located in their default installation location ''/usr/lib64/nagios/plugins/''
-  * We use one external probe check_mem which can be downloaded here: [[https://exchange.nagios.org/directory/Plugins/System-Metrics/Memory/check_mem-2Esh/details]]. This probe, however, returns bad results on RHEL7 because of the different meaning of the //free// command output. The fixed version is:+  * We use one external probe check\_mem which can be downloaded here: [[https://exchange.nagios.org/directory/Plugins/System-Metrics/Memory/check_mem-2Esh/details]]. This probe, however, returns bad results on RHEL7 because of the different meaning of the ''free'' command output. The fixed version is:
 <code bash> <code bash>
 #!/bin/bash #!/bin/bash
  • by urbanl