Monitoring of server with CzechIdM

Book Creator
Add this page to your book

Book Creator
Remove this page from your book

This is an old revision of the document!

Automatic monitoring of production system is crucial for bussiness continuity. Monitoring is recommended also for the testing environment, but it is not mandatory. This page will show you how to set up basic monitoring of server with CzechIdM using Nagios NRPE. It is very useful to store monitored values for trend overview (e.g. with Munin). Some monitoring systems (like Zabbix) can store trends and monitor services at once. It is also practical to install iostat, vmstat and sar utilities on the server.

This article is about real-time monitoring of the server and its services. It does not deal with monitoring of the "insides" of CzechIdM.

Typical CzechIdM server

This is a typical configuration of a production server for a small company. These parameters may need to be adjusted to complexity of particular deployment.

RHEL7-flavoured system.
About 80GB HDD.
At least 6GB RAM.
At least 2x2GHz CPU.

This is a list of monitored server's (and services') parameters. It should be treated as a bare minimum and, if needed, extended according to your company's policy. Parameters and their thresholds mentioned below are based on our best practices for the monitoring of a deployment.

Service/Parameter	Probe binary	Name in NRPE	Warning threshold	Critical threshold	Check frequency	Notification frequency
HOST UP	N/A	this is not implemented on the target machine	N/A or ping RTT threshold	high ping RTT or host is not pingable at all	every 5 minutes	every 6 hours
swap used space	check\_swap	check\_swap	50% swap free	10% swap free	every 5 minutes	every 24 hours
disk free space	check\_disk	check\_disk	90% used	95% used	every 5 minutes	every 24 hours
system load	check\_load	check\_load	4,3.5,3	6,5.5,5	every 5 minutes	every 24 hours
used memory	check\_mem	check\_mem	90% used	95% used	every 5 minutes	every 24 hours
process count	check\_procs	check\_procs	300+	500+	every 5 minutes	every 24 hours
zombie process count	check\_procs	check\_zombies	1+	5+	every 5 minutes	every 24 hours
system time	check\_ntp\_time	check\_time	skew >1min	skew >5min	every hour	every 24 hours
CzechIdM is running	check\_http	check\_idm	N/A	CzechIdM not running	every 5 minutes	every 24 hours
HTTPD is running	check\_http	check\_httpd	response time >1s	HTTPD is not running	every 5 minutes	every 24 hours
HTTPS certificate expiration	check\_http	check\_httpd\_cert	less than 30 days	less than 7 days	once a day	every 24 hours
PostgresSQL is running	check\_pgsql	check\_postgres	response time >0.5s	response time >1s or not running at all	every 5 minutes	every 24 hours

We will use nrpe and probes from the standard system packages. We have epel repository enabled.

NRPE daemon will listen on 5666\tcp (its default port). Open the port in your iptables by adding the rule: -A INPUT -m state –state ESTABLISHED,RELATED -p tcp –dport 5666 -j ACCEPT.
All probes are located in their default installation location /usr/lib64/nagios/plugins/.
We use one external probe check\_mem which can be downloaded here: https://exchange.nagios.org/directory/Plugins/System-Metrics/Memory/check_mem-2Esh/details. This probe, however, returns bad results on RHEL7 because of the different meaning of the free command output. The fixed version is:

#!/bin/bash
 
# Original version https://exchange.nagios.org/directory/Plugins/System-Metrics/Memory/check_mem-2Esh/details
# Modified for CentOS7/RHEL7 - Petr Fiser, BCV solutions s.r.o.
if [ "$1" = "-w" ] && [ "$2" -gt "0" ] && [ "$3" = "-c" ] && [ "$4" -gt "0" ]; then
 
        memTotal_b=`free -b |grep Mem |awk '{print $2}'`
        memFree_b=`free -b |grep Mem |awk '{print $4}'`
        memBuffer_b=`free -b |grep Mem |awk '{print $6}'`
 
        memTotal_m=`free -m |grep Mem |awk '{print $2}'`
        memFree_m=`free -m |grep Mem |awk '{print $4}'`
        memBuffer_m=`free -m |grep Mem |awk '{print $6}'`
 
        memUsed_b=$(($memTotal_b-$memFree_b-$memBuffer_b))
        memUsed_m=$(($memTotal_m-$memFree_m-$memBuffer_m))
 
        memUsedPrc=$((($memUsed_b*100)/$memTotal_b))
 
 
        if [ "$memUsedPrc" -ge "$4" ]; then
                echo "Memory: CRITICAL Total: $memTotal_m MB - Used: $memUsed_m MB - $memUsedPrc% used!|TOTAL=$memTotal_b;;;; USED=$memUsed_b;;;; BUFFER=$memBuffer_b;;;;"
                $(exit 2)
        elif [ "$memUsedPrc" -ge "$2" ]; then
                echo "Memory: WARNING Total: $memTotal_m MB - Used: $memUsed_m MB - $memUsedPrc% used!|TOTAL=$memTotal_b;;;; USED=$memUsed_b;;;; BUFFER=$memBuffer_b;;;;"
                $(exit 1)
        else
                echo "Memory: OK Total: $memTotal_m MB - Used: $memUsed_m MB - $memUsedPrc% used|TOTAL=$memTotal_b;;;; USED=$memUsed_b;;;; BUFFER=$memBuffer_b;;;;"
                $(exit 0)
        fi
 
else
        echo "check_mem v1.1"
        echo ""
        echo "Usage:"
        echo "check_mem.sh -w <warnlevel> -c <critlevel>"
        echo ""
        echo "warnlevel and critlevel is percentage value without %"
        echo ""
        echo "Copyright (C) 2012 Lukasz Gogolin (lukasz.gogolin@gmail.com)"
        exit
fi

Deployment

First, install the necessary packages:

yum install nrpe nagios-plugins-nrpe nagios-plugins-swap nagios-plugins-disk nagios-plugins-load nagios-plugins-procs nagios-plugins-ntp nagios-plugins-http nagios-plugins-pgsql

If you use SELinux, we need to permit the check_disk plugin access to the /sys/kernel/…:

yum install policycoreutils-python
semanage permissive -a nagios_checkdisk_plugin_t

Edit the /etc/nagios/nrpe.cfg file and add your monitoring server address to the allowed_hosts directive:

allowed_hosts=127.0.0.1,IPofMonitoringServer

Create a configuration of system checks in the file /etc/nrpe.d/checks.cfg. Fill in the YOUR_NTP_SERVER and IDM_SERVICE_DOMAIN_NAME accordingly. The MONITORING_USER and MONITORING_USER_PASSWORD are values filled with credentials of an user which is capable to log into the PostgreSQL database. Create separate user just for this purpose.

command[check_swap]=/usr/lib64/nagios/plugins/check_swap -w 50% -c 10%
command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w 90 -c 95
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 4,3.5,3 -c 6,5.5,5
command[check_mem]=/usr/lib64/nagios/plugins/check_mem -w 90 -c 95
command[check_procs]=/usr/lib64/nagios/plugins/check_procs -w 300 -c 500
command[check_zombies]=/usr/lib64/nagios/plugins/check_procs -w 1 -c 5 -s Z
command[check_time]=/usr/lib64/nagios/plugins/check_ntp_time -H YOUR_NTP_SERVER -w60 -c300
command[check_idm]=/usr/lib64/nagios/plugins/check_http -H 127.0.0.1 -p 8080 -u '/idm/api/v1/status'
command[check_httpd]=/usr/lib64/nagios/plugins/check_http -H IDM_SERVICE_DOMAIN_NAME -S -p443 -w1
command[check_httpd_cert]=/usr/lib64/nagios/plugins/check_http -H IDM_SERVICE_DOMAIN_NAME -S -p443 -C30,7
command[check_postgres]=/usr/lib64/nagios/plugins/check_pgsql -H 127.0.0.1 -P 5432 -d template1 -l MONITORING_USER -p MONITORING_USER_PASSWORD -w0.5 -c1

Add the check_mem script to the /usr/lib64/nagios/plugins/ directory, make it executable:

cp check_mem /usr/lib64/nagios/plugins/
chmod 755 /usr/lib64/nagios/plugins/check_mem

Create the MONITORING_USER in the PostgreSQL. Please generate some strong password - you can use pwgen for that.

create user monitoring password 'somepassword';

Start and enable the NRPE daemon:

systemctl start nrpe
systemctl enable nrpe

To test the probes, you can use check_nrpe plugin:

/usr/lib64/nagios/plugins/check_nrpe -H 127.0.0.1 -b 127.0.0.1 -c check_swap

Nagios server configuration

This is a sample configuration for the Nagios server. It is meant more as an inspiration, feel free to adapt it to your Nagios deployment.

Configure the check_nrpe command (you probably already have this in your Nagios configuration):

define command{
        command_name check_nrpe
        command_line /usr/lib64/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}

Define CzechIdM server host:

define host {
        use                             linux-server
        host_name                       czechidm_server
        alias                           idmserver.example.com - CzechIdM server
        address                         1.2.3.4
        check_period                    24x7
        # we expect interval_length=60 as is the default, so 1440*60s = 1 day
        notification_interval           1440
        notification_period             24x7
}

Define checks:

define service {
        use                             generic-service
        host_name                       czechidm_server
        service_description             SWAP
        check_command                   check_nrpe!check_swap
        # we expect interval_length=60 as is the default, so 5*60s = 5 minutes
        check_interval 			5
        # we expect interval_length=60 as is the default, so 1440*60s = 1 day
        notification_interval           1440
        contacts 			user1,user2
        contact_groups			admins1,admins2
}

... and similarly the other checks ...

Monitoring of server with CzechIdM

Monitored parameters

Implementation