So I was having problems with Nagios 2.0 stopping bothering to check on services and not re-scheduling the checks for times in the future (and looking at NagiosExchange, it seems to be a problem that a few people are having). But now I have it whipped. I think. Well, it's stabilised now after some minor tweaking on Wednesday afternoon, then again on Friday.
Previously, it would be nice and quick, with low latencies for about a day, then the latency would rocket to 100 seconds until I did a reload. Now it appears to be staying around 2-6 seconds, but I'm keeping an eye out for the value creeping slowly upwards (which I think might be happening, hard to tell at the moment).
I've got 922 services at the moment (on a Dell 850, running Solaris 10, with Nagios 2.0(stable)). I was seeing a reasonable number of orphaned checks building up, so set the check_for_orphaned_services directive[0] to 1. I also reduced a couple of timeout values so that Nagios stopped wasting time on checks which were bound to fail:-
service_check_timeout=30 [1]
host_check_timeout=30 [2]
event_handler_timeout=30 [3]
notification_timeout=60 [4] [6]
On top of that, I made sure that Nagios wasn't wasting time with status information quite so much:-
aggregate_status_updates=1 [5]
Given that the load on the machine doesn't appear to go over 0.50, I've allowed infinite concurrent services checks now, increased from 400, but that appears to be making no difference at all. And I left the reaper frequency at 10 seconds.
Now the checks are being re-scheduled for times in the future, and the latencies have stopped running away quite so dramatically.
[0] - check_for_orphaned_services
Format: check_for_orphaned_services=<0/1>
Example: check_for_orphaned_services=0
This option allows you to enable or disable checks for orphaned service checks. Orphaned service checks are checks which ahve been executed and have been removed from the event queue, but have not had any results reported in a long time. Since no results have come back in for the service, it is not rescheduled in the event queue. This can cause service checks to stop being executed. Normally it is very rare for this to happen - it might happen if an external user or process killed off the process that was being used to execute a service check. If this option is enabled and Nagios finds that results for a particular service check have not come back, it will log an error message and reschedule the service check. If you start seeing service checks that never seem to get rescheduled, enable this option and see if you notice any log messages about orphaned services.
* 0 = Don't check for orphaned service checks (default)
* 1 = Check for orphaned service checks
[1] - service_check_timeout
format: service_check_timeout=
Example: service_check_timeout=60 (default)
This is the maximum number of seconds that Nagios will allow service checks to run. If checks exceed this limit, they are killed and a CRITICAL state is returned. A timeout error will also be logged.
There is often widespread confusion as to what this option really does. It is meant to be used as a last ditch mechanism to kill off plugins which are misbehaving and not exiting in a timely manner. It should be set to something high (like 60 seconds or more), so that each service check normally finishes executing within this time limit. If a service check runs longer than this limit, Nagios will kill it off thinking it is a runaway processes. (applies to all of these directives)
[2] - host_check_timeout
Format: host_check_timeout=
Example: host_check_timeout=60 (default)
This is the maximum number of seconds that Nagios will allow host checks to run. If checks exceed this limit, they are killed and a CRITICAL state is returned and the host will be assumed to be DOWN. A timeout error will also be logged.
[3] - event_handler_timeout
Format: event_handler_timeout=
Example: event_handler_timeout=60 (default)
This is the maximum number of seconds that Nagios will allow event handlers to be run. If an event handler exceeds this time limit it will be killed and a warning will be logged.
[4] - notification_timeout
Format: notification_timeout=
Example: notification_timeout=60 (default)
This is the maximum number of seconds that Nagios will allow notification commands to be run. If a notification command exceeds this time limit it will be killed and a warning will be logged.
[5] - aggregate_status_updates
Format: aggregate_status_updates=<0/1>
Example: aggregate_status_updates=1
This option determines whether or not Nagios will aggregate updates of host, service, and program status data. If you do not enable this option, status data is updated every time a host or service checks occurs. This can result in high CPU loads and file I/O if you are monitoring a lot of services. If you want Nagios to only update status data (in the status file) every few seconds (as determined by the status_update_interval option), enable this option. If you want immediate updates, disable it. I would highly recommend using aggregated updates (even at short intervals) unless you have good reason not to. Values are as follows:
* 0 = Disable aggregated updates
* 1 = Enabled aggregated updates (default)
[6] - you really don't want notifications to time out because you really do want to be notified of Bad Things happening, so I've left this at the default 60 seconds.
