statistic

Monitoring rsyslog’s impstats with Kibana and SPM

Original post: Monitoring rsyslog with Kibana and SPM by @Sematext

A while ago we published this post where we explained how you can get stats about rsyslog, such as the number of messages enqueued, the number of output errors and so on. The point was to send them to Elasticsearch (or Logsene, our logging SaaS, which exposes the Elasticsearch API) in order to analyze them.

This is part 2 of that story, where we share how we process these stats in production. We’ll cover:

  • an updated config, working with Elasticsearch 2.x
  • what Kibana dashboards we have in Logsene to get an overview of what rsyslog is doing
  • how we send some of these metrics to SPM as well, in order to set up alerts on their values: both threshold-based alerts and anomaly detection

Continue reading “Monitoring rsyslog’s impstats with Kibana and SPM”

rsyslog statistic counter plugin imuxsocks

Plugin – imuxsock

This plugin maintains a global statistics with the following properties:

  • submitted – total number of messages submitted for processing since startup
  • ratelimit.discarded – number of messages discarded due to rate limiting
  • ratelimit.numratelimiters – number of currently active rate limiters (small data structures used for the rate limiting logic)

Back to statistics counter overview

rsyslog statistic counter plugin imudp

Plugin – imudp

This plugin maintains statistics for each listener and for each worker thread.

The listener statistic is named starting with “imudp”, followed followed by the listener IP, a colon and port in parenthesis. For example, the counter for a listener on port 514 (on all IPs) with no set name is called “imudp(*:514)”.

If an “inputname” is defined for a listener, that inputname is used instead of “imudp” as statistic name. For example, if the inputname is set to “myudpinut”, that corresponding statistic name in above case would be “myudpinput(*:514)”. This has been introduced in 7.5.3.

The following properties are maintained for each listener:

  • submitted – total number of messages submitted for processing since startup

The worker thread (in short: worker) statistic is named “imudp(wX)” where “X” is the worker thread ID, which is an monotonically increasing integer starting at 0. This means the first worker will have the name “imudp(w0”), the second “imudp(w1)” and so on. Note that workers are all equal. It doesn’t really matter which worker processes which messages, so the actual worker ID is not of much concern. More interesting is to check how the load is spread between the worker. Also note that there is no fixed worker-to-listener relationship: all workers process messages from all listeners.

Note: worker thread statistics are available starting with rsyslog 7.5.5.

The following properties are maintained for each worker thread:

  • called.recvmmsg – number of recvmmsg() OS calls done
  • called.recvmsg – number of recvmsg() OS calls done
  • msgs.received – number of actual messages received

Note: usually either one of “called.recvmmsg” or “called.recvmsg” is non-zero. This is because these are alternative kernel interfaces and the one to be used is selected based on kernel capabilities. However, there are two counters, as the system call type used has important performance implications, and so we thought this information needs to be exposed.

On systems supporting recvmmsg, the quotient msgs.received / caled.recvmmsg tells the average number of messages that could be pulled from the kernel buffers with a single syscall.

Back to statistics counter overview

rsyslog statistic counter plugin imtcp

Plugin – imtcp

This plugin maintains statistics for each listener. The statistic is named after the given input name (or “imtcp” if none is configured), followed by the listener port in parenthesis. For example, the counter for a listener on port 514 with no set name is called “imtcp(514)”.

The following properties are maintained for each listener:

  • submitted – total number of messages submitted for processing since startup

Back to statistics counter overview

rsyslog statistic counter plugin imptcp

Plugin – imptcp

This plugin maintains statistics for each listener. The statistic is named “imtcp” , followed by the bound address, listener port and IP version in parenthesis. For example, the counter for a listener on port 514, bound to all interfaces and listening on IPv6 is called “imptcp(*/514/IPv6)”.

The following properties are maintained for each listener:

  • submitted – total number of messages submitted for processing since startup

Back to statistics counter overview

rsyslog statistic counter plugin impstats

Plugin – impstats (rsyslog 7.5.3+) The impstats plugin gathers some internal statistics. They have different names depending on the actual statistics. Obviously, they do not relate to the plugin itself but rather to a broader object – most notably the rsyslog process itself. The “resource-usage” counter maintains process statistics. They base on the getrusage() system call. The counters are named like getrusage returned data memebers. So for details, looking them up in “man getrusage” is highly recommended, especially as value may be different depending on the platform. A getrusage() call is done immediately before the counter is emitted. The follwowing individual counters are maintained:

  • utime – this is the user time in microseconds (thus the timeval structure combined)
  • stime – again, time given in microseconds
  • maxrss
  • minflt
  • majflt
  • inblock
  • oublock
  • nvcsw
  • nivcsw

Back to statistics counter overview

rsyslog statistic counter plugin omfile

Plugin – omfile (rsyslog 7.3.6+)

This plugin maintains statistics for each dynafile cache. Dynafile cache performance is critical for overall system performance, so reviewing these counters on a busy system (especially one experiencing performance problems) is advisable. The statistic is named “dynafile cache”, followed by the template name used for this dynafile action.

The following properties are maintained for each dynafile:

  • requests – total number of requests made to obtain a dynafile
  • level0 – requests for the current active file, so no real cache lookup needed to be done. These are extremely good.
  • missed – cache misses, where the required file did not reside in cache. Even with a perfect cache, there will be at least one miss per file. That happens when the file is being accessed for the first time and brought into cache. So “missed” will always be at least as large as the number of different files processed.
  • evicted – the number of times a file needed to be evicted from the cache as it ran out of space. These can simply happen when date-based files are used, and the previous date files are being removed from the cache as time progresses. It is better, though, to set an appropriate “closeTimeout” (counter described below), so that files are removed from the cache after they become no longer accessed. It is bad if active files need to be evicted from the cache. This is a very costly operation as an evict requires to close the file (thus a full flush, no matter of its buffer state) and a later access requires a re-open – and the eviction of another file, as the cache obviously has run out of free entries. If this happens frequently, it can severely affect performance. So a high eviction rate is a sign that the dynafile cache size should be increased. If it is already very high, it is recommended to re-think about the design of the file store, at least if the eviction process causes real performance problems.
  • maxused – the maximum number of cache entries ever used. This can be used to trim the cache down to a value that’s actually useful but does not waste resources. Note that when date-based files are used and rsyslog is run for an extended period of time, the cache gradually fills up to the max configured value as older files are migrated out of it. This will make “maxused” questionable after some time. Frequently enough purging the cache can prevent this (usually, once a day is sufficient).
  • closetimeouts –  available since 8.3.3 – tells how often a file was closed due to timeout settings (“closeTimeout” action parameter). These are cases where dynafiles or static files have been closed by rsyslog due to inactivity. Note that if no “closeTimeout” is specified for the action, this counter always is zero. A high or low number in itself doesn’t mean anything good or bad. It totally depends on the use case, so no general advise can be given.

Note that the dynafile caches are purged when a HUP is sent.

Back to statistics counter overview

rsyslog statistic counter plugin imrelp

Plugin – imrelp

This plugin maintains statistics for each listener. The statistic by default is named “imrelp” , followed by the listener port in parenthesis. For  example, the counter for a listener on port 514 is called “imprelp(514)”. If the input is given a name, that input name is used instead of “imrelp”. This counter is available starting rsyslog 7.5.1

The following properties are maintained for each listener:

  • submitted – total number of messages submitted for processing since startup

Back to statistics counter overview

rsyslog statistic counter Actions

Actions

  • processed – total number of messages processed by this action. This includes those messages that failed actual execution (so it is a total count of messages ever seen, but not necessarily successfully processed)
  • failed – total number of messages that failed during processing. These are actually lost if they have not been processed by some other action. Most importantly in a failover chain the messages are flagged as “failed” in the failing actions even though they are forwarded to the failover action (the failover action’s “processed” count should equal to failing actions “fail” count in this scenario)
  • suspended (7.5.8+) – total number of times this action suspended itself. Note that this counts the number of times the action transitioned from active to suspended state. The counter is no indication of how long the action was suspended or how often it was retried. This is intentional, as the counter as it currently is permits to tell how often the action ran into a failure condition.
  • suspended.duration (7.5.8+) – the total number of seconds this action was disabled. Note that the second count is incremented as soon as the action is suspended for another interval. As such, the time may be higher than it should be at the reporting point of time. If the pstats interval is shorter than the suspension timeout, the same suspended.duration value may be reported for successive pstats outputs. For a long-running system, this is considered a minimal difference. In general, note that this setting is not totally accurate, especially when running with multiple worker threads. In rsyslog v8, this is the total suspended time for all worker instances of this action.
  • resumed (7.5.8+) – total number of times this action resumed itself. A resumption occurs after the action has detected that a failure condition does no longer exist.

Back to statistics counter overview

rsyslog statistic counter plugin omelasticsearch

Plugin – omelasticsearch

This plugin maintains global statistics, which accumulate all action instances. The statistic is named “omelasticsearch”. Parameters are:

  • submitted – number of messages submitted for processing (with both success and error result)
  • fail.httprequests – the number of times a http request failed. Note that a single http request may be used to submit multiple messages, so this number may be (much) lower than fail.http.
  • fail.http – number of message failures due to connection like-problems (things like remote server down, broken link etc)
  • fail.es – number of failures due to elasticsearch error reply; Note that this counter does NOT count the number of failed messages but the number of times a failure occured (a potentially much smaller number). Counting messages would be quite performance-intense and is thus not done.

The fail.httprequests and fail.http counters reflect only failures that omelasticsearch detected. Once it detects problems, it (usually, depends on circumstances) tell the rsyslog core that it wants to be suspended until the situation clears (this is a requirement for rsyslog output modules). Once it is suspended, it does NOT receive any further messages. Depending on the user configuration, messages will be lost during this period. Those lost messages will NOT be counted by impstats (as it does not see them).

Note that some previous (pre 7.4.5) versions of this plugin had different counters. These were experimental and confusing. The only ones really used were “submits”, which were the number of successfully processed messages and “connfail” which were equivalent to “failed.http”.

 

Back to statistics counter overview

Scroll to top