imfile

Recipe: Apache Logs + rsyslog (parsing) + Elasticsearch

Original post: Recipe: Apache Logs + rsyslog (parsing) + Elasticsearch by @Sematext

This recipe is about tailing Apache HTTPD logs with rsyslog, parsing them into structured JSON documents, and forwarding them to Elasticsearch (or a log analytics SaaS, like Logsene, which exposes the Elasticsearch API). Having them indexed in a structured way will allow you to do better analytics with tools like Kibana:

Kibana_screenshot

We’ll also cover pushing logs coming from the syslog socket and kernel, and how to buffer all of them properly. So this is quite a complete recipe for your centralized logging needs.

Getting the ingredients

Even though most distros already have rsyslog installed, it’s highly recommended to get the latest stable from the rsyslog repositories. The packages you’ll need are:

With the ingredients in place, let’s start cooking a configuration. The configuration needs to do the following:

  • load the required modules
  • configure inputs: tailing Apache logs and system logs
  • configure the main queue to buffer your messages. This is also the place to define the number of worker threads and batch sizes (which will also be Elasticsearch bulk sizes)
  • parse common Apache logs into JSON
  • define a template where you’d specify how JSON messages would look like. You’d use this template to send logs to Logsene/Elasticsearch via the Elasticsearch output

Loading modules

Here, we’ll need imfile to tail files, mmnormalize to parse them, and omelasticsearch to send them. If you want to tail the system logs, you’d also need to include imuxsock and imklog (for kernel logs).

# system logs
module(load="imuxsock")
module(load="imklog")
# file
module(load="imfile")
# parser
module(load="mmnormalize")
# sender
module(load="omelasticsearch")

Configure inputs

For system logs, you typically don’t need any special configuration (unless you want to listen to a non-default Unix Socket). For Apache logs, you’d point to the file(s) you want to monitor. You can use wildcards for file names as well. You also need to specify a syslog tag for each input. You can use this tag later for filtering.

input(type="imfile"
      File="/var/log/apache*.log"
      Tag="apache:"
)

NOTE: By default, rsyslog will not poll for file changes every N seconds. Instead, it will rely on the kernel (via inotify) to poke it when files get changed. This makes the process quite realtime and scales well, especially if you have many files changing rarely. Inotify is also less prone to bugs when it comes to file rotation and other events that would otherwise happen between two “polls”. You can still use the legacy mode=”polling” by specifying it in imfile’s module parameters.

Queue and workers

By default, all incoming messages go into a main queue. You can also separate flows (e.g. files and system logs) by using different rulesets but let’s keep it simple for now.

For tailing files, this kind of queue would work well:

main_queue(
  queue.workerThreads="4"
  queue.dequeueBatchSize="1000"
  queue.size="10000"
)

This would be a small in-memory queue of 10K messages, which works well if Elasticsearch goes down, because the data is still in the file and rsyslog can stop tailing when the queue becomes full, and then resume tailing. 4 worker threads will pick batches of up to 1000 messages from the queue, parse them (see below) and send the resulting JSONs to Elasticsearch.

If you need a larger queue (e.g. if you have lots of system logs and want to make sure they’re not lost), I would recommend using a disk-assisted memory queue, that will spill to disk whenever it uses too much memory:

main_queue(
  queue.workerThreads="4"
  queue.dequeueBatchSize="1000"
  queue.highWatermark="500000"    # max no. of events to hold in memory
  queue.lowWatermark="200000"     # use memory queue again, when it's back to this level
  queue.spoolDirectory="/var/run/rsyslog/queues"  # where to write on disk
  queue.fileName="stats_ruleset"
  queue.maxDiskSpace="5g"        # it will stop at this much disk space
  queue.size="5000000"           # or this many messages
  queue.saveOnShutdown="on"      # save memory queue contents to disk when rsyslog is exiting
)

Parsing with mmnormalize

The message normalization module uses liblognorm to do the parsing. So in the configuration you’d simply point rsyslog to the liblognorm rulebase:

action(type="mmnormalize"
  rulebase="/opt/rsyslog/apache.rb"
)

where apache.rb will contain rules for parsing apache logs, that can look like this:

version=2

rule=:%clientip:word% %ident:word% %auth:word% [%timestamp:char-to:]%] "%verb:word% %request:word% HTTP/%httpversion:float%" %response:number% %bytes:number% "%referrer:char-to:"%" "%agent:char-to:"%"%blob:rest%

Where version=2 indicates that rsyslog should use liblognorm’s v2 engine (which is was introduced in rsyslog 8.13) and then you have the actual rule for parsing logs. You can find more details about configuring those rules in the liblognorm documentation.

Besides parsing Apache logs, creating new rules typically requires a lot of trial and error. To check your rules without messing with rsyslog, you can use the lognormalizer binary like:

head -1 /path/to/log.file | /usr/lib/lognorm/lognormalizer -r /path/to/rulebase.rb -e json

NOTE: If you’re used to Logstash’s grok, this kind of parsing rules will look very familiar. However, things are quite different under the hood. Grok is a nice abstraction over regular expressions, while liblognorm builds parse trees out of specialized parsers. This makes liblognorm much faster, especially as you add more rules. In fact, it scales so well, that for all practical purposes, performance depends on the length of the log lines and not on the number of rules. This post explains the theory behind this assuption, and this is actually proven by various tests. The downside is that you’ll lose some of the flexibility offered by regular expressions. You can still use regular expressions with liblognorm (you’d need to set allow_regex to on when loading mmnormalize) but then you’d lose a lot of the benefits that come with the parse tree approach.

Template for parsed logs

Since we want to push logs to Elasticsearch as JSON, we’d need to use templates to format them. For Apache logs, by the time parsing ended, you already have all the relevant fields in the $!all-json variable, that you’ll use as a template:

template(name="all-json" type="list"){
  property(name="$!all-json")
}

Template for time-based indices

For the logging use-case, you’d probably want to use time-based indices (e.g. if you keep your logs for 7 days, you can have one index per day). Such a design will give your cluster a lot more capacity due to the way Elasticsearch merges data in the background (you can learn the details in our presentations at GeeCON and Berlin Buzzwords).

To make rsyslog use daily or other time-based indices, you need to define a template that builds an index name off the timestamp of each log. This is one that names them logstash-YYYY.MM.DD, like Logstash does by default:

template(name="logstash-index"
  type="list") {
    constant(value="logstash-")
    property(name="timereported" dateFormat="rfc3339" position.from="1" position.to="4")
    constant(value=".")
    property(name="timereported" dateFormat="rfc3339" position.from="6" position.to="7")
    constant(value=".")
    property(name="timereported" dateFormat="rfc3339" position.from="9" position.to="10")
}

And then you’d use this template in the Elasticsearch output:

action(type="omelasticsearch"
  template="all-json"
  dynSearchIndex="on"
  searchIndex="logstash-index"
  searchType="apache"
  server="MY-ELASTICSEARCH-SERVER"
  bulkmode="on"
  action.resumeretrycount="-1"
)

Putting both Apache and system logs together

If you use the same rsyslog to parse system logs, mmnormalize won’t parse them (because they don’t match Apache’s common log format). In this case, you’ll need to pick the rsyslog properties you want and build an additional JSON template:

template(name="plain-syslog"
  type="list") {
    constant(value="{")
      constant(value="\"timestamp\":\"")     property(name="timereported" dateFormat="rfc3339")
      constant(value="\",\"host\":\"")        property(name="hostname")
      constant(value="\",\"severity\":\"")    property(name="syslogseverity-text")
      constant(value="\",\"facility\":\"")    property(name="syslogfacility-text")
      constant(value="\",\"tag\":\"")   property(name="syslogtag" format="json")
      constant(value="\",\"message\":\"")    property(name="msg" format="json")
    constant(value="\"}")
}

Then you can make rsyslog decide: if a log was parsed successfully, use the all-json template. If not, use the plain-syslog one:

if $parsesuccess == "OK" then {
 action(type="omelasticsearch"
  template="all-json"
  ...
 )
} else {
 action(type="omelasticsearch"
  template="plain-syslog"
  ...
 )
}

And that’s it! Now you can restart rsyslog and get both your system and Apache logs parsed, buffered and indexed into Elasticsearch. If you’re a Logsene user, the recipe is a bit simpler: you’d follow the same steps, except that you’ll skip the logstash-index template (Logsene does that for you) and your Elasticsearch actions will look like this:

action(type="omelasticsearch"
  template="all-json or plain-syslog"
  searchIndex="LOGSENE-APP-TOKEN-GOES-HERE"
  searchType="apache"
  server="logsene-receiver.sematext.com"
  serverport="80"
  bulkmode="on"
  action.resumeretrycount="-1"
)

Coupling with Logstash via Redis

Original post: Recipe: rsyslog + Redis + Logstash by @Sematext

OK, so you want to hook up rsyslog with Logstash. If you don’t remember why you want that, let me give you a few hints:

  • Logstash can do lots of things, it’s easy to set up but tends to be too heavy to put on every server
  • you have Redis already installed so you can use it as a centralized queue. If you don’t have it yet, it’s worth a try because it’s very light for this kind of workload.
  • you have rsyslog on pretty much all your Linux boxes. It’s light and surprisingly capable, so why not make it push to Redis in order to hook it up with Logstash?

In this post, you’ll see how to install and configure the needed components so you can send your local syslog (or tail files with rsyslog) to be buffered in Redis so you can use Logstash to ship them to Elasticsearch, a logging SaaS like Logsene (which exposes the Elasticsearch API for both indexing and searching) so you can search and analyze them with Kibana:

Kibana_search

Continue reading “Coupling with Logstash via Redis”

rsyslog 8.11.0 (v8-stable) released

We have released rsyslog 8.11.0.

This release now provides a new signature provider for Keyless Signature Infrastructure (KSI) as well as quite a few fixes for imfile, omkafka, the build system and others.
ChangeLog:

http://www.rsyslog.com/changelog-for-8-11-0-v8-stable/

Download:

http://www.rsyslog.com/downloads/download-v8-stable/

As always, feedback is appreciated.

Best regards,
Florian Riedl

Changelog for 8.11.0 (v8-stable)

Version 8.11.0 [v8-stable] 2015-06-30

  • new signature provider for Keyless Signature Infrastructure (KSI) added
  • build system: re-enable use of “make distcheck”
  • bugfix imfile: regex multiline mode ignored escapeLF option
    Thanks to Ciprian Hacman for reporting the problem
    closes https://github.com/rsyslog/rsyslog/issues/370
  • bugfix omkafka: fixed several concurrency issues, most of them related to dynamic topics.
    Thanks to Janmejay Singh for the patch.
  • bugfix: execonlywhenpreviousissuspended did not work correctly
    This especially caused problems when an action with this attribute was configured with an action queue.
  • bugfix core engine: ensured global variable atomicity
    This could lead to problems in RainerScript, as well as probably in other areas where global variables are used inside rsyslog. I wouldn’t outrule it could lead to segfaults.
    Thanks to Janmejay Singh for the patch.
  • bugfix imfile: segfault when using startmsg.regex because of empty log line
    closes https://github.com/rsyslog/rsyslog/issues/357
    Thanks to Ciprian Hacman for the patch.
  • bugfix: build problem on Solaris
    Thanks to Dagobert Michelsen for reporting this and getting us up to
    speed on the openCWS build farm.
  • bugfix: build system strndup was used even if not present now added compatibility function. This came up on Solaris builds.
    Thanks to Dagobert Michelsen for reporting the problem.
    closes https://github.com/rsyslog/rsyslog/issues/347

 

rsyslog 8.10.0 (v8-stable) released

We have released rsyslog 8.10.0.

This provides a number of new features and fixes in several modules, like imfile, zmq and others. It also adds a new contributed module omhttpfs for writing to HDFS via HTTP.
ChangeLog:

http://www.rsyslog.com/changelog-for-8-10-0-v8-stable/

Download:

http://www.rsyslog.com/downloads/download-v8-stable/

As always, feedback is appreciated.

Best regards,
Florian Riedl

Changelog for 8.10.0 (v8-stable)

Version 8.10.0 [v8-stable] 2015-05-19

  • imfile: add capability to process multi-line messages based on regex input parameter “endmsg.regex” was added for that purpose. The new mode provides much more power in processing different multiline-formats.
  • pmrfc3164: add new parameters
    • “detect.yearAfterTimestamp”
      This supports timestamps as generated e.g. by some Aruba Networks equipment.
    • “permit.squareBracesInHostname”
      Permits to use “hostnames” in the form of “[127.0.0.1]”; also seen in Aruba Networks equipment, but we strongly assume this can also happen in other cases, especially with IPv6.
  • supplementary groups are now set when dropping privileges
    closes https://github.com/rsyslog/rsyslog/issues/296
    Thanks to Zach Lisinski for the patch.
  • imfile: added brace glob expansion to wildcard
    Thanks to Zach Lisinski for the patch.
  • zmq: add the ability for zeromq input and outputs to advertise their presence on UDP via the zbeacon API.
    Thanks to Brian Knox for the contribution.
  • added omhttpfs: contributed module for writing to HDFS via HTTP
    Thanks to sskaje for the contribution.
  • Configure option “–disable-debug-symbols” added which is disabled per default. If you set the new option, configure won’t set the appropriate compiler flag to generate debug symbols anymore.
  • When building from git source we now require rst2man and yacc (or a replacement like bison).
    That isn’t any new requirement, we only added missing configure checks.
  • Configure option “–enable-generate-man-pages” is now disabled for non git source builds per default but enforced when building from git source.
  • mmpstrucdata: some code cleanup
    removed lots of early development debug outputs
  • bugfix imuxsock: fix a crash when setting a hostname
    Setting a hostname via the legacy directive would lead to a crash during shutdown caused by a double-free.
    Thanks to Tomas Heinrich for the patch.
  • bugfix: memory leak in mmpstrucdata
    Thanks to Grégoire Seux for reporting this issue.
    closes https://github.com/rsyslog/rsyslog/issues/310
  • bugfix (minor): default action name: assigned number was one off
    see also https://github.com/rsyslog/rsyslog/pull/340
    Thanks to Tomas Heinrich for the patch.
  • bugfix: memory leak in imfile
    A small leak happened each time a new file was monitored based on a wildcard. Depending on the rate of file creation, this could result in a serious memory leak.

rsyslog 8.7.0 (v8-stable) released

We have released rsyslog 8.7.0.

Version 8.7.0 contains various improvements and additions to a wide array of modules, like imfile, imptcp, improvements to RainerScript and mmnormalize (thanks to Singh Janmejay) and a couple of other improvements. But, the biggest addition is the new omkafka module that now allows direct writing to Apache Kafka.

This release also contains important bug fixes.

This is a recommended upgrade for all users.

 

ChangeLog:

http://www.rsyslog.com/changelog-for-8-7-0-v8-stable/

Download:

http://www.rsyslog.com/downloads/download-v8-stable/

As always, feedback is appreciated.

Best regards,
Florian Riedl

Changelog for 8.7.0 (v8-stable)

Version 8.7.0 [v8-stable] 2015-01-13

  • add message metadata “system” to msg object
    this permits to store metadata alongside the message
  • imfile: add support for “filename” metadata
    this is useful in cases where wildcards are used
  • imptcp: make stats counter names consistent with what imudp, imtcp uses
  • added new module “omkafka” to support writing to Apache Kafka
  • omfwd: add new “udp.senddelay” parameter
  • mmnormalize enhancements
    Thanks to Janmejay Singh for the patch.
  • RainerScript “foreach” iterator and array reading support
    Thanks to Janmejay Singh for the patch.
  • now requires liblognorm >= 1.0.2
  • add support for systemd >= 209 library names
  • BSD “ntp” facility (value 12) is now also supported in filter
    Thanks to Douglas K. Rand of Iteris, Inc. for the patch.
    Note: this patch was released under ASL 2.0 (see email-conversation).
  • bugfix: global(localHostName=”xxx”) was not respected in all modules
  • bugfix: emit correct error message on config-file-not-found
    closes https://github.com/rsyslog/rsyslog/issues/173
  • bugfix: impstats emitted invalid JSON format (if JSON was selected)
  • bugfix: (small) memory leak in omfile’s outchannel code
    Thanks to Koral Ilgun for reporting this issue.
  • bugfix: imuxsock did not deactivate some code not supported by platform
    Among potential other problemns, this caused build failure under Solaris.
    Note that this build problem just made a broader problem appear that so
    far always existed but was not visible.
    closes https://github.com/rsyslog/rsyslog/issues/185

rsyslog 8.5.0 (v8-devel) released

We have just released 8.5.0 of the v8-devel branch.

This begins the next v8 devel series. Most importantly, it contains a greatly refactored imfile, which now supports wildcards inside filenames. There are also some other improvements, as well as some bugfixes that are not yet included in the stable versions (this will happen soon with the next release).

For more details about using wildcards in imfile, please take a look at this presentation:

Using Wildcards with rsyslog’s File Monitor

ChangeLog:

http://www.rsyslog.com/changelog-for-8-5-0-v8-devel/

Download:

http://www.rsyslog.com/download-v8-devel/

As always, feedback is appreciated.

Best regards,
Florian Riedl

Changelog for 8.5.0 (v8-devel)

Version 8.5.0 [v8-stable] 2014-10-24

  • imfile greatly refactored and support for wildcards added
  • PRI-handling code refactored for more clarity and robustness
  • ommail: add support for RainerScript config system [action() object]
    This finally adds support for the new config style. Also, we now permit to set a constant subject text without the need to create a template for it.
  • refactored the auto-backgrounding method
    The code is now more robust and also offers possibilities for enhanced error reporting in the future. This is also assumed to fix some races where a system startup script hang due to “hanging” rsyslogd.
  • make gntls tcp syslog driver emit more error messages
    Messages previously emitted only to the debug log are now emitted as syslog error messages. It has shown that they contain information  helpful to the user for troubleshooting config issues. Note that this change is a bit experimental, as we are not sure if there are situations where large amounts of error messages may be emitted.
  • bugfix: imfile did not complain if configured file did not exist
    closes https://github.com/rsyslog/rsyslog/issues/137
  • bugfix: build failure on systems which don’t have json_tokener_errors
    Older versions of json-c need to use a different API (which don’t exists on newer versions, unfortunately…)
    Thanks to Thomas D. for reporting this problem.
  • imgssapi: log remote peer address in some error messages
    Thanks to Bodik for the patch.
Scroll to top