How large is “large”? 1,000,000,000 + logs per day. Tens of thousands per second. TB or more of logs per day. It is a long but interesting road from the 100 events per second when you first started. Put the work and effort into it and in the process you will learn just how powerful the ELK really is (while learning to keep your ELK stack running…). Check out my other Elasticsearch|Logstash|Kibana installation posts as these use configs lifted directly from this high logging rate environment talked about below.
Spend time to learn about those examples and samples found on the internet
There is a lot of good and useful information on ELK on the internet, but there is also a large amount of inaccurate information. Don’t just copy and paste a config or enable a setting without actually understanding how it works or what it does. There are a lot of cases that enabling a setting might make things work better now by covering up a different issue, but will only turn into a much larger bottle neck later on when you reach higher message ingestion rates. Because you added that setting to “solve” an earlier performance problem it will take even longer to discover that it is now the cause. Kibana and Elasticsearch work pretty well with the default settings, be certain you are changing the correct setting to address an issue.
The same thing applies to sample Logstash configs for a given application; use them only as a starting point. A lot of the configs that are publicly available are written in environments with lower logging rates and not as much attention is put towards efficiency. They work, a lot of times are well organized and commented nicely, but are sloooow. The 50% overhead introduced by an un-anchored regex in a grok statement is hardly noticeable when you are ingesting 100 logs per second, a little extra CPU usage on one of your Logstash CPU cores. When you hit rates of 10,000 logs per second that 50% overhead due to an inefficient regex can eat up an entire 16 core CPU and throwing more hardware at it is getting expensive.
Spend time reading through the Elastic.co blog, there is a lot of good info hiding in there.
The default Elasticsearch settings are pretty good
There are a variety of things that will end up needing changed to fine tune to your environment, but the key word here is “fine tune”. Especially in the more recent 5.x versions Elasticsearch is pretty reliable straight out of the box and will run pretty high indexing and search rates without changing anything. If Elasticsearch is timing out or crashing the problem almost always lies elsewhere and the setting is only a way to bypass it. Too many small tweaks and you have a house of cards that fails in spectacular ways and hit very immovable performance walls. Before enabling any setting in Elasticsearch first understand what it does and *why* it is needed. Chances are it is only covering up the real problem.
Shard and replica settings are important but probably for not the reasons you think
Firstly the replica settings. Most logging environments have very high indexing rates but relatively low searching rates. Having many replica’s is useful in heavy searching environments since any shard including the replica can service the search request. Having 1 replica is usually enough unless it is of above average importance (and in that case, consider backing it up). For older indexes you would normally delete to reclaim space, consider changing the replica to 0 and hanging onto them for a little bit longer. It doesn’t matter if there is a failure and you lose a shard because that index was going to be deleted anyway.
Shard sizing serves two different purposes. More shards up to the number of data nodes make indexing faster, and fewer shards as long as the shard can still fit in the JVM make searching faster. Since ELK for logging is indexing heavy and searching light, size it for faster indexing. Size it for 1 shard per data node (a replica shard is still a shard) and have 1 data node as a spare. The spare node is required if there is a failure so that the cluster nodes do not have an unbalanced workload after a rebuild.
(number of nodes – 1) / (replicas + 1) = number of shards rounded down
If you have 14 data nodes and have replica set to 1, then (14-1) / (1+1) = 6.5 = 6 shards
When Elasticsearch indexes it writes to both the primary and replica shards at the same time, in the above example there is no performance gain by having 13 or 14 shards instead of 6 and it will only increase your segment count unnecessarily. I have successfully hit 100,000 logs per second indexed with 1 replica (200,000 per second on the back end) with 16 data nodes, shard=7, replica=1. Don’t overthink it.
Protect against old logs creating new indexes in Elasticsearch
Lets say you are retaining 180 days of logs and using the Elastic Beats plugins. One of those Beats plugins is misconfigured and on the first time it starts it will send every available system log to Logstash regardless of its age. If you are lucky, the system it is running on has less then 180 days worth of logs on it. In all other circumstances it can send many previous years worth of logs to Logstash. Not a big deal? It is a very big problem.
The default setting used in the Logstash configs is to roll the Elasticsearch indexes based on a certain time period, either once a day or once a week, etc. What happens when Logstash receives logs from every day in the past 3 years and you have Logstash set to create daily indexes? It will create 2.5 years worth of new indexes. Over 900 new indexes are created and index creation is a resource intensive task for Elasticsearch to perform multiplied 900 times over. Best case Elasticsearch will slow to a crawl for many hours, all other cases usually involve a data node crashing which then will trigger a index/shard rebuild and a cascading data node crash usually will follow.
Logstash has an age plugin for a already made solution, and it is a very easy ruby filter plugin to write. Have Logstash drop events that are older then your Elasticsearch retention period.
Don’t be afraid of the Logstash Ruby filter plugin
Ruby is pretty easy to learn even for a non programmer and there are a lot of special use cases that writing a custom ruby script is significantly more resource efficient then using the already created Logstash plugins (or in many cases several Logstash plugins used together to give you the same thing that a quick ruby script will do by itself).
Test your Logstash config for performance
There are a couple different ways to do this, either by feeding a Logstash service a log file and timing how long it takes to complete or by feeding a Logstash service running on resource capped hardware a constant stream of live logs and watching the logs/second it will process. The Logstash API can be queried and will tell you on a per plugin basis what is taking the most amount of time in the pipeline. Then performance tune by optimizing the Logstash config. If you never test your config for performance you will both not know how to improve it and will not know how future changes affect the processing efficiency.
Be sure to test your input and output plugins for performance. Using the default settings you will start to run into performance bottlenecks with some of the input and output plugins at higher logging rates.
If Logstash is “slow” it is because your configuration needs to be improved
Processing the most horrible and CPU greedy of logs, long string based messages such as syslog without consistent formatting between each type of message, it is easy to get to 1000 logs per second per 2.0GHz physical CPU core. Beyond that depends on the message itself and fine tuning your config, but 1000 logs per second is a good baseline. JSON based logs will process at an even higher efficiency rate. If the CPU is maxing out at a lower message rate then it is due to inefficiencies in the Logstash configuration files or what those config files are doing (…that DNS plugin is pretty efficient, but the latency of waiting for a response from the DNS server is quite long) and a bad regex can bring performance to an absolute crawl.
Anchor your regex!
^ and & tags. You won’t regret it.
One Logstash input port per client service becomes time consuming to manage doesn’t scale very well
I’ve seen this in many blogs and it makes a lot of sense as it is a whole lot easier to get a functioning ELK system working. You have several different types of systems all sending syslog messages to different syslog ports on Logstash. Firewalls send to 5000, linux servers to 5001, app#1 to 5002, etc. Don’t get me wrong, this method does work. The problem is scaling, what happens when you have several dozen different types of source clients and other people are configuring these systems to send logs. You can remember that port 5000 is for firewalls but chances are you will be running a constant battle against your coworkers that keep sending logs to the wrong ports and screwing up the logging in ELK.
Send all syslog from all the different syslog sources to port 1514 (or whatever), and with Logstash ‘if’ statements and regex strings use the mutate filter to add tags or fields to identify the traffic. If your regex is efficient there is some CPU overhead but it isn’t much to outweigh the benefits of simplifying all of your coworkers lives.
Separating Logstash ingestion and indexing, then adding a buffer, is critical
Logstash ingestion is the receiving of the various logs, the beats and syslog and other input plugins. The only thing worse then no logs is capturing *almost* all of the logs. Which ones are you missing? Hopefully nothing important…? If all input plugins are using TCP then the chances of dropped messages are lower because the sending client will resend, but if using UDP (common for syslog and for performance reasons) if Logstash does not service the UDP input plugin fast enough then some messages will get missed and disappear.
Logstash indexing is the CPU heavy task of processing the logs and where almost all of the configuration (and configuration changes) take place. Faster is better of course, but there is to time requirement for indexing except for it to be faster then the ingestion rate.
The easiest solution? Separate out ingestion and indexing into separate Logstash services, on separate hardware, and put a buffer in-between (Redis works well).
Logstash ingestion is the input plugins and some simple filters to identify and classify the traffic into different log types then writes to Redis. This Logstash config hardly ever changes and is doing very little filtering so it has free CPU cycles to service the input plugins.
The Redis buffer gives you headroom to survive ‘issues’ on the indexing and Elasticsearch side. It sucks if Elasticsearch crashes, or a new Logstash indexing config change was done that killed performance, but as long as it can be fixed before the Redis buffer runs out of space then it doesn’t matter much. It also allows for other benefits such as being able to suck up sudden bursts of ingestion traffic that are at a higher rate then the indexers can index. Or shutting down the indexers, shutting down the Elasticsearch cluster and upgrading, then starting everything back up again. No lost logs.
The Logstash indexers have lots of CPU power, input from Redis and output to Elasticsearch. This is where the bulk of the Logstash configuration is located (most error prone and most performance impacting prone)
It is a simple architecture (look for a future blog post) that allows all the tiers to scale independently of each other depending on need, and is set up to prioritize making sure logs are not dropped or disappear anywhere.