How large is “large”? 1,000,000,000 + logs per day. Tens of thousands per second. TB or more of logs per day. It is a long but interesting road from the 100 events per second when you first started. Put the work and effort into it and in the process you will learn just how powerful the ELK really is (while learning to keep your ELK stack running…). Check out my other Elasticsearch|Logstash|Kibana installation posts as these use configs lifted directly from this high logging rate environment talked about below.
Elasticsearch is a scale out system by design and reinforced by the roughly 31GB JVM limit per instance. It is built to be resilient and to recover from failure and is designed to distribute the workload. More nodes means more CPU/RAM/spindles which means more performance. Find a node size that works and then focus on quantity. Performance in Elasticsearch depends on the ratio between resources. The number of CPU cores and drive spindles for indexing. The ratio of RAM to SSD/HDD space for search performance. Finding the balance between the three resources will determine the performance of the cluster on a whole.
Good performance in Elasticsearch is a combination of CPU, RAM, and IOPs to a storage device. SSD prices have dropped enough that there isn’t much reason not to use them and servicing concurrent writes and reads from many indexing services and searches can thrash a normal spinning hard disk. Elasticsearch needs both CPU and RAM to function. 8 cores and 64GB of RAM has proven to be a good ratio for me. 30GB for the JVM and the rest for the disk cache and OS.
If you undersize your ES cluster of CPU or RAM performance will suffer. This is most likely to happen on big hardware with multiple instances each because of the tendency to over consolidate when using larger dual socket hardware. Xeon-D servers work well, consider using them.
Use standalone disks. Don’t over-complicate things and set up RAID 1 or RAID 0 arrays for the Elasticsearch data. Overall on a cluster wide scale there is little to no performance improvement in doing RAID 0 over just assigning independent disks. RAID 1 is a waste of space since Elasticsearch has good data redundancy built in, and Elasticsearch can make use of replica’s for performance improvements where as RAID 1 does not improve performance at all.
Logstash is almost purely about CPU so throw as many CPU cores at it as you need and can afford.
Both SSD and SATA drives are very useful: use storage tiering
In most ELK solutions for logging there is a fairly common usage pattern. Most recent 2 hours has all of the document indexing, the massive searching to feed live updated dashboards and visualizations, and human driven searches and usage due to troubleshooting. From 2 hours to a couple days there are some human driven searches and the occasional visualization. After a few days usage is almost non existent.
current time to 2h: Heavy writes and extremely heavy searches
2h to a few days: Very spiky heavy search traffic with long idle periods
beyond a few days: Barely used
The workload concurrency that happens on the most recent data requires SSD for any reasonable performance. Try to stick with Enterprise grade SSD as their performance is more predictable under load, but even lower end Enterprise SSD will function fine (Samsung PM863 for example). If you put spinning disks under this heavy concurrent workload they will get thrashed to death and performance will suffer quite badly.
The 2h to a few days data does benefit from resting on SSD because it allows for more concurrent searches without performance degradation, when there is a problem that has several people troubleshooting the last thing anybody wants to deal with is search results in Kibana timing out.
Even though SSD costs have plummeted in recent years, large 6+ TB SATA drives are much cheaper per GB of storage. After the heavy troubleshooting period ends of the most recent few days the workloads are now mostly from a single or just a few different users and sequential when it comes to disk access. For large non-concurrent sequential reads (a single user searching through stuff from a month ago) the SATA drives will perform the same or better. You can use SSD for this tier but it is cost prohibitive.
How to do this? Have different tiers of data nodes. Data nodes with SSD drives it it to handle the log ingestion and indexing plus the super hot most recent data for dashboards and searching. Then data nodes with large SATA drives in them for long term storage. High performance plus bulk storage in the same cluster.
Elasticsearch field mapping customized to your data is required
Elasticsearch by default will try and figure out what data you are sending it and use default mapping. This leads to having an analyzed and a non analyzed field created for every field (in 5.5.x versions, the “message” and “message.keyword” fields as an example). This will degrade indexing performance and use up more disk space. The _all field is also very powerful.
Take for example the default syslog message, it is a semi-formatted string that is ran through some parsing to extract important data fields using Logstash and then those fields plus the original message is stored in Elasticsearch. Using default settings, each field will be stored as both analysed (text type in 5.5.x and allows for partial text searching) and non-analyized (keyword type in 5.5.x and is exact match) and every field must be analyzed which takes a lot of processing power in Elasticsearch. All of the analyzed fields are then analyzed again into the _all field. The _all field is used for searching when a user does not specify a field in the search string, Elasticsearch will search the _all field instead.
For syslog messages this is horribly inefficient. The original message itself is one of the few fields that needs to be analyzed (text type) and most of the other fields only need to be non-analyzed. Additionally since the original syslog message contains the full data set, the _all field is not needed and Elasticsearch can simply be redirected to use the original message field instead of analyzing and created the _all.
50+% reduction in load on Elasticsearch clusters is within the realm of possible which means your cluster can now index twice as many events per second as before.
Don’t parse out and save every field
Syslog messages are expensive (resource consumption) to parse, and saving fields that have no value are expensive on the storage side in Elasticsearch as well as increased CPU and RAM requirements for searching and indexing.
For a sample firewall log message, the bits of information most people care about are the src/dst IP and port and a few other fields perhaps. If all the additional info is not important then don’t parse it out. Chances are you are saving the original message in Elasticsearch anyway so the info isn’t lost.
It seems like an obvious tip, but for example if you use the default Cisco ASA Logstash grok patterns there are a lot of fields that get parsed out and stored that have very little value. Similarly, using the default Cisco ASA grok patterns but then doing a mutate remove_field for the unwanted fields is still a waste of CPU on the Logstash server.
Keep the original unparsed message
Syslog is easy, json can be done by feeding through a ruby filter plugin that outputs the raw json message into a single text encoded field. The benefits of saving the original message are many fold.
-Logstash parsing doesn’t always go the way you intended, having the original message allows a comparison to be done
-It allows to go back to old messages and parse out additional fields or reparse the whole message because the of a field mapping change.
-Can take the place of _all in some cases to reduce the analyzing load on the Elasticsearch cluster
-Some people really want to see the raw message
-You can output application messages back out in original formatting for tech support (Cisco TAC)
Add fields that did not exist before
This is the power of ELK and where a lot of both ELK and Graylog installations fall short: a log is more then just something you search for. Add value to it.
What is more useful:
-Searching for “protocol: UDP” and “port: 53” –or– searching for “l7protocol: DNS”
-Searching for “hostname: *dc1-5* OR hostname: *31*-*” –or– searching for “datacenter: Dallas1”
-Searching for “hostname: *app3*z OR sourcefile: c:\bla\bla\app1\*” –or– searching for “application: appname”
Parse things out in logstash and add *additional* classifying information into the message that did not previously exist to be used later to help search and filter. Creating a Tag Cloud visualization in Kibana with the l7protocol field that shows DNS, HTTP, HTTPS, etc is a whole lot more useful then a big list of port numbers and protocol types.
We are all our own worst enemies. We know UDP:53 is DNS, so why bother doing that extra work to create a field called DNS? Because spending time up front to simplify it by tagging things with additional information will save a lot of time later and will lower the level of technical skill required to use it.
Pretty graphs and visualizations are useless unless they show something that cannot be expressed another way
Take a pie chart of a list of hit counts from IP addresses with the list of IP addresses and hit counts below it. Simple list, simple pie chart. What value does the pie chart add and give you? It looks pretty but it is useless.
Do you have a bunch of visualizations and dashboards that others (including yourself) don’t use? Its probably because they don’t add any value and are therefore of no use. Pretty != valuable.
Visualizations are amazing at being able to show patterns and anomalies that otherwise would be missed or take an inordinate amount of time to find. When you are spending hours hunting through logs, what exactly are you looking for? How can you display that very vague hunt in a visualization so that it shows you what you spent hours hunting for? That is the value of visualizations. If you parse the data out in a way that is usable, and learn how to visualize it, the power is there in ELK to do it.
Managers and bosses like pretty graphs
Everybody is nodding along at this. Why do they like pretty graphs? It is because a visualization is able to express important information in a quick and valuable way because they do not have the time to go through all the details. Managers and bosses don’t really like pretty graphs because they are pretty. They like graphs that quickly convey all the information they need…and it just so happens that a lot of times they look pretty. Build the visualization for the right reasons to convey the right information in the most efficient way as possible.
Standardize on a system wide naming scheme for fields
Don’t call the firewall hostname “host” in one index and a server hostname as “host_name” in another index. This makes it all but impossible to correlate information between indexes or even in the same index.
Parent/child field names can be useful
Consider a common set of fields
The fields could be called “hostname, domain, fqdn” respectively. Or alternatively they could all be nested under the “host” field as “host.name, host.domain, host.fqdn”. Why is this potentially useful other then the organizational gains from it? In Kibana you can search the parent field and the results will be from all child fields. A search string of “host: server1” will return true on fields “host.name:server1” and “host.domain:server1.domain.local”
Logs ARE metrics and so so so much more then “just logs”
Not to purposefully pick on Graylog but it is common to see with deployments of that system. You have Graylog that is collecting a bunch of logs that users will do text searches against. Then there are various other metrics and polling agents that are sending data to graphite (usually) and graphing in Graphana (usually). A lot of ELK systems are set up the same way to act only as a log repository with graphite and grafana sitting alongside.
The various different logs that are collected are metrics, not in the sense that you graph how many of them there are but you graph the information that they contain. They both contain a lot of valuable metrics and only need to be pulled out and they themselves are metrics. The time difference between a HTTP GET and a HTTP Response is a valuable metric that tracks performance of the web application, and it is probably already contained in the logs you are collecting. Overlaying HTTP response codes against SQL response times against firewall logs against netflow data gives an application wide view of health in a single visualization.
You’re logging config versions and changes of the various applications and changes/change control windows into ELK right? How about a graph of configuration changes, overlaid on a display of change control windows, overlaid on application response times gathered by Nagios and stored in ELK? One graph to link causation between application changes and application availability and the financial effect to the business of changes both inside and outside of change control windows. All of it is being logged already… Neat eh?
The power of ELK comes from the blurring of the line between logs and metrics data and combining it all back together as needs require. Also known as Business Intelligence (BI) in the business world.