Monitoring your Infrastructure with Elasticsearch and Elastalert
this tutorial shows how to
- setup Elasticsearch stack (ELK stack) on a server,
- install Metricbeat, Filebeat and Auditbeat agents on an endpoint
- install Elastalert plugin that will alert you on events
As a sysadmin I have a medium-sized infra environment (less than 200 servers, both physical and AWS instances), and I need a way to keep an eye on them in case there’s any form of intrusion, system changes, permission changes, or any kind of out-of-band changes
The following shows how to set this up using Saltstack, but the concept can be applied to any config management system (ie, Puppet, Ansible, Chef, etc)
the following shows examples on a Centos 7 OS, so adjust to your systems accordingly.
Step 1 — install ELK stack on a server
The server hosting your ELK stack must be able to handle massive volumes of data, so I recommend hosting it on a server with a minimum of 32G of RAM and at least 8 CPUs, plus minimum of 200G of disk space,
if your infra is more than 50 nodes/servers, use at least 500G of space as search indexes grow large pretty quickly. You can prune the indexes using Elastic index Lifecycle (see Step 5 below on how to do this)
my ELK host has the following resources:
12 CPUs, 62G RAM and 700G of disk space, as you can see its very busy running the ELK java procs
To install ELK stack:
git clone this repo, and move the directory to your Salt state folder
Once you have the repo cloned, open up each YAML config file for metricbeat, auditbeat, logstash and filebeat, and adjust your settings, for example provide the hostname of your ELK host,
files/metricbeat.yml
also provide the username and password for the Elastic user, this is used for GUI login authentication
Once your YAMLs are configured with your parameters, install the ELK stack on your host:
to install the stack, run
salt <elk-host> state.sls elk-stack
This will install all the components necessary for running ELK stack (Elasticsearch, Kibana, Logstash)
It will also install the Yelp Elastalert plugin that will monitor your index for any events and alert on specific rules
Once the state is done, check if port 5601 is up and listening, as well as port 9200 (logstash port)
You should be able to see a GUI on port <elk-host>:5601
Step 2 — install Beats
on every host that you want to monitor, make sure that endpoint can connect to <elk-host>:5600 and also <elk-host>:9200,
firewall issues are common problem so troubleshoot any firewall blockage before installing any beats agents
To install all beats (Metric, File, Audit), run this on the endpoint
salt nycweb1 state.sls elastic.beats
to install specific beat, ie Metric run
salt nycweb1 state.sls elastic.metricbeat
This will install the Beats, start the beat service and generate a dashboard for this endpoint
Step 3 — Elastalert
Elastalert is a separate plugin from Yelp that runs on top of ELK stack (on same host as the ELK stack), and monitors the Auditbeat index for any alerts
You can see some sample Alert configs in plugins/elastalert/files/rules
These rules also have an option to notify you via email, Slack and other methods (read Elastalert docs for all options)
Lets take a simple SSH abuse alert, this reads parses the Auditbeat index for any SSH abuse and alerts you when someone tries to SSH to your server but gets rejected (plugins/elastalert/files/rules/ssh.yml)
# Rule name, must be unique
name: SSH abuse - ElastAlert 3.0.1
is_enabled: true
# Alert on x events in y seconds
type: frequency
# Alert when this many documents matching the query occur within a timeframe
num_events: 3
# num_events must occur within this amount of time to trigger an alert
timeframe:
minutes: 30
# A list of elasticsearch filters used for find events
# These filters are joined with AND and nested in a filtered query
# For more info: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html
filter:
- query:
query_string:
query: "event.type:authentication_failure"
index: auditbeat-*
# When the attacker continues, send a new alert after x minutes
realert:
minutes: 1
query_key:
- source.ip
include:
- host.hostname
- user.name
- source.ip
include_match_in_root: true
alert_subject: "SSH abuse on <{}>"
alert_subject_args:
- host.hostname
alert_text: |-
An attack on {} is detected.
The attacker looks like:
User: {}
IP: {}
alert_text_args:
- host.hostname
- user.name
- source.ip
# The alert is use when a match is found
alert:
- email
- slack
email:
- "infraalerts@company.com"
slack_webhook_url: "https://hooks.slack.com/services/TXYZ123/ABC234"
# Alert body only cointains a title and text
alert_text_type: alert_text_only
as you can see, it parses certain key events in your index like “authentication_failure” — these are generated by Auditbeat and then reports based on key fields like Hostname, Username and Source IP
It then sends an Email and also a Slack message to your channel
The final notification looks like this in Slack,
You can also see these failures in the Elastic GUI, under the Auditbeat dashboard,
For other rules, it will report things like changed files, permissions, any new packages installed or removed, and other rules that are all configured in the Rules yamls
If you are writing your own rules, it can be tricky to get the YAML syntax right
To test the validity of your rule, you can simulate an alert like this
elastalert-test-rule --config elastalert.yaml rules/ssh.yaml
Elastalert will query ELK every 1 min for events, to adjust this, open up elastalert.yaml file and configure this variable
# How often ElastAlert will query Elasticsearch# The unit can be anything from weeks to seconds
run_every:
minutes: 1
Step 5 — Index Management
because indexes can grow to very large sizes, depending on how long you store your data, setup a Lifecycle policy to shrink the index, ie remove events older than X days
In the Elastic GUI, go to Management > Stack Management
go to Index Lifecycle Policy and create a policy based on disk space and age, this example shows a policy that will shrink Auditbeat index once it gets past 50G of diskspace or over 30 days
This will automatically keep your disks healthy assuming you dont need to store long term data
Once your data reaches this threshold it will enter a “cold phase” and will be shrunk, meaning that its still available to query, but it will need to be “warmed up” to be usable,
Final Thoughts
ELK stack is a relatively complex piece of system to setup, so hopefully these instructions shine a bit of light on how to orchestrate all these different pieces to work together. As a user of both Elastic and Splunk, I can tell you that I found Elastic to be easier to setup and maintain than Splunk, not to mention the fact that Elastic is free, compared to the high cost of Splunk.
Once configured though, Elastic runs very smoothly and does its job perfectly. Its a terrific piece of software by folks at Elastic.
Saltstack makes the deployment of all these pieces much easier to manage and keeps your config in proper state.
Read the contents of the repository to understand the full deployment process and post any questions here if you run into issues.
Hope this helps you with Elastic deployment.