Monitoring your Infrastructure with Elasticsearch and Elastalert

6 min readJan 13, 2021

this tutorial shows how to

setup Elasticsearch stack (ELK stack) on a server,
install Metricbeat, Filebeat and Auditbeat agents on an endpoint
install Elastalert plugin that will alert you on events

As a sysadmin I have a medium-sized infra environment (less than 200 servers, both physical and AWS instances), and I need a way to keep an eye on them in case there’s any form of intrusion, system changes, permission changes, or any kind of out-of-band changes

The following shows how to set this up using Saltstack, but the concept can be applied to any config management system (ie, Puppet, Ansible, Chef, etc)

the following shows examples on a Centos 7 OS, so adjust to your systems accordingly.

Step 1 — install ELK stack on a server

The server hosting your ELK stack must be able to handle massive volumes of data, so I recommend hosting it on a server with a minimum of 32G of RAM and at least 8 CPUs, plus minimum of 200G of disk space,

if your infra is more than 50 nodes/servers, use at least 500G of space as search indexes grow large pretty quickly. You can prune the indexes using Elastic index Lifecycle (see Step 5 below on how to do this)

my ELK host has the following resources:

12 CPUs, 62G RAM and 700G of disk space, as you can see its very busy running the ELK java procs

To install ELK stack:

git clone this repo, and move the directory to your Salt state folder

Once you have the repo cloned, open up each YAML config file for metricbeat, auditbeat, logstash and filebeat, and adjust your settings, for example provide the hostname of your ELK host,

files/metricbeat.yml

also provide the username and password for the Elastic user, this is used for GUI login authentication

Once your YAMLs are configured with your parameters, install the ELK stack on your host:

to install the stack, run

salt <elk-host> state.sls elk-stack

This will install all the components necessary for running ELK stack (Elasticsearch, Kibana, Logstash)

It will also install the Yelp Elastalert plugin that will monitor your index for any events and alert on specific rules

Once the state is done, check if port 5601 is up and listening, as well as port 9200 (logstash port)

You should be able to see a GUI on port <elk-host>:5601

try logging in with user and password configured above in YAML

Step 2 — install Beats

on every host that you want to monitor, make sure that endpoint can connect to <elk-host>:5600 and also <elk-host>:9200,

firewall issues are common problem so troubleshoot any firewall blockage before installing any beats agents

To install all beats (Metric, File, Audit), run this on the endpoint

salt nycweb1 state.sls elastic.beats

to install specific beat, ie Metric run

salt nycweb1 state.sls elastic.metricbeat

This will install the Beats, start the beat service and generate a dashboard for this endpoint

Step 3 — Elastalert

Elastalert is a separate plugin from Yelp that runs on top of ELK stack (on same host as the ELK stack), and monitors the Auditbeat index for any alerts

You can see some sample Alert configs in plugins/elastalert/files/rules

These rules also have an option to notify you via email, Slack and other methods (read Elastalert docs for all options)

Lets take a simple SSH abuse alert, this reads parses the Auditbeat index for any SSH abuse and alerts you when someone tries to SSH to your server but gets rejected (plugins/elastalert/files/rules/ssh.yml)

# Rule name, must be unique
name: SSH abuse - ElastAlert 3.0.1
is_enabled: true

# Alert on x events in y seconds
type: frequency

# Alert when this many documents matching the query occur within a timeframe
num_events: 3

# num_events must occur within this amount of time to trigger an alert
timeframe:
  minutes: 30

# A list of elasticsearch filters used for find events
# These filters are joined with AND and nested in a filtered query
# For more info: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html
filter:
- query:
    query_string:
      query: "event.type:authentication_failure"

index: auditbeat-*

# When the attacker continues, send a new alert after x minutes
realert:
  minutes: 1

query_key:
  - source.ip

include:
  - host.hostname
  - user.name
  - source.ip

include_match_in_root: true

alert_subject: "SSH abuse on <{}>"
alert_subject_args:
  - host.hostname

alert_text: |-
  An attack on {} is detected.
  The attacker looks like:
  User: {}
  IP: {}
alert_text_args:
  - host.hostname
  - user.name
  - source.ip

# The alert is use when a match is found
alert:
  - email
  - slack

email:
  - "infraalerts@company.com"

slack_webhook_url: "https://hooks.slack.com/services/TXYZ123/ABC234"

# Alert body only cointains a title and text
alert_text_type: alert_text_only

as you can see, it parses certain key events in your index like “authentication_failure” — these are generated by Auditbeat and then reports based on key fields like Hostname, Username and Source IP

It then sends an Email and also a Slack message to your channel

The final notification looks like this in Slack,

You can also see these failures in the Elastic GUI, under the Auditbeat dashboard,

For other rules, it will report things like changed files, permissions, any new packages installed or removed, and other rules that are all configured in the Rules yamls

If you are writing your own rules, it can be tricky to get the YAML syntax right

To test the validity of your rule, you can simulate an alert like this

elastalert-test-rule --config elastalert.yaml rules/ssh.yaml

Elastalert will query ELK every 1 min for events, to adjust this, open up elastalert.yaml file and configure this variable

# How often ElastAlert will query Elasticsearch# The unit can be anything from weeks to seconds
run_every:
  minutes: 1

Step 5 — Index Management

because indexes can grow to very large sizes, depending on how long you store your data, setup a Lifecycle policy to shrink the index, ie remove events older than X days

In the Elastic GUI, go to Management > Stack Management

go to Index Lifecycle Policy and create a policy based on disk space and age, this example shows a policy that will shrink Auditbeat index once it gets past 50G of diskspace or over 30 days

This will automatically keep your disks healthy assuming you dont need to store long term data

Once your data reaches this threshold it will enter a “cold phase” and will be shrunk, meaning that its still available to query, but it will need to be “warmed up” to be usable,

Final Thoughts

ELK stack is a relatively complex piece of system to setup, so hopefully these instructions shine a bit of light on how to orchestrate all these different pieces to work together. As a user of both Elastic and Splunk, I can tell you that I found Elastic to be easier to setup and maintain than Splunk, not to mention the fact that Elastic is free, compared to the high cost of Splunk.

Once configured though, Elastic runs very smoothly and does its job perfectly. Its a terrific piece of software by folks at Elastic.

Saltstack makes the deployment of all these pieces much easier to manage and keeps your config in proper state.

Read the contents of the repository to understand the full deployment process and post any questions here if you run into issues.

Hope this helps you with Elastic deployment.