written: Feb 1, 2021
This is a primer on how to setup Apache Airflow running on Centos 7 server for 2 separate environments (Prod and Simulation)
My company needed 2 separate envs, one for Production DAGs and one for Simulation/UAT DAGs.
Airflow can be a pain in the ass to setup, and the complexity grows exponentially if you are using Docker to do this. Reading different articles on how to setup separate environments using docker, I got frustrated because I kept running into container issues, Docker while a great tool can obfuscate basic troubleshooting and synchronization of all the different Airflow components.
I simply created 2 on-premise Airflow environments, sharing the same Postgres DB but separate physical databases. This setup does not use Docker at all and is relatively straight forward.
This shows how to spin up 2 separate Airflow envs using Saltstack config management tool.
This will install Airflow version 2.0.0 (latest as of Feb 2021)
All Saltstack code for this article is hosted here:
https://github.com/perfecto25/airflow
Prerequisites
to install Airflow you will need to have the following installed:
- Postgres 10 or up (I’m running Postgres 10)
- Python 3.6 or higher
Step 1 — create folder structure
we will setup our 2 Airflow envs using the following folder structure
/opt/airflow/prod
/opt/airflow/sim
inside each folder you have separate .env files, DAGs, configs, etc
heres what the final layout of each env will look like
Both envs share the same Postgres DB connection, yet 2 different DB instances.
Create the folder structure (this can be done via Saltstack, but showing individual steps for clarity)
root@server> mkdir -p /opt/airflow/{sim,prod}
Step 2 — install Airflow
Install Airflow using VirtualEnv (always good practice to separate your python libs)
We can install one instance of Airflow application which will be shared by both environments, what we will be keeping separate is the database and configs
root@server> cd /opt/airflow
root@server> python3.6 -m virtualenv venv
root@server> source venv/bin/activate
(venv) root@server> pip install apache-airflow['postgres']
This will install all required libs into your venv located at /opt/airflow/venv
now create a symlink so the OS knows what the “airflow” command is,
ln -s /opt/airflow/venv/bin/airflow /usr/bin/airflow
Step 3 — setup Database
create 2 databases, SIM and Prod (dont forget to add “;” to end of each psql command)
# create DB user
root@server> sudo -u postgres createuser airflow
root@server> su postgres
postgres@server> psql# give Airflow user a password
postgres=# alter user airflow with encrypted password 'airflow';# create Databases
postgres=# create database airflowsim;
postgres=# create database airflowprod;
postgres=# grant all privileges on database airflowsim to airflow;
postgres=# grant all privileges on database airflowprod to airflow;
exit PSQL by typing “\q”
Step 4 — provide Config and .Env files
Add an .env file to your Airflow env,
vim /opt/airflow/sim/.env## Airflow-{{ env }}
export AIRFLOW_CONFIG=/opt/airflow/sim/airflow.cfg
export AIRFLOW_HOME=/opt/airflow/sim
export AIRFLOW__WEBSERVER__NAVBAR_COLOR="#32a8a2"
This .env file provides the environment HOME variable as well as any additional parameters you want to customize like header color (I like to visually separate Prod and Sim by having Prod being red color, Sim being blue)
Now add the Airflow config file — this is a very large file and you can reference it via Github repo above, but most important variable to change is the Postgres connection string, Web server port, path to your DAGs (you want to make sure you dont combine the 2 environments and have them completely separated)
vim /opt/airflow/sim/airflow.cfg (see Github repo for full example)dags_folder = /opt/airflow/sim/dags
base_log_folder = /opt/airflow/sim/logs
sql_alchemy_conn = postgresql+psycopg2://airflow@localhost:5432/airflowsim# SIM interface will run on 8095, Prod will run on 8090
base_url = http://<your server IP or hostname>:8095
web_server_port = 8095
Also add 2 keys, Fernet and Secrets key to airflow.cfg
to generate Fernet key run,
pip install cryptographypython -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
add the output to fernet_key variable
To generate secret_key, run
openssl rand -hex 30
add the output to secret_key variable
Now lets initialize the SIM database
# if you are still in virtualenv, exit it
(venv) root@server> exit# source the Environment file that contains Airflow variables
root@server> source /opt/airflow/sim/.env# activate venv and initialize the database
source /opt/airflow/venv/bin/activate
(venv) airflow db init
This will populate the SIM database with all required Airflow tables
Step 5 — add Service files and Airflow user
create OS user called “airflow”
useradd airflow
add 3 service files, Webserver, Scheduler, Worker
Webserver
vim /usr/lib/systemd/system/airflowsim-webserver.service[Unit]
Description=Airflow-sim webserver daemon
After=network.target postgresql-10.service
Wants=postgresql-10.service
[Service]
EnvironmentFile=/opt/airflow/sim/.env
User=root
Group=root
Type=simple
ExecStart=/bin/bash -c 'source /opt/airflow/sim/.env;source /opt/airflow/venv/bin/activate;airflow webserver --pid /run/airflow/webserver-sim.pid'
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
[Service]
RuntimeDirectory=airflow
RuntimeDirectoryMode=0775
Scheduler
[Unit]
Description=Airflow-sim scheduler daemon
After=network.target postgresql-10.service
Wants=postgresql-10.service
[Service]
EnvironmentFile=/opt/airflow/sim/.env
User=root
Group=root
Type=simple
ExecStart=/bin/bash -c 'source /opt/airflow/sim/.env;source /opt/airflow/venv/bin/activate;airflow scheduler'
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
Worker (if not using Celery, can skip this)
[Unit]
Description=Airflow-sim celery worker daemon
After=network.target postgresql-10.service
Wants=postgresql-10.service
[Service]
EnvironmentFile=/opt/airflow/sim/.env
User=root
Group=root
Type=simple
ExecStart=/bin/bash -c 'export C_FORCE_ROOT=True;source /opt/airflow/sim/.env;source /opt/airflow/venv/bin/activate;airflow worker'
Restart=on-failure
RestartSec=10s
[Install]
WantedBy=multi-user.target
Each of these service files will source each .env and virutalenv before running the service
Reload systemctl to pick up changes
systemctl daemon-reload
Now add a combined Services file, this will combine all 3 services into 1 easy to use script,
vi /opt/airflow/sim/service
#!/bin/bash
action=${1:-'status'}
function start(){
echo "starting airflow SIM webserver"
systemctl start airflowsim-webserver
echo "starting airflow SIM scheduler"
systemctl start airflowsim-scheduler
echo "starting airflow SIM worker"
systemctl start airflowsim-worker
}function stop(){
echo "stopping airflow SIM webserver"
systemctl stop airflowsim-webserver
echo "stopping airflow SIM scheduler"
systemctl stop airflowsim-scheduler
echo "stopping airflow SIM worker"
systemctl stop airflowsim-worker
}function status(){
systemctl status airflowsim-webserver
systemctl status airflowsim-scheduler
systemctl status airflowsim-worker
}if [ $action == "start" ]
then
start
elif [ $action == "stop" ]
then
stop
elif [ $action == "restart" ]
then
stop
start
elif [ $action == "status" ]
then
status
else
echo "invalid command (start|stop|restart|status)"
fi
chmod +x /opt/airflow/sim/service
To start all Airflow SIM services run
/opt/airflow/sim/service start
to check for errors or startup messages, you can tail journal log
journalctl -f
Airflow SIM should startup and you can access the service via your browser
http://<your server>:8095 (or whatever port you configure for SIM)
Step 6 — create users
to create user accounts, run airflow user create command to create Admin user (to create users with other Roles, see airflow documentation for Role types)
(venv) root@server> airflow users create -f Joe -l Smith -p abracadabra -r Admin -u jsmith -e jsmith@company.com
Step 7 — Database backups
to backup your Postgres DB for SIM, just run a pg_dump command
root@server> mkdir /opt/airflow/sim/backupsroot@server> runuser -l postgres -c 'pg_dump -O -F c -f /opt/airflow/sim/backups/backup.dat -Z 3 --blobs -p 5432 -h localhost -d airflowsim'
You now have a working SIM instance!
To create a Prod environment, follow the same steps but replace “sim” with “prod” for all files, commands and configs. Dont forget to change DB connection string to point to “airflowprod”, webserver Port and other variables.
Saltstack
If you use Salt, you can easily create both evironments by cloning the above Repo to your formula directory and run,
salt <target> state.sls formula.airflow.sim
salt <target> state.sls formula.airflow.prod
This formula will create everything mentioned in this article. Salt uses Jinja variables to separate environments and config variables.
Hope this helps your setup.