Deploying Airflow and MLflow in Kubernetes on AWS EKS
In part 2 of this series we tackle deploying Airflow and MLflow into our Kubernetes cluster in AWS EKS.
By Pilotcore
In the previous article, we described the deployment of your own Kubernetes cluster in AWS using the Elastic Kubernetes Service (EKS). After your cluster is up and running, it’s time to deploy the first resources to it, in our case Airflow and MLflow.
Airflow
At Pilotcore, we often use Airflow pipelines in our machine learning projects along with MLflow for model management. Airflow is an open-source tool that allows you to programmatically define and monitor your workflows. Since its initial release in 2015, it gained enormous popularity, and today it’s a go-to tool for many data engineers. In combination with EKS, Airflow on Kubernetes can be a reliable, highly scalable tool to handle all your data. Let’s look at some of its options and how it can be used along with MLflow on Kubernetes.
Helm
Airflow contains an official Helm chart that can be used for deployments in Kubernetes.
Theoretically speaking, all you need to do is run the following command from your command line:
helm install airflow --namespace airflow apache-airflow/airflow
Of course, practically, there is a lot of configuration needed. Most things will depend on your particular use case, but here we will take a look at some considerations.
Git Sync
Airflow’s git syncing is a very handy tool to enable GitOps over your DAGs. Simply speaking, Airflow will periodically check the git repository, and if it detects changes, it will pull them, automatically updating your DAGs without any additional work.
If you are using the official Airflow Helm chart, enabling git sync is very easy; all you have to do is set the correct values in the values.yaml
file.
As a first step, you need to enable it, then select the correct git repository and target branch. By default, Airflow will sync all DAGs located in the tests/dags
directory. Here, because our structure is a little bit more complex, we set it to sync everything within the root up to 5 nested folders.
dags:
gitSync:
enabled: true
repo: "ssh://git@github.com/.../.git"
branch: "main"
depth: 5
subPath: ""
It can be safely assumed that you don’t keep all your source codes publicly available. Because of that, you need to provide a secret SSH key that Airflow will use to download the repository. This can safely be done using a combination of Kubernetes secrets and AWS Secrets Manager.
In AWS Secrets Manager, create a new secret and as a content, copy-paste your Git private SSH key. Pay attention to the new-line at the end of the content as it might not work without it, and that’s a very tricky bug to catch.
Retrieve the value of this secret using a Terraform data resource:
data "aws_secretsmanager_secret" "dags_git_ssh_key_secret" {
name = var.dags_git_ssh_key_name
}
data "aws_secretsmanager_secret_version" "dags_git_ssh_key_secret_version" {
secret_id = data.aws_secretsmanager_secret.dags_git_ssh_key_secret.id
}
And create a Kubernetes secret using this value:
resource "kubernetes_secret" "dags_git_ssh_key_secret_kube" {
metadata {
name = var.dags_git_ssh_key_name
namespace = var.namespace
}
data = {
gitSshKey = data.aws_secretsmanager_secret_version.dags_git_ssh_key_secret_version.secret_string
}
}
Finally, add it to the configuration:
dags:
gitSync:
sshKeySecret: "your_secret_name"
Logs
Logs are an essential part of any application. As we will discuss in the third post in this series, Scaling Airflow workers in EKS, our workers are without persistence and they can be shut down when there are no tasks to do. Because of that, we cannot simply serve logs from the individual worker pods.
In Airflow, you have the option to upload logs to S3, a feature that can be enabled in the Airflow configuration:
[logging]
remote_logging = True
remote_base_log_folder = s3://my-bucket/path/to/logs
remote_log_conn_id = S3CONN
Where remote_base_log_folder
is the destination for your logs and MyS3Conn
is Airflow’s connection string with credentials to the S3 bucket. You can either set it in the web server UI or via environment variable AIRFLOW_CONN_S3CONN
in the following format:
s3://${aws_iam_access_key}:${aws_iam_secret_access_key}@S3
Keep in mind that when tasks are running, the logs are stored locally on their workers. At that time, they can actually be seen in the web server UI because the web server will automatically retrieve them from the worker, but they are not yet available in S3.
After the task ends, logs get uploaded to S3 and the worker can be shut down. After this point, the web server will read the logs from S3.
XComs
XComs, short for “cross-communications,” are Airflow’s mechanism for exchanging data between tasks; however, Airflow allows you to send and receive only very small pieces of data depending on the type of backend:
- SQLite: 2 GB
- Postgres: 1 GB
- MySQL: 64 KB
A common scenario is that you need to send more than just 64 KB. As a workaround, you can serialize and upload the data somewhere else (S3, SFTP, …) and then send only the link to the data file via the XCom. This is easy, but doing it each time sounds like unnecessary boilerplate. What if we could automate it and let Airflow do it for us in the background? That’s exactly what we will describe in a future blog post, Creating custom XCom backend in Airflow.
MLflow
MLflow is an open-source platform for the machine learning lifecycle. It’s library agnostic, language agnostic, and it can scale to large organizations with big data.
We use MLflow heavily for tracking our experiments and storing/deploying models.
Docker
Unfortunately, MLflow doesn’t provide official Docker images or Kubernetes deployment options. So let’s create our own.
First, we need to list Python requirements for our image, create a file requirements.txt
with the following content:
boto3==1.21.40
matplotlib==3.5.1
mlflow[extras]==1.25.1
psycopg2-binary==2.9.3
You may also consider using a more sophisticated Python package manager like Poetry or Pipenv.
Docker image itself is quite simple:
FROM python:3.9
# Install any apt-dependencies you need here.
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
curl gnupg2 apt-transport-https apt-utils ca-certificates \
curl dumb-init freetds-bin gnupg gosu ldap-utils locales \
lsb-release netcat openssh-client postgresql-client sasl2-bin sudo \
unixodbc build-essential \
&& apt-get autoremove -yqq --purge \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Add any Python requirements.
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt
And build it:
docker build . -t mlflow:1.25.1
At this point, you have a local docker image; however, you need a way to pull it in the Kubernetes. You can either create an account on the official docker hub, or use Amazon’s solution called Elastic Container Registry (AWS ECR).
ECR can again be deployed with Terraform:
resource "aws_ecr_repository" "ecr" {
name = var.name
image_tag_mutability = var.image_tag_mutability
encryption_configuration {
encryption_type = "KMS"
kms_key = var.kms_arn
}
image_scanning_configuration {
scan_on_push = true
}
}
After deployment, you will have your own repository URI that can look like account-id.dkr.ecr.region.amazonaws.com/mlflow
.
You will probably need to log in to be able to push your images; you can do so using the command (replacing account-id
and region
with your relevant values):
aws ecr get-login-password --region region | docker login --username AWS --password-stdin account-id.dkr.ecr.region.amazonaws.com
And re-tag the previously built image to a new repository:
docker tag mlflow:1.25.1 account-id.dkr.ecr.region.amazonaws.com/mlflow:1.25.1
Afterwards, you can push it and use it in later deployments:
docker push account-id.dkr.ecr.region.amazonaws.com/mlflow:1.25.1
Helm
With the Docker image in hand, we can continue with the creation of the Helm chart.
First, let’s create a directory called chart
and change the directory into it:
mkdir chart
cd chart
Now we need Chart.yaml
with the following content:
apiVersion: v2
name: mlflow
description: A Helm chart for Kubernetes
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
version: 0.1.0
# Should reflect the version the application is using.
appVersion: 1.25.1
Inside this directory, let’s create another one and open it again:
mkdir templates
cd templates
Here we will need to create several template files.
We will start with _helpers.yaml
that will hold help variables used in other files:
{{/* vim: set filetype=mustache: */}}
{{/*
Expand the name of the chart.
*/}}
{{- define "mlflow.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "mlflow.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "mlflow.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Common labels
*/}}
{{- define "mlflow.labels" -}}
release: {{ .Release.Name }}
chart: {{ include "mlflow.chart" . }}
heritage: {{ .Release.Service }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
{{/*
Selector labels
*/}}
{{- define "mlflow.selectorLabels" -}}
app.kubernetes.io/name: {{ include "mlflow.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
{{/*
Variables common across templates
*/}}
{{- define "mlflow.server-service-name" -}}
{{ include "mlflow.fullname" . }}-mlflow-server
{{- end }}
{{- define "mlflow.server-service-account-name" -}}
{{ default (printf "%s-mlflow-server" (include "mlflow.fullname" .)) .Values.serviceAccount.name }}
{{- end}}
Service is an abstract way to expose an application running on a set of Pods as a network service. We will define one service that will route traffic in our cluster to one of the deployed pods.
apiVersion: v1
kind: Service
metadata:
name: {{ include "mlflow.server-service-name" . }}
namespace: {{ .Release.Namespace }}
labels:
{{- include "mlflow.labels" . | nindent 4 }}
spec:
type: {{ .Values.service.type }}
ports:
- port: {{ .Values.service.port }}
targetPort: http
protocol: TCP
name: http
selector:
{{- include "mlflow.selectorLabels" . | nindent 4 }}
Ingress is an API object that manages external access to the services in a cluster. It will expose an HTTP route from outside of the cluster to our Mlflow service.
{{- if .Values.ingress.enabled -}}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: {{ default (printf "%s-mlflow-server" (include "mlflow.fullname" .)) .Values.ingress.name }}
namespace: {{ .Release.Namespace }}
labels:
{{- include "mlflow.labels" . | nindent 4 }}
{{- with .Values.ingress.annotations }}
annotations:
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
{{- if and .Values.ingress.ingressClassName }}
ingressClassName: {{ .Values.ingress.ingressClassName }}
{{- end }}
rules:
- http:
paths:
- path: {{ .Values.ingress.path }}
pathType: ImplementationSpecific
backend:
service:
name: {{ include "mlflow.server-service-name" . }}
port:
number: {{ .Values.service.port }}
{{- end }}
Service accounts in Kubernetes allow you to give identity and set permission to your pods. For example, MLflow pod will need to read data from S3. Thanks to the integration between AWS and Kubernetes (EKS), we can create a service role in Kubernetes, bind it with an IAM role, and set permissions for this IAM role.
{{- if .Values.serviceAccount.create }}
kind: ServiceAccount
apiVersion: v1
metadata:
name: {{ include "mlflow.server-service-account-name" . }}
labels:
{{- include "mlflow.labels" . | nindent 4 }}
{{- with .Values.serviceAccount.annotations }}
annotations:
{{ toYaml . | nindent 4 }}
{{- end }}
{{- end }}
Then we need a Kubernetes deployment. Deployment will take care of pods running our docker image. But we will use Helm template language instead of standard Kubernetes manifests to make it configurable.
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "mlflow.fullname" . }}-mlflow-server
namespace: {{ .Release.Namespace }}
labels:
{{- include "mlflow.labels" . | nindent 4 }}
spec:
replicas: 1
selector:
matchLabels:
{{- include "mlflow.selectorLabels" . | nindent 6 }}
template:
metadata:
{{- with .Values.podAnnotations }}
annotations:
{{- toYaml . | nindent 8 }}
{{- end }}
labels:
{{- include "mlflow.selectorLabels" . | nindent 8 }}
spec:
{{- if .Values.serviceAccount.create }}
serviceAccountName: {{ include "mlflow.server-service-account-name" . }}
{{- end }}
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 8 }}
containers:
- name: {{ .Chart.Name }}
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
command:
- mlflow
- server
- --host=0.0.0.0
- --port={{ .Values.service.port }}
- --workers={{ .Values.mlflow.workers }}
- --backend-store-uri=$(BACKEND_STORE_URI)
- --default-artifact-root={{ .Values.mlflow.defaultArtifactRoot }}
- --serve-artifacts
env:
- name: LC_ALL
value: C.UTF-8
- name: LANG
value: C.UTF-8
- name: BACKEND_STORE_URI
valueFrom:
secretKeyRef:
name: {{ .Values.mlflow.backendStoreUriSecretName }}
key: value
ports:
- name: http
containerPort: {{ .Values.service.port }}
protocol: TCP
livenessProbe:
httpGet:
path: /
port: http
readinessProbe:
httpGet:
path: /
port: http
resources:
{{- toYaml .Values.resources | nindent 12 }}
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
Now you should have a working Helm chart for MLflow. All that’s left to do is create a values.yaml
that
will set configurable options from your chart.
In our case, we need to set a link to the docker image, credentials for the database:
image:
repository: ""
tag: ""
pullPolicy: IfNotPresent
mlflow:
defaultArtifactRoot: ""
backendStoreUriSecretName: ""
workers: 1
serviceAccount:
create: false
name: ""
annotations: {}
ingress:
enabled: false
name: ""
annotations:
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/healthcheck-path: /
ingressClassName: alb
path: /*
service:
type: ClusterIP
port: 2202
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 2
memory: 4Gi
nameOverride: ""
fullnameOverride: ""
imagePullSecrets: []
podAnnotations: {}
podSecurityContext: {}
securityContext: {}
nodeSelector: {}
tolerations: []
affinity: {}
Finally, our chart is ready to be installed. For simplicity and first test, you can install it using just the Helm tool:
helm install mlflow -f values.yaml --namespace mlflow chart/
After your basic deployment works, you can switch to Terraform and deploy it using Helm release resources. You will also probably need to create IAM roles and link them with service accounts to get proper permissions.
We hope this gives you a better idea about how Airflow on Kubernetes (EKS) can be deployed along with MLflow.
Ready to Elevate Your Business?
Discuss your cloud strategy with our experts and discover the best solutions for your needs.