Pilotcore Pilotcore

Deploying Airflow and MLflow in Kubernetes on AWS EKS

In part 2 of this series we tackle deploying Airflow and MLflow into our Kubernetes cluster in AWS EKS.

Peter Jung
Peter Jung
Cloud Engineer
8 min read
Deploying Airflow and MLflow in Kubernetes on AWS EKS

In the previous article, we described the deployment of your own Kubernetes cluster in AWS using the Elastic Kubernetes Service (EKS). After your cluster is up and running it's time to deploy the first resources to it, in our case Airflow and MLflow.

Airflow

At Pilotcore, we often use Airflow pipelines in our machine learning projects along with MLflow for model management. Airflow is an open-source tool that allows you to programmatically define and monitor your workflows. Since its initial release in 2015, it gained enormous popularity and today it's a go-to tool for many data engineers. In combination with EKS, Airflow on Kubernetes can be a reliable, highly scalable tool to handle all your data. Let's look at some of its options and how it can be used along with MLflow on Kubernetes.

Helm

Airflow contains an official Helm chart that can be used for deployments in Kubernetes.

Theoretically speaking, all you need to do is run the following command from your command line

helm install airflow --namespace airflow apache-airflow/airflow

Of course, practically, there is a lot of configuration needed. Most things will depend on your particular use case, but here we will take a look at some considerations.

Git Sync

Airflow's git syncing is a very handy tool to enable GitOps over your DAGs. Simply speaking, Airflow will periodically check the git repository and if it detects changes, it will pull them, automatically updating your DAGs without any additional work.

If you are using the official Airflow Helm chart, enabling git sync is very easy, all you have to do is set the correct values in the values.yaml file.

As a first step, you need to enable it, then select the correct git repository and target branch. By default, Airflow will sync all dags located in tests/dags directory. Here, because our structure is a little bit more complex, we set it to sync everything within the root up to 5 nested folders.

dags:
  gitSync:
      enabled: true
      repo: "ssh://git@github.com/.../.git"
      branch: "main" 
	  depth: 5
	  subPath: ""

It can be safely assumed that you don’t keep all your source codes publicly available, because of that, you need to provide a secret SSH key that Airflow will use to download the repository. This can safely be done using a combination of Kubernetes secrets and AWS Secrets Manager.

In AWS Secrets Manager, create a new secret and as a content, copy-paste your Git private SSH key. Pay attention to the new-line at the end of the content as it might not work without it and that’s a very tricky bug to catch.

Retrieve the value of this secret using a Terraform data resource:

data "aws_secretsmanager_secret" "dags_git_ssh_key_secret" {
  name = var.dags_git_ssh_key_name
}

data "aws_secretsmanager_secret_version" "dags_git_ssh_key_secret_version" {
  secret_id = data.aws_secretsmanager_secret.dags_git_ssh_key_secret.id
}

And create a Kubernetes secret using this value:

resource "kubernetes_secret" "dags_git_ssh_key_secret_kube" {
  metadata {
	name      = var.dags_git_ssh_key_name
	namespace = var.namespace
  }

  data = {
	gitSshKey = data.aws_secretsmanager_secret_version.dags_git_ssh_key_secret_version.secret_string
  }
}

Finally, add it to the configuration:

dags:
  gitSync:
      sshKeySecret: "your_secret_name"

Logs

Logs are an essential part of any application. As we will discuss in the third post in this series, Scaling Airflow workers in EKS, our workers are without persistence and they can be shut down when there are no tasks to do. Because of that, we can not simply serve logs from the individual worker pods.

In Airflow, you have the option to upload logs to S3, a feature that can be enabled in the Airflow configuration:

[logging]
remote_logging = True
remote_base_log_folder = s3://my-bucket/path/to/logs
remote_log_conn_id = S3CONN

Where remote_base_log_folder is the destination for your logs and MyS3Conn is Airflow’s connection string with credentials to the S3 bucket. You can either set it in the web server UI or via environment variable AIRFLOW_CONN_S3CONN in the following format:

s3://${aws_iam_access_key}:${aws_iam_secret_access_key}@S3

Keep in mind that when tasks are running, the logs are stored locally on their workers. At that time, they can actually be seen in the web server UI, because the web server will automatically retrieve them from the worker, but they are not yet available in S3.

After the task ends, logs get uploaded to S3 and the worker can be shut down. After this point, the web server will read the logs from S3.

XComs

XComs, short for "cross-communications,” are Airflow’s mechanism for exchanging data between tasks, however Airflow allows you to send and receive only very small pieces of data depending on the type of backend:

  • SQLite: 2 GB
  • Postgres: 1 GB
  • MySQL: 64 KB

A common scenario is that you need to send more than just 64 KB, as a workaround, you can serialize and upload the data somewhere else (S3, SFTP, ...) and then send only the link to the data file via the XCom. This is easy, but doing it each time sounds like an unnecessary boilerplate. What if we could automate it and let Airflow do it for us in the background? That’s exactly what we will describe in a future blog post, Creating custom XCom backend in Airflow.

MLflow

MLflow is an open-source platform for the machine learning lifecycle. It’s library agnostic, language agnostic and it can scale to large organizations with big data.

We use MLflow heavily for tracking of our experiments and storing/deploying models.

Docker

Unfortunately, MLflow doesn’t provide official Docker images or Kubernetes deployment options. So let’s create our own.

First, we need to list Python requirements for our image, create file requirements.txt with the following content:

boto3==1.21.40
matplotlib==3.5.1
mlflow[extras]==1.25.1
psycopg2-binary==2.9.3

You may also consider using a more sophisticated Python package manager like Poetry or Pipenv.

Docker image itself is quite simple:

FROM python:3.9

# Install any apt-dependencies you need here.
RUN apt-get update \
	&& apt-get install -y --no-install-recommends \
   	curl gnupg2 apt-transport-https apt-utils ca-certificates \
   	curl dumb-init freetds-bin gnupg gosu ldap-utils locales  \
   	lsb-release netcat openssh-client postgresql-client sasl2-bin sudo \
   	unixodbc build-essential \
	&& apt-get autoremove -yqq --purge \
	&& apt-get clean \
	&& rm -rf /var/lib/apt/lists/*

# Add any Python requirements.
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt

And build it

docker build . -t mlflow:1.25.1

At this point, you have a local docker image, however you need a way to pull it in the Kubernetes. You can either create an account on the official docker hub, or use Amazon’s solution called Elastic Container Registry (AWS ECR).

ECR can again be deployed with Terraform:

resource "aws_ecr_repository" "ecr" {
  name             	= var.name
  image_tag_mutability = var.image_tag_mutability

  encryption_configuration {
	encryption_type = "KMS"
	kms_key     	= var.kms_arn
  }

  image_scanning_configuration {
	scan_on_push = true
  }
}

After deployment, you will have your own repository URI that can look like account-id.dkr.ecr.region.amazonaws.com/mlflow.

You will probably need to log in to be able to push your images, you can do so using the command (replacing account-id and region with your relevant values:

aws ecr get-login-password --region region | docker login --username AWS --password-stdin account-id.dkr.ecr.region.amazonaws.com

And re-tag the previously built image to a new repository:

docker tag mlflow:1.25.1 account-id.dkr.ecr.region.amazonaws.com/mlflow:1.25.1

Afterwards, you can push it and use it in later deployments:

docker push account-id.dkr.ecr.region.amazonaws.com/mlflow:1.25.1

Helm

With the Docker image in hand, we can continue with the creation of the Helm chart.

First, let’s create a directory called chart and change the directory into it:

mkdir chart
cd chart

Now we need Chart.yaml with the following content:

apiVersion: v2
name: mlflow
description: A Helm chart for Kubernetes

type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
version: 0.1.0

# Should reflect the version the application is using.
appVersion: 1.25.1

Inside this directory, let’s create another one and open it again:

mkdir templates
cd templates

Here we will need to create several template files.

We will start with _helpers.yaml that will hold help variables used in other files

{{/* vim: set filetype=mustache: */}}

{{/*
Expand the name of the chart.
*/}}
{{- define "mlflow.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "mlflow.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "mlflow.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Common labels
*/}}
{{- define "mlflow.labels" -}}
release: {{ .Release.Name }}
chart: {{ include "mlflow.chart" . }}
heritage: {{ .Release.Service }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}

{{/*
Selector labels
*/}}
{{- define "mlflow.selectorLabels" -}}
app.kubernetes.io/name: {{ include "mlflow.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

{{/*
Variables common across templates
*/}}
{{- define "mlflow.server-service-name" -}}
{{ include "mlflow.fullname" . }}-mlflow-server
{{- end }}

{{- define "mlflow.server-service-account-name" -}}
{{ default (printf "%s-mlflow-server" (include "mlflow.fullname" .)) .Values.serviceAccount.name }}
{{- end}}

Service is an abstract way to expose an application running on a set of Pods as a network service. We will define one service that will route traffic in our cluster to one of the deployed pods.

apiVersion: v1
kind: Service
metadata:
  name: {{ include "mlflow.server-service-name" . }}
  namespace: {{ .Release.Namespace }}
  labels:
    {{- include "mlflow.labels" . | nindent 4 }}
spec:
  type: {{ .Values.service.type }}
  ports:
    - port: {{ .Values.service.port }}
      targetPort: http
      protocol: TCP
      name: http
  selector:
    {{- include "mlflow.selectorLabels" . | nindent 4 }}

Ingress is an API object that manages external access to the services in a cluster. It will expose an HTTP route from outside of the cluster to our Mlflow service.

{{- if .Values.ingress.enabled -}}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: {{ default (printf "%s-mlflow-server" (include "mlflow.fullname" .)) .Values.ingress.name }}
  namespace: {{ .Release.Namespace }}
  labels:
    {{- include "mlflow.labels" . | nindent 4 }}
  {{- with .Values.ingress.annotations }}
  annotations:
    {{- toYaml . | nindent 4 }}
  {{- end }}
spec:
  {{- if and .Values.ingress.ingressClassName }}
  ingressClassName: {{ .Values.ingress.ingressClassName }}
  {{- end }}
  rules:
  - http:
      paths:
      - path: {{ .Values.ingress.path }}
        pathType: ImplementationSpecific
        backend:
          service:
            name: {{ include "mlflow.server-service-name" . }}
            port:
              number: {{ .Values.service.port }}
{{- end }}

Service accounts in Kubernetes allow you to give identity and set permission to your pods. For example, MLflow pod will need to read data from S3. Thanks to the integration between AWS and Kubernetes (EKS), we can create a service role in Kubernetes, bind it with an IAM role and set permissions for this IAM role.

{{- if .Values.serviceAccount.create }}
kind: ServiceAccount
apiVersion: v1
metadata:
  name: {{ include "mlflow.server-service-account-name" . }}
  labels:
    {{- include "mlflow.labels" . | nindent 4 }}
  {{- with .Values.serviceAccount.annotations }}
  annotations:
    {{ toYaml . | nindent 4 }}
  {{- end }}
{{- end }}

Then we need a Kubernetes deployment. Deployment will take care of pods running our docker image. But we will use Helm template language instead of standard Kubernetes manifests to make it configurable.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "mlflow.fullname" . }}-mlflow-server
  namespace: {{ .Release.Namespace }}
  labels:
    {{- include "mlflow.labels" . | nindent 4 }}
spec:
  replicas: 1
  selector:
    matchLabels:
      {{- include "mlflow.selectorLabels" . | nindent 6 }}
  template:
    metadata:
    {{- with .Values.podAnnotations }}
      annotations:
        {{- toYaml . | nindent 8 }}
    {{- end }}
      labels:
        {{- include "mlflow.selectorLabels" . | nindent 8 }}
    spec:
      {{- if .Values.serviceAccount.create }}
      serviceAccountName: {{ include "mlflow.server-service-account-name" . }}
      {{- end }}
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          command:
            - mlflow
            - server
            - --host=0.0.0.0
            - --port={{ .Values.service.port }}
            - --workers={{ .Values.mlflow.workers }}
            - --backend-store-uri=$(BACKEND_STORE_URI)
            - --default-artifact-root={{ .Values.mlflow.defaultArtifactRoot }}
            - --serve-artifacts
          env:
            - name: LC_ALL
              value: C.UTF-8
            - name: LANG
              value: C.UTF-8
            - name: BACKEND_STORE_URI
              valueFrom:
                secretKeyRef:
                  name: {{ .Values.mlflow.backendStoreUriSecretName }}
                  key: value
          ports:
            - name: http
              containerPort: {{ .Values.service.port }}
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /
              port: http
          readinessProbe:
            httpGet:
              path: /
              port: http
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
      {{- with .Values.nodeSelector }}
      nodeSelector:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}

Now you should have a working helm chart for mlflow. All that’s left to do is create a values.yaml that will set configurable options from your chart.

In our case, we need to set link to the docker image, credentials for the database:

image:
  repository: ""
  tag: ""
  pullPolicy: IfNotPresent

mlflow:
  defaultArtifactRoot: ""
  backendStoreUriSecretName: ""
  workers: 1

serviceAccount:
  create: false
  name: ""
  annotations: {}

ingress:
  enabled: false
  name: ""

  annotations:
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/healthcheck-path: /

  ingressClassName:  alb

  path: /*

service:
  type: ClusterIP
  port: 2202

resources:
   limits:
    cpu: 2
    memory: 4Gi
   requests:
    cpu: 2
    memory: 4Gi

nameOverride: ""

fullnameOverride: ""

imagePullSecrets: []

podAnnotations: {}

podSecurityContext: {}

securityContext: {}

nodeSelector: {}

tolerations: []

affinity: {}

Finally, our chart is ready to be installed. For simplicity and first test, you can install it using just the helm tool

helm install mlflow -f values.yaml --namespace mlflow chart/

After your basic deployment works, you can switch to Terraform and deploy it using helm release resources. You will also probably need to create IAM roles and link it with service accounts to get proper permissions.

We hope this gives you a better idea about how Airflow on Kubernetes (EKS) can be deployed along with MLflow. If you struggle or have any questions, please reach out to us and we will be happy to help.