docker-airflow-spark

Building a Modern Data Lakehouse with Spark, Airflow, and MinIO via Docker

Overview

Our project has evolved into a comprehensive system that leverages the lakehouse architecture, combining the best elements of data lakes and data warehouses for enterprise-level data storage and computation.

Pulling Docker Images

Before starting the services, you need to pull the necessary Docker images. Run the following commands:

docker pull bde2020/spark-master:3.2.0-hadoop3.2
docker pull bde2020/spark-worker:3.2.0-hadoop3.2
docker pull minio/minio
docker pull apache/airflow:2.2.3-python3.7
docker pull postgres:9.5.3
docker pull dimeji/hadoop-namenode:latest
docker pull dimeji/hadoop-datanode:latest
docker pull dimeji/hive-metastore
docker pull jupyter/pyspark-notebook:spark-3.2.0

Starting Services

After running airflow-init & pulling the necessary images, you’re ready to rock n roll.

docker compose -f docker-compose.lakehouse.yml -f docker-compose.yml up --build -d 

Stopping Services

Once you’re done and you want to stop the services, you can do so with the following command:

docker-compose -f docker-compose.yml -f docker-compose.lakehouse.yml down --remove-orphans -v

Additional Step for Gitpod Users

If you are using Gitpod, you may need to modify the permissions for any file or directory that has permission issues. For example, to modify the permissions for the notebooks and logs directories, use the following commands:

sudo chmod -R 777 /workspace/docker-airflow-spark/notebooks/*
sudo chmod -R 777 /workspace/docker-airflow/*
sudo chmod -R 777 /workspace/docker-airflow-spark/docker-airflow/logs/*

To ensure the services are running, you can click on the following URLs:

Table of Docker Images and Services

Docker Image Docker Hub Link Port Service Description
apache/airflow:2.2.3 Link 8088 Airflow Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.
bde2020/spark-master:3.2.0-hadoop3.2 Link 8181 Spark Master Spark Master is the heart of the Spark application. It is the central coordinator that schedules tasks on worker nodes.
bde2020/spark-worker:3.2.0-hadoop3.2 Link 8081 Spark Worker Spark Worker is the node that runs the tasks assigned by the Spark Master.
jupyter/pyspark-notebook:spark-3.2.0 Link 8888 Jupyter Notebook Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.
postgres:9.5.3 Link 5432 Postgres PostgreSQL is a powerful, open source object-relational database system.
redis:latest Link 6379 Redis Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker.
minio/minio Link 9091 MinIO MinIO is a High Performance Object Storage released under Apache License v2.0. It is API compatible with Amazon S3 cloud storage service.

Status of Containers

To check the status of the containers, run the following command:

docker ps

Connecting Postgres with Airflow via Airflow DAG

Click on Create and fill in the necessary details:

Click on save: Creating the connection airflow to connect the Postgres DB.

Validate DAG

docker exec -it  <postgres_container> psql -U airflow airflow

Spark Architecture

Connecting Spark with Airflow via Airflow DAG

  1. Go to the Spark webUI on http://localhost:8085

  2. Click on Admin -> Connections in the top bar.

  3. Click on Add a new record and input the following details:

  1. Save the connection settings.

  1. Run the SparkOperatorDemo DAG.

  2. After a couple minutes you should see the DAG run as successful.

  1. Check the DAG log for the result.

  1. Check the Spark UI http://localhost:8181 for the result of our DAG & Spark submit via terminal:

Access Jupyter Notebook

After starting the container, a URL with a token is generated. This URL is used to access the Jupyter notebook in your web browser.

Retrieve the Access URL

You can find the URL with the token in the container logs. Use the following command to retrieve it:

docker logs $(docker ps -q --filter "ancestor=jupyter/pyspark-notebook:spark-3.2.0") 2>&1 | grep 'http://127.0.0.1' | tail -1

Querying the dvdrental Database

Once the Docker container is up and running, you can query the dvdrental database using the psql command-line interface. Here are the steps:

  1. Access the PostgreSQL interactive terminal:
docker exec -it oasis-postgresdb psql -U airflow
  1. List all databases:
\l

You should see dvdrental in the list of databases.

  1. Connect to the dvdrental database:
\c dvdrental
  1. List all tables in the dvdrental database:
\dt

You should see a list of tables such as actor, address, category, etc.

  1. Query a table. For example, to select the first 5 rows from the actor table:
SELECT * FROM actor LIMIT 5;

You should see a table with the columns actor_id, first_name, last_name, and last_update, and the first 5 rows of data.

dvdrental=# SELECT * FROM actor LIMIT 5;

actor_id first_name last_name last_update
1 Penelope Guiness 2013-05-26 14:47:57.62
2 Nick Wahlberg 2013-05-26 14:47:57.62
3 Ed Chase 2013-05-26 14:47:57.62
4 Jennifer Davis 2013-05-26 14:47:57.62
5 Johnny Lollobrigida 2013-05-26 14:47:57.62

(5 rows)

Extras and Useful Commands

Adding New Users: Airflow

airflow users create -u <USERNAME> -f <FIRST> -l <LAST> -r <ROLE> -e <EMAIL>

Memory and CPU utilization

When all the containers are running, you can experience system lag if your system is not able to handle the load. Monitoring the CPU and Memory utilization of the containers is crucial to maintaining good performance and a reliable system. To monitor the CPU and Memory utilization of the containers, we use the Docker command-line tool stats command, which gives us a live look at our containers resource utilization. We can use this tool to gauge the CPU, Memory, Network, and disk utilization of every running container.

docker stats

Docker Commands

List Images:
$ docker images <repository_name>

List Containers:
$ docker container ls

Check container logs:
$ docker logs -f <container_name>

To build a Dockerfile after changing sth (run inside directoty containing Dockerfile):
$ docker build --rm -t <tag_name> .

Access container bash:
$ docker exec -it <container_name> bash

Remove Images & Containers:
$ docker system prune -a

Remove Volumes:
$ docker volume prune

Useful docker-compose commands

Start Containers:
$ docker-compose -f <compose-file.yml> up -d

Stop Containers:
$ docker-compose -f <compose-file.yml> down --remove-orphans

Stop Containers & Remove Volumes:
$ docker-compose -f <compose-file.yml> down --remove-orphans -v

Postgres

enter the Postgres Conatiner via CLI command :

docker exec -it  postgres_container psql 

.env

Before starting Airflow for the first time, we need to prepare our environment. We need to add the Airflow USER to our .env file because some of the container’s directories that we mount, will not be owned by the root user. The directories are:

mkdir -p ./dags ./logs ./plugins
chmod -R 777 ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" >> .env
echo -e "AIRFLOW_GID=0" >> .env

Why do we need an ETL pipeline?

Assume we had a set of data that we wanted to use. However, this data is unclean, missing information, and inconsistent as with most data. One solution would be to have a program clean and transform this data so that:

Modern Data Lake with Minio

Installation and Setup

Install s3cmd with:

sudo apt update
sudo apt install -y \
    s3cmd \
    openjdk-11-jre-headless  # Needed for trino-cli

Pull and run all services with:

docker-compose up

Configure s3cmd with (or use the minio.s3cfg configuration):

s3cmd --config minio.s3cfg --configure

Use the following configuration for the s3cmd configuration when prompted:

Access Key: minio_access_key
Secret Key: minio_secret_key
Default Region [US]:
S3 Endpoint [s3.amazonaws.com]: localhost:9000
DNS-style bucket+hostname:port template for accessing a bucket [%(bucket)s.s3.amazonaws.com]: localhost:9000
Encryption password:
Path to GPG program [/usr/bin/gpg]:
Use HTTPS protocol [Yes]: no

To create a bucket and upload data to minio, type:

s3cmd --config minio.s3cfg mb s3://iris
s3cmd --config minio.s3cfg put data/iris.parq s3://iris

To list all object in all buckets, type:

s3cmd --config minio.s3cfg la

User defined network

User-defined bridges provide automatic DNS resolution between containers, meaning one container will be able to “talk” to the other containers in the same network of docker containers. On a user-defined bridge network (like oasiscorp in our case), containers can resolve each other by name or alias. This is very practical as we won’t have to manually look up and configure specific IP addresses.

Official Docker Image Docs

https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html