Automating Spark container builds

4 min readMar 13, 2020

In this article I will be outlining how I leverage an online automated CI flow to build my spark and jupyter containers. I use these containers to create data pipelines at work and in my home lab.

Implementing an automated environment greatly simplifies the build, test and distribution of these images. We move from a manual, time consuming, error prone process to a modern automated experience.

Gitlab

This automated environment will be using my gitlab.com public repository. If you are interested in development and building containers, I highly recommend setting up a free tier account that provides access to:

Unlimited repositories
2,000 CI pipeline minutes/month
Basic support
Built-in CI/CD

There are also paid for accounts if you need the extra CI minutes, support and features, …and no I don’t work for Gitlab. ;)

You can view the repository I will be commenting here : https://gitlab.com/jboothomas/docker_spark

Flow of ‘Spark container’ build

In order to facilitate updates, additional packages I use the following workflow:

The base image has all the initial required packages and spark files
An extra image is used for additional packages, jar files
Our end application images based on the extra image add the required scripts — parameters to have a spark master, spark worker or a jupyter service in our container.

The underlying code repository reflects this layout:

All builds use external files to maintain package and spark versions, the base and extra images have requirements.* file(s) where we list our package==versions. This allows for simple package addition, version updates and control, overall helping rollout less error prone containers.

CI pipeline

Gitlab comes with a built in CI pipeline to get it up and running we just need to create the required .gitlab-ci.yml files and specify the tasks to perform, and rules for these to be applied.

For my Spark containers I run the following gitlab CI pipeline:

The top level .gitlab-ci.yml file contains CI stages, some variables and then each separate .gitlab-ci-*.yml file for each container in the flow:

stages:- build-base- build-extra- build- test- releasevariables:DOCKER_TLS_CERTDIR: ""SPARK: "2.4.4"include:- local: base/.gitlab-ci-base.yml- local: extra/.gitlab-ci-extra.yml- local: master/.gitlab-ci-master.yml- local: worker/.gitlab-ci-worker.yml- local: jupyter/.gitlab-ci-jupyter.yml

Build phase

As you can see, we first build the base then the extra images as separate build stages the next 3 builds can run in parallel as they all use the extra container (so it needs to exist prior to them being built).

Example build phase for the extra image:

d-build-extra:image: docker:stableservices:- docker:dindstage: build-extrabefore_script:- docker login -u $CI_REGISTRY_USER -p $CI_JOB_TOKEN $CI_REGISTRYscript:- docker build --pull -t "$CI_REGISTRY_IMAGE/sparkextra:$SPARK-$CI_COMMIT_SHORT_SHA" ./extra --build-arg SPARK_VERSION="$SPARK" --build-arg CI_IMAGE="$CI_REGISTRY_IMAGE/sparkbase:$SPARK-$CI_COMMIT_SHORT_SHA"- docker tag "$CI_REGISTRY_IMAGE/sparkextra:$SPARK-$CI_COMMIT_SHORT_SHA" "$CI_REGISTRY_IMAGE/sparkextra:$SPARK-latest"- docker push "$CI_REGISTRY_IMAGE/sparkextra:$SPARK-$CI_COMMIT_SHORT_SHA"- docker push "$CI_REGISTRY_IMAGE/sparkextra:$SPARK-latest"

The build phase pushes to our local gitlab container registry the images:

Test phase

Tests can all run in parallel, so there is no specific order required. The tests performed are simple validation of version and file presence.

These could be improved to validate running services, ports, etc… that is a future endeavour.

Example test phase for the extra image:

d-test-extra:image: docker:stableservices:- docker:dindstage: testbefore_script:- docker login -u $CI_REGISTRY_USER -p $CI_JOB_TOKEN $CI_REGISTRYscript:- docker run "$CI_REGISTRY_IMAGE/sparkextra:$SPARK-$CI_COMMIT_SHORT_SHA" ls /spark/jars | grep "aws"

Release phase

In the release phase the built containers are pulled from the gitlab repository, tagged for docker.hub and then pushed, making them publicly available.

For this to work I must add in my gitlab projects release settings variables for: DOCK_USER, DOCK_PASS and DOCK_REG. Each referencing the required values to access my docker hub account (username, password & docker registry address).

Example release phase for the extra image:

d-release-extra:image: docker:stableservices:- docker:dindstage: releasebefore_script:- docker login -u $CI_REGISTRY_USER -p $CI_JOB_TOKEN $CI_REGISTRY- docker pull "$CI_REGISTRY_IMAGE/sparkextra:$SPARK-$CI_COMMIT_SHORT_SHA"script:- docker login -u $DOCK_USER -p $DOCK_PASS $DOCK_REG- docker tag "$CI_REGISTRY_IMAGE/sparkextra:$SPARK-$CI_COMMIT_SHORT_SHA" "$DOCK_USER/sparkextra:$SPARK-$CI_COMMIT_SHORT_SHA"- docker tag "$CI_REGISTRY_IMAGE/sparkextra:$SPARK-$CI_COMMIT_SHORT_SHA" "$DOCK_USER/sparkextra:$SPARK-latest"- docker push "$DOCK_USER/sparkextra:$SPARK-$CI_COMMIT_SHORT_SHA"- docker push "$DOCK_USER/sparkextra:$SPARK-latest"

On my docker hub I can now see the images:

Building spark 2.4.5

Until now my images have used a spark 2.4.4 base but spark 2.4.5 has been released. In order to build this version, I simply edit the main .gitlab-ci.yml file and change the version variable: SPARK: “2.4.5”

The CI pipeline will launch and after a few minutes we have our new 2.4.5 spark images available:

Wrapping up

All the code and images generated can be viewed and used:

Code: https://gitlab.com/jboothomas/docker_spark/

Images :

https://hub.docker.com/repository/docker/jboothomas/sparkmaster

https://hub.docker.com/repository/docker/jboothomas/sparkworker

https://hub.docker.com/repository/docker/jboothomas/sparkjupyter

To call from a k8s pod or docker specification, the 2.4.4 release simply specify as the Image:

jboothomas/sparkmaster:2.4.4-latestjboothomas/sparkworker:2.4.4-latestjboothomas/sparkjupyter:2.4.4-latest

Happy building!