source: https://haveagreatdata.com/posts/step-by-step-docker-image-for-data-science-projects/
Have a great data!
Step-By-Step: Creating a Docker Image for Your Data Science Project
Docker is widely used for software development nowadays and has become the de-facto standard for creating scalable systems in the cloud together with Kubernetes. Sooner or later, many data scientists need to use Docker to deploy their projects or at least become interested in trying it out. This article will show you how to create a Docker image from your conda environment step-by-step.
We assume you have already installed Docker, besides this the article should be pretty self-contained. However, if you have never used Docker before, it may be a good idea to go through the official Docker quickstart first.
Project Skeleton
First, let’s set up a basic project skeleton for which we will develop the
Dockerfile
later on. The source code, containing the final Dockerfile
, is also available on GitHub.
As the projects' python code, we will use a simple hello world program located under
src/hello.py
:
As you can see, it uses click to create a simple CLI that prints out a greeting.
Conda allows us specify our projects' dependencies in a yaml file called
environment.yml
:
The only requirement are a modern python version and the click library. If you have never used a conda environment file before, check out our article on managing python dependencies in data science projects to find out why that’s a good idea.
Now, we are only missing the
Dockerfile
itself and we will go through that one in more detail below. If you want to follow along with the steps, simply create an empty file for now. The project folder should now look like this:Dockerfile
To get our project up and running inside a Docker container, at the very least we have to do the following steps:
- Choose a base image
- Set-up a python environment with the required libraries
- Add the project source code
Choose a Base Image
The first step in creating your Dockerfile is to choose a base image. It will provide a root file system, pre-installed software and some basic configuration. This step already presents us with a multitude of choices — there are over 100.000 images available on Docker Hub! Even if we narrow it down to a python project, there are many options like starting with a vanilla OS image, using the official python image or choosing a more specialized one.
Since many of us data scientists are working with Anaconda / Miniconda python distributions, we will choose the official
continuumio/miniconda3
image here. At the time of writing, it is based on the debian:buster-slim
OS image. For beginners, debian-based images are great since they are widely used and you will find a ton of online resources about those. However, they are also generally a good choice for python projects.
The official miniconda3 image already has a miniconda python distribution installed. It also adds conda’s base environment executable directory to the PATH (as can be seen in the Dockerfile), so you can run python related commands directly without knowing where they are located. This will become important later on.
For now, let’s choose it as our base image by putting a
FROM
instruction into our Dockerfile
:
Note, that I chose a specific version tag for the base image (the most recent version at the time of writing). There are some good reasons why we should not simply use the image with the
latest
tag.
Now, we can build the first version of our project Docker image and tag (
-t
) it as my_project:latest
by executing the following command in the project folder:
Let’s now run the container to enter the shell and run some commands to inspect what it looks like. In order to use a shell, make sure to set the interactive (
-i
) and TTY (-t
) flags:
To find out which python version is currently installed we run:
We find out that Python 3.7.6 is installed.
It is also interesting to find out where the python executable lives:
It says
/opt/conda/bin/python
. If you’re interested to see a full list of the installed conda packages and their versions, simply run a conda list
.
Actually, we could have found out all of the above by simply running a container off the base image itself. But this way, we already tested building and running our own image.
It’s time to add our requirements to make it more useful.
Set-Up a Python Environment With the Required Libraries
On our local machine we can simply run
in our project folder to create a conda environment with the desired libraries. Then, we need to activate the environment with
before we can actually run our python programs. Doing the same in a Docker image, however, is not as straightforward.
It turns out, we have two options here:
- Create a new conda environment and use
conda run
to run our program as described here. - Update the base environment with our requirements.
I am using the latter option in my daily work for three main reasons:
- The Dockerfile is easier to read and understand than the one needed for Option 1.
- If you run a shell in the container, you can directly interact with the projects' python environment without manually activating a conda environment or creating a custom entrypoint that does that for you.
- If your projects runs under the python version of the base image, your image will stay smaller since there is no need to install a second python version. (This is the reason why I pinned the python version to 3.7.6 in the conda environment file above!)
This is how the
Dockerfile
looks after adding two more commands to create the desired python environment:
The conda environment file is copied into the folder
/opt/env/
and then the conda base environment is updated. The name of the project environment is overridden by the -n
flag so the base environment is updated. This is great, since you can still use the environment called “my_project” locally on your system without renaming it to “base”.
Also,
conda clean
is run to clean up index and package caches to keep our image small. It is important to run the update and clean commands in the same RUN
instruction. Every RUN
creates a new layer in the image and previous layers are immutable. If we ran the conda clean
command in a separate RUN
, the deleted files would not be visible anymore. However, they still would take up disk space and the files could still be extracted from the image.
If you want, build the image again, run it and execute
conda list
inside the container to verify that the correct requirements are now installed.Add the Project Source Code
Now that our project environment is ready, we can copy the project source code and add a
CMD
instruction so that our program starts when the container is run without additional arguments:
If we build the image again and run
we are finally greeted with “hello world!”. Note, that we set the working directory (
WORKDIR
). That means we can reference our program hello.py
directly instead of providing the full path. Also, we start off in the project folder if we run a shell in the container with docker run -it my_project bash
.
Great! We were able to do all this by using only six instructions. But are we really done yet? We could certainly use the image above. However, there are some details that should be considered for deployment.
Advanced Considerations
Firstly, in the above image, the python program runs under the root user inside the container. Most of the time, particularly if you deploy a machine learning model, root privileges are not needed by your program. As a general rule, if no elevated rights are needed the program should run under a non-root user. This is also stated clearly in the Docker best-practices. We will add a non-root user in our final version of the
Dockerfile
below.
Secondly, the python program runs as PID 1. Normally, in a Linux operating system init would be running as the first process. The PID 1 process has some special responsibilities, for instance forwarding signals. Since most of the python programs we write do not implement handling signals, a
SIGTERM
(which is sent by docker stop
to gracefully exit the container) will be ignored by our container. That’s why docker stop
will resort to forcefully terminating the container using SIGKILL
after ten seconds.
This can be easily verified by printing the greeting in an endless loop every second in the python program above and then trying to stop the container using
docker stop
(which is what our cluster manager might try to do). You will see that the container ignores the polite request to stop printing "hello world" and will be forcefully removed from the club by the SIGKILL
bouncer ten seconds later.
This behavior can lead to problems, which is why it is better to handle this case. Again, we have multiple options here. If you are using Docker 1.13 or later and you are able to specify run arguments, you can add the
--init
flag to your run command:
It will use the built-in version of tini as a lightweight replacement for init to run as PID 1.
A better alternative is to add tini (or dumb-init) to the image in any case, since you may not be able set the
--init
flag in your deployment environment.
Adding a non-root user, installing tini and setting it as the entrypoint we arrive at the final version of our
Dockerfile
:
This final version of the
Dockerfile
including the example project code is available on GitHub.
That’s it for now and I hope you found this step-by-step walk-through useful! I’d love to hear about your workflow and suggestions to improving mine. Just write a mail to mail@haveagreatdata.com.
READ OTHER POSTS
No comments:
Post a Comment