The basics of Docker

This section is based mostly on the information available at docker documentation. You can also find an example of utilizing docker for a datascience project at datacamp.com.

Furthermore, based on an answer on stackoverflow, an IBM research paper (Felter et al. 2015) found that docker is close to native performance, while Kernel-based Virtual Machines are slower in comparison.

Furthermore, each VM requires a full copy of the OS and its applications, which makes having multiple VM’s. In contrast, docker provides the ability to package and run an application in an isolated environment called a container. Containers are lightweight and contain everything needed to run the application without requiring a full copy of the OS, so you don’t need to rely on what’s installed on the host, other than having the docker engine application itself.

What is docker?

Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure1.

There are two objects of docker that we are interested in, namely:

  • images - an image is a read-only template with instructions for creating a docker container. For example, you may build an image which is based on the ubuntu image, but installs Python, R, as well as the configuration details needed to make these application run.The image must contain everything needed to run an application - all dependencies, configurations, scripts, binaries, environment variables, startup commands, etc.
  • containers - a container is a runnable instance of an image and is defined by its image as well as any configuration options you provide to it when you create or start it (e.g. custom port mapping, volumes, etc. specific to that container). It is a sandboxed process running on a host machine that is (by default) isolated from all other processes running on that host machine. You can create, start, stop, move, or delete containers.
A Note on immutability in docker
  • Docker images are immutable, so you cannot change them once they are created (except deleting the image itself).
  • When a container is removed, any changes to its state that aren’t stored in persistent storage disappear (it is usually recommended to remove containers that are no longer in use - in fact containers can be setup to be automatically removed once they are stopped).

For this reason it is recommended to use volumes in order to save persistent data inside a directory on the host machine. Though, you also need to configure your docker image to save data to the mounted volume directory.

Dockerfile instructions

docker can build images automatically by reading the instructions from a .dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Some of the supported instructions are:

  • FROM - create a new build stage from a base image.
  • LABEL - add metadata to an image.
  • ADD - add local or remote files and directories.
  • ARG - use build-time variables.
  • ENV - set environment variables.
  • COPY - copy files and directories from your host machine (or from different build stages, in the case of multi-stage builds) to the docker image.
  • RUN - execute build commands.
  • EXPOSE - describe which ports your application is listening on.
  • CMD - specify default commands.
  • ENTRYPOINT - specify default executable.

See here for a full list of instructions.

Layers

A docker build consists of a series of ordered build instructions. Each instruction in a .dockerfile roughly translates to an image layer. When you build an image, the builder attempts to reuse layers from earlier builds. If a layer of an image is unchanged, then the builder picks it up from the build cache. If a layer has changed since the last build, that layer, and all layers that follow, must be rebuilt. Therefore, the order of .dockerfile instructions matters. Usually, you would want to firstly carry out download or installation instructions that take a long time and only after use shorter instructions. That way, if you ever need to change one of these instructions, then the builder won’t need to re-download any files, as they were downloaded a previous layer.

Unfortunately, having many layers usually increases the size of the .dockerfile. To combat this, multi-stage builds can be used. This is especially useful if you know for a fact that you only need to build this image completely and won’t have to use parts of this image to build other, similar images. Then, a multi-stage build can reduce the number of layers in your docker file to only a small subset (for example, by reducing a 60+ layer image to only a 2-layer one by using COPY to copy the previous multi-layer build result and then finalizing with a CMD instruction).

Note

On Windows you need to start the docker software by launching the Docker Desktop app. You need to keep this app open to build images and run containers.

Creating the base Ubuntu image

Firstly, create a file named basic_1.dockerfile with the following instructions:

basic_1.dockerfile
FROM ubuntu:22.04

# https://stackoverflow.com/a/65054865 
ENV DEBIAN_FRONTEND noninteractive

Then build this docker image by running the following command on your machine’s terminal (make sure to change the directory in the terminal to the same location as your .dockerfile):

run in your machine's terminal
docker build -f ./basic_1.dockerfile -t basic-ubuntu --pull=false .

After the build finishes run the following command:

run in your machine's terminal
docker image ls

to verify that an image named basic-ubuntu is available.

Exploring the created container

Note

While connected to the container, you will be using bash to run a variety of commands.

Unlike Windows, where we are accustomed to downloading software through our browser and installing it by running an installation wizard, we can completely automate the download and install process using bash commands.

In order to get familiar with bash shell, see the section on the linux command line, which uses the Ubuntu container to demonstrate some of the most common commands (some of which are very similar to commands in a Windows terminal).

Run the following code in your terminal in order to create a container based on the basic-ubuntu image and launch a bash terminal connected to it:

run in your machine's terminal
docker run --rm --name temp-container -it basic-ubuntu bash

The --rm argument means that we want to automatically destroy the container after we close it.

Your terminal window should now show the bash terminal connected as root:

root@abcd12345678:/#

While connected to your docker container, run the following code to view available directories:

run in your machine's terminal while connected to the docker container
ls

Firstly, we’ll verify that this kind of container is indeed immutable - we will remove a couple of folders with the following commands:

run in your machine's terminal while connected to the docker container
rm -rf tmp
rm -rf home
ls

With the ls command you should see that the tmp and home folders are gone. Next run the following command to exit the container:

run in your machine's terminal while connected to the docker container
exit

Next, launch the container again:

run in your machine's terminal
docker run --rm --name temp-container -it basic-ubuntu bash

Next, run ls inside the container and verify that tmp and home folders are back.

In other words, if we make any kind of changes inside the container - these changes will disappear if we launch a new container. As mentioned before, a container is based on the created image and is used to launch some kind of application that we configure. If we want to save some kind of files, or settings, we should use volumes, which are persistent. We will see how to do this after adding Python to our image.

Adding Python

Firstly, launch the container and connect via bash:

run in your machine's terminal
docker run --rm --name temp-container -it basic-ubuntu bash

Then, based on Posit’s documentation on Python installation, we can run a number of commands to download and install Python inside the container in order to make sure that we don’t get any errors when we build this kind of image.

Firstly, we download a couple of pre-requisite linux libraries in order to downlaod and install Python:

run in your machine's terminal while connected to the docker container
apt-get update --fix-missing && apt-get upgrade -yq
apt-get install -yq --no-install-recommends curl ca-certificates gdebi-core

Next, we will create an environmental variable for the Python version. For example:

run in your machine's terminal while connected to the docker container
export PYTHON_VERSION=3.11.7

We can see the value of this variable by running:

run in your machine's terminal while connected to the docker container
echo $PYTHON_VERSION

We then download the Python installer, using the environmental variable as part of the url:

run in your machine's terminal while connected to the docker container
curl -O https://cdn.rstudio.com/python/ubuntu-2204/pkgs/python-${PYTHON_VERSION}_1_amd64.deb

Using the ls command you should see the .deb file.

We then install Python from this file, delete the .deb file and add Python’s location to our PATH variable:

run in your machine's terminal while connected to the docker container
apt-get install -yq --no-install-recommends ./python-${PYTHON_VERSION}_1_amd64.deb
rm -rf python-${PYTHON_VERSION}_1_amd64.deb
/opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade pip
/opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade setuptools
echo "PATH=/opt/python/${PYTHON_VERSION}/bin:$PATH" >> ~/.bashrc

By running:

run in your machine's terminal while connected to the docker container
python --version

You can verify that we have successfully installed Python on our container. Sadly, once we exit the container, all of our changes will be reverted. Nevertheless, this is a good way to test any of the instructions that we wish to add to our .dockerfile.

On that note, exit the container. Then, create a new .dockerfile named basic_2.dockerfile with the following instructions:

basic_2.dockerfile
FROM ubuntu:22.04

# https://stackoverflow.com/a/65054865 
ENV DEBIAN_FRONTEND noninteractive

# https://www.python.org/downloads/
ARG PYTHON_VERSION=3.11.7

# https://docs.posit.co/resources/install-python/
RUN apt-get update --fix-missing  \
    && apt-get upgrade -yq \
    && apt-get install -yq --no-install-recommends curl ca-certificates gdebi-core

RUN curl -O https://cdn.rstudio.com/python/ubuntu-2204/pkgs/python-${PYTHON_VERSION}_1_amd64.deb \
    && apt-get install -yq --no-install-recommends ./python-${PYTHON_VERSION}_1_amd64.deb \
    && rm -rf python-${PYTHON_VERSION}_1_amd64.deb \
    && /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade pip \
    && /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade setuptools

RUN echo "PATH=/opt/python/${PYTHON_VERSION}/bin:$PATH" >> ~/.bashrc

And build this image:

run in your machine's terminal
docker build -f ./basic_2.dockerfile -t basic-python --pull=false .

Note that this time the building process will take longer, since we are executing more instructions, which involde downloading and installing Python.

Adding libraries

Connect to our container with Python already installed:

run in your machine's terminal
docker run --rm --name temp-container -v ${PWD}:/media/container_shared/ -it basic-python bash

Note that in the above code we have added a volume with the -v argument. We have bound the current working directory in our machines terminal to the \media\container_shared directory inside the container. We will use it to back-up the versions of Python libraries we want installed in our image.

Firstly, we will install a number of libraries and save their versions in a file named requirements.txt:

run in your machine's terminal while connected to the docker container
apt-get update --fix-missing && apt-get install -yq build-essential libcairo2-dev libpango1.0-dev ffmpeg
python -m pip install --upgrade pip
python -m pip install -U virtualenv numpy scipy numexpr matplotlib pandas 'modin[all]' statsmodels 'datatable>1.0.0' pymc 'cmdstanpy[all]' 'arviz[all]' bambi jupyterlab jupyterlab-lsp jupyter-cache scikit-learn tensorflow keras pyarrow polars duckdb tzdata datar patsy plotnine seaborn siuba streamlit great-tables skimpy beautifulsoup4 manim lckr-jupyterlab-variableinspector
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
python -m pip freeze > requirements.txt

After this, we can view the requirements.txt file:

run in your machine's terminal while connected to the docker container
cat requirements.txt

We then copy this file to our hosts machine from inside the container:

run in your machine's terminal while connected to the docker container
cp requirements.txt /media/container_shared/my_python_requirements.txt

Next, go to the directory on your host machine and verify that you have indeed saved a file named my_python_requirements.txt. Then, open this file in and add the following line at the beginning: --find-links https://download.pytorch.org/whl/torch_stable.html (this is needed, since we have added the pytorch library from a different repository).

Now that we have the list of libraries and their versions, we can create the final image. On that note, exit the container, if you haven’t already.

Final image

Finally, create a new .dockerfile named basic_3.dockerfile with the following instructions:

basic_3.dockerfile
FROM ubuntu:22.04

# https://stackoverflow.com/a/65054865 
ENV DEBIAN_FRONTEND noninteractive

# https://www.python.org/downloads/
ARG PYTHON_VERSION=3.11.7

# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# Add Python
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

# https://docs.posit.co/resources/install-python/
RUN apt-get update --fix-missing  \
    && apt-get upgrade -yq \
    && apt-get install -yq --no-install-recommends curl ca-certificates gdebi-core

RUN curl -O https://cdn.rstudio.com/python/ubuntu-2204/pkgs/python-${PYTHON_VERSION}_1_amd64.deb \
    && apt-get install -yq --no-install-recommends ./python-${PYTHON_VERSION}_1_amd64.deb \
    && rm -rf python-${PYTHON_VERSION}_1_amd64.deb \
    && /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade pip \
    && /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade setuptools

RUN echo "PATH=/opt/python/${PYTHON_VERSION}/bin:$PATH" >> ~/.bashrc

# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# Install Python libraries
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# Required libraries for some Python libraries (e.g. manim):
RUN apt-get update --fix-missing \
    && apt-get install -yq build-essential libcairo2-dev libpango1.0-dev ffmpeg

COPY /my_python_requirements.txt /tmp/requirements.txt

RUN /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade pip \
    && /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install -r /tmp/requirements.txt

Note that the COPY instruction copies my_python_requirements.txt file from your host’s machine to the image (assuming that your machine’s terminal is opened to the same directory where my_python_requirements.txt and basic_3.dockerfile are).

Finally, build this image:

run in your machine's terminal
docker build -f ./basic_3.dockerfile -t additional-python --pull=false .

Since we now include the installation of Python libraries - the build time of this image increases just as well.

Finally, launch a container based on this image:

run in your machine's terminal
docker run --rm --name temp-container -it additional-python bash

And run a couple of commands to see the directory structure, as well as the Python version and the version of one of the installed libraries, then, exit the container:

run in your machine's terminal while connected to the docker container
du -h -d 1 / | sort -hr
du -h -d 1 / -t +100MB 2> >(grep -v '^du:') | sort -hr
python --version
python -c "import statsmodels; print(statsmodels.__version__)"
exit

Extra commands - pruning build cache, removing images, etc.

After building (or re-building) many images, you will notice that quite a bit of space is taken by the build cache. We can inspect the total size taken up by everything docker-related (e.g. other images that we have):

run in your machine's terminal
docker system df

Once you are certain that your won’t need to develop and re-build your images - you can clear the build cache with the following command:

run in your machine's terminal
docker builder prune --force

We can list all of our images:

run in your machine's terminal
docker image list

and remove some of the images no longer needed, for example basic-ubuntu and basic-python:

run in your machine's terminal
docker rmi $(docker images 'basic-ubuntu' -a -q)
docker rmi $(docker images 'basic-python' -a -q)
docker image list

Alternatively, we can stop all active containers and remove ALL images:

run in your machine's terminal
docker stop $(docker ps -a -q)
docker system prune -a --volumes --force
docker system df

  1. There are some limitations. By default a docker image will be built targeting the type of machine that is building the image. If you build an image on Windows, then the resulting image can only be run on Windows machines. On the other hand, if you want to create an image that is compatible for both Windows and MacOS machines, you would need to configure your build to target those types of machines. Furthermore, if you want to target ARM-based architecture, then it is very likely that you will need to modify the commands inside your dockerfile to download ARM-based software (instead of software based on the x86-64 architecture).↩︎