basic_1.dockerfile
FROM ubuntu:22.04
# https://stackoverflow.com/a/65054865
ENV DEBIAN_FRONTEND noninteractive
This section is based mostly on the information available at docker documentation. You can also find an example of utilizing docker for a datascience project at datacamp.com.
Furthermore, based on an answer on stackoverflow, an IBM research paper (Felter et al. 2015) found that docker is close to native performance, while Kernel-based Virtual Machines are slower in comparison.
Furthermore, each VM requires a full copy of the OS and its applications, which makes having multiple VM’s. In contrast, docker provides the ability to package and run an application in an isolated environment called a container. Containers are lightweight and contain everything needed to run the application without requiring a full copy of the OS, so you don’t need to rely on what’s installed on the host, other than having the docker engine application itself.
docker
?Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure1.
There are two objects of docker
that we are interested in, namely:
docker
container. For example, you may build an image which is based on the ubuntu
image, but installs Python
, R
, as well as the configuration details needed to make these application run.The image must contain everything needed to run an application - all dependencies, configurations, scripts, binaries, environment variables, startup commands, etc.For this reason it is recommended to use volumes in order to save persistent data inside a directory on the host machine. Though, you also need to configure your docker
image to save data to the mounted volume directory.
docker
can build images automatically by reading the instructions from a .dockerfile
. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Some of the supported instructions are:
FROM
- create a new build stage from a base image.LABEL
- add metadata to an image.ADD
- add local or remote files and directories.ARG
- use build-time variables.ENV
- set environment variables.COPY
- copy files and directories from your host machine (or from different build stages, in the case of multi-stage builds) to the docker
image.RUN
- execute build commands.EXPOSE
- describe which ports your application is listening on.CMD
- specify default commands.ENTRYPOINT
- specify default executable.See here for a full list of instructions.
A docker
build consists of a series of ordered build instructions. Each instruction in a .dockerfile
roughly translates to an image layer. When you build an image, the builder attempts to reuse layers from earlier builds. If a layer of an image is unchanged, then the builder picks it up from the build cache. If a layer has changed since the last build, that layer, and all layers that follow, must be rebuilt. Therefore, the order of .dockerfile
instructions matters. Usually, you would want to firstly carry out download or installation instructions that take a long time and only after use shorter instructions. That way, if you ever need to change one of these instructions, then the builder won’t need to re-download any files, as they were downloaded a previous layer.
Unfortunately, having many layers usually increases the size of the .dockerfile.
To combat this, multi-stage builds can be used. This is especially useful if you know for a fact that you only need to build this image completely and won’t have to use parts of this image to build other, similar images. Then, a multi-stage build can reduce the number of layers in your docker file to only a small subset (for example, by reducing a 60+ layer image to only a 2-layer one by using COPY
to copy the previous multi-layer build result and then finalizing with a CMD
instruction).
On Windows
you need to start the docker
software by launching the Docker Desktop app. You need to keep this app open to build images and run containers.
Ubuntu
imageFirstly, create a file named basic_1.dockerfile
with the following instructions:
basic_1.dockerfile
FROM ubuntu:22.04
# https://stackoverflow.com/a/65054865
ENV DEBIAN_FRONTEND noninteractive
Then build this docker image by running the following command on your machine’s terminal (make sure to change the directory in the terminal to the same location as your .dockerfile
):
run in your machine's terminal
docker build -f ./basic_1.dockerfile -t basic-ubuntu --pull=false .
After the build finishes run the following command:
run in your machine's terminal
docker image ls
to verify that an image named basic-ubuntu
is available.
While connected to the container, you will be using bash to run a variety of commands.
Unlike Windows
, where we are accustomed to downloading software through our browser and installing it by running an installation wizard, we can completely automate the download and install process using bash commands.
In order to get familiar with bash
shell, see the section on the linux command line, which uses the Ubuntu
container to demonstrate some of the most common commands (some of which are very similar to commands in a Windows
terminal).
Run the following code in your terminal in order to create a container based on the basic-ubuntu
image and launch a bash
terminal connected to it:
run in your machine's terminal
docker run --rm --name temp-container -it basic-ubuntu bash
The --rm
argument means that we want to automatically destroy the container after we close it.
Your terminal window should now show the bash terminal connected as root:
root@abcd12345678:/#
While connected to your docker container, run the following code to view available directories:
run in your machine's terminal while connected to the docker container
ls
Firstly, we’ll verify that this kind of container is indeed immutable - we will remove a couple of folders with the following commands:
run in your machine's terminal while connected to the docker container
rm -rf tmp
rm -rf home
ls
With the ls
command you should see that the tmp
and home
folders are gone. Next run the following command to exit the container:
run in your machine's terminal while connected to the docker container
exit
Next, launch the container again:
run in your machine's terminal
docker run --rm --name temp-container -it basic-ubuntu bash
Next, run ls
inside the container and verify that tmp
and home
folders are back.
In other words, if we make any kind of changes inside the container - these changes will disappear if we launch a new container. As mentioned before, a container is based on the created image and is used to launch some kind of application that we configure. If we want to save some kind of files, or settings, we should use volumes, which are persistent. We will see how to do this after adding Python
to our image.
Python
Firstly, launch the container and connect via bash:
run in your machine's terminal
docker run --rm --name temp-container -it basic-ubuntu bash
Then, based on Posit’s documentation on Python installation, we can run a number of commands to download and install Python
inside the container in order to make sure that we don’t get any errors when we build this kind of image.
Firstly, we download a couple of pre-requisite linux libraries in order to downlaod and install Python
:
run in your machine's terminal while connected to the docker container
apt-get update --fix-missing && apt-get upgrade -yq
apt-get install -yq --no-install-recommends curl ca-certificates gdebi-core
Next, we will create an environmental variable for the Python
version. For example:
run in your machine's terminal while connected to the docker container
export PYTHON_VERSION=3.11.7
We can see the value of this variable by running:
run in your machine's terminal while connected to the docker container
echo $PYTHON_VERSION
We then download the Python
installer, using the environmental variable as part of the url:
run in your machine's terminal while connected to the docker container
curl -O https://cdn.rstudio.com/python/ubuntu-2204/pkgs/python-${PYTHON_VERSION}_1_amd64.deb
Using the ls
command you should see the .deb
file.
We then install Python
from this file, delete the .deb
file and add Python
’s location to our PATH
variable:
run in your machine's terminal while connected to the docker container
apt-get install -yq --no-install-recommends ./python-${PYTHON_VERSION}_1_amd64.deb
rm -rf python-${PYTHON_VERSION}_1_amd64.deb
/opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade pip
/opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade setuptools
echo "PATH=/opt/python/${PYTHON_VERSION}/bin:$PATH" >> ~/.bashrc
By running:
run in your machine's terminal while connected to the docker container
python --version
You can verify that we have successfully installed Python
on our container. Sadly, once we exit the container, all of our changes will be reverted. Nevertheless, this is a good way to test any of the instructions that we wish to add to our .dockerfile
.
On that note, exit
the container. Then, create a new .dockerfile
named basic_2.dockerfile
with the following instructions:
basic_2.dockerfile
FROM ubuntu:22.04
# https://stackoverflow.com/a/65054865
ENV DEBIAN_FRONTEND noninteractive
# https://www.python.org/downloads/
ARG PYTHON_VERSION=3.11.7
# https://docs.posit.co/resources/install-python/
RUN apt-get update --fix-missing \
&& apt-get upgrade -yq \
&& apt-get install -yq --no-install-recommends curl ca-certificates gdebi-core
RUN curl -O https://cdn.rstudio.com/python/ubuntu-2204/pkgs/python-${PYTHON_VERSION}_1_amd64.deb \
&& apt-get install -yq --no-install-recommends ./python-${PYTHON_VERSION}_1_amd64.deb \
&& rm -rf python-${PYTHON_VERSION}_1_amd64.deb \
&& /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade pip \
&& /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade setuptools
RUN echo "PATH=/opt/python/${PYTHON_VERSION}/bin:$PATH" >> ~/.bashrc
And build this image:
run in your machine's terminal
docker build -f ./basic_2.dockerfile -t basic-python --pull=false .
Note that this time the building process will take longer, since we are executing more instructions, which involde downloading and installing Python
.
Connect to our container with Python
already installed:
run in your machine's terminal
docker run --rm --name temp-container -v ${PWD}:/media/container_shared/ -it basic-python bash
Note that in the above code we have added a volume with the -v
argument. We have bound the current working directory in our machines terminal to the \media\container_shared
directory inside the container. We will use it to back-up the versions of Python
libraries we want installed in our image.
Firstly, we will install a number of libraries and save their versions in a file named requirements.txt
:
run in your machine's terminal while connected to the docker container
apt-get update --fix-missing && apt-get install -yq build-essential libcairo2-dev libpango1.0-dev ffmpeg
python -m pip install --upgrade pip
python -m pip install -U virtualenv numpy scipy numexpr matplotlib pandas 'modin[all]' statsmodels 'datatable>1.0.0' pymc 'cmdstanpy[all]' 'arviz[all]' bambi jupyterlab jupyterlab-lsp jupyter-cache scikit-learn tensorflow keras pyarrow polars duckdb tzdata datar patsy plotnine seaborn siuba streamlit great-tables skimpy beautifulsoup4 manim lckr-jupyterlab-variableinspector
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
python -m pip freeze > requirements.txt
After this, we can view the requirements.txt
file:
run in your machine's terminal while connected to the docker container
cat requirements.txt
We then copy this file to our hosts machine from inside the container:
run in your machine's terminal while connected to the docker container
cp requirements.txt /media/container_shared/my_python_requirements.txt
Next, go to the directory on your host machine and verify that you have indeed saved a file named my_python_requirements.txt
. Then, open this file in and add the following line at the beginning: --find-links https://download.pytorch.org/whl/torch_stable.html
(this is needed, since we have added the pytorch
library from a different repository).
Now that we have the list of libraries and their versions, we can create the final image. On that note, exit
the container, if you haven’t already.
Finally, create a new .dockerfile
named basic_3.dockerfile
with the following instructions:
basic_3.dockerfile
FROM ubuntu:22.04
# https://stackoverflow.com/a/65054865
ENV DEBIAN_FRONTEND noninteractive
# https://www.python.org/downloads/
ARG PYTHON_VERSION=3.11.7
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# Add Python
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# https://docs.posit.co/resources/install-python/
RUN apt-get update --fix-missing \
&& apt-get upgrade -yq \
&& apt-get install -yq --no-install-recommends curl ca-certificates gdebi-core
RUN curl -O https://cdn.rstudio.com/python/ubuntu-2204/pkgs/python-${PYTHON_VERSION}_1_amd64.deb \
&& apt-get install -yq --no-install-recommends ./python-${PYTHON_VERSION}_1_amd64.deb \
&& rm -rf python-${PYTHON_VERSION}_1_amd64.deb \
&& /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade pip \
&& /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade setuptools
RUN echo "PATH=/opt/python/${PYTHON_VERSION}/bin:$PATH" >> ~/.bashrc
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# Install Python libraries
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# Required libraries for some Python libraries (e.g. manim):
RUN apt-get update --fix-missing \
&& apt-get install -yq build-essential libcairo2-dev libpango1.0-dev ffmpeg
COPY /my_python_requirements.txt /tmp/requirements.txt
RUN /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install --upgrade pip \
&& /opt/python/${PYTHON_VERSION}/bin/python3 -m pip install -r /tmp/requirements.txt
Note that the COPY
instruction copies my_python_requirements.txt
file from your host’s machine to the image (assuming that your machine’s terminal is opened to the same directory where my_python_requirements.txt
and basic_3.dockerfile
are).
Finally, build this image:
run in your machine's terminal
docker build -f ./basic_3.dockerfile -t additional-python --pull=false .
Since we now include the installation of Python
libraries - the build time of this image increases just as well.
Finally, launch a container based on this image:
run in your machine's terminal
docker run --rm --name temp-container -it additional-python bash
And run a couple of commands to see the directory structure, as well as the Python
version and the version of one of the installed libraries, then, exit the container:
run in your machine's terminal while connected to the docker container
du -h -d 1 / | sort -hr
du -h -d 1 / -t +100MB 2> >(grep -v '^du:') | sort -hr
python --version
python -c "import statsmodels; print(statsmodels.__version__)"
exit
After building (or re-building) many images, you will notice that quite a bit of space is taken by the build cache. We can inspect the total size taken up by everything docker-related (e.g. other images that we have):
run in your machine's terminal
docker system df
Once you are certain that your won’t need to develop and re-build your images - you can clear the build cache with the following command:
run in your machine's terminal
docker builder prune --force
We can list all of our images:
run in your machine's terminal
docker image list
and remove some of the images no longer needed, for example basic-ubuntu
and basic-python
:
run in your machine's terminal
docker rmi $(docker images 'basic-ubuntu' -a -q)
docker rmi $(docker images 'basic-python' -a -q)
docker image list
Alternatively, we can stop all active containers and remove ALL images:
run in your machine's terminal
docker stop $(docker ps -a -q)
docker system prune -a --volumes --force
docker system df
There are some limitations. By default a docker image will be built targeting the type of machine that is building the image. If you build an image on Windows
, then the resulting image can only be run on Windows
machines. On the other hand, if you want to create an image that is compatible for both Windows
and MacOS
machines, you would need to configure your build to target those types of machines. Furthermore, if you want to target ARM-based architecture, then it is very likely that you will need to modify the commands inside your dockerfile
to download ARM-based software (instead of software based on the x86-64 architecture).↩︎