In this chapter we review two common ways of setting up your data science software:
- The classic way - by directly installing software on your operating system.
- The modern way - by setting up a dedicated container with your data science software.
Both methods have their own advantages and drawbacks:
Advantages |
- Usually requires a straightforward installation.
- Easy to update and install new packages and additional software.
|
- Ensures that your software remains the same each time you start a container, regardless of the OS you have on your PC.
- If you accidentally update or remove something in the container - you can easily spin up a new container with the initial software state.
- The same container image can be used by different people, ensuring that everyone has exactly the same software.
- Container performance is usually identical to native performance (see stackoverflow), if the container image is optimized correctly. In reality, you can expect some performance loss, though it is usually small.
- Most cloud based providers, such as AWS, Azure and Google offer container-based solutions for running machine learning software, which makes it convenient for data scientists working in the same team/company.
|
Drawbacks |
- Difficult to maintain (update) multiple versions of the same software.
- Sometimes difficult to re-install the same software.
- Difficult to prevent accidental software updates.
- Some software may be OS-exclusive (MacOS vs. Windows vs. Linux).
|
- Cannot make persistent changes to files/software in the container (at least not easily).
- If you want to create your own container from scratch, then you need to be familiar with containerization software, such as docker or podman, as well as the linux command line. This is generally a time-consuming process.
- Your software needs to be able to work remotely. Some software has both a desktop and a web-based alternative, for example VSCode (desktop) and Code-Server (browser). If no browser version exists - such software might not be easy to use in containers.
- You PC needs to be able to run containers, e.g. see the docker requirements for: Windows, MacOS, Linux.
|
To sum up the above table:
- Software containers are immutable but very time-consuming to set up the first time without prior experience.
- Directly installing software to your PC is relatively fast, but has a high likelihood of accidentally updating some software or libraries that break your software.
- Containerized software is currently quite popular in most companies that employ data scientists, by ensuring reproducibility, a consistent environment, as well as providing a deployment strategy for deploying models as small web apps.
Therefore, while these notes describe both methods of software setup, it is strongly encouraged to setup your software inside a container.