A Step Towards Reproducible Data Science : Docker for Data Science Workflows

Last Updated : 05 Jun, 2020

8 min read

Introduction

My first encounter with Docker was not to solve a Data Science problem, but to install MySQL. Yes, to install MySQL! Quite an anti-climatic start, right? At times, you stumble upon jewels while going through StackOverflow and Docker was one of them. What started with a one-off use case, ended up becoming a useful tool in my daily workflow.

I got a taste of docker when I tried to install TensorFlow in my system. Just to give you the context, TensorFlow is a deep learning library which requires a series of steps that you ought to do for system setup. Especially it is extremely complex to install Nvidia Graphics drivers. I literally had to reinstall my Operating system countless number of times. That loop stopped only when I shifted to docker, thankfully!

Docker provides you with an easy way to share your working environments including libraries and drivers. This enables us to create reproducible data science workflows.

This article aims to provide the perfect starting point to nudge you to use Docker for your Data Science workflows! I will cover both the useful aspects of Docker – namely, setting up your system without installing the tools and creating your own data science environment.

What is Docker?
Use Cases for Data Science
Docker terminology
Docker Hello-World
Data Science tools without installation
Your first Docker Image
Docker eco-system

1. What is Docker?

Docker is a software technology providing containers, promoted by the company Docker, Inc. Docker provides an additional layer of abstraction and automation of operating-system-level virtualization on Windows and Linux.

The underlying concept that Docker promotes is usage of containers, which are essentially “boxes” of self-contained softwares. Containers have been in existence before Docker & quite successful, but 2015 saw a huge adoption by the software community in terms of containerization to solve the day-to-day issues.

2. Use Cases for Data Science

When you walk into a cubicle of Data Science folks, they are either doing data processing or struggling to setup something on their work-stations/laptops. Okay, that might be an exaggeration but you get the sense of helplessness. To give a small example, for someone to setup a Caffe environment there are more than 30 unique ways. And trust me, you’ll end up creating a new blogpost just for showing all the steps!

protocols

Source: xkcd comics

You get the idea. Anaconda distribution has made virtual environments and replicating environments using a standardized method a reality…yet things do get muddled and sometimes we miss the bullet points in the README file, carefully created to replicate those.

To solve the above problem, bash scripts and makefiles are added which adds to more confusion. It becomes as simple as untangling earphones, phew!

Docker’s learning curve might be a bit steep, but it helps to solve:

Distribution and setup of Data Science tools and software:

The Caffe example we discussed is one of the pain points that everyone experiences in their Data Science journey. Not only Docker helps to set a consistent platform via which these tools can be shared, the time wasted in searching for operating system specific installers/libraries is eliminated.

Sharing reproducible analysis and code via Docker Images:

Along with sharing the tools (docker images as installers), we can share Jupyter notebooks or scripts along with their results baked inside a Docker image. All the other person/colleague needs to do is run the Docker image to find out what’s there!
Sharing Data Science applications directly without a dedicated DevOps team:

In my last article, we looked at how wrapping ML model in an API helps to make it available to your consumers. This is just one part of it. With small teams with no independent DevOps team to take care of deployments, Docker and the eco-system around it — docker-compose, docker-machine helps to ease the problems at a small scale.

Sales guys needs to present an RShiny application but don’t want to run the code? Docker can help you with that!

3. Docker Terminology

I’ve been going on about containers, containerization in the previous section. Let’s understand the Docker terminologies first.

For starters, containers can be thought of as mini-VMs that are light-weight, disposable. Technically though, they are just processes (threads if you might say) that are created when you fire Docker commands in your terminal via their Command Line Interface (CLI).

Docker also provides: images that are essentially snapshots of the containers whose running state is saved using Docker CLI or generated using Dockerfile.

A Dockerfile can be considered as an automated setup file. This small file helps to create/modify Docker images. All the talk makes no sense, until there’s some proof. Let’s dive in and fire up your terminals.

Think of the process of creating a Docker image as creating a layered cake. Dockerfile being your recipe, Docker image created out of various layers.

In the next few sections, we’ll try to get a feel of Docker and work with its command line commands. Also, we’ll create our own Docker Image.

4. Docker: Hello-World

To install Docker, below are the links for the major operating systems:
- Linux
- Mac OS
- Windows
After installation, to test if Docker has been successfully installed run:

Docker_1

The above output means that Docker CLI is ready. Next step would be to download an image, now how do we get any Docker image. Docker has a repository for that similar to a github repo called Dockerhub. Visit dockerhub to know more.
After you have logged in, you would see your dashboard (which would be empty at first). Do a quick-search using the Search button and type in: hello-world. (Below is my dashboard)

docker_2

Searching hello-world would give you the results below:

docker_3

Click on the first result which also happens to be the official image (created by good folks at Docker, try to use the official images always if there’s a choice or create your own).

docker_4

The command: docker pull hello-world is what you need to run on your terminal. That’s how you download images to your local system.
- To know which images are already present, run: docker images

- Download the `hello-world` image:

- Run the image using the command: `docker run hello-world`

This is all we need to execute a docker image. hello-world is simple, it has to be, but let’s move on to better things. Those that will help more, next section is all about that: Data Science tools without installation, our first use case.

5. Data Science tools without installation:

You have a clean laptop and you need to install TensorFlow in your system, but you are lazy (yes we all are sometimes). You want to procrastinate and not install things on your laptop, but you have Docker installed already as a standard company practice. Hmm, interesting times, you ponder!

You go to Dockerhub and search for the official Docker image for TensorFlow. All you need to run on your terminal is: docker pull tensorflow/tensorflow

As discussed above (in Docker Terminology section), the tensorflow docker image is also a layered object that forms images. Once all the intermediate layers are downloaded, run: docker images to check whether our docker pull was successful.

To run the image, run the command: docker run -it -p 8888:8888 tensorflow/tensorflow

[NOTE: At the time of writing, port 8888 was already used up so running it on 8889. You can run it on any port though *shrugs*]

Now the above docker run command packs in a few more command line argurments. A few which you need to know better are as follows:

i is running the image interactively.
t is to run bash inside the docker container created.
p is connect/publish the container ports to host. Here localhost:8888 to 8888 of container.
d is to run the container in detached mode i.e. the container would run in the background unlike the above (i where once you stop the process the container gets automatically removed).

Now since a docker container is created, you can visit: http://localhost:8889 where you can try out tensorflow.

Wasn’t that easy? Now as a exercise, replace -it in the docker run command by -d. See whether you can get the tensorflow jupyter environment again or not?

You should get the following outputs as in the screenshot below:

Exercise: Create more containers with different ports using the docker run command and see how many get created.

6. Your first Docker Image

We as Data Science folks are picky about what tools we use for our analysis, some like to work with R while others prefer Python. Personally, I’d whine about the above TensorFlow image. I don’t know what’s there in it (unless I look at the source code i.e. the Dockerfile aka recipe). Tensorflow isn’t enough on it’s own, suppose you want to use OpenCV too and maybe scikit-learn & matplotlib.

Let’s see how to create your own custom TensorFlow image!

First thing you need is to create a requirements.txt file. For reference, below is the file that you might want to use: requirements.txt
Our Dockerfile would be comprised of the below components:
- For the base image, we’ll use the official docker image for python i.e. python:3.6.
- Command to update the source repositories (the image uses Debian distribution).\
- Copy the requirements.txt file and pip install the python libraries from the requirements.txt file.
- Command to expose the ports.
- Command to run the jupyter notebook command.
The final Dockerfile would look as below:

# Base image
FROM python:3.6

# Updating repository sources
RUN apt-get update

# Copying requirements.txt file
COPY requirements.txt requirements.txt

# pip install 
RUN pip install --no-cache -r requirements.txt

# Exposing ports
EXPOSE 8888

# Running jupyter notebook
# --NotebookApp.token ='demo' is the password
CMD ["jupyter", "notebook", "--no-browser", "--ip=0.0.0.0", "--allow-root", "--NotebookApp.token='demo'"]

Next step is to build our image, below is the tree structure that can be followed:

To build the image, run: docker build -t tensorflow-av . (Note:-t is to tag the image as you wish too. You can version it as well, eg: docker build -t tensorflow-av:v1 .
The logs for all the run is provided here. Once the entire process is completed, the image will be visible in your local docker registry. Run: docker images to check!

Now that you have created the image, we need to test it. Run the image using the same command you used to run the original tensorflow docker image. Run: docker run -p 8887:8888 -it tensorflow-av

Congratulations! You have made your first docker image. To share it, you have two ways in which you could do it:
- Upload the image to Dockerhub. Follow the steps below to do it:
  - Login to Dockerhub via terminal: sudo docker login
  - Rename the docker file: sudo docker tag tensorflow-av <dockerhub-id>/tensorflow-av
  - Push the image to Dockerhub: sudo docker push <dockerhub-id>/tensorflow-av
- Export the image to .tar file.
  - docker save <dockerhub-id>/tensorflow-av > <path>/tensorflow-av.tar
- We can even export the container to a .tar file, along with all the running instances/state and other meta-data.
  - docker export <container-id> > <path>/tensorflow-av-run1.tar

7. Docker Eco-system

Docker provides a good support to build up from a prototype level scale to production levels. Purely from a deployments perspective: docker-machine, docker-compose & docker-swarm are components that help achieve that.

Source: Official Docker Blog

Want to take your ML API & deploy it to any cloud provider? docker-machine helps you do that.
Your deployed API is growing in usage, want to scale it up? docker-swarm is there to help you do it without many changes.
Want to use multiple Docker images in a single application? docker-compose makes it possible for you to do that!

End Notes

Starting off with a new habit is a difficult task. But once the learning curve smoothens out, things start to work out and new ideas open up with the usage. It is the same with Docker, hoping that this primer makes you think about using it in your daily Data Science workflows. Comment down below, how do you plan to use Docker, starting today!

About the Author

Prathamesh Sarang works as a Data Scientist at Lemoxo Technologies. Data Engineering is his latest love, turned towards the *nix faction recently. Strong advocate of “Markdown for everyone”.

Learn, engage, compete, and get hired!

Data Engineering Docker Intermediate Libraries Machine Learning

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

siva

I've been waiting for new post from last one week. I think U are busy with DATA HACK SUMMIT 2017. well my question is, 1. is it mandatory to learn DOCKER for beginners 2. containers are like mini VM, so do they consume extra memory apart from memory for regular calculations in tensor flow. 3. i tried to install FASTTEXT in my anaconda but failed and realized that it needs linux OS. so can i install it with DOCKER without installing LINUX. thank you so much valuable post. keep it up!

Show 1 reply

Prathamesh Sarang

Hi Siva, 1. Not mandatory as such, you are free to use Vagrant or regular VMs (using virtualbox) on your systems. But Docker containers are light-weight and frankly no extra overhead to run those. 2. Technically they are just threads and run like an application on your system. You can restrict the memory usage when you spin up docker containers, but by default it uses whatever memory your host system (local or cloud VM) can spare. For more info, do try to read this: https://docs.docker.com/engine/admin/resource_constraints/ 3. Yes, you need to install docker and then do search for docker image for fasttext on dockerhub. I looked it up, there's no official image, but someone has created a fasttext docker. You can use that. If that doesn't serve your purpose then why not create your own docker image for fasttext and publish it :) Hope this helps. Thanks and Regards, Prathamesh

Nishant Kumar singh

Hi , Since I know vagrant , vagrantfile , It is easy to learn docker concepts. Very nice tutorial.It really helps me.Thanks

Show 1 reply

Prathamesh Sarang

Hi Nishant, Glad you liked it :) Thanks and Regards, Prathamesh

Sankar

Wow! A great and simple docker tutorial in the context of Data Science. Thanks for the examples and screenshots.

Show 1 reply

Prathamesh Sarang

Hi Sankar, Glad that it helped :) Thanks and Regards, Prathamesh

Samarth

Simply Awesome !

franz

Nicely done. Here is a container (https://github.com/ufoym/deepo) which contains all your learning libraries as one container. I agree with your approach to have a container per library. Also, you can use nvidia-docker to access the GPU but it is only UNIX and is upgrading to 2.0. Again nicely done on the write up.

Show 1 reply

Prathamesh Sarang

Hey Franz, glad you liked it. And thanks for sharing the repo :) Cheers!

Rajendran K

I attended Data Hack summit 2017 and this is the first post I read about AI,ML etc. This article has given me some ideas on different aspects of configuring the system for AI and helps to identify the beginning point from some where. I could get at least few concepts which will be of help to explore it further to understand it better. The article is simple and easy to understand and I appreciate it.

Show 1 reply

Prathamesh Sarang

Hi Rajendran, Hope that this helps you in integrating Docker in your daily work :) Thanks and Regards, Prathamesh

Carlos

Great, I will try it!

Mark

Hi Prathamesh Sarang, thanks for your tutorial. For me it seems that chapter 6 has some jumps. Could you explain that a bit more. Like what are the exact commands to run the Docker file. I have used `docker build -t tensorflow-co:v1 - < Dockerfile` within the directory where I saved the requirements.txt file. However, I get this error message: Step 3/6 : COPY requirements.txt requirements.txt COPY failed: stat /var/lib/docker/tmp/docker-builder354864138/requirements.txt: no such file or directory Cheers

Show 1 reply

Prathamesh Sarang

Hi Mark, Troubleshooting via comments would be a difficult thing, but I'll list down the flow on how to build the Docker image in this gist: https://gist.github.com/pratos/86a3bb07bde32abf8de1b06eb2a808a9 Do let me know if it solves the problem. Thanks and Regards, Prathamesh

Vasanth Gopal

Thanks Sarang. In fact I was fascinated about the docker during the #DHS2017 where I saw your presentation. Quite frankly it was totally new to me. I was just wondering whether we can use this technology to bypass the setup issues we face while trying to connect AWS instances to the deep learning server for fast.ai course which I am currently pursuing. Can you suggest if it is possible and if yes then how to go about it? Thanks Regards Vasanth

Deep

Brilliant explanation !! Thanks a lot.

Prathamesh Sarang

Hi Vasanth, I'm assuming the deep learning servers you are talking about are just regular GPU enabled VMs (AWS does have Deep Learning AMIs which have drivers and CUDA). If yes then you need to use nvidia-docker. nvidia-docker is a wrapper above the original docker engine that helps you get docker images (nvidia-docker pull) via the command line and run it seamlessly without any driver/CUDA installations. The official repo:https://github.com/NVIDIA/nvidia-docker In the comment above by Franz, he has shared the repo that helps you do it: https://github.com/ufoym/deepo Hope this helps :) Thanks and Regards, Prathamesh

R. Righart

Insightful tutorial, many thanks for sharing!

Meng Lee

This is a very useful article for beginners, keep up the nice work :)

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

A Step Towards Reproducible Data Science : Docker for Data Science Workflows

Introduction

Table of Contents

1. What is Docker?

2. Use Cases for Data Science

Distribution and setup of Data Science tools and software:

Sharing reproducible analysis and code via Docker Images:

Sharing Data Science applications directly without a dedicated DevOps team:

3. Docker Terminology

4. Docker: Hello-World

5. Data Science tools without installation:

6. Your first Docker Image

7. Docker Eco-system

End Notes

About the Author

Learn, engage, compete, and get hired!

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM