Learn everything about Analytics

Home ยป A Step Towards Reproducible Data Science : Docker for Data Science Workflows

A Step Towards Reproducible Data Science : Docker for Data Science Workflows

Introduction

My first encounter with Docker was not to solve a Data Science problem, but to install MySQL. Yes, to install MySQL! Quite an anti-climatic start, right? At times, you stumble upon jewels while going through StackOverflow and Docker was one of them. What started with a one-off use case, ended up becoming a useful tool in my daily workflow.

I got a taste of docker when I tried to install TensorFlow in my system. Just to give you the context, TensorFlow is a deep learning library which requires a series of steps that you ought to do for system setup. Especially it is extremely complex to install Nvidia Graphics drivers. I literally had to reinstall my Operating system countless number of times. That loop stopped only when I shifted to docker, thankfully!

Docker provides you with an easy way to share your working environments including libraries and drivers. This enables us to create reproducible data science workflows.

This article aims to provide the perfect starting point to nudge you to use Docker for your Data Science workflows! I will cover both the useful aspects of Docker – namely, setting up your system without installing the tools and creating your own data science environment.

 

Table of Contents

  1. What is Docker?
  2. Use Cases for Data Science
  3. Docker terminology
  4. Docker Hello-World
  5. Data Science tools without installation
  6. Your first Docker Image
  7. Docker eco-system

 

1. What is Docker?

Docker is a software technology providing containers, promoted by the company Docker, Inc. Docker provides an additional layer of abstraction and automation of operating-system-level virtualization on Windows and Linux.

The underlying concept that Docker promotes is usage of containers, which are essentially โ€œboxesโ€ of self-contained softwares. Containers have been in existence before Docker & quite successful, but 2015 saw a huge adoption by the software community in terms of containerization to solve the day-to-day issues.

 

2. Use Cases for Data Science

When you walk into a cubicle of Data Science folks, they are either doing data processing or struggling to setup something on their work-stations/laptops. Okay, that might be an exaggeration but you get the sense of helplessness. To give a small example, for someone to setup aย Caffeย environment there are more thanย 30 unique ways. And trust me, you’ll end up creating a new blogpost just for showing all the steps!

protocols

Source: xkcd comics

You get the idea.ย Anaconda distributionย has made virtual environments and replicating environments using a standardized method a reality…yet things do get muddled and sometimes we miss the bullet points in theย README file, carefully created to replicate those.

To solve the above problem,ย bash scriptsย andย makefilesย are added which adds to more confusion. It becomes as simple asย untangling earphones, phew!

Docker’s learning curveย might be a bit steep, but it helps to solve:

  • Distribution and setup ofย Data Scienceย tools and software:

    Theย Caffeย example we discussed is one of the pain points that everyone experiences in their Data Science journey. Not onlyย Dockerย helps to set a consistent platform via which these tools can be shared, the time wasted in searching forย operating systemย specific installers/libraries is eliminated.

  • Sharing reproducible analysis and code viaย Docker Images:

    Along with sharing the tools (docker images as installers), we can shareย Jupyter notebooksย orย scriptsย along with their results baked inside aย Docker image. All the other person/colleague needs to do is run theย Docker imageย to find out what’s there!

  • Sharing Data Science applications directly without a dedicated DevOps team:

    Inย my last article, we looked at how wrapping ML model in an API helps to make it available to your consumers. This is just one part of it. With small teams with no independent DevOps team to take care of deployments,ย Docker and the eco-systemย around it —ย docker-compose,ย docker-machineย helps to ease the problems at a small scale.

    Sales guys needs to present an RShiny application but don’t want to run the code?ย Dockerย can help you with that!

 

3. Docker Terminology

I’ve been going on about containers, containerization in the previous section. Let’s understand theย Docker terminologiesย first.

For starters,ย containersย can be thought of as mini-VMs that are light-weight, disposable. Technically though, they are just processes (threads if you might say) that are created when you fire Docker commands in your terminal via their Command Line Interface (CLI).

Docker also provides:ย imagesย that are essentially snapshots of theย containersย whose running state is saved using Docker CLI or generated usingย Dockerfile.

Aย Dockerfileย can be considered as an automated setup file. This small file helps to create/modify Docker images. All the talk makes no sense, until there’s some proof. Let’s dive in and fire up your terminals.

Think of the process of creating aย Docker imageย as creating aย layered cake.ย Dockerfileย being yourย recipe,ย Docker imageย created out of various layers.

In the next few sections, we’ll try to get a feel ofย Dockerย and work with its command line commands. Also, we’ll create our ownย Docker Image.

 

4. Docker: Hello-World

  • To installย Docker, below are the links for the major operating systems:
  • After installation, to test ifย Dockerย has been successfully installed run:

Docker_1

  • The above output means thatย Docker CLIย is ready. Next step would be to download an image, now how do we get anyย Docker image. Docker has a repository for that similar to aย github repoย calledย Dockerhub. Visitย dockerhubย to know more.
  • After you have logged in, you would see your dashboard (which would be empty at first). Do a quick-search using theย Searchย button and type in:ย hello-world. (Below is my dashboard)

docker_2

  • Searchingย hello-worldย would give you the results below:

docker_3

  • Click on the first result which also happens to be the official image (created by good folks at Docker, try to use the official images always if there’s a choice or create your own).

docker_4

  • The command:ย docker pull hello-worldย is what you need to run on your terminal. That’s how you download images to your local system.
    • To know which images are already present, run:ย docker imagesย docker_5
- Download the `hello-world` image:

- Run the image using the command: `docker run hello-world`

This is all we need to execute aย docker image.ย hello-worldย is simple, it has to be, but let’s move on to better things. Those that will help more, next section is all about that:ย Data Science tools without installation, our first use case.

 

5. Data Science tools without installation:

You have a clean laptop and you need to installย TensorFlowย in your system, but you are lazy (yes we all are sometimes). You want to procrastinate and not install things on your laptop, but you haveย Dockerย installed already as a standard company practice. Hmm, interesting times, you ponder!

You go toย Dockerhubย and search for the officialย Dockerย image forย TensorFlow. All you need to run on your terminal is:ย docker pull tensorflow/tensorflow

As discussed above (inย Docker Terminologyย section), theย tensorflowย docker image is also a layered object that forms images. Once all the intermediate layers are downloaded, run:ย docker imagesย to check whether ourย docker pullย was successful.

To run the image, run the command:ย docker run -it -p 8888:8888 tensorflow/tensorflow

[NOTE:ย At the time of writing, port 8888 was already used up so running it on 8889. You can run it on any port thoughย *shrugs*]

Now the aboveย docker runย command packs in a few more command line argurments. A few which you need to know better are as follows:

  • iย is running the image interactively.
  • tย is to run bash inside the docker container created.
  • pย is connect/publish the container ports to host. Here localhost:8888 to 8888 of container.
  • dย is to run the container inย detachedย mode i.e. the container would run in the background unlike the above (iย where once you stop the process the container gets automatically removed).

Now since a docker container is created, you can visit:ย http://localhost:8889ย where you can try outย tensorflow.

Wasn’t that easy? Now as a exercise, replaceย -itย in theย docker runย command byย -d. See whether you can get the tensorflow jupyter environment again or not?

You should get the following outputs as in the screenshot below:

Exercise:ย Create more containers with different ports using theย docker runย command and see how many get created.

 

6. Your first Docker Image

We as Data Science folks are picky about what tools we use for our analysis, some like to work withย Rย while others preferย Python. Personally, I’d whine about the above TensorFlow image. I don’t know what’s there in it (unless I look at the source code i.e. theย Dockerfileย akaย recipe). Tensorflow isn’t enough on it’s own, suppose you want to useย OpenCVย too and maybeย scikit-learnย &ย matplotlib.

Let’s see how to create your own customย TensorFlowย image!

  • First thing you need is to create aย requirements.txtย file. For reference, below is the file that you might want to use:ย requirements.txt
  • Ourย Dockerfileย would be comprised of the below components:
    • For the base image, we’ll use theย official docker image for python i.e.ย python:3.6.
    • Command to update the source repositories (the image usesย Debianย distribution).\
    • Copy the requirements.txt file andย pip installย the python libraries from theย requirements.txtย file.
    • Command to expose the ports.
    • Command to run theย jupyter notebookย command.
  • The finalย Dockerfileย would look as below:
# Base image
FROM python:3.6

# Updating repository sources
RUN apt-get update

# Copying requirements.txt file
COPY requirements.txt requirements.txt

# pip install 
RUN pip install --no-cache -r requirements.txt

# Exposing ports
EXPOSE 8888

# Running jupyter notebook
# --NotebookApp.token ='demo' is the password
CMD ["jupyter", "notebook", "--no-browser", "--ip=0.0.0.0", "--allow-root", "--NotebookApp.token='demo'"]
  • Next step is to build our image, below is the tree structure that can be followed:

  • To build the image, run:ย docker build -t tensorflow-av .ย (Note:-tย is to tag the image as you wish too. You can version it as well, eg:ย docker build -t tensorflow-av:v1 .
  • The logs for all the run is providedย here. Once the entire process is completed, the image will be visible in your local docker registry. Run:ย docker imagesย to check!

  • Now that you have created the image, we need to test it. Run the image using the same command you used to run the originalย tensorflow docker image. Run:ย docker run -p 8887:8888 -it tensorflow-av

  • Congratulations! You have made your firstย docker image. To share it, you have two ways in which you could do it:
    • Upload the image toย Dockerhub. Follow the steps below to do it:
      • Login toย Dockerhubย via terminal:ย sudo docker login
      • Rename the docker file:ย sudo docker tag tensorflow-av <dockerhub-id>/tensorflow-av
      • Push the image toย Dockerhub:ย sudo docker push <dockerhub-id>/tensorflow-av
    • Export the image toย .tar file.
      • docker save <dockerhub-id>/tensorflow-av > <path>/tensorflow-av.tar
    • We can even export the container to aย .tar file, along with all the running instances/state and other meta-data.
      • docker export <container-id> > <path>/tensorflow-av-run1.tar

 

7. Docker Eco-system

Docker provides a good support to build up from a prototype level scale to production levels. Purely from a deployments perspective:ย docker-machine,ย docker-composeย &ย docker-swarmย are components that help achieve that.


Source: Official Docker Blog
  • Want to take your ML API & deploy it toย any cloud provider?ย docker-machineย helps you do that.
  • Your deployed API is growing in usage, want to scale it up?ย docker-swarmย is there to help you do it without many changes.
  • Want to use multiple Docker images in a single application?ย docker-composeย makes it possible for you to do that!

 

End Notes

Starting off with a new habit is a difficult task. But once the learning curve smoothens out, things start to work out and new ideas open up with the usage. It is the same with Docker, hoping that this primer makes you think about using it in your daily Data Science workflows. Comment down below, how do you plan to useย Docker, starting today!

About the Author

Prathamesh Sarang works as a Data Scientist at Lemoxo Technologies. Data Engineering is his latest love, turned towards the *nix faction recently. Strong advocate of โ€œMarkdown for everyoneโ€.

Learn, engage,ย compete,ย andย get hired!

You can also read this article on our Mobile APP Get it on Google Play
This article is quite old and you might not get a prompt response from the author. We request you to post this comment on Analytics Vidhya's Discussion portal to get your queries resolved

18 Comments

  • siva says:

    I’ve been waiting for new post from last one week. I think U are busy with DATA HACK SUMMIT 2017.
    well my question is,
    1. is it mandatory to learn DOCKER for beginners
    2. containers are like mini VM, so do they consume extra memory apart from memory for regular calculations in tensor flow.
    3. i tried to install FASTTEXT in my anaconda but failed and realized that it needs linux OS. so can i install it with DOCKER without installing LINUX.
    thank you so much valuable post. keep it up!

    • Prathamesh Sarang says:

      Hi Siva,

      1. Not mandatory as such, you are free to use Vagrant or regular VMs (using virtualbox) on your systems. But Docker containers are light-weight and frankly no extra overhead to run those.
      2. Technically they are just threads and run like an application on your system. You can restrict the memory usage when you spin up docker containers, but by default it uses whatever memory your host system (local or cloud VM) can spare. For more info, do try to read this: https://docs.docker.com/engine/admin/resource_constraints/
      3. Yes, you need to install docker and then do search for docker image for fasttext on dockerhub. I looked it up, there’s no official image, but someone has created a fasttext docker. You can use that. If that doesn’t serve your purpose then why not create your own docker image for fasttext and publish it ๐Ÿ™‚

      Hope this helps.

      Thanks and Regards,
      Prathamesh

  • Nishant Kumar singh says:

    Hi ,
    Since I know vagrant , vagrantfile , It is easy to learn docker concepts.
    Very nice tutorial.It really helps me.Thanks

  • Sankar says:

    Wow! A great and simple docker tutorial in the context of Data Science. Thanks for the examples and screenshots.

  • Samarth says:

    Simply Awesome !

  • franz says:

    Nicely done. Here is a container (https://github.com/ufoym/deepo) which contains all your learning libraries as one container. I agree with your approach to have a container per library. Also, you can use nvidia-docker to access the GPU but it is only UNIX and is upgrading to 2.0. Again nicely done on the write up.

  • Rajendran K says:

    I attended Data Hack summit 2017 and this is the first post I read about AI,ML etc.
    This article has given me some ideas on different aspects of configuring the system for AI and helps to identify the beginning point from some where. I could get at least few concepts which will be of help to explore it further to understand it better. The article is simple and easy to understand and I appreciate it.

    • Prathamesh Sarang says:

      Hi Rajendran,

      Hope that this helps you in integrating Docker in your daily work ๐Ÿ™‚

      Thanks and Regards,
      Prathamesh

  • Carlos says:

    Great, I will try it!

  • Mark says:

    Hi Prathamesh Sarang, thanks for your tutorial. For me it seems that chapter 6 has some jumps. Could you explain that a bit more. Like what are the exact commands to run the Docker file. I have used `docker build -t tensorflow-co:v1 – < Dockerfile` within the directory where I saved the requirements.txt file.
    However, I get this error message:
    Step 3/6 : COPY requirements.txt requirements.txt
    COPY failed: stat /var/lib/docker/tmp/docker-builder354864138/requirements.txt: no such file or directory
    Cheers

  • Vasanth Gopal says:

    Thanks Sarang. In fact I was fascinated about the docker during the #DHS2017 where I saw your presentation. Quite frankly it was totally new to me.
    I was just wondering whether we can use this technology to bypass the setup issues we face while trying to connect AWS instances to the deep learning server for fast.ai course which I am currently pursuing.
    Can you suggest if it is possible and if yes then how to go about it?

    Thanks

    Regards

    Vasanth

  • Deep says:

    Brilliant explanation !! Thanks a lot.

  • Prathamesh Sarang says:

    Hi Vasanth,

    I’m assuming the deep learning servers you are talking about are just regular GPU enabled VMs (AWS does have Deep Learning AMIs which have drivers and CUDA). If yes then you need to use nvidia-docker. nvidia-docker is a wrapper above the original docker engine that helps you get docker images (nvidia-docker pull) via the command line and run it seamlessly without any driver/CUDA installations.

    The official repo:https://github.com/NVIDIA/nvidia-docker

    In the comment above by Franz, he has shared the repo that helps you do it: https://github.com/ufoym/deepo

    Hope this helps ๐Ÿ™‚

    Thanks and Regards,
    Prathamesh

  • Meng Lee says:

    This is a very useful article for beginners, keep up the nice work ๐Ÿ™‚