How to Put Jupyter Notebooks in a Dockerfile

As a data scientist, Jupyter notebooks are an invaluable tool that aid my day-to-day work in many ways. They are open-source web applications that allow a developer or data scientist to create documents that show the output of code written in multiple languages (i.e., Julia, Python, R), and which can be annotated with writing and visualizations.

Jupyter notebooks have a wealth of different uses including as a testing ground for development work, a presentation platform, and more. Some of the applications I use most include:

  • Designing, developing, and testing solutions to problems I’m working on using notebooks’ REPR capabilities.
  • Presenting analyses I’ve completed, demonstrating both the code and the output for them in tidy, concise cells that can be easily turned into slides.
  • Providing hands-on walkthroughs of new library modules, visualization techniques, and strategies for attacking existing problems. They allow someone to mostly follow along while allowing them space to try out new things right in-line.

The four major drawbacks of Jupyter

As great as Jupyter is, however, it does have some drawbacks, especially when it comes to sharing your work with other people and collaborating with teammates. It’s a big reason why although many data scientists claim Jupyter notebooks are excellent for collaboration and knowledge-sharing, in practice it can be tough. Here’s why:

  1. Ad hoc nature. The ad hoc nature of notebooks is excellent for trying things out but tends to run into problems when you need to reproduce your work for someone else. There are cells all over the place, they’ve been run in a random order as you tried to get something working, etc. Trying to disentangle which thing should come first can feel like more effort than it’s worth.
  2. Time consuming setup. When you use Jupyter notebooks to develop workflows, you might spend a bunch of time doing expensive setup, cleaning, or training operations that you don’t necessarily need for a new audience to repeat. It would be easier if they could just start with the cleaned data, the trained model, and get right to the analysis.
  3. Burdensome to share steps. Even if you do want someone to repeat all your steps, ensuring they have their system set up in the same way you did when you made the initial analysis requires you to both do everything on your end correctly and also ensure that anyone you want to use your analysis can easily set up and get started. This step can be non-trivial. It might require you to save a requirements.txt file using the correct specific versions of your packages, make your module installable using a setup.py file, run a specific version of python, and ensure you don’t have any conflicting dependencies with any of your other libraries (or set up a virtual environment for just this analysis, install the requirements, and load the virtualenv as a conda environment that your Jupyter notebook can access, and be sure to activate it as the kernel used when you review the analysis).
  4. Not built for collaboration. Jupyter notebooks are notoriously hard to collaborate with using version control systems like git. Their JSON output makes it extremely difficult to tell where things were changed—and where there is no change, just a cell that has been executed again.

Containerizing your Jupyter notebook

Containerization can take some of these headaches away—or at least leave them with the developer of the core code rather than the intended audience. Docker containers are an excellent way to package up an analysis. They can include the data you need, any scripts and code, and they’re guaranteed to work on everyone’s machine—no installation required.

Before diving into the five steps to containerization, imagine your work is organized like this:

The  module.py file does the heavy lifting—it’s what you spent all your time developing. The notebooks folder contains just a walkthrough of the analysis and visualization that you want to be runnable for an audience who want to poke around. The raw data is in the data folder. A quick aside: this isn’t the best way to organize a python module, especially if it’s under active development using a notebook, but it represents a pretty common pattern for showing off work I’ve done.

You’ll notice that it doesn’t have cleaned data or any saved models. Cleaning the data and training the model is the task of the module.py file. In order to use it, we’ll want to run those functions in the Docker container. Running it in the container ensures that the process is truly repeatable and provides an important quality control check. To make sure we can do this, the one piece we still need is a requirements.txt file (or Pipfile, if you use Pipenv). If you don’t have one (as above), you can run pip freeze > requirements.txt

Five steps to containerize your Jupyter notebook in Docker

1.      Start with a Dockerfile

Create a Dockerfile (just name the file Dockerfile) in the same folder as the module.py file.

2.      Set up the operating system and source code Docker will run

I started from a Linux base, but using a slimmer environment (or just python itself) will also work.

FROM ubuntu:latest
RUN apt-get update && apt-get -y update
RUN apt-get install -y build-essential python3.6 python3-pip python3-dev
RUN pip3 -q install pip –upgrade

Next, create an src working directory and copy the entire directory over to it—data, notebook, and all. Once it is started, the container will have an exact copy of what you have locally.

RUN mkdir src
WORKDIR src/
COPY . .

One final piece of setup required for python projects is to install the libraries required for your project. Remember, even if you already have them installed on your local computer, you need to install them inside the container, so these lines should be in your Dockerfile.

RUN pip3 install -r requirements.txt
RUN pip3 install jupyter

3.      Prep your data and run your code

Run the process that will clean your data and train a model. This will save the trained model into your container as the result of your module.py process. It can then be used by the notebook code in the notebooks folder.

RUN python3 module.py

Since the module.py process is going to save a cleaned, deduplicated, processed dataset, go ahead and remove the raw data from the Docker image. This will make the image smaller (and thus easier to download) but is entirely optional. If you anticipate that there will be changes to the process of cleaning and preparing the data you can leave it, but in general if you don’t need it, cut it out.

RUN rm /src/data/raw_data.csv

This ensures that whenever someone starts the Docker container, they will start at the notebook you have saved.

WORKDIR /src/notebooks

This code comes from the Jupyter Docker Stacks project, an open-source repository that builds ready-to-use data science notebooks to start development and visualization projects. These are great for development but loading new data into them can be a little tricky. This process helps avoid crashes and should be included.

# Add Tini. Tini operates as a process subreaper for jupyter. This prevents kernel crashes.
ENV TINI_VERSION v0.6.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /usr/bin/tini
RUN chmod +x /usr/bin/tini
ENTRYPOINT ["/usr/bin/tini", "--"]

From there, you just need the command that starts up the notebook at the end of the Dockerfile.

CMD ["jupyter", "notebook", "--port=8888", "--no-browser", "--ip=0.0.0.0", "--allow-root"]

4.      Build a Docker container

Once you have the Dockerfile saved, you just need to run it locally to make sure it works. To run it, you first have to build it. Use the -t tag to save it as a name in your local system. Once you’ve finished verifying that it will work, you can also push it up to an account on Docker Hub. Type this into your terminal (and don’t forget the dot at the end).

docker build -t myaccount/new_project .

5.     Start the Jupyter notebook and log in

After that finishes building, you can test out the notebook. The -p tag here is important—you will need to connect the port that the notebook is running on inside the container with your local machine.

docker run -p 8888:8888 myaccount/new_project

This will fire up your Jupyter notebook using the final command in the Dockerfile. It will show a page like this one:

You can log in using the token listed in your terminal, where you started up the image.

[I 17:58:09.296 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 17:58:09.544 NotebookApp] Serving notebooks from local directory: /src/notebooks
[I 17:58:09.544 NotebookApp] The Jupyter Notebook is running at:
[I 17:58:09.544 NotebookApp] http://416e64cc88f8:8888/?token=fe978e3ff88080bd7d7790750e955b0071cf5b8849462b74
[I 17:58:09.544 NotebookApp]  or http://127.0.0.1:8888/?token=fe978e3ff88080bd7d7790750e955b0071cf5b8849462b74
[I 17:58:09.545 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 17:58:09.551 NotebookApp]

To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-6-open.html
Or copy and paste one of these URLs:
        http://416e64cc88f8:8888/?token=fe978e3ff88080bd7d7790750e955b0071cf5b8849462b74    
  or
http://127.0.0.1:8888/?token=fe978e3ff88080bd7d7790750e955b0071cf5b8849462b74

And with that, you should be logged in to your notebook, hosted ephemerally on the Docker container. The notebook is completely reproducible—just start up the container again and it will reset the document.

One last thing to do is to push up the container to Docker Hub.

docker push myaccount/new_project

Once you do that, anyone can pull down the exact notebook you published with a single command and use it as if they had built it themselves.

docker pull myaccount/new_project
docker run -p 8888:8888 myaccount/new_project

Now that you’ve set up your Dockerfile to containerize your Jupyter notebook, your Jupyter projects will be more reproducible, shareable, and intuitive using Docker. This will allow you to demonstrate the process that leads to your conclusions—without having to stop and start because the environment you’re demoing in isn’t exactly what you expected. For more time-saving tips and thoughtful discussions on data science, data analytics, and more, be sure to sign up for our newsletter.

Get alerted to new job postings, events, and insights by registering for our monthly newsletter.