How to build a Python Data Science container using Docker

Artificial Intelligence(AI) and Machine Learning(ML) are very popular these days. They power a wide spectrum of use-cases ranging from self-driving cars to drug discovery. AI and ML have a bright and thriving future ahead of them.

On the other hand, Docker revolutionized the computing world through the introduction of ephemeral lightweight containers. Containers package all the software required to run inside an imagea bunch of read-only layers with a COWCopy On Write layer to persist the data.

Let’s get started with building a Python Data Science container.

Building the Data Science container

Python is fast becoming the go-to language for data scientists. For this reason, we are going to use Python to build our Data Science container.

The Base Alpine Linux image

Alpine Linux is a tiny Linux distribution designed for power users who appreciate security, simplicity, and resource efficiency.

As claimed by Alpine:

Small. Simple. Secure. Alpine Linux is a security-oriented, lightweight Linux distribution based on musl libc and busybox.

The Alpine image is surprisingly tiny with a size of no more than 8MB for containers. Minimal packages are installed to reduce the attack surface on the underlying container, making Alpine a good choice for our data science container.

Downloading and Running an Alpine Linux container is as simple as:

$ docker container run --rm alpine:latest cat /etc/os-release

In our, Dockerfile we can simply use the Alpine base image as:

FROM alpine:latest

Talk is cheap, let’s build the Dockerfile

Now, let’s work our way through the Dockerfile:

FROM alpine:latest
LABEL MAINTAINER="Faizan Bashir <faizan.ibn.bashir@gmail.com>"
# Linking of locale.h as xlocale.h
# This is done to ensure successfull install of python numpy package
# see https://forum.alpinelinux.org/comment/690#comment-690 for more information.
WORKDIR /var/www/
# SOFTWARE PACKAGES
#   * musl: standard C library
#   * lib6-compat: compatibility libraries for glibc
#   * linux-headers: commonly needed, and an unusual package name from Alpine.
#   * build-base: used so we include the basic development packages (gcc)
#   * bash: so we can access /bin/bash
#   * git: to ease up clones of repos
#   * ca-certificates: for SSL verification during Pip and easy_install
#   * freetype: library used to render text onto bitmaps, and provides support font-related operations
#   * libgfortran: contains a Fortran shared library, needed to run Fortran
#   * libgcc: contains shared code that would be inefficient to duplicate every time as well as auxiliary helper routines and runtime support
#   * libstdc++: The GNU Standard C++ Library. This package contains an additional runtime library for C++ programs built with the GNU compiler
#   * openblas: open source implementation of the BLAS(Basic Linear Algebra Subprograms) API with many hand-crafted optimizations for specific processor types
#   * tcl: scripting language
#   * tk: GUI toolkit for the Tcl scripting language
#   * libssl1.0: SSL shared libraries
ENV PACKAGES="\
    dumb-init \
    musl \
    libc6-compat \
    linux-headers \
    build-base \
    bash \
    git \
    ca-certificates \
    freetype \
    libgfortran \
    libgcc \
    libstdc++ \
    openblas \
    tcl \
    tk \
    libssl1.0 \
"
# PYTHON DATA SCIENCE PACKAGES
#   * numpy: support for large, multi-dimensional arrays and matrices
#   * matplotlib: plotting library for Python and its numerical mathematics extension NumPy.
#   * scipy: library used for scientific computing and technical computing
#   * scikit-learn: machine learning library integrates with NumPy and SciPy
#   * pandas: library providing high-performance, easy-to-use data structures and data analysis tools
#   * nltk: suite of libraries and programs for symbolic and statistical natural language processing for English
ENV PYTHON_PACKAGES="\
    numpy \
    matplotlib \
    scipy \
    scikit-learn \
    pandas \
    nltk \
" 
RUN apk add --no-cache --virtual build-dependencies python --update py-pip \
    && apk add --virtual build-runtime \
    build-base python-dev openblas-dev freetype-dev pkgconfig gfortran \
    && ln -s /usr/include/locale.h /usr/include/xlocale.h \
    && pip install --upgrade pip \
    && pip install --no-cache-dir $PYTHON_PACKAGES \
    && apk del build-runtime \
    && apk add --no-cache --virtual build-dependencies $PACKAGES \
    && rm -rf /var/cache/apk/*
CMD ["python"]

Explanation

The FROM directive is used to set alpine:latest as the base image.
Using the WORKDIR directive, we set the /var/www as the working directory for our container.
The ENV PACKAGES lists the software packages required for our container like git, blas, and libgfortran.
The Python packages for our Data Science container are defined in the ENV PACKAGES.
We have combined all the commands under a single Dockerfile RUN directive to reduce the number of layers. This will, in turn, help with reducing the resultant image size.

Building and tagging the image

Now that we have our Dockerfile defined navigate to the folder with the Dockerfile using the terminal. Then, build the image using:

$ docker build -t faizanbashir/python-datascience:2.7 -f Dockerfile .

The -t flag is used to name a tag in the ‘name:tag’ format. The -f tag is used to define the name of the Dockerfile (Default is PATH/Dockerfile).

Running the container

We have successfully built and tagged the docker image, now we can run the container using the following command:

$ docker container run --rm -it faizanbashir/python-datascience:2.7 python

Voila, we are greeted by the sight of a Python shell ready to perform all kinds of cool Data Science stuff.

Python 2.7.15 (default, Aug 16 2018, 14:17:09) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>

Our container comes with Python 2.7, but don’t be sad if you want work with Python 3.6 as we do have the Dockerfile for Python 3.6:

{% gist 9443a7149cc53f81d84d0d356f871ec7 %}

Build and tag the image:

$ docker build -t faizanbashir/python-datascience:3.6 -f Dockerfile .

Run the container:

$ docker container run --rm -it faizanbashir/python-datascience:3.6 python

You now have a ready-to-use container for all kinds of cool Data Science stuff.

However, all the information above assumes you have the time and resources to set all this stuff up. In the case that you don’t, you can pull the existing images that I have already built and pushed to Docker’s registry Docker Hub using:

# For Python 2.7 pull
$ docker pull faizanbashir/python-datascience:2.7
# For Python 3.6 pull
$ docker pull faizanbashir/python-datascience:3.6

After pulling the images, you can use the image, extend the same in your Dockerfile file, or use it as an image in your docker-compose or stack file.

Python Data Science packages

Our Python Data Science container makes use of the following Python packages:

NumPy: NumPy or Numeric Python supports large, multi-dimensional arrays and matrices. It provides fast, precompiled functions for mathematical and numerical routines. In addition, NumPy optimizes Python programming with powerful data structures for efficient computation of multi-dimensional arrays and matrices.
SciPy: SciPy provides useful functions for regression, minimization, Fourier-transformation, etc. Like NumPy, SciPy extends its capabilities, and, also like Numpy, its main data structure is a multidimensional array. This package contains tools that help with solving linear algebra, probability theory, integral calculus, and much more.
Pandas: Pandas offers versatile and powerful tools for manipulating data structures and performing extensive data analysis. It works well with incomplete, unstructured, and unordered real-world data and provides tools for shaping, aggregating, analyzing, and visualizing datasets.
SciKit-Learn: Scikit-learn is a Python module that integrates a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. It is one of the best-known machine-learning libraries for Python. The Scikit-learn package focuses on bringing Machine Learning to non-specialists using a general-purpose, high-level language. The primary emphasis is on ease of use, performance, documentation, and API consistency. With minimal dependencies and easy distribution under the simplified BSD license, SciKit-Learn is widely used in academic and commercial settings. Scikit-learn exposes a concise and consistent interface to common Machine Learning algorithms in order to simplify the process of bringing ML into production systems.
Matplotlib: Matplotlib is a Python 2D plotting library that is capable of producing publication-quality figures in a wide variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
NLTK: NLTK is the leading platform for building Python programs that can work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources (such as WordNet), along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Aftermath

I hope this article helped you with how to build containers for your Data Science projects.

You can check out the code at faizanbashir/python-datascience.

Free Resources

Attributions:

undefined by undefined

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)