Distributed Deep Learning

Introduction

While MLlib provides scalable implementations for classic machine learning algorithms, it does not natively support deep learning libraries such as TensorFlow and PyTorch. There are libraries that parallelize the training of deep learning models on Spark, but the dataset needs to be able to fit in memory on each worker node, and these approaches are best used for distributed hyperparameter tuning on medium-sized datasets.

For the model application stage, where we already have a deep learning model trained but need to apply the resulting model to a large user base, we can use Pandas UDFs. With Pandas UDFs, we can:

  • partition and distribute our dataset.
  • run the resulting dataframes against a Keras model.
  • compile the results back into a single large Spark datagram.

This lesson shows how we can take the Keras model that we built in Keras Regression and scale it to larger datasets using PySpark and Pandas UDFs. However, the data used for training the model is still required to fit into memory on the driver node.

We’ll use the same datasets from the prior lesson, where we split the games dataset into training and test sets of users. This is a relatively small dataset, so we can use the toPandas operation to load the dataframe onto the driver node, as shown in the snippet below. The result is a dataframe and list that we can provide as an input to train a Keras deep learning model.

Get hands-on with 1200+ tech skills courses.