Data Science in Production: Building Scalable Model Pipelines/

...

/

Python for Scalable Compute

Python 3.5

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
# load Boston housing data set 
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
bostonDF = pd.DataFrame(data, columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT'])
bostonDF['label'] = target
# create train and test splits of the housing data set 
x_train, x_test, y_train, y_test = train_test_split(bostonDF.drop(['label'], axis=1), bostonDF['label'], test_size=0.33)
# train a linear regression model
model = LinearRegression()
model.fit(x_train, y_train)
# print results 
print("R^2: " + str(model.score(x_test, y_test)))
print("Mean Error: " + str(sum(abs(y_test - model.predict(x_test) ))/y_test.count()))

You can use Python on your local machine and build predictive models with scikit-learn, or you can use environments such as Dataflow and PySpark to build distributed systems. While these different environments use different libraries and programming paradigms, they’re all in the same language of Python.

It’s no longer necessary to translate an R script into a production language such as Java; you can use the same language for both development and production of predictive models. It took me a while to adopt Python as my data science language of choice.

Reasons to learn Python

When I started working at Zynga in 2018, I adopted Python and haven’t looked back. It took a bit of time to get used to the new language, but there were a number of reasons that convinced me to learn Python.

Following are some of the reasons:

Momentum: Many teams are already using Python for production, or portions of their data pipelines. It makes sense to also use Python for performing analysis tasks.
PySpark: R and Java don’t provide a good transition to authoring Spark tasks interactively. You can use Java for Spark, but it’s not a good fit for exploratory work. Additionally, the transition from Python to PySpark seems to be the most approachable way to learn Spark.
Deep learning: I’m interested in deep learning, and while there are R bindings for libraries such as Keras, it’s better to code in the native language of these libraries. I used R to author custom loss functions previously, and I had problems figuring out debugging errors.
Libraries: In addition to the deep learning libraries offered for Python, there are a number of other useful tools, including Flask and Bokeh. There are also notebook environments that can scale, including Google’s Colaboratory and AWS SageMaker.

From R to Python

To ease the transition from R to Python, I’d recommend the following steps:

Focus on outcomes, not semantics: Instead of learning about all the fundamentals of the language, I first focused on doing what I already knew how to do in other languages in Python, such as training a logistic regression model.
Learn the ecosystem, not the language: I didn’t limit myself to the base language when learning. Instead, I jumped right into using Pandas and scikit-learn.
Use cross-language libraries: I already had experience with Keras and Plotly in R and used knowledge of these libraries to bootstrap learning Python.
Work with real-world data: I used the data sets provided by Google’s BigQuery to test out my scripts on large-scale data.
Start locally, if possible: While one of my goals was to learn PySpark, I first focused on getting things up and running them on my local machine before moving to cloud ecosystems.

There are many situations where Python is not the best choice for a specific task, but it does have broad applicability when prototyping models and building scalable model pipelines.

Because of Python’s rich ecosystem, we will be using it for all the examples in this course.

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

Python for Scalable Compute

Python for data science

Reasons to learn Python

From R to Python