Python for data science

Python is quickly becoming the de facto language for data science. In addition to the huge library of packages that provide useful functionalities, one of the reasons that Python is becoming so popular is that it can be used for building scalable data and predictive model pipelines.

Python 3.5

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
# load Boston housing data set 
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
bostonDF = pd.DataFrame(data, columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT'])
bostonDF['label'] = target
# create train and test splits of the housing data set 
x_train, x_test, y_train, y_test = train_test_split(bostonDF.drop(['label'], axis=1), bostonDF['label'], test_size=0.33)
# train a linear regression model
model = LinearRegression()
model.fit(x_train, y_train)
# print results 
print("R^2: " + str(model.score(x_test, y_test)))
print("Mean Error: " + str(sum(abs(y_test - model.predict(x_test) ))/y_test.count()))

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Containers for Reproducible Models

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Conclusion

Python for Scalable Compute

Python for data science