...

/

Python for Scalable Compute

Python for Scalable Compute

Learn why Python is the leading language in data science.

Python for data science

svg viewer

Python is quickly becoming the de facto language for data science. In addition to the huge library of packages that provide useful functionalities, one of the reasons that Python is becoming so popular is that it can be used for building scalable data and predictive model pipelines.

Below is an example of modeling in Python. Click the button to execute the code in our embedded code widget.

Python 3.5
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
# load Boston housing data set
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
bostonDF = pd.DataFrame(data, columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT'])
bostonDF['label'] = target
# create train and test splits of the housing data set
x_train, x_test, y_train, y_test = train_test_split(bostonDF.drop(['label'], axis=1), bostonDF['label'], test_size=0.33)
# train a linear regression model
model = LinearRegression()
model.fit(x_train, y_train)
# print results
print("R^2: " + str(model.score(x_test, y_test)))
print("Mean Error: " + str(sum(abs(y_test - model.predict(x_test) ))/y_test.count()))

You can use Python on your local machine and build predictive models with scikit-learn, or you can use environments such as Dataflow and PySpark to build distributed systems. While these different environments use different libraries and programming paradigms, they’re all in the same language of Python.

It’s no longer necessary to translate an R script into a ...