Python for Scalable Compute
Learn why Python is the leading language in data science.
Python for data science
Python is quickly becoming the de facto language for data science. In addition to the huge library of packages that provide useful functionalities, one of the reasons that Python is becoming so popular is that it can be used for building scalable data and predictive model pipelines.
Below is an example of modeling in Python. Click the button to execute the code in our embedded code widget.
from sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_bostonimport pandas as pdimport numpy as np# load Boston housing data setdata_url = "http://lib.stat.cmu.edu/datasets/boston"raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])target = raw_df.values[1::2, 2]bostonDF = pd.DataFrame(data, columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT'])bostonDF['label'] = target# create train and test splits of the housing data setx_train, x_test, y_train, y_test = train_test_split(bostonDF.drop(['label'], axis=1), bostonDF['label'], test_size=0.33)# train a linear regression modelmodel = LinearRegression()model.fit(x_train, y_train)# print resultsprint("R^2: " + str(model.score(x_test, y_test)))print("Mean Error: " + str(sum(abs(y_test - model.predict(x_test) ))/y_test.count()))
You can use Python on your local machine and build predictive models with scikit-learn, or you can use environments such as Dataflow and PySpark to build distributed systems. While these different environments use different libraries and programming paradigms, they’re all in the same language of Python.
It’s no longer necessary to translate an R script into a ...