- Model Pipeline
Explore how to build scalable model pipelines in cloud environments using PySpark. Learn data transformations, regression modeling with MLlib, hyperparameter tuning through cross-validation, and saving predictions to cloud storage for use in workflows.
We'll cover the following...
To read in the natality dataset, we can use the read function with the Avro setting to fetch the dataset. Since we are using the Avro format, the dataframe will be lazily loaded and the data is not retrieved until the display command is used to sample the dataset, as shown in the snippet below:
Data transformation
Before we can use MLlib to build a regression model, we need to perform a few transformations on the dataset to select a subset of the features, cast data types, and split records into training and test groups. We’ll also use the fillna function, as shown below in order to replace any null values in the ...