Machine Learning
Train a machine learning model with scaled training data, predict with test data, and visualize predictions.
Let's create a machine learning model using a linear regression module from scikit-learn to suggest the house price based on the selected features.
Get started
Let’s say we have cleaned our data, treated the missing values and categorical variables, removed outliers, and created required new features (if needed). Now, our data is ready to feed into the machine learning model. The very first thing to do now is to separate our data into the following:
X: Will contain the selected features, also called independent variables.y: Will be the target values; in this case, the house price is also called the dependent variable.
Note: Uppercase
Xand lowercaseyare just conventions, and it is recommended to use these variables for features and target, respectively.
Standardization: Feature scaling
Let's see what X (original unscaled features) looks like.
Remember, the machine learning algorithms that employ gradient descent as an optimization strategy, such as linear regression, logistic regression, and neural networks, require data to be scaled. Let’s scale our features and check the difference.
We have standardized all the features in the code above before splitting them into train and test datasets. It’s important to know that the model trained on standardized features needs unseen features to make predictions. So, it’s recommended and considered a good practice to serialize/save the transformation from the training dataset. We can then load it and transform the unseen data before making predictions.
Linear regression model training
Let's train our very first machine learning model.
Train test split
Now, we have features in X and target (price) in y. The next step is to split the data into:
A training set (
X_trainandy_train)A testing set (
X_testandy_test)
This splitting is important and can be conveniently done using the scikit-learn built-in method train_test_split. After splitting, we'll train our model on the training part of the dataset, which is in X_train and y_train. Then we'll use X_test from the test part of our dataset to ...