How to scale numerical features in Python with vaex

In data science and machine learning, understanding the importance of scaling numerical features to create accurate and reliable models is essential. This Answer will cover scaling numerical features in Python with vaex for our machine learning projects.

Why does numerical feature scaling matter?

Let’s assume we are trying to predict housing prices using features like square footage, number of bedrooms, and distance to the nearest school. These features likely have different scales and units—for example, square footage could range from hundreds to thousands, while the number of bedrooms might only range from 1 to 5. If we do not scale these features appropriately, our model may prioritize certain features over others simply because of their scale, leading to biased and inaccurate predictions.

Scaling numerical features solves this problem by bringing all features to a similar scale, ensuring that no single feature dominates the learning process. This allows our model to learn from each feature equally, resulting in more reliable predictions and better overall performance.

Numerical feature scaling techniques

Scaling numerical features is a crucial preprocessing step in machine learning, and vaex offers a range of powerful scalers to streamline this process, including:

  • StandardScaler: It scales features by removing their mean and dividing them by variance. This technique is particularly useful when our data follows a Gaussian distribution, and we want to ensure that each feature has a mean of 0 and a variance of 1.

  • MinMaxScaler: It scales features to a range, typically between 0 and 1. It’s ideal for datasets where we want to preserve the relationships between values while ensuring that all features are comparable.

  • RobustScaler: It scales features by removing their median and scaling them according to a given percentile range. This scaler is robust to outliers, making it suitable for datasets with extreme values that could skew other scaling techniques.

  • MaxAbsScaler: It scales features by their maximum absolute value, ensuring that each feature is divided by the maximum absolute value of that feature. This scaler is useful when we want to preserve the sparsity of our data while scaling it.

Code implementation

Let’s see how we can use the vaex.ml’s scalers to scale numerical features in Python, as follows:

import vaex
import vaex.ml

# Load the dataset using vaex
df = vaex.datasets.iris()

# Initialize the scaler
scaler = vaex.ml.StandardScaler(features=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])

# Fit and transform the data
df_trans = scaler.fit_transform(df)

# Display the transformed data
print(df_trans)

# Code onwards to use the scaled data for machine learning models
Using StandardScaler from vaex.ml

Note: The provided code implements the StandardScaler from vaex.ml. We can use the same process for other vaex.ml scalers, such as MinMaxScaler, RobustScaler, and MaxAbsScaler, based on the requirements and model specifications of our projects.

Explanation

In the code above:

  • Lines 1–2: We import the vaex and vaex.ml modules, which provide machine learning utilities and tools for working with large-scale datasets efficiently using vaex.

  • Line 5: We load the Iris dataset into a vaex DataFrame named df. The iris() function retrieves the Iris dataset, which contains information about Iris flowers, including the sepal and petal dimensions.

  • Line 8: We create an instance of the StandardScaler class from vaex.ml. We specify the features we want to scale (sepal_length, sepal_width, petal_length, and petal_width) by passing them as a list to the features parameter.

  • Line 11: We fit the scaler to the data and then transform the data using the fitted scaler. The fit_transform() method calculates the necessary scaling parameters based on the provided data and applies the transformation to the dataset.

  • Line 14: We display the transformed data, which now contains the scaled values of the selected features (sepal_length, sepal_width, petal_length, and petal_width).

Hence, our data is scaled and can be used as input for training machine learning models.

Conclusion

Scaling numerical features is vital for optimizing the performance of the machine learning models. With vaex, we can preprocess the data effectively, yielding robust and accurate models that will lead to improved predictive performance and more reliable results.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved