What is the DataFrame.std() function in Polars?

Polars is a fast and efficient data manipulation library written in Rust. It’s designed to provide high-performance operations on large datasets and handles them more quickly than pandas. It’s particularly suitable when working with tabular data.

The dataframe.std() method helps understand and quantify data variability. It supports various advanced statistical and machine-learning tasks, which help data scientists make informed decisions.

The dataframe.std() method

The dataframe.std() function aggregates the columns of a DataFrame to their standard deviationDispersion or variability of a set of values from the mean. value.

Syntax

Here’s the syntax of the DataFrame.std() function:

DataFrame.std(ddof: int = 1)

Parameter:

The ddof parameter refers to the delta degrees of freedom. It denotes the divisor used in the calculation, represented by N - ddof, where N is the number of elements. The default value for ddof is 1.

Note: Setting ddof=0 would calculate the population standard deviation.

Return value:

It returns a new DataFrame containing the standard deviation of the numerical columns.

Calculating standard deviation

Standard deviation is a statistical measure that describes the amount of variation or dispersion in a set of values. The formula to calculate standard deviation is as follows:

We will now see how to calculate the standard deviation.

canvasAnimation-image
1 of 7

Code

Let’s illustrate the usage of the DataFrame.std() function with an example.

import polars as pl
df = pl.DataFrame(
{
"alpha":[1, 2, 3, 4, 5],
"beta": ["a", "b", "c", "d", "e"],
})
deviation = df.std(ddof = 0)
default_deviation = df.std()
print(deviation)
print(default_deviation)

Explanation

We will see the step-by-step explanation of the code above:

  • Lines 2–6: We create a DataFrame df with two columns, alpha and beta. The alpha contains numeric values [1, 2, 3, 4, 5], and beta contains string values ["a", "b", "c", "d", "e"].

  • Line 8: We use the std(ddof=0) function to compute the population standard deviation of the df columns. It provides an exact measure of dispersion without adjusting for sampling bias.

  • Line 9: We use the std() function to compute the sample standard deviation of the df columns. It provides an exact measure of dispersion without adjusting for sampling bias. It accounts for the fact that the sample mean may not perfectly represent the population mean.

  • Lines 11–12: We print both the sample and population standard deviations.

Note: The beta column contains string values; standard deviation is not computable for it using traditional statistical methods. Therefore, the result of the beta column might not be meaningful.

Conclusion

The DataFrame.std() is standard deviation function can be used in various fields to measure and understand data variability. For example:

  • In finance, analysts compute the standard deviation of stock returns to measure a stock’s volatility. A higher standard deviation indicates higher risk.

  • Environmental researchers might use the standard deviation of temperature readings to understand climate change in a specific region over time.

  • In education, teachers might analyze test scores to understand the distribution of student performance and identify any significant variability.

In conclusion, by calculating the standard deviation, one can assess data consistency, identify outliers, compare datasets, and understand the overall spread and risk associated with the data.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved