How to perform the ANOVA test in Python

When to use ANOVA

ANOVA is useful when we want to compare the means of multiple groups simultaneously. It is frequently utilized in experimental research to assess the impacts of various treatments or interventions on a dependent variable.

Assumptions of ANOVA

Before performing an ANOVA test, it’s essential to ensure that the following assumptions are met:

Independence: Observations (samples) within each group are independent of each other.
Normality: The distribution of data within each group should be normal.
Homogeneity of variance: The variance of the data should be approximately equal across all groups.

Note: It is recommended that outliers should be removed from the dataset before conducting the ANOVA test to ensure that the data meets the test’s assumptions. Outliers can violate the assumptions of normality and homoscedasticity, potentially leading to inaccurate results.

Types of ANOVA

There are different types of ANOVA depending on the study design and the number of factors being considered:

One-way ANOVA

To compare the mean of three or more groups, one-way ANOVA test is used. It computes mean based on a single factor or independent variable and determines whether there are statistically significant variances among the means of the groups.

Code example

SciPy is a powerful library that provides various tools for scientific computing in Python. Within SciPy, a module called scipy.stats focuses on statistical functions and distributions. Within this module, a function named f_oneway performs one-way ANOVA testing.

Now let’s see a code example of how to perform one-way ANOVA test in Python.

# Importing necessary libraries
import pandas as pd
from mlxtend.data import iris_data
from scipy.stats import f_oneway
# Loading the Iris dataset
X, y = iris_data()
# Creating a DataFrame
df = pd.DataFrame(X, columns=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'])
df['Species'] = y
# Performing one-way ANOVA
anova_results = f_oneway(
    df[df['Species'] == 0]['Sepal Length'],
    df[df['Species'] == 1]['Sepal Length'],
    df[df['Species'] == 2]['Sepal Length']
)
print("One-Way ANOVA Results:")
print("F-statistic:", anova_results.statistic)
print("P-value:", anova_results.pvalue)

Code explanation

Line 2: Importing the pandas library as pd, which is used for data manipulation and analysis.
Line 3: Importing the Iris dataset from iris_data function from the mlxtend.data module, which provides access to the Iris dataset.
Line 4: Calling the iris_data() function for loading the Iris dataset into variables X and y.
Line 10: Creating a DataFrame named df from the feature data X, with columns labeled as Sepal Length, Sepal Width, Petal Length, and Petal Width.
Line 11: Adds a new column named Species to the DataFrame df and populates it with the target label y.
Lines 14–18: Performing a one-way ANOVA test using the f_oneway function from scipy.stats. It compares the Sepal Length data among the three Species of Iris flowers (setosa, versicolor, and virginica) loaded from the Iris dataset.
Lines 20–22: Printing the results of the one-way ANOVA test, including the F-statistic and the corresponding p-value.

Terms used in code

We got two values from one-way ANOVA testing: F-statistic and p-value.

Now let’s understand what these values represent:

F-statistic

The F-statistic is also known as the F-ratio. It is a measure of variation between group means relative to the variation within group means.

In ANOVA, the F-statistic measures how much the group averages differ from each other compared to how much they vary within each group. A bigger F-statistic means the group averages are more different from each other. We use the F-statistic to see whether these differences are real or random. If the F-statistic is bigger than a certain number, it means the differences are likely real.

P-value

The p-value is a short term used for probability value. It is associated with the F-statistic and represents the likelihood of observing the calculated F-statistic (or a more extreme value) under the null hypothesis.

In ANOVA, the null hypothesis posits that there are no significant differences between the means of the groups, implying that all group means are equal.

The p-value tells us whether those differences are significant. A small p-value (typically less than a chosen significance level, often 0.05) indicates strong evidence against the null hypothesis, suggesting that at least one group mean is significantly different from the others. On the other hand, a high p-value indicates limited evidence contradicting the null hypothesis, implying that there’s no notable distinction among the group means.

Two-way ANOVA

Two-way ANOVA is a statistical test employed to examine the impact of two categorical independent variables (factors) on a continuous dependent variableContinuous dependent variable refers to a numerical outcome that can take on any value within a certain range meaning the variable can assume any value within a continuous range of possible values.. It extends the one-way ANOVA by examining interactions between the two factors in addition to their main effects.

In a two-way ANOVA, there are two independent variables. Each variable has two or more levels or categories. The dependent variable represents the outcome under measurement or observation and is continuous.

Now let’s see a code example of how to perform two-way ANOVA test in Python.

# Importing necessary libraries
import pandas as pd
from mlxtend.data import iris_data
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# Loading the Iris dataset
X, y = iris_data()
iris_df = pd.DataFrame(X, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
iris_df['species'] = y
# Fitting the ANOVA model
model = ols('sepal_length ~ C(species) + C(petal_length)', data=iris_df).fit()
# Performing ANOVA
anova_results = anova_lm(model)
print(anova_results)

Code explanation

Line 2: Importing the pandas library as pd, which is used for data manipulation and analysis.
Line 3: Importing the iris data from iris_data function from the mlxtend.data module, which provides access to the Iris dataset.
Line 4: Importing ols (ordinary least square) function from the statsmodels.formula.api module, which is used to fit linear models.
Line 5: Importing anova_lm function from the statsmodels.stats.anova module, which is used to compute ANOVA tables.
Line 8: Loading the Iris dataset into variables X and y.
Lines 9–10: Creating a pandas DataFrame named iris_df from the features X, with column names specified as sepal_length, sepal_width, petal_length, and petal_width and adding a new column named species and populating it with target label y.
Line 13: Fitting a linear regression model using the ols function. The model predicts sepal_length based on categorical variables species and petal_length. This method fits the model to the data.
Line 16: Computing ANOVA tables based on the fitted model using the anova_lm function.
Line 18: Printing the computed results.

Conclusion

ANOVA test compares the means of multiple groups and determines whether there are significant differences between them. By analyzing the F-statistic and p-value obtained from the test, researchers can make informed decisions about whether the observed differences in group means are likely due to real effects or simply random variation. This allows for robust statistical inference and provides valuable insights into the relationships between variables under study.

Free Resources