How to perform the ANOVA test in Python
ANOVA stands for Analysis of Variance. It is a statistical method used to compare the means of three or more groups to determine if there are any statistically significant differences between them. It enables us to examine whether the means of the groups are equal under the null hypothesis compared to the alternative hypothesis where at least one of the group means differs.
When to use ANOVA
ANOVA is useful when we want to compare the means of multiple groups simultaneously. It is frequently utilized in experimental research to assess the impacts of various treatments or interventions on a dependent variable.
Assumptions of ANOVA
Before performing an ANOVA test, it’s essential to ensure that the following assumptions are met:
Independence: Observations (samples) within each group are independent of each other.
Normality: The distribution of data within each group should be normal.
Homogeneity of variance: The variance of the data should be approximately equal across all groups.
Note: It is recommended that outliers should be removed from the dataset before conducting the ANOVA test to ensure that the data meets the test’s assumptions. Outliers can violate the assumptions of normality and homoscedasticity, potentially leading to inaccurate results.
Types of ANOVA
There are different types of ANOVA depending on the study design and the number of factors being considered:
One-way ANOVA
To compare the mean of three or more groups, one-way ANOVA test is used. It computes mean based on a single factor or independent variable and determines whether there are statistically significant variances among the means of the groups.
Code example
SciPy is a powerful library that provides various tools for scientific computing in Python. Within SciPy, a module called scipy.stats focuses on statistical functions and distributions. Within this module, a function named f_oneway performs one-way ANOVA testing.
Now let’s see a code example of how to perform one-way ANOVA test in Python.
# Importing necessary librariesimport pandas as pdfrom mlxtend.data import iris_datafrom scipy.stats import f_oneway# Loading the Iris datasetX, y = iris_data()# Creating a DataFramedf = pd.DataFrame(X, columns=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'])df['Species'] = y# Performing one-way ANOVAanova_results = f_oneway(df[df['Species'] == 0]['Sepal Length'],df[df['Species'] == 1]['Sepal Length'],df[df['Species'] == 2]['Sepal Length'])print("One-Way ANOVA Results:")print("F-statistic:", anova_results.statistic)print("P-value:", anova_results.pvalue)
Code explanation
Line 2: Importing the
pandaslibrary aspd, which is used for data manipulation and analysis.Line 3: Importing the Iris dataset from
iris_datafunction from themlxtend.datamodule, which provides access to the Iris dataset.Line 4: Calling the
iris_data()function for loading the Iris dataset into variablesXandy.Line 10: Creating a DataFrame named
dffrom the feature dataX, with columns labeled asSepal Length,Sepal Width,Petal Length, andPetal Width.Line 11: Adds a new column named
Speciesto the DataFramedfand populates it with the target labely.Lines 14–18: Performing a one-way ANOVA test using the
f_onewayfunction fromscipy.stats. It compares theSepal Lengthdata among the threeSpeciesof Iris flowers (setosa, versicolor, and virginica) loaded from the Iris dataset.Lines 20–22: Printing the results of the one-way ANOVA test, including the
F-statisticand the correspondingp-value.
Terms used in code
We got two values from one-way ANOVA testing: F-statistic and p-value.
Now let’s understand what these values represent:
F-statistic
The F-statistic is also known as the F-ratio. It is a measure of variation between group means relative to the variation within group means.
In ANOVA, the F-statistic measures how much the group averages differ from each other compared to how much they vary within each group. A bigger F-statistic means the group averages are more different from each other. We use the F-statistic to see whether these differences are real or random. If the F-statistic is bigger than a certain number, it means the differences are likely real.
P-value
The p-value is a short term used for probability value. It is associated with the F-statistic and represents the likelihood of observing the calculated F-statistic (or a more extreme value) under the null hypothesis.
In ANOVA, the null hypothesis posits that there are no significant differences between the means of the groups, implying that all group means are equal.
The p-value tells us whether those differences are significant. A small p-value (typically less than a chosen significance level, often 0.05) indicates strong evidence against the null hypothesis, suggesting that at least one group mean is significantly different from the others. On the other hand, a high p-value indicates limited evidence contradicting the null hypothesis, implying that there’s no notable distinction among the group means.
Two-way ANOVA
Two-way ANOVA is a statistical test employed to examine the impact of two categorical independent variables (factors) on a
In a two-way ANOVA, there are two independent variables. Each variable has two or more levels or categories. The dependent variable represents the outcome under measurement or observation and is continuous.
Now let’s see a code example of how to perform two-way ANOVA test in Python.
# Importing necessary librariesimport pandas as pdfrom mlxtend.data import iris_datafrom statsmodels.formula.api import olsfrom statsmodels.stats.anova import anova_lm# Loading the Iris datasetX, y = iris_data()iris_df = pd.DataFrame(X, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])iris_df['species'] = y# Fitting the ANOVA modelmodel = ols('sepal_length ~ C(species) + C(petal_length)', data=iris_df).fit()# Performing ANOVAanova_results = anova_lm(model)print(anova_results)
Code explanation
Line 2: Importing the
pandaslibrary aspd, which is used for data manipulation and analysis.Line 3: Importing the iris data from
iris_datafunction from themlxtend.datamodule, which provides access to the Iris dataset.Line 4: Importing
ols(ordinary least square) function from thestatsmodels.formula.apimodule, which is used to fit linear models.Line 5: Importing
anova_lmfunction from thestatsmodels.stats.anovamodule, which is used to compute ANOVA tables.Line 8: Loading the Iris dataset into variables
Xandy.Lines 9–10: Creating a
pandasDataFrame namediris_dffrom the featuresX, with column names specified assepal_length,sepal_width,petal_length, andpetal_widthand adding a new column namedspeciesand populating it with target labely.Line 13: Fitting a linear regression model using the
olsfunction. The model predictssepal_lengthbased on categorical variablesspeciesandpetal_length. This method fits the model to the data.Line 16: Computing ANOVA tables based on the fitted model using the
anova_lmfunction.Line 18: Printing the computed results.
Conclusion
ANOVA test compares the means of multiple groups and determines whether there are significant differences between them. By analyzing the F-statistic and p-value obtained from the test, researchers can make informed decisions about whether the observed differences in group means are likely due to real effects or simply random variation. This allows for robust statistical inference and provides valuable insights into the relationships between variables under study.
Free Resources