What is the ANOVA test in R?
Introduction
An ANOVA test, also known as an Analysis of Variance, is used to analyze the relationship between categorical and continuous variables. It is used to investigate whether either quantitative dependent variable changes at each level, according to one or more categorical independent variables.
ANOVA’s null hypothesis says that there is no difference in the means of the independent variable, whereas the alternative hypothesis says that the means differ.
Types
- One-way ANOVA test: This takes one categorical group into consideration.
- Two-way ANOVA test: This takes two categorical groups into consideration.
Syntax
aov(Dependent_variable~factor(Independent_Variable))
One-way ANOVA testing
A one-way ANOVA test is performed using the mtcars dataset between the disp attribute, a continuous attribute, and the gear attribute, a categorical attribute.
Note: A one-way ANOVA test comes pre-installed with the
dplyrpackage.
The mtcars data comes from the 1974 MotorTrend magazine. The data includes fuel consumption data and aspects of car design for then-current car models.
library(dplyr)boxplot(mtcars$disp~factor(mtcars$gear),xlab = "gear", ylab = "disp")
The box plot shows the mean values of gear with respect to displacement. Here, the categorical variable is gear, on which the factor function is used, and the continuous variable is disp.
mtcars_aov <- aov(mtcars$disp~factor(mtcars$gear))summary(mtcars_aov)
Explanation
The summary shows that the gear attribute is very significant to displacement (there are stars denoting it). In addition, the p-value is less than 0.05, which proves that gear is significant to displacement, meaning they are related to each other. Therefore, we reject the null hypothesis.
The rest of the values in the output table describe the independent variable and the residuals:
- The
Dfcolumn displays the , and thedegrees of freedom for the independent variable the number of levels in the variable minus one .degrees of freedom for the residuals the total number of observations minus one and minus the number of levels in the independent variables - The
Sum Sqcolumn displays the .sum of squares also known as the total variation between the group means and the overall mean - The
Mean Sqcolumn is the mean of the sum of squares, calculated by dividing the sum of squares by the degrees of freedom for each parameter. - The
F-valuecolumn is the test statistic from the F test. This is the mean square of each independent variable divided by the mean square of the residuals. The larger the F value, the more likely it is that the variation caused by the independent variable is real and not due to chance. - The
Pr(>F)column is the p-value of the F-statistic. This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.
Free Resources