# Confidence Interval

Learn about a confidence interval for prediction.

## Confidence interval for predicting unknown population mean

Recall from the previous lessons that to get at the unknown population mean, we may employ two methods of statistical inference:

- Null hypothesis testing
- Confidence interval construction.

In this lesson, we focus on the second method.

A **confidence interval** is a range of values that contains the unknown population mean at a confidence level chosen by an analyst. The convention is to compute the 95% confidence interval, which means we are 95% confident (for about 95 out of 100 repeated random samples) that the estimated range of values contains the population mean. The choice of 95% is based on the choice of the acceptable Type I error rate of 5%, that is, (1-0.05).

If we know the population standard deviation, then the 95% confidence interval for the population mean μ is as follows:

$sample\ mean\ \pm \ margin\ of\ error = \ \bar{y}\ \pm \ z_{0.05/2}(\frac{\sigma}{\sqrt{n}})$

In other words, the lower bound and the upper bound of the 95% confidence interval can be explicitly specified as follows:

$\Bigg[\bar{y} - z_{0.025}\bigg(\frac{\sigma}{\sqrt{n}}\bigg), \bar{y} + z_{0.025}\bigg(\frac{\sigma}{\sqrt{n}}\bigg)\Bigg]$

where $z_{0.05/2}$ is the Z-critical value corresponding to $0.95$ from the standard normal distribution. Above, $0.05/2$ indicates $0.025$ Type I error on each side of the bell-shaped standard normal probability distribution.

As noted earlier, it’s simply not possible to know the population standard deviation without knowing the population mean. This means that the population standard deviation is probably always unknown. Therefore, in practice, we have to substitute the population standard deviation in the Z interval above with the sample standard deviation. As a result, we estimate a 95% t-based confidence interval instead. The formula for the 95% t-based confidence interval is as follows:

$\Bigg[\bar{y} - t_{0.025, n-1}\bigg(\frac{s}{\sqrt{n}}\bigg), \bar{y} + t_{0.025, n-1}\bigg(\frac{s}{\sqrt{n}}\bigg)\Bigg]$

As for estimation in R, it turns out that by default, the `t.test()`

function reports the estimated 95% confidence interval, right below the null hypothesis testing result. The R output above shows that based on sample data `pwt7g`

, we are about 95% confident that the average economic growth in the population of all countries is between 2.14% and 2.46%.

Obviously, a confidence interval is more informative than the null hypothesis testing. Hypothesis testing, relying on a hypothetical value of the population mean, doesn’t help us to know where the unknown population mean is when we reject the null hypothesis. In contrast, a confidence interval provides a range of estimates that contains the most probable values for the population mean at a pre-chosen confidence level.

## Plot mean and 95% confidence interval of growth

In applied research, it has become common to present and communicate statistical findings graphically. This lesson will demonstrate our findings on the mean and confidence interval of growth. We introduce and employ the widely used `ggplot2`

package. The `ggplot2`

package, created by Hadley Wickham (2009) for R, has a consistent and compact syntax to describe and define statistical graphics. The idea is to build any plot from a few common elements:

- A dataset: a data frame only.
- Aesthetic mappings (
`aes`

) define the roles of the variables in a graph, including xy-position, color, height, size, group, and so on. For example,`aes(x variable, y variable, color=z variable)`

. - Geometric objects (
`geom`

) are the type of graphics (abline, area, bar, boxplot, errorbar, history, line, point, polygon, and so on).

In a `ggplot`

, data is summarized or transformed and then mapped onto a
specific coordinate system. As an example, look at the histogram of growth in the figure below, based on the following R code:

Get hands-on with 1200+ tech skills courses.