Case Study: Effective Data Storytelling

Learn about statistical inference with a case study.

As we’ve progressed throughout this course, we’ve seen how to work with data in a variety of ways. We’ve learned effective strategies for plotting data by understanding which types of plots work best for which combinations of variable types. We’ve summarized data in spreadsheet form and calculated summary statistics for a variety of different variables. Furthermore, we’ve seen the value of statistical inference as a process to come to conclusions about a population by using sampling. Lastly, we’ve explored how to fit linear regression models and the importance of checking the conditions required, so that all confidence intervals and hypothesis tests have valid interpretations. All throughout, we’ve learned many computational techniques and focused on writing R code that’s reproducible.

We now present another set of case studies, but this time on the effective data storytelling done by data journalists around the world. Great data stories don’t mislead the reader, but rather engulf them in understanding the importance that data plays in our lives through storytelling.

Bechdel test for Hollywood gender representation

We recommend reading and analyzing Walt Hickey’s FiveThirtyEight article, “The Dollar-And-Cents Case Against Hollywood's Exclusion of Women.” In this article, Walt completed a long study of how many movies pass the Bechdel test. (The Bechdel test is an informal test of gender representation in a movie that was created by Alison Bechdel.)

While reading the article, think carefully about how Walt Hickey is using data, graphics, and analyses to tell the reader a story. FiveThirtyEight has also shared the data and R code that they used for this article. We can also find the data used in many more of their articles on their GitHub page. There’s also a fivethirtyeight package that provides access to these datasets more easily in R.

US births in 1999

The US_births_1994_2003 data frame included in the fivethirtyeight package provides information about the number of daily births in the United States between 1994 and 2003. For more information on this data frame, check out the help file by running ?US_births_1994_2003 in the console.

It’s always a good idea to preview our data, either by using RStudio’s spreadsheet print() function or using glimpse() from the dplyr package:

Get hands-on with 1200+ tech skills courses.