Filter Information from Large Datasets

Big data, or the use of large datasets, is common for many applications. However, data storytelling can at times become challenging as the scale of the dataset increases.

Challenges of large datasets

Large datasets can present a few problems for the storytelling process. For example, you may find that:

  • All the data cannot fit into a single plot or only a few plots.

  • Data operations need to be reduced and optimized to handle the size, or the data storyteller needs to work to avoid compute-intensive operations accordingly.

Getting an accurate picture of insights from large datasets is important for data storytelling.

Strategies for filtering large data

Filtering is a key concept in data storytelling that can be particularly useful to apply for large datasets. Some strategies to do this are:

  • Analyzing summary statistics.

  • Using callouts in your plot.

  • Using grouping/indexing operations.

  • Taking a sample of the dataset, with the caveat that the sample may not be fully representative of the full dataset (optional).

Note: These operations can still be potentially compute-intensive when running them on the dataset, depending on the size of the data and the types of packages being used.

Example: Analyze summary statistics

Take a look at an example below, where we have a dataset with 5,000,000 values and a number of different variables in the dataset showing hypothetical revenue generated by services from a company.

Let's take a brief look at the dataset's variables:

Revenue (in US dollars): Revenue generated from service in US dollars

Service: The service category

Year: The year the service was launched

Cost: The cost of developing and maintaining the service

satisfaction_score: The satisfaction score of customers using the service, with range 1–10

We can simply use the head() and describe() functions to immediately identify potential characteristics of the data that we could use for further narrowing down the analysis:

Get hands-on with 1200+ tech skills courses.