Build an End-to-End Data Pipeline for Formula 1 Analysis

Build an End-to-End Data Pipeline for Formula 1 Analysis

In this project, we will build a production-grade end-to-end data pipeline to analyze Formula 1Formula 1 (F1 or Formula One) is the highest class of single-seater auto racing sanctioned by the Fédération Internationale de l'Automobile (FIA) and owned by the Formula One Group. World Championship data. By the end of this project, we will be able to analyze the performance of each driver and constructor in the past races.

Requirements

  • Ingest the source data into the BigQuery data warehouse.

  • Use Kimball data modeling techniques to design a data warehouse.

  • Transform the data using SQL and create a destination table that effectively answers at least the following analytical questions. We can also come up with our own questions, for example:

    • How many points does constructor Red Bull make in the year 2019?

    • Which driver gets the most points from the year 2018 to 2020?

    • Which circuit does driver Lewis Hamilton win the most?

  • Use Apache Airflow to schedule the data pipeline.

We’ll choose a design that makes the most sense for us. There is no right or wrong.

Source data

The source data comes from the Kaggle Formula 1 World Championship (1950–2023) dataset. For this project, we will use CSV files (they are already uploaded to the environment):

  • drivers.csv: Information about F1 drivers

  • constructors.csv: Information about F1 constructors

  • races.csv: Information about races in F1

  • circuitid.csv: Information about circuits where F1 races are held

  • results.csv: Information about results of F1 races