Build an End-to-End Data Pipeline for Formula 1 Analysis
In this project, we will build a production-grade end-to-end data pipeline to analyze
Ingest the source data into the BigQuery data warehouse.
Use Kimball data modeling techniques to design a data warehouse.
Transform the data using SQL and create a destination table that effectively answers at least the following analytical questions. We can also come up with our own questions, for example:
How many points does constructor Red Bull make in the year 2019?
Which driver gets the most points from the year 2018 to 2020?
Which circuit does driver Lewis Hamilton win the most?
Use Apache Airflow to schedule the data pipeline.
We’ll choose a design that makes the most sense for us. There is no right or wrong.
The source data comes from the Kaggle Formula 1 World Championship (1950–2023) dataset. For this project, we will use CSV files (they are already uploaded to the environment):
drivers.csv: Information about F1 drivers
constructors.csv: Information about F1 constructors
races.csv: Information about races in F1
circuitid.csv: Information about circuits where F1 races are held
results.csv: Information about results of F1 races