Batch Ingestion
Explore batch ingestion, a core data engineering concept that involves extracting and loading data in bulk at scheduled intervals. Understand time-based and size-based ingestion methods, the differences between full snapshots and incremental loads, and how to implement a batch ingestion pipeline in BigQuery using Python. This lesson equips you to design reliable data ingestion processes that support downstream analytics with efficient resource use and data quality considerations.
Data ingestion is the first stage in most data architecture designs. The process has two steps. First, it consumes data from assorted sources. Second, it loads data into centralized storage, which can be accessed and used by the organization. It is a critical component in the data engineering life cycle because downstream systems rely entirely on the ingestion layer's output.
The ingestion layer works with various data sources, which data engineers typically don't have full control of. A good practice is building a layer of data quality checks and a self-healing system to react to unexpected situations, such as data loss, corruption, system failure, etc. Let’s explore a traditional but widely used design pattern, batch ingestion, with a real-life example using BigQuery.
Batch ingestion is a commonly used way to ingest data. It processes data in bulk, meaning that a subset of data from the source system is extracted and loaded into the internal data storage based on the time interval or the size of the accumulated data.
Time-based vs. size-based batch ingestion
Time-based batch ingestion often processes data on a fixed time interval (e.g., once a day) to provide periodic reporting. It is often used in traditional business ETL or ELT for data warehousing, such as getting daily transactions ...