High-Level Design of Scalable Data Infrastructure for AI/ML
Understand the high-level architecture by exploring the five core layers: Ingestion, storage, processing, feature store, and serving. Learn how to design APIs for data management and define storage schemas that separate offline training data from online inference features.
The previous lesson covered the fundamentals of a scalable data platform, including its requirements, resource estimations, and key building blocks. Now, let’s explore the high-level design of the system.
High-level design
The high-level design of the platform is structured around five core layers that work together to create a cohesive system. This design is built to meet our functional and non-functional requirements by establishing a clear data flow from ingestion to serving.
The workflow of the high-level design is as follows:
Data ingestion: Data originates from diverse sources such as logs, transactional databases (
), IoT devices, and third-party SaaS platforms. These sources push data through API connectors for synchronous loads or publish events to message queues for real-time streams. The ingestion layer is split into specialized stream and batch ingestion components. Stream ingestion handles real-time events continuously, while batch ingestion manages scheduled bulk loads.CDC Change data capture in transactional databases identifies and streams data changes (inserts, updates, deletes) in real-time, acting as a powerful data integration pattern to sync operational data to destinations like data warehouses or event streams (Kafka) without impacting source performance. Raw data storage layer: The ingested data lands in the data storage layer, specifically in the raw data lake. This zone typically utilizes scalable object (blob) storage, such as S3, to handle vast amounts of unstructured data. It employs a “schema-on-read” approach, preserving the original fidelity of the data for compliance and potential reprocessing.
Data processing: The data processing layer pulls raw data for transformation.
, managed byETL pipelines An ETL (Extract, Transform, Load) pipeline is an automated data processing system that moves data from its source, cleans and reformats it, and loads it into a target destination like a data warehouse for analysis. tools, execute the ...workflow orchestration Automated coordination and management of interconnected tasks, systems to ensure complex processes run smoothly and efficiently from start to finish, handling dependencies, data flow, triggers, and error recovery across different applications for a holistic, end-to-end outcome.