High-Level Design of Scalable Data Infrastructure for AI/ML
Understand the high-level architecture by exploring the five core layers: Ingestion, storage, processing, feature store, and serving. Learn how to design APIs for data management and define storage schemas that separate offline training data from online inference features.
Building on the requirements and estimations from the previous lesson, we now focus on the system’s high-level design.
High-level design
The platform consists of five core layers that establish a clear data flow from ingestion to serving. This structure ensures the system meets both functional and non-functional requirements.
The high-level workflow operates as follows:
Data ingestion: Data originates from diverse sources such as logs, transactional databases (
), IoT devices, and third-party SaaS platforms. These sources push data through API connectors for synchronous loads or publish events to message queues for real-time streams. The ingestion layer is split into specialized stream and batch ingestion components. Stream ingestion handles real-time events continuously, while batch ingestion manages scheduled bulk loads.CDC Stands for change data capture in transactional databases; Identifies and streams data changes (inserts, updates, and deletes) in real-time, acting as a powerful data integration pattern to sync operational data to destinations like data warehouses or event streams (Kafka) without impacting source performance. Raw data storage layer: Ingested data lands in the raw data lake, typically using scalable object storage like S3. It employs a “schema-on-read” approach to preserve original fidelity for compliance and reprocessing.
Data ...