High-Level Design of Scalable Data Infrastructure for AI/ML

Understand the high-level architecture by exploring the five core layers: ingestion, storage, processing, feature store, and serving. Learn how to design APIs for data management and define storage schemas that separate offline training data from online inference features.

We'll cover the following...

High-level design
- Key components and responsibilities
API design
Storage schema
- Feature store schema
- Data lake directory structure
Conclusion

The workflow of the high-level design is as follows:

Data ingestion: Data originates from diverse sources such as logs, transactional databases (CDCStands for change data capture in transactional databases; Identifies and streams data changes (inserts, updates, and deletes) in real-time, acting as a powerful data integration pattern to sync operational data to destinations like data warehouses or event streams (Kafka) without impacting source performance), IoT devices, and third-party SaaS platforms. These sources push data through API connectors for synchronous loads or publish events to message queues for real-time streams. The ingestion layer is split into specialized stream and batch ingestion components. Stream ingestion handles real-time events continuously, while batch ingestion manages scheduled bulk loads.
Raw data storage layer: The ingested data lands in the data storage layer, specifically in the raw data lake. This zone typically utilizes scalable object (blob) storage, such as S3, to handle large amounts of unstructured data. It employs a “schema-on-read” approach, preserving the original fidelity of the data for compliance and potential reprocessing.
Data processing: The data processing layer pulls raw data for transformation. ETL pipelinesStands for Extract, Transform, Load; An automated data processing system that moves data from its source, cleans and reformats it, and loads it into a target destination like a data warehouse for analysis ...

High-Level Design of Scalable Data Infrastructure for AI/ML

High-level design