Design a Data Infrastructure System - 15 Minute Sprint
Explore how to design a scalable data infrastructure system that supports the full machine learning lifecycle. Understand key layers including ingestion, storage, processing, feature store, and model serving. Learn to address challenges like training-serving skew and ensure reliability, scalability, and low latency for real-time inference and batch processing.
We'll cover the following...
The rapid adoption of ML and AI has made a scalable, reliable, and efficient data infrastructure essential. Rather than building a single application, this involves architecting a foundational system that supports the entire ML life cycle, enabling large-scale data ingestion, transformation, storage, feature management, and low-latency serving, so diverse ML workloads can be trained, deployed, and served effectively.
The following is an overview of the stages involved in the data pipeline for AI and ML systems.
The challenges of traditional architectures
General-purpose
Training-serving skew: Discrepancies between the data used for training and the data used for live predictions can degrade model performance.
Lack of feature reusability: Without a centralized feature platform, teams often recreate the same features, wasting effort and creating inconsistencies.
Reproducibility issues: Mutable or overwritten data make it hard to retrain models or reproduce experiments. Strict data and code versioning is essential.
Scalability bottlenecks: Traditional systems struggle with computationally heavy transformations and aggregations as data volumes grow.
Technical Quiz
What is a training-serving skew?
The difference in time it takes to train a model vs. serving a prediction
Using different server hardware (e.g., CPUs vs. GPUs) for training and serving
Discrepancies between the data or features used for training and for live predictions
When a model is trained on significantly more data than it is served
Requirements
To design a scalable data infrastructure, let’s scope the problem to the following functional and non-functional requirements.
Functional requirements
Data collection: The system must be able to collect data from multiple sources, including databases, real-time event streams, third-party APIs, and logs.
Data processing and transformation: Raw data is rarely usable for machine learning. The platform needs to process and transform this data into model-ready formats. This includes
,cleaning Finding and fixing incorrect, incomplete, corrupted, or irrelevant data in a dataset to improve its quality , andnormalization Rescaling numerical features to a common range, preventing dominance by large values .feature engineering Rescaling numerical features to a common range, preventing dominance by large values Batch and real-time handling: The platform must handle both batch data for model training and real-time data streams for live predictions.
Data storage: It needs to store raw, processed, and feature data efficiently and durably. Different storage solutions are required for various types of data.
Serving data to ML/AI models: The system must deliver high-quality, versioned features and datasets to ML models for both training and inference with low latency.
Data monitoring: The platform should monitor data quality, track data lineage, and track interactions between models and data.
Non-functional requirements
Reliability: The system should maintain continuous, accurate data processing even in the event of failures.
Security and privacy: The system should ensure that data is protected from unauthorized access, breaches, and misuse through encryption, authentication, and access controls.
Scalability: The platform should be scalable to handle increasing volumes of data, ...