Data Privacy, Compliance & ML
Explore essential data privacy and compliance principles in machine learning system design. Learn to navigate GDPR and other regulations, implement machine unlearning methods, apply differential privacy and federated learning, and handle PII securely. This lesson equips you to design ML systems that respect privacy constraints and meet regulatory demands effectively.
When a user in Berlin submits a deletion request to your recommendation system, that request doesn’t just wipe a row from a database. It ripples through feature stores, embedding tables, and trained model parameters. If your system wasn’t designed for this from the start, you’re facing a full model retrain that could take days of GPU time and still leave you non-compliant. In MAANG ML system design interviews, privacy has shifted from a footnote to an architectural constraint that shapes every component, from data ingestion to model serving.
This lesson covers five pillars that interviewers expect you to reason about fluently. First, the regulatory landscape that defines what your system must support. Second, machine unlearning techniques that let models “forget” specific users. Third, differential privacy mechanisms that bound individual influence during training. Fourth, federated learning architectures that keep raw data on-device. Fifth, PII handling patterns that protect sensitive information in features and embeddings. Each pillar introduces specific trade-offs between privacy and utility, compute cost and compliance speed, and architectural complexity and model quality.
The regulatory landscape
Three regulations appear repeatedly in ML system design discussions, and each imposes distinct architectural constraints on how you collect, store, and train on user data.
GDPR (General Data Protection Regulation, EU) is the most impactful for ML systems. Article 17 establishes the right to be forgotten, which means a user can demand deletion of their data and its influence on any trained model. Article 5 enforces data minimization, limiting which features you can collect to only what is strictly necessary. Consent requirements constrain how training data is gathered, requiring explicit opt-in for many use cases.
CCPA (California Consumer Privacy Act) grants users opt-out rights, meaning they can request their data not be sold or used for model training. It includes deletion rights similar to GDPR, though enforcement mechanisms differ.
DMA (Digital Markets Act, EU) targets gatekeeper platforms such as Apple, Google, and Meta. It restricts cross-service data combination, directly affecting how recommendation and ads models aggregate features across products. A company cannot freely merge a user’s search history with their messaging behavior to build richer embeddings without explicit consent.
The critical insight for interviews is that a deletion request doesn’t end at the database layer. If a user’s behavioral data influenced gradient updates during training, the model itself retains a trace of that user. This motivates the need for machine unlearning, which we cover next.
The following table summarizes the key provisions and their ML system impact.
Comparison of Data Privacy Regulations and Their Impact on ML Systems
Regulation | Jurisdiction | Right to Deletion | Data Minimization Requirement | Consent Model | Key ML System Impact |
GDPR | European Union (EU) | Yes, via Article 17 | Yes, via Article 5 | Explicit opt-in | Must support model unlearning and audit trails |
CCPA | California, USA | Yes, with opt-out mechanism | Limited | Opt-out model | Must honor deletion and opt-out in training pipelines |
DMA | EU gatekeeper platforms | Inherited from GDPR | Implicit via cross-service restrictions | Per-service consent required | Cannot combine cross-service user embeddings without consent |
Machine unlearning
Machine unlearning is the process of removing a specific data point's influence from a trained model without performing a full retrain from scratch. When a deletion request arrives, simply removing the user’s rows from your training database is insufficient because the model’s parameters still ...