Mobile A/B Testing Infrastructure
Explore the design of reliable mobile A/B testing infrastructure to ensure consistent user assignment, persistent event logging, and data integrity despite mobile interruptions. Understand deterministic bucketing, offline telemetry persistence, event deduplication, and statistical validation methods to maintain experiment accuracy and robustness.
Mobile experimentation failures are often invisible but impactful. A user may be assigned to a variant and interact with it, but if the app is terminated before events are recorded, that data is lost, silently biasing experiment results. Unlike web systems, mobile environments introduce interruptions like process kills and unreliable connectivity that make data collection inherently fragile.
At its core, mobile A/B testing is a reliability problem. Systems must ensure consistent user assignment, durable event capture, and data integrity despite device-level disruptions. Without this, experiments can lead to incorrect conclusions.
This lesson explores the architecture behind reliable mobile experimentation, including deterministic bucketing, resilient telemetry pipelines, and server-side safeguards for statistical validity.
Assignment service and metrics pipeline
The system splits into two tiers. A server-side assignment service resolves which experiments and variants apply to a given user or device. A client-side SDK caches those assignments and enforces them locally during feature evaluation.
The assignment service carries several responsibilities:
Experiment definitions: Each experiment record includes an experiment ID, variant weights, targeting rules, and start/end dates.
Variant resolution: When a client requests assignments, the service evaluates all active experiments against the user’s attributes and returns a resolved assignment payload.
Versioned API surface: The service exposes a lightweight REST or gRPC endpoint that delivers the full
.assignment manifest A JSON or binary payload containing every active experiment and the user's resolved variant for each, cached locally on the device.
The client must cache this entire manifest on-device rather than making per-feature network calls. This eliminates latency during feature evaluation, guarantees offline availability, and ensures consistency within a single session.
The metrics pipeline forms the return path. The client SDK collects experiment events locally, buffers them on disk, uploads them in compressed batches to a server-side ingestion endpoint, and the ingestion layer feeds an analytics warehouse. A critical nuance from industry practice is that the assignment service must maintain experiment data for all active client versions. Outdated clients requesting resolved experiments is a primary failure mode, and dropping old experiment definitions causes undefined behavior on those devices, leading to
The following diagram illustrates how these components connect and where failures occur.
With the high-level architecture established, the next question is how the client determines which variant a user belongs to without relying on the network at evaluation time.
Deterministic bucketing and consistency
Reliable experimentation requires that a user remain in the same variant throughout the lifecycle of an experiment, regardless of app restarts or reinstalls. Mobile systems achieve this through deterministic hashing rather than persisting random assignments in local storage, which would be lost if the user clears app data.
Stable hashing logic: The system computes a hash of a composite key, typically the
user_idconcatenated with theexperiment_id, using non-cryptographic algorithms like orMurmurHash3 MurmurHash3 is an incredibly fast, non-cryptographic hashing algorithm used to turn large amounts of data (like strings or files) into a short, unique-looking identification number (a hash). . These are chosen for their uniform distribution and minimal CPU overhead.FNV-1a Fowler–Noll–Vo version 1a is a fast, non-cryptographic hash function designed for simplicity and efficiency. It is widely used in hash table implementations, checksums, and data integrity checks because it offers high performance and a low collision rate for many types of data. Bucket normalization: The resulting hash is mapped (e.g., via modulo ) to a range. This value represents the user's bucket, which remains constant for that specific user-experiment pair.
Cross-experiment independence: Using the
experiment_idas a ensures that a user’s assignment in one experiment is statistically independent of their assignment in another. This prevents user clustering, where the same group of users is inadvertently exposed to the same combinations of treatments across unrelated tests.salt A unique value added to the primary input (the user_id) before it is passed through a hash function. State-free persistence: Since the same inputs always produce the same hash, the assignment can be recalculated instantly on the client or server without requiring a central database lookup or network round-trip.
Attention: If variant weights change mid-experiment (say, from 50/50 to 70/30), some users will be reassigned to a different variant. The architectural decision is either to lock weights for the experiment’s lifetime or to accept reassignment for a subset of users and exclude them from analysis.
The following pseudocode demonstrates the bucketing function.
With consistent bucketing in place, the next challenge is ensuring that the events generated by those bucketed users actually survive the journey from device to server.
Reliable telemetry and offline persistence
Mobile devices lose connectivity unpredictably, and the OS can terminate background processes at any time. Any event not persisted to disk before these interruptions is permanently lost.
Every experiment event, whether an exposure, interaction, or conversion, is first written to a durable on-disk store before any network transmission is attempted. This is the
A background scheduler periodically reads unsent events from the disk store, packages them into compressed batches, and attempts to upload to the ingestion endpoint. On success, events are marked as sent or deleted from the queue. On failure, they remain queued for retry with exponential backoff.
Unbounded event accumulation during extended offline periods can exhaust device storage. The system enforces a maximum storage cap, typically around 5 MB, with an eviction policy that prioritizes retention of
Because writes happen synchronously to disk before returning control to the calling code, even an immediate OS kill after the write call preserves the event. The event exists on disk and will be picked up by the upload scheduler on the next app launch. This is the critical property that prevents the scenario described at the start of this lesson.
Practical tip: Batch sizes of 20–50 events with gzip compression strike a good balance between network efficiency and upload latency. Smaller batches reduce the window of data at risk between flushes.
Once events survive the device and reach the server, a new problem emerges: the same event arriving more than once.
Data ingestion and event deduplication
Duplicates arise from two common scenarios. A network timeout causes the client to retry a batch that the server already received and processed. Or the app restarts and resends events that were written to disk but whose “sent” flag was not yet committed.
Each event is stamped with a globally unique event_id, a UUID v4 generated at event creation time on the client. The ingestion service maintains an
The window size represents a trade-off. A larger window catches more duplicates but consumes more memory. A 72-hour TTL covers the vast majority of retry scenarios, including weekend offline periods, without high cost.
Achieving true exactly-once semantics on mobile is impractical. The realistic goal is at least once delivery with server-side deduplication to achieve effective once processing. This distinction matters because it shapes how you design both the client retry logic and the server ingestion layer.
Note: Use server-received timestamps for event ordering rather than client-generated timestamps, because device clocks drift and can be manually altered. Retain client timestamps as metadata for latency analysis, but never use them as the source of truth for sequencing.
The following table summarizes the failure modes discussed so far and their mitigations.
Failure mode | Root cause | Mitigation strategy | Impact if unmitigated |
Lost exposure events | Process kill before flush | Write-ahead disk persistence | Inflated conversion rates due to missing denominators |
Duplicate conversion events | Network retry after timeout | Server-side UUID deduplication | Inflated treatment effect, false positive results |
Inconsistent variant assignment | Random assignment without deterministic hash | Deterministic hashing with (user_id + experiment_id) | Flickering UI, polluted experiment data |
Stale experiment manifest | Client not refreshing assignments | TTL-based cache invalidation with background refresh | Users stuck in expired experiments |
Version fragmentation | Assignment service dropping old experiment data | Maintaining experiment data for all active client versions | Undefined behavior on outdated clients |
With the pipeline hardened against data loss and duplication, the final step is verifying that the data actually supports valid statistical conclusions.
Preserving statistical validity
Every infrastructure decision discussed so far serves one ultimate purpose: producing statistically valid experiment results.
The primary diagnostic tool is the
Exposure logging deserves special attention. A user should only count in the experiment denominator if an exposure event was actually recorded, confirming that the client evaluated the experiment toggle and rendered the variant. This intent-to-treat vs. as-treated distinction prevents dilution bias, where users who never actually saw the variant are counted against it.
Practical tip: Monitor guardrail metrics like crash rates and latency per variant in real time. A variant that increases crashes will corrupt a long-running experiment and harm users before statistical significance is reached.
Test Your Knowledge!
A mobile A/B test shows a statistically significant 8% improvement in conversion rate for the treatment group. However, the Sample Ratio Mismatch check reveals that the treatment group has 12% fewer users than expected. What is the most likely conclusion?
The treatment genuinely improves conversion and should be shipped.
The result is likely invalid because missing exposure events in the treatment group inflated the observed conversion rate.
The bucketing algorithm has a non-uniform hash distribution that should be corrected, but the conversion result is still valid.
The server-side deduplication window is too large, causing valid events to be dropped.
Understanding SRM checks completes the picture of how infrastructure protects experimental conclusions. The next section ties all the layers together.
End-to-end resilience in practice
The full life cycle works as follows. The assignment service resolves variants and serves a versioned manifest. The client SDK caches the manifest and evaluates toggles locally using deterministic hashing. Every exposure and conversion event is written to disk before network transmission. Batched uploads with retry and exponential backoff ensure at-least-once delivery. Server-side UUID-based deduplication achieves effectiveness once processing is complete. And SRM checks validate pipeline integrity after the experiment concludes.
Each layer compensates for a specific mobile failure mode. Write-ahead persistence handles process termination. Batched retry handles connectivity loss. Versioned manifests handle client version skew. Server timestamps handle clock drift. No single component is trusted to be perfectly reliable. This is defense in depth applied to experimentation.
As experiment volume scales, additional concerns emerge, such as mutual exclusion between experiments, interaction effects, and progressive rollout integration. But the foundational infrastructure covered in this lesson remains the bedrock for all of them.