Search⌘ K
AI Features

Mobile A/B Testing Infrastructure

Explore the design of reliable mobile A/B testing infrastructure to ensure consistent user assignment, persistent event logging, and data integrity despite mobile interruptions. Understand deterministic bucketing, offline telemetry persistence, event deduplication, and statistical validation methods to maintain experiment accuracy and robustness.

Mobile experimentation failures are often invisible but impactful. A user may be assigned to a variant and interact with it, but if the app is terminated before events are recorded, that data is lost, silently biasing experiment results. Unlike web systems, mobile environments introduce interruptions like process kills and unreliable connectivity that make data collection inherently fragile.

At its core, mobile A/B testing is a reliability problem. Systems must ensure consistent user assignment, durable event capture, and data integrity despite device-level disruptions. Without this, experiments can lead to incorrect conclusions.

This lesson explores the architecture behind reliable mobile experimentation, including deterministic bucketing, resilient telemetry pipelines, and server-side safeguards for statistical validity.

Assignment service and metrics pipeline

The system splits into two tiers. A server-side assignment service resolves which experiments and variants apply to a given user or device. A client-side SDK caches those assignments and enforces them locally during feature evaluation.

The assignment service carries several responsibilities:

  • Experiment definitions: Each experiment record includes an experiment ID, variant weights, targeting rules, and start/end dates.

  • Variant resolution: When a client requests assignments, the service evaluates all active experiments against the user’s attributes and returns a resolved assignment payload.

  • Versioned API surface: The service exposes a lightweight REST or gRPC endpoint that delivers the full assignment manifestA JSON or binary payload containing every active experiment and the user's resolved variant for each, cached locally on the device..

The client must cache this entire manifest on-device rather than making per-feature network calls. This eliminates latency during feature evaluation, guarantees offline availability, and ensures consistency within a single session.

The metrics pipeline forms the return path. The client SDK collects experiment events locally, buffers them on disk, uploads them in compressed batches to a server-side ingestion endpoint, and the ingestion layer feeds an analytics warehouse. A critical nuance from industry practice is that the assignment service must maintain experiment data for all active client versions. Outdated clients requesting resolved experiments is a primary failure mode, and dropping old experiment definitions causes undefined behavior on those devices, leading to feature fragmentationA state where different app versions exhibit inconsistent feature behavior because experiment configurations are no longer available for older clients..

The following diagram illustrates how these components connect and where failures occur.

Loading D2 diagram...
Mobile A/B test lifecycle from variant resolution through event ingestion with failure points

With the high-level architecture established, the next question is how the client determines which variant a user belongs to without relying on the network at evaluation time.

Deterministic bucketing and consistency

Reliable experimentation requires that a user remain in the same variant throughout the lifecycle of an experiment, regardless of app restarts or reinstalls. Mobile systems achieve this through deterministic hashing rather than persisting random assignments in local storage, which would be lost if the user clears app data.

  • Stable hashing logic: The system computes a hash of a composite key, typically the user_id concatenated with the experiment_id, using non-cryptographic algorithms like MurmurHash3MurmurHash3 is an incredibly fast, non-cryptographic hashing algorithm used to turn large amounts of data (like strings or files) into a short, unique-looking identification number (a hash). or FNV-1aFowler–Noll–Vo version 1a is a fast, non-cryptographic hash function designed for simplicity and efficiency. It is widely used in hash table implementations, checksums, and data integrity checks because it offers high performance and a low collision rate for many types of data.. These are chosen for their uniform distribution and minimal CPU overhead.

  • Bucket normalization: The resulting hash is mapped (e.g., via modulo ) to a range. This value represents the user's bucket, which remains constant for that specific user-experiment pair.

  • Cross-experiment independence: Using the experiment_id as a saltA unique value added to the primary input (the user_id) before it is passed through a hash function. ensures that a user’s assignment in one experiment is statistically independent of their assignment in another. This prevents user clustering, where the same group of users is inadvertently exposed to the same combinations of treatments across unrelated tests.

  • State-free persistence: Since the same inputs always produce the same hash, the assignment can be recalculated instantly on the client or server without requiring a central database lookup or network round-trip.

Attention: If variant weights change mid-experiment (say, from 50/50 to 70/30), some users will be reassigned to a different variant. The architectural decision is either to lock weights for the experiment’s lifetime or to accept reassignment for a subset of users and exclude them from analysis.

The following pseudocode demonstrates the bucketing function.

Swift
import Foundation
// MurmurHash3 (32-bit) implementation for uniform, deterministic hashing.
// Chosen because it produces well-distributed output with no cryptographic overhead,
// ensuring consistent bucket assignment across platforms and runs.
func murmurHash3(_ key: String, seed: UInt32 = 0) -> UInt32 {
let data = Array(key.utf8)
let length = data.count
var h1: UInt32 = seed
let c1: UInt32 = 0xcc9e2d51
let c2: UInt32 = 0x1b873593
// Process 4-byte blocks
let blockCount = length / 4
for i in 0..<blockCount {
var k1 = UInt32(data[i * 4])
| (UInt32(data[i * 4 + 1]) << 8)
| (UInt32(data[i * 4 + 2]) << 16)
| (UInt32(data[i * 4 + 3]) << 24)
k1 &*= c1
k1 = (k1 << 15) | (k1 >> 17) // rotl32
k1 &*= c2
h1 ^= k1
h1 = (h1 << 13) | (h1 >> 19) // rotl32
h1 = h1 &* 5 &+ 0xe6546b64
}
// Handle remaining bytes (tail)
var tail: UInt32 = 0
let tailStart = blockCount * 4
switch length & 3 {
case 3: tail ^= UInt32(data[tailStart + 2]) << 16; fallthrough
case 2: tail ^= UInt32(data[tailStart + 1]) << 8; fallthrough
case 1:
tail ^= UInt32(data[tailStart])
tail &*= c1
tail = (tail << 15) | (tail >> 17)
tail &*= c2
h1 ^= tail
default: break
}
// Finalization mix — forces avalanche of all bits
h1 ^= UInt32(length)
h1 ^= h1 >> 16
h1 &*= 0x85ebca6b
h1 ^= h1 >> 13
h1 &*= 0xc2b2ae35
h1 ^= h1 >> 16
return h1
}
/// Deterministically assigns a user to a variant based on a stable hash of their identity.
/// - Parameters:
/// - userId: Unique identifier for the user.
/// - experimentId: Identifier for the experiment, scopes the hash to avoid cross-experiment correlation.
/// - variants: Ordered list of (name, weight) pairs; weights need not sum to 1 but must be positive.
/// - Returns: The name of the resolved variant.
func resolveVariant(
userId: String,
experimentId: String,
variants: [(name: String, weight: Double)]
) -> String {
precondition(!variants.isEmpty, "Variants array must not be empty")
// Combine userId and experimentId so the same user gets different buckets per experiment
let key = "\(userId):\(experimentId)"
// Compute a 32-bit MurmurHash3 of the composite key
let hashValue = murmurHash3(key)
// Normalize to [0, 1) by dividing by UInt32.max + 1 (the full hash space size).
// This maps the discrete hash space uniformly onto a continuous probability range,
// preserving the distribution properties of MurmurHash3.
let normalized = Double(hashValue) / (Double(UInt32.max) + 1.0)
// Compute total weight to support unnormalized weight inputs
let totalWeight = variants.reduce(0.0) { $0 + $1.weight }
// Iterate in stable order — iteration order must be deterministic so that
// the same normalized value always resolves to the same variant regardless of call site.
var cumulative = 0.0
for variant in variants {
cumulative += variant.weight / totalWeight
if normalized < cumulative {
return variant.name
}
}
// Fallback to last variant to handle floating-point edge cases near 1.0
return variants.last!.name
}

With consistent bucketing in place, the next challenge is ensuring that the events generated by those bucketed users actually survive the journey from device to server.

Reliable telemetry and offline persistence

Mobile devices lose connectivity unpredictably, and the OS can terminate background processes at any time. Any event not persisted to disk before these interruptions is permanently lost.

Every experiment event, whether an exposure, interaction, or conversion, is first written to a durable on-disk store before any network transmission is attempted. This is the write-ahead patternA technique borrowed from database systems where data is persisted to stable storage before being acknowledged, ensuring durability even if the process crashes immediately after the write.. The on-disk store is typically a SQLite database or an append-only file.

A background scheduler periodically reads unsent events from the disk store, packages them into compressed batches, and attempts to upload to the ingestion endpoint. On success, events are marked as sent or deleted from the queue. On failure, they remain queued for retry with exponential backoff.

Unbounded event accumulation during extended offline periods can exhaust device storage. The system enforces a maximum storage cap, typically around 5 MB, with an eviction policy that prioritizes retention of exposure eventsRecords indicating that a user's client evaluated an experiment toggle and rendered a specific variant, forming the denominator for conversion rate calculations.. Exposure data is essential for denominator accuracy, so lower-priority interaction events are evicted first.

Because writes happen synchronously to disk before returning control to the calling code, even an immediate OS kill after the write call preserves the event. The event exists on disk and will be picked up by the upload scheduler on the next app launch. This is the critical property that prevents the scenario described at the start of this lesson.

Practical tip: Batch sizes of 20–50 events with gzip compression strike a good balance between network efficiency and upload latency. Smaller batches reduce the window of data at risk between flushes.

Once events survive the device and reach the server, a new problem emerges: the same event arriving more than once.

Data ingestion and event deduplication

Duplicates arise from two common scenarios. A network timeout causes the client to retry a batch that the server already received and processed. Or the app restarts and resends events that were written to disk but whose “sent” flag was not yet committed.

Each event is stamped with a globally unique event_id, a UUID v4 generated at event creation time on the client. The ingestion service maintains an idempotency windowA time-bounded set (commonly implemented as a Redis set with a 72-hour TTL) that stores recently processed event IDs and rejects any incoming event whose ID already exists in the set.. Any incoming event whose ID already exists in this window is silently dropped.

The window size represents a trade-off. A larger window catches more duplicates but consumes more memory. A 72-hour TTL covers the vast majority of retry scenarios, including weekend offline periods, without high cost.

Achieving true exactly-once semantics on mobile is impractical. The realistic goal is at least once delivery with server-side deduplication to achieve effective once processing. This distinction matters because it shapes how you design both the client retry logic and the server ingestion layer.

Note: Use server-received timestamps for event ordering rather than client-generated timestamps, because device clocks drift and can be manually altered. Retain client timestamps as metadata for latency analysis, but never use them as the source of truth for sequencing.

The following table summarizes the failure modes discussed so far and their mitigations.

Failure mode

Root cause

Mitigation strategy

Impact if unmitigated

Lost exposure events

Process kill before flush

Write-ahead disk persistence

Inflated conversion rates due to missing denominators

Duplicate conversion events

Network retry after timeout

Server-side UUID deduplication

Inflated treatment effect, false positive results

Inconsistent variant assignment

Random assignment without deterministic hash

Deterministic hashing with (user_id + experiment_id)

Flickering UI, polluted experiment data

Stale experiment manifest

Client not refreshing assignments

TTL-based cache invalidation with background refresh

Users stuck in expired experiments

Version fragmentation

Assignment service dropping old experiment data

Maintaining experiment data for all active client versions

Undefined behavior on outdated clients

With the pipeline hardened against data loss and duplication, the final step is verifying that the data actually supports valid statistical conclusions.

Preserving statistical validity

Every infrastructure decision discussed so far serves one ultimate purpose: producing statistically valid experiment results.

The primary diagnostic tool is the SRMSample Ratio Mismatch check. After an experiment runs, you verify that the observed ratio of users in control vs. treatment matches the configured ratio within expected statistical variance. A significant deviation signals a data pipeline bug, whether from lost events, biased bucketing, or deduplication failures. Each architectural pattern maps directly to a specific SRM cause. Deterministic bucketing prevents assignment bias. Durable telemetry prevents denominator loss. Deduplication prevents numerator inflation.

Exposure logging deserves special attention. A user should only count in the experiment denominator if an exposure event was actually recorded, confirming that the client evaluated the experiment toggle and rendered the variant. This intent-to-treat vs. as-treated distinction prevents dilution bias, where users who never actually saw the variant are counted against it.

Practical tip: Monitor guardrail metrics like crash rates and latency per variant in real time. A variant that increases crashes will corrupt a long-running experiment and harm users before statistical significance is reached.

Test Your Knowledge!

1.

A mobile A/B test shows a statistically significant 8% improvement in conversion rate for the treatment group. However, the Sample Ratio Mismatch check reveals that the treatment group has 12% fewer users than expected. What is the most likely conclusion?

A.

The treatment genuinely improves conversion and should be shipped.

B.

The result is likely invalid because missing exposure events in the treatment group inflated the observed conversion rate.

C.

The bucketing algorithm has a non-uniform hash distribution that should be corrected, but the conversion result is still valid.

D.

The server-side deduplication window is too large, causing valid events to be dropped.


1 / 1

Understanding SRM checks completes the picture of how infrastructure protects experimental conclusions. The next section ties all the layers together.

End-to-end resilience in practice

The full life cycle works as follows. The assignment service resolves variants and serves a versioned manifest. The client SDK caches the manifest and evaluates toggles locally using deterministic hashing. Every exposure and conversion event is written to disk before network transmission. Batched uploads with retry and exponential backoff ensure at-least-once delivery. Server-side UUID-based deduplication achieves effectiveness once processing is complete. And SRM checks validate pipeline integrity after the experiment concludes.

Each layer compensates for a specific mobile failure mode. Write-ahead persistence handles process termination. Batched retry handles connectivity loss. Versioned manifests handle client version skew. Server timestamps handle clock drift. No single component is trusted to be perfectly reliable. This is defense in depth applied to experimentation.

As experiment volume scales, additional concerns emerge, such as mutual exclusion between experiments, interaction effects, and progressive rollout integration. But the foundational infrastructure covered in this lesson remains the bedrock for all of them.