Failure Recovery in Flink

Explore how Flink efficiently recovers from failures by periodically checkpointing operator states using the Asynchronous Barrier Snapshotting algorithm. Learn how Flink manages state storage, coordinates checkpoints with external systems like Kafka, and ensures exactly-once processing guarantees to maintain stream processing reliability.

We'll cover the following...

Asynchronous Barrier Snapshotting (ABS)
- Working
Subtle points in the checkpoint algorithm
Phases of ABS
Storing operator’s state
Integration of Flink with other systems
- Integration with Kafka
Guarantees provided by Flink

As mentioned previously, stream processing applications in Flink are supposed to be long-lived. So there must be an efficient way to recover from failures without repeating a lot of work. For this purpose, Flink periodically checkpoints the operators’ state and the position of the consumed stream to generate this state. In case of a failure, an application can be restarted from the latest checkpoint and continue processing from there.

All this is achieved via an algorithm similar to the Chandy-Lamport algorithm for distributed snapshots, called Asynchronous Barrier Snapshotting (ABS).

Asynchronous Barrier Snapshotting (ABS)

The ABS algorithm operates slightly differently for acyclic and cyclic graphs, so we will examine the first case here, which is a bit simpler.

Working

The algorithm works in the following way:

The Job Manager periodically injects some control records in the stream, referred to as stage barriers. These records are supposed to divide the stream into stages. At the end of a stage, the set of operator states reflects the whole execution history up to the associated barrier. Thus it can be used for a snapshot.
When a ...

1.Before Getting Started

2.Introduction to Distributed Systems

3.Basic Concepts and Theorems

4.Distributed Transactions

5.Achieving Isolation

6.Achieving Atomicity

7.Concluding Distributed Transactions

8.Consensus

9.Time

10.Order

11.Networking

12.Security

13.Security Protocols

14.From Theory to Practice

15.Case Study 1: Distributed File Systems

16.Case Study 2: Distributed Coordination Service

17.Case Study 3: Distributed Data Stores

18.Case Study 4: Distributed Messaging System

19.Case Study 5: Distributed Cluster Management

20.Case Study 6: Distributed Ledger

21.Case Study 7: Distributed Data Processing Systems

22.Practices & Patterns

23.Communication Patterns

24.Coordination Patterns

25.Data Synchronization

26.Shared-nothing Architectures

27.Distributed Locking

28.Compatibility Patterns

29.Dealing with Failure

30.Distributed Tracing

31.Concluding this Course

Failure Recovery in Flink

Asynchronous Barrier Snapshotting (ABS)

Working