Feature #8: Distributed Process Coordinator
Explore the implementation of a distributed process coordinator in a search engine to efficiently manage tasks across worker nodes. Learn to create fault-tolerant systems using snapshot functionality that tracks node progress states, enabling continued operation despite node failures. Understand the use of setState, snap, and fetchState functions to monitor and retrieve the execution state at any given snapshot, preparing you to handle distributed computing challenges in coding interviews.
We'll cover the following...
Description
For this search engine feature, we will implement a distributed process coordinator. In distributed computing, a coordinator is the organizer of a task that is distributed among nodes. Our distributed process coordinator is responsible for breaking a task into multiple subtasks, assigning tasks among different worker nodes, and monitoring their progress. We want to implement fault tolerance, so that if one or more worker node(s) fail, our search engine can continue working without interruption. To implement fault tolerance, we will implement a snapshot functionality to save the current progress of worker nodes.
We have n worker nodes. Each node will have a state, which will be the number of subtasks that the node has successfully executed. In the beginning, the state of each node should be 0. We can change the progress state for each node by using the setState(idx, state) function. This function will take two parameters. idx is the index of the node whose progress we are setting, whereas state is the new state of that node. We should also be able to take a snapshot of the nodes at any time. This means that we should be able to save the current state of the nodes at any given time. To implement this, we need to create a snap() function. This function will not take any parameters and will return the snapshotId. The snapshotId counts the number of times that the snaps were taken.
We should also be able to access the state of any node at any given time, by using the fetchState(idx, snapshotId) function. This function returns the state for the node idx, which is taken at the snapshot snapshotId.
Suppose that we have three nodes, as shown in the illustration below. Initially, the state of all the nodes will be 0. After calling the setState(1, 4) function, the state of node1 will change to 4. If we take a snapshot, at this point, the current state of all nodes will be saved against the snapshot id 0. Now, if we call setState(1, 7), the current state for node1 will change to 7. Now, if we call fetchState(1, 0) ...