Sharing data in a cluster

Sharing data in a distributed environment, regardless of the use case, can be confusing.

Understanding the scope (where the variables “live”) and lifecycle (how the values change) of shared variables while executing code in a cluster presents itself as a challenging task.

Within the Spark ecosystem, variables can be passed down to objects that operate in a distributed fashion. Still, these are copies with a different state each while execution takes place.

Furthermore, this is a one-way type of communication, meaning that those variables are not sent back to the driver program with whatever updated values they might have.

To address these inconveniences, the Spark API provides both accumulators and broadcast variables.

Accumulators

Accumulators are variables that expose only an addition operation. This means we can only add values but not delete, or modify existing values. In other words, accumulators provide a simple way of aggregating values from the worker nodes back to the driver program.

Let’s start with a code example to illustrate. Let’s work on a brief project to demonstrate the usage of accumulators and broadcast variables.

Get hands-on with 1200+ tech skills courses.