Shared Variables in Spark

Learn how Spark makes data sharing and information gathering efficient.

In addition to RDDs, Spark's second abstraction is distributed shared variables. We might want to send static data to all the workers (driver-to-worker information flow) or might want to collect some state from all the workers (workers-to-driver information flow). Spark's shared variable abstraction helps with both of these scenarios.

Shared variables

Setup work is required for some operations, like creating a random number from a specific distribution, for each partition. The user will have to create and send it to the worker with specific partitions every time a task is run on it. Shared variables are used to help cater to the setup overhead. Shared variables can be used to add together data from all tasks or save a large value on all worker nodes and reuse it across many Spark jobs without resending them to the whole cluster. Spark offers two types of shared variables––broadcast variables and accumulators.

Broadcast variables

Normally, a variable used in a driver node’s tasks is simply referenced in a closureclosures can refer to variables in the scope where they are created (function). This process can be very inefficient in the following cases.

  • If we have large variables like a machine learning model or a lookup table because they have to be deserialized on a worker node every time they are sent with a task

  • If a variable is used in multiple jobs

Instead of only sending it once, it needs to be delivered with each job. This creates the need for broadcast variables.

Level up your interview prep. Join Educative to access 80+ hands-on prep courses.