- Spark Clusters

Distributing workloads in Spark clusters.

Spark environment

A Spark environment is a cluster of machines with a single driver node and zero or more worker nodes. The driver machine is the master node in the cluster and is responsible for coordinating the workloads to perform.

Driver and worker nodes

In general, workloads will be distributed across the worker nodes when performing operations on Spark dataframes. However, when working with Python objects, such as lists or dictionaries, objects will be instantiated on the driver node.

Ideally, you want all of your workloads to be operating on worker nodes so that the execution of the steps to perform is distributed across the cluster and not bottlenecked by the driver node. However, there are some types of operations in PySpark where the driver has to perform all of the work.

Get hands-on with 1200+ tech skills courses.