- Spark Clusters
Explore the structure and function of Spark clusters, including the roles of driver and worker nodes. Understand lazy execution in PySpark and best practices for managing data with Pandas dataframes. Learn how to use persistent storage for scalable, fault-tolerant batch pipelines in cloud environments.
We'll cover the following...
Spark environment
A Spark environment is a cluster of machines with a single driver node and zero or more worker nodes. The driver machine is the master node in the cluster and is responsible for coordinating the workloads to perform.
Driver and worker nodes
In general, workloads will be distributed across the worker nodes when performing operations on Spark dataframes. However, when working with Python objects, such as lists or dictionaries, objects will be instantiated on the driver node.
Ideally, you want all of your workloads to be ...