Deep Dive: Transformations and Data Storage

Get introduced to the mechanism behind a Spark transformation, and in particular, the memory scheme used to store the data with which a transformation works.

Task, Partitions, and Transformations

Several steps are triggered when reading from a source, such as a CSV file, or when applying a transformation on read records.

Behind the scenes, Spark works based on a divide and conquer approach to deal with significant volumes of data.

When we applied a transformation in our previous lesson, we learned that such an operation occurs for the whole set of data. We also noted that this happens distributedly. So, how does Spark manage this? Let’s take a quick look.

The driver program kicks off processing in parallel when the first operation (reading from a CSV file) is triggered by fanning out the workload to the cluster’s worker nodes.

Let’s imagine a timeline where a succession of actions takes place, expressed first as code in the driver program (static in nature) and seen in the cluster nodes as processes (dynamic in nature). This can be visually represented as follows:

Get hands-on with 1200+ tech skills courses.