Transformations and Actions
Explore the fundamental concepts of transformations and actions in Apache Spark. Understand how transformations modify DataFrames immutably and are classified as narrow or wide. Learn about the lazy evaluation of transformations using Directed Acyclic Graphs, and how actions trigger computation to produce results. This lesson equips you with the knowledge to efficiently manipulate and process big data using Spark's Java API.
Two types of operations
After having worked on the previous example projects, we’re better positioned to understand two crucial concepts, and their related operations, involved in any Spark application: transformations and actions.
These two concepts are exposed programmatically by the Spark API’s methods that belong to abstractions such as DataFrames, RDDs, or JavaRDDs, etc.
Transformations
Transformations are the kind of operations that can transform both the structure of a DataFrame and its contents. We’ve applied these two types of operations in previous examples while:
-
Renaming, dropping, and creating columns of a DataFrame or a Dataset (
withColumn(),drop()methods, etc.). -
Doing calculations on each row of the DataFrame, whether to add a new column or create a Dataset of POJOs (when we introduced the
map()method and related interfaceMapFunction).
Every time we applied a transformation we also got a new DataFrame as a result. This happens is due to a fundamental property of the abstraction:
- DataFrames are immutable structures. Explained in practical terms, they are objects that can be read or created but not updated. To obtain a modified version of a Dataframe, a new one is created based on the existing DataFrame’s information after a