Transformations (I): Map and Filter
Explore how to apply map and filter transformations on Spark DataFrames using the Java API. Understand the use of MapFunction and FilterFunction interfaces to modify and filter data while respecting Spark's immutability. Learn practical examples of transforming column values and filtering rows based on conditions, enhancing your big data manipulation skills.
This lesson follows the project embedded in the widget below.
It is recommended that we follow the explanations in tandem with the code, and run the project to see the results. It’s also helpful to change parts of the code and detour a bit from the code base to experiment and see the results live.
mvn install exec:exec
Let’s print the DataFrame’s first five lines, containing data about foods and their respective related data, by running the project for the first time with mvn install exec:exec
+--------------+--------------------+----------------+--------------------+
| FOOD NAME| SCIENTIFIC NAME| GROUP| SUB GROUP|
+--------------+--------------------+----------------+--------------------+
| Angelica| Angelica keiskei|Herbs and Spices| Herbs|
| Savoy cabbage|Brassica oleracea...| Vegetables| Cabbages|
| Silver linden| Tilia argentea|Herbs and Spices| Herbs|
| Kiwi| Actinidia chinensis| Fruits| Tropical fruits|
|Allium (Onion)| Allium| Vegetables|Onion-family vege...|
+--------------+--------------------+----------------+--------------------+
only showing top 5 rows
Map
In plain terms, the map transformation provides the functionality to apply a function to all the elements of a DataFrame (or other Spark abstractions).
As developers, it is always a good exercise to read a method signature because it can show intent and the contract we have to abide by for the map method it is defined as:
map(MapFunction<T,U> func, Encoder<U> encoder)
-
Func: Of the two arguments it takes, its first is a MapFunction type called Func. We’ve used it in a previous lesson, but we can reiterate that it defines an Interface containing a sole method (call) with an input of type ‘T’ and an Output or return type of ‘U.’ In Java ...