The groupBy and groupByKey methods

Let’s learn two major transformations to group data that might also cause data shuffling.

Grouping data

While working with a DataFrame’s columns of values representing an ID or key of some sort, some business scenarios exist to group and process information based on it.

As we’ve seen in the previous lesson, we have no control over how Spark might initially allocate the rows among the partitions and nodes where these reside. Still, we can use a grouping transformation to bring them closer based on the key column.

To achieve this, the Spark API introduces us to the groupBy and groupByKey operations.

The groupBy method

Elements can be grouped together based on one or more fields acting as the grouping criteria, with the help of the groupBy(...) method provided by the DataFrame API.

However, just like when working with Databases, the grouped information is quite useless without an operation performed on it. The type of operation that Spark links to a groupBy operation is an aggregate function, which can be, for example:

  • Count
  • Avg
  • Sum
  • … others

The “Actions (II): Reduce and Aggregated Functions: Max, Min and Mean” lesson introduced aggregated functions. Still, the groupBy transformation paints the most precise picture of them, so let’s begin by depicting it.

We’ll follow our tradition of illustrating transformations, but for this operation, we’ll change the layout to show a more comprehensive image of the grouping operation flow:

Get hands-on with 1200+ tech skills courses.