What is GroupByKey transform in Apache Beam?

Overall, there are five core transforms in the Apache Beam model. The GroupByKey transform is one of them. This transform works similar to the shuffle phase of the map-shuffle-reduce algorithm. This transformation is for processing collections of key/value pairs. GroupByKey is a helpful way to aggregate data that has something in common.

For example, for a collection that stores records of client orders, you might want to group the requests from the same city. Here, city represents the key of the key/value pair, and the rest of the record is the value.

Input to `GroupByKey`

The input to GroupByKey is a combination of key/value pairs that describe a multimap, i.e., multiple pairs with the same key but different values. For such cases, you use GroupByKey to collect all the values connected with each unique key.

Let’s take an example where we have words (keys) from a text file and the line numbers (values) on which they appear. Our goal is to combine all the line numbers for a particular word.

I, 3
We, 6
You, 7
They, 9
Edpresso, 1
Educative, 2
You, 5
They, 4
Edpresso, 3
Educative, 8
...

Output of `GroupByKey`

As discussed above, our goal is to get all the line numbers for a particular word.

Applying GroupByKey to this input, we get the output as:

I, [3]
We, [6]
You, [7, 5]
They, [9, 4]
Edpresso, [1,3]
Educative, [2,8]
...

The GroupByKey transform is for boundedBounded represents a fixed amount of data, for example, files in Google Cloud Storage. data. To apply GroupByKey to unboundedUnbounded means an infinite amount of data. For instance, while reading the data from the streaming process., data you need to use windowing, otherwise Beam generates IllegalStateException while building the pipeline.

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)

What is GroupByKey transform in Apache Beam?

Input to GroupByKey

Output of GroupByKey

Input to `GroupByKey`

Output of `GroupByKey`