Overall, there are five core transforms in the Apache Beam model. The GroupByKey
transform is one of them. This transform works similar to the shuffle phase of the map-shuffle-reduce algorithm. This transformation is for processing collections of key/value pairs. GroupByKey
is a helpful way to aggregate data that has something in common.
For example, for a collection that stores records of client orders, you might want to group the requests from the same city. Here, city represents the key of the key/value pair, and the rest of the record is the value.
GroupByKey
The input to GroupByKey
is a combination of key/value pairs that describe a multimap, i.e., multiple pairs with the same key but different values. For such cases, you use GroupByKey
to collect all the values connected with each unique key.
Let’s take an example where we have words (keys) from a text file and the line numbers (values) on which they appear. Our goal is to combine all the line numbers for a particular word.
I, 3
We, 6
You, 7
They, 9
Edpresso, 1
Educative, 2
You, 5
They, 4
Edpresso, 3
Educative, 8
...
GroupByKey
As discussed above, our goal is to get all the line numbers for a particular word.
Applying GroupByKey
to this input, we get the output as:
I, [3]
We, [6]
You, [7, 5]
They, [9, 4]
Edpresso, [1,3]
Educative, [2,8]
...
The GroupByKey
transform is for GroupByKey
to IllegalStateException
while building the pipeline.