Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

communitycreator
apachebeam

What is GroupByKey transform in Apache Beam?

Kedar Kodgire

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Overall, there are five core transforms in the Apache Beam model. The GroupByKey transform is one of them. This transform works similar to the shuffle phase of the map-shuffle-reduce algorithm. This transformation is for processing collections of key/value pairs. GroupByKey is a helpful way to aggregate data that has something in common.

For example, for a collection that stores records of client orders, you might want to group the requests from the same city. Here, city represents the key of the key/value pair, and the rest of the record is the value.

Input to GroupByKey

The input to GroupByKey is a combination of key/value pairs that describe a multimap, i.e., multiple pairs with the same key but different values. For such cases, you use GroupByKey to collect all the values connected with each unique key.

Let’s take an example where we have words (keys) from a text file and the line numbers (values) on which they appear. Our goal is to combine all the line numbers for a particular word.

I, 3
We, 6
You, 7
They, 9
Edpresso, 1
Educative, 2
You, 5
They, 4
Edpresso, 3
Educative, 8
...

Output of GroupByKey

As discussed above, our goal is to get all the line numbers for a particular word.

Applying GroupByKey to this input, we get the output as:

I, [3]
We, [6]
You, [7, 5]
They, [9, 4]
Edpresso, [1,3]
Educative, [2,8]
...

The GroupByKey transform is for boundedBounded represents a fixed amount of data, for example, files in Google Cloud Storage. data. To apply GroupByKey to unboundedUnbounded means an infinite amount of data. For instance, while reading the data from the streaming process., data you need to use windowing, otherwise Beam generates IllegalStateException while building the pipeline.

RELATED TAGS

communitycreator
apachebeam

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring