Note: The examples below are in Python, but you can achieve similar functionality using Java and Go.
Overall, there are five core transforms in the Apache Beam model.
ParDo is one of the most commonly used transform functions.
ParDo works similar to the map phase of the map-reduce algorithm.
ParDo is the transform for parallel processing. It applies the processing function to every element in the
PCollection input and returns zero or more elements to the output
ParDo function requires a user-defined
DoFn. This function will have the transformation you plan to apply.
DoFn is a Beam SDK class that describes a distributed processing function.
ParDo, the apply method is called on the PCollection and passing with
ParDo passed as an argument. The
DoFn object is then passed to the
Let’s look at an example where we compute every word’s length in a
# Apply a ParDo to the "words" PCollection to get lengths for every word. word_lengths = words | beam.ParDo(WordLengthFn())
WordLengthFn is a
As discussed previously,
DoFn holds the processing logic that gets applied to every element in input
PCollection. Therefore, inside the
DoFn subclass, you need a
process method to write the processing logic.
You don’t need to extract individual elements from
PCollection manually. Beam SDK will handle it so the user only needs to add
element as an argument in the function.
class WordLengthFn(beam.DoFn): def process(self, element): return [len(element)]
There are two critical points you need to consider while building the
You should not in any way alter the
element argument given to the
Once you output a value using return, you should not change that value.
When the functionality is straightforward, You don’t need to create a new
DoFn. Instead, you can use the
word_lengths = words | beam.FlatMap(lambda word: [len(word)])
View all Courses