How to do serverless data processing with Dataflow
Serverless data processing is a powerful way to analyze, transform, and process data without the burden of managing infrastructure. Google Cloud Dataflow, powered by Apache Beam, is a managed service that simplifies data processing tasks, making it an ideal choice for beginners. In this Answer, we will learn how to set up and execute a basic data processing pipeline using Google Cloud Dataflow and the Apache Beam framework in a local development environment, including creating a virtual environment.
Dataflow pipeline prerequisites
Python must be installed on the local machine
Pip must be installed to manage Python packages
Basic knowledge of data processing concepts
Implementation of the Dataflow pipeline
The following steps show a basic implementation of a Dataflow pipeline using Apache Beam.
Step 1: Setting up the local development and virtual environments
To begin, we need to set up a local development environment to work with Apache Beam and Google Cloud Dataflow. We must also create a virtual environment to isolate our project dependencies, as shown below:
We open a terminal and navigate to the directory where we want to create our project.
We create a virtual environment
dataflowfor our project as follows:
python -m venv dataflow
We activate the virtual environment as follows:
source my_dataflow_env/bin/activate
Step 2: Installing Apache Beam in the virtual environment
Within our virtual environment, we install the Apache Beam Python SDK and the necessary dependencies using pip, as shown below:
pip install apache-beam
Step 3: Preparing our data
We use the following .txt file for this Answer:
This is a sample text file to test dataflow processing using Apache Beam.It contains two lines of text for word counting.
Step 4: Developing our Dataflow pipeline
Now, it’s time to write the code for our data processing pipeline using Apache Beam by performing the following steps:
We create a simple Python script to define our Dataflow pipeline as follows:
import apache_beam as beam# Define a Dataflow pipelinedef run():with beam.Pipeline() as pipeline:data = (pipeline| 'ReadData' >> beam.io.ReadFromText('data.txt')| 'TransformData' >> beam.Map(lambda line: line.upper())| 'WriteData' >> beam.io.WriteToText('output.txt'))if __name__ == '__main__':run()
Note: In the code, replace
'data.txt'with the path to your local data file andoutput.txtwith the path where you want to store the processed data.
Step 5: Testing locally
We open a terminal and navigate to the directory containing our Python script.
We run the pipeline locally using the following command:
python local-dataflow.py
We execute the following widget to see it in action:
import apache_beam as beam
# Define a Dataflow pipeline
def run():
with beam.Pipeline() as pipeline:
# Read a text file
lines = pipeline | "ReadFromText" >> beam.io.ReadFromText("data.txt")
# Count the number of words in each line
word_counts = (
lines
| "SplitWords" >> beam.FlatMap(lambda line: line.split())
| "CountWords" >> beam.combiners.Count.PerElement()
)
# Print the results
word_counts | "PrintResults" >> beam.Map(print)
if __name__ == "__main__":
run()
The output with the word count shows a successful execution.
Free Resources