PCollection is a multi-element dataset. To create a data processing pipeline, we must have a
PCollection. Therefore, there will be at least one or more
PCollection in the pipeline, storing at least one input and output.
There are two ways to make a
External sources are typically used in the production environment. In contrast, in-memory sources are used to debug and test objectives.
To read data from external sources, you need Beam I/O Adapters. The adapters differ in usage. However, all of them read from some external data source and return a
Let’s look at an example using the
lines = pipeline | 'ReadFile' >> beam.io.ReadFromText('gs://some/input.txt')
Here, we pass the file location as an argument. In this case,
gs://some/input.txt is a Google cloud storage location. Each adapter has a
Read transform. To read, you must apply that transform to the
pipeline object itself.
To create a
PCollection from in-memory data, you need to use
create transform. This transform can be directly applied to the pipeline object, and you can pass data in the code.
Let’s look at an example.
import apache_beam as beam with beam.Pipeline() as pipeline: lines = ( pipeline | beam.Create([ 'Welcome to the SHOT: ', "This is in-memory data ", 'it is used in testing ', 'This is never used in prod ', ]))
Before we begin, it i essential to know that the
PCollections cannot be shared between different pipelines.
You cannot access a random element from a
PCollection. Instead, your code should read elements one by one.
PCollection data can be of any type (integer, string, etc.). But all of the data should be of the same kind.
Beam encodes every element in a
PCollection as a byte string to support distributed processing.
The element type in a
PCollection often has a structure, e.g., JSON, Proto Buffer, Avro, and database records. Using a schema will allow us to perform complex operations.
PCollections are immutable. Therefore, once created, you cannot modify (add, delete, etc.) them in any way.
While applying the transformations upon the
PCollection, the elements will be read one by one, transformed, and stored into a new
PCollection. Hence, it is never modified.
There is no upper limit on the number of elements stored in a
PCollection. So, either the data will be adjusted in memory on a single machine, or Beam can also distribute it.
PCollection can either be bounded or unbounded. Bounded represents a fixed amount of data; for example, files in Google Cloud Storage. Unbounded means an infinite amount of data; for instance, while reading the data from the streaming process (Kafka/pub-sub), we may not know how much data we will be receiving.
A timestamp will be assigned to every element in a
PCollection when created from the source.
Timestamps can be helpful when you want to filter or transform the data with a particular timestamp.
View all Courses