Trusted answers to developer questions

What is PCollection in Apache Beam?

Get the Learn to Code Starter Pack

Break into tech with the logic & computer science skills you’d learn in a bootcamp or university — at a fraction of the cost. Educative's hand-on curriculum is perfect for new learners hoping to launch a career.

The PCollection is a multi-element dataset. To create a data processing pipeline, we must have a PCollection. Therefore, there will be at least one or more PCollection in the pipeline, storing at least one input and output.

How to create a PCollection

There are two ways to make a PCollection.

  1. From external sources.
  2. Using in-memory data.

External sources are typically used in the production environment. In contrast, in-memory sources are used to debug and test objectives.

External sources

To read data from external sources, you need Beam I/O Adapters. The adapters differ in usage. However, all of them read from some external data source and return a PCollection.

Let’s look at an example using the ReadFromText adapter.

lines = pipeline | 'ReadFile' >> beam.io.ReadFromText('gs://some/input.txt')

Here, we pass the file location as an argument. In this case, gs://some/input.txt is a Google cloud storage location. Each adapter has a Read transform. To read, you must apply that transform to the pipeline object itself.

In-memory data

To create a PCollection from in-memory data, you need to use create transform. This transform can be directly applied to the pipeline object, and you can pass data in the code.

Let’s look at an example.

import apache_beam as beam

with beam.Pipeline() as pipeline:
  lines = (
      pipeline
      | beam.Create([
          'Welcome to the SHOT: ',
          "This is in-memory data ",
          'it is used in testing ',
          'This is never used in prod ',
      ]))

PCollection features

Before we begin, it i essential to know that the PCollections cannot be shared between different pipelines.

Access

You cannot access a random element from a PCollection. Instead, your code should read elements one by one.

Element type

The PCollection data can be of any type (integer, string, etc.). But all of the data should be of the same kind.

Beam encodes every element in a PCollection as a byte string to support distributed processing.

Schema

The element type in a PCollection often has a structure, e.g., JSON, Proto Buffer, Avro, and database records. Using a schema will allow us to perform complex operations.

Immutability

PCollections are immutable. Therefore, once created, you cannot modify (add, delete, etc.) them in any way.

While applying the transformations upon the PCollection, the elements will be read one by one, transformed, and stored into a new PCollection. Hence, it is never modified.

Size and boundedness

There is no upper limit on the number of elements stored in a PCollection. So, either the data will be adjusted in memory on a single machine, or Beam can also distribute it.

The PCollection can either be bounded or unbounded. Bounded represents a fixed amount of data; for example, files in Google Cloud Storage. Unbounded means an infinite amount of data; for instance, while reading the data from the streaming process (Kafka/pub-sub), we may not know how much data we will be receiving.

Timestamps

A timestamp will be assigned to every element in a PCollection when created from the source.

Timestamps can be helpful when you want to filter or transform the data with a particular timestamp.

RELATED TAGS

dataflow
apachebeam
Did you find this helpful?