Relationships and subsequent reunions of data produce meaningful information, and Spark provides an operation to achieve the Join transformation.

The Join transformation

To join information on different datasets, Spark provides the Join transformation, similar to the join operation on relational databases.

And like on those databases, there are different types of joins that can be performed between separate yet related datasets.

Let’s introduce some Venn diagrams along with our code examples.

Types of Joins

In this course, we’re going to focus on the following Joins: Inner, Self, Left Outer, Right Outer, and Full Outer.

Join is a wide transformation, so the likely possibility of data shuffling has to be taken into account when using it.

The general structure of Join

The following code snippet, that belongs to the joins project located in the widget down this lesson, presents the general structure of a join transformation’s syntax at the code level.

public Dataset<Row> join(final Dataset<?> right, final Column joinExprs, final String joinType)

The method’s signature (part of the Spark DataFrame API) is almost self-explanatory and expects the following arguments:

  • A right dataset (or generically typed DataFrame). This is the second DataFrame to join with.

  • The join expression. This is a column-based type of expression that allows specifying the Join criteria.

  • The type of Join to be performed.

There are overloaded versions of this method, but it is useful to keep this one as the foundation to help understand the rest of them.

In this lesson’s project, we’re working with a trivial example of relationships between datasets. One dataset holds employee information,whereas the second (related) dateset holds departmental information.

Get hands-on with 1200+ tech skills courses.