...

Joins, Unions, and Window Functions

Work through an example to learn how to execute joins, unions, and windowing operations in Spark SQL.

We'll cover the following...

Join

Union
Windowing

Spark offers more involved and complex functions that support aggregation, collection, datetime, math, string, sorting, windowing, etc., functionality. We’ll see examples of joins, unions and windowing in this lesson.

Join

Joins can be performed between DataFrames or tables in Spark. By default, Spark executes an inner join between tables but has support for cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, and left_anti joins. Let’s work an example of executing an inner join on our Bollywood movies data set. We’ll read the following two files and then join their respective DataFrames.

BollywoodMovieDetail.csv (the file we have already been reading in the previous lessons). The column names appear below:

imdbId	title	releaseYear	releaseDate	genre	writers	actors	directors	sequel	hitFlop
						X

BollywoodActorRanking.csv. The column names appear below:

actorId	actorName	movieCount	ratingSum	normalizedMovieRank	googleHits	normalizedGoogleRank	normalizedRating
	X

We’ll want to join the two tables data on the actor names’ column. However, recall that the column actors in BollywoodMovieDetail.csv appears as pipe delimited string of names, so we’ll need to massage the data in this column first in a way to make it amenable for a join. We can then attempt a join with the actorName column of the BollywoodActorRanking.csv file.

Additionally, we’ll switch between the DataFrame ...

Spark Overview

DataFrames

Datasets

Spark SQL

Summary

Joins, Unions, and Window Functions

Join