Joins, Unions, and Window Functions
Explore how to execute joins and unions between DataFrames or tables in Spark SQL. Understand different join types and practice manipulating string data for effective joins. Discover window functions to perform ranking and analytic operations within grouped data. This lesson helps you master advanced Spark SQL techniques for querying and analyzing structured big data.
Spark offers more involved and complex functions that support aggregation, collection, datetime, math, string, sorting, windowing, etc., functionality. We’ll see examples of joins, unions and windowing in this lesson.
Join
Joins can be performed between DataFrames or tables in Spark. By default, Spark executes an inner join between tables but has support for cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, and left_anti joins. Let’s work an example of executing an inner join on our Bollywood movies data set. We’ll read the following two files and then join their respective DataFrames.
- BollywoodMovieDetail.csv (the file we have already been reading in the previous lessons). The column names appear below:
| imdbId | title | releaseYear | releaseDate | genre | writers | actors | directors | sequel | hitFlop |
|---|---|---|---|---|---|---|---|---|---|
| X |
- BollywoodActorRanking.csv. The column names appear below:
| actorId | actorName | movieCount | ratingSum | normalizedMovieRank | googleHits | normalizedGoogleRank | normalizedRating |
|---|---|---|---|---|---|---|---|