Spark SQL Goodness

Get introduced to the functionality in Spark that allows the developer to query DataFrames in a relational database manner: Spark SQL.

SQL in SparkSQL

Structured Query Language (abbreviated SQL) has been a golden standard for manipulating data for many years now. It is a powerful tool used widely, so much that top-of-the-art cloud services (AWS S3, among others) provide SQL-like functionality to inspect and retrieve data. It also provides a human-readable syntax and its learning curve is not too steep.

All these positive aspects made it a reasonable choice to embed SQL into the SparkSQL module, even more so if the data Spark works with, on many occasions, is structured or semi-structured.

This lesson focuses on teaching how to use SparkSQL to make our lives easier as Big Data apprentices.

If your SQL knowledge be a bit rusty, check some great courses here on educative.io.

A practical introduction to SparkSQL

Just as SQL relational databases can expose a view of a table that can be accessed by any application needing to interact with the DataSource, SparkSQL follows a similar approach but requires a view as the main entry point to querying a DataFrame.

The below diagram shows how a view is an abstraction on top of the DataFrame when we interact with it through SparkSQL:

Get hands-on with 1200+ tech skills courses.