...

Spark SQL Views and Tables

Get an introduction to Spark SQL views and tables.

We'll cover the following...

Managed vs unmanaged tables
Views
Catalog

In the previous lesson, we created a temporary view in Spark. We can also create a table using Spark SQL. Spark uses Apache Hive to persist metadata like the schema, description, table name, database name, column names, partitions, or physical location for tables created by users. In case Hive isn’t configured, Spark uses Hive’s embedded deployment mode, which employs Apache Derby as the underlying database. When we start the spark-shell without Hive configuration, the spark-shell creates metastore_db and warehouse directories in the current directory. We’ll see these directories when we work the terminal at the end of this lesson.

There are two configuration settings related to Hive. The configuration property spark.sql.warehouse.dir specifies the location of the Hive metastore warehouse, also known as the spark-warehouse directory. This is the location where Spark SQL persists tables. The second is the location of the Hive metastore, also known as the metastore_db, which is a relational database to manage the metadata of the persistent relational entities, such as databases, tables, columns, and partitions.

Managed vs unmanaged tables

In Spark, we can create two types of tables:

Managed: With managed tables, Spark is responsible for managing both the data and the metadata related to the table. If the user deletes a managed table, then Spark deletes both the data and the metadata for the table.
Unmanaged: With unmanaged tables, Spark is only responsible for managing the metadata of the table while the user has the onus of managing the table’s data in an external data source. If the user deletes the table, only the metadata for the table is deleted and not the actual data for the ...

Spark Overview

DataFrames

Datasets

Spark SQL

Summary

Spark SQL Views and Tables

Managed vs unmanaged tables