Spark SQL Views and Tables
Get an introduction to Spark SQL views and tables.
We'll cover the following
In the previous lesson, we created a temporary view in Spark. We can also create a table using Spark SQL. Spark uses Apache Hive to persist metadata like the schema, description, table name, database name, column names, partitions, or physical location for tables created by users. In case Hive isn’t configured, Spark uses Hive’s embedded deployment mode, which employs Apache Derby as the underlying database. When we start the spark-shell without Hive configuration, the spark-shell creates metastore_db
and warehouse
directories in the current directory. We’ll see these directories when we work the terminal at the end of this lesson.
There are two configuration settings related to Hive. The configuration property spark.sql.warehouse.dir
specifies the location of the Hive metastore warehouse, also known as the spark-warehouse directory. This is the location where Spark SQL persists tables. The second is the location of the Hive metastore, also known as the metastore_db
, which is a relational database to manage the metadata of the persistent relational entities, such as databases, tables, columns, and partitions.
Managed vs unmanaged tables
In Spark, we can create two types of tables:
-
Managed: With managed tables, Spark is responsible for managing both the data and the metadata related to the table. If the user deletes a managed table, then Spark deletes both the data and the metadata for the table.
-
Unmanaged: With unmanaged tables, Spark is only responsible for managing the metadata of the table while the user has the onus of managing the table’s data in an external data source. If the user deletes the table, only the metadata for the table is deleted and not the actual data for the table.
Let’s see how we can create both. If we don’t specify the database, Spark uses the database default
. We’ll start with creating the database spark_course
.
scala> spark.sql("CREATE DATABASE spark_course")
scala> spark.sql("USE spark_course")
scala> spark.sql("SHOW TABLES").show(5, false)
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
We have no tables created in the database spark_course
yet. We’ll create a table movieShortDetail
as follows:
scala> val movies = spark.read.format("csv").option("header", "true").option("samplingRatio", 0.001).option("inferSchema", "true").load("/data/BollywoodMovieDetail.csv")
scala> spark.sql("CREATE TABLE movieShortDetail(imdbID String, title String)")
scala> spark.sql("SHOW TABLES").show(5, false)
+------------+----------------+-----------+
|database |tableName |isTemporary|
+------------+----------------+-----------+
|spark_course|movieshortdetail|false |
+------------+----------------+-----------+
We can also create the table using DataFrame API as follows:
scala> movies.write.saveAsTable("movieShortDetailUsingDataFrame")
scala> spark.sql("SHOW TABLES").show(5, false)
+------------+------------------------------+-----------+
|database |tableName |isTemporary|
+------------+------------------------------+-----------+
|spark_course|movieshortdetail |false |
|spark_course|movieshortdetailusingdataframe|false |
+------------+------------------------------+-----------+
We can create unmanaged tables in Spark by reading data from our own sources such as Parquet, CSV, or JSON files. For instance:
scala> spark.sql("CREATE TABLE movieShortDetailUnmanaged (imdbID STRING, title STRING) USING csv OPTIONS (PATH '/data/BollywoodMovieDetail.csv')")
Using DataFrames, we can create the same table as follows:
scala> movies.write.option("path","/data/shortMovieDetail.csv").saveAsTable("movieShortDetailUsingDataFrameUnmanaged")
We can list all the tables using spark.catalog.listTables()
.
scala> spark.catalog.listTables().show(5, false)
+---------------------------------------+--------+-----------+---------+-----------+
|name |database|description|tableType|isTemporary|
+---------------------------------------+--------+-----------+---------+-----------+
|movieshortdetail |default |null |MANAGED |false |
|movieshortdetailunmanaged |default |null |EXTERNAL |false |
|movieshortdetailusingdataframe |default |null |MANAGED |false |
|movieshortdetailusingdataframeunmanaged|default |null |EXTERNAL |false |
+---------------------------------------+--------+-----------+---------+-----------+
Similarly, we can list the columns of the table as follows:
scala> spark.catalog.listColumns("movieshortdetail").show(10, false)
+------+-----------+--------+--------+-----------+--------+
|name |description|dataType|nullable|isPartition|isBucket|
+------+-----------+--------+--------+-----------+--------+
|imdbID|null |string |true |false |false |
|title |null |string |true |false |false |
+------+-----------+--------+--------+-----------+--------+
We can list the databases as follows:
scala> spark.catalog.listDatabases().show(10, false)
+------------+---------------------+-------------------------------------+
|name |description |locationUri |
+------------+---------------------+-------------------------------------+
|default |Default Hive database|file:/spark-warehouse |
|spark_course| |file:/spark-warehouse/spark_course.db|
+------------+---------------------+-------------------------------------+
Views
Views can be created on top of existing tables. Views are of two types:
-
Global Views: are visible across all SparkSessions on a given cluster. An application may need to access and combine data from multiple SparkSessions with different Hive metastore configurations.
-
Session-scoped Views: are visible only to a single SparkSession.
Views don’t hold the actual data and disappear once the Spark application terminates.
scala> movies.write.saveAsTable("movies")
scala> spark.sql("CREATE OR REPLACE TEMP VIEW high_rated_movies AS SELECT title FROM movies WHERE hitFlop > 7")
scala> spark.catalog.listTables()
res11: org.apache.spark.sql.Dataset[org.apache.spark.sql.catalog.Table] = [name: string, database: string ... 3 more fields]
scala> spark.catalog.listTables().show(5,false)
+-----------------+------------+-----------+---------+-----------+
|name |database |description|tableType|isTemporary|
+-----------------+------------+-----------+---------+-----------+
|movies |spark_course|null |MANAGED |false |
|high_rated_movies|null |null |TEMPORARY|true |
+-----------------+------------+-----------+---------+-----------+
We can create a global view as follows:
scala> spark.sql("CREATE OR REPLACE GLOBAL TEMP VIEW high_rated_movies_global AS SELECT title FROM movies WHERE hitFlop > 7")
scala> spark.sql("SELECT * FROM global_temp.high_rated_movies_global").show(3, false)
+---------------------------+
|title |
+---------------------------+
|Kabhi Khushi Kabhie Gham...|
|Gadar: Ek Prem Katha |
|Krrish |
+---------------------------+
only showing top 3 rows
Global views are saved in the global_temp
database and we need to prefix global temporary views with global_temp.<view_name>
in SQL queries. Similar to tables, views can be dropped too.
SQL tables and views can be cached and uncached like DataFrames.
Catalog
Spark manages metadata for both managed and unmanaged tables. We can access the metadata information for not only tables but also databases and views using a high level abstraction known as the Catalog.
Some examples include:
spark.catalog.listDatabases()
spark.catalog.listTables()
spark.catalog.listColumns("movies")
All the queries and commands used in this lesson are reproduced in the widget below for easy copy/pasting into the terminal.
Get hands-on with 1200+ tech skills courses.