Writing and Reading Delta Tables
Explore how to write and read Delta tables in Databricks, focusing on managed and unmanaged tables. Learn to create tables, query them with both PySpark and SQL, and understand common pitfalls to avoid. This lesson builds essential skills for working with reliable, versioned data storage using Delta Lake.
Delta Lake makes it easy to store data reliably. You can write, read, and update tables without worrying about corrupted data or inconsistent formats.
Writing Delta tables: Managed vs. unmanaged
When saving data in Delta format, you have two main options:
Managed table: Databricks manages both the table
and the underlying data. Dropping the table deletes all data files.metadata Metadata is data about your data, such as table name, column types, and storage location, that is stored separately from the actual rows. Unmanaged table: You provide a location for the table files. Dropping the table only deletes the metadata; the files remain in DBFS or cloud storage.
Understanding this distinction matters because it affects how data is backed up, deleted, and shared across teams. Keep these two types in mind as you progress. In real projects, you'll encounter both. For now, we'll focus on managed tables since they work reliably in the free Community Edition of Databricks. We'll introduce the unmanaged tables conceptually, so you recognize them when you see them.
Managed tables are beginner-friendly since Databricks handles storage automatically. Unmanaged tables are useful when you want more control or to share data between multiple workspaces.
Writing a managed Delta table
Now we will save the df DataFrame from the previous lesson as a managed Delta table. This is the simplest way to start using Delta tables in production.
df.write.format("delta").mode("overwrite").saveAsTable("sales")
This code runs in a Databricks notebook. The screenshot below shows that the table is successfully created in the catalog.
Code explanation:
df.writestarts the write operation..format("delta")saves the data in Delta format.
Under the hood, Delta tables are stored as Parquet files plus a transaction log folder called _delta_log. This log is what gives Delta its reliability features; without it, you just have a regular Parquet file.
.mode("overwrite")replaces the table if it already exists. If the table doesn't exist yet, it creates a new one..saveAsTable("sales")saves the table as managed in the Databricks catalog and names itsales. Databricks catalog is the central registry that tracks all your tables, like a library index for your data.
Managed tables appear in the "Tables" section of the Databricks workspace. You can query them immediately without specifying a file path.
Writing an unmanaged Delta table
In full Databricks or cloud versions, you can also save tables in custom paths (unmanaged tables).
df.write.format("delta").mode("overwrite").save("/mnt/delta/sales_unmanaged")
Important note: In the Databricks Free Edition, access to legacy DBFS paths such as /FileStore may be disabled. Attempting to read or write data there can produce a “Public DBFS root is disabled” error. Modern Databricks environments encourage using managed tables or Unity Catalog volumes instead. To keep the examples simple and compatible with the free environment, this course will primarily use managed tables.
Code explanation:
.save("/mnt/delta/...")writes the table files to a specified location.Metadata is separate from the files. Deleting the table won’t remove the underlying data.
Use unmanaged tables if you want to share the files across workspaces or control where the data is stored.
Reading Delta tables using PySpark
After writing a Delta table, the next step is to read it back into a DataFrame for analysis or transformation. In Databricks, the simplest way to read a managed table is by using its table name from the catalog.
sales_df = spark.read.table("sales")sales_df.show()
This code loads the sales Delta table as a Spark DataFrame so you can perform further transformations, filtering, or analysis using PySpark.
This code runs in a Databricks notebook. The screenshot attached shows the data in the table that we created.
Code explanation:
spark.readstarts a read operation..table("sales")loads the Delta table named sales from the catalog..show()displays the contents of the DataFrame.
Because the table is stored in Delta format, Spark automatically reads both the data files and the Delta transaction log, ensuring the table is read consistently.
Reading by table name is common when working with managed tables. You do not need to know the underlying file path.
Querying Delta tables using SQL
Delta tables can be queried with SQL directly in a notebook. This is useful for analysts who don’t want to use Python.
%sql-- Query the managed Delta tableSELECT * FROM sales;
This SQL runs in a Databricks notebook SQL cell. Screenshot attached shows the query output.
Explanation:
%sqlindicates a SQL cell.SELECT * FROM salesretrieves all rows from the Delta table.
You can mix Python and SQL queries in the same notebook to leverage the strengths of both languages.
Appending data to Delta tables
Sometimes you need to add new records without overwriting existing data. In full Databricks environments, you can usually add new rows to an existing Delta table using .mode("append").
# Append new data to the Delta tablenew_data = [("Charlie", 300)]new_df = spark.createDataFrame(new_data, ["name", "sales"])new_df.write.format("delta").mode("append").saveAsTable("sales")# Read back to confirmsales_df = spark.read.table("sales")sales_df.show()
Important note: In the Community Edition, managed Delta tables cannot be appended to. Trying to append data (even with the same schema) will result in errors like [DELTA_FAILED_TO_MERGE_FIELDS]. For this course, we will focus only on creating, reading, and querying managed Delta tables, which work reliably in the free edition of Databricks.
The above code does the following:
.mode("append")tells Spark to add the new rows below the existing ones instead of replacing them. The original data stays untouched.spark.read.table()reads the fullsalestable, including both the original rows and the newly appended one, back into a DataFrame to be displayed in the next step:sales_df.show().
Appending data is safe even when multiple analysts write to the same table at the same time because Delta Lake uses ACID transactions.
Common beginner mistakes
As you start writing Delta tables, watch out for these pitfalls that catch most beginners early on:
Forgetting
.format("delta"): If you do not specify.format("delta")when writing a DataFrame, Spark will save the data as a regular Parquet file instead of a Delta table, which means you won’t be able to use Delta features like ACID transactions, time travel, or schema enforcement.Using
overwriteinstead ofappend: If you use.mode("overwrite")when your goal is to add new data, all the existing data in the table will be replaced and lost. Always double-check your write mode before running the operation.Reading a table before it exists: Trying to read a Delta table that has not been created yet will result in an
AnalysisExceptionbecause Spark cannot find the table in the catalog. Make sure the table exists before attempting to read it.Confusing managed and unmanaged paths: Managed tables are fully controlled by Databricks, while unmanaged tables require you to provide a storage path. Mixing them up can lead to confusion, unexpected behavior, or errors, especially in environments like Databricks Community Edition, where unmanaged paths are restricted.