Search⌘ K
AI Features

Writing and Reading Delta Tables

Explore how to write and read Delta tables in Databricks, focusing on managed and unmanaged tables. Learn to create tables, query them with both PySpark and SQL, and understand common pitfalls to avoid. This lesson builds essential skills for working with reliable, versioned data storage using Delta Lake.

Delta Lake makes it easy to store data reliably. You can write, read, and update tables without worrying about corrupted data or inconsistent formats.

Writing Delta tables: Managed vs. unmanaged

When saving data in Delta format, you have two main options:

  1. Managed table: Databricks manages both the table metadataMetadata is data about your data, such as table name, column types, and storage location, that is stored separately from the actual rows. and the underlying data. Dropping the table deletes all data files.

  2. Unmanaged table: You provide a location for the table files. Dropping the table only deletes the metadata; the files remain in DBFS or cloud storage.

Understanding this distinction matters because it affects how data is backed up, deleted, and shared across teams. Keep these two types in mind as you progress. In real projects, you'll encounter both. For now, we'll focus on managed tables since they work reliably in the free Community Edition of Databricks. We'll introduce the unmanaged tables conceptually, so you recognize them when you see them.

Managed tables are beginner-friendly since Databricks handles storage automatically. Unmanaged tables are useful when you want more control or to share data between multiple workspaces.

Writing a managed Delta table

Now we will save the df DataFrame from the previous lesson as a managed Delta table. This is the simplest way to start using Delta tables in production.

df.write.format("delta").mode("overwrite").saveAsTable("sales")
Save DataFrame as a managed Delta table

This code runs in a Databricks notebook. The screenshot below shows that the table is successfully created in the catalog.

DBFS view showing the default folder where the Delta table is stored
DBFS view showing the default folder where the Delta table is stored

Code explanation:

  • df.write starts the write operation.

  • .format("delta") saves the data in Delta format.

Under the hood, Delta tables are stored as Parquet files plus a transaction log folder called _delta_log. This log is what gives Delta its reliability features; without it, you just have a regular Parquet file.

  • .mode("overwrite") replaces the table if it already exists. If the table doesn't exist yet, it creates a new one.

  • .saveAsTable("sales") saves the table as managed in the Databricks catalog and names it sales. Databricks catalog is the central registry that tracks all your tables, like a library index for your data.

Managed tables appear in the "Tables" section of the Databricks workspace. You can query them immediately without specifying a file path.

Writing an unmanaged Delta table

In full Databricks or cloud versions, you can also save tables in custom paths (unmanaged tables).

df.write.format("delta").mode("overwrite").save("/mnt/delta/sales_unmanaged")
Save DataFrame as an unmanaged Delta table

Important note: In the Databricks Free Edition, access to legacy DBFS paths such as /FileStore may be disabled. Attempting to read or write data there can produce a “Public DBFS root is disabled” error. Modern Databricks environments encourage using managed tables or Unity Catalog volumes instead. To keep the examples simple and compatible with the free environment, this course will primarily use managed tables.

Code explanation:

  • .save("/mnt/delta/...") writes the table files to a specified location.

  • Metadata is separate from the files. Deleting the table won’t remove the underlying data.

Use unmanaged tables if you want to share the files across workspaces or control where the data is stored.

Reading Delta tables using PySpark

After writing a Delta table, the next step is to read it back into a DataFrame for analysis or transformation. In Databricks, the simplest way to read a managed table is by using its table name from the catalog.

sales_df = spark.read.table("sales")
sales_df.show()
Reading Delta tables using PySpark

This code loads the sales Delta table as a Spark DataFrame so you can perform further transformations, filtering, or analysis using PySpark.

This code runs in a Databricks notebook. The screenshot attached shows the data in the table that we created.

Notebook showing the data of the Delta table we have created
Notebook showing the data of the Delta table we have created

Code explanation:

  • spark.read starts a read operation.

  • .table("sales") loads the Delta table named sales from the catalog.

  • .show() displays the contents of the DataFrame.

Because the table is stored in Delta format, Spark automatically reads both the data files and the Delta transaction log, ensuring the table is read consistently.

Reading by table name is common when working with managed tables. You do not need to know the underlying file path.

Querying Delta tables using SQL

Delta tables can be queried with SQL directly in a notebook. This is useful for analysts who don’t want to use Python.

%sql
-- Query the managed Delta table
SELECT * FROM sales;
Querying the managed Delta table using SQL

This SQL runs in a Databricks notebook SQL cell. Screenshot attached shows the query output.

Querying the Delta tables using SQL
Querying the Delta tables using SQL

Explanation:

  • %sql indicates a SQL cell.

  • SELECT * FROM sales retrieves all rows from the Delta table.

You can mix Python and SQL queries in the same notebook to leverage the strengths of both languages.

Appending data to Delta tables

Sometimes you need to add new records without overwriting existing data. In full Databricks environments, you can usually add new rows to an existing Delta table using .mode("append").

# Append new data to the Delta table
new_data = [("Charlie", 300)]
new_df = spark.createDataFrame(new_data, ["name", "sales"])
new_df.write.format("delta").mode("append").saveAsTable("sales")
# Read back to confirm
sales_df = spark.read.table("sales")
sales_df.show()
Appending data to Delta tables

Important note: In the Community Edition, managed Delta tables cannot be appended to. Trying to append data (even with the same schema) will result in errors like [DELTA_FAILED_TO_MERGE_FIELDS]. For this course, we will focus only on creating, reading, and querying managed Delta tables, which work reliably in the free edition of Databricks.

The above code does the following:

  • .mode("append") tells Spark to add the new rows below the existing ones instead of replacing them. The original data stays untouched.

  • spark.read.table() reads the full sales table, including both the original rows and the newly appended one, back into a DataFrame to be displayed in the next step: sales_df.show().

Appending data is safe even when multiple analysts write to the same table at the same time because Delta Lake uses ACID transactions.

Common beginner mistakes

As you start writing Delta tables, watch out for these pitfalls that catch most beginners early on:

  1. Forgetting .format("delta"): If you do not specify .format("delta") when writing a DataFrame, Spark will save the data as a regular Parquet file instead of a Delta table, which means you won’t be able to use Delta features like ACID transactions, time travel, or schema enforcement.

  2. Using overwrite instead of append: If you use .mode("overwrite") when your goal is to add new data, all the existing data in the table will be replaced and lost. Always double-check your write mode before running the operation.

  3. Reading a table before it exists: Trying to read a Delta table that has not been created yet will result in an AnalysisException because Spark cannot find the table in the catalog. Make sure the table exists before attempting to read it.

  4. Confusing managed and unmanaged paths: Managed tables are fully controlled by Databricks, while unmanaged tables require you to provide a storage path. Mixing them up can lead to confusion, unexpected behavior, or errors, especially in environments like Databricks Community Edition, where unmanaged paths are restricted.