Writing and Reading Delta Tables

Explore how to create and read Delta tables within Databricks using managed table options. Understand the differences between managed and unmanaged tables, learn to write Delta tables with PySpark, query them using SQL, and avoid common beginner pitfalls for effective data management.

We'll cover the following...

Writing Delta tables: Managed vs. unmanaged
- Writing a managed Delta table
- Writing an unmanaged Delta table
Reading Delta tables using PySpark
Querying Delta tables using SQL
Appending data to Delta tables
Common beginner mistakes

Delta Lake makes it easy to store data reliably. You can write, read, and update tables without worrying about corrupted data or inconsistent formats.

Writing Delta tables: Managed vs. unmanaged

When saving data in Delta format, you have two main options:

Managed table: Databricks manages both the table metadataMetadata is data about your data, such as table name, column types, and storage location, that is stored separately from the actual rows. and the underlying data. Dropping the table deletes all data files.
Unmanaged table: You provide a location for the table files. Dropping the table only deletes the metadata; the files remain in DBFS or cloud storage.

Understanding this distinction matters because it affects how data is backed up, deleted, and shared across teams. Keep these two types in mind as you progress. In real projects, you'll encounter both. For now, we'll focus on managed tables since they work reliably in the free Community Edition of Databricks. We'll introduce the unmanaged tables conceptually, so you recognize them when you see them.

Managed tables are beginner-friendly since Databricks handles storage automatically. Unmanaged tables are useful when you want more control or to share data between multiple workspaces.

Writing a managed Delta table

Now we will save the df DataFrame from the previous lesson as a managed Delta table. This is the simplest way to start using Delta tables in production.

Code explanation:

df.write starts the write operation.
.format("delta") saves the data in Delta format.

Under the hood, Delta tables are stored as Parquet files plus a transaction log folder called _delta_log. This log is what gives Delta its reliability features; without it, you just have a regular Parquet file.

.mode("overwrite") replaces the table if it already exists. If the table doesn't exist yet, it creates a new one.
.saveAsTable("sales") saves the table as managed in the Databricks catalog and names it sales. Databricks catalog is the central registry that tracks all your tables, like a library index for your data.

Managed tables appear in the "Tables" section of the Databricks workspace. You can query them immediately without specifying a file path.

Writing an unmanaged Delta table

In full Databricks or cloud versions, you can also save tables in custom paths (unmanaged tables).

Important note: In the Databricks Free Edition, access to legacy DBFS paths such as /FileStore may be disabled. Attempting to read or write data there can produce a “Public DBFS root is disabled” error. Modern Databricks environments encourage using managed tables or Unity Catalog volumes instead. To keep the examples simple and compatible with the free environment, this course will primarily use managed tables.

Code explanation:

.save("/mnt/delta/...") writes the table files to a specified location.
Metadata is separate from the files. Deleting the table won’t remove the underlying data.

Use unmanaged tables if you want to share the files across workspaces or control where the data is stored.

Reading Delta tables using PySpark

After writing a Delta table, the next step is to read it back into a DataFrame for analysis or transformation. In Databricks, the simplest way to read a managed table is by using its table name from the catalog.

Code explanation:

spark.read starts a read operation.
.table("sales") loads the Delta table named sales from the catalog.
.show() displays the contents of the DataFrame.

Because the table is stored in Delta format, Spark automatically reads both the data files and the Delta transaction log, ensuring the table is read consistently.

Reading by table name is common when working with managed tables. You do not need to know the underlying file path.

Querying Delta tables using SQL

Delta tables can be queried with SQL directly in a notebook. This is useful for analysts who don’t want to use Python.

Important note: In the Community Edition, managed Delta tables cannot be appended to. Trying to append data (even with the same schema) will result in errors like [DELTA_FAILED_TO_MERGE_FIELDS]. For this course, we will focus only on creating, reading, and querying managed Delta tables, which work reliably in the free edition of Databricks.

The above code does the following:

.mode("append") tells Spark to add the new rows below the existing ones instead of replacing them. The original data stays untouched.
spark.read.table() reads the full sales table, including both the original rows and the newly appended one, back into a DataFrame to be displayed in the next step: sales_df.show().

Appending data is safe even when multiple analysts write to the same table at the same time because Delta Lake uses ACID transactions.

Common beginner mistakes

As you start writing Delta tables, watch out for these pitfalls that catch most beginners early on:

Forgetting .format("delta"): If you do not specify .format("delta") when writing a DataFrame, Spark will save the data as a regular Parquet file instead of a Delta table, which means you won’t be able to use Delta features like ACID transactions, time travel, or schema enforcement.
Using overwrite instead of append: If you use .mode("overwrite") when your goal is to add new data, all the existing data in the table will be replaced and lost. Always double-check your write mode before running the operation.
Reading a table before it exists: Trying to read a Delta table that has not been created yet will result in an AnalysisException because Spark cannot find the table in the catalog. Make sure the table exists before attempting to read it.
Confusing managed and unmanaged paths: Managed tables are fully controlled by Databricks, while unmanaged tables require you to provide a storage path. Mixing them up can lead to confusion, unexpected behavior, or errors, especially in environments like Databricks Community Edition, where unmanaged paths are restricted.

1.Introduction to Databricks and Lakehouse

2.Setting Up Databricks

3.PySpark Basics in Databricks

4.Delta Lake Fundamentals

5.SQL in Databricks

6.Mini End-to-End Lakehouse Project

7.Wrap Up and Next Steps

Writing and Reading Delta Tables

Writing Delta tables: Managed vs. unmanaged

Writing a managed Delta table

Writing an unmanaged Delta table

Reading Delta tables using PySpark

Querying Delta tables using SQL

Appending data to Delta tables

Common beginner mistakes