Reading Data from CSV
Explore how to upload CSV files to the Databricks File System and read them into PySpark DataFrames. Understand how to inspect schemas, preview data rows, and avoid common errors when working with CSV data in Databricks.
In previous lessons, we created DataFrames manually inside the notebook. But in real-world data engineering, data rarely comes from hardcoded lists; it comes from files and external storage systems.
In this lesson, you'll work with your first real external file: a CSV. By the end, you will be able to upload a CSV file to Databricks, read it into a Spark DataFrame, inspect its structure, and preview its contents with confidence. Databricks can read many file formats: CSV, JSON, Parquet, Delta tables, and more. This lesson focuses on CSV files, but the workflow for other files is very similar once you understand the fundamentals here.
Understanding DBFS
Before reading a CSV file, you need to understand where files are stored. Databricks uses a file system called DBFS (Databricks File System).
DBFS is a storage layer built into your Databricks workspace. It gives you a single, consistent way to store and access files using simple paths, regardless of where those files are physically stored in the cloud. You do not need to know anything about the underlying cloud storage to use DBFS in this lesson; just think of it as a managed folder system for your workspace.
In notebooks, DBFS paths typically ...