Search⌘ K
AI Features

Why Databricks?

Explore the reasons behind Databricks' creation and its role in unifying data engineering, analytics, and machine learning. Learn how Databricks and the Lakehouse architecture address common problems with traditional data systems by providing a single platform that simplifies workflows, scales compute automatically, and integrates cloud storage. Understand the components of Databricks architecture and the advantages it offers for beginners and professionals working with large-scale data.

Before learning how to click buttons or write code, it is important to understand why Databricks was created. Many beginners jump straight into tools without knowing the problems those tools are trying to solve. This lesson builds that foundation in very simple language.

The problem with traditional data systems

For a long time, companies stored data in separate systems. One system was used for reports, another for large files, and another for machine learning work. This caused confusion and delays.

Data teams often faced these issues:

  • Data engineers had to move the same data between multiple tools, which increased errors and wasted time.

  • Analysts could not access raw data easily and had to wait for prepared tables.

  • Machine learning teams worked on copied data that was often outdated.

  • Managing infrastructure required a lot of manual effort and deep technical knowledge.

These problems slowed down decision-making and made data projects expensive and hard to maintain.

When data lives in many systems, teams spend more time moving data than learning from it. Databricks was designed to reduce this movement.

What Databricks does differently

Databricks brings data engineering, analytics, and machine learning into one platform. Instead of switching tools, teams work in the same environment using shared data.

At a high level, Databricks provides:

  • A single workspace where teams can write code, run queries, and view results.

  • Support for large data without needing to manage servers manually.

  • Built-in tools for working with structured and unstructured dataStructured data fits neatly into rows and columns, like a spreadsheet or a database table. Unstructured data, like images, audio files, or free-form text, does not fit into rows and columns..

  • Tight integration with cloud storage, so data is easy to access and scale.

This approach makes Databricks especially useful for beginners who want to focus on learning data skills instead of managing systems.

Databricks architecture

To understand how Databricks works, it helps to see the big picture. The diagram below shows the general architecture for classic Databricks workspaces:

Databricks architecture in Community Edition
Databricks architecture in Community Edition

Let's walk through each part of this diagram in turn.

1. Users and applications

These are the people or apps that interact with Databricks. They write code, run queries, or build dashboards. They do not need to manage servers or storage directly.

2. Your Databricks account

This is the core of the Databricks environment, where two main planes work together to process your requests.

a) Control plane

The control plane is the management layer of Databricks, it handles scheduling, access control, and coordination. You interact with it through your browser. It contains:

  • Web application: The browser interface you use to write and run code.

  • Compute management: Schedules and manages compute tasks.

  • Catalog: Central place for managing who can access which data (governance).

  • Workspace: Where your notebooks, SQL queries, and scripts live.

b) Serverless compute plane

This is the engine that runs your code. It auto-scales computing resources, so you do not have to choose or manage servers. When you click “Run,” the compute plane picks up the task, processes the data, and returns the result.

c) Default storage

Databricks works with two kinds of storage:

  • Default storage: This is a storage area inside Databricks for temporary or default data. Useful for quick experiments or small projects.

  • External cloud storage: Your own storage buckets on AWS (S3), Azure (ADLS), or Google Cloud (GCS). This is where your real data typically lives, separate from Databricks, so you retain full ownership and control.

3. How it all connects

Here is the sequence of events every time you run code in Databricks:

  1. Users and apps interact with the Databricks control plane.

  2. The control plane sends tasks to the serverless compute plane.

  3. The compute plane reads and writes data from either default storage or your external cloud storage.

  4. Results are returned to you in the workspace.

Why this design matters
This separation of concerns allows Databricks to do three important things automatically:

  • Manage orchestration (scheduling and running tasks) without needing to configure anything.

  • Scale compute up or down depending on how much data you are processing.

  • Keep your data in your own cloud storage so you are never locked in.

Analogy: a car factory
If the architecture feels abstract, think of it like a car factory:

  • Control plane → Factory managers planning and directing operations.

  • Compute plane → Assembly robots doing the heavy lifting.

  • Storage → The warehouse where raw materials (data) are kept, separate from the factory floor.

  • Users → Customers designing the car and placing orders.

Databricks and Apache Spark

Databricks is built on top of Apache Spark, but you do not need to understand Spark deeply to use Databricks.

Think of it like this:

  • Spark is the engine that processes data.

  • Databricks is the car, which gives you controls, safety features, and an easy way to drive that engine.

In this course, you'll use very basic PySparkThe Python interface for Apache Spark. It lets you write Python code that Spark executes on large data. only when needed, and always inside Databricks. The goal is not to master Spark, but to understand how Databricks uses it.

Databricks was created by the original creators of Apache Spark to make Spark easier and more practical for real companies.

A very first look at the code in Databricks

To make this lesson concrete, let’s look at a very small example of what running code in Databricks feels like. Do not worry about understanding every word yet.

In the following example, we create a tiny dataset and display it.

Python
from pyspark.sql import Row
data = [
Row(OrderID=1, Product="Laptop", Category="Electronics", Revenue=1200),
Row(OrderID=2, Product="Headphones", Category="Electronics", Revenue=150),
]
df = spark.createDataFrame(data)
df.show()
df.printSchema()

Note on warnings: When running Spark code in a terminal, you may see some warning messages. These warnings often come from Spark’s internal logging system and don’t usually affect how your code runs or its performance. While there are ways to reduce these messages, sometimes they can still appear depending on your setup or Spark version. For learning purposes, you can safely ignore these warnings and focus on how Spark processes data quickly and efficiently.

Here is what this code is doing, explained step by step:

  • The first part creates two rows of sales data in memory, where each row represents an order with an order ID, product name, product category, and revenue amount.

  • Each row uses named fields, which means the column names, such as OrderID, Product, Category, and Revenue are defined directly while creating the data.

  • The spark.createDataFrame() function converts this in-memory sales data into a DataFrame, which is the main data structure Databricks uses to process and analyze data.

  • The show() function displays the DataFrame as a table inside the notebook, allowing you to quickly check the values in each column.

  • The printSchema() function prints the structure of the DataFrame, showing column names and data types so you can understand how Databricks interprets your data.

This kind of simple DataFrame creation, inspection, and validation is a pattern you'll use repeatedly throughout the course as you explore Databricks and the Lakehouse platform.

In this course, you will first run Databricks-style code directly inside the course environment. This is done on purpose so you can focus on learning concepts, not setting up accounts or infrastructure.

This environment behaves like Databricks:

  • It uses Apache Spark.

  • It supports DataFrames.

  • It produces the same type of output you would see in Databricks.

This means the code you write here is fully compatible with Databricks.

Running the same code in the Databricks platform

Once you are comfortable, copy this exact same code and run it inside a Databricks notebook. We will learn how to do this in a later lesson.

Below is a screenshot from the Databricks platform showing the same code executed in a real Databricks workspace, producing the same output.

Creating a tiny dataset and display it in Databricks platform
Creating a tiny dataset and display it in Databricks platform

You’ll notice:

  • The table output looks the same.

  • The schema output is identical.

  • The workflow feels very similar.

That’s because Databricks is simply running Spark for you behind the scenes.

Where Databricks is used in the real world

Databricks is used by companies that work with large amounts of data and need reliable results. Typical use cases include:

  • Building data pipelines that clean and prepare raw data.

  • Running analytics queries for dashboards and reports.

  • Training machine learning models using shared data.

  • Supporting both technical and non-technical teams on one platform.

Understanding why Databricks is used will help you see where your new skills fit professionally.

20,000+ organizations worldwide use Databricks to build and scale data and AI apps, analytics, and agents. Their clients include over 60% of the Fortune 500 companies, like AT&T, Mastercard, and Unilever.

Technical Quiz
1.

What is the main reason Databricks was created?

A.

To replace Python as a programming language

B.

To store data only for reporting purposes

C.

To remove the need for cloud storage

D.

To combine data engineering, analytics, and machine learning in one platform


1 / 4