Search⌘ K
AI Features

Mix and Match Data

Explore how to connect and combine scattered data tables using SQL joins, primary keys, and foreign keys. Understand different join types and common mistakes to avoid, enabling you to retrieve and analyze data from multiple sources accurately.

In the real world, data rarely lives in one neat table. Users might be stored in one place, orders in another, and products somewhere else entirely. Individually, these tables tell partial stories. But when combined correctly, they reveal how users behave, what’s selling, and where systems might be failing.

As a data scientist, your job is to connect these pieces safely and correctly. That means understanding how tables relate, and using SQL joins to turn scattered datasets into reliable, production-ready data.

Before we jump into joins, we need to understand how tables are structured and connected, starting with primary keys and foreign keys.

Creating tables in SQL

In SQL, a table is like a dataset: rows represent records, and columns represent fields (attributes). As data scientists, we may not always create these tables ourselves, but understanding how they’re built helps us query them effectively.

Here’s a simple syntax to define a table:

CREATE TABLE table_name (
column1 datatype,
column2 datatype,
...
);
Define a table

For example, a table of customers:

CREATE TABLE Customers (
CustomerID INT PRIMARY KEY,
Name VARCHAR(100),
Email VARCHAR(100)
);
Example table

In this example, CustomerID is marked as a PRIMARY KEY, meaning it will uniquely identify each customer—this is essential when linking this table to others.

Tables are often created by data pipelines, not by hand. But it’s important to know how they work so you can read or debug them when needed.

Understanding primary keys

A primary key uniquely identifies each row in a table. For example, no two customers should share the same CustomerID. The following are key characteristics:

  • Unique: No duplicates allowed.

  • Not null: Every row must have a value.

  • Stable: Should not change frequently.

In data science workflows, primary keys help ensure data integrity during joins, filters, and feature engineering.

Fun fact: Think of a primary key like a student roll number in school. Even if two kids have the same name, their roll number keeps things sorted out!

Understanding foreign keys

Now, let’s talk about the foreign key. If the primary key is like a unique ID for each book in a library, then a foreign key is like a reference to a specific book in the account of a user who has borrowed it. It establishes a connection between two tables by referring to the primary key in another table.

A foreign key is a column (or a set of columns) in one table that refers to the primary key in another table. It creates a relationship between the two tables and ensures referential integrity by enforcing that the value in the foreign key column matches an existing primary key in the related table.

The following are key properties of the foreign keys:

  • Referential integrity: The value of a foreign key in one table must match a primary key in another table or be NULL.

  • Establishes relationships: Foreign keys are used to define relationships between tables, like one-to-many or many-to-many relationships.

CREATE TABLE Orders (
OrderID INT PRIMARY KEY,
OrderDate DATE,
CustomerID INT,
FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);
Foreign key example

In this case, CustomerID in the Orders table is a foreign key that references the CustomerID in the Customers table. This relationship ensures that every order is linked to an existing customer.

Fun fact: A foreign key is like a “friend tag” on social media—it points to someone else’s profile (or in this case, another table).

Now that we understand how tables can be related through primary and foreign keys, let's dive into how we can retrieve and combine data from these tables using SQL joins.

Using SQL joins

Joins allow us to combine data from multiple tables into one result set. Think of them as a way to “merge” datasets based on shared keys.

We’ll use the following example tables:

Customer Table

Orders Table

CustomerID

Name

OrderID

CustomerID

OrderDate

101

Alice

1

101

2025-04-01

102

Bob

2

102

2025-04-05

103

Carol

3

101

2025-04-10

Let’s break down the most commonly used types of SQL joins.

1. INNER JOIN

The INNER JOIN is one of the most popular join types. It returns only the rows where there is a match between the two tables. If there is no match, the row is excluded from the results.

Venn diagram of inner join
Venn diagram of inner join

Example

Let’s say we want to retrieve a list of customers along with their orders. We will use an INNER JOIN between the Customers and Orders tables, matching them on the CustomerID.

Before running the query, can you guess which customers will appear in the result? Look at the tables: which customers have a matching CustomerID in the Orders table?

MySQL
SELECT Customers.Name, Orders.OrderID
FROM Customers
INNER JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

Here, the Customers table is used as the main source and joined with the Orders table, keeping only the rows where the CustomerID matches in both tables.

2. LEFT JOIN

The LEFT JOIN (or LEFT OUTER JOIN) returns all the rows from the left table, along with the matching rows from the right table. If there is no match, the result will contain NULL values for columns from the right table.

Venn diagram of left join
Venn diagram of left join

Example:

Let’s say we want a list of all customers and their orders, but this time, we also want to include customers who have not placed any orders. We use the LEFT JOIN:

Before running the query, can you guess which customer will be included this time (who was missing from the INNER JOIN), and what will the OrderID be for their row?

MySQL
SELECT Customers.Name, Orders.OrderID
FROM Customers
LEFT JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

Here, all customers will appear in the result, even if they haven’t placed any orders. For those customers without orders, the OrderID will be NULL.

3. RIGHT JOIN

The RIGHT JOIN (or RIGHT OUTER JOIN) is the mirror image of the LEFT JOIN. It returns all the rows from the right table and the matching rows from the left table. If there is no match, the result will contain NULL values for columns from the left table.

Venn diagram of right join
Venn diagram of right join

Example:

In some cases, we may want to list all orders, including orders without associated customers. The RIGHT JOIN would be useful here:

Thinking about our specific tables, all three orders do have matching customers. Before running the query, can you guess if the RIGHT JOIN result will look any different from the INNER JOIN result?

MySQL
SELECT Customers.Name, Orders.OrderID
FROM Customers
RIGHT JOIN Orders ON Customers.CustomerID = Orders.CustomerID;

This query returns all records from the right table, along with any matching records from the left table. If there’s no match, the left table’s columns will contain NULL values.

4. CROSS JOIN

The CROSS JOIN produces a cartesian product—it returns every possible combination of rows from the two tables. If the first table has N rows and the second table has M rows, the result will contain N * M rows.

Venn diagram of cross join
Venn diagram of cross join

Example:

In this example, a cross join pairs every customer with every order, producing all possible customer–order combinations.

Our Customers table has 3 rows and our Orders table has 3 rows. Before running the query, can you guess exactly how many total rows this CROSS JOIN will return?

MySQL
SELECT
c.CustomerID,
c.Name,
o.OrderID,
o.OrderDate
FROM
Customers AS c
CROSS JOIN Orders AS o

This query selects four columns: the customer’s ID and name from the Customers table, and the order ID and date from the Orders table. It uses a CROSS JOIN, which pairs every row from Customers with every row from Orders, creating all possible combinations of customers and orders regardless of whether they’re related.

5. SELF JOIN

A SELF JOIN is a join where a table is joined to itself. This is useful when we need to compare rows within the same table.

Venn diagram of self join
Venn diagram of self join

Example:

In this example, we use a self-join on the Orders table to list every pair of orders placed by the same customer, ensuring the first order ID is always less than the second.

Look at the Orders table. Which CustomerID is the only one with more than one order? Before running the query, can you guess which customer’s orders will be paired in the result?

MySQL
SELECT
o1.CustomerID,
c.Name,
o1.OrderID AS FirstOrder,
o2.OrderID AS SecondOrder
FROM
Orders AS o1
JOIN Orders AS o2
ON o1.CustomerID = o2.CustomerID
AND o1.OrderID < o2.OrderID
JOIN Customers AS c
ON c.CustomerID = o1.CustomerID

The self-join will pair up each customer’s orders where the first order ID is less than the second. In our data, only customer 101 (Alice) has more than one order (OrderIDs 1 and 3), so we get a single row. Joining with Customers lets us replace the raw CustomerID with actual customer details (e.g., Name) and ensures only pair orders for valid customers.

Here is a quick reference table summarizing the SQL joins we covered:

Comparing SQL Join types

Join type

What it does

Common use case

INNER JOIN

Returns only rows where the join condition is met in both tables.

Finding matching data (e.g., customers with orders).

LEFT JOIN

Returns all rows from the left table, and matched rows from the right.

Finding all customers, including those who haven’t placed any orders.

RIGHT JOIN

Returns all rows from the right table, and matched rows from the left.

Finding all orders, even if the customer information is missing or unknown.

CROSS JOIN

Returns every possible combination of rows from both tables (Cartesian product).

Generating all possible pairings (e.g., matching all shirt sizes with all colors).

SELF JOIN

Joins a table to itself using an alias.

Comparing rows within the same table (e.g., finding employees who have the same manager).

Now that you’ve seen how different joins work, let’s look at some common mistakes developers make when using them.

Common mistakes:

1. Forgetting the ON condition
Without it, your JOIN becomes a CROSS JOIN and can explode your result set. Always double-check your join condition.

2. Using INNER JOIN instead of LEFT JOIN
An INNER JOIN will exclude unmatched rows—fine for clean data, risky for pipelines expecting full data coverage.

3. Mixing up aliases
Using short aliases like c and o makes code cleaner—but mix them up, and you’ll get confusing errors or wrong results.

Common mistakes

SQL joins are powerful, but small oversights can lead to major data errors. Here are some common mistakes to avoid.

  1. Forgetting the ON clause: If we forget the ON condition, SQL might execute a CROSS JOIN by default. This can return millions of rows unexpectedly, slowing down our query and database.

  2. Using INNER JOIN vs. LEFT JOIN: A common error is using an INNER JOIN when we really need a LEFT JOIN. If we want all customers (even those without orders), INNER JOIN will incorrectly filter them out. We should always ask ourselves: “Do we only want matching rows, or do we want all rows from one table, plus any matches?”

  3. Mismatched data types in join columns: If the columns we’re joining on have different data types (e.g., INTEGER vs. VARCHAR), the join may fail or produce unexpected results. Always ensure the data types are compatible before joining.

Conclusion

SQL Joins are powerful tools that help us combine data from multiple tables, making it possible to retrieve and analyze related data efficiently. Understanding how primary keys and foreign keys form relationships between tables is essential for writing effective queries. With this knowledge, we can build complex queries that draw data from different tables to solve real-world problems.

Let’s wrap things up with a quick quiz to check your understanding of the key concepts we’ve covered.

Technical Quiz
1.

In an SQL query, which SQL keyword is used to specify the column names used in a join?

A.

WHERE

B.

HAVING

C.

GROUP BY

D.

ON


1 / 5