Data Science and Machine Learning Interview Handbook/

...

Structured vs. Unstructured Data

Explore common interview questions related to data formats, transformation pipelines, and database decisions, using real-world examples from domains like health care.

We'll cover the following...

Transforming unstructured data into a structured format
- Sample answer
Relational vs. non-relational databases
- Sample answer
Store unstructured and structured data
- Sample answer

Structured and unstructured data are everywhere—from neat SQL tables to free-form handwritten notes. Knowing how to work with both types (and choose the right tools) is an important skill for any data scientist. Let’s explore three questions interviewees can expect in data analytics and data science interviews.

Transforming unstructured data into a structured format

A hospital has a significant amount of unstructured data in doctors’ handwritten notes containing prescribed medications for patients. The hospital staff finds it challenging to track prescriptions efficiently because these notes vary in format, contain medical information, and are often difficult to decipher.

To improve their prescription tracking and record-keeping, the hospital has implemented a digital solution to convert these handwritten notes into structured data. This structured format will organize prescription information in a way that can be easily searched, accessed, and analyzed across all patients.

How can the hospital convert these handwritten notes into structured, usable data?

This question is frequently asked in data science interviews for roles in health care technology, NLP-focused startups, or enterprise data engineering.

Sample answer

To answer this, we can reframe the problem statement: converting doctors’ handwritten notes containing prescribed medications for patients (unstructured data) into structured data to keep track of prescriptions across patients.

We need to design a solution to convert these handwritten notes into structured data to capture key prescription details in a searchable, standardized format.

Here are some steps that can be implemented (see the figure below for a visual example):

Scanning: Capture doctors’ handwritten notes using mobile or desktop scanning tools.
OCR: Perform Optical Character Recognition (OCR) via an algorithm to extract the text from the image.
Text cleaning and verification: Post-process the extracted text to remove any grammatical issues or inconsistencies caused by the OCR algorithm. Verify the accuracy of the OCR output (e.g., by comparing it against the original image and making any necessary corrections).
NER: Perform Named Entity Recognition (NER) to extract names or categories of prescribed medications for use in the structured data. Verify the accuracy of the NER output (e.g., by reviewing the extracted entities and ensuring they match the context of the original text).
Structuring and storage: Organize the extracted data into a structured format (like rows in a table) and store it in a relational or document-based database for querying and analysis.

Press + to interact

This answer can be adapted for various scenarios, but its essence is identifying what structured data can be created from unstructured data—names, categories, numbers, and more. In an interview, you can strengthen your response by:

Mentioning specific tools (e.g., Tesseract OCR, spaCy for NER).
Describing how you would handle edge cases (e.g., illegible handwriting, ambiguous drug names).
Returning to the business value—improved efficiency, better patient outcomes, etc.

This approach can also be extended to emails, PDFs, scanned forms, and audio transcripts.

Relational vs. non-relational databases

Suppose you’re designing a system dealing with structured (e.g., patient records) and unstructured data (e.g., scanned prescriptions or social media feedback). How do you decide whether to use a relational or non-relational database?

Sample answer

To address this, focus on three main factors: functionality, data storage, and scalability.

Functionality

Relational databases: They are suitable for structured data that requires strong ACID (Atomicity, Consistency, Isolation, Durability) compliance and complex queries involving joins. They are perfect for data with clearly defined relationships, such as customer orders in e-commerce or banking transactions.
Non-relational databases (NoSQL): They are suitable for unstructured or semi-structured data. They often prioritize eventual consistency over strong consistency (using BASE—available, soft state, eventually consistent). They are designed for flexible schema and scaling, with many types (e.g., document stores, key-value stores, graph databases) offering tailored solutions for specific use cases.

Storage

Relational databases:
1. Storage mechanism: Data is stored in tables with predefined schemas. Each table consists of rows and columns, where each row represents a record and each column represents an attribute.
2. Examples: They include MySQL, PostgreSQL, and Oracle Database.
Non-relational databases (NoSQL):
1. Storage mechanism: Data is stored more flexibly, depending on the database type. Document stores use JSON-like documents, key-value stores use pairs of keys and values, and graph databases use nodes and edges.
2. Examples: They include MongoDB (document store), Redis (key-value store), and Neo4j (graph database).

Scalability

Relational databases:
1. Scalability approach: They typically scale vertically by upgrading the server’s hardware (e.g., adding more CPU, RAM). Horizontal scaling (sharding) is more complex due to the rigid schema and inter-table dependencies.
2. Strengths and limitations: They provide strong transactional consistency but can face limitations in efficiently handling massive, distributed datasets.
Non-relational databases (NoSQL):
1. Scalability approach: They are designed for horizontal scaling, allowing easy data distribution across multiple servers or clusters. This ensures high availability and fault tolerance.
2. Strengths and limitations: They are excellent for handling large volumes of data and high-throughput requirements. However, eventual consistency models may not guarantee immediate consistency of all data replicas.

There’s no universal rule or answer. Instead, explain how you would:

Use a relational database for structured, relational data that requires strict consistency.
Use a NoSQL database for flexibility, scalability, and high-throughput use cases.
Justify trade-offs in consistency: (ACID vs. BASE), schema design, and operational needs.

Store unstructured and structured data

You’re building a health care system aggregating patient data from various sources. Some of this data is structured (e.g., names, visit reasons), while others are unstructured or semi-unstructured (e.g., doctor’s notes or free-text reviews).

Your system must store both data types while preserving relationships and supporting patient-level record retrieval.

Requirements:

Design a database schema that can handle structured and unstructured data.
Implement functions to store both types of data.
Ensure data consistency across the two sources.
Implement error handling for data insertion.
Add a function to retrieve complete patient records.
Your solution must be in Python. You may use any database framework or engine for your answer.

Consider the following sample input records and the expected example usage for your solution.

Note: The second sample input record isn’t fully representative of unstructured data—it is unstructured text data already somewhat incorporated in a structured data format. An example of true unstructured data would be free-form text, which is beyond the scope of this lesson. Refer to the first question in this lesson for an example of converting this data into a structured format.

Press + to interact

# Sample records
structured_data = [
    {"id": 1, "name": "Alice", "reason_visit": "General_Checkup"},
    {"id": 2, "name": "Bob", "reason_visit": "Diabetes"}
]
semi_unstructured_data = [
    {"id": 1, "notes": "Vitals show no signs of infection. Rash due to allergy. "},
    {"id": 2, "review": "Raise dose of medication to ... amount."}
]

#Expected Usage
#manager = HealthcareDataManager()
#manager.store_structured_data(structured_data)
#manager.store_unstructured_data(unstructured_data)
#patient_record = manager.get_patient_record(1)
#manager.close()

Press + to interact

# Sample Solution using SQLite
import sqlite3
from typing import List, Dict
import json

class HealthcareDataManager:
    def __init__(self, db_name: str = "healthcare.db"):
        """Initialize database connection and create tables."""
        self.conn = sqlite3.connect(db_name)
        self.cursor = self.conn.cursor()
        self._create_tables()
    
    def _create_tables(self):
        """Create necessary tables if they don't exist.
        Some parts of the query are already written for you."""
        self.cursor.executescript("""
            CREATE TABLE IF NOT EXISTS patients (
                id INTEGER PRIMARY KEY,
                ...
            );
            
            CREATE TABLE IF NOT EXISTS patient_notes (
                id INTEGER PRIMARY KEY,
                patient_id INTEGER,
                notes TEXT,
                review TEXT,
                FOREIGN KEY (patient_id) REFERENCES patients (id)
            );
        """)
        self.conn.commit()
    
    def store_structured_data(self, data: List[Dict]):
        """Store structured patient data."""
        try:
            for record in data:
                self.cursor.execute(
                    #TODO - your query here
                )
            self.conn.commit()
            return True
        except sqlite3.Error as e:
            print(f"Error storing structured data: {e}")
            self.conn.rollback()
            return False
    
    def store_unstructured_data(self, data: List[Dict]):
        """Store unstructured patient notes and reviews."""
        try:
            for record in data:
                self.cursor.execute(
                    #TODO - your query here
                    (record['id'], record.get('notes'), record.get('review'))
                )
            self.conn.commit()
            return True
        except sqlite3.Error as e:
            print(f"Error storing unstructured data: {e}")
            self.conn.rollback()
            return False
    
    def get_patient_record(self, patient_id: int) -> Dict:
        """Retrieve complete patient record including both structured and unstructured data."""
        try:
           #TODO - your query execution here
            result = self.cursor.fetchone()
            if result:
                return {
                    "id": result[0],
                    "name": result[1],
                    "reason_visit": result[2],
                    "notes": result[3],
                    "review": result[4]
                }
            return None
        except sqlite3.Error as e:
            print(f"Error retrieving patient record: {e}")
            return None
    
    def close(self):
        """Close database connection."""
        self.conn.close()

Sample answer

For a sample solution, let’s consider sqlite3. You can click the “Show Solution” button above to access this solution code. Let’s break it down step by step:

The _create_tables() function:

The underscore prefix indicates this is a private helper function.
The function creates two tables if they don’t already exist, based on the sample input records.
- patients: It stores structured data (ID, name, reason for visit).
- patient_notes: It stores unstructured data (notes and reviews).
It uses a foreign key to link patient_notes to patients.
- A foreign key is a column shared between tables that establishes a link between related data.
It commits the changes to the database.

The store_structured_data() function:

It takes a list of dictionaries containing structured patient data.
Each record:
- Uses INSERT OR REPLACE to either insert new records or update existing ones.
- Uses parameterized queries to prevent SQL injection.
It wraps the operation in a try-except block for error handling.
If successful, it commits changes and returns True.
If an error occurs, rolls back changes and returns False.

The store_unstructured_data function:

It is similar to store_structured_data but for unstructured data.
It uses .get() method for notes and review because they may not both exist in every record.
It also uses INSERT OR REPLACE for similar functionality as store_structured_data().
It follows the same error handling pattern with commit/rollback.

The get_patient_record function:

It takes a patient ID and returns all data for that patient.
It uses a LEFT JOIN to combine data from both tables.
It returns None if no patient is found or if there’s an error.
It converts the SQL result tuple into a dictionary for easier use.
It uses parameterized query for security.

Altogether, these functions help us implement our solution to store different types of health care data in our database and retrieve patient records efficiently.

Getting Started

Handling Diverse Real-World Data

Preparing and Transforming Data for Machine Learning Pipelines

Understanding Supervised Learning Algorithms

Understanding Unsupervised Learning Algorithms

Advanced Machine Learning Concepts

ML Applications and Deployment in the Real World

Responsible Machine Learning: Ethics, Fairness, and Privacy

ML Interview Preparation and Case Studies

Structured vs. Unstructured Data

Transforming unstructured data into a structured format

Sample answer

Relational vs. non-relational databases

Sample answer

Functionality

Storage

Scalability

Store unstructured and structured data

Sample answer