Mastering LlamaIndex: From Fundamentals to Building AI Apps/

...

Structured Data Extraction in LlamaIndex

Learn to extract structured data from unstructured text using LlamaIndex, Pydantic schemas, and LLM-powered parsing.

We'll cover the following...

Step 1: Getting started with a basic schema
Improving the schema with Field descriptions
- Try it yourself: With Field descriptions
Introducing nested structures
- Extracting with the nested schema
- Try it yourself: Nested schema for realistic resume structure
Expected error
Handling schema mismatches in structured data extraction
- Update 1: Modify the schema to allow optional fields
- Update 2: Relax validation when calling the structured LLM
Try again
Tip: Improve reliability in structured extraction
Benefits of structured data extraction
Conclusion

We encounter unstructured text every day—in emails, reports, news articles, and resumes. While we, as humans, can easily understand and interpret these texts, machines need structure to make sense of them. A resume, for instance, may contain a person’s name, work history, skills, and education—all in a free-flowing paragraph or scattered across different sections.

Now imagine you’re building an AI assistant to help a hiring manager screen hundreds of resumes. Reading each document manually would take hours. But what if we could teach our system to extract structured information—like name, email, and experience—from each resume?

This is where structured data extraction comes in. And with LlamaIndex, we can do this using large language models (LLMs), combined with tools for guiding the output format using schemas.

Press + to interact

In this lesson, we’ll start with a basic schema to extract simple fields, then improve it using field descriptions, and finally, expand to a more realistic, nested structure for job and education history.

Step 1: Getting started with a basic schema

Let’s begin with the simplest version. We’ll extract just a few top-level fields from a resume—like name, email, phone number, and skills.

Define the schema

We’ll use Pydantic, a Python library that lets us define data models using regular Python classes. These models tell the LLM exactly what kind of structured data we want in return.

BaseModel is a class from Pydantic used to define structured data models. It lets us specify the fields we want the LLM to extract—along with their expected types.
In our ResumeData schema, we define four fields to extract from a resume: name, email, phone, and a list of skills. For now, we’re keeping the structure simple to focus on how extraction works.

Step 2: Load a resume document

For this lesson, we’ll assume we have a resume file (in PDF format) from which we want to extract information. LlamaIndex provides built-in tools to load and parse such files.

Getting Started

Core Concepts and Using LLMs

Building a RAG Pipeline

Extracting Structured Outputs from LLMs

Agents and Workflows

Monitoring and Evaluating LLM Applications

Building Real-World Applications with LlamaIndex

Wrap Up

Structured Data Extraction in LlamaIndex

Step 1: Getting started with a basic schema

Define the schema

Step 2: Load a resume document

Step 3: Extract structured data using an LLM