...

/

Structured Data Extraction in LlamaIndex

Structured Data Extraction in LlamaIndex

Learn to extract structured data from unstructured text using LlamaIndex, Pydantic schemas, and LLM-powered parsing.

We encounter unstructured text every day—in emails, reports, news articles, and resumes. While we, as humans, can easily understand and interpret these texts, machines need structure to make sense of them. A resume, for instance, may contain a person’s name, work history, skills, and education—all in a free-flowing paragraph or scattered across different sections.

Now imagine you’re building an AI assistant to help a hiring manager screen hundreds of resumes. Reading each document manually would take hours. But what if we could teach our system to extract structured information—like name, email, and experience—from each resume?

This is where structured data extraction comes in. And with LlamaIndex, we can do this using large language models (LLMs), combined with tools for guiding the output format using schemas.

Press + to interact

In this lesson, we’ll start with a basic schema to extract simple fields, then improve it using field descriptions, and finally, expand to a more realistic, nested structure for job and education history.

Step 1: Getting started with a basic schema

Let’s begin with the simplest version. We’ll extract just a few top-level fields from a resume—like name, email, phone number, and skills.

Define the schema

We’ll use Pydantic, a Python library that lets us define data models using regular Python classes. These models tell the LLM exactly what kind of structured data we want in return.

from pydantic import BaseModel
from typing import List
class ResumeData(BaseModel):
name: str
email: str
phone: str
skills: List[str]
Define a simple schema using Pydantic to represent top-level resume fields
  • BaseModel is a class from Pydantic used to define structured data models. It lets us specify the fields we want the LLM to extract—along with their expected types.

  • In our ResumeData schema, we define four fields to extract from a resume: name, email, phone, and a list of skills. For now, we’re keeping the structure simple to focus on how extraction works.

Step 2: Load a resume document

For this lesson, we’ll assume we have a resume file (in PDF format) from which we want to extract information. LlamaIndex provides built-in tools to load and parse such files.

from llama_index.readers.file import PDFReader
pdf_reader = PDFReader()
documents = pdf_reader.load_data("/path/to/resume.pdf")
text = documents[0].text
Load resume text from a PDF using LlamaIndex's built-in PDFReader

Here, we’re using PDFReader, a document loader from LlamaIndex, to read a resume and extract its full text.

Step 3: Extract structured data using an LLM

We’ll now connect to an LLM using LlamaIndex. You can use any supported backend—like OpenAI or Groq.

from llama_index.llms.groq import Groq
llm = Groq(model="llama3-70b-8192", api_key="YOUR_GROQ_API_KEY")
sllm = llm.as_structured_llm(ResumeData)
response = sllm.complete(text)
print(response)
Connect to a Groq-hosted LLM and use the schema to extract structured output
...