...
/Structured Data Extraction in LlamaIndex
Structured Data Extraction in LlamaIndex
Learn to extract structured data from unstructured text using LlamaIndex, Pydantic schemas, and LLM-powered parsing.
We'll cover the following...
We encounter unstructured text every day—in emails, reports, news articles, and resumes. While we, as humans, can easily understand and interpret these texts, machines need structure to make sense of them. A resume, for instance, may contain a person’s name, work history, skills, and education—all in a free-flowing paragraph or scattered across different sections.
Now imagine you’re building an AI assistant to help a hiring manager screen hundreds of resumes. Reading each document manually would take hours. But what if we could teach our system to extract structured information—like name, email, and experience—from each resume?
This is where structured data extraction comes in. And with LlamaIndex, we can do this using large language models (LLMs), combined with tools for guiding the output format using schemas.
In this lesson, we’ll start with a basic schema to extract simple fields, then improve it using field descriptions, and finally, expand to a more realistic, nested structure for job and education history.
Step 1: Getting started with a basic schema
Let’s begin with the simplest version. We’ll extract just a few top-level fields from a resume—like name, email, phone number, and skills.
Define the schema
We’ll use Pydantic, a Python library that lets us define data models using regular Python classes. These models tell the LLM exactly what kind of structured data we want in return.
from pydantic import BaseModelfrom typing import Listclass ResumeData(BaseModel):name: stremail: strphone: strskills: List[str]
BaseModel
is a class from Pydantic used to define structured data models. It lets us specify the fields we want the LLM to extract—along with their expected types.In our
ResumeData
schema, we define four fields to extract from a resume:name
,email
,phone
, and a list ofskills
. For now, we’re keeping the structure simple to focus on how extraction works.
Step 2: Load a resume document
For this lesson, we’ll assume we have a resume file (in PDF format) from which we want to extract information. LlamaIndex provides built-in tools to load and parse such files.
from llama_index.readers.file import PDFReaderpdf_reader = PDFReader()documents = pdf_reader.load_data("/path/to/resume.pdf")text = documents[0].text
Here, we’re using PDFReader
, a document loader from LlamaIndex, to read a resume and extract its full text.
Step 3: Extract structured data using an LLM
We’ll now connect to an LLM using LlamaIndex. You can use any supported backend—like OpenAI or Groq.
from llama_index.llms.groq import Groqllm = Groq(model="llama3-70b-8192", api_key="YOUR_GROQ_API_KEY")sllm = llm.as_structured_llm(ResumeData)response = sllm.complete(text)print(response)