Search⌘ K
AI Features

Document Loaders and Text Splitters

Explore how to use document loaders to standardize diverse data sources into uniform documents and apply text splitters to chunk these documents appropriately. Understand strategies for chunk size, overlap, and splitting algorithms to optimize retrieval accuracy and maintain semantic coherence in LLM applications.

With prompt templates and output parsers producing reliable structured objects, the next challenge is feeding real-world data into the pipeline. LLMs operate within finite context windows, yet production applications must reason over large corpora spanning PDFs, scraped web pages, database records, and API responses. A support chatbot answering questions from a 200-page product manual cannot pass the entire document to the model in a single prompt. The solution is a two-stage data preparation workflow that sits at the front of every Retrieval-Augmented Generation (RAG) pipeline.

LangChain addresses this with two abstractions. Document loaders standardize ingestion from heterogeneous sources into a uniform Document objectA LangChain data structure containing a page_content string (the actual text) and a metadata dictionary (source information like page number, URL, or timestamp).. Text splitters then partition those documents into chunks small enough for embedding and retrieval. Together, these two stages transform raw, unwieldy data into retrieval-ready pieces that downstream components (embedders, vector stores, and retrievers) can consume without modification.

This lesson covers the ingestion and chunking stages. The vector store and retriever optimization that follow are covered in the next lesson.

The following diagram illustrates where document loaders and text splitters fit within the broader RAG data preparation pipeline.

RAG pipeline showing data flow from sources through document loaders and text splitters to embedding and vector storage
RAG pipeline showing data flow from sources through document loaders and text splitters to embedding and vector storage

Loading documents from diverse sources

Every LangChain document loader exposes a .load() method that returns a list of Document objects. Because every loader produces the same structure (page_content as a string and metadata as a dictionary) all downstream code works identically regardless of the original source. Think of it like a universal ...