Data Storage with S3 and OpenSearch
Explore how to design effective data storage solutions using Amazon S3 and OpenSearch for Amazon Bedrock applications. Understand S3 bucket organization, OpenSearch vector index configuration, hybrid search, and metadata filtering to build scalable and reliable retrieval-augmented generation systems. This lesson also covers IAM role setup and the trade-offs between OpenSearch serverless and provisioned services.
A RAG application built with Amazon Bedrock often depends on two storage layers working together. Amazon S3 stores the raw and processed documents that feed the knowledge base ingestion pipeline, while Amazon OpenSearch Service indexes vector embeddings generated from those documents so the application can retrieve relevant chunks at query time.
In a retrieval-augmented generation (RAG) architecture, the separation between document storage and vector indexing is deliberate. Amazon S3 serves as the durable, cost-effective storage layer for documents. Amazon OpenSearch Service serves as the search and vector indexing layer, where embeddings generated from document chunks become searchable. Without a well-designed connection between the storage and vector indexing layers, Knowledge Bases for Amazon Bedrock can experience slow synchronization, lower retrieval relevance due to poor chunking, metadata, or indexing choices, and access failures caused by misconfigured IAM roles.
This lesson walks through S3 bucket design, OpenSearch vector configuration, hybrid search, and metadata filtering, giving you the architectural foundation to build reliable Bedrock-powered retrieval systems.
S3 as the data foundation for Bedrock
Amazon S3 is the primary document storage layer for Bedrock applications. Every document that enters your RAG pipeline, whether a PDF, an HTML page, or a plain-text file, begins its life cycle in an S3 bucket. How you organize that bucket directly affects sync performance, pipeline automation, and access control.
A well-structured S3 bucket uses prefixes (which behave like folders) to separate documents by their role in the pipeline. The recommended convention includes four key prefixes.
Raw documents (
s3://bucket/raw/): This prefix stores original, unprocessed files as they arrive from upstream sources such as content management systems or manual uploads.Processed documents (
s3://bucket/processed/): Files that have been cleaned, chunked, or transformed by a preprocessing pipeline land here before being indexed.Knowledge Base sync (
s3://bucket/kb-sync/): This prefix contains the documents that Bedrock Knowledge Bases actively sync from. Keeping this separate from raw storage ensures that only validated, ready-to-index content enters the retrieval pipeline.Fine-tuning datasets (
s3://bucket/fine-tuning/): Training data ...