By Adrien Payong and Shaoni Mukherjee
Retrieval-Augmented Generation is an AI architecture that combines classic information retrieval with generative large language models (LLMs) in one pipeline. Rather than relying entirely on the static knowledge pre-trained into an LLM (which may be incomplete, out-of-date, or both), RAG dynamically augments the generation process by injecting retrieved information from external sources. The result is a more precise and contextually relevant generation that leverages the best of retrieval and generation technology.
The idea behind a RAG pipeline is conceptually simple. However, making a production-ready RAG pipeline can be complex. There are many tooling decisions and design tradeoffs that must be considered. This includes choosing a framework (Haystack vs. LangChain), selecting and tuning the vector database, document chunking strategies, engineering requirements, and deployment strategies.
The goal of this guide is to walk through, from a technical perspective, how to build production-ready RAG pipelines.
A RAG pipeline usually has various components that interact with one another. The most essential building blocks are the following:
Components flow linearly in the pipeline, and production systems may also introduce branching, looping, or other additional steps (e.g., re-ranking of retrieved results before passing to LLM). The goal is for all components to work together and turn raw data into an accurate, contextually relevant, and up-to-date answer.
For those building RAG into production applications, high-level frameworks are often preferred. The primary feature and selling point of these frameworks is that they provide pre-built components and abstractions for building efficient pipelines. Haystack (by Deepset) and LangChain are by far the most popular and well-known tools in this category. Both aim for the same general use case – ease of building applications around LLMs. However, they have different underlying design principles and relative strengths.
Haystack (by Deepset) is an open-source framework focused on RAG and other use cases for LLMs. The central feature of Haystack, further emphasized in the most recent release, is its pipeline-centric, modular architecture. Haystack adopts a graph-like approach, where each component (reader, retriever, generator, etc.) represents a node in a directed acyclic graph (DAG) pipeline. Haystack pipelines are composable and customizable – nodes can be added, removed, and replaced with minimal side effects on other components. Below is a table summarizing some of the main benefits of Haystack:
Key Advantage | Description |
---|---|
Modular Pipeline Design | The modularity of the pipeline design in Haystack, with clearly defined component connections, allows for straightforward swapping of components (e.g., swapping BM25 with a Dense Vector retriever) or the integration of additional modules (such as re-rankers) without rewriting the entire application, resulting in a cleaner and more modular architecture that is simpler to debug. |
Production-Ready Features | Haystack offers built-in support for evaluation, monitoring, and scalability. It also ships with dedicated evaluation frameworks (such as RAGAS and DeepEval) for retrieval and generation benchmarking(integrated into Haystack 2.0) |
Superior Documentation | Haystack provides comprehensive and easy-to-follow documentation and examples, often considered clearer than LangChain’s. This improves development speed and reduces troubleshooting time, especially for teams entering production. |
RAG-Optimized | Tailored for retrieval-based question answering, Haystack comes with out-of-the-box support and utilities for popular tricks and new techniques such as HyDE and query expansion. Multi-hop pipelines, agents, and other complex retrieval patterns are supported in Haystack 2.0. |
Using pseudocode (inspired by Haystack 2.0 syntax), a simple RAG pipeline might be constructed as follows:
# Step 1: Install required packages
#pip install haystack-ai openai python-dotenv
# Step 2: Load environment variables
import os
from dotenv import load_dotenv
load_dotenv() # Loads variables from a .env file if present
# OPTIONAL: Set the API key directly if not using .env
# os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
# Step 3: Haystack imports
from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
# Step 4: Create the DocumentStore
doc_store = InMemoryDocumentStore()
# Step 5: Add raw documents
documents = [
Document(content="The Eiffel Tower is located in Paris, France."),
Document(content="Haystack is a framework for building NLP applications."),
Document(content="LangChain is often used to build LLM-based agents."),
Document(content="RAG stands for Retrieval-Augmented Generation.")
]
doc_store.write_documents(documents)
# Step 6: Embed and store document embeddings
embedder = OpenAITextEmbedder(model="text-embedding-ada-002")
for doc in doc_store.filter_documents():
embedding = embedder.run({"text": doc.content})["embedding"]
doc.embedding = embedding
doc_store.update_documents(documents)
# Step 7: Build the pipeline
pipeline = Pipeline()
# Add components with API key if needed (optional)
pipeline.add_component("embedder", OpenAITextEmbedder(model="text-embedding-ada-002"))
pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=doc_store))
pipeline.add_component("generator", OpenAIGenerator(model="gpt-3.5-turbo"))
# Connect the components
pipeline.connect("embedder.embedding", "retriever.query_embedding")
pipeline.connect("retriever.documents", "generator.documents")
# Step 8: Run the pipeline with a query
query = "What is RAG in NLP?"
result = pipeline.run({"embedder": {"text": query}})
# Step 9: Print the answer
print("\n Answer:", result["generator"]["replies"][0])
In this code, Haystack connects an embedding component to a dense retriever and then to a generator. You must ensure that OPENAI_API_KEY is set either via .env or by direct assignment in the code.
LangChain is another open-source framework that gained some popularity for quickly prototyping LLM applications. It has a slightly different philosophy. Instead of declaring a static pipeline graph, you often chain together components in code (or with higher-level abstractions like “Chains” or “Agents”).
LangChain is known for its extensive integrations, flexibility, and support for LLM providers, vector stores, and tools(such as a web browser or calculator) integrated through an agent mechanism. The following table outlines some of the key benefits of using Langchain:
Key Advantage | Description |
---|---|
Extensive Integrations | LangChain is compatible with a wide range of models (OpenAI, Anthropic, HuggingFace), vector databases (Pinecone, Weaviate, FAISS, Chroma), and APIs. Developers can mix and match components with ease using plug-and-play modules. |
Agent Framework | Using LangChain, we can build LLM-powered agents that can dynamically choose which tools to use (web search, calculator, APIs, and more). Perfect for workflows that require complex, multi-step logic and tool orchestration. |
Quick Prototyping | User-friendly development experience with high-level abstractions that handle common use cases (Retrieval QA, memory chains, etc). Build proofs-of-concept and demos with only a few lines of code. |
Community and Ecosystem | Large, active community of users with many tutorials, plugins, and 3rd-party integrations. Even for tools that aren’t officially supported, there’s probably a community-contributed solution. |
LangChain often represents RAG as a RetrievalQA chain. For instance:
# Step 1: Install required packages (run this once)
# pip install -qU langchain_community faiss-cpu openai python-dotenv
# Step 2: Imports
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.schema import Document
import os
from dotenv import load_dotenv
# Step 3: Load environment variables (e.g. OpenAI key)
load_dotenv()
# Optionally, you can set manually
# os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
# Step 4: Prepare your documents
raw_documents = [
"The Eiffel Tower is located in Paris, France.",
"Haystack is a framework for building NLP applications.",
"LangChain is often used to build LLM-based agents.",
"RAG stands for Retrieval-Augmented Generation."
]
# Step 5: Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100
)
chunks = text_splitter.create_documents(raw_documents)
# Step 6: Initialize embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
# Step 7: Create vector store (FAISS is in-memory and fast)
vectorstore = FAISS.from_documents(chunks, embedding=embeddings)
# Step 8: Initialize RetrievalQA pipeline
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(model_name="gpt-3.5-turbo"), # default temperature is 0.7
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
chain_type="stuff", # stuff = concatenate retrieved docs into prompt
return_source_documents=True # useful for debugging and transparency
)
# Step 9: Run a query
query = "What is RAG in NLP?"
result = qa_chain(query)
# Step 10: Print result
print("\n Answer:\n", result["result"])
# Optional: Show retrieved docs
print("\n Retrieved Documents:")
for doc in result["source_documents"]:
print("-", doc.page_content)
This code snippet creates an in-memory FAISS vector index of the texts of all documents. It then calls LangChain’s RetrievalQA chain, which internally embeds the query, searches the vector store, and constructs a prompt for the LLM with the results. chain_type = “stuff” concatenates the retrieved documents (“stuffing” them) into the LLM prompt along with the question. Other chain types are available in LangChain, such as “map-reduce” (summarizing each document, then aggregating) or “refine” (iteratively building an answer), to handle more complex and larger contexts.
We will recommend Haystack if your target is a maintainable, evaluable, production-ready system and you’re largely following a retrieval QA or similar pattern. If you’re trying to prototype something quickly or need complex agent behaviors/integrations, LangChain may help you get there faster. Some teams use Haystack for the core pipeline and add in LangChain agents where necessary, so these aren’t exclusive either.
The vector database is a foundational component of any RAG pipeline. It stores your document embeddings and performs a similarity search. The vector store you choose will impact performance (query latency, speed), but also features like filtering, scaling, and ease of integration. Let’s compare a few popular vector stores and their use cases:
Vector DB | Strengths | Trade-offs / Limitations | Typical Use Cases |
---|---|---|---|
FAISS (Facebook AI Similarity Search) | C++ library with Python bindings optimized for similarity search. Supports advanced indexes: IVF, HNSW, Product Quantization, and GPU acceleration. Extremely fast in-memory queries | In-memory only—requires manual disk persistence and sharding. No built-in metadata support; external DB needed. Scaling beyond one machine requires a custom setup. | Ultra-low-latency retrieval workloads (e.g., real-time search, trading systems). High-performance prototypes or systems fitting in RAM |
ChromaDB | Open-source with persistent storage and Pythonic APIs. Excellent integration with LangChain. Metadata support and solid tooling for local use. | Limited to local/in-process use; lacks native clustering. Manual scaling is required for production workloads. Moderate query speed | Quick prototyping and small-to-medium applications. Projects need persistence and lightweight metadata filtering. |
Pinecone | Fully managed, serverless cloud service with auto-scaling. Rich feature set: metadata filters, multi-tenancy, monitoring. | Higher latency due to network overhead (hundreds of ms). Service costs and some vendor lock-in via proprietary API. Max dimensions/payload limits per API specs. | Enterprise-grade, large-scale applications with minimal ops burden. Use cases requiring metadata filters and multi-tenant support. |
Example: Local Development with ChromaDB
During development, you might start with Chroma for convenience. Let’s consider the following pseudo-code using Langchain:
# Install required packages
#pip install langchain langchain-community chromadb pypdf tiktoken python-dotenv
# Additional for document loading
#pip install beautifulsoup4 html2text lxml
# ===== CHROMA/LANGCHAIN RAG PIPELINE PSEUDO-CODE =====
# 1. ENVIRONMENT SETUP
import os
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
# 2. CONFIGURATION
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") # Required environment variable
PERSIST_DIR = "./chroma_db" # Storage directory
COLLECTION_NAME = "corporate_docs" # Vector collection name
EMBEDDING_MODEL = "text-embedding-3-large" # OpenAI model
# 3. DOCUMENT PROCESSING
# - Load documents from multiple sources
documents = PyPDFLoader("sample_financial_report.pdf").load() # sample source
# - Semantic-aware chunking
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Character limit per chunk
chunk_overlap=200, # Context preservation overlap
separators=["\n\n", "\n", ". ", " "] # Natural language boundaries
)
chunks = text_splitter.split_documents(documents)
# 4. EMBEDDING INITIALIZATION
embeddings = OpenAIEmbeddings(
model=EMBEDDING_MODEL,
openai_api_key=OPENAI_API_KEY # Secure credential handling
)
# 5. VECTOR STORE CREATION
vectorstore = Chroma(
collection_name=COLLECTION_NAME,
embedding_function=embeddings,
persist_directory=PERSIST_DIR,
# Optional production parameters
client_settings={
"anonymized_telemetry": False # Compliance setting
}
)
# 6. DOCUMENT INJECTION
vectorstore.add_documents(
documents=chunks,
ids=[f"doc_{idx}" for idx in range(len(chunks))], # Unique identifiers
metadatas=[chunk.metadata for chunk in chunks] # Preserve context
)
# 7. PERSISTENCE OPERATION
vectorstore.persist() # Critical for disk storage
# 8. RETRIEVAL VERIFICATION
query = "Q3 financial projections"
results = vectorstore.similarity_search(
query=query,
k=5, # Top 5 results
filter={"department": "finance"} # Metadata-based filtering
)
The preceding scripts create a production-ready RAG pipeline built with LangChain and Chroma. It loads PDF documents, splits them semantically, embeds them using OpenAI’s embedding model, stores them in a persistent Chroma vector database, and uses filtered similarity search to retrieve relevant context for queries. This pipeline will be ready for deployment after:
No matter which vector DB you choose, certain best practices apply:
The vector database is your information store. You can choose the one that scales to your needs and meets deployment constraints, then use the features to best balance speed, accuracy, and development effort.
Choosing the right embeddings is a foundational decision in the architecture of any RAG pipeline. Embeddings influence both retrieval quality and computational efficiency. Optimal embeddings ensure accurate matches between user queries and relevant documents. This directly drives the effectiveness of semantic search applications and context-aware LLM applications.
One of the most popular commercial models used in production has been OpenAI’s text-embedding-ada-002.
OpenAI’s latest generations (particularly text-embedding-3-small and text-embedding-3-large, released in late 2024) have further improved quality or lowered prices, but ada-002 remains widely available and a “safe default” for most production scenarios.
For many organizations, open-source or self-hosted options are preferable for reasons of privacy, fine-tuning, or cost avoidance.
Recent trends favor dual-embedding strategies: you can use a lower-cost, faster embedder (like OpenAI’s text-embedding-3-small) to retrieve a candidate set for recall, then rerank the results with a more accurate or specialized model. This can provide a good trade-off between speed, cost, and retrieval performance.
Choosing the appropriate chunking strategy is essential for optimizing retrieval accuracy and overall LLM performance within RAG pipelines. Each method presents trade-offs between implementation simplicity, retrieval quality, and resource consumption. The table below outlines the most common chunking strategies and describes their implementation within LangChain and key considerations:
Strategy | How it Works | Advantages | Limitations |
---|---|---|---|
Fixed-Size Chunking | Split text into equal-sized chunks (e.g., 1000 tokens), often with overlap (e.g., 200 tokens) | Simple to implement. Predictable chunk/embedding counts. Works for homogeneous or unstructured text. | Can break sentences/concepts mid-chunk. Overlap causes duplication. Not aware of structure or semantics. |
Semantic Chunking | Splitting text based on semantic boundaries, such as topics, paragraphs, or sections aligned with meaning. | Chunks are semantically coherent. Improves retrieval precision and recall. Handles structured documents well. | Requires NLP tools for splitting. Variable chunk size. More complex and computationally expensive. |
Sentence-Level (Fine-Grained) | Each chunk is a sentence or a small group of sentences; it can use a sliding window for minimal context. | Maximum retrieval precision. Answers are easy to extract. Useful for direct-answer tasks. | Loss of broader context. Index size increases (more chunks/embeddings). LLM prompt construction can become complex. |
There isn’t a single optimal approach to chunking. Your choice should depend on the structure of your data, your retrieval objectives, and the level of complexity you can manage. Fixed-size chunking offers speed and simplicity, whereas semantic and sentence-level strategies can provide more accurate, contextually relevant results but may involve more overhead in terms of setup and computation. Experimenting with these strategies is important if you want to develop a high-performance, production-ready RAG system.
Beyond the basic setup of “embed -> retrieve -> generate”, additional techniques can be applied to optimize a RAG pipeline. Here are a few techniques RAG practitioners can use to improve relevance and answer quality:
Query Rewriting and Expansion: Query expansion addresses the problem of query-document vocabulary mismatch by creating multiple variations of the user’s query to improve recall by casting a wider net for retrieval.
Typical Workflow:
HyDE (Hypothetical Document Embeddings): Instead of embedding and searching with the user query directly, first create a hypothetical answer/document for the query using an LLM, then embed that and search.
Process Flow:
Traditional QA and search evaluation metrics must adapt to align with the RAG context. Our evaluation includes the retrieval step performance and the overall quality of answers. So to start, let’s look at the different elements we may want to evaluate in our RAG pipeline:
Retrieval Metrics:
Generation Metrics:
Frameworks like DeepEval and RAGAS provide tooling to compute these metrics. For example, DeepEval can use an LLM to score faithfulness by comparing the answer and context.
Beyond offline evaluation, you want to monitor the pipeline in production for anomalies:
Deploying a RAG pipeline in production involves considerations of scalability, security, and operational reliability. The table below provides a quick summary of essential strategies and best practices:
Area | Strategy/Technique | Description & Examples | Key Considerations |
---|---|---|---|
Scalability & Performance | Horizontal Scaling | Run multiple RAG service replicas behind a load balancer (e.g., via containers/Kubernetes). Use a distributed or replicated vector DB (e.g., Pinecone, Weaviate) for shared access. | Ensure index/data consistency; vector DB should support scaling and concurrent access. |
Caching | Cache query embeddings, retrieved documents, and (where appropriate) LLM answers to speed up repeated queries and reduce compute costs. | Cache invalidation and privacy; not all outputs are cacheable if queries are highly personalized. | |
Asynchronous Processing | Use message queues and worker pools to handle high query volumes, decoupling request handling from retrieval and generation. | Design for at-least-once or exactly-once processing; manage result delivery to users. | |
Kubernetes Orchestration | Deploy RAG as a set of pods with resource limits, autoscaling, and robust health checks. Use YAML manifests for configuration (e.g., set replicas, resource requests/limits, and secrets for API keys). | Monitor for pod health, resource usage, and implement autoscaling based on demand. | |
Latency Optimization | Colocate Services | Host your vector DB and RAG application servers in the same region to minimize network latency. | Avoid cross-region/cross-cloud hops to keep query times low. |
LLM Warmup & Streaming | Warm up LLM models at startup; stream LLM answers to users as generated for improved perceived latency. | Mitigates cold start penalties and enhances user experience. | |
Failure & Recovery | Retries & Fallbacks | Implement retry logic with exponential backoff for external APIs; design fallback retrieval (e.g., keyword search) if vector search fails; regular health checks and sample query monitoring. | Ensure reliability and graceful degradation under failure scenarios. |
Security & Privacy | Data Encryption | Encrypt all data at rest (storage, vector DB) and in transit (HTTPS/TLS between services). E.g., enable KMS or disk encryption on cloud and managed DBs. | Comply with company policies and regulations (e.g., GDPR, HIPAA). |
Access Control | Tag documents/vectors with user roles or permissions; filter retrieval based on user identity (e.g., Pinecone’s metadata filter). | Enforce strict access and audit logs; ensure no cross-user or cross-tenant data leakage. | |
Isolation & Multi-Tenancy | Use namespaces/collections per client or tenant to isolate data in multi-tenant deployments. | Prevents data leaks across organizational boundaries; simplifies compliance. | |
Content Filtering | Apply output filtering or moderation (e.g., OpenAI’s moderation API) to prevent LLM from exposing sensitive or inappropriate content. | Reduces risk of data leaks and helps enforce acceptable use. | |
Differential Privacy | Add noise to embeddings or outputs (e.g., via Laplace mechanism) to prevent extraction of sensitive info; or mask/redact sensitive tokens in preprocessing. | Balance privacy with retrieval performance; may impact relevance if overused. | |
Regulatory Compliance | Implement mechanisms for data deletion (“right to be forgotten”), obtain user consent for data use, and avoid sending data to 3rd parties without consent. | Stay compliant with GDPR, CCPA, and industry standards. | |
Monitoring & Auditing | Log source documents for every answer, monitor for unusual access/query patterns, and maintain system and security logs. | Facilitates trust, troubleshooting, and regulatory investigations. |
Following these strategies, organizations can achieve scalable, secure, and reliable RAG deployments. Containerization, access controls, and monitoring, when built in from the start, will help your system stay resilient as it scales and matures in production.
What is Retrieval-Augmented Generation (RAG), and why should I care? RAG is a powerful AI architecture that pairs classic information retrieval with large language models to generate answers augmented with up-to-date external knowledge. They are more factually accurate and contextually relevant than LLMs alone.
How do I decide between Haystack and LangChain to build a RAG pipeline? Choose Haystack if you need a production-ready, modular, and evaluable pipeline with comprehensive documentation. You can choose LangChain for prototyping new models quickly, building agent-based workflows, or for more flexible integrations with various models/vector stores.
What is the role of the vector database in a RAG pipeline? The vector database stores your documents’ high-dimensional embeddings and enables efficient similarity search to retrieve the most relevant context to use in informing the LLM’s answer.
What are the best vector databases for RAG pipelines, and how do I choose one?
Engineering a production-ready RAG pipeline requires careful consideration of multiple layers. It extends far beyond the basics of retrieval and generation. Success relies on making the right choices at each step, from selecting the appropriate framework (Haystack or LangChain), embedding models, vector databases, chunking strategies, evaluation, deployment practices, and security measures. To take you further down into the weeds of practical steps and best practices, here are some tutorials relevant for building and operating advanced RAG pipelines:
By engineering each layer of the pipeline meticulously and following best practices in scalability, monitoring, and security, you can deliver accurate, reliable, and secure RAG solutions. As the landscape of RAG technologies is rapidly evolving, staying flexible and continuously evaluating your pipeline’s components is essential for maintaining an effective and future-proof solution.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.