RAG Best Practices
RAG training best practices for Builder - structure knowledge base documents, tune chunking and retrieval settings, and evaluate retrieval quality to improve AI agent accuracy.
What Is RAG and Why It Matters
Section titled “What Is RAG and Why It Matters”RAG (Retrieval-Augmented Generation) allows your Builder AI agents to answer questions using your specific business documents, policies, FAQs, and knowledge base content rather than relying solely on the language model’s pre-trained knowledge. When a user asks a question, the RAG system retrieves the most relevant chunks from your uploaded documents and injects them as context into the prompt before the model generates a response.
This approach grounds the model’s output in your actual content, reduces hallucination, and makes it possible to update the knowledge base without retraining the underlying language model.
RAG is configured inside the LLM node (or the legacy OpenAI GPT / Azure OpenAI nodes) on the agent canvas. The RAG settings are divided into two groups: Training Settings (how your documents are processed and indexed) and Inference Settings (how the system retrieves context at query time).
Key RAG Parameters
Section titled “Key RAG Parameters”Training Settings
Section titled “Training Settings”| Parameter | Description |
|---|---|
| Training Style | How your documents are processed. Questions & Answers is optimized for FAQ-style content - pairs of questions and direct answers. Text Documents is optimized for longer narrative content such as policy manuals, product guides, or technical documentation. |
| Embedding Model | The model used to convert document text into vector representations. Smaller models (for example text-embedding-ada-002, bge-small-en-v1.5) process faster and cost less. Larger models (for example text-embedding-3-large) produce higher-quality embeddings at higher cost. |
| Training Mode | Controls how much of the training pipeline is re-run: Full Training reprocesses all documents and rebuilds all indexes from scratch; Rebuild Embeddings reprocesses document content while preserving index structure; Rebuild Index Only reconstructs search indexes without reprocessing document text; Fetch Data Only retrieves existing data without reprocessing. |
Inference Settings
Section titled “Inference Settings”| Parameter | Description |
|---|---|
| Distance Function | The similarity metric used to compare query embeddings against document embeddings. Cosine (default) measures the angle between vectors and works well for most text. Euclidean, Manhattan, and Chebyshev are alternatives that may suit specific content types. |
Minimum Confidence Threshold (minConfidence) | A value between 0.0 and 1.0. Only document chunks with a similarity score at or above this threshold are included in the context. A value of 0.0 includes all results; higher values filter to only the most relevant chunks. |
Top N Contexts (topN) | The maximum number of document chunks to retrieve and inject into the prompt. A value of 0 returns all chunks above the confidence threshold. Setting a specific number limits context to the most relevant results. |
Selected Training Set (selectedTS) | Specifies which training data set to use for retrieval when multiple training sets are configured on a single node. |
Advanced Index Settings
Section titled “Advanced Index Settings”| Parameter | Description |
|---|---|
| Approximate Similarity Index | When enabled, uses approximate nearest-neighbor search instead of exact search. Recommended for large document sets (100,000+ pages) where exact search becomes slow. |
| Index Trees | Number of trees used in the approximate index. Higher values improve accuracy at the cost of longer index build time. |
| Index Search Nodes | Number of nodes examined during an approximate search query. Set to -1 for automatic optimization. |
Embedding Model Selection
Section titled “Embedding Model Selection”The embedding model determines how well the system understands the semantic meaning of both your documents and the user’s query. The model used during training must match the model used during inference - changing the embedding model requires a full retrain.
Smaller models are appropriate when:
- Query volume is high and latency matters
- The content is straightforward (short answers, structured FAQs)
- Cost is a significant constraint
Larger models are appropriate when:
- The content is complex, technical, or highly specialized
- Accuracy is the primary concern and query volume is manageable
- The domain involves nuanced language (legal, medical, scientific)
Start with a smaller model during development and testing. If retrieval quality is consistently poor after tuning the confidence threshold and Top N, upgrade to a larger model and retrain.
Training Style: Questions & Answers vs Text Documents
Section titled “Training Style: Questions & Answers vs Text Documents”Questions & Answers training style:
- Best for: FAQ documents, help desk knowledge bases, customer service scripts, structured Q&A pairs
- How it works: The system indexes the content expecting distinct question-answer pairs. Queries are matched against the question side of each pair.
- Document preparation: Structure source documents as explicit Q&A pairs. Keep individual answers focused - one topic per answer.
- Recommended chunk size: 200 to 400 tokens per chunk
Text Documents training style:
- Best for: Policy manuals, product documentation, research content, legal documents, narrative guides
- How it works: The system indexes passages from longer documents. Queries are matched against the most semantically relevant passages.
- Document preparation: Organize documents with clear headings and logical sections. Each section should cover a single coherent topic.
- Recommended chunk size: 500 to 800 tokens per chunk to preserve context
If your knowledge base contains both types of content, consider creating separate training sets or separate agents optimized for each type.
Training Modes
Section titled “Training Modes”Use the right training mode to balance thoroughness against time:
- Full Training - use when setting up a new knowledge base, when significantly changing the document set, or when changing the embedding model. Reprocesses everything.
- Rebuild Embeddings - use when adding new documents or updating existing content with the same embedding model. Faster than full training.
- Rebuild Index Only - use when you have changed index settings (distance function, approximate index parameters) but have not changed the document content or embedding model.
- Fetch Data Only - use to retrieve existing indexed data for inspection without reprocessing.
After any training run, use the Test feature on the agent and check the Debugger’s RAG selection items output to verify that the retrieved chunks match what you expect for representative queries.
Document Preparation Tips
Section titled “Document Preparation Tips”The quality of retrieval is directly proportional to the quality of the source documents. Poor-quality documents produce poor-quality context regardless of the parameter settings.
Before uploading documents:
- Remove boilerplate content (headers, footers, page numbers, legal disclaimers that repeat on every page) that adds noise without adding information value
- Split very long documents into logically coherent sections when possible - a 200-page PDF is harder to chunk well than ten focused 20-page documents
- Ensure consistent terminology throughout the knowledge base - if the same concept is called by multiple names, consider adding a glossary or synonym document
- For Q&A style, write explicit answers rather than relying on the model to infer answers from surrounding context
- For text documents, use descriptive headings and subheadings - these help the chunking process produce more topically coherent chunks
- Avoid documents where critical information is only in tables, images, or charts - extract that information into plain text
- Keep content current - outdated information is retrieved with the same confidence as current information, so stale content directly degrades answer quality
Chunking and Relevance Scoring
Section titled “Chunking and Relevance Scoring”Builder’s RAG system splits documents into chunks before embedding them. Each chunk is a contiguous segment of text. The chunking strategy interacts with the training style setting:
- Q&A training treats each question-answer pair as a unit
- Text document training uses sliding window or paragraph-based chunking
During retrieval, the system computes a similarity score between the embedded user query and each stored chunk embedding. The Minimum Confidence Threshold filters out chunks below the score cutoff, and Top N Contexts caps the number of chunks that are injected into the prompt.
The practical effect of these two parameters:
- A low threshold and high Top N: retrieves many chunks, including less relevant ones. Useful during initial testing to see what the system finds, but risks injecting irrelevant context that confuses the model.
- A high threshold and low Top N: retrieves fewer but more precise chunks. Better for production when you have confirmed that relevant content reliably scores above the threshold.
Initial Configuration and Tuning Process
Section titled “Initial Configuration and Tuning Process”Step 1: Start with open settings for testing
Section titled “Step 1: Start with open settings for testing”When first configuring RAG on a node:
- Set Minimum Confidence Threshold to 0 to capture all results
- Set Top N Contexts to 0 to retrieve everything above the threshold
- Select a small, fast embedding model for quick iteration
- Run Full Training
This open configuration lets you see the full range of what the system retrieves for your test queries before you start narrowing it down.
Step 2: Run test queries and inspect retrieval
Section titled “Step 2: Run test queries and inspect retrieval”- Click the Train Model button in the LLM node configuration panel
- Wait for training to complete (time varies with document volume - small sets take minutes, large sets may take hours)
- Run a test execution from the agent toolbar
- Open the Debugger and expand the RAG selection items message for the LLM node
- Review the chunks retrieved for your query - are they the right content? Are they complete enough to answer the question?
Step 3: Tune confidence threshold and Top N
Section titled “Step 3: Tune confidence threshold and Top N”After confirming that the right content exists in the knowledge base and is being retrieved in the initial open configuration:
- Gradually increase Minimum Confidence Threshold from 0 toward 0.5 - test after each adjustment
- Note the threshold where relevant chunks start being excluded - that is your lower bound
- Reduce Top N Contexts from unlimited toward a practical number (7 to 12 is a common range for most use cases) - test after each reduction
- Stop when the agent produces accurate, complete answers with acceptable response time
Use the OpenAI tokenizer (or equivalent tool) to verify that the total tokens consumed by the retrieved context, the system message, and the user query stay within the model’s context window limit. Aim for 75% or less of the available limit to leave room for the model’s response.
Step 4: Verify with a cross-platform check
Section titled “Step 4: Verify with a cross-platform check”If you are unsure whether poor answers are caused by retrieval quality or by the model itself, copy the retrieved context chunks from the Debugger and test them directly in another AI platform (for example ChatGPT or Claude). If those platforms also cannot produce a good answer from that context, the issue is in the retrieved content, not the model. If they can produce a good answer, the issue is in the model configuration (system message, prompt structure, or model selection).
Troubleshooting Common RAG Issues
Section titled “Troubleshooting Common RAG Issues””The agent says it does not know the answer”
Section titled “”The agent says it does not know the answer””Likely causes and fixes:
- Confidence threshold is too high - lower it to allow more chunks through
- The information is not in the knowledge base - search your source documents manually to confirm the answer exists, then add it if missing and retrain
- The embedding model is not capturing the query semantics well - try a larger model and retrain
”Answers are not specific enough or are too general”
Section titled “”Answers are not specific enough or are too general””Likely causes and fixes:
- Top N is too low - the model has too little context; increase Top N to retrieve more chunks
- Chunks are too small - if using Q&A style on narrative content, switch to Text Documents style
- The relevant content is buried in long passages - split source documents into shorter, more focused sections
”Responses are slow”
Section titled “”Responses are slow””Likely causes and fixes:
- Top N is too high - retrieving and processing many chunks increases latency; reduce to 7 to 12
- Embedding model is too large - switch to a smaller model if accuracy allows
- Large document set without approximate indexing - enable Approximate Similarity Index for large sets
”Answers include incorrect or irrelevant information”
Section titled “”Answers include incorrect or irrelevant information””Likely causes and fixes:
- Confidence threshold is too low - low-relevance chunks are being retrieved and injected; raise the threshold
- Source documents contain outdated or contradictory information - review and update the knowledge base, then retrain
- Multiple topics in one chunk - restructure source documents so each section covers a single topic
Token limit exceeded
Section titled “Token limit exceeded”When the retrieved context plus the system message plus the user query exceeds the model’s token limit, the request will fail or the response will be truncated.
Fixes:
- Reduce Top N Contexts to retrieve fewer chunks
- Increase Minimum Confidence Threshold to retrieve only the most relevant chunks
- Shorten the system message
- Use a model with a larger context window
Performance Monitoring
Section titled “Performance Monitoring”After going live, track these indicators to know when to retrain or re-tune:
- Retrieval relevance - periodically sample live queries in the Debugger and manually verify that retrieved chunks are relevant
- Answer accuracy - review a sample of responses against expected answers; create a standard test query set that covers your key use cases
- Response time - if latency increases over time as the document set grows, consider enabling approximate indexing
- Token usage - monitor whether retrieved context is consuming an increasing share of the context window as the knowledge base grows
Plan quarterly reviews of the knowledge base content to remove outdated documents and add new ones, followed by a retraining run.
Implementation Checklist
Section titled “Implementation Checklist”Initial setup:
- Organize and clean source documents before uploading
- Choose training style based on content type (Q&A or Text Documents)
- Select an initial embedding model (start smaller for testing)
- Set confidence threshold to 0 and Top N to 0 for initial testing
- Run Full Training and wait for completion
- Run test queries and inspect RAG selection items in the Debugger
Tuning:
- Gradually increase confidence threshold while testing retrieval quality
- Reduce Top N while verifying answer completeness
- Verify token usage stays within the model’s context window limit
- Create a standard set of test queries covering key use cases
Before going live:
- Run the full test query set and confirm all answers are accurate
- Verify response times are acceptable
- Document the final configuration settings for future reference
- Plan a schedule for knowledge base updates and retraining
Related Documentation
Section titled “Related Documentation”- Builder Debugger - inspect RAG selection items and trace retrieval behavior
- User Manual - LLM node configuration reference
- OpenAI GPT node - legacy node with RAG settings documentation