RAG Best Practices

RAG training best practices for Builder - structure knowledge base documents, tune chunking and retrieval settings, and evaluate retrieval quality to improve AI agent accuracy.

What Is RAG and Why It Matters

RAG (Retrieval-Augmented Generation) allows your Builder AI agents to answer questions using your specific business documents, policies, FAQs, and knowledge base content rather than relying solely on the language model’s pre-trained knowledge. When a user asks a question, the RAG system retrieves the most relevant chunks from your uploaded documents and injects them as context into the prompt before the model generates a response.

This approach grounds the model’s output in your actual content, reduces hallucination, and makes it possible to update the knowledge base without retraining the underlying language model.

RAG is configured inside the LLM node (or the legacy OpenAI GPT / Azure OpenAI nodes) on the agent canvas. The RAG settings are divided into two groups: Training Settings (how your documents are processed and indexed) and Inference Settings (how the system retrieves context at query time).

Key RAG Parameters

Training Settings

Parameter	Description
Training Style	How your documents are processed. `Questions & Answers` is optimized for FAQ-style content - pairs of questions and direct answers. `Text Documents` is optimized for longer narrative content such as policy manuals, product guides, or technical documentation.
Embedding Model	The model used to convert document text into vector representations. Smaller models (for example `text-embedding-ada-002`, `bge-small-en-v1.5`) process faster and cost less. Larger models (for example `text-embedding-3-large`) produce higher-quality embeddings at higher cost.
Training Mode	Controls how much of the training pipeline is re-run: `Full Training` reprocesses all documents and rebuilds all indexes from scratch; `Rebuild Embeddings` reprocesses document content while preserving index structure; `Rebuild Index Only` reconstructs search indexes without reprocessing document text; `Fetch Data Only` retrieves existing data without reprocessing.

Inference Settings

Parameter	Description
Distance Function	The similarity metric used to compare query embeddings against document embeddings. `Cosine` (default) measures the angle between vectors and works well for most text. `Euclidean`, `Manhattan`, and `Chebyshev` are alternatives that may suit specific content types.
Minimum Confidence Threshold (`minConfidence`)	A value between 0.0 and 1.0. Only document chunks with a similarity score at or above this threshold are included in the context. A value of 0.0 includes all results; higher values filter to only the most relevant chunks.
Top N Contexts (`topN`)	The maximum number of document chunks to retrieve and inject into the prompt. A value of 0 returns all chunks above the confidence threshold. Setting a specific number limits context to the most relevant results.
Selected Training Set (`selectedTS`)	Specifies which training data set to use for retrieval when multiple training sets are configured on a single node.

Advanced Index Settings

Parameter	Description
Approximate Similarity Index	When enabled, uses approximate nearest-neighbor search instead of exact search. Recommended for large document sets (100,000+ pages) where exact search becomes slow.
Index Trees	Number of trees used in the approximate index. Higher values improve accuracy at the cost of longer index build time.
Index Search Nodes	Number of nodes examined during an approximate search query. Set to -1 for automatic optimization.

Embedding Model Selection

The embedding model determines how well the system understands the semantic meaning of both your documents and the user’s query. The model used during training must match the model used during inference - changing the embedding model requires a full retrain.

Smaller models are appropriate when:

Query volume is high and latency matters
The content is straightforward (short answers, structured FAQs)
Cost is a significant constraint

Larger models are appropriate when:

The content is complex, technical, or highly specialized
Accuracy is the primary concern and query volume is manageable
The domain involves nuanced language (legal, medical, scientific)

Start with a smaller model during development and testing. If retrieval quality is consistently poor after tuning the confidence threshold and Top N, upgrade to a larger model and retrain.

Training Style: Questions & Answers vs Text Documents

Questions & Answers training style:

Best for: FAQ documents, help desk knowledge bases, customer service scripts, structured Q&A pairs
How it works: The system indexes the content expecting distinct question-answer pairs. Queries are matched against the question side of each pair.
Document preparation: Structure source documents as explicit Q&A pairs. Keep individual answers focused - one topic per answer.
Recommended chunk size: 200 to 400 tokens per chunk

Text Documents training style:

Best for: Policy manuals, product documentation, research content, legal documents, narrative guides
How it works: The system indexes passages from longer documents. Queries are matched against the most semantically relevant passages.
Document preparation: Organize documents with clear headings and logical sections. Each section should cover a single coherent topic.
Recommended chunk size: 500 to 800 tokens per chunk to preserve context

If your knowledge base contains both types of content, consider creating separate training sets or separate agents optimized for each type.

Training Modes

Use the right training mode to balance thoroughness against time:

Full Training - use when setting up a new knowledge base, when significantly changing the document set, or when changing the embedding model. Reprocesses everything.
Rebuild Embeddings - use when adding new documents or updating existing content with the same embedding model. Faster than full training.
Rebuild Index Only - use when you have changed index settings (distance function, approximate index parameters) but have not changed the document content or embedding model.
Fetch Data Only - use to retrieve existing indexed data for inspection without reprocessing.

After any training run, use the Test feature on the agent and check the Debugger’s RAG selection items output to verify that the retrieved chunks match what you expect for representative queries.

Document Preparation Tips

The quality of retrieval is directly proportional to the quality of the source documents. Poor-quality documents produce poor-quality context regardless of the parameter settings.

Before uploading documents:

Remove boilerplate content (headers, footers, page numbers, legal disclaimers that repeat on every page) that adds noise without adding information value
Split very long documents into logically coherent sections when possible - a 200-page PDF is harder to chunk well than ten focused 20-page documents
Ensure consistent terminology throughout the knowledge base - if the same concept is called by multiple names, consider adding a glossary or synonym document
For Q&A style, write explicit answers rather than relying on the model to infer answers from surrounding context
For text documents, use descriptive headings and subheadings - these help the chunking process produce more topically coherent chunks
Avoid documents where critical information is only in tables, images, or charts - extract that information into plain text
Keep content current - outdated information is retrieved with the same confidence as current information, so stale content directly degrades answer quality

Chunking and Relevance Scoring

Builder’s RAG system splits documents into chunks before embedding them. Each chunk is a contiguous segment of text. The chunking strategy interacts with the training style setting:

Q&A training treats each question-answer pair as a unit
Text document training uses sliding window or paragraph-based chunking

During retrieval, the system computes a similarity score between the embedded user query and each stored chunk embedding. The Minimum Confidence Threshold filters out chunks below the score cutoff, and Top N Contexts caps the number of chunks that are injected into the prompt.

The practical effect of these two parameters:

A low threshold and high Top N: retrieves many chunks, including less relevant ones. Useful during initial testing to see what the system finds, but risks injecting irrelevant context that confuses the model.
A high threshold and low Top N: retrieves fewer but more precise chunks. Better for production when you have confirmed that relevant content reliably scores above the threshold.

Initial Configuration and Tuning Process

Step 1: Start with open settings for testing

When first configuring RAG on a node:

Set Minimum Confidence Threshold to 0 to capture all results
Set Top N Contexts to 0 to retrieve everything above the threshold
Select a small, fast embedding model for quick iteration
Run Full Training

This open configuration lets you see the full range of what the system retrieves for your test queries before you start narrowing it down.

Step 2: Run test queries and inspect retrieval

Click the Train Model button in the LLM node configuration panel
Wait for training to complete (time varies with document volume - small sets take minutes, large sets may take hours)
Run a test execution from the agent toolbar
Open the Debugger and expand the RAG selection items message for the LLM node
Review the chunks retrieved for your query - are they the right content? Are they complete enough to answer the question?

Step 3: Tune confidence threshold and Top N

After confirming that the right content exists in the knowledge base and is being retrieved in the initial open configuration:

Gradually increase Minimum Confidence Threshold from 0 toward 0.5 - test after each adjustment
Note the threshold where relevant chunks start being excluded - that is your lower bound
Reduce Top N Contexts from unlimited toward a practical number (7 to 12 is a common range for most use cases) - test after each reduction
Stop when the agent produces accurate, complete answers with acceptable response time

Use the OpenAI tokenizer (or equivalent tool) to verify that the total tokens consumed by the retrieved context, the system message, and the user query stay within the model’s context window limit. Aim for 75% or less of the available limit to leave room for the model’s response.

Step 4: Verify with a cross-platform check

If you are unsure whether poor answers are caused by retrieval quality or by the model itself, copy the retrieved context chunks from the Debugger and test them directly in another AI platform (for example ChatGPT or Claude). If those platforms also cannot produce a good answer from that context, the issue is in the retrieved content, not the model. If they can produce a good answer, the issue is in the model configuration (system message, prompt structure, or model selection).

Troubleshooting Common RAG Issues

”The agent says it does not know the answer”

Likely causes and fixes:

Confidence threshold is too high - lower it to allow more chunks through
The information is not in the knowledge base - search your source documents manually to confirm the answer exists, then add it if missing and retrain
The embedding model is not capturing the query semantics well - try a larger model and retrain

”Answers are not specific enough or are too general”

Likely causes and fixes:

Top N is too low - the model has too little context; increase Top N to retrieve more chunks
Chunks are too small - if using Q&A style on narrative content, switch to Text Documents style
The relevant content is buried in long passages - split source documents into shorter, more focused sections

”Responses are slow”

Likely causes and fixes:

Top N is too high - retrieving and processing many chunks increases latency; reduce to 7 to 12
Embedding model is too large - switch to a smaller model if accuracy allows
Large document set without approximate indexing - enable Approximate Similarity Index for large sets

”Answers include incorrect or irrelevant information”

Likely causes and fixes:

Confidence threshold is too low - low-relevance chunks are being retrieved and injected; raise the threshold
Source documents contain outdated or contradictory information - review and update the knowledge base, then retrain
Multiple topics in one chunk - restructure source documents so each section covers a single topic

Token limit exceeded

When the retrieved context plus the system message plus the user query exceeds the model’s token limit, the request will fail or the response will be truncated.

Fixes:

Reduce Top N Contexts to retrieve fewer chunks
Increase Minimum Confidence Threshold to retrieve only the most relevant chunks
Shorten the system message
Use a model with a larger context window

Performance Monitoring

After going live, track these indicators to know when to retrain or re-tune:

Retrieval relevance - periodically sample live queries in the Debugger and manually verify that retrieved chunks are relevant
Answer accuracy - review a sample of responses against expected answers; create a standard test query set that covers your key use cases
Response time - if latency increases over time as the document set grows, consider enabling approximate indexing
Token usage - monitor whether retrieved context is consuming an increasing share of the context window as the knowledge base grows

Plan quarterly reviews of the knowledge base content to remove outdated documents and add new ones, followed by a retraining run.

Implementation Checklist

Initial setup:

Organize and clean source documents before uploading
Choose training style based on content type (Q&A or Text Documents)
Select an initial embedding model (start smaller for testing)
Set confidence threshold to 0 and Top N to 0 for initial testing
Run Full Training and wait for completion
Run test queries and inspect RAG selection items in the Debugger

Tuning:

Gradually increase confidence threshold while testing retrieval quality
Reduce Top N while verifying answer completeness
Verify token usage stays within the model’s context window limit
Create a standard set of test queries covering key use cases

Before going live:

Run the full test query set and confirm all answers are accurate
Verify response times are acceptable
Document the final configuration settings for future reference
Plan a schedule for knowledge base updates and retraining

Builder Debugger - inspect RAG selection items and trace retrieval behavior
User Manual - LLM node configuration reference
OpenAI GPT node - legacy node with RAG settings documentation

RAG Best Practices

What Is RAG and Why It Matters

Key RAG Parameters

Training Settings

Inference Settings

Advanced Index Settings

Embedding Model Selection

Training Style: Questions & Answers vs Text Documents

Training Modes

Document Preparation Tips

Chunking and Relevance Scoring

Initial Configuration and Tuning Process

Step 1: Start with open settings for testing

Step 2: Run test queries and inspect retrieval

Step 3: Tune confidence threshold and Top N

Step 4: Verify with a cross-platform check

Troubleshooting Common RAG Issues

”The agent says it does not know the answer”

”Answers are not specific enough or are too general”

”Responses are slow”

”Answers include incorrect or irrelevant information”

Token limit exceeded

Performance Monitoring

Implementation Checklist

ThreoAI

Wingtip

Builder

Pylon

Canopy

MSP Onboarding

Videos

Certification

RAG Best Practices

What Is RAG and Why It Matters

Key RAG Parameters

Training Settings

Inference Settings

Advanced Index Settings

Embedding Model Selection

Training Style: Questions & Answers vs Text Documents

Training Modes

Document Preparation Tips

Chunking and Relevance Scoring

Initial Configuration and Tuning Process

Step 1: Start with open settings for testing

Step 2: Run test queries and inspect retrieval

Step 3: Tune confidence threshold and Top N

Step 4: Verify with a cross-platform check

Troubleshooting Common RAG Issues

”The agent says it does not know the answer”

”Answers are not specific enough or are too general”

”Responses are slow”

”Answers include incorrect or irrelevant information”

Token limit exceeded

Performance Monitoring

Implementation Checklist

Related Documentation

ThreoAI

Wingtip

Builder

Pylon

Canopy

MSP Onboarding

Videos

Certification