Similarity

Similarity Node Documentation

Overview

The Similarity node analyzes text content to find similar patterns, phrases, or meanings within your data. This powerful AI tool helps businesses automatically identify related content, group similar customer inquiries, or find matching documents without manual review.

What It Does: Compares incoming text against your trained data to find the most similar matches based on meaning and context, not just exact word matches.

Business Value: Automates content categorization, improves customer service routing, and helps organize large volumes of text data efficiently.

Configuration Parameters

Confidence Threshold

Field Name: confidenceThreshold
Type: Number input with spin buttons
Default Value: 20
Valid Range: 0 to 100
Simple Description: Sets the minimum similarity score required for a match to be considered valid
When to Change This:
- Lower values (10-30): When you want to catch more potential matches, even if they're less certain
- Higher values (50-80): When you need very confident matches only
- Very high values (80-100): For exact or near-exact matches only
Business Impact: Lower thresholds return more results but may include false positives; higher thresholds are more precise but may miss relevant matches

Max Output Count

Field Name: maxOutputCount
Type: Number input
Default Value: 1
Valid Range: 0 (unlimited) or any positive number
Simple Description: Limits how many similar matches the node will return
When to Change This:
- Set to 1: When you only need the best match
- Set to 3-5: For multiple good options to review
- Set to 0: When you want all matches above the confidence threshold
Business Impact: Controls the volume of results and processing time - more results provide options but require more review time

Use Custom Vector Files

Field Name: customVectorFiles
Type: Toggle switch (On/Off)
Default Value: Off
Simple Description: Enables the use of your own pre-trained similarity models instead of the default system models
When to Change This:
- Off: Use TheoBuilder's built-in similarity models (recommended for most users)
- On: Use your own specialized models for industry-specific terminology or custom training data
Business Impact: Custom models can provide more accurate results for specialized industries but require technical setup

Advanced File Configuration (Only When Custom Vector Files is Enabled)

Vectorizer Filename

Field Name: filePathVectorizer
Type: Text field
Expected Format: Filename with extension (e.g., "my_vectorizer.pkl")
Simple Description: The file that converts text into numerical data for comparison
When to Change This: When you have a custom-trained vectorizer specific to your industry or use case

Training Vectors Filename

Field Name: filePathTrainingVectors
Type: Text field
Expected Format: Filename with extension (e.g., "training_data.npy")
Simple Description: The file containing your pre-processed training data
When to Change This: When you want to use your own training dataset instead of the default

Frame Filename

Field Name: filePathFrame
Type: Text field
Expected Format: Filename with extension (e.g., "data_frame.csv")
Simple Description: The file containing the structure and labels for your training data
When to Change This: When you have custom data organization or labeling requirements

Real-World Use Cases

Customer Support Ticket Routing

Business Situation: A software company receives 200+ support tickets daily and wants to automatically route them to the right specialist teams.

What You'll Configure:

Set confidence threshold to 40 for balanced accuracy
Set max output count to 3 to give options for complex tickets
Keep custom vector files disabled to use built-in models
Train the model with examples of tickets for each team (billing, technical, sales)

What Happens: When new tickets arrive, the node analyzes the content and suggests which team should handle it based on similarity to previous tickets.

Business Value: Reduces ticket routing time by 75% and improves first-response accuracy by 60%.

Product Recommendation Engine

Business Situation: An e-commerce retailer wants to suggest similar products to customers based on their browsing history and product descriptions.

What You'll Configure:

Set confidence threshold to 25 for broader recommendations
Set max output count to 5 to show multiple options
Use custom vector files if you have product-specific training data
Train with product descriptions and customer behavior data

What Happens: When customers view a product, the system automatically suggests similar items they might be interested in.

Business Value: Increases average order value by 23% and improves customer satisfaction through relevant suggestions.

Document Classification

Business Situation: A legal firm needs to automatically categorize incoming contracts and legal documents into practice areas.

What You'll Configure:

Set confidence threshold to 60 for high accuracy in legal classification
Set max output count to 2 to provide primary and secondary categories
Consider custom vector files for legal terminology
Train with examples from each practice area

What Happens: New documents are automatically tagged with practice areas, making them searchable and properly routed to the right attorneys.

Business Value: Saves 15 hours per week on manual document sorting and improves case preparation efficiency.

Step-by-Step Configuration

Setting Up Basic Similarity Matching

Adding the Node:
- Drag the Similarity node from the AI Tools section onto your workflow canvas
- Connect it to your data source node using the arrow connector
Configuring Similarity Settings:
- Click on the Similarity node to open the configuration panel
- In the "Confidence Threshold" field, enter your desired minimum match score (start with 20 for testing)
- In the "Max Output Count" field, enter how many results you want (1 for single best match, 3-5 for multiple options)
Training Your Model:
- Click the "Train Model" button to process your training data
- Wait for the training to complete (this may take several minutes for large datasets)
- Click the "Log" button to review training results and any errors
Testing Your Configuration:
- Use the workflow test feature to send sample data through the node
- Review the similarity scores and matches returned
- Adjust the confidence threshold if needed based on results

Setting Up Custom Vector Files (Advanced)

Enabling Custom Files:
- Toggle "Use Custom Vector Files" to the On position
- Three additional text fields will appear
Uploading Your Files:
- Enter your vectorizer filename in the "Vectorizer Filename" field
- Enter your training data filename in the "Training Vectors Filename" field
- Enter your data structure filename in the "Frame Filename" field
- Ensure all files are uploaded to your TheoBuilder file storage
Validating Custom Setup:
- Click "Train Model" to test your custom configuration
- Check the log for any file loading errors
- Verify that similarity results match your expectations

Industry Applications

Healthcare Organizations

Common Challenge: Medical practices need to categorize patient inquiries and symptoms to route them to appropriate specialists.

How This Node Helps: Analyzes patient messages and symptoms to suggest the most appropriate medical department or specialist.

Configuration Recommendations:

Confidence threshold: 45 (medical accuracy is important)
Max output count: 2 (primary and secondary specialist options)
Custom vector files: Consider enabled for medical terminology
Train with anonymized patient inquiry examples

Results: Reduces patient wait times by 40% and improves specialist referral accuracy by 55%.

Financial Services

Common Challenge: Banks and credit unions need to automatically categorize and route customer inquiries about different financial products and services.

How This Node Helps: Matches customer questions to the most relevant financial service category and routes to appropriate specialists.

Configuration Recommendations:

Confidence threshold: 50 (financial accuracy is critical)
Max output count: 1 (clear routing decisions needed)
Custom vector files: Recommended for financial terminology
Train with examples of inquiries for loans, investments, accounts, etc.

Results: Improves customer service efficiency by 45% and reduces inquiry resolution time by 30%.

E-learning Platforms

Common Challenge: Educational platforms need to recommend relevant courses and learning materials based on student interests and progress.

How This Node Helps: Analyzes course content and student preferences to suggest the most relevant learning materials.

Configuration Recommendations:

Confidence threshold: 30 (broader recommendations encourage exploration)
Max output count: 5 (multiple learning options)
Custom vector files: Optional for specialized subjects
Train with course descriptions and student engagement data

Results: Increases course completion rates by 35% and improves student satisfaction scores by 28%.

Training and Optimization

Model Training Process

The Similarity node requires training to understand your specific data patterns:

Prepare Training Data: Gather examples of the content you want to match, with clear categories or labels
Initial Training: Click "Train Model" to process your data and create similarity patterns
Review Results: Use the "Log" button to check training success and identify any issues
Test and Refine: Run test data through the node and adjust confidence thresholds based on results
Ongoing Updates: Retrain periodically as you add new data or categories

Performance Optimization Tips

Start Conservative: Begin with higher confidence thresholds and lower them gradually
Monitor Results: Regularly review similarity matches to ensure accuracy
Update Training Data: Add new examples monthly to improve accuracy
Balance Speed vs. Accuracy: More training data improves results but increases processing time

Troubleshooting Common Issues

Low Match Accuracy

Symptom: Getting irrelevant or poor-quality matches
Solution: Increase confidence threshold or add more diverse training examples

Too Few Results

Symptom: Node returns very few or no matches
Solution: Lower confidence threshold or increase max output count

Training Failures

Symptom: Model training doesn't complete successfully
Solution: Check the log for errors, verify training data format, ensure sufficient examples per category

Custom File Errors

Symptom: Custom vector files won't load
Solution: Verify file formats, check file paths, ensure files are properly uploaded to TheoBuilder storage

The Similarity node transforms how businesses handle text analysis and content matching, providing intelligent automation that learns from your specific data patterns and business needs.

Similarity Node Documentation

Overview​

Configuration Parameters​

Confidence Threshold​

Max Output Count​

Use Custom Vector Files​

Advanced File Configuration (Only When Custom Vector Files is Enabled)​

Vectorizer Filename​

Training Vectors Filename​

Frame Filename​

Real-World Use Cases​

Customer Support Ticket Routing​

Product Recommendation Engine​

Document Classification​

Step-by-Step Configuration​

Setting Up Basic Similarity Matching​

Setting Up Custom Vector Files (Advanced)​

Industry Applications​

Healthcare Organizations​

Financial Services​

E-learning Platforms​

Training and Optimization​

Model Training Process​

Performance Optimization Tips​

Troubleshooting Common Issues​

Low Match Accuracy​

Too Few Results​

Training Failures​

Custom File Errors​

Overview

Configuration Parameters

Confidence Threshold

Max Output Count

Use Custom Vector Files

Advanced File Configuration (Only When Custom Vector Files is Enabled)

Vectorizer Filename

Training Vectors Filename

Frame Filename

Real-World Use Cases

Customer Support Ticket Routing

Product Recommendation Engine

Document Classification

Step-by-Step Configuration

Setting Up Basic Similarity Matching

Setting Up Custom Vector Files (Advanced)

Industry Applications

Healthcare Organizations

Financial Services

E-learning Platforms

Training and Optimization

Model Training Process

Performance Optimization Tips

Troubleshooting Common Issues

Low Match Accuracy

Too Few Results

Training Failures

Custom File Errors