Similarity
Similarity Node Documentation
Overview
The Similarity node analyzes text content to find similar patterns, phrases, or meanings within your data. This powerful AI tool helps businesses automatically identify related content, group similar customer inquiries, or find matching documents without manual review.
What It Does: Compares incoming text against your trained data to find the most similar matches based on meaning and context, not just exact word matches.
Business Value: Automates content categorization, improves customer service routing, and helps organize large volumes of text data efficiently.
Configuration Parameters
Confidence Threshold
- Field Name:
confidenceThreshold
- Type: Number input with spin buttons
- Default Value: 20
- Valid Range: 0 to 100
- Simple Description: Sets the minimum similarity score required for a match to be considered valid
- When to Change This:
- Lower values (10-30): When you want to catch more potential matches, even if they're less certain
- Higher values (50-80): When you need very confident matches only
- Very high values (80-100): For exact or near-exact matches only
- Business Impact: Lower thresholds return more results but may include false positives; higher thresholds are more precise but may miss relevant matches
Max Output Count
- Field Name:
maxOutputCount
- Type: Number input
- Default Value: 1
- Valid Range: 0 (unlimited) or any positive number
- Simple Description: Limits how many similar matches the node will return
- When to Change This:
- Set to 1: When you only need the best match
- Set to 3-5: For multiple good options to review
- Set to 0: When you want all matches above the confidence threshold
- Business Impact: Controls the volume of results and processing time - more results provide options but require more review time
Use Custom Vector Files
- Field Name:
customVectorFiles
- Type: Toggle switch (On/Off)
- Default Value: Off
- Simple Description: Enables the use of your own pre-trained similarity models instead of the default system models
- When to Change This:
- Off: Use TheoBuilder's built-in similarity models (recommended for most users)
- On: Use your own specialized models for industry-specific terminology or custom training data
- Business Impact: Custom models can provide more accurate results for specialized industries but require technical setup
Advanced File Configuration (Only When Custom Vector Files is Enabled)
Vectorizer Filename
- Field Name:
filePathVectorizer
- Type: Text field
- Expected Format: Filename with extension (e.g., "my_vectorizer.pkl")
- Simple Description: The file that converts text into numerical data for comparison
- When to Change This: When you have a custom-trained vectorizer specific to your industry or use case
Training Vectors Filename
- Field Name:
filePathTrainingVectors
- Type: Text field
- Expected Format: Filename with extension (e.g., "training_data.npy")
- Simple Description: The file containing your pre-processed training data
- When to Change This: When you want to use your own training dataset instead of the default
Frame Filename
- Field Name:
filePathFrame
- Type: Text field
- Expected Format: Filename with extension (e.g., "data_frame.csv")
- Simple Description: The file containing the structure and labels for your training data
- When to Change This: When you have custom data organization or labeling requirements
Real-World Use Cases
Customer Support Ticket Routing
Business Situation: A software company receives 200+ support tickets daily and wants to automatically route them to the right specialist teams.
What You'll Configure:
- Set confidence threshold to 40 for balanced accuracy
- Set max output count to 3 to give options for complex tickets
- Keep custom vector files disabled to use built-in models
- Train the model with examples of tickets for each team (billing, technical, sales)
What Happens: When new tickets arrive, the node analyzes the content and suggests which team should handle it based on similarity to previous tickets.
Business Value: Reduces ticket routing time by 75% and improves first-response accuracy by 60%.
Product Recommendation Engine
Business Situation: An e-commerce retailer wants to suggest similar products to customers based on their browsing history and product descriptions.
What You'll Configure:
- Set confidence threshold to 25 for broader recommendations
- Set max output count to 5 to show multiple options
- Use custom vector files if you have product-specific training data
- Train with product descriptions and customer behavior data
What Happens: When customers view a product, the system automatically suggests similar items they might be interested in.
Business Value: Increases average order value by 23% and improves customer satisfaction through relevant suggestions.
Document Classification
Business Situation: A legal firm needs to automatically categorize incoming contracts and legal documents into practice areas.
What You'll Configure:
- Set confidence threshold to 60 for high accuracy in legal classification
- Set max output count to 2 to provide primary and secondary categories
- Consider custom vector files for legal terminology
- Train with examples from each practice area
What Happens: New documents are automatically tagged with practice areas, making them searchable and properly routed to the right attorneys.
Business Value: Saves 15 hours per week on manual document sorting and improves case preparation efficiency.
Step-by-Step Configuration
Setting Up Basic Similarity Matching
-
Adding the Node:
- Drag the Similarity node from the AI Tools section onto your workflow canvas
- Connect it to your data source node using the arrow connector
-
Configuring Similarity Settings:
- Click on the Similarity node to open the configuration panel
- In the "Confidence Threshold" field, enter your desired minimum match score (start with 20 for testing)
- In the "Max Output Count" field, enter how many results you want (1 for single best match, 3-5 for multiple options)
-
Training Your Model:
- Click the "Train Model" button to process your training data
- Wait for the training to complete (this may take several minutes for large datasets)
- Click the "Log" button to review training results and any errors
-
Testing Your Configuration:
- Use the workflow test feature to send sample data through the node
- Review the similarity scores and matches returned
- Adjust the confidence threshold if needed based on results
Setting Up Custom Vector Files (Advanced)
-
Enabling Custom Files:
- Toggle "Use Custom Vector Files" to the On position
- Three additional text fields will appear
-
Uploading Your Files:
- Enter your vectorizer filename in the "Vectorizer Filename" field
- Enter your training data filename in the "Training Vectors Filename" field
- Enter your data structure filename in the "Frame Filename" field
- Ensure all files are uploaded to your TheoBuilder file storage
-
Validating Custom Setup:
- Click "Train Model" to test your custom configuration
- Check the log for any file loading errors
- Verify that similarity results match your expectations
Industry Applications
Healthcare Organizations
Common Challenge: Medical practices need to categorize patient inquiries and symptoms to route them to appropriate specialists.
How This Node Helps: Analyzes patient messages and symptoms to suggest the most appropriate medical department or specialist.
Configuration Recommendations:
- Confidence threshold: 45 (medical accuracy is important)
- Max output count: 2 (primary and secondary specialist options)
- Custom vector files: Consider enabled for medical terminology
- Train with anonymized patient inquiry examples
Results: Reduces patient wait times by 40% and improves specialist referral accuracy by 55%.
Financial Services
Common Challenge: Banks and credit unions need to automatically categorize and route customer inquiries about different financial products and services.
How This Node Helps: Matches customer questions to the most relevant financial service category and routes to appropriate specialists.
Configuration Recommendations:
- Confidence threshold: 50 (financial accuracy is critical)
- Max output count: 1 (clear routing decisions needed)
- Custom vector files: Recommended for financial terminology
- Train with examples of inquiries for loans, investments, accounts, etc.
Results: Improves customer service efficiency by 45% and reduces inquiry resolution time by 30%.
E-learning Platforms
Common Challenge: Educational platforms need to recommend relevant courses and learning materials based on student interests and progress.
How This Node Helps: Analyzes course content and student preferences to suggest the most relevant learning materials.
Configuration Recommendations:
- Confidence threshold: 30 (broader recommendations encourage exploration)
- Max output count: 5 (multiple learning options)
- Custom vector files: Optional for specialized subjects
- Train with course descriptions and student engagement data
Results: Increases course completion rates by 35% and improves student satisfaction scores by 28%.
Training and Optimization
Model Training Process
The Similarity node requires training to understand your specific data patterns:
- Prepare Training Data: Gather examples of the content you want to match, with clear categories or labels
- Initial Training: Click "Train Model" to process your data and create similarity patterns
- Review Results: Use the "Log" button to check training success and identify any issues
- Test and Refine: Run test data through the node and adjust confidence thresholds based on results
- Ongoing Updates: Retrain periodically as you add new data or categories
Performance Optimization Tips
- Start Conservative: Begin with higher confidence thresholds and lower them gradually
- Monitor Results: Regularly review similarity matches to ensure accuracy
- Update Training Data: Add new examples monthly to improve accuracy
- Balance Speed vs. Accuracy: More training data improves results but increases processing time
Troubleshooting Common Issues
Low Match Accuracy
- Symptom: Getting irrelevant or poor-quality matches
- Solution: Increase confidence threshold or add more diverse training examples
Too Few Results
- Symptom: Node returns very few or no matches
- Solution: Lower confidence threshold or increase max output count
Training Failures
- Symptom: Model training doesn't complete successfully
- Solution: Check the log for errors, verify training data format, ensure sufficient examples per category
Custom File Errors
- Symptom: Custom vector files won't load
- Solution: Verify file formats, check file paths, ensure files are properly uploaded to TheoBuilder storage
The Similarity node transforms how businesses handle text analysis and content matching, providing intelligent automation that learns from your specific data patterns and business needs.