Skip to content
synthreo.ai

Feature Extraction Node - Synthreo Builder

Feature Extraction node for Builder - convert raw text or structured data into numerical feature vectors for use as input to machine learning models and classifiers.

The Feature Extraction node automatically identifies and extracts specific types of information (features) from text data, such as names, dates, locations, organizations, and monetary amounts. This AI-driven node helps businesses structure unstructured text data for analysis, reporting, and automation.

The Feature Extraction node analyzes text content and identifies meaningful entities like person names, company names, dates, locations, and other important information. It uses advanced natural language processing to understand context and extract relevant data points that would otherwise require manual review.

Business Value: Automatically processes large volumes of text documents, emails, or customer feedback to extract key information, saving hours of manual data entry and ensuring consistent, accurate results.

Column Name

  • Field Name: sourcePropName
  • Type: Smart text field with data suggestions
  • Default Value: Empty
  • Simple Description: The name of the column containing the text you want to analyze for feature extraction
  • When to Change This: Select the specific column from your data that contains the text content you want to process
  • Business Impact: Choosing the correct source column ensures the node analyzes the right text content and produces accurate results

Language

  • Field Name: languageModel
  • Type: Dropdown menu with options:
    • English - Optimized for English text analysis with highest accuracy
    • German - Specialized for German language text processing
    • French - Configured for French language content analysis
    • Spanish - Tailored for Spanish text feature extraction
    • Undetermined - Attempts to auto-detect language (use when language varies)
  • Default Value: English
  • Simple Description: The primary language of your text content for optimal feature extraction accuracy
  • When to Change This: Match this to the language of your source text data
  • Business Impact: Correct language selection improves extraction accuracy by up to 40% and reduces false positives

Select Features

  • Field Name: selectedFeatures
  • Type: Multi-select tag box with options:
    • Cardinal - Numbers and quantities (e.g., “five”, “100”, “dozen”)
    • Date - Dates and time references (e.g., “January 15”, “next week”, “2024”)
    • Event - Named events and occasions (e.g., “Christmas”, “Super Bowl”, “conference”)
    • Fac - Facilities and buildings (e.g., “airport”, “stadium”, “hospital”)
    • Gpe - Countries, cities, states (e.g., “United States”, “California”, “London”)
    • Gpe_from - Origin locations in travel or shipping contexts
    • Gpe_to - Destination locations in travel or shipping contexts
    • Language - Language names (e.g., “English”, “Spanish”, “Mandarin”)
    • Law - Legal documents, laws, acts (e.g., “GDPR”, “Constitution”, “Patent Act”)
    • Loc - Geographic locations and landmarks (e.g., “Pacific Ocean”, “Mount Everest”)
    • Money - Monetary amounts and currencies (e.g., “$100”, “fifty dollars”, “€25”)
    • Norp - Nationalities, religious groups, political groups (e.g., “American”, “Buddhist”, “Republican”)
    • Ordinal - Ordinal numbers (e.g., “first”, “second”, “21st”)
    • Org - Organizations and companies (e.g., “Microsoft”, “UN”, “Harvard University”)
    • Percent - Percentage values (e.g., “25%”, “fifty percent”)
    • Person - People’s names (e.g., “John Smith”, “Dr. Johnson”)
    • Product - Products, brands, and services (e.g., “iPhone”, “Coca-Cola”)
    • Quantity - Measurements and quantities (e.g., “5 miles”, “two hours”, “10 kg”)
    • Time - Time expressions (e.g., “3 PM”, “morning”, “midnight”)
    • WORK_OF_ART - Creative works (e.g., “Mona Lisa”, “Star Wars”, “Beethoven’s 9th”)
  • Default Value: None selected
  • Simple Description: Choose which types of information you want to extract from your text
  • When to Change This: Select only the features relevant to your business needs to avoid information overload
  • Business Impact: Focused feature selection improves processing speed and reduces noise in your extracted data

Option

  • Field Name: outTransformId
  • Type: Dropdown menu with options:
    • Original with appended result column - Keeps all original data and adds extracted features in a new column
    • Return result column only - Returns only the extracted features, removing original text data
  • Default Value: Empty (must be selected)
  • Simple Description: How you want the extracted features to be formatted in your output data
  • When to Change This: Choose “Original with appended” to keep source data for reference, or “Result only” for clean feature-focused output
  • Business Impact: Proper output formatting ensures your downstream processes receive data in the expected structure

Column Name

  • Field Name: outColumnName
  • Type: Text field
  • Default Value: Empty
  • Simple Description: The name for the new column that will contain your extracted features
  • When to Change This: Use descriptive names like “extracted_entities” or “customer_mentions” for easy identification
  • Business Impact: Clear column naming improves data organization and makes results easier to understand for your team

When features are extracted, the result column contains a structured object where each selected feature type becomes a key, and the value is an array of all matched text strings found in the source. For example, extracting Person and Org from a sentence might produce output like the following.

{
"Person": ["John Smith", "Dr. Johnson"],
"Org": ["Acme Corp", "Harvard University"],
"Date": ["January 15", "next week"]
}

When no entities of a given type are found in the source text, the corresponding key will contain an empty array. Downstream nodes can reference these values using standard property path expressions such as extracted_entities.Person[0] to access the first person name found.

Feature LabelWhat It ExtractsExample Values
CardinalRaw numbers and counted quantities”five”, “100”, “a dozen”
DateCalendar dates and relative date expressions”January 15”, “last Tuesday”, “2025”
EventNamed events and scheduled occasions”World Cup”, “annual summit”, “Black Friday”
FacNamed facilities and built structures”JFK Airport”, “Madison Square Garden”
GpeGeo-political entities (countries, cities, states)“France”, “Chicago”, “Ontario”
Gpe_fromOrigin locations in directional context”from London”, “departing Tokyo”
Gpe_toDestination locations in directional context”to Berlin”, “arriving in Sydney”
LanguageHuman language names”Mandarin”, “Portuguese”, “Arabic”
LawNamed laws, acts, regulations”GDPR”, “Sarbanes-Oxley”, “HIPAA”
LocNon-GPE geographic features and landmarks”the Amazon River”, “Mount Fuji”
MoneyMonetary amounts with or without currency”$250”, “fifty euros”, “two million dollars”
NorpNationalities, ethnic groups, political affiliations”Canadian”, “Buddhist”, “Democrat”
OrdinalPosition or rank expressed as ordinal numbers”third”, “21st”, “last”
OrgCompanies, agencies, and institutions”IBM”, “the United Nations”, “MIT”
PercentPercentage values and rates”30%”, “half”, “three-quarters”
PersonNames of real or fictional people”Marie Curie”, “CEO Jane Doe”
ProductBrand names, product lines, and services”Tesla Model 3”, “Windows 11”
QuantityMeasurements with units”10 kilometers”, “500 mg”, “two hours”
TimeTime-of-day expressions”noon”, “3:45 PM”, “early morning”
WORK_OF_ARTTitles of creative works”Pride and Prejudice”, “The Beatles”

Business Situation: A retail company receives thousands of customer reviews and wants to automatically identify mentioned products, competitors, and sentiment-related entities.

What You’ll Configure:

  • Set “Column Name” to “review_text” (your review content column)
  • Choose “English” from the Language dropdown
  • Select features: Person, Org, Product, Money, Percent
  • Choose “Original with appended result column” for output option
  • Name the output column “extracted_entities”

What Happens: The node processes each review and identifies customer names, competitor mentions, product references, prices, and percentage ratings, creating a structured dataset for analysis.

Business Value: Reduces manual review analysis time by 85% and provides consistent entity identification across all customer feedback.

Business Situation: A law firm needs to extract key information from contracts including parties, dates, monetary amounts, and legal references.

What You’ll Configure:

  • Set “Column Name” to “contract_text”
  • Select “English” as the language
  • Choose features: Person, Org, Date, Money, Law, Loc
  • Select “Original with appended result column” to maintain document integrity
  • Name output column “contract_entities”

What Happens: Each contract is analyzed to identify all parties involved, important dates, financial terms, legal citations, and jurisdictions mentioned.

Business Value: Accelerates contract review process by 60% and ensures no critical information is overlooked during legal analysis.

Business Situation: A PR agency wants to monitor news articles for client mentions, competitor references, and industry events.

What You’ll Configure:

  • Set “Column Name” to “article_content”
  • Choose “English” for language
  • Select features: Person, Org, Event, Date, Loc, Money
  • Use “Return result column only” for focused monitoring data
  • Name output column “media_mentions”

What Happens: News articles are processed to extract all company names, executive mentions, industry events, dates, locations, and financial figures.

Business Value: Provides comprehensive media monitoring with 95% accuracy, enabling faster response to industry developments and client coverage.

  1. Drag the Feature Extraction node from the AI Processing section in the left panel
  2. Drop it onto your workflow canvas
  3. Connect it to your data source node using the arrow connector
  1. Click on the Feature Extraction node to open the configuration panel
  2. In the “Source Property” section, click the “Column Name” field
  3. Select or type the name of the column containing your text data
  4. The smart text box will suggest available columns from your connected data
  1. Expand the “Options” section in the configuration panel
  2. Click the “Language” dropdown and select the primary language of your text
  3. In the “Select Features” field, click to open the multi-select box
  4. Check the boxes for each type of information you want to extract
  5. Click “OK” to confirm your feature selections
  1. Expand the “Output” section
  2. Click the “Option” dropdown and choose your preferred output format:
    • Select “Original with appended result column” to keep source data
    • Select “Return result column only” for extracted features only
  3. In the “Column Name” field, enter a descriptive name for your results column
  4. Click “Save Configuration” to apply your settings
  1. Click the “Test Configuration” button in the node panel
  2. Enter sample text in the test input field
  3. Review the extracted features in the preview panel
  4. Adjust your feature selections if needed
  5. Save your final configuration

Common Challenge: Medical records contain unstructured notes that need analysis for patient care coordination and billing accuracy.

How This Node Helps: Automatically extracts patient names, medical conditions, medications, dates, and healthcare facilities from clinical notes and discharge summaries.

Configuration Recommendations:

  • Use “English” language setting for most medical records
  • Select features: Person, Date, Org, Quantity, Product (for medications)
  • Choose “Original with appended result column” to maintain medical record integrity
  • Name output column “clinical_entities”

Results: Healthcare providers reduce documentation review time by 70% and improve billing accuracy through consistent entity extraction.

Common Challenge: Processing loan applications, insurance claims, and financial reports requires extracting specific financial and personal information from documents.

How This Node Helps: Identifies applicant names, financial amounts, dates, organizations, and locations from financial documents for automated processing.

Configuration Recommendations:

  • Select “English” for most financial documents
  • Choose features: Person, Money, Date, Org, Percent, Loc
  • Use “Original with appended result column” for audit trail requirements
  • Name output column “financial_entities”

Results: Financial institutions process applications 50% faster while maintaining compliance and reducing manual data entry errors.

Common Challenge: Product reviews, customer service tickets, and marketplace listings contain valuable information that needs structured analysis.

How This Node Helps: Extracts product names, brand mentions, prices, customer names, and quality indicators from unstructured e-commerce text data.

Configuration Recommendations:

  • Use “English” or “Undetermined” for international marketplaces
  • Select features: Product, Person, Money, Org, Percent, Ordinal
  • Choose “Return result column only” for clean analytics data
  • Name output column “commerce_entities”

Results: E-commerce businesses gain 40% better insights into customer sentiment and product performance through automated text analysis.

  • Symptom: The output column is empty or contains only empty arrays
  • Cause: The source column may not contain text matching the selected feature types, or the wrong language model is selected
  • Solution: Verify the source column contains the expected text type, confirm the language setting matches your data, and check that the feature types you selected actually appear in sample text
  • Symptom: The node extracts information that does not match the expected feature type
  • Cause: Natural language is ambiguous - a word like “Mars” could be a person name, a product, or a location depending on context
  • Solution: Narrow your feature selection to only the types you need, and post-process the output with a filtering node if specific false positives are consistent
  • Symptom: Workflow runs significantly longer when processing more than a few hundred records
  • Cause: Each record requires a full NLP analysis pass which is compute-intensive
  • Solution: Select fewer feature types to reduce the analysis scope, and ensure your workflow runs during off-peak hours for large batch jobs
  • Symptom: Entities from the correct types are missed or text is not parsed properly
  • Cause: The language model does not match the actual language of the source text
  • Solution: If your dataset contains multiple languages, use the “Undetermined” setting to allow the model to detect language per record
  • Start Small - Begin with 3 to 5 essential features and expand based on results
  • Business Relevance - Only select features that directly support your business objectives
  • Data Quality - More features do not always mean better results; focus on accuracy over quantity
  • Consistency - Ensure your language setting matches your data’s primary language
  • Mixed Content - Use “Undetermined” only when your text contains multiple languages
  • Regional Variations - The English setting works well for US, UK, and other English variants
  • Downstream Compatibility - Choose output format based on how you will use the extracted data
  • Column Naming - Use consistent, descriptive names that your team will understand
  • Data Retention - Keep original data when you need to verify extraction accuracy
  • Batch Processing - Process large datasets in smaller batches for optimal performance
  • Feature Limits - Selecting fewer features improves processing speed
  • Text Length - Very long documents may require preprocessing to focus on relevant sections
  • Sentiment Analysis - Combine with Feature Extraction to get both entity data and emotional tone from the same text
  • Similarity - Use extracted entities as input to find similar records across your dataset
  • Custom Script - Post-process extraction results with JavaScript for custom filtering or reshaping
  • Convert From JSON - Parse the extracted entity JSON object to access individual entity arrays in downstream nodes

The Feature Extraction node transforms unstructured text into valuable, actionable data that drives better business decisions and automates manual processes across industries.