Skip to main content

FeatureExtraction

Feature Extraction Node

The Feature Extraction node automatically identifies and extracts specific types of information (features) from text data, such as names, dates, locations, organizations, and monetary amounts. This powerful AI-driven node helps businesses structure unstructured text data for analysis, reporting, and automation.

What This Node Does

The Feature Extraction node analyzes text content and identifies meaningful entities like person names, company names, dates, locations, and other important information. It uses advanced natural language processing to understand context and extract relevant data points that would otherwise require manual review.

Business Value: Automatically processes large volumes of text documents, emails, or customer feedback to extract key information, saving hours of manual data entry and ensuring consistent, accurate results.

Configuration Parameters

Source Property Section

Column Name

  • Field Name: sourcePropName
  • Type: Smart text field with data suggestions
  • Default Value: Empty
  • Simple Description: The name of the column containing the text you want to analyze for feature extraction
  • When to Change This: Select the specific column from your data that contains the text content you want to process
  • Business Impact: Choosing the correct source column ensures the node analyzes the right text content and produces accurate results

Options Section

Language

  • Field Name: languageModel
  • Type: Dropdown menu with options:
    • English: Optimized for English text analysis with highest accuracy
    • German: Specialized for German language text processing
    • French: Configured for French language content analysis
    • Spanish: Tailored for Spanish text feature extraction
    • Undetermined: Attempts to auto-detect language (use when language varies)
  • Default Value: English
  • Simple Description: The primary language of your text content for optimal feature extraction accuracy
  • When to Change This: Match this to the language of your source text data
  • Business Impact: Correct language selection improves extraction accuracy by up to 40% and reduces false positives

Select Features

  • Field Name: selectedFeatures
  • Type: Multi-select tag box with options:
    • Cardinal: Numbers and quantities (e.g., "five", "100", "dozen")
    • Date: Dates and time references (e.g., "January 15", "next week", "2024")
    • Event: Named events and occasions (e.g., "Christmas", "Super Bowl", "conference")
    • Fac: Facilities and buildings (e.g., "airport", "stadium", "hospital")
    • Gpe: Countries, cities, states (e.g., "United States", "California", "London")
    • Gpe_from: Origin locations in travel or shipping contexts
    • Gpe_to: Destination locations in travel or shipping contexts
    • Language: Language names (e.g., "English", "Spanish", "Mandarin")
    • Law: Legal documents, laws, acts (e.g., "GDPR", "Constitution", "Patent Act")
    • Loc: Geographic locations and landmarks (e.g., "Pacific Ocean", "Mount Everest")
    • Money: Monetary amounts and currencies (e.g., "$100", "fifty dollars", "€25")
    • Norp: Nationalities, religious groups, political groups (e.g., "American", "Buddhist", "Republican")
    • Ordinal: Ordinal numbers (e.g., "first", "second", "21st")
    • Org: Organizations and companies (e.g., "Microsoft", "UN", "Harvard University")
    • Percent: Percentage values (e.g., "25%", "fifty percent")
    • Person: People's names (e.g., "John Smith", "Dr. Johnson")
    • Product: Products, brands, and services (e.g., "iPhone", "Coca-Cola")
    • Quantity: Measurements and quantities (e.g., "5 miles", "two hours", "10 kg")
    • Time: Time expressions (e.g., "3 PM", "morning", "midnight")
    • WORK_OF_ART: Creative works (e.g., "Mona Lisa", "Star Wars", "Beethoven's 9th")
  • Default Value: None selected
  • Simple Description: Choose which types of information you want to extract from your text
  • When to Change This: Select only the features relevant to your business needs to avoid information overload
  • Business Impact: Focused feature selection improves processing speed and reduces noise in your extracted data

Output Section

Option

  • Field Name: outTransformId
  • Type: Dropdown menu with options:
    • Original with appended result column: Keeps all original data and adds extracted features in a new column
    • Return result column only: Returns only the extracted features, removing original text data
  • Default Value: Empty (must be selected)
  • Simple Description: How you want the extracted features to be formatted in your output data
  • When to Change This: Choose "Original with appended" to keep source data for reference, or "Result only" for clean feature-focused output
  • Business Impact: Proper output formatting ensures your downstream processes receive data in the expected structure

Column Name

  • Field Name: outColumnName
  • Type: Text field
  • Default Value: Empty
  • Simple Description: The name for the new column that will contain your extracted features
  • When to Change This: Use descriptive names like "extracted_entities" or "customer_mentions" for easy identification
  • Business Impact: Clear column naming improves data organization and makes results easier to understand for your team

Real-World Use Cases

Customer Feedback Analysis

Business Situation: A retail company receives thousands of customer reviews and wants to automatically identify mentioned products, competitors, and sentiment-related entities.

What You'll Configure:

  • Set "Column Name" to "review_text" (your review content column)
  • Choose "English" from the Language dropdown
  • Select features: Person, Org, Product, Money, Percent
  • Choose "Original with appended result column" for output option
  • Name the output column "extracted_entities"

What Happens: The node processes each review and identifies customer names, competitor mentions, product references, prices, and percentage ratings, creating a structured dataset for analysis.

Business Value: Reduces manual review analysis time by 85% and provides consistent entity identification across all customer feedback.

Business Situation: A law firm needs to extract key information from contracts including parties, dates, monetary amounts, and legal references.

What You'll Configure:

  • Set "Column Name" to "contract_text"
  • Select "English" as the language
  • Choose features: Person, Org, Date, Money, Law, Loc
  • Select "Original with appended result column" to maintain document integrity
  • Name output column "contract_entities"

What Happens: Each contract is analyzed to identify all parties involved, important dates, financial terms, legal citations, and jurisdictions mentioned.

Business Value: Accelerates contract review process by 60% and ensures no critical information is overlooked during legal analysis.

News Article Monitoring

Business Situation: A PR agency wants to monitor news articles for client mentions, competitor references, and industry events.

What You'll Configure:

  • Set "Column Name" to "article_content"
  • Choose "English" for language
  • Select features: Person, Org, Event, Date, Loc, Money
  • Use "Return result column only" for focused monitoring data
  • Name output column "media_mentions"

What Happens: News articles are processed to extract all company names, executive mentions, industry events, dates, locations, and financial figures.

Business Value: Provides comprehensive media monitoring with 95% accuracy, enabling faster response to industry developments and client coverage.

Step-by-Step Configuration

Adding the Node

  1. Drag the Feature Extraction node from the AI Processing section in the left panel
  2. Drop it onto your workflow canvas
  3. Connect it to your data source node using the arrow connector

Configuring Source Data

  1. Click on the Feature Extraction node to open the configuration panel
  2. In the "Source Property" section, click the "Column Name" field
  3. Select or type the name of the column containing your text data
  4. The smart text box will suggest available columns from your connected data

Setting Language and Features

  1. Expand the "Options" section in the configuration panel
  2. Click the "Language" dropdown and select the primary language of your text
  3. In the "Select Features" field, click to open the multi-select box
  4. Check the boxes for each type of information you want to extract
  5. Click "OK" to confirm your feature selections

Configuring Output Format

  1. Expand the "Output" section
  2. Click the "Option" dropdown and choose your preferred output format:
    • Select "Original with appended result column" to keep source data
    • Select "Return result column only" for extracted features only
  3. In the "Column Name" field, enter a descriptive name for your results column
  4. Click "Save Configuration" to apply your settings

Testing Your Configuration

  1. Click the "Test Configuration" button in the node panel
  2. Enter sample text in the test input field
  3. Review the extracted features in the preview panel
  4. Adjust your feature selections if needed
  5. Save your final configuration

Industry Applications

Healthcare Organizations

Common Challenge: Medical records contain unstructured notes that need analysis for patient care coordination and billing accuracy.

How This Node Helps: Automatically extracts patient names, medical conditions, medications, dates, and healthcare facilities from clinical notes and discharge summaries.

Configuration Recommendations:

  • Use "English" language setting for most medical records
  • Select features: Person, Date, Org, Quantity, Product (for medications)
  • Choose "Original with appended result column" to maintain medical record integrity
  • Name output column "clinical_entities"

Results: Healthcare providers reduce documentation review time by 70% and improve billing accuracy through consistent entity extraction.

Financial Services

Common Challenge: Processing loan applications, insurance claims, and financial reports requires extracting specific financial and personal information from documents.

How This Node Helps: Identifies applicant names, financial amounts, dates, organizations, and locations from financial documents for automated processing.

Configuration Recommendations:

  • Select "English" for most financial documents
  • Choose features: Person, Money, Date, Org, Percent, Loc
  • Use "Original with appended result column" for audit trail requirements
  • Name output column "financial_entities"

Results: Financial institutions process applications 50% faster while maintaining compliance and reducing manual data entry errors.

E-commerce Platforms

Common Challenge: Product reviews, customer service tickets, and marketplace listings contain valuable information that needs structured analysis.

How This Node Helps: Extracts product names, brand mentions, prices, customer names, and quality indicators from unstructured e-commerce text data.

Configuration Recommendations:

  • Use "English" or "Undetermined" for international marketplaces
  • Select features: Product, Person, Money, Org, Percent, Ordinal
  • Choose "Return result column only" for clean analytics data
  • Name output column "commerce_entities"

Results: E-commerce businesses gain 40% better insights into customer sentiment and product performance through automated text analysis.

Best Practices

Feature Selection Strategy

  • Start Small: Begin with 3-5 essential features and expand based on results
  • Business Relevance: Only select features that directly support your business objectives
  • Data Quality: More features don't always mean better results - focus on accuracy over quantity

Language Configuration

  • Consistency: Ensure your language setting matches your data's primary language
  • Mixed Content: Use "Undetermined" only when your text contains multiple languages
  • Regional Variations: English setting works well for US, UK, and other English variants

Output Optimization

  • Downstream Compatibility: Choose output format based on how you'll use the extracted data
  • Column Naming: Use consistent, descriptive names that your team will understand
  • Data Retention: Keep original data when you need to verify extraction accuracy

Performance Considerations

  • Batch Processing: Process large datasets in smaller batches for optimal performance
  • Feature Limits: Selecting fewer features improves processing speed
  • Text Length: Very long documents may require preprocessing to focus on relevant sections

The Feature Extraction node transforms unstructured text into valuable, actionable data that drives better business decisions and automates manual processes across industries.