FileToText

FileToText Node Documentation

Overview

The FileToText node converts various file formats (PDFs, images, documents) into readable text content that your workflow can process. This powerful node uses advanced text extraction and OCR (Optical Character Recognition) technology to pull text from files, making it perfect for document processing, data extraction, and content analysis workflows.

What This Node Does

The FileToText node takes files from your specified location and converts them into text format. It can handle multiple file types including PDFs, images with text, Word documents, and more. The extracted text becomes available as data that subsequent nodes in your workflow can use for analysis, storage, or further processing.

Configuration Parameters

Data Source Section

File Path

Field Name: getterTemplate
Type: Smart text field with dynamic data support
Default Value: Empty
Simple Description: Specifies the exact path or pattern to locate the files you want to convert to text
When to Change This: Enter a specific file path when processing individual files, or use dynamic placeholders when file paths come from previous workflow steps
Business Impact: Accurate file paths ensure your workflow processes the correct documents every time

Folder Path

Field Name: filesFolderPath
Type: Smart text field with location picker
Default Value: Empty
Simple Description: Sets the base folder where your files are stored
When to Change This: Specify when you want to process multiple files from a specific directory or when files are organized in folders
Business Impact: Proper folder configuration enables batch processing of multiple documents, saving significant manual effort

Processing Section

Remove File After Processing

Field Name: removeFileAfterProcessing
Type: Toggle switch (On/Off)
Default Value: Off
Simple Description: Automatically deletes the original file after successfully extracting its text
When to Change This:
- On: Use when processing temporary files or when you want to clean up storage space automatically
- Off: Keep when you need to preserve original files for compliance, backup, or future reference
Business Impact: Helps manage storage costs and keeps your file system organized, but use carefully to avoid losing important documents

Limit OCR

Field Name: limitOcr
Type: Toggle switch (On/Off)
Default Value: Off
Simple Description: Restricts the use of OCR technology for text extraction from images and scanned documents
When to Change This:
- On: Enable when processing costs are a concern or when you only want to extract text from digital documents
- Off: Keep disabled when you need to extract text from images, scanned PDFs, or handwritten documents
Business Impact: OCR processing can be resource-intensive; limiting it reduces costs but may miss text in image-based documents

Produce Chunks from PDF

Field Name: produceChunksFromPdf
Type: Toggle switch (On/Off)
Default Value: Off
Simple Description: Breaks large PDF documents into smaller, manageable text segments instead of one large text block
When to Change This:
- On: Enable when processing large documents that need to be analyzed in sections or when working with AI models that have text length limits
- Off: Keep disabled when you need the complete document text as one continuous piece
Business Impact: Chunking makes large documents easier to process and analyze, improving workflow performance and enabling more detailed content analysis

Output Section

Output Format

Field Name: outTransformId
Type: Dropdown menu with options:
- Original with appended result column: Keeps all original data and adds the extracted text as a new column
- Return result column only: Returns only the extracted text, discarding other data
Default Value: Original with appended result column
Simple Description: Determines how the extracted text is formatted in your workflow output
When to Change This: Choose "result column only" when you only need the text content, or keep the default when you need to maintain relationships with other data
Business Impact: The right output format ensures your data flows correctly to subsequent workflow steps

Result Property Name

Field Name: outColumnName
Type: Text field
Default Value: "text_result"
Simple Description: Names the column or property that will contain your extracted text
When to Change This: Customize the name to match your data structure or make it more descriptive (e.g., "contract_text", "invoice_content", "document_summary")
Business Impact: Clear, descriptive names make your workflow data easier to understand and use in reports or subsequent processing steps

Real-World Use Cases

Legal Document Processing

Business Situation: A law firm receives hundreds of PDF contracts daily that need to be reviewed for specific clauses and terms.

What You'll Configure:

Set the folder path to your contract storage directory
Choose "Produce chunks from PDF" to break large contracts into sections
Name the result property "contract_text" for clarity
Keep "Remove file after processing" disabled to preserve originals

What Happens: Each contract PDF is converted to searchable text, broken into manageable sections, and made available for AI analysis or keyword searching.

Business Value: Reduces contract review time by 75% and ensures no important clauses are missed during legal analysis.

Invoice Data Extraction

Business Situation: An accounting department needs to extract text from scanned invoices received via email to populate their accounting system.

What You'll Configure:

Use dynamic file paths from email attachment data
Enable OCR processing for scanned documents
Set output to "result column only" since you only need the text
Name the result property "invoice_text"

What Happens: Scanned invoice images are converted to text, making vendor names, amounts, and dates available for automated data entry.

Business Value: Eliminates 90% of manual data entry work and reduces invoice processing errors by 85%.

Research Document Analysis

Business Situation: A market research company needs to analyze hundreds of PDF reports to extract key insights and trends.

What You'll Configure:

Point to your research document folder
Enable "Produce chunks from PDF" for better analysis
Keep original files for reference
Use descriptive result property names like "research_content"

What Happens: Research PDFs are converted to text chunks that can be analyzed by AI for themes, sentiment, and key findings.

Business Value: Accelerates research analysis by 60% and enables processing of 10x more documents with the same team.

Step-by-Step Configuration

Adding the Node

Drag the FileToText node from the left panel onto your workflow canvas
Connect it to the previous node using the arrow connector
Click on the FileToText node to open the configuration panel

Setting Up Data Source

In the "Data Source" section, enter your file path in the "File Path" field
- For single files: Enter the complete path (e.g., "/documents/contract.pdf")
- For dynamic paths: Use placeholders from previous nodes
If processing multiple files, enter the folder path in the "Folder Path" field
Test your path configuration using the preview feature

Configuring Processing Options

In the "Processing" section, decide whether to remove files after processing
- Toggle "On" only if you're certain you won't need the original files
Set OCR limitations based on your document types
- Keep "Off" if you have scanned documents or images with text
Enable PDF chunking if you're processing large documents
- Toggle "On" for documents over 10 pages or when using AI analysis

Setting Output Format

In the "Output" section, choose your output format from the dropdown
- Select "Original with appended result column" to keep all data
- Choose "Return result column only" if you only need the text
Enter a descriptive name for your result property
- Use clear names like "document_text", "email_content", or "report_data"
Save your configuration and test with sample files

Industry Applications

Healthcare Organizations

Common Challenge: Medical practices receive patient forms, insurance documents, and medical records in various formats that need to be digitized and searchable.

How This Node Helps: Converts scanned forms, handwritten notes, and PDF reports into searchable text for electronic health record systems.

Configuration Recommendations:

Enable OCR for handwritten and scanned documents
Use chunking for long medical reports
Keep original files for compliance requirements
Use descriptive property names like "patient_form_text"

Results: Reduces document processing time by 80% and improves patient data accessibility while maintaining HIPAA compliance.

Real Estate Agencies

Common Challenge: Property listings, contracts, and inspection reports come in various formats and need to be searchable and analyzable.

How This Node Helps: Extracts text from property documents, making them searchable and enabling automated analysis of property features and terms.

Configuration Recommendations:

Process entire folders of property documents
Enable PDF chunking for detailed inspection reports
Preserve original documents for legal requirements
Use property-specific naming like "listing_text" or "inspection_content"

Results: Improves property search capabilities by 90% and enables automated matching of properties to client requirements.

Financial Services

Common Challenge: Banks and financial institutions process thousands of loan applications, financial statements, and regulatory documents daily.

How This Node Helps: Converts financial documents into analyzable text for risk assessment, compliance checking, and automated decision-making.

Configuration Recommendations:

Use secure folder paths for sensitive documents
Enable OCR for scanned financial statements
Implement chunking for comprehensive financial reports
Use compliance-friendly naming conventions

Results: Accelerates loan processing by 65% and improves regulatory compliance through automated document analysis.

Best Practices

File Organization

Organize source files in clearly named folders
Use consistent file naming conventions
Separate different document types into different folders
Regularly clean up processed files if using auto-deletion

Performance Optimization

Enable OCR only when necessary to reduce processing time
Use chunking for large documents to improve downstream processing
Process files in batches during off-peak hours
Monitor storage usage when preserving original files

Quality Assurance

Test with sample files before processing large batches
Verify text extraction quality with different file types
Set up error handling for corrupted or unreadable files
Regularly review extraction results for accuracy

Security Considerations

Use secure folder paths for sensitive documents
Implement proper access controls on file storage locations
Consider encryption for highly sensitive document processing
Maintain audit trails of document processing activities

The FileToText node transforms your document processing workflows by making any file content searchable, analyzable, and actionable. Whether you're processing contracts, invoices, research papers, or any other document type, this node provides the foundation for intelligent document automation.

FileToText Node Documentation

Overview​

What This Node Does​

Configuration Parameters​

Data Source Section​

Processing Section​

Output Section​

Real-World Use Cases​

Legal Document Processing​

Invoice Data Extraction​

Research Document Analysis​

Step-by-Step Configuration​

Adding the Node​

Setting Up Data Source​

Configuring Processing Options​

Setting Output Format​

Industry Applications​

Healthcare Organizations​

Real Estate Agencies​

Financial Services​

Best Practices​

File Organization​

Performance Optimization​

Quality Assurance​

Security Considerations​

Overview

What This Node Does

Configuration Parameters

Data Source Section

Processing Section

Output Section

Real-World Use Cases

Legal Document Processing

Invoice Data Extraction

Research Document Analysis

Step-by-Step Configuration

Adding the Node

Setting Up Data Source

Configuring Processing Options

Setting Output Format

Industry Applications

Healthcare Organizations

Real Estate Agencies

Financial Services

Best Practices

File Organization

Performance Optimization

Quality Assurance

Security Considerations