FileToText
FileToText Node Documentation
Overview
The FileToText node converts various file formats (PDFs, images, documents) into readable text content that your workflow can process. This powerful node uses advanced text extraction and OCR (Optical Character Recognition) technology to pull text from files, making it perfect for document processing, data extraction, and content analysis workflows.
What This Node Does
The FileToText node takes files from your specified location and converts them into text format. It can handle multiple file types including PDFs, images with text, Word documents, and more. The extracted text becomes available as data that subsequent nodes in your workflow can use for analysis, storage, or further processing.
Configuration Parameters
Data Source Section
File Path
- Field Name:
getterTemplate
- Type: Smart text field with dynamic data support
- Default Value: Empty
- Simple Description: Specifies the exact path or pattern to locate the files you want to convert to text
- When to Change This: Enter a specific file path when processing individual files, or use dynamic placeholders when file paths come from previous workflow steps
- Business Impact: Accurate file paths ensure your workflow processes the correct documents every time
Folder Path
- Field Name:
filesFolderPath
- Type: Smart text field with location picker
- Default Value: Empty
- Simple Description: Sets the base folder where your files are stored
- When to Change This: Specify when you want to process multiple files from a specific directory or when files are organized in folders
- Business Impact: Proper folder configuration enables batch processing of multiple documents, saving significant manual effort
Processing Section
Remove File After Processing
- Field Name:
removeFileAfterProcessing
- Type: Toggle switch (On/Off)
- Default Value: Off
- Simple Description: Automatically deletes the original file after successfully extracting its text
- When to Change This:
- On: Use when processing temporary files or when you want to clean up storage space automatically
- Off: Keep when you need to preserve original files for compliance, backup, or future reference
- Business Impact: Helps manage storage costs and keeps your file system organized, but use carefully to avoid losing important documents
Limit OCR
- Field Name:
limitOcr
- Type: Toggle switch (On/Off)
- Default Value: Off
- Simple Description: Restricts the use of OCR technology for text extraction from images and scanned documents
- When to Change This:
- On: Enable when processing costs are a concern or when you only want to extract text from digital documents
- Off: Keep disabled when you need to extract text from images, scanned PDFs, or handwritten documents
- Business Impact: OCR processing can be resource-intensive; limiting it reduces costs but may miss text in image-based documents
Produce Chunks from PDF
- Field Name:
produceChunksFromPdf
- Type: Toggle switch (On/Off)
- Default Value: Off
- Simple Description: Breaks large PDF documents into smaller, manageable text segments instead of one large text block
- When to Change This:
- On: Enable when processing large documents that need to be analyzed in sections or when working with AI models that have text length limits
- Off: Keep disabled when you need the complete document text as one continuous piece
- Business Impact: Chunking makes large documents easier to process and analyze, improving workflow performance and enabling more detailed content analysis
Output Section
Output Format
- Field Name:
outTransformId
- Type: Dropdown menu with options:
- Original with appended result column: Keeps all original data and adds the extracted text as a new column
- Return result column only: Returns only the extracted text, discarding other data
- Default Value: Original with appended result column
- Simple Description: Determines how the extracted text is formatted in your workflow output
- When to Change This: Choose "result column only" when you only need the text content, or keep the default when you need to maintain relationships with other data
- Business Impact: The right output format ensures your data flows correctly to subsequent workflow steps
Result Property Name
- Field Name:
outColumnName
- Type: Text field
- Default Value: "text_result"
- Simple Description: Names the column or property that will contain your extracted text
- When to Change This: Customize the name to match your data structure or make it more descriptive (e.g., "contract_text", "invoice_content", "document_summary")
- Business Impact: Clear, descriptive names make your workflow data easier to understand and use in reports or subsequent processing steps
Real-World Use Cases
Legal Document Processing
Business Situation: A law firm receives hundreds of PDF contracts daily that need to be reviewed for specific clauses and terms.
What You'll Configure:
- Set the folder path to your contract storage directory
- Choose "Produce chunks from PDF" to break large contracts into sections
- Name the result property "contract_text" for clarity
- Keep "Remove file after processing" disabled to preserve originals
What Happens: Each contract PDF is converted to searchable text, broken into manageable sections, and made available for AI analysis or keyword searching.
Business Value: Reduces contract review time by 75% and ensures no important clauses are missed during legal analysis.
Invoice Data Extraction
Business Situation: An accounting department needs to extract text from scanned invoices received via email to populate their accounting system.
What You'll Configure:
- Use dynamic file paths from email attachment data
- Enable OCR processing for scanned documents
- Set output to "result column only" since you only need the text
- Name the result property "invoice_text"
What Happens: Scanned invoice images are converted to text, making vendor names, amounts, and dates available for automated data entry.
Business Value: Eliminates 90% of manual data entry work and reduces invoice processing errors by 85%.
Research Document Analysis
Business Situation: A market research company needs to analyze hundreds of PDF reports to extract key insights and trends.
What You'll Configure:
- Point to your research document folder
- Enable "Produce chunks from PDF" for better analysis
- Keep original files for reference
- Use descriptive result property names like "research_content"
What Happens: Research PDFs are converted to text chunks that can be analyzed by AI for themes, sentiment, and key findings.
Business Value: Accelerates research analysis by 60% and enables processing of 10x more documents with the same team.
Step-by-Step Configuration
Adding the Node
- Drag the FileToText node from the left panel onto your workflow canvas
- Connect it to the previous node using the arrow connector
- Click on the FileToText node to open the configuration panel
Setting Up Data Source
- In the "Data Source" section, enter your file path in the "File Path" field
- For single files: Enter the complete path (e.g., "/documents/contract.pdf")
- For dynamic paths: Use placeholders from previous nodes
- If processing multiple files, enter the folder path in the "Folder Path" field
- Test your path configuration using the preview feature
Configuring Processing Options
- In the "Processing" section, decide whether to remove files after processing
- Toggle "On" only if you're certain you won't need the original files
- Set OCR limitations based on your document types
- Keep "Off" if you have scanned documents or images with text
- Enable PDF chunking if you're processing large documents
- Toggle "On" for documents over 10 pages or when using AI analysis
Setting Output Format
- In the "Output" section, choose your output format from the dropdown
- Select "Original with appended result column" to keep all data
- Choose "Return result column only" if you only need the text
- Enter a descriptive name for your result property
- Use clear names like "document_text", "email_content", or "report_data"
- Save your configuration and test with sample files
Industry Applications
Healthcare Organizations
Common Challenge: Medical practices receive patient forms, insurance documents, and medical records in various formats that need to be digitized and searchable.
How This Node Helps: Converts scanned forms, handwritten notes, and PDF reports into searchable text for electronic health record systems.
Configuration Recommendations:
- Enable OCR for handwritten and scanned documents
- Use chunking for long medical reports
- Keep original files for compliance requirements
- Use descriptive property names like "patient_form_text"
Results: Reduces document processing time by 80% and improves patient data accessibility while maintaining HIPAA compliance.
Real Estate Agencies
Common Challenge: Property listings, contracts, and inspection reports come in various formats and need to be searchable and analyzable.
How This Node Helps: Extracts text from property documents, making them searchable and enabling automated analysis of property features and terms.
Configuration Recommendations:
- Process entire folders of property documents
- Enable PDF chunking for detailed inspection reports
- Preserve original documents for legal requirements
- Use property-specific naming like "listing_text" or "inspection_content"
Results: Improves property search capabilities by 90% and enables automated matching of properties to client requirements.
Financial Services
Common Challenge: Banks and financial institutions process thousands of loan applications, financial statements, and regulatory documents daily.
How This Node Helps: Converts financial documents into analyzable text for risk assessment, compliance checking, and automated decision-making.
Configuration Recommendations:
- Use secure folder paths for sensitive documents
- Enable OCR for scanned financial statements
- Implement chunking for comprehensive financial reports
- Use compliance-friendly naming conventions
Results: Accelerates loan processing by 65% and improves regulatory compliance through automated document analysis.
Best Practices
File Organization
- Organize source files in clearly named folders
- Use consistent file naming conventions
- Separate different document types into different folders
- Regularly clean up processed files if using auto-deletion
Performance Optimization
- Enable OCR only when necessary to reduce processing time
- Use chunking for large documents to improve downstream processing
- Process files in batches during off-peak hours
- Monitor storage usage when preserving original files
Quality Assurance
- Test with sample files before processing large batches
- Verify text extraction quality with different file types
- Set up error handling for corrupted or unreadable files
- Regularly review extraction results for accuracy
Security Considerations
- Use secure folder paths for sensitive documents
- Implement proper access controls on file storage locations
- Consider encryption for highly sensitive document processing
- Maintain audit trails of document processing activities
The FileToText node transforms your document processing workflows by making any file content searchable, analyzable, and actionable. Whether you're processing contracts, invoices, research papers, or any other document type, this node provides the foundation for intelligent document automation.