File To Text - Synthreo Builder

File To Text node for Builder - extract readable text from PDFs, Word documents, spreadsheets, and images for LLM ingestion, search indexing, and text analysis workflows.

Purpose

The FileToText node converts documents (PDFs, images, office files, and other supported formats) into machine-readable text using OCR (Optical Character Recognition) and document parsing techniques. This allows workflows to analyze, search, and process unstructured documents at scale without manual text extraction.

This node is commonly placed early in document processing pipelines to make file content available to LLMs, search indexes, data entry automation, and classification workflows.

Inputs

File Path / Folder Path (String, Optional) - Path to the file or folder to be processed. Can be a static value or dynamically supplied from a previous node using an expression.

Outputs

Extracted Text - The text content extracted from the file or files.
Structured Output - Depending on the selected Option, the output may include the original input record fields alongside the extracted text, or return only the extracted text.

Parameters

Name	Type	Required	Default	Description
File Path	String	No	(empty)	Path to a specific file to process. Supports dynamic input from upstream nodes. When populated, the node processes this single file.
Folder Path	String	No	(empty)	Base folder path for batch file processing. When populated, the node processes all supported files found in the folder.
Remove File After Processing	Boolean (Toggle)	No	Off	When enabled, deletes the original file from storage after successful text extraction. Use with caution in compliance-sensitive industries where file retention is required.
Limit OCR	Boolean (Toggle)	No	Off	When enabled, reduces the depth and detail of OCR processing to improve speed. Best suited for high-volume workflows where documents are simple and text quality is consistent.
Option	Dropdown	No	Original with appended result column	Controls output format. Original with appended result column keeps all incoming data fields and adds the extracted text as a new column. Return result column only outputs only the extracted text, discarding original input fields.
Result Property Name	String	No	`text_result`	The name of the output property that stores the extracted text content.

Supported File Types

The node supports a wide range of document and image formats:

Documents: PDF, DOCX, TXT, RTF
Spreadsheets: XLSX, CSV (basic parsing into text)
Images: PNG, JPG/JPEG, TIFF, BMP
Scanned Files: Multi-page PDFs and image-based PDFs (via OCR)

Note: OCR quality varies depending on scan resolution, document quality, image contrast, and language. High-resolution scans (300 DPI or above) generally produce more accurate text extraction than low-resolution or heavily compressed images.

How It Works

When the node executes, it resolves the file or folder path from the configured parameters or from upstream node output. For each file, it determines the appropriate extraction method: native text parsing for text-based documents (PDF with embedded text, DOCX, TXT), or OCR processing for image-based files and scanned PDFs. The extracted text is placed into the output property named by Result Property Name, and the result is passed downstream according to the selected Option.

When Folder Path is used, the node processes all supported files in the folder and produces one output record per file. The output records can then be iterated or aggregated by downstream nodes.

Example Usage

Invoice Automation

A workflow processes scanned invoice images uploaded to a shared folder and extracts text for automated data entry into an accounting system.

Setup:

Folder Path: /uploads/invoices/
Remove File After Processing: Off
Limit OCR: On
Result Property Name: invoice_text

Result: Invoice images are processed quickly with limited OCR, and the extracted text is stored in invoice_text for downstream parsing and data entry steps.

Customer Support Document Triage

A workflow receives email attachments from customers and extracts text for AI-based categorization and routing.

Setup:

File Path: Dynamically set from email attachment path
Option: Return result column only
Result Property Name: attachment_text

Result: Attachment content is extracted into plain text and passed directly to an LLM node for categorization and priority assignment.

Legal Contract Review

A workflow processes PDF contracts from a folder and extracts full text for clause analysis and compliance checking.

Setup:

Folder Path: /documents/contracts/
Remove File After Processing: Off
Limit OCR: Off
Option: Original with appended result column
Result Property Name: contract_text

Result: Contract text is extracted in full detail and appended to the original file metadata record, which is then passed to a downstream analysis node.

Best Practices

Organize files clearly: Use structured folder paths for batch processing to make it easy to target specific document types. Processing mixed file types from a single folder can complicate downstream handling.
Enable Limit OCR for speed: For high-volume workflows processing simple documents with consistent formatting (such as printed invoices or typed forms), enabling Limit OCR reduces processing time. For complex, handwritten, or degraded documents, disable it for more accurate results.
Test with sample files: Before scaling up a batch processing workflow, test the node with a representative sample of files to verify extraction quality and adjust OCR settings as needed.
Be cautious with Remove File After Processing: In industries with compliance or audit requirements (healthcare, finance, legal), disabling this toggle is the safer default. Only enable it when the extracted text has been verified and preserved in a downstream system.
Use Result Property Name descriptively: Name the output property to reflect the document type, such as contract_text, invoice_text, or support_attachment_text. This makes downstream node configuration clearer.
Handle encoding variation: Some documents may contain text in non-standard encodings or mixed languages. Test with actual document samples from your environment to confirm extraction accuracy for your specific use case.

Troubleshooting

Extracted text is empty or missing

The most common cause is that the file is an image-based PDF or a scanned document with no embedded text, and Limit OCR is preventing full OCR processing. Try disabling Limit OCR and re-running the workflow. Also confirm the file type is in the supported formats list.

OCR quality is poor

Scan resolution and document quality directly affect OCR accuracy. For best results, ensure source documents are scanned at 300 DPI or higher. Heavily compressed images, low-contrast text, or handwriting may produce inaccurate results regardless of settings.

File not found error

Verify that the path provided to File Path or Folder Path is correct and that the workflow runtime has read access to the target location. If the path is dynamically sourced from an upstream node, inspect that node’s output to confirm the field is populated correctly.

Only some files in a folder are processed

The node only processes file types in its supported formats list. Files with unsupported extensions are skipped. Review the folder contents and confirm that all files requiring processing use supported formats.

Remove File After Processing deleted a file unexpectedly

This toggle is irreversible once executed. If files were deleted unintentionally, disable this toggle and restore files from backup. Always verify workflow behavior in a test environment before enabling this toggle in production.

Test Cases

Given: invoice.jpg with Limit OCR = On - Expected: Extracted text produced with faster but potentially less detailed OCR results.
Given: Folder containing 3 PDFs with Option = Return result column only - Expected: Output array of three records, each containing extracted plain text for one file.
Given: DOCX file with embedded text with Limit OCR = Off - Expected: Full text content of the document extracted accurately.
Given: File path pointing to a nonexistent file - Expected: Error indicating the file was not found.

ConvertBase64ToFile - Produces binary files from Base64 strings that can then be processed by this node.
String Operation - Can clean and format the extracted text output from this node.
Set Transformation - Can filter or reshape the output records produced when processing a folder of files.
Custom Script - Can apply custom parsing logic to the raw text extracted by this node.

File To Text - Synthreo Builder

Purpose

Inputs

Outputs

Parameters

Supported File Types

How It Works

Example Usage

Invoice Automation

Customer Support Document Triage

Legal Contract Review

Best Practices

Troubleshooting

Extracted text is empty or missing

OCR quality is poor

File not found error

Only some files in a folder are processed

Remove File After Processing deleted a file unexpectedly

Test Cases

ThreoAI

Wingtip

Builder

Pylon

Canopy

MSP Onboarding

Videos

Certification

File To Text - Synthreo Builder

Purpose

Inputs

Outputs

Parameters

Supported File Types

How It Works

Example Usage

Invoice Automation

Customer Support Document Triage

Legal Contract Review

Best Practices

Troubleshooting

Extracted text is empty or missing

OCR quality is poor

File not found error

Only some files in a folder are processed

Remove File After Processing deleted a file unexpectedly

Test Cases

Related Nodes

ThreoAI

Wingtip

Builder

Pylon

Canopy

MSP Onboarding

Videos

Certification