Skip to content
synthreo.ai

File To Text - Synthreo Builder

File To Text node for Builder - extract readable text from PDFs, Word documents, spreadsheets, and images for LLM ingestion, search indexing, and text analysis workflows.


The FileToText node converts documents (PDFs, images, office files, and other supported formats) into machine-readable text using OCR (Optical Character Recognition) and document parsing techniques. This allows workflows to analyze, search, and process unstructured documents at scale without manual text extraction.

This node is commonly placed early in document processing pipelines to make file content available to LLMs, search indexes, data entry automation, and classification workflows.


  • File Path / Folder Path (String, Optional) - Path to the file or folder to be processed. Can be a static value or dynamically supplied from a previous node using an expression.

  • Extracted Text - The text content extracted from the file or files.
  • Structured Output - Depending on the selected Option, the output may include the original input record fields alongside the extracted text, or return only the extracted text.

NameTypeRequiredDefaultDescription
File PathStringNo(empty)Path to a specific file to process. Supports dynamic input from upstream nodes. When populated, the node processes this single file.
Folder PathStringNo(empty)Base folder path for batch file processing. When populated, the node processes all supported files found in the folder.
Remove File After ProcessingBoolean (Toggle)NoOffWhen enabled, deletes the original file from storage after successful text extraction. Use with caution in compliance-sensitive industries where file retention is required.
Limit OCRBoolean (Toggle)NoOffWhen enabled, reduces the depth and detail of OCR processing to improve speed. Best suited for high-volume workflows where documents are simple and text quality is consistent.
OptionDropdownNoOriginal with appended result columnControls output format. Original with appended result column keeps all incoming data fields and adds the extracted text as a new column. Return result column only outputs only the extracted text, discarding original input fields.
Result Property NameStringNotext_resultThe name of the output property that stores the extracted text content.

The node supports a wide range of document and image formats:

  • Documents: PDF, DOCX, TXT, RTF
  • Spreadsheets: XLSX, CSV (basic parsing into text)
  • Images: PNG, JPG/JPEG, TIFF, BMP
  • Scanned Files: Multi-page PDFs and image-based PDFs (via OCR)

Note: OCR quality varies depending on scan resolution, document quality, image contrast, and language. High-resolution scans (300 DPI or above) generally produce more accurate text extraction than low-resolution or heavily compressed images.


When the node executes, it resolves the file or folder path from the configured parameters or from upstream node output. For each file, it determines the appropriate extraction method: native text parsing for text-based documents (PDF with embedded text, DOCX, TXT), or OCR processing for image-based files and scanned PDFs. The extracted text is placed into the output property named by Result Property Name, and the result is passed downstream according to the selected Option.

When Folder Path is used, the node processes all supported files in the folder and produces one output record per file. The output records can then be iterated or aggregated by downstream nodes.


A workflow processes scanned invoice images uploaded to a shared folder and extracts text for automated data entry into an accounting system.

Setup:

  • Folder Path: /uploads/invoices/
  • Remove File After Processing: Off
  • Limit OCR: On
  • Result Property Name: invoice_text

Result: Invoice images are processed quickly with limited OCR, and the extracted text is stored in invoice_text for downstream parsing and data entry steps.

A workflow receives email attachments from customers and extracts text for AI-based categorization and routing.

Setup:

  • File Path: Dynamically set from email attachment path
  • Option: Return result column only
  • Result Property Name: attachment_text

Result: Attachment content is extracted into plain text and passed directly to an LLM node for categorization and priority assignment.

A workflow processes PDF contracts from a folder and extracts full text for clause analysis and compliance checking.

Setup:

  • Folder Path: /documents/contracts/
  • Remove File After Processing: Off
  • Limit OCR: Off
  • Option: Original with appended result column
  • Result Property Name: contract_text

Result: Contract text is extracted in full detail and appended to the original file metadata record, which is then passed to a downstream analysis node.


  • Organize files clearly: Use structured folder paths for batch processing to make it easy to target specific document types. Processing mixed file types from a single folder can complicate downstream handling.
  • Enable Limit OCR for speed: For high-volume workflows processing simple documents with consistent formatting (such as printed invoices or typed forms), enabling Limit OCR reduces processing time. For complex, handwritten, or degraded documents, disable it for more accurate results.
  • Test with sample files: Before scaling up a batch processing workflow, test the node with a representative sample of files to verify extraction quality and adjust OCR settings as needed.
  • Be cautious with Remove File After Processing: In industries with compliance or audit requirements (healthcare, finance, legal), disabling this toggle is the safer default. Only enable it when the extracted text has been verified and preserved in a downstream system.
  • Use Result Property Name descriptively: Name the output property to reflect the document type, such as contract_text, invoice_text, or support_attachment_text. This makes downstream node configuration clearer.
  • Handle encoding variation: Some documents may contain text in non-standard encodings or mixed languages. Test with actual document samples from your environment to confirm extraction accuracy for your specific use case.

The most common cause is that the file is an image-based PDF or a scanned document with no embedded text, and Limit OCR is preventing full OCR processing. Try disabling Limit OCR and re-running the workflow. Also confirm the file type is in the supported formats list.

Scan resolution and document quality directly affect OCR accuracy. For best results, ensure source documents are scanned at 300 DPI or higher. Heavily compressed images, low-contrast text, or handwriting may produce inaccurate results regardless of settings.

Verify that the path provided to File Path or Folder Path is correct and that the workflow runtime has read access to the target location. If the path is dynamically sourced from an upstream node, inspect that node’s output to confirm the field is populated correctly.

The node only processes file types in its supported formats list. Files with unsupported extensions are skipped. Review the folder contents and confirm that all files requiring processing use supported formats.

Remove File After Processing deleted a file unexpectedly

Section titled “Remove File After Processing deleted a file unexpectedly”

This toggle is irreversible once executed. If files were deleted unintentionally, disable this toggle and restore files from backup. Always verify workflow behavior in a test environment before enabling this toggle in production.


  • Given: invoice.jpg with Limit OCR = On - Expected: Extracted text produced with faster but potentially less detailed OCR results.
  • Given: Folder containing 3 PDFs with Option = Return result column only - Expected: Output array of three records, each containing extracted plain text for one file.
  • Given: DOCX file with embedded text with Limit OCR = Off - Expected: Full text content of the document extracted accurately.
  • Given: File path pointing to a nonexistent file - Expected: Error indicating the file was not found.

  • ConvertBase64ToFile - Produces binary files from Base64 strings that can then be processed by this node.
  • String Operation - Can clean and format the extracted text output from this node.
  • Set Transformation - Can filter or reshape the output records produced when processing a folder of files.
  • Custom Script - Can apply custom parsing logic to the raw text extracted by this node.