Skip to main content

FileToText


๐ŸŽฏ Purposeโ€‹

The FileToText node converts documents (PDFs, images, office files) into machine-readable text using OCR (Optical Character Recognition) and parsing techniques. This allows workflows to analyze, search, and process unstructured documents at scale.

๐Ÿ“ฅ Inputsโ€‹

  • File Path / Folder Path (String, Optional): Path to the file(s) to be processed. Can be static or dynamically supplied from previous nodes.

๐Ÿ“ค Outputsโ€‹

  • Extracted Text: The text content extracted from the file(s).
  • Structured Output: Depending on configuration, may include metadata and original file information.

โš™๏ธ Parametersโ€‹

NameTypeRequiredDefaultDescription
File PathStringNo(empty)Specific file path to process. Supports dynamic input.
Folder PathStringNo(empty)Base folder path for batch file processing.
Remove File After ProcessingBoolean (Toggle)NoOffDeletes the original file after extraction.
Limit OCRBoolean (Toggle)NoOffLimits OCR depth for faster but less detailed processing.
OptionDropdownNoOriginal with appended result columnDefines output format: keep original data + extracted text, or return text only.
Result Property NameStringNotext_resultName of the property that stores extracted text.

๐Ÿ“‚ Supported File Typesโ€‹

The node supports a wide range of document and image formats:

  • Documents: PDF, DOCX, TXT, RTF
  • Spreadsheets: XLSX, CSV (basic parsing into text)
  • Images: PNG, JPG/JPEG, TIFF, BMP
  • Scanned Files: Multi-page PDFs and image-based PDFs (via OCR)

โš ๏ธ Note: OCR quality may vary depending on scan resolution, file quality, and language.

๐Ÿ’ก Example Usageโ€‹

Invoice Automationโ€‹

  • Setup:
    • Folder Path = /uploads/invoices/
    • Keep Remove File After Processing = Off
    • Enable Limit OCR
  • Result: Invoices are extracted to invoice_text for automated data entry.

Customer Support Documentsโ€‹

  • Setup:
    • Dynamic File Path from email attachments
    • produceChunksFromPdf = Off
    • Option = Return result column only
  • Result: Attachments are extracted into plain text for categorization and AI triage.

๐Ÿ“˜ Best Practicesโ€‹

  • Organize files in clear folder structures for easier batch processing.
  • Enable Limit OCR for high-volume, simple documents.
  • Use chunking for large or complex PDFs that need contextual sectioning.
  • Always test with sample files before scaling up.
  • Be cautious with Remove File After Processing in compliance-heavy industries.

๐Ÿงช Test Casesโ€‹

  • Given: invoice.jpg with Limit OCR = On โ†’
    Expected: Extracted text with faster but less detailed results.
  • Given: Folder with 3 PDFs, Option = Return result column only โ†’
    Expected: Output array of plain text results, one per file.