FileToText
๐ฏ Purposeโ
The FileToText node converts documents (PDFs, images, office files) into machine-readable text using OCR (Optical Character Recognition) and parsing techniques. This allows workflows to analyze, search, and process unstructured documents at scale.
๐ฅ Inputsโ
- File Path / Folder Path (String, Optional): Path to the file(s) to be processed. Can be static or dynamically supplied from previous nodes.
๐ค Outputsโ
- Extracted Text: The text content extracted from the file(s).
- Structured Output: Depending on configuration, may include metadata and original file information.
โ๏ธ Parametersโ
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| File Path | String | No | (empty) | Specific file path to process. Supports dynamic input. |
| Folder Path | String | No | (empty) | Base folder path for batch file processing. |
| Remove File After Processing | Boolean (Toggle) | No | Off | Deletes the original file after extraction. |
| Limit OCR | Boolean (Toggle) | No | Off | Limits OCR depth for faster but less detailed processing. |
| Option | Dropdown | No | Original with appended result column | Defines output format: keep original data + extracted text, or return text only. |
| Result Property Name | String | No | text_result | Name of the property that stores extracted text. |
๐ Supported File Typesโ
The node supports a wide range of document and image formats:
- Documents: PDF, DOCX, TXT, RTF
- Spreadsheets: XLSX, CSV (basic parsing into text)
- Images: PNG, JPG/JPEG, TIFF, BMP
- Scanned Files: Multi-page PDFs and image-based PDFs (via OCR)
โ ๏ธ Note: OCR quality may vary depending on scan resolution, file quality, and language.
๐ก Example Usageโ
Invoice Automationโ
- Setup:
Folder Path = /uploads/invoices/- Keep Remove File After Processing = Off
- Enable Limit OCR
- Result: Invoices are extracted to
invoice_textfor automated data entry.
Customer Support Documentsโ
- Setup:
- Dynamic
File Pathfrom email attachments - produceChunksFromPdf = Off
Option = Return result column only
- Dynamic
- Result: Attachments are extracted into plain text for categorization and AI triage.
๐ Best Practicesโ
- Organize files in clear folder structures for easier batch processing.
- Enable Limit OCR for high-volume, simple documents.
- Use chunking for large or complex PDFs that need contextual sectioning.
- Always test with sample files before scaling up.
- Be cautious with Remove File After Processing in compliance-heavy industries.
๐งช Test Casesโ
- Given:
invoice.jpgwithLimit OCR = Onโ
Expected: Extracted text with faster but less detailed results. - Given: Folder with 3 PDFs,
Option = Return result column onlyโ
Expected: Output array of plain text results, one per file.