OCR Node - Synthreo Builder
OCR node for Builder - extract machine-readable text from images, scanned PDFs, and photos using optical character recognition for downstream parsing, search, and AI analysis.
Purpose
Section titled “Purpose”The OCR node converts files (PDFs, images, Word documents, and similar formats) into readable text using text extraction and Optical Character Recognition (OCR).
This enables workflows to analyze, store, and process content from documents without manual transcription. Once text is extracted, it can be passed to AI analysis nodes, stored in a database, used as context for an LLM prompt, or searched for specific values.
Inputs
Section titled “Inputs”- File Path - A static or dynamic path to a single file from a previous workflow step. For example, a path returned by the Email Receiver node or downloaded by the HTTP Client node.
- Folder Path - A path to a folder containing multiple files for batch document processing.
Outputs
Section titled “Outputs”- Extracted text from the input file or files.
- Optional chunked text output for large PDFs when chunking is enabled.
Output Format (single file example):
{ "text_result": "Extracted document content here..."}Output Format (chunked PDF example):
{ "text_result": "First portion of the document content...", "chunk_index": 1, "total_chunks": 4}When chunking is enabled, multiple output records are produced - one per chunk.
Parameters
Section titled “Parameters”| Name | Field Name | Type | Required | Default | Description |
|---|---|---|---|---|---|
| File Path | getterTemplate | Smart text | Optional | Empty | Path or pattern to locate a single file. Supports dynamic values from upstream nodes. |
| Folder Path | filesFolderPath | Smart text | Optional | Empty | Base folder path for processing multiple files in batch. |
| Remove File After Processing | removeFileAfterProcessing | Toggle | Optional | Off | When On, deletes the original file after text extraction is complete. Use with caution in production. |
| Limit OCR | limitOcr | Toggle | Optional | Off | When On, restricts OCR to only run when necessary, reducing processing time and cost. When Off, OCR runs for all documents including scanned and handwritten content. |
| Produce Chunks From PDF | produceChunksFromPdf | Toggle | Optional | Off | When On, splits large PDFs into smaller text chunks, producing one output record per chunk. |
| Output Mode | outTransformId | Dropdown | Yes | Original + appended result | Choose between appending extracted text to the input row or returning only the extracted text. |
| Result Property Name | outColumnName | Text | Yes | text_result | The name of the property that holds the extracted text in the output data. |
Output Mode Options
Section titled “Output Mode Options”| Option | Description |
|---|---|
| Original + appended result | Keeps all upstream data and adds the extracted text as a new property. Useful when downstream nodes need the original file metadata alongside the text. |
| Result only | Returns only the extracted text, discarding upstream fields. Useful when only the text content is needed for further analysis. |
Supported File Types
Section titled “Supported File Types”The OCR node can process the following file formats:
- PDF (both digital and scanned/image-based)
- Images: JPEG, PNG, TIFF, BMP
- Microsoft Word documents (.docx)
- Plain text files (.txt)
For digital PDFs (where text is already embedded), the node extracts text directly without invoking OCR. For scanned PDFs and image files, OCR is applied to recognize and extract text from the visual content.
When to Use OCR vs. Direct Text Extraction
Section titled “When to Use OCR vs. Direct Text Extraction”| Document Type | Recommended Setting |
|---|---|
| Digital PDF (searchable text) | Enable Limit OCR to use fast direct extraction |
| Scanned PDF (image-only) | Leave Limit OCR Off to allow OCR processing |
| Photos of documents | Leave Limit OCR Off |
| Word documents (.docx) | Enable Limit OCR - direct extraction is sufficient |
| Mixed folder (some scanned, some digital) | Leave Limit OCR Off to handle all types |
Step-by-Step Configuration
Section titled “Step-by-Step Configuration”Single File Extraction
Section titled “Single File Extraction”- Drag the OCR node onto your workflow canvas and connect it to the node that provides the file path.
- Click the node to open settings.
- Enter the file path in the File Path field. Use
{{filePath}}to reference a path from an upstream node. - If the document is a scanned PDF or image, ensure Limit OCR is Off so OCR is applied.
- Set Output Mode and Result Property Name.
- Save the configuration and run a test with a sample file.
Batch Folder Processing
Section titled “Batch Folder Processing”- Set the Folder Path field to the directory containing your documents.
- Leave File Path empty when processing a folder.
- Configure OCR and chunking settings as needed.
- Set Output Mode to Original + appended result to retain file name metadata alongside extracted text.
Processing Large PDFs with Chunking
Section titled “Processing Large PDFs with Chunking”- Set the File Path to the target PDF.
- Enable Produce Chunks From PDF.
- Set Output Mode to Original + appended result to preserve source file context on each chunk.
- Connect the output to a LangChain node or vector storage node for downstream AI processing.
Real-World Use Cases
Section titled “Real-World Use Cases”Legal Document Analysis
Section titled “Legal Document Analysis”Extract contract text for clause identification and analysis by an LLM.
Configuration:
- File Path:
{{contract_file_path}} - Limit OCR: Off (contracts may be scanned)
- Output Mode: Original + appended result
- Result Property Name:
contract_text
Downstream step: Pass contract_text to an OpenAI GPT node with a prompt that identifies specific clauses.
Invoice Processing Automation
Section titled “Invoice Processing Automation”Pull text from scanned invoices to extract line items, totals, and vendor information for accounting automation.
Configuration:
- Folder Path:
/invoices/incoming/ - Limit OCR: Off (invoices are often scanned)
- Output Mode: Original + appended result
- Result Property Name:
invoice_text
Downstream step: Pass extracted text to a Custom Script node that parses invoice fields.
Research Report Processing
Section titled “Research Report Processing”Break down lengthy research reports into chunks for AI analysis and vector database storage.
Configuration:
- File Path:
{{report_file_path}} - Limit OCR: On (reports are typically digital PDFs)
- Produce Chunks From PDF: On
- Output Mode: Original + appended result
- Result Property Name:
report_chunk
Downstream step: Each chunk is stored in a vector database for retrieval-augmented generation.
Troubleshooting
Section titled “Troubleshooting”| Issue | Likely Cause | Resolution |
|---|---|---|
| Output text is empty | File path is incorrect or the file does not exist | Verify the file path value from the upstream node. Check that the file exists at that location. |
| OCR produces garbled text | Image quality is too low or resolution is insufficient | Use higher-resolution scans (at least 300 DPI) for better OCR accuracy. |
| Processing is very slow | Large file or OCR running on digital PDFs unnecessarily | Enable Limit OCR for digital PDFs to skip the OCR step and use direct text extraction. |
| Too many output chunks | Chunk size is very small | The chunk size is determined by internal defaults. If you need finer control over chunk size, pass the extracted text to a LangChain node configured with your target chunk size. |
| File is deleted unexpectedly | Remove File After Processing is enabled | Disable this toggle if you need to retain files after extraction. Only enable it when the workflow is the sole consumer of the file. |
| Folder processing skips some files | File types in the folder are not supported | Verify the file formats in the folder against the supported types list and convert or pre-filter files as needed. |
Best Practices
Section titled “Best Practices”- File Organization: Use consistent naming conventions and folder structures. Descriptive file names make it easier to trace extracted text back to the source document.
- OCR Usage: Enable Limit OCR when processing large volumes of digital PDFs to reduce processing time. Only leave it Off when you know scanned documents are present.
- Chunking for AI Workflows: Enable chunking when passing text from large documents to LLM nodes. This prevents token limit errors and enables more precise AI responses focused on specific sections.
- Remove File After Processing: Use this toggle only in workflows where files are temporary and you have confirmed no other process needs them afterward. Accidental deletion cannot be undone.
- Testing: Always test the node with a representative sample file from each category (digital PDF, scanned PDF, image) before deploying a batch processing workflow.
Related Nodes
Section titled “Related Nodes”- Email Receiver - retrieves email attachments that can then be processed by the OCR node.
- HTTP Client - downloads files from external URLs that can be passed to the OCR node for extraction.
- LangChain - accepts OCR-extracted text and further splits or loads it for AI workflows.
- OpenAI GPT - receives extracted text as context for analysis, summarization, or question answering.