Skip to content
synthreo.ai

OCR Node - Synthreo Builder

OCR node for Builder - extract machine-readable text from images, scanned PDFs, and photos using optical character recognition for downstream parsing, search, and AI analysis.

The OCR node converts files (PDFs, images, Word documents, and similar formats) into readable text using text extraction and Optical Character Recognition (OCR).

This enables workflows to analyze, store, and process content from documents without manual transcription. Once text is extracted, it can be passed to AI analysis nodes, stored in a database, used as context for an LLM prompt, or searched for specific values.


  • File Path - A static or dynamic path to a single file from a previous workflow step. For example, a path returned by the Email Receiver node or downloaded by the HTTP Client node.
  • Folder Path - A path to a folder containing multiple files for batch document processing.

  • Extracted text from the input file or files.
  • Optional chunked text output for large PDFs when chunking is enabled.

Output Format (single file example):

{
"text_result": "Extracted document content here..."
}

Output Format (chunked PDF example):

{
"text_result": "First portion of the document content...",
"chunk_index": 1,
"total_chunks": 4
}

When chunking is enabled, multiple output records are produced - one per chunk.


NameField NameTypeRequiredDefaultDescription
File PathgetterTemplateSmart textOptionalEmptyPath or pattern to locate a single file. Supports dynamic values from upstream nodes.
Folder PathfilesFolderPathSmart textOptionalEmptyBase folder path for processing multiple files in batch.
Remove File After ProcessingremoveFileAfterProcessingToggleOptionalOffWhen On, deletes the original file after text extraction is complete. Use with caution in production.
Limit OCRlimitOcrToggleOptionalOffWhen On, restricts OCR to only run when necessary, reducing processing time and cost. When Off, OCR runs for all documents including scanned and handwritten content.
Produce Chunks From PDFproduceChunksFromPdfToggleOptionalOffWhen On, splits large PDFs into smaller text chunks, producing one output record per chunk.
Output ModeoutTransformIdDropdownYesOriginal + appended resultChoose between appending extracted text to the input row or returning only the extracted text.
Result Property NameoutColumnNameTextYestext_resultThe name of the property that holds the extracted text in the output data.

OptionDescription
Original + appended resultKeeps all upstream data and adds the extracted text as a new property. Useful when downstream nodes need the original file metadata alongside the text.
Result onlyReturns only the extracted text, discarding upstream fields. Useful when only the text content is needed for further analysis.

The OCR node can process the following file formats:

  • PDF (both digital and scanned/image-based)
  • Images: JPEG, PNG, TIFF, BMP
  • Microsoft Word documents (.docx)
  • Plain text files (.txt)

For digital PDFs (where text is already embedded), the node extracts text directly without invoking OCR. For scanned PDFs and image files, OCR is applied to recognize and extract text from the visual content.


When to Use OCR vs. Direct Text Extraction

Section titled “When to Use OCR vs. Direct Text Extraction”
Document TypeRecommended Setting
Digital PDF (searchable text)Enable Limit OCR to use fast direct extraction
Scanned PDF (image-only)Leave Limit OCR Off to allow OCR processing
Photos of documentsLeave Limit OCR Off
Word documents (.docx)Enable Limit OCR - direct extraction is sufficient
Mixed folder (some scanned, some digital)Leave Limit OCR Off to handle all types

  1. Drag the OCR node onto your workflow canvas and connect it to the node that provides the file path.
  2. Click the node to open settings.
  3. Enter the file path in the File Path field. Use {{filePath}} to reference a path from an upstream node.
  4. If the document is a scanned PDF or image, ensure Limit OCR is Off so OCR is applied.
  5. Set Output Mode and Result Property Name.
  6. Save the configuration and run a test with a sample file.
  1. Set the Folder Path field to the directory containing your documents.
  2. Leave File Path empty when processing a folder.
  3. Configure OCR and chunking settings as needed.
  4. Set Output Mode to Original + appended result to retain file name metadata alongside extracted text.
  1. Set the File Path to the target PDF.
  2. Enable Produce Chunks From PDF.
  3. Set Output Mode to Original + appended result to preserve source file context on each chunk.
  4. Connect the output to a LangChain node or vector storage node for downstream AI processing.

Extract contract text for clause identification and analysis by an LLM.

Configuration:

  • File Path: {{contract_file_path}}
  • Limit OCR: Off (contracts may be scanned)
  • Output Mode: Original + appended result
  • Result Property Name: contract_text

Downstream step: Pass contract_text to an OpenAI GPT node with a prompt that identifies specific clauses.

Pull text from scanned invoices to extract line items, totals, and vendor information for accounting automation.

Configuration:

  • Folder Path: /invoices/incoming/
  • Limit OCR: Off (invoices are often scanned)
  • Output Mode: Original + appended result
  • Result Property Name: invoice_text

Downstream step: Pass extracted text to a Custom Script node that parses invoice fields.

Break down lengthy research reports into chunks for AI analysis and vector database storage.

Configuration:

  • File Path: {{report_file_path}}
  • Limit OCR: On (reports are typically digital PDFs)
  • Produce Chunks From PDF: On
  • Output Mode: Original + appended result
  • Result Property Name: report_chunk

Downstream step: Each chunk is stored in a vector database for retrieval-augmented generation.


IssueLikely CauseResolution
Output text is emptyFile path is incorrect or the file does not existVerify the file path value from the upstream node. Check that the file exists at that location.
OCR produces garbled textImage quality is too low or resolution is insufficientUse higher-resolution scans (at least 300 DPI) for better OCR accuracy.
Processing is very slowLarge file or OCR running on digital PDFs unnecessarilyEnable Limit OCR for digital PDFs to skip the OCR step and use direct text extraction.
Too many output chunksChunk size is very smallThe chunk size is determined by internal defaults. If you need finer control over chunk size, pass the extracted text to a LangChain node configured with your target chunk size.
File is deleted unexpectedlyRemove File After Processing is enabledDisable this toggle if you need to retain files after extraction. Only enable it when the workflow is the sole consumer of the file.
Folder processing skips some filesFile types in the folder are not supportedVerify the file formats in the folder against the supported types list and convert or pre-filter files as needed.

  • File Organization: Use consistent naming conventions and folder structures. Descriptive file names make it easier to trace extracted text back to the source document.
  • OCR Usage: Enable Limit OCR when processing large volumes of digital PDFs to reduce processing time. Only leave it Off when you know scanned documents are present.
  • Chunking for AI Workflows: Enable chunking when passing text from large documents to LLM nodes. This prevents token limit errors and enables more precise AI responses focused on specific sections.
  • Remove File After Processing: Use this toggle only in workflows where files are temporary and you have confirmed no other process needs them afterward. Accidental deletion cannot be undone.
  • Testing: Always test the node with a representative sample file from each category (digital PDF, scanned PDF, image) before deploying a batch processing workflow.

  • Email Receiver - retrieves email attachments that can then be processed by the OCR node.
  • HTTP Client - downloads files from external URLs that can be passed to the OCR node for extraction.
  • LangChain - accepts OCR-extracted text and further splits or loads it for AI workflows.
  • OpenAI GPT - receives extracted text as context for analysis, summarization, or question answering.