OCR Node - Synthreo Builder

OCR node for Builder - extract machine-readable text from images, scanned PDFs, and photos using optical character recognition for downstream parsing, search, and AI analysis.

Purpose

The OCR node converts files (PDFs, images, Word documents, and similar formats) into readable text using text extraction and Optical Character Recognition (OCR).

This enables workflows to analyze, store, and process content from documents without manual transcription. Once text is extracted, it can be passed to AI analysis nodes, stored in a database, used as context for an LLM prompt, or searched for specific values.

Inputs

File Path - A static or dynamic path to a single file from a previous workflow step. For example, a path returned by the Email Receiver node or downloaded by the HTTP Client node.
Folder Path - A path to a folder containing multiple files for batch document processing.

Outputs

Extracted text from the input file or files.
Optional chunked text output for large PDFs when chunking is enabled.

Output Format (single file example):

{
  "text_result": "Extracted document content here..."
}

Output Format (chunked PDF example):

{
  "text_result": "First portion of the document content...",
  "chunk_index": 1,
  "total_chunks": 4
}

When chunking is enabled, multiple output records are produced - one per chunk.

Parameters

Name	Field Name	Type	Required	Default	Description
File Path	`getterTemplate`	Smart text	Optional	Empty	Path or pattern to locate a single file. Supports dynamic values from upstream nodes.
Folder Path	`filesFolderPath`	Smart text	Optional	Empty	Base folder path for processing multiple files in batch.
Remove File After Processing	`removeFileAfterProcessing`	Toggle	Optional	Off	When On, deletes the original file after text extraction is complete. Use with caution in production.
Limit OCR	`limitOcr`	Toggle	Optional	Off	When On, restricts OCR to only run when necessary, reducing processing time and cost. When Off, OCR runs for all documents including scanned and handwritten content.
Produce Chunks From PDF	`produceChunksFromPdf`	Toggle	Optional	Off	When On, splits large PDFs into smaller text chunks, producing one output record per chunk.
Output Mode	`outTransformId`	Dropdown	Yes	Original + appended result	Choose between appending extracted text to the input row or returning only the extracted text.
Result Property Name	`outColumnName`	Text	Yes	`text_result`	The name of the property that holds the extracted text in the output data.

Output Mode Options

Option	Description
Original + appended result	Keeps all upstream data and adds the extracted text as a new property. Useful when downstream nodes need the original file metadata alongside the text.
Result only	Returns only the extracted text, discarding upstream fields. Useful when only the text content is needed for further analysis.

Supported File Types

The OCR node can process the following file formats:

PDF (both digital and scanned/image-based)
Images: JPEG, PNG, TIFF, BMP
Microsoft Word documents (.docx)
Plain text files (.txt)

For digital PDFs (where text is already embedded), the node extracts text directly without invoking OCR. For scanned PDFs and image files, OCR is applied to recognize and extract text from the visual content.

When to Use OCR vs. Direct Text Extraction

Document Type	Recommended Setting
Digital PDF (searchable text)	Enable Limit OCR to use fast direct extraction
Scanned PDF (image-only)	Leave Limit OCR Off to allow OCR processing
Photos of documents	Leave Limit OCR Off
Word documents (.docx)	Enable Limit OCR - direct extraction is sufficient
Mixed folder (some scanned, some digital)	Leave Limit OCR Off to handle all types

Step-by-Step Configuration

Single File Extraction

Drag the OCR node onto your workflow canvas and connect it to the node that provides the file path.
Click the node to open settings.
Enter the file path in the File Path field. Use {{filePath}} to reference a path from an upstream node.
If the document is a scanned PDF or image, ensure Limit OCR is Off so OCR is applied.
Set Output Mode and Result Property Name.
Save the configuration and run a test with a sample file.

Batch Folder Processing

Set the Folder Path field to the directory containing your documents.
Leave File Path empty when processing a folder.
Configure OCR and chunking settings as needed.
Set Output Mode to Original + appended result to retain file name metadata alongside extracted text.

Processing Large PDFs with Chunking

Set the File Path to the target PDF.
Enable Produce Chunks From PDF.
Set Output Mode to Original + appended result to preserve source file context on each chunk.
Connect the output to a LangChain node or vector storage node for downstream AI processing.

Real-World Use Cases

Legal Document Analysis

Extract contract text for clause identification and analysis by an LLM.

Configuration:

File Path: {{contract_file_path}}
Limit OCR: Off (contracts may be scanned)
Output Mode: Original + appended result
Result Property Name: contract_text

Downstream step: Pass contract_text to an OpenAI GPT node with a prompt that identifies specific clauses.

Invoice Processing Automation

Pull text from scanned invoices to extract line items, totals, and vendor information for accounting automation.

Configuration:

Folder Path: /invoices/incoming/
Limit OCR: Off (invoices are often scanned)
Output Mode: Original + appended result
Result Property Name: invoice_text

Downstream step: Pass extracted text to a Custom Script node that parses invoice fields.

Research Report Processing

Break down lengthy research reports into chunks for AI analysis and vector database storage.

Configuration:

File Path: {{report_file_path}}
Limit OCR: On (reports are typically digital PDFs)
Produce Chunks From PDF: On
Output Mode: Original + appended result
Result Property Name: report_chunk

Downstream step: Each chunk is stored in a vector database for retrieval-augmented generation.

Troubleshooting

Issue	Likely Cause	Resolution
Output text is empty	File path is incorrect or the file does not exist	Verify the file path value from the upstream node. Check that the file exists at that location.
OCR produces garbled text	Image quality is too low or resolution is insufficient	Use higher-resolution scans (at least 300 DPI) for better OCR accuracy.
Processing is very slow	Large file or OCR running on digital PDFs unnecessarily	Enable Limit OCR for digital PDFs to skip the OCR step and use direct text extraction.
Too many output chunks	Chunk size is very small	The chunk size is determined by internal defaults. If you need finer control over chunk size, pass the extracted text to a LangChain node configured with your target chunk size.
File is deleted unexpectedly	Remove File After Processing is enabled	Disable this toggle if you need to retain files after extraction. Only enable it when the workflow is the sole consumer of the file.
Folder processing skips some files	File types in the folder are not supported	Verify the file formats in the folder against the supported types list and convert or pre-filter files as needed.

Best Practices

File Organization: Use consistent naming conventions and folder structures. Descriptive file names make it easier to trace extracted text back to the source document.
OCR Usage: Enable Limit OCR when processing large volumes of digital PDFs to reduce processing time. Only leave it Off when you know scanned documents are present.
Chunking for AI Workflows: Enable chunking when passing text from large documents to LLM nodes. This prevents token limit errors and enables more precise AI responses focused on specific sections.
Remove File After Processing: Use this toggle only in workflows where files are temporary and you have confirmed no other process needs them afterward. Accidental deletion cannot be undone.
Testing: Always test the node with a representative sample file from each category (digital PDF, scanned PDF, image) before deploying a batch processing workflow.

Email Receiver - retrieves email attachments that can then be processed by the OCR node.
HTTP Client - downloads files from external URLs that can be passed to the OCR node for extraction.
LangChain - accepts OCR-extracted text and further splits or loads it for AI workflows.
OpenAI GPT - receives extracted text as context for analysis, summarization, or question answering.

OCR Node - Synthreo Builder

Purpose

Inputs

Outputs

Parameters

Output Mode Options

Supported File Types

When to Use OCR vs. Direct Text Extraction

Step-by-Step Configuration

Single File Extraction

Batch Folder Processing

Processing Large PDFs with Chunking

Real-World Use Cases

Legal Document Analysis

Invoice Processing Automation

Research Report Processing

Troubleshooting

Best Practices

ThreoAI

Wingtip

Builder

Pylon

Canopy

MSP Onboarding

Videos

Certification

OCR Node - Synthreo Builder

Purpose

Inputs

Outputs

Parameters

Output Mode Options

Supported File Types

When to Use OCR vs. Direct Text Extraction

Step-by-Step Configuration

Single File Extraction

Batch Folder Processing

Processing Large PDFs with Chunking

Real-World Use Cases

Legal Document Analysis

Invoice Processing Automation

Research Report Processing

Troubleshooting

Best Practices

Related Nodes

ThreoAI

Wingtip

Builder

Pylon

Canopy

MSP Onboarding

Videos

Certification