OCR
Purposeโ
The OCR node converts files (PDFs, images, Word docs, etc.) into readable text using text extraction and OCR (Optical Character Recognition).
This enables workflows to analyze, store, and process content from documents without manual transcription.
๐ฅ Inputsโ
- File Path (static or dynamic) from previous workflow steps.
- Folder Path for batch document processing.
๐ค Outputsโ
- Extracted text from the input file(s).
- Optional chunked text for large PDFs.
Output Format (example):
{
"text_result": "Extracted document content here..."
}
โ๏ธ Parametersโ
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
File Path (getterTemplate) | Smart text | Optional | Empty | Path or pattern to locate a single file. Supports dynamic values. |
Folder Path (filesFolderPath) | Smart text | Optional | Empty | Base folder path for multiple files. |
Remove File After Processing (removeFileAfterProcessing) | Toggle | Optional | Off | Deletes original file after extraction. Use with caution. |
Limit OCR (limitOcr) | Toggle | Optional | Off | Restricts OCR usage to reduce costs. When off, OCR runs for scanned/handwritten documents. |
Produce Chunks From PDF (produceChunksFromPdf) | Toggle | Optional | Off | Splits large PDFs into smaller text chunks. |
Output Mode (outTransformId) | Dropdown | โ | Original + appended result | Choose between appending extracted text or returning only the text. |
Result Property Name (outColumnName) | Text | โ | text_result | Name of the property holding extracted text. |
๐ก Example Usageโ
- Legal Documents: Extract contract text for clause analysis.
- Invoices: Pull text from scanned invoices for accounting automation.
- Research Reports: Break down lengthy reports into chunks for AI analysis.
๐ Best Practicesโ
- File Organization: Use consistent naming and folder structures.
- OCR Usage: Enable only when handling scanned or image-based documents.
- Chunking: Use for large PDFs or AI workflows with token limits.
๐งช Test Casesโ
- Given: File =
contract.pdf, OCR enabled โ Expected:{ "text_result": "Contract terms and conditions..." } - Given: Folder =
/invoices/, Limit OCR on โ Expected: Extracted text only from digital PDFs, no OCR processing. - Given: PDF > 50 pages, Produce Chunks enabled โ Expected: Multiple text chunks
{ "chunk_1": "...", "chunk_2": "..." }.