Skip to main content

OCR


Purposeโ€‹

The OCR node converts files (PDFs, images, Word docs, etc.) into readable text using text extraction and OCR (Optical Character Recognition).

This enables workflows to analyze, store, and process content from documents without manual transcription.

๐Ÿ“ฅ Inputsโ€‹

  • File Path (static or dynamic) from previous workflow steps.
  • Folder Path for batch document processing.

๐Ÿ“ค Outputsโ€‹

  • Extracted text from the input file(s).
  • Optional chunked text for large PDFs.

Output Format (example):

    {
"text_result": "Extracted document content here..."
}

โš™๏ธ Parametersโ€‹

NameTypeRequiredDefaultDescription
File Path (getterTemplate)Smart textOptionalEmptyPath or pattern to locate a single file. Supports dynamic values.
Folder Path (filesFolderPath)Smart textOptionalEmptyBase folder path for multiple files.
Remove File After Processing (removeFileAfterProcessing)ToggleOptionalOffDeletes original file after extraction. Use with caution.
Limit OCR (limitOcr)ToggleOptionalOffRestricts OCR usage to reduce costs. When off, OCR runs for scanned/handwritten documents.
Produce Chunks From PDF (produceChunksFromPdf)ToggleOptionalOffSplits large PDFs into smaller text chunks.
Output Mode (outTransformId)Dropdownโœ…Original + appended resultChoose between appending extracted text or returning only the text.
Result Property Name (outColumnName)Textโœ…text_resultName of the property holding extracted text.

๐Ÿ’ก Example Usageโ€‹

  • Legal Documents: Extract contract text for clause analysis.
  • Invoices: Pull text from scanned invoices for accounting automation.
  • Research Reports: Break down lengthy reports into chunks for AI analysis.

๐Ÿ“˜ Best Practicesโ€‹

  • File Organization: Use consistent naming and folder structures.
  • OCR Usage: Enable only when handling scanned or image-based documents.
  • Chunking: Use for large PDFs or AI workflows with token limits.

๐Ÿงช Test Casesโ€‹

  • Given: File = contract.pdf, OCR enabled โ†’ Expected: { "text_result": "Contract terms and conditions..." }
  • Given: Folder = /invoices/, Limit OCR on โ†’ Expected: Extracted text only from digital PDFs, no OCR processing.
  • Given: PDF > 50 pages, Produce Chunks enabled โ†’ Expected: Multiple text chunks { "chunk_1": "...", "chunk_2": "..." }.