Skip to content
synthreo.ai

URL Scraper - Synthreo Builder

URL Scraper node for Builder - extract structured text from web pages, online PDFs, and images using built-in OCR, enabling AI agents to reason over live web content in their workflows.

The URL Scraper node automatically extracts content from web pages and documents by visiting URLs present in your workflow data. This node can scrape text from web pages, process online PDFs, handle images with OCR, and execute custom JavaScript on dynamic sites.

Use this node when your workflow needs to gather content from specific web pages or document URLs - for example, pulling product descriptions from an e-commerce site, extracting article text from news URLs, or reading PDFs hosted on a client portal.


  • Multiple Content Types: Extract text from web pages, PDFs, and images.
  • Batch Processing: Process multiple URLs automatically with configurable error handling.
  • Smart Extraction: Choose between clean text or HTML-preserved content.
  • Browser Automation: Use a Chrome browser engine for JavaScript-heavy sites that do not render with a standard HTTP request.
  • OCR Capabilities: Extract text from images found on web pages.
  • Custom JavaScript: Execute advanced scraping logic when needed for complex page interactions.

The node reads URLs from a property on the incoming workflow data row. The property that contains the URL is specified in the Property Path field.


Extracted page content is added to the workflow data row as a new property, or returned as a standalone result depending on the configured output format.

Output Format (example):

{
"url_scrape_result": "Full extracted text content from the web page..."
}

ParameterField NameTypeDefaultDescription
Property PathurlSourcePropSmart textEmptySpecifies which property in the workflow data row contains the URL to scrape. Use the path suggestion dropdown to select from available properties.
Process PDF FilesprocessPdfsToggleOffWhen On, extracts text content from PDF documents when the URL points to a PDF file.
ParameterField NameTypeDefaultDescription
Batch Processing ModebatchOptionDropdownNoneControls how the node handles multiple URLs and errors.
Batch ModeDescription
NoneProcess a single URL or stop the workflow if a URL fails.
Iterate to next (if response error)Skip failed URLs and continue processing remaining ones. Recommended for batch jobs where some URLs may be broken or unavailable.
Iterate to next (always)Process all URLs in sequence regardless of individual results. Useful for systematic collection where you need a result row for every URL, including failures.
ParameterField NameTypeDefaultDescription
Scraping EngineengineIdDropdownHTTP ClientThe method used to access and extract content from the web page.

These settings appear when HTTP Client is selected as the scraping engine.

ParameterField NameTypeDefaultDescription
Content Extraction MethodscrapeOpDropdownExtract Text (Remove HTML)Determines whether HTML formatting is preserved in the extracted content.
Apply OCR on ImagesapplyOcrOnImagesToggleOffWhen On, extracts text from images found on the page using OCR.
Remove JavaScript and CSSstripJsCssToggleOnWhen On, strips script tags and style elements from the extracted content to reduce noise.
Extraction MethodDescription
Extract Text (Remove HTML)Returns clean text with all HTML tags removed. Best for AI analysis, database storage, or any downstream processing that expects plain text.
Extract Text (Keep HTML)Preserves HTML tags and structure. Use when you need links, headings, or table structure in the output.

These settings appear when Chrome Browser (Selenium) is selected as the scraping engine.

ParameterField NameTypeDefaultDescription
Wait ConditionwebDriverWaitIdDropdownNoneTells the browser what to wait for before starting content extraction.
Maximum Wait TimewebDriverWaitSecondsNumber10How long (in seconds) to wait for the wait condition before timing out. Valid range: 1 to 300.
Wait For ElementwebDriverWaitSelectorSmart textEmptyCSS selector identifying the page element to wait for (used with element-based wait conditions).
JavaScript Execution ModewebDriverJavaScriptAsyncToggleOffWhen On, executes custom JavaScript asynchronously. Needed for scripts that require time to complete.
Wait ConditionDescription
NoneBegin scraping immediately after the initial page load.
Presence of element locatedWait until the specified CSS element exists in the DOM, even if it is not visible yet.
Visibility of element locatedWait until the specified element is visible on the page.
Element to be clickableWait until the specified element is fully interactive.
ParameterField NameTypeDefaultDescription
Output FormatoutTransformIdDropdownOriginal with appended result columnDetermines what data the workflow receives after scraping.
Result Property NameoutColumnNameTexturl_scrape_resultName of the new property that will contain the scraped content.
Output OptionDescription
Original with appended result columnKeeps all original workflow data and adds the scraped content as a new property. Use when downstream nodes need both the source URL and the extracted text.
Return result column onlyReturns only the scraped content, removing all other properties. Use when only the content is needed downstream.

Choosing Between HTTP Client and Chrome Browser

Section titled “Choosing Between HTTP Client and Chrome Browser”
ScenarioRecommended Engine
Standard news articles, blog posts, and static pagesHTTP Client
Product pages on simple e-commerce sitesHTTP Client
Single-page applications (React, Vue, Angular)Chrome Browser
Pages that load content via JavaScript after initial loadChrome Browser
Social media profiles and feedsChrome Browser
Online PDF documentsHTTP Client with Process PDF Files enabled
Sites behind login (with session cookies)Chrome Browser

Chrome Browser uses more resources and is slower than HTTP Client. Use it only when HTTP Client fails to return the content you need.


  1. Drag the URL Scraper node onto your workflow canvas and connect it to the node that provides URLs.
  2. Click the node to open settings.
  3. In the URL Source section, set the Property Path to the field that contains your URLs (for example, website_url or product_link).
  4. If any URLs point to PDF files, enable Process PDF Files.
  5. In the Batch Settings section, select Iterate to next (if response error) for robust batch processing.
  6. In the Scrape Operation section, select HTTP Client for standard pages.
  7. Choose Extract Text (Remove HTML) for AI-ready clean text output.
  8. Set a descriptive Result Property Name (for example, page_content or article_text).
  9. Save and run a test with a sample URL.
  1. Follow steps 1 to 4 above.
  2. In the Scrape Operation section, select Chrome Browser (Selenium).
  3. If the page loads content dynamically, select an appropriate Wait Condition.
  4. Enter a CSS selector in Wait For Element (for example, .article-content or #main-product-details).
  5. Set Maximum Wait Time based on the typical load time of the target site (10 to 30 seconds for most sites).
  6. Save and test with a representative URL to verify content is fully loaded before extraction.

An online retailer monitors competitor pricing across hundreds of product pages daily.

Configuration:

  • Property Path: competitor_url
  • Engine: HTTP Client
  • Extraction Method: Extract Text (Remove HTML)
  • Batch Mode: Iterate to next (if response error)
  • Result Property Name: competitor_data

Outcome: Daily pricing data is collected automatically and passed to a downstream node for comparison and reporting.

A marketing agency gathers article content from industry news sites.

Configuration:

  • Property Path: article_links
  • Engine: Chrome Browser
  • Wait Condition: Presence of element located
  • Wait For Element: .article-content
  • Result Property Name: article_text

Outcome: Full article text is extracted from each URL, preserving the source link for citation.

PDF Document Extraction from a Client Portal

Section titled “PDF Document Extraction from a Client Portal”

A legal firm accesses contract PDFs hosted on a secure client portal.

Configuration:

  • Property Path: document_url
  • Process PDF Files: On
  • Engine: HTTP Client
  • Output Format: Return result column only
  • Result Property Name: contract_text

Outcome: Contract text is extracted and passed directly to an LLM node for clause analysis.


IssueLikely CauseResolution
Output is emptyProperty Path does not match the URL field nameCheck the exact property name in the upstream node output and update the Property Path accordingly.
Partial content returnedPage loads content after the initial HTTP responseSwitch to Chrome Browser engine and add an appropriate Wait Condition.
Rate limiting or IP blocksToo many requests sent to the target site in a short timeAdd a delay node upstream to space out requests, and process URLs in smaller batches.
PDF returns no textURL does not point directly to a PDF or the PDF is scannedConfirm the URL ends with .pdf and points to the file directly. For scanned PDFs, download the file and use the OCR node instead.
Workflow stops on a failed URLBatch Mode is set to NoneChange Batch Mode to Iterate to next (if response error) for batch jobs.
JavaScript-rendered content missing with Chrome engineWait condition is too shortIncrease Maximum Wait Time and use a more specific CSS selector in Wait For Element.
Extracted text contains JavaScript or CSS codeRemove JavaScript and CSS toggle is OffEnable Remove JavaScript and CSS in HTTP Client options.

  • Use HTTP Client engine whenever possible. It is significantly faster than Chrome Browser.
  • Set appropriate wait times for Chrome Browser - longer is not always better and increases processing time unnecessarily.
  • Process URLs in batches of 100 to 500 at a time for better reliability and easier error diagnosis.
  • Use specific CSS selectors in Wait For Element to minimize wait time.
  • Always enable Iterate to next (if response error) for batch processing jobs to prevent a single failed URL from stopping the entire workflow.
  • Test with a sample of 5 to 10 representative URLs before processing large datasets.
  • Monitor workflow execution logs for recurring failure patterns that may indicate a site has changed its structure.
  • Use Extract Text (Remove HTML) for any downstream AI processing or database storage to avoid noise from HTML tags.
  • Enable OCR on images only when necessary, as it increases processing time.
  • Use descriptive Result Property Names to make downstream configuration and debugging easier.
  • Respect the website’s robots.txt file and terms of service before scraping.
  • Implement reasonable delays between requests to avoid placing excessive load on target servers.
  • Only extract publicly accessible information.

  • Web Search - for finding URLs to scrape via search engine queries before passing them to this node.
  • HTTP Client - for making structured API calls to sites that offer an API, which is preferable to scraping when available.
  • OCR - for extracting text from downloaded document files when the URL Scraper alone cannot handle scanned content.
  • LangChain - for chunking the extracted text before passing it to an LLM or vector database.