URL Scraper - Synthreo Builder

URL Scraper node for Builder - extract structured text from web pages, online PDFs, and images using built-in OCR, enabling AI agents to reason over live web content in their workflows.

Overview

The URL Scraper node automatically extracts content from web pages and documents by visiting URLs present in your workflow data. This node can scrape text from web pages, process online PDFs, handle images with OCR, and execute custom JavaScript on dynamic sites.

Use this node when your workflow needs to gather content from specific web pages or document URLs - for example, pulling product descriptions from an e-commerce site, extracting article text from news URLs, or reading PDFs hosted on a client portal.

Key Features

Multiple Content Types: Extract text from web pages, PDFs, and images.
Batch Processing: Process multiple URLs automatically with configurable error handling.
Smart Extraction: Choose between clean text or HTML-preserved content.
Browser Automation: Use a Chrome browser engine for JavaScript-heavy sites that do not render with a standard HTTP request.
OCR Capabilities: Extract text from images found on web pages.
Custom JavaScript: Execute advanced scraping logic when needed for complex page interactions.

Inputs

The node reads URLs from a property on the incoming workflow data row. The property that contains the URL is specified in the Property Path field.

Outputs

Extracted page content is added to the workflow data row as a new property, or returned as a standalone result depending on the configured output format.

Output Format (example):

{
  "url_scrape_result": "Full extracted text content from the web page..."
}

Parameters

URL Source Section

Parameter	Field Name	Type	Default	Description
Property Path	`urlSourceProp`	Smart text	Empty	Specifies which property in the workflow data row contains the URL to scrape. Use the path suggestion dropdown to select from available properties.
Process PDF Files	`processPdfs`	Toggle	Off	When On, extracts text content from PDF documents when the URL points to a PDF file.

Batch Settings Section

Parameter	Field Name	Type	Default	Description
Batch Processing Mode	`batchOption`	Dropdown	None	Controls how the node handles multiple URLs and errors.

Batch Mode	Description
None	Process a single URL or stop the workflow if a URL fails.
Iterate to next (if response error)	Skip failed URLs and continue processing remaining ones. Recommended for batch jobs where some URLs may be broken or unavailable.
Iterate to next (always)	Process all URLs in sequence regardless of individual results. Useful for systematic collection where you need a result row for every URL, including failures.

Scrape Operation Section

Parameter	Field Name	Type	Default	Description
Scraping Engine	`engineId`	Dropdown	HTTP Client	The method used to access and extract content from the web page.

HTTP Client Options

These settings appear when HTTP Client is selected as the scraping engine.

Parameter	Field Name	Type	Default	Description
Content Extraction Method	`scrapeOp`	Dropdown	Extract Text (Remove HTML)	Determines whether HTML formatting is preserved in the extracted content.
Apply OCR on Images	`applyOcrOnImages`	Toggle	Off	When On, extracts text from images found on the page using OCR.
Remove JavaScript and CSS	`stripJsCss`	Toggle	On	When On, strips script tags and style elements from the extracted content to reduce noise.

Extraction Method	Description
Extract Text (Remove HTML)	Returns clean text with all HTML tags removed. Best for AI analysis, database storage, or any downstream processing that expects plain text.
Extract Text (Keep HTML)	Preserves HTML tags and structure. Use when you need links, headings, or table structure in the output.

Chrome Browser (Selenium) Options

These settings appear when Chrome Browser (Selenium) is selected as the scraping engine.

Parameter	Field Name	Type	Default	Description
Wait Condition	`webDriverWaitId`	Dropdown	None	Tells the browser what to wait for before starting content extraction.
Maximum Wait Time	`webDriverWaitSeconds`	Number	10	How long (in seconds) to wait for the wait condition before timing out. Valid range: 1 to 300.
Wait For Element	`webDriverWaitSelector`	Smart text	Empty	CSS selector identifying the page element to wait for (used with element-based wait conditions).
JavaScript Execution Mode	`webDriverJavaScriptAsync`	Toggle	Off	When On, executes custom JavaScript asynchronously. Needed for scripts that require time to complete.

Wait Condition	Description
None	Begin scraping immediately after the initial page load.
Presence of element located	Wait until the specified CSS element exists in the DOM, even if it is not visible yet.
Visibility of element located	Wait until the specified element is visible on the page.
Element to be clickable	Wait until the specified element is fully interactive.

Output Section

Parameter	Field Name	Type	Default	Description
Output Format	`outTransformId`	Dropdown	Original with appended result column	Determines what data the workflow receives after scraping.
Result Property Name	`outColumnName`	Text	`url_scrape_result`	Name of the new property that will contain the scraped content.

Output Option	Description
Original with appended result column	Keeps all original workflow data and adds the scraped content as a new property. Use when downstream nodes need both the source URL and the extracted text.
Return result column only	Returns only the scraped content, removing all other properties. Use when only the content is needed downstream.

Choosing Between HTTP Client and Chrome Browser

Scenario	Recommended Engine
Standard news articles, blog posts, and static pages	HTTP Client
Product pages on simple e-commerce sites	HTTP Client
Single-page applications (React, Vue, Angular)	Chrome Browser
Pages that load content via JavaScript after initial load	Chrome Browser
Social media profiles and feeds	Chrome Browser
Online PDF documents	HTTP Client with Process PDF Files enabled
Sites behind login (with session cookies)	Chrome Browser

Chrome Browser uses more resources and is slower than HTTP Client. Use it only when HTTP Client fails to return the content you need.

Step-by-Step Configuration

Basic Web Scraping Setup

Drag the URL Scraper node onto your workflow canvas and connect it to the node that provides URLs.
Click the node to open settings.
In the URL Source section, set the Property Path to the field that contains your URLs (for example, website_url or product_link).
If any URLs point to PDF files, enable Process PDF Files.
In the Batch Settings section, select Iterate to next (if response error) for robust batch processing.
In the Scrape Operation section, select HTTP Client for standard pages.
Choose Extract Text (Remove HTML) for AI-ready clean text output.
Set a descriptive Result Property Name (for example, page_content or article_text).
Save and run a test with a sample URL.

Chrome Browser Setup for Dynamic Sites

Follow steps 1 to 4 above.
In the Scrape Operation section, select Chrome Browser (Selenium).
If the page loads content dynamically, select an appropriate Wait Condition.
Enter a CSS selector in Wait For Element (for example, .article-content or #main-product-details).
Set Maximum Wait Time based on the typical load time of the target site (10 to 30 seconds for most sites).
Save and test with a representative URL to verify content is fully loaded before extraction.

Real-World Use Cases

Competitive Pricing Analysis

An online retailer monitors competitor pricing across hundreds of product pages daily.

Configuration:

Property Path: competitor_url
Engine: HTTP Client
Extraction Method: Extract Text (Remove HTML)
Batch Mode: Iterate to next (if response error)
Result Property Name: competitor_data

Outcome: Daily pricing data is collected automatically and passed to a downstream node for comparison and reporting.

Content Research for Reports

A marketing agency gathers article content from industry news sites.

Configuration:

Property Path: article_links
Engine: Chrome Browser
Wait Condition: Presence of element located
Wait For Element: .article-content
Result Property Name: article_text

Outcome: Full article text is extracted from each URL, preserving the source link for citation.

PDF Document Extraction from a Client Portal

A legal firm accesses contract PDFs hosted on a secure client portal.

Configuration:

Property Path: document_url
Process PDF Files: On
Engine: HTTP Client
Output Format: Return result column only
Result Property Name: contract_text

Outcome: Contract text is extracted and passed directly to an LLM node for clause analysis.

Troubleshooting

Issue	Likely Cause	Resolution
Output is empty	Property Path does not match the URL field name	Check the exact property name in the upstream node output and update the Property Path accordingly.
Partial content returned	Page loads content after the initial HTTP response	Switch to Chrome Browser engine and add an appropriate Wait Condition.
Rate limiting or IP blocks	Too many requests sent to the target site in a short time	Add a delay node upstream to space out requests, and process URLs in smaller batches.
PDF returns no text	URL does not point directly to a PDF or the PDF is scanned	Confirm the URL ends with `.pdf` and points to the file directly. For scanned PDFs, download the file and use the OCR node instead.
Workflow stops on a failed URL	Batch Mode is set to None	Change Batch Mode to Iterate to next (if response error) for batch jobs.
JavaScript-rendered content missing with Chrome engine	Wait condition is too short	Increase Maximum Wait Time and use a more specific CSS selector in Wait For Element.
Extracted text contains JavaScript or CSS code	Remove JavaScript and CSS toggle is Off	Enable Remove JavaScript and CSS in HTTP Client options.

Best Practices

Performance

Use HTTP Client engine whenever possible. It is significantly faster than Chrome Browser.
Set appropriate wait times for Chrome Browser - longer is not always better and increases processing time unnecessarily.
Process URLs in batches of 100 to 500 at a time for better reliability and easier error diagnosis.
Use specific CSS selectors in Wait For Element to minimize wait time.

Error Handling

Always enable Iterate to next (if response error) for batch processing jobs to prevent a single failed URL from stopping the entire workflow.
Test with a sample of 5 to 10 representative URLs before processing large datasets.
Monitor workflow execution logs for recurring failure patterns that may indicate a site has changed its structure.

Data Quality

Use Extract Text (Remove HTML) for any downstream AI processing or database storage to avoid noise from HTML tags.
Enable OCR on images only when necessary, as it increases processing time.
Use descriptive Result Property Names to make downstream configuration and debugging easier.

Compliance and Ethics

Respect the website’s robots.txt file and terms of service before scraping.
Implement reasonable delays between requests to avoid placing excessive load on target servers.
Only extract publicly accessible information.

Web Search - for finding URLs to scrape via search engine queries before passing them to this node.
HTTP Client - for making structured API calls to sites that offer an API, which is preferable to scraping when available.
OCR - for extracting text from downloaded document files when the URL Scraper alone cannot handle scanned content.
LangChain - for chunking the extracted text before passing it to an LLM or vector database.

URL Scraper - Synthreo Builder

Overview

Key Features

Inputs

Outputs

Parameters

URL Source Section

Batch Settings Section

Scrape Operation Section

HTTP Client Options

Chrome Browser (Selenium) Options

Output Section

Choosing Between HTTP Client and Chrome Browser

Step-by-Step Configuration

Basic Web Scraping Setup

Chrome Browser Setup for Dynamic Sites

Real-World Use Cases

Competitive Pricing Analysis

Content Research for Reports

PDF Document Extraction from a Client Portal

Troubleshooting

Best Practices

Performance

Error Handling

Data Quality

Compliance and Ethics

ThreoAI

Wingtip

Builder

Pylon

Canopy

MSP Onboarding

Videos

Certification

URL Scraper - Synthreo Builder

Overview

Key Features

Inputs

Outputs

Parameters

URL Source Section

Batch Settings Section

Scrape Operation Section

HTTP Client Options

Chrome Browser (Selenium) Options

Output Section

Choosing Between HTTP Client and Chrome Browser

Step-by-Step Configuration

Basic Web Scraping Setup

Chrome Browser Setup for Dynamic Sites

Real-World Use Cases

Competitive Pricing Analysis

Content Research for Reports

PDF Document Extraction from a Client Portal

Troubleshooting

Best Practices

Performance

Error Handling

Data Quality

Compliance and Ethics

Related Nodes

ThreoAI

Wingtip

Builder

Pylon

Canopy

MSP Onboarding

Videos

Certification