URL Scraper - Synthreo Builder
URL Scraper node for Builder - extract structured text from web pages, online PDFs, and images using built-in OCR, enabling AI agents to reason over live web content in their workflows.
Overview
Section titled “Overview”The URL Scraper node automatically extracts content from web pages and documents by visiting URLs present in your workflow data. This node can scrape text from web pages, process online PDFs, handle images with OCR, and execute custom JavaScript on dynamic sites.
Use this node when your workflow needs to gather content from specific web pages or document URLs - for example, pulling product descriptions from an e-commerce site, extracting article text from news URLs, or reading PDFs hosted on a client portal.
Key Features
Section titled “Key Features”- Multiple Content Types: Extract text from web pages, PDFs, and images.
- Batch Processing: Process multiple URLs automatically with configurable error handling.
- Smart Extraction: Choose between clean text or HTML-preserved content.
- Browser Automation: Use a Chrome browser engine for JavaScript-heavy sites that do not render with a standard HTTP request.
- OCR Capabilities: Extract text from images found on web pages.
- Custom JavaScript: Execute advanced scraping logic when needed for complex page interactions.
Inputs
Section titled “Inputs”The node reads URLs from a property on the incoming workflow data row. The property that contains the URL is specified in the Property Path field.
Outputs
Section titled “Outputs”Extracted page content is added to the workflow data row as a new property, or returned as a standalone result depending on the configured output format.
Output Format (example):
{ "url_scrape_result": "Full extracted text content from the web page..."}Parameters
Section titled “Parameters”URL Source Section
Section titled “URL Source Section”| Parameter | Field Name | Type | Default | Description |
|---|---|---|---|---|
| Property Path | urlSourceProp | Smart text | Empty | Specifies which property in the workflow data row contains the URL to scrape. Use the path suggestion dropdown to select from available properties. |
| Process PDF Files | processPdfs | Toggle | Off | When On, extracts text content from PDF documents when the URL points to a PDF file. |
Batch Settings Section
Section titled “Batch Settings Section”| Parameter | Field Name | Type | Default | Description |
|---|---|---|---|---|
| Batch Processing Mode | batchOption | Dropdown | None | Controls how the node handles multiple URLs and errors. |
| Batch Mode | Description |
|---|---|
| None | Process a single URL or stop the workflow if a URL fails. |
| Iterate to next (if response error) | Skip failed URLs and continue processing remaining ones. Recommended for batch jobs where some URLs may be broken or unavailable. |
| Iterate to next (always) | Process all URLs in sequence regardless of individual results. Useful for systematic collection where you need a result row for every URL, including failures. |
Scrape Operation Section
Section titled “Scrape Operation Section”| Parameter | Field Name | Type | Default | Description |
|---|---|---|---|---|
| Scraping Engine | engineId | Dropdown | HTTP Client | The method used to access and extract content from the web page. |
HTTP Client Options
Section titled “HTTP Client Options”These settings appear when HTTP Client is selected as the scraping engine.
| Parameter | Field Name | Type | Default | Description |
|---|---|---|---|---|
| Content Extraction Method | scrapeOp | Dropdown | Extract Text (Remove HTML) | Determines whether HTML formatting is preserved in the extracted content. |
| Apply OCR on Images | applyOcrOnImages | Toggle | Off | When On, extracts text from images found on the page using OCR. |
| Remove JavaScript and CSS | stripJsCss | Toggle | On | When On, strips script tags and style elements from the extracted content to reduce noise. |
| Extraction Method | Description |
|---|---|
| Extract Text (Remove HTML) | Returns clean text with all HTML tags removed. Best for AI analysis, database storage, or any downstream processing that expects plain text. |
| Extract Text (Keep HTML) | Preserves HTML tags and structure. Use when you need links, headings, or table structure in the output. |
Chrome Browser (Selenium) Options
Section titled “Chrome Browser (Selenium) Options”These settings appear when Chrome Browser (Selenium) is selected as the scraping engine.
| Parameter | Field Name | Type | Default | Description |
|---|---|---|---|---|
| Wait Condition | webDriverWaitId | Dropdown | None | Tells the browser what to wait for before starting content extraction. |
| Maximum Wait Time | webDriverWaitSeconds | Number | 10 | How long (in seconds) to wait for the wait condition before timing out. Valid range: 1 to 300. |
| Wait For Element | webDriverWaitSelector | Smart text | Empty | CSS selector identifying the page element to wait for (used with element-based wait conditions). |
| JavaScript Execution Mode | webDriverJavaScriptAsync | Toggle | Off | When On, executes custom JavaScript asynchronously. Needed for scripts that require time to complete. |
| Wait Condition | Description |
|---|---|
| None | Begin scraping immediately after the initial page load. |
| Presence of element located | Wait until the specified CSS element exists in the DOM, even if it is not visible yet. |
| Visibility of element located | Wait until the specified element is visible on the page. |
| Element to be clickable | Wait until the specified element is fully interactive. |
Output Section
Section titled “Output Section”| Parameter | Field Name | Type | Default | Description |
|---|---|---|---|---|
| Output Format | outTransformId | Dropdown | Original with appended result column | Determines what data the workflow receives after scraping. |
| Result Property Name | outColumnName | Text | url_scrape_result | Name of the new property that will contain the scraped content. |
| Output Option | Description |
|---|---|
| Original with appended result column | Keeps all original workflow data and adds the scraped content as a new property. Use when downstream nodes need both the source URL and the extracted text. |
| Return result column only | Returns only the scraped content, removing all other properties. Use when only the content is needed downstream. |
Choosing Between HTTP Client and Chrome Browser
Section titled “Choosing Between HTTP Client and Chrome Browser”| Scenario | Recommended Engine |
|---|---|
| Standard news articles, blog posts, and static pages | HTTP Client |
| Product pages on simple e-commerce sites | HTTP Client |
| Single-page applications (React, Vue, Angular) | Chrome Browser |
| Pages that load content via JavaScript after initial load | Chrome Browser |
| Social media profiles and feeds | Chrome Browser |
| Online PDF documents | HTTP Client with Process PDF Files enabled |
| Sites behind login (with session cookies) | Chrome Browser |
Chrome Browser uses more resources and is slower than HTTP Client. Use it only when HTTP Client fails to return the content you need.
Step-by-Step Configuration
Section titled “Step-by-Step Configuration”Basic Web Scraping Setup
Section titled “Basic Web Scraping Setup”- Drag the URL Scraper node onto your workflow canvas and connect it to the node that provides URLs.
- Click the node to open settings.
- In the URL Source section, set the Property Path to the field that contains your URLs (for example,
website_urlorproduct_link). - If any URLs point to PDF files, enable Process PDF Files.
- In the Batch Settings section, select Iterate to next (if response error) for robust batch processing.
- In the Scrape Operation section, select HTTP Client for standard pages.
- Choose Extract Text (Remove HTML) for AI-ready clean text output.
- Set a descriptive Result Property Name (for example,
page_contentorarticle_text). - Save and run a test with a sample URL.
Chrome Browser Setup for Dynamic Sites
Section titled “Chrome Browser Setup for Dynamic Sites”- Follow steps 1 to 4 above.
- In the Scrape Operation section, select Chrome Browser (Selenium).
- If the page loads content dynamically, select an appropriate Wait Condition.
- Enter a CSS selector in Wait For Element (for example,
.article-contentor#main-product-details). - Set Maximum Wait Time based on the typical load time of the target site (10 to 30 seconds for most sites).
- Save and test with a representative URL to verify content is fully loaded before extraction.
Real-World Use Cases
Section titled “Real-World Use Cases”Competitive Pricing Analysis
Section titled “Competitive Pricing Analysis”An online retailer monitors competitor pricing across hundreds of product pages daily.
Configuration:
- Property Path:
competitor_url - Engine: HTTP Client
- Extraction Method: Extract Text (Remove HTML)
- Batch Mode: Iterate to next (if response error)
- Result Property Name:
competitor_data
Outcome: Daily pricing data is collected automatically and passed to a downstream node for comparison and reporting.
Content Research for Reports
Section titled “Content Research for Reports”A marketing agency gathers article content from industry news sites.
Configuration:
- Property Path:
article_links - Engine: Chrome Browser
- Wait Condition: Presence of element located
- Wait For Element:
.article-content - Result Property Name:
article_text
Outcome: Full article text is extracted from each URL, preserving the source link for citation.
PDF Document Extraction from a Client Portal
Section titled “PDF Document Extraction from a Client Portal”A legal firm accesses contract PDFs hosted on a secure client portal.
Configuration:
- Property Path:
document_url - Process PDF Files: On
- Engine: HTTP Client
- Output Format: Return result column only
- Result Property Name:
contract_text
Outcome: Contract text is extracted and passed directly to an LLM node for clause analysis.
Troubleshooting
Section titled “Troubleshooting”| Issue | Likely Cause | Resolution |
|---|---|---|
| Output is empty | Property Path does not match the URL field name | Check the exact property name in the upstream node output and update the Property Path accordingly. |
| Partial content returned | Page loads content after the initial HTTP response | Switch to Chrome Browser engine and add an appropriate Wait Condition. |
| Rate limiting or IP blocks | Too many requests sent to the target site in a short time | Add a delay node upstream to space out requests, and process URLs in smaller batches. |
| PDF returns no text | URL does not point directly to a PDF or the PDF is scanned | Confirm the URL ends with .pdf and points to the file directly. For scanned PDFs, download the file and use the OCR node instead. |
| Workflow stops on a failed URL | Batch Mode is set to None | Change Batch Mode to Iterate to next (if response error) for batch jobs. |
| JavaScript-rendered content missing with Chrome engine | Wait condition is too short | Increase Maximum Wait Time and use a more specific CSS selector in Wait For Element. |
| Extracted text contains JavaScript or CSS code | Remove JavaScript and CSS toggle is Off | Enable Remove JavaScript and CSS in HTTP Client options. |
Best Practices
Section titled “Best Practices”Performance
Section titled “Performance”- Use HTTP Client engine whenever possible. It is significantly faster than Chrome Browser.
- Set appropriate wait times for Chrome Browser - longer is not always better and increases processing time unnecessarily.
- Process URLs in batches of 100 to 500 at a time for better reliability and easier error diagnosis.
- Use specific CSS selectors in Wait For Element to minimize wait time.
Error Handling
Section titled “Error Handling”- Always enable Iterate to next (if response error) for batch processing jobs to prevent a single failed URL from stopping the entire workflow.
- Test with a sample of 5 to 10 representative URLs before processing large datasets.
- Monitor workflow execution logs for recurring failure patterns that may indicate a site has changed its structure.
Data Quality
Section titled “Data Quality”- Use Extract Text (Remove HTML) for any downstream AI processing or database storage to avoid noise from HTML tags.
- Enable OCR on images only when necessary, as it increases processing time.
- Use descriptive Result Property Names to make downstream configuration and debugging easier.
Compliance and Ethics
Section titled “Compliance and Ethics”- Respect the website’s
robots.txtfile and terms of service before scraping. - Implement reasonable delays between requests to avoid placing excessive load on target servers.
- Only extract publicly accessible information.
Related Nodes
Section titled “Related Nodes”- Web Search - for finding URLs to scrape via search engine queries before passing them to this node.
- HTTP Client - for making structured API calls to sites that offer an API, which is preferable to scraping when available.
- OCR - for extracting text from downloaded document files when the URL Scraper alone cannot handle scanned content.
- LangChain - for chunking the extracted text before passing it to an LLM or vector database.