The Page Scraper node fetches web pages and extracts specific content using CSS selectors, making it easy to gather data from websites in a structured way.
How It Works
Page Scraper takes a URL and CSS selector, then:
- Fetches the HTML content of the page
- Applies your CSS selector to find matching elements
- Extracts the text content from matched elements
- Returns the extracted content as structured data
Configuration
Required Inputs
url: The webpage URL to scrape
selector: CSS selector to target content
Examples:
url: https://example.com/article
selector: article
selector: .content
selector: #main-text
selector: div.post-bodyOutput Structure
{
"out": {
"content": "Extracted text content here...",
"selector": "article",
"url": "https://example.com/article"
},
"_input": {
"url": "https://example.com/article",
"selector": "article"
}
}CSS Selector Examples
Common Selectors
/* Element selectors */
article /* <article> elements */
p /* <p> elements */
div /* <div> elements */
/* Class selectors */
.content /* class="content" */
.post-body /* class="post-body" */
/* ID selectors */
#main /* id="main" */
#article-content /* id="article-content" */
/* Nested selectors */
article p /* <p> inside <article> */
div.content p /* <p> inside <div class="content"> */
/* Multiple elements */
h1, h2, h3 /* All h1, h2, and h3 elements */Advanced Selectors
/* Attribute selectors */
[data-content] /* Elements with data-content attribute */
a[href*="blog"] /* Links containing "blog" in href */
/* Pseudo-selectors */
p:first-child /* First <p> element */
div:not(.sidebar) /* <div> elements without sidebar class */
/* Combinators */
article > p /* Direct <p> children of <article> */
h1 + p /* <p> immediately after <h1> */Tip
Test your CSS selectors in your browser's DevTools (right-click → Inspect) to ensure they match the content you want to extract.
Common Usage Patterns
Article Extraction
Page Scraper:
url: https://blog.example.com/post-123
selector: article
Result: Full article content extractedMultiple Pages
Website Mapper (discover pages)
→ Array Splitter (split URLs)
→ Page Scraper (extract from each)
→ Prompt (analyze content)
→ Dataset Sink (save results)Structured Data Extraction
For more complex extraction, scrape multiple sections:
Page Scraper 1: selector: h1 (get title)
Page Scraper 2: selector: .author (get author)
Page Scraper 3: selector: article (get content)
→ Combine in Prompt nodeBest Practices
Be Specific with Selectors
- Too broad:
div(might match many unwanted elements) - Too specific:
div.content.main.primary[data-id="123"](brittle) - Just right:
articleor.post-content
Handle Multiple Matches
If your selector matches multiple elements, Page Scraper concatenates their content. To process items separately, use HTML Text Extractor.
Test Before Batch Processing
- Run on 1-2 URLs first
- Verify the extracted content is what you expect
- Adjust your selector if needed
- Then process your full list
Warning
Websites can change their structure. Monitor your scraping flows and be prepared to update selectors if sites redesign.
Error Handling
Page Scraper handles common issues:
- Invalid URL: Returns error with details
- Connection timeout: Retries with exponential backoff
- No matching elements: Returns empty content
- 403/404 errors: Logs error and continues flow
Performance Considerations
Rate Limiting
When scraping multiple pages:
- Disable parallel iterations to avoid overwhelming servers
- Add delays between requests when possible
- Respect robots.txt and rate limits
- Consider caching results for repeated scrapes
Content Size
Large pages can slow processing:
- Use specific selectors to extract only needed content
- Avoid scraping entire sites if you only need specific sections
- Consider using HTML Text Extractor for cleaner text
Integration with Other Nodes
With Prompts
Page Scraper → Prompt
Map: out.content → Prompt.textToAnalyzeWith HTML Text Extractor
Page Scraper (get full article)
→ HTML Text Extractor (clean and structure)
→ Prompt (analyze clean text)With Datasets
Dataset Source (URLs to scrape)
→ Page Scraper
→ Dataset Sink (save scraped content)