The Page Scraper node fetches web pages and extracts specific content using CSS selectors, making it easy to gather data from websites in a structured way.

How It Works

Page Scraper takes a URL and CSS selector, then:

  • Fetches the HTML content of the page
  • Applies your CSS selector to find matching elements
  • Extracts the text content from matched elements
  • Returns the extracted content as structured data

Configuration

Required Inputs

url: The webpage URL to scrape
selector: CSS selector to target content

Examples:
  url: https://example.com/article
  selector: article
  selector: .content
  selector: #main-text
  selector: div.post-body

Output Structure

{
  "out": {
    "content": "Extracted text content here...",
    "selector": "article",
    "url": "https://example.com/article"
  },
  "_input": {
    "url": "https://example.com/article",
    "selector": "article"
  }
}

CSS Selector Examples

Common Selectors

/* Element selectors */
article                  /* <article> elements */
p                        /* <p> elements */
div                      /* <div> elements */

/* Class selectors */
.content                 /* class="content" */
.post-body              /* class="post-body" */

/* ID selectors */
#main                    /* id="main" */
#article-content        /* id="article-content" */

/* Nested selectors */
article p               /* <p> inside <article> */
div.content p           /* <p> inside <div class="content"> */

/* Multiple elements */
h1, h2, h3              /* All h1, h2, and h3 elements */

Advanced Selectors

/* Attribute selectors */
[data-content]          /* Elements with data-content attribute */
a[href*="blog"]         /* Links containing "blog" in href */

/* Pseudo-selectors */
p:first-child           /* First <p> element */
div:not(.sidebar)       /* <div> elements without sidebar class */

/* Combinators */
article > p             /* Direct <p> children of <article> */
h1 + p                  /* <p> immediately after <h1> */
Tip
Test your CSS selectors in your browser's DevTools (right-click → Inspect) to ensure they match the content you want to extract.

Common Usage Patterns

Article Extraction

Page Scraper:
  url: https://blog.example.com/post-123
  selector: article

Result: Full article content extracted

Multiple Pages

Website Mapper (discover pages)
  → Array Splitter (split URLs)
  → Page Scraper (extract from each)
  → Prompt (analyze content)
  → Dataset Sink (save results)

Structured Data Extraction

For more complex extraction, scrape multiple sections:

Page Scraper 1: selector: h1 (get title)
Page Scraper 2: selector: .author (get author)
Page Scraper 3: selector: article (get content)
  → Combine in Prompt node

Best Practices

Be Specific with Selectors

  • Too broad: div (might match many unwanted elements)
  • Too specific: div.content.main.primary[data-id="123"] (brittle)
  • Just right: article or .post-content

Handle Multiple Matches

If your selector matches multiple elements, Page Scraper concatenates their content. To process items separately, use HTML Text Extractor.

Test Before Batch Processing

  • Run on 1-2 URLs first
  • Verify the extracted content is what you expect
  • Adjust your selector if needed
  • Then process your full list
Warning
Websites can change their structure. Monitor your scraping flows and be prepared to update selectors if sites redesign.

Error Handling

Page Scraper handles common issues:

  • Invalid URL: Returns error with details
  • Connection timeout: Retries with exponential backoff
  • No matching elements: Returns empty content
  • 403/404 errors: Logs error and continues flow

Performance Considerations

Rate Limiting

When scraping multiple pages:

  • Disable parallel iterations to avoid overwhelming servers
  • Add delays between requests when possible
  • Respect robots.txt and rate limits
  • Consider caching results for repeated scrapes

Content Size

Large pages can slow processing:

  • Use specific selectors to extract only needed content
  • Avoid scraping entire sites if you only need specific sections
  • Consider using HTML Text Extractor for cleaner text

Integration with Other Nodes

With Prompts

Page Scraper → Prompt
Map: out.content → Prompt.textToAnalyze

With HTML Text Extractor

Page Scraper (get full article)
  → HTML Text Extractor (clean and structure)
  → Prompt (analyze clean text)

With Datasets

Dataset Source (URLs to scrape)
  → Page Scraper
  → Dataset Sink (save scraped content)

Related Documentation

Website Mapper
Discover pages to scrape
HTML Text Extractor
Clean and structure HTML
Array Splitter
Process multiple pages