Page Scraper - Evaligo AI Workflow Automation Docs

The Page Scraper node fetches web pages and extracts specific content using CSS selectors, making it easy to gather data from websites in a structured way.

How It Works

Page Scraper takes a URL and CSS selector, then:

Fetches the HTML content of the page
Applies your CSS selector to find matching elements
Extracts the text content from matched elements
Returns the extracted content as structured data

Configuration

Required Inputs

url: The webpage URL to scrape
selector: CSS selector to target content

Examples:
  url: https://example.com/article
  selector: article
  selector: .content
  selector: #main-text
  selector: div.post-body

Output Structure

{
  "out": {
    "content": "Extracted text content here...",
    "selector": "article",
    "url": "https://example.com/article"
  },
  "_input": {
    "url": "https://example.com/article",
    "selector": "article"
  }
}

CSS Selector Examples

Common Selectors

/* Element selectors */
article                  /* <article> elements */
p                        /* <p> elements */
div                      /* <div> elements */

/* Class selectors */
.content                 /* class="content" */
.post-body              /* class="post-body" */

/* ID selectors */
#main                    /* id="main" */
#article-content        /* id="article-content" */

/* Nested selectors */
article p               /* <p> inside <article> */
div.content p           /* <p> inside <div class="content"> */

/* Multiple elements */
h1, h2, h3              /* All h1, h2, and h3 elements */

Advanced Selectors

/* Attribute selectors */
[data-content]          /* Elements with data-content attribute */
a[href*="blog"]         /* Links containing "blog" in href */

/* Pseudo-selectors */
p:first-child           /* First <p> element */
div:not(.sidebar)       /* <div> elements without sidebar class */

/* Combinators */
article > p             /* Direct <p> children of <article> */
h1 + p                  /* <p> immediately after <h1> */

Tip

Test your CSS selectors in your browser's DevTools (right-click → Inspect) to ensure they match the content you want to extract.

Common Usage Patterns

Article Extraction

Page Scraper:
  url: https://blog.example.com/post-123
  selector: article

Result: Full article content extracted

Multiple Pages

Website Mapper (discover pages)
  → Array Splitter (split URLs)
  → Page Scraper (extract from each)
  → Prompt (analyze content)
  → Dataset Sink (save results)

Structured Data Extraction

For more complex extraction, scrape multiple sections:

Page Scraper 1: selector: h1 (get title)
Page Scraper 2: selector: .author (get author)
Page Scraper 3: selector: article (get content)
  → Combine in Prompt node

Best Practices

Be Specific with Selectors

Too broad: div (might match many unwanted elements)
Too specific: div.content.main.primary[data-id="123"] (brittle)
Just right: article or .post-content

Handle Multiple Matches

If your selector matches multiple elements, Page Scraper concatenates their content. To process items separately, use HTML Text Extractor.

Test Before Batch Processing

Run on 1-2 URLs first
Verify the extracted content is what you expect
Adjust your selector if needed
Then process your full list

Warning

Websites can change their structure. Monitor your scraping flows and be prepared to update selectors if sites redesign.

Error Handling

Page Scraper handles common issues:

Invalid URL: Returns error with details
Connection timeout: Retries with exponential backoff
No matching elements: Returns empty content
403/404 errors: Logs error and continues flow

Performance Considerations

Rate Limiting

When scraping multiple pages:

Disable parallel iterations to avoid overwhelming servers
Add delays between requests when possible
Respect robots.txt and rate limits
Consider caching results for repeated scrapes

Content Size

Large pages can slow processing:

Use specific selectors to extract only needed content
Avoid scraping entire sites if you only need specific sections
Consider using HTML Text Extractor for cleaner text

Integration with Other Nodes

With Prompts

Page Scraper → Prompt
Map: out.content → Prompt.textToAnalyze

With HTML Text Extractor

Page Scraper (get full article)
  → HTML Text Extractor (clean and structure)
  → Prompt (analyze clean text)

With Datasets

Dataset Source (URLs to scrape)
  → Page Scraper
  → Dataset Sink (save scraped content)