The Website Mapper node crawls a website and discovers all accessible pages, creating a sitemap that can be used to systematically extract content.

How It Works

Website Mapper takes a starting URL and:

  • Fetches the page content
  • Extracts all internal links
  • Follows links to discover additional pages
  • Returns a structured list of all discovered URLs

Configuration

Input Variable

The Website Mapper requires a single input:

urlVar: The starting URL to begin mapping from
Example: https://example.com

Output Structure

The node outputs an array of discovered URLs:

{
  "out": {
    "urls": [
      "https://example.com",
      "https://example.com/about",
      "https://example.com/products",
      "https://example.com/contact"
    ],
    "totalPages": 4,
    "crawlDepth": 2
  },
  "_input": {
    "urlVar": "https://example.com"
  }
}

Common Usage Patterns

Map and Scrape Workflow

Dataset Source (websites)
  → Website Mapper (discover all pages)
  → Array Splitter (split urls array)
  → Page Scraper (extract content from each page)
  → Prompt (analyze content)
  → Dataset Sink (save results)

Selective Crawling

After mapping, you can filter URLs before scraping:

Website Mapper
  → (manual filtering or conditional logic)
  → Only scrape pages matching certain patterns
  → Example: only /blog/ or /docs/ pages
Tip
Website Mapper respects robots.txt by default. It will not crawl pages disallowed for bots.

Rate Limiting

The Website Mapper automatically implements polite crawling:

  • Respects robots.txt directives
  • Adds delays between requests (default: 1 second)
  • Limits concurrent requests
  • Sets appropriate User-Agent headers
Warning
Be respectful when crawling websites. Excessive requests can strain servers. Consider using the depth limit to avoid crawling entire large sites.

Configuration Options

Max Depth

Control how many levels deep to crawl:

  • Depth 1: Only the starting page
  • Depth 2: Starting page + directly linked pages
  • Depth 3+: Continue following links

Domain Restriction

By default, Website Mapper only follows links within the same domain. It will not follow external links.

Best Practices

Start Small

Test with a small site or limited depth before mapping large websites:

  • Use depth limit of 2-3 for testing
  • Verify the results before processing all pages
  • Consider the number of pages to avoid long execution times

Combine with Array Processing

Use Array Splitter to process discovered URLs individually:

Website Mapper output: out.urls (array)
  → Array Splitter: split on 'urls'
  → Each URL processed individually by downstream nodes

Filter Before Scraping

Not all discovered pages may be relevant. Consider adding filtering logic to process only specific page types.

Limitations

  • JavaScript-heavy sites: May not discover pages loaded dynamically
  • Authentication: Cannot access pages behind login walls
  • Large sites: May take significant time to map thousands of pages
  • Rate limits: Some sites may block or throttle crawlers

Error Handling

Website Mapper handles common errors gracefully:

  • Invalid URLs are skipped with warnings
  • Connection timeouts continue with discovered pages
  • HTTP errors (404, 500) are logged but don't stop the crawl
  • Inaccessible pages are reported in the output

Related Documentation

Page Scraper
Extract content from discovered pages
Array Splitter
Process URLs individually
Variable Mapping
Map discovered URLs to other nodes