The Website Mapper node crawls a website and discovers all accessible pages, creating a sitemap that can be used to systematically extract content.
How It Works
Website Mapper takes a starting URL and:
- Fetches the page content
- Extracts all internal links
- Follows links to discover additional pages
- Returns a structured list of all discovered URLs
Configuration
Input Variable
The Website Mapper requires a single input:
urlVar: The starting URL to begin mapping from
Example: https://example.comOutput Structure
The node outputs an array of discovered URLs:
{
"out": {
"urls": [
"https://example.com",
"https://example.com/about",
"https://example.com/products",
"https://example.com/contact"
],
"totalPages": 4,
"crawlDepth": 2
},
"_input": {
"urlVar": "https://example.com"
}
}Common Usage Patterns
Map and Scrape Workflow
Dataset Source (websites)
→ Website Mapper (discover all pages)
→ Array Splitter (split urls array)
→ Page Scraper (extract content from each page)
→ Prompt (analyze content)
→ Dataset Sink (save results)Selective Crawling
After mapping, you can filter URLs before scraping:
Website Mapper
→ (manual filtering or conditional logic)
→ Only scrape pages matching certain patterns
→ Example: only /blog/ or /docs/ pagesRate Limiting
The Website Mapper automatically implements polite crawling:
- Respects robots.txt directives
- Adds delays between requests (default: 1 second)
- Limits concurrent requests
- Sets appropriate User-Agent headers
Configuration Options
Max Depth
Control how many levels deep to crawl:
- Depth 1: Only the starting page
- Depth 2: Starting page + directly linked pages
- Depth 3+: Continue following links
Domain Restriction
By default, Website Mapper only follows links within the same domain. It will not follow external links.
Best Practices
Start Small
Test with a small site or limited depth before mapping large websites:
- Use depth limit of 2-3 for testing
- Verify the results before processing all pages
- Consider the number of pages to avoid long execution times
Combine with Array Processing
Use Array Splitter to process discovered URLs individually:
Website Mapper output: out.urls (array)
→ Array Splitter: split on 'urls'
→ Each URL processed individually by downstream nodesFilter Before Scraping
Not all discovered pages may be relevant. Consider adding filtering logic to process only specific page types.
Limitations
- JavaScript-heavy sites: May not discover pages loaded dynamically
- Authentication: Cannot access pages behind login walls
- Large sites: May take significant time to map thousands of pages
- Rate limits: Some sites may block or throttle crawlers
Error Handling
Website Mapper handles common errors gracefully:
- Invalid URLs are skipped with warnings
- Connection timeouts continue with discovered pages
- HTTP errors (404, 500) are logged but don't stop the crawl
- Inaccessible pages are reported in the output