The HTML Text Extractor node converts raw HTML into clean, structured text by removing tags, scripts, styles, and formatting content for optimal AI processing.
How It Works
HTML Text Extractor takes HTML content and:
- Strips HTML tags and attributes
- Removes scripts, styles, and comments
- Preserves semantic structure (headings, lists, paragraphs)
- Normalizes whitespace and line breaks
- Outputs clean, readable text
Why Use HTML Text Extractor?
Cleaner AI Inputs
Raw HTML can confuse LLMs and waste tokens:
<!-- Raw HTML (wastes tokens) -->
<div class="article-content" data-id="123">
<h1 style="color: blue;">Title</h1>
<p class="intro"><span>Text here</span></p>
</div>
<!-- After HTML Text Extractor -->
Title
Text hereToken Efficiency
- Removes unnecessary markup
- Reduces input size by 50-80%
- Lowers API costs
- Faster processing
Better AI Understanding
- Focuses on actual content
- Preserves document structure
- Removes navigation, ads, footers
- Improves prompt effectiveness
Configuration
Input
html: Raw HTML content to process
selector (optional): Extract specific elements firstOutput Structure
{
"out": {
"text": "Cleaned text content...",
"title": "Page title (if found)",
"headings": ["H1", "H2", "H3"],
"metadata": {
"wordCount": 1234,
"originalSize": "45KB",
"cleanedSize": "12KB"
}
},
"_input": {
"html": "Original HTML..."
}
}Processing Options
Standard Mode (Default)
- Removes all HTML tags
- Preserves paragraph breaks
- Maintains heading structure
- Keeps list formatting
Markdown Mode
Converts HTML to Markdown format:
# Heading 1
## Heading 2
Paragraph text with **bold** and *italic*.
- List item 1
- List item 2
[Link text](url)Structured Mode
Outputs structured JSON with semantic sections:
{
"title": "Page Title",
"sections": [
{
"heading": "Introduction",
"content": "Intro text..."
},
{
"heading": "Main Content",
"content": "Body text..."
}
]
}Common Usage Patterns
Web Content Processing
Page Scraper (get HTML)
→ HTML Text Extractor (clean text)
→ Prompt (analyze clean content)
→ Dataset SinkMulti-Page Analysis
Website Mapper (discover URLs)
→ Array Splitter
→ Page Scraper (get each page)
→ HTML Text Extractor (clean each)
→ Prompt (analyze each clean page)
→ Array Flatten
→ Prompt (summarize all pages)
→ Dataset SinkContent Extraction Pipeline
Dataset Source (URLs)
→ Page Scraper (selector: "article")
→ HTML Text Extractor (markdown mode)
→ Prompt (extract key points)
→ Dataset Sink (structured data)Cleaning Options
Remove Navigation
Automatically filters common navigation elements:
- Header and footer menus
- Sidebar navigation
- Breadcrumbs
- Social media buttons
Remove Boilerplate
Strips common non-content elements:
- Cookie notices
- Newsletter signups
- Related articles widgets
- Advertisement placeholders
Smart Content Detection
Focuses on main content:
- Identifies primary content area
- Ignores sidebars and ancillary content
- Preserves article structure
- Maintains semantic hierarchy
Best Practices
Use After Page Scraper
✅ Good:
Page Scraper → HTML Text Extractor → Prompt
❌ Bad:
Page Scraper → Prompt (raw HTML wastes tokens)Pre-filter with Selectors
Use Page Scraper selectors to get relevant sections first:
Page Scraper (selector: "article")
→ HTML Text Extractor
→ Cleaner, more focused textChoose Right Mode
- Standard: General text processing
- Markdown: When structure matters (documentation, articles)
- Structured: When analyzing sections separately
Token Savings Example
Raw HTML: 4,500 tokens
After extraction: 1,200 tokens
Savings: 73%
Cost impact at $0.01/1K tokens:
Raw: $0.045 per item
Cleaned: $0.012 per item
Savings: $0.033 per item × 1000 items = $33 savedAdvanced Features
Language Detection
Automatically detects content language:
{
"text": "Extracted content...",
"language": "en",
"confidence": 0.98
}Metadata Extraction
Captures useful metadata:
{
"metadata": {
"title": "Page Title",
"author": "John Doe",
"publishDate": "2024-01-15",
"description": "Meta description...",
"keywords": ["ai", "automation"]
}
}Table Handling
Converts HTML tables to readable format:
Standard mode:
Column1 Column2 Column3
Value1 Value2 Value3
Markdown mode:
| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Value1 | Value2 | Value3 |Handling Edge Cases
JavaScript-Heavy Sites
May not capture dynamically loaded content:
- Use Page Scraper with rendering if needed
- HTML Text Extractor works on whatever HTML is provided
- Consider alternative scraping methods for SPAs
Malformed HTML
Handles broken HTML gracefully:
- Attempts to parse and clean
- Logs warnings for major issues
- Returns best-effort text extraction
Empty Results
If output is empty:
- Check if Page Scraper selector was too specific
- Verify HTML contains actual text content
- Review logs for parsing errors
- Try standard mode if using specialized modes
Performance Tips
Batch Processing
HTML Text Extractor is fast and efficient:
- Processes pages in milliseconds
- Safe to use with parallel execution
- No API calls (runs locally)
- No cost per extraction
When to Skip
You might not need HTML Text Extractor if:
- Page Scraper already returns clean text
- You need to preserve specific HTML structure
- Processing non-HTML content