The HTML Text Extractor node converts raw HTML into clean, structured text by removing tags, scripts, styles, and formatting content for optimal AI processing.

How It Works

HTML Text Extractor takes HTML content and:

  • Strips HTML tags and attributes
  • Removes scripts, styles, and comments
  • Preserves semantic structure (headings, lists, paragraphs)
  • Normalizes whitespace and line breaks
  • Outputs clean, readable text

Why Use HTML Text Extractor?

Cleaner AI Inputs

Raw HTML can confuse LLMs and waste tokens:

<!-- Raw HTML (wastes tokens) -->
<div class="article-content" data-id="123">
  <h1 style="color: blue;">Title</h1>
  <p class="intro"><span>Text here</span></p>
</div>

<!-- After HTML Text Extractor -->
Title

Text here

Token Efficiency

  • Removes unnecessary markup
  • Reduces input size by 50-80%
  • Lowers API costs
  • Faster processing

Better AI Understanding

  • Focuses on actual content
  • Preserves document structure
  • Removes navigation, ads, footers
  • Improves prompt effectiveness
Tip
Use HTML Text Extractor after Page Scraper and before Prompt nodes for optimal results and cost savings.

Configuration

Input

html: Raw HTML content to process
selector (optional): Extract specific elements first

Output Structure

{
  "out": {
    "text": "Cleaned text content...",
    "title": "Page title (if found)",
    "headings": ["H1", "H2", "H3"],
    "metadata": {
      "wordCount": 1234,
      "originalSize": "45KB",
      "cleanedSize": "12KB"
    }
  },
  "_input": {
    "html": "Original HTML..."
  }
}

Processing Options

Standard Mode (Default)

  • Removes all HTML tags
  • Preserves paragraph breaks
  • Maintains heading structure
  • Keeps list formatting

Markdown Mode

Converts HTML to Markdown format:

# Heading 1
## Heading 2

Paragraph text with **bold** and *italic*.

- List item 1
- List item 2

[Link text](url)

Structured Mode

Outputs structured JSON with semantic sections:

{
  "title": "Page Title",
  "sections": [
    {
      "heading": "Introduction",
      "content": "Intro text..."
    },
    {
      "heading": "Main Content",
      "content": "Body text..."
    }
  ]
}

Common Usage Patterns

Web Content Processing

Page Scraper (get HTML)
  → HTML Text Extractor (clean text)
  → Prompt (analyze clean content)
  → Dataset Sink

Multi-Page Analysis

Website Mapper (discover URLs)
  → Array Splitter
  → Page Scraper (get each page)
  → HTML Text Extractor (clean each)
  → Prompt (analyze each clean page)
  → Array Flatten
  → Prompt (summarize all pages)
  → Dataset Sink

Content Extraction Pipeline

Dataset Source (URLs)
  → Page Scraper (selector: "article")
  → HTML Text Extractor (markdown mode)
  → Prompt (extract key points)
  → Dataset Sink (structured data)

Cleaning Options

Remove Navigation

Automatically filters common navigation elements:

  • Header and footer menus
  • Sidebar navigation
  • Breadcrumbs
  • Social media buttons

Remove Boilerplate

Strips common non-content elements:

  • Cookie notices
  • Newsletter signups
  • Related articles widgets
  • Advertisement placeholders

Smart Content Detection

Focuses on main content:

  • Identifies primary content area
  • Ignores sidebars and ancillary content
  • Preserves article structure
  • Maintains semantic hierarchy
Warning
HTML Text Extractor works best on well-structured HTML. Poorly formatted pages may produce suboptimal results.

Best Practices

Use After Page Scraper

✅ Good:
Page Scraper → HTML Text Extractor → Prompt

❌ Bad:
Page Scraper → Prompt (raw HTML wastes tokens)

Pre-filter with Selectors

Use Page Scraper selectors to get relevant sections first:

Page Scraper (selector: "article")
  → HTML Text Extractor
  → Cleaner, more focused text

Choose Right Mode

  • Standard: General text processing
  • Markdown: When structure matters (documentation, articles)
  • Structured: When analyzing sections separately

Token Savings Example

Raw HTML: 4,500 tokens
After extraction: 1,200 tokens
Savings: 73%

Cost impact at $0.01/1K tokens:
  Raw: $0.045 per item
  Cleaned: $0.012 per item
  Savings: $0.033 per item × 1000 items = $33 saved

Advanced Features

Language Detection

Automatically detects content language:

{
  "text": "Extracted content...",
  "language": "en",
  "confidence": 0.98
}

Metadata Extraction

Captures useful metadata:

{
  "metadata": {
    "title": "Page Title",
    "author": "John Doe",
    "publishDate": "2024-01-15",
    "description": "Meta description...",
    "keywords": ["ai", "automation"]
  }
}

Table Handling

Converts HTML tables to readable format:

Standard mode:
Column1  Column2  Column3
Value1   Value2   Value3

Markdown mode:
| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Value1  | Value2  | Value3  |

Handling Edge Cases

JavaScript-Heavy Sites

May not capture dynamically loaded content:

  • Use Page Scraper with rendering if needed
  • HTML Text Extractor works on whatever HTML is provided
  • Consider alternative scraping methods for SPAs

Malformed HTML

Handles broken HTML gracefully:

  • Attempts to parse and clean
  • Logs warnings for major issues
  • Returns best-effort text extraction

Empty Results

If output is empty:

  • Check if Page Scraper selector was too specific
  • Verify HTML contains actual text content
  • Review logs for parsing errors
  • Try standard mode if using specialized modes

Performance Tips

Batch Processing

HTML Text Extractor is fast and efficient:

  • Processes pages in milliseconds
  • Safe to use with parallel execution
  • No API calls (runs locally)
  • No cost per extraction

When to Skip

You might not need HTML Text Extractor if:

  • Page Scraper already returns clean text
  • You need to preserve specific HTML structure
  • Processing non-HTML content
Tip
HTML Text Extractor has no API costs and runs instantly. Use it liberally to clean content before sending to expensive LLM calls.

Related Documentation

Page Scraper
Extract HTML from web pages
Website Mapper
Discover pages to process
Prompt Node
Process cleaned text with AI