HTML Text Extractor - Evaligo AI Workflow Automation Docs

The HTML Text Extractor node converts raw HTML into clean, structured text by removing tags, scripts, styles, and formatting content for optimal AI processing.

How It Works

HTML Text Extractor takes HTML content and:

Strips HTML tags and attributes
Removes scripts, styles, and comments
Preserves semantic structure (headings, lists, paragraphs)
Normalizes whitespace and line breaks
Outputs clean, readable text

Why Use HTML Text Extractor?

Cleaner AI Inputs

Raw HTML can confuse LLMs and waste tokens:

<!-- Raw HTML (wastes tokens) -->
<div class="article-content" data-id="123">
  <h1 style="color: blue;">Title</h1>
  <p class="intro"><span>Text here</span></p>
</div>

<!-- After HTML Text Extractor -->
Title

Text here

Token Efficiency

Removes unnecessary markup
Reduces input size by 50-80%
Lowers API costs
Faster processing

Better AI Understanding

Focuses on actual content
Preserves document structure
Removes navigation, ads, footers
Improves prompt effectiveness

Tip

Use HTML Text Extractor after Page Scraper and before Prompt nodes for optimal results and cost savings.

Configuration

Input

html: Raw HTML content to process
selector (optional): Extract specific elements first

Output Structure

{
  "out": {
    "text": "Cleaned text content...",
    "title": "Page title (if found)",
    "headings": ["H1", "H2", "H3"],
    "metadata": {
      "wordCount": 1234,
      "originalSize": "45KB",
      "cleanedSize": "12KB"
    }
  },
  "_input": {
    "html": "Original HTML..."
  }
}

Processing Options

Standard Mode (Default)

Removes all HTML tags
Preserves paragraph breaks
Maintains heading structure
Keeps list formatting

Markdown Mode

Converts HTML to Markdown format:

# Heading 1
## Heading 2

Paragraph text with **bold** and *italic*.

- List item 1
- List item 2

[Link text](url)

Structured Mode

Outputs structured JSON with semantic sections:

{
  "title": "Page Title",
  "sections": [
    {
      "heading": "Introduction",
      "content": "Intro text..."
    },
    {
      "heading": "Main Content",
      "content": "Body text..."
    }
  ]
}

Common Usage Patterns

Web Content Processing

Page Scraper (get HTML)
  → HTML Text Extractor (clean text)
  → Prompt (analyze clean content)
  → Dataset Sink

Multi-Page Analysis

Website Mapper (discover URLs)
  → Array Splitter
  → Page Scraper (get each page)
  → HTML Text Extractor (clean each)
  → Prompt (analyze each clean page)
  → Array Flatten
  → Prompt (summarize all pages)
  → Dataset Sink

Content Extraction Pipeline

Dataset Source (URLs)
  → Page Scraper (selector: "article")
  → HTML Text Extractor (markdown mode)
  → Prompt (extract key points)
  → Dataset Sink (structured data)

Cleaning Options

Remove Navigation

Automatically filters common navigation elements:

Header and footer menus
Sidebar navigation
Breadcrumbs
Social media buttons

Remove Boilerplate

Strips common non-content elements:

Cookie notices
Newsletter signups
Related articles widgets
Advertisement placeholders

Smart Content Detection

Focuses on main content:

Identifies primary content area
Ignores sidebars and ancillary content
Preserves article structure
Maintains semantic hierarchy

Warning

HTML Text Extractor works best on well-structured HTML. Poorly formatted pages may produce suboptimal results.

Best Practices

Use After Page Scraper

✅ Good:
Page Scraper → HTML Text Extractor → Prompt

❌ Bad:
Page Scraper → Prompt (raw HTML wastes tokens)

Pre-filter with Selectors

Use Page Scraper selectors to get relevant sections first:

Page Scraper (selector: "article")
  → HTML Text Extractor
  → Cleaner, more focused text

Choose Right Mode

Standard: General text processing
Markdown: When structure matters (documentation, articles)
Structured: When analyzing sections separately

Token Savings Example

Raw HTML: 4,500 tokens
After extraction: 1,200 tokens
Savings: 73%

Cost impact at $0.01/1K tokens:
  Raw: $0.045 per item
  Cleaned: $0.012 per item
  Savings: $0.033 per item × 1000 items = $33 saved

Advanced Features

Language Detection

Automatically detects content language:

{
  "text": "Extracted content...",
  "language": "en",
  "confidence": 0.98
}

Metadata Extraction

Captures useful metadata:

{
  "metadata": {
    "title": "Page Title",
    "author": "John Doe",
    "publishDate": "2024-01-15",
    "description": "Meta description...",
    "keywords": ["ai", "automation"]
  }
}

Table Handling

Converts HTML tables to readable format:

Standard mode:
Column1  Column2  Column3
Value1   Value2   Value3

Markdown mode:
| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Value1  | Value2  | Value3  |

Handling Edge Cases

JavaScript-Heavy Sites

May not capture dynamically loaded content:

Use Page Scraper with rendering if needed
HTML Text Extractor works on whatever HTML is provided
Consider alternative scraping methods for SPAs

Malformed HTML

Handles broken HTML gracefully:

Attempts to parse and clean
Logs warnings for major issues
Returns best-effort text extraction

Empty Results

If output is empty:

Check if Page Scraper selector was too specific
Verify HTML contains actual text content
Review logs for parsing errors
Try standard mode if using specialized modes

Performance Tips

Batch Processing

HTML Text Extractor is fast and efficient:

Processes pages in milliseconds
Safe to use with parallel execution
No API calls (runs locally)
No cost per extraction

When to Skip

You might not need HTML Text Extractor if:

Page Scraper already returns clean text
You need to preserve specific HTML structure
Processing non-HTML content

Tip

HTML Text Extractor has no API costs and runs instantly. Use it liberally to clean content before sending to expensive LLM calls.