Batch processing allows you to efficiently process hundreds or thousands of items through your flow using parallel execution, smart batching strategies, and progress tracking.

What is Batch Processing?

Batch processing is the technique of processing multiple data items together as a group rather than one at a time. This is essential for:

  • Large dataset processing (100s to 1000s of items)
  • Periodic data synchronization
  • Bulk content generation
  • Mass data enrichment

Processing Strategies

Sequential Processing

Process items one after another:

Item 1 → Complete → Item 2 → Complete → Item 3 → ...

Advantages:
  - Lower cost (single API call at a time)
  - Easier debugging
  - Predictable resource usage
  
Disadvantages:
  - Slowest option
  - Time = N × per-item-time
  
Best for:
  - Small batches (< 50 items)
  - Rate-limited APIs
  - Resource-constrained environments

Parallel Processing

Process multiple items simultaneously:

Item 1 ──┐
Item 2 ──┼─→ Process simultaneously
Item 3 ──┘

Advantages:
  - Much faster (10x+ speedup)
  - Better resource utilization
  - Ideal for I/O-bound operations
  
Disadvantages:
  - Higher concurrent API usage
  - More complex error handling
  - May hit rate limits
  
Best for:
  - Large batches (100+ items)
  - Independent items
  - Production workflows

Hybrid (Chunked) Processing

Process in parallel batches:

Chunk 1 (10 items) → Process in parallel → Complete
Chunk 2 (10 items) → Process in parallel → Complete
Chunk 3 (10 items) → Process in parallel → Complete

Advantages:
  - Balance speed and control
  - Manage rate limits
  - Progressive results
  
Best for:
  - Very large datasets (1000+ items)
  - APIs with rate limits
  - When you need progress updates
Tip
Start with parallel processing of 10 items at a time. Adjust based on performance and rate limits.

Configuring Batch Processing

Array Splitter Settings

Execution Mode:
  - Sequential: One at a time
  - Parallel: All items simultaneously
  - Chunked: Process N items at a time
  
Parallel Settings:
  Max Concurrency: 10 (default)
  Chunk Size: 10 items per batch
  
Error Handling:
  On Error: Skip and continue
  Max Error Rate: 10%
  Fail After: 50 consecutive errors

Performance Tuning

Small batch (< 50 items):
  Mode: Parallel
  Concurrency: 10
  
Medium batch (50-500 items):
  Mode: Chunked
  Chunk Size: 20
  Concurrency: 20
  
Large batch (500+ items):
  Mode: Chunked
  Chunk Size: 50
  Concurrency: 25
  + Enable checkpointing

Handling Rate Limits

Automatic Rate Limiting

Evaligo automatically manages API rate limits:

  • Detects 429 (Too Many Requests) errors
  • Implements exponential backoff
  • Queues requests when limit reached
  • Resumes automatically when available

Manual Rate Control

Rate Limit Settings:
  Requests per minute: 60
  Delay between requests: 1000ms
  Burst allowance: 10
  
Example:
  Process 100 items with 60 req/min limit
  → Takes ~2 minutes
  → Automatically paced to stay under limit

Provider-Specific Limits

OpenAI GPT-4:
  Tier 1: 500 RPM
  Tier 2: 3,500 RPM
  Tier 3: 10,000 RPM
  
Claude:
  Free: 50 RPM
  Pro: 1,000 RPM
  
Strategy:
  - Know your tier limits
  - Set concurrency accordingly
  - Monitor usage in dashboard
Warning
Exceeding rate limits will slow down your batch processing. Monitor the execution logs for rate limit warnings.

Progress Tracking

Real-Time Progress

Monitor batch execution:

{
  "totalItems": 500,
  "processed": 237,
  "successful": 231,
  "failed": 6,
  "remaining": 263,
  "progress": 47.4,
  "elapsedTime": "3m 42s",
  "estimatedRemaining": "4m 15s",
  "currentRate": "1.2 items/sec"
}

Checkpointing

Save progress for large batches:

Every 100 items processed:
  → Save checkpoint
  → Mark completed item IDs
  
If flow fails:
  → Resume from last checkpoint
  → Skip already processed items
  → Continue with remaining

Best Practices

1. Test with Small Samples

Step 1: Test with 5 items (validate logic)
Step 2: Test with 50 items (check performance)
Step 3: Test with 500 items (verify scale)
Step 4: Run full batch (production)

2. Set Appropriate Timeouts

Fast operations (text processing): 10s per item
API calls (OpenAI): 30s per item
Web scraping: 60s per item
Complex chains: 120s per item

3. Handle Partial Failures

Strategy: Skip and Continue
  → 95/100 items succeed
  → 5 items fail (logged)
  → Flow completes
  → Review failed items
  → Reprocess if needed

4. Monitor Costs

Before running:
  Cost per item: $0.05
  Total items: 1,000
  Estimated cost: $50
  
During execution:
  Current spend: $23.50
  Items processed: 470
  Projected total: $50.00 ✓
Tip
Always estimate costs before running large batches. Use the cost calculator in the flow settings.

Common Patterns

Dataset Processing

Dataset Source (1000 items)
  → Array Splitter (parallel: 20)
  → Process each item
  → Array Flatten
  → Dataset Sink (save results)

Incremental Processing

Dataset Source (filter: unprocessed)
  → Array Splitter (chunked: 50)
  → Process each
  → Mark as processed
  → Dataset Sink
  
Run daily to process new items only

Multi-Stage Batching

Stage 1: Fetch data (parallel: 50)
  → Array Flatten
Stage 2: Process data (parallel: 20)
  → Array Flatten
Stage 3: Save results (sequential)
  → Dataset Sink

Optimization Techniques

Caching

Reduce redundant API calls:

  • Cache website scraping results
  • Reuse identical prompt outputs
  • Store intermediate results
  • Can reduce costs by 30-70%

Deduplication

Remove duplicate items before processing:

Dataset Source (1000 items)
  → Deduplicate (750 unique items)
  → Array Splitter
  → Process 750 instead of 1000
  → 25% cost savings

Smart Ordering

Process items in optimal order:

  • Prioritize high-value items
  • Group similar items together
  • Process fast items first for quick wins

Error Recovery

Automatic Retry

Item fails due to timeout:
  Attempt 1: Failed (timeout)
  Wait 2s
  Attempt 2: Failed (timeout)
  Wait 5s
  Attempt 3: Success ✓

Manual Reprocessing

After batch completes:
  1. Export failed items list
  2. Fix underlying issues
  3. Create new dataset with failed items
  4. Reprocess just those items

Partial Results

Flow processes 800/1000 items then crashes
  → 800 results saved to dataset
  → Resume from item 801
  → Process remaining 200
  → Merge results

Monitoring and Alerts

Set Up Alerts

  • Error rate exceeds 5%
  • Execution time exceeds estimate by 50%
  • Cost exceeds budget
  • Flow fails or times out

Review Metrics

After each batch:

  • Success rate
  • Average time per item
  • Cost per item
  • Error patterns
  • Bottleneck nodes

Related Documentation

Array Splitter Node
Configure parallel processing
Error Handling
Handle failures gracefully
Dataset Nodes
Source and sink data