Measure HTML Token Bloat

Detect when boilerplate overwhelms useful content for LLM crawlers.

The Problem

LLMs and RAG pipelines tokenize your HTML before extracting content. When navigation, footers, tracking scripts, and repeated UI components dominate the page, the signal-to-noise ratio collapses. Claude, GPT, and Perplexity either truncate or hallucinate when the useful content is buried under 10× its weight in boilerplate tokens.

The Hard Way

Calculate token bloat manually: fetch the page, strip boilerplate, count tokens in the useful content vs. the full HTML. You’d need a tokenizer (tiktoken or similar), a content extractor, and a baseline for what’s “good”. For pSEO at scale, you’d repeat this per template.

The SEODiff Way

One API call. Results in under 2 seconds.

POST https://seodiff.io/api/v1/agent/evaluate

{"urls": ["https://example.com/blog/post-1"], "assertions": [{"rule": "max_token_bloat", "value": 8.0}]}
ParameterTypeExample
valuefloat8.0

Code Examples

Copy-paste examples in your preferred language:

cURL

See the full evaluation example in cURL →

Python

See the full evaluation example in Python →

Node.js

See the full evaluation example in Node.js →

Go

See the full evaluation example in Go →

PHP

See the full evaluation example in PHP →

Related Assertions

min_word_count

Prevent thin content by requiring a minimum number of words per page.

max_js_ghost_ratio

Flag pages where content is rendered client-side and invisible to crawlers.

Use in CI/CD

Add this assertion to your deployment pipeline. Works with any CI platform:

🐙 GitHub Actions

Block bad deployments with automated SEO checks in your GitHub Actions CI/CD pipeline.

🦊 GitLab CI

Add automated SEO quality gates to your GitLab CI/CD pipelines.

▲ Vercel

Automatically validate SEO on every Vercel preview deployment before promoting to production.

Start testing in 30 seconds

Get an API key and run your first evaluation with a single cURL command.

Get API Key or Read full API docs