What is Extractability?

Extractability measures whether AI systems can reliably pull the primary content from your pages.

Definition

Extractability is a derived score (0–100) that combines four signals to predict whether an AI system will successfully extract your primary content — or get confused by noise, empty shells, or unstructured text.

How SEODiff computes it

Extractability is a weighted composite of four components:

Extractability = 0.30 × Structure + 0.25 × Schema + 0.25 × Rendering + 0.20 × BloatEfficiency

Where BloatEfficiency is derived from the Token Bloat Ratio:

BloatEfficiency = clamp(100 / TokenBloatRatio × 5, 0, 100)

This means a page with 20× token bloat gets a BloatEfficiency of 25, while a page with 5× bloat or less gets 100.

Why it matters

A page can be accessible (bots aren't blocked) but still have terrible extractability. Common scenarios:

Empty JavaScript shell — The page loads, returns 200, but the HTML body is just a <div id="root"></div>. Ghost ratio is high, extractability is near zero.
Buried content — The real content exists but is surrounded by 50KB of navigation, footer, inline JSON state, and ad markup. Token bloat is high, extractability suffers.
Flat text wall — The content is there but has no headings, no lists, no semantic structure. AI systems can't parse the information hierarchy.
No schema — Without JSON-LD, AI crawlers must guess what entities, products, or topics your page represents.

Score interpretation

80–100: AI systems can reliably extract your primary content.
60–79: Content is partially extractable but some signals are weak.
40–59: Significant extraction issues — AI may misrepresent your content.
Below 40: AI crawlers will likely fail to extract meaningful content.

What is Extractability?

Definition

How SEODiff computes it

Why it matters

Score interpretation

Related concepts

Related metrics

Related tools