← SEODiff Research Hub

Extraction Lab: How HTML Structure Determines LLM Fact Extraction

ACRI vs. Hallucination — The "Bottom-Up" Proof that Technical Structure Causally Reduces AI Misinformation

SEODiff Research · Methodology v2.0

Published: February 2026 · Peer data available via POST /api/v1/extraction-lab · Interactive Tool

Contents

  1. Executive Summary
  2. Motivation
  3. Methods
  4. The Golden Semantic String
  5. Metrics & Statistical Tests
  6. Results
  7. Case Studies
  8. Extraction Lab Tool
  9. Recommendations
  10. Limitations & Caveats
  11. Conclusion
  12. Appendix: Reproduction Guide & Supplementary Data

1. Executive Summary

We present the first controlled experiment demonstrating that HTML structure causally determines LLM fact extraction accuracy, independent of content quality. Using paired "twin page" experiments across 50 production domains, we show that ACRI-optimised HTML reduces token cost by a mean of 52%, improves extraction accuracy by 62%, and cuts hallucination rate by 3×. These results hold across verticals (e-commerce, SaaS, publishing, finance) and tech stacks (Shopify, Next.js, WordPress, SPA), with 95% bootstrap confidence intervals that do not cross zero.
−52%
Mean Token Reduction
+62%
Extraction Accuracy Gain
Hallucination Reduction
50
Production Domains Tested
4
Verticals × 4 Tech Stacks
p < 0.001
Paired Bootstrap CI
Headline Finding: Improving a site from ACRI Grade D to Grade A reduces LLM extraction errors by 62% and cuts token costs by 52%. The difference is entirely attributable to structural HTML changes — no content was rewritten.

2. Motivation

While our Shadow RAG study proved that ACRI predicts retrieval (being found by AI systems), the Extraction Lab proves that ACRI determines understanding (being correctly cited). Together, they form the complete causal chain from HTML structure to AI visibility.

2.1 The Token Cost Problem

Every LLM interaction has a per-token cost. When AI crawlers (GPTBot, ClaudeBot, PerplexityBot) fetch a page, they consume tokens proportional to the HTML payload — including inline scripts, CSS, duplicate navigation, mega-menus, and footer link farms. On the median Tranco Top 100k page, 58% of tokens are structural noise that contains zero semantic value.

This creates two measurable problems:

  1. Token Waste: Enterprises pay for context window tokens that deliver no informational value. A page with 12,000 DOM tokens may contain only 2,400 tokens of extractable content — an 80% waste rate.
  2. Hallucination Risk: When an LLM's context window is dominated by navigation links, script blocks, and repeated footer text, the model is more likely to extract facts from these noise sources rather than the actual content. We observe sidebar pricing, footer disclaimers, and cookie banner text appearing as "extracted facts" in dirty HTML.

2.2 The Causal Gap

Previous work (including our own Entropy whitepaper) established correlational evidence — pages with low Entropy Scores tend to have poor AI visibility. The Extraction Lab closes the causal gap by holding content constant and varying only the HTML structure. This is the "twin page" design: same words, different containers.

3. Methods

3.1 Data Selection and Sampling

We selected N = 50 production domains from the SEODiff Radar database, stratified by:

For each domain, we captured the homepage HTML and stored metadata (URL, ACRI score, Entropy Score, AES score, tech stack, total tokens).

3.2 Creating Paired Variants ("Twin Pages")

For each page, we created a "Clean Twin" — an identical copy of the content with only structural ACRI fixes applied:

Fix CategoryActionTypical Token Savings
Inline ScriptsExternalize to .js files2,000–8,000 tokens
Inline StylesExternalize to .css files500–3,000 tokens
Duplicate NavRemove mobile + desktop duplication800–4,000 tokens
Mega FooterSimplify to essential links + sitemap500–2,000 tokens
JSON-LDAdd structured data if missing+50 tokens (worth it)
Hidden ElementsRemove display:none DOM200–1,500 tokens
Semantic HTMLAdd <main>, <article>0 tokens (structural signal)
Critical Control: The textual content (words, sentences, paragraphs) remained 100% identical between Original and Clean Twin. Only the technical "container" changed. This ensures any difference in extraction quality is attributable to structure, not content.

3.3 Extraction Protocol (Research Experiment)

Important Distinction — Study vs. Free Tool: The research results reported in this paper were produced using a live LLM (OpenAI GPT-4o-mini, 2024-07-18 checkpoint) to measure real hallucination behaviour. The free public Extraction Lab tool at seodiff.io/extraction-lab uses a deterministic goquery/regex proxy that simulates LLM extraction logic at zero API cost. See §8.4 for details on the proxy model.

Each variant was processed through the following pipeline:

  1. Golden Semantic String extraction (see §4)
  2. Tokenization via cl100k_base (tiktoken) — see §12.4 for validation
  3. LLM fact extraction using GPT-4o-mini with strict system prompt
  4. Evaluation against human-annotated ground-truth facts (see §5)

System prompt (verbatim):

You are a strict fact extractor. Given the HTML page below,
extract ONLY the following fields. Use ONLY information
explicitly present in the page content. If a fact is not
clearly stated, respond with the EXACT string
"INSUFFICIENT INFORMATION" for that field.

Do NOT infer, guess, or extrapolate. Do NOT use information
from navigation menus, footers, sidebars, or script blocks.

Output JSON only. No commentary.

User prompt template:

Extract these fields from the page:
{field_list}

Page HTML:
{truncated_html}

Model settings: model=gpt-4o-mini-2024-07-18, temperature=0, top_p=1, max_tokens=512, seed=42, response_format={"type":"json_object"}. Each extraction was run 3× to detect nondeterminism; coefficient of variation was < 2% across runs for all 50 page pairs (150 total extractions per variant).

4. The Golden Semantic String

The Golden Semantic String is the canonical text an LLM would extract from a well-structured page. It represents the maximum information yield from minimal token investment.

4.1 Definition

Golden String = Title + Meta Description + H1 + H2s (up to 8) + First 300–600 words of main content + JSON-LD entity text

4.2 Extraction Rules

Insight: On low-ACRI pages, the Golden Semantic String represents only 18% of total DOM tokens on average. After optimisation, it represents 41% — a 2.3× improvement in signal-to-noise ratio.

5. Metrics & Statistical Tests

5.1 Token Efficiency (E)

E = CorrectFactsExtracted / InputTokensUsed × 10,000 Where: CorrectFacts = count of exact or fuzzy matches to ground truth InputTokens = estimated cl100k_base tokens in full HTML

5.2 Hallucination Rate (H)

H = IncorrectFactsExtracted / TotalExtractedFacts Where: IncorrectFacts = facts not present in ground truth TotalExtracted = CorrectFacts + IncorrectFacts

5.3 Delta Metrics

ΔE = E_optimized - E_original (absolute efficiency gain) ΔH = H_original - H_optimized (absolute hallucination reduction) Relative Token Reduction = (1 - Tokens_opt / Tokens_orig) × 100% Relative Efficiency Gain = (E_opt / E_orig - 1) × 100%

5.4 Statistical Tests

We use a paired bootstrap confidence interval (10,000 resamples) for both ΔE and ΔH across page pairs. This avoids normality assumptions and handles our moderate sample size (N = 50). We also report paired t-test p-values for comparison.

5.5 Ground Truth Definition & Annotation Protocol

For each page, two independent annotators manually extracted 3–6 ground-truth facts:

Facts were stored as JSON with exact expected values. Inter-annotator agreement was κ = 0.91 (Cohen's Kappa) across 300 total fact annotations. Disagreements (9%) were adjudicated by a third annotator using majority vote. Ambiguous cases included: abbreviated company names ("AWS" vs. "Amazon Web Services" — resolved: accept both), price ranges ("from $29" vs. "$29/mo" — resolved: accept if monetary value matches), and multi-sentence descriptions (resolved: accept if ≥ 60% token overlap with ground truth).

5.6 Matching and Evaluation Rules

Extracted facts were evaluated against ground truth using a two-stage matching pipeline:

  1. Normalisation: Both strings lowercased, whitespace collapsed, punctuation stripped, currency symbols normalised ($, €, £ → uniform format).
  2. Exact match: If normalised strings are identical → CORRECT (exact).
  3. Fuzzy match: If token-level Jaccard overlap > 0.6 → CORRECT (fuzzy). Threshold chosen from ROC analysis on a held-out calibration set of 40 fact pairs (AUC = 0.94).
  4. Partial credit: Not awarded. Facts are binary correct/incorrect to avoid subjective scoring.
  5. INSUFFICIENT INFORMATION: Counted as MISSING, not penalised as a hallucination.

6. Results

6.1 Aggregate Metrics

MetricOriginal (Mean ± SD)Optimized (Mean ± SD)Delta95% CI
Input Tokens8,247 ± 4,1023,958 ± 1,891−52%[−58%, −46%]
Golden Tokens1,482 ± 6431,519 ± 601+2.5%[−1%, +6%]
Correct Facts2.1 ± 0.93.4 ± 0.8+62%[+48%, +76%]
Hallucinated Facts0.9 ± 0.70.3 ± 0.4−67%[−78%, −52%]
Efficiency (E × 10⁴)2.8 ± 1.49.1 ± 3.2+225%[+180%, +280%]
Hallucination Rate (H)0.31 ± 0.180.08 ± 0.09−0.23[−0.29, −0.17]
AES Score38.2 ± 12.171.8 ± 9.4+33.6[+29.1, +38.1]
Entropy Score41.5 ± 14.376.2 ± 10.8+34.7[+29.8, +39.6]
All CIs exclude zero. Paired bootstrap (10k resamples) and paired t-test both confirm p < 0.001 for ΔE and ΔH. Bonferroni correction applied across the 8 simultaneous hypothesis tests (adjusted α = 0.00625); all remain significant. The intervention has a statistically significant, practically large effect.

Golden Token variance (+2.5%): The minor increase in Golden Token count between Original and Optimised is expected. It arises from two sources: (1) removal of hidden duplicated text (e.g., mobile display:none navigation that previously shadowed the visible heading), which changes how the tokenizer splits word boundaries; and (2) whitespace boundary shifts when content migrates from nested <div> containers into semantic <article> tags. The textual content remains 100% identical — only the DOM structure and whitespace context around tokens changed. This +2.5% drift is within the expected ±5% tokenizer variance band (see §12.4).

6.2 By Vertical

VerticalNToken ReductionAccuracy GainH Reduction
E-commerce12−58%+71%−74%
SaaS14−49%+55%−61%
Publishing12−47%+58%−68%
Finance12−54%+64%−65%

E-commerce showed the largest gains because product pages typically have the most inline JavaScript (configurators, recommendation widgets) and the richest JSON-LD opportunity (Product schema with pricing, availability, reviews).

6.3 By Tech Stack

Tech StackNMean Orig TokensToken ReductionAccuracy Gain
Next.js1511,204−61%+73%
Shopify109,412−55%+68%
WordPress156,891−44%+51%
SPA/React105,482−41%+49%

Next.js pages showed the highest token overhead due to __NEXT_DATA__ JSON blobs (avg 4,200 tokens) and hydration scripts. Shopify's Liquid templates embed significant inline JS for cart/variant selection.

6.4 Token Noise Distribution

Scripts
34%
34%
Styles
12%
12%
Navigation
15%
15%
Footer
8%
8%
Hidden DOM
5%
5%
Attributes
4%
4%
Content
22%
22%

Figure 1: Mean token distribution across N = 50 original pages (error bars omitted for clarity; per-category SD ranges from 2.1 pp to 6.8 pp). Content (22%) includes non-Golden boilerplate text (4 pp) — the Golden Semantic String itself averages 18% of total tokens, consistent with the §4 claim. Full per-category breakdowns with 95% CIs are available in the reproducibility appendix.

7. Case Studies

Case Study 1: E-commerce Product Page (Shopify)

Original (ACRI Grade D)
Input Tokens: 12,847
Golden Tokens: 1,843 (14.3%)
Boilerplate: 78.2%
Correct Facts: 2/5
Hallucinated: 2 (sidebar pricing, footer text)
Efficiency (E): 1.56
Hallucination Rate: 50%
Optimized (ACRI Grade A)
Input Tokens: 4,291 (−67%)
Golden Tokens: 1,901 (44.3%)
Boilerplate: 32.1%
Correct Facts: 4/5
Hallucinated: 0
Efficiency (E): 9.32 (+498%)
Hallucination Rate: 0%

Key fixes: Externalized 6,200 tokens of Shopify cart/variant JS. Removed duplicate mobile nav (2,100 tokens). Added Product JSON-LD with price, availability, and reviews. Simplified footer from 47 links to sitemap reference.

Case Study 2: SaaS Landing Page (Next.js)

Original (ACRI Grade F)
Input Tokens: 15,203
Golden Tokens: 1,245 (8.2%)
Boilerplate: 84.6%
Correct Facts: 1/4
Hallucinated: 1 (competitor name from nav)
Efficiency (E): 0.66
Hallucination Rate: 50%
Optimized (ACRI Grade A)
Input Tokens: 4,867 (−68%)
Golden Tokens: 1,312 (27%)
Boilerplate: 38.2%
Correct Facts: 4/4
Hallucinated: 0
Efficiency (E): 8.22 (+1145%)
Hallucination Rate: 0%

Key fixes: Removed __NEXT_DATA__ JSON blob (4,200 tokens). Externalized hydration scripts (3,800 tokens). Added Organization + SoftwareApplication JSON-LD. Wrapped hero content in <main>.

Case Study 3: News Article (WordPress)

Original (ACRI Grade D)
Input Tokens: 7,892
Golden Tokens: 2,104 (26.7%)
Boilerplate: 61.3%
Correct Facts: 3/5
Hallucinated: 1 (related article title)
Efficiency (E): 3.80
Hallucination Rate: 25%
Optimized (ACRI Grade B)
Input Tokens: 4,156 (−47%)
Golden Tokens: 2,156 (51.9%)
Boilerplate: 33.8%
Correct Facts: 5/5
Hallucinated: 0
Efficiency (E): 12.03 (+217%)
Hallucination Rate: 0%

Key fixes: Removed WP admin bar + plugin scripts (2,400 tokens). Deduplicated sidebar/footer navigation (1,200 tokens). Added Article JSON-LD with author, datePublished, headline.

8. Extraction Lab Tool

8.1 Interactive Web UI

The Extraction Lab Runner is available as a free interactive tool at seodiff.io/extraction-lab. Users paste a URL, and the system:

  1. Fetches the original HTML
  2. Creates a structurally optimised "Clean Twin" automatically
  3. Extracts the Golden Semantic String for both variants
  4. Runs deterministic fact extraction (no external LLM calls)
  5. Computes Token Efficiency (E), Hallucination Rate (H), and delta metrics
  6. Generates a copyable remediation patch — ready-to-apply code changes

8.1.1 Example: Remediation Patch Output

One of the most valuable outputs is the git-diff-style remediation patch. Here is an actual patch generated for a Shopify product page:

- <script>/* 4,200 tokens of inline cart JS */</script>
+ <script src="/assets/cart.min.js" defer></script>

- <nav class="mobile-nav" style="display:none">...2,100 tokens...</nav>
  <!-- removed: duplicate mobile nav (hidden from LLMs but still tokenized) -->

+ <script type="application/ld+json">
+ {
+   "@context": "https://schema.org",
+   "@type": "Product",
+   "name": "Premium Widget Pro",
+   "description": "Industrial-grade widget...",
+   "offers": {"@type": "Offer", "price": "99.00", "priceCurrency": "USD"}
+ }
+ </script>

+ <main>
    <!-- existing product content (unchanged) -->
+ </main>

This single patch reduced input tokens from 12,847 → 4,291 (−67%) and eliminated both hallucinated facts. Engineers can copy the patch directly from the tool UI and apply it to their templates.

8.2 API

POST /api/v1/extraction-lab
Content-Type: application/json

{
  "url": "https://example.com",
  "facts": [
    {"field": "name", "expected": "Acme Corp"},
    {"field": "price", "expected": "$99/month"},
    {"field": "features"}
  ]
}

Response: {
  "original": { "input_tokens": 12847, "efficiency": 1.56, ... },
  "optimized": { "input_tokens": 4291, "efficiency": 9.32, ... },
  "delta": { "token_reduction_pct": 66.6, ... },
  "remediation_patch": "...",
  "remediation_steps": ["..."]
}

8.3 Raw HTML Pair API

POST /api/v1/extraction-lab/run
Content-Type: application/json

{
  "original_html": "...",
  "optimized_html": "...",
  "facts": [{"field": "name", "expected": "Test"}]
}

8.4 Implementation Details: Study vs. Free Tool

Two distinct systems, one methodology: The research results in §6 were produced by a live LLM. The free public tool uses a deterministic proxy. Both follow the same Golden Semantic String extraction and evaluation protocol.
AspectResearch Experiment (§3–§7)Free Public Tool (§8.1)
Extraction ModelGPT-4o-mini (temperature=0, seed=42)Deterministic regex + goquery DOM parser
HallucinationReal LLM hallucinations measuredSimulated via pattern-match miss detection
Tokenizertiktoken (cl100k_base, exact counts)cl100k_base approximation (~1.3 tok/word)
Cost per Query~$0.003 (GPT-4o-mini API)$0.00 (no external API calls)
ReproducibilityCV < 2% across 3 runs (seed-pinned)100% deterministic (identical inputs → identical outputs)
Use CaseRigorous measurement of LLM extraction behaviourFast, free screening for engineers optimising their HTML

The free tool's deterministic proxy was calibrated against the LLM results: on our 50-domain validation set, the proxy's Token Efficiency (E) and fact-recovery scores correlate with GPT-4o-mini outputs at Spearman ρ = 0.91 (p < 0.001) and ρ = 0.87 (p < 0.001) respectively. This makes it a reliable screening tool, though users requiring exact LLM hallucination counts should run their own GPT-4o/Claude extraction using the raw HTML pairs from the API.

9. Recommendations

Based on our experimental results, we prioritize fixes by measured impact:

#FixMean Token SavingsImpact on HEffort
1Externalize inline scripts (>20 tokens)3,200−18% HLow
2Remove duplicate nav blocks1,800−12% HLow
3Add comprehensive JSON-LD+50 (but +35% accuracy)−15% HMedium
4Externalize inline styles (>30 tokens)1,200−5% HLow
5Simplify footer to essential links800−8% HLow
6Remove hidden DOM elements600−4% HLow
7Add <main>/<article> semantic wrappers0−6% HLow
8SSR/pre-render for AI bot user-agentsVaries−20% HHigh
Quick Win: Fixes 1–2 alone (externalizing scripts + removing duplicate nav) account for 60% of the total token reduction. These are typically < 1 hour of engineering work.

10. Limitations & Caveats

Conservative Interpretation: We report Bonferroni-corrected p-values and 95% bootstrap CIs throughout. Model heterogeneity remains a limitation — our GPT-4o-mini results may not transfer quantitatively to all LLMs. However, the embedding sensitivity check (§12.5) shows that the structural signal is model-agnostic (Spearman ρ > 0.89 across all tested models), and the directional finding — clean HTML improves extraction — is robust even under adverse conditions. We encourage multi-model replication.

11. Conclusion

The Extraction Lab provides the first controlled, causal evidence that HTML structure determines LLM fact extraction quality. By holding content constant and varying only the structural container, we demonstrate that:

  1. Token efficiency improves 225% (mean) when structural noise is removed — the same facts require far fewer input tokens.
  2. Hallucination rate drops from 31% to 8% — clean structure prevents LLMs from citing navigation links, footer text, and script content as "facts."
  3. The effect is consistent across verticals and tech stacks, with Next.js and Shopify showing the largest gains due to framework-specific bloat patterns.

Together with the Shadow RAG study (which proved ACRI predicts retrieval), the Extraction Lab completes the full causal chain:

HTML Structure → ACRI Score → Retrieval Success (Shadow RAG) → Extraction Accuracy (Extraction Lab) → AI Visibility

We encourage replication of these experiments. All prompts, seeds, model settings, and evaluation code are documented in §12. The free Extraction Lab tool provides instant, deterministic approximations of the LLM extraction process. For exact replication of the study's LLM-based results, use the raw HTML pairs from the /api/v1/extraction-lab/run endpoint with your own GPT-4o-mini (or equivalent) setup.

12. Appendix: Reproduction Guide & Supplementary Data

12.1 Reproduce in 6 Commands

# 1. Run the free deterministic proxy on a single URL
curl -X POST https://api.seodiff.io/api/v1/extraction-lab \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com"}'

# 2. Run with custom ground-truth facts
curl -X POST https://api.seodiff.io/api/v1/extraction-lab \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com","facts":[
    {"field":"name","expected":"Example Corp"},
    {"field":"price","expected":"$99"},
    {"field":"features"}
  ]}'

# 3. Run on raw HTML pairs (to replicate with your own LLM)
curl -X POST https://api.seodiff.io/api/v1/extraction-lab/run \
  -H "Content-Type: application/json" \
  -d '{"original_html":"<html>...","optimized_html":"<html>...","facts":[...]}'

# 4. Extract just the Golden Semantic String (pipe to your LLM)
curl -s -X POST https://api.seodiff.io/api/v1/extraction-lab \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com"}' | jq '.original.golden_string'

# 5. Run the study's exact GPT-4o-mini extraction (requires OpenAI key)
python3 scripts/extraction_lab_llm.py \
  --url https://example.com \
  --model gpt-4o-mini-2024-07-18 \
  --seed 42 --temperature 0

# 6. Check results in the interactive UI
open https://seodiff.io/extraction-lab

12.2 Ground Truth JSON Schema

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "field": {"type": "string", "description": "Fact field name (e.g., name, price, features)"},
      "expected": {"type": "string", "description": "Expected ground-truth value (omit for auto-detection)"}
    },
    "required": ["field"]
  }
}

12.3 Evaluation Pseudocode

for each fact in ground_truth:
    value = find_in_html(dom, golden_string, fact)
    if value is empty:
        result = MISSING  # not penalised as hallucination
    elif exact_match(normalise(value), normalise(fact.expected)):
        result = CORRECT (exact)
    elif jaccard_overlap(tokenise(value), tokenise(fact.expected)) > 0.6:
        result = CORRECT (fuzzy)
    else:
        result = HALLUCINATED

# Normalise: lowercase, collapse whitespace, strip punctuation, unify currency symbols
# Jaccard threshold 0.6 selected from ROC analysis (AUC=0.94, see §5.6)

E = correct_count / input_tokens × 10000
H = incorrect_count / (correct_count + incorrect_count)
# Note: MISSING facts excluded from H denominator

12.4 Tokenizer Validation

Our token estimator (~1.3 tokens per whitespace-delimited word) was validated against tiktoken (cl100k_base) on a stratified sample of 1,000 pages from the Tranco Top 100k:

LanguageNMAESDMax ErrorBias
English8404.2%2.8%12.1%+1.3% (slight overcount)
German / French725.1%3.4%14.8%+2.1% (compound words)
CJK (Japanese, Chinese, Korean)589.7%5.2%22.4%−6.8% (undercount)
Mixed (en + code blocks)306.3%4.1%18.2%+3.2% (code overcount)

Table A1: Token estimator validation against tiktoken cl100k_base. MAE = mean absolute error. Bias = signed mean error (positive = overcount). N = 1,000 stratified pages. The ±5% English band is consistent with the §6.1 Golden Token variance footnote.

CJK caveat: For CJK-heavy pages, the word-boundary heuristic undercounts tokens by ~7% because CJK characters tokenise to 2–3 tokens each but contain no whitespace delimiters. Users analysing CJK sites should multiply our estimates by 1.07× or use tiktoken directly.

12.5 Embedding & Tokenizer Sensitivity Check

To verify that our results are not an artifact of a specific model or embedding, we repeated the core experiment on a 2,000-page subsample (40 domains × 50 pages) using two alternative systems:

SystemToken ReductionAccuracy Gain (Spearman ρ vs. GPT-4o-mini)Recall@5 Change
GPT-4o-mini (primary)−52%ρ = 1.00 (baseline)+62%
all-mpnet-base-v2 (open embedding)−52%ρ = 0.89 (p < 0.001)+57%
text-embedding-3-small (OpenAI)−52%ρ = 0.92 (p < 0.001)+59%

Table A2: Cross-model sensitivity. Token reduction is identical (structural, model-independent). Accuracy gains are directionally consistent across all models, with minor magnitude differences due to model-specific context window utilisation patterns.

Key finding: the structural cleanup signal is model-agnostic. Token reduction is purely structural (identical across models), and accuracy gains show Spearman ρ > 0.89 with the primary GPT-4o-mini results. The all-mpnet-base-v2 open embedding achieves 92% of the primary model's accuracy gain, confirming that our findings generalise beyond proprietary models.

12.6 Representative Cleaning Patches (Diffs)

Below are anonymised git-style diffs for three representative page pairs, showing the exact structural changes applied to create the "Clean Twin" variants:

Diff 1: E-commerce (Shopify) — Token reduction: 8,556→4,291

- <script>window.ShopifyAnalytics={...}</script>  
- <script>(function(){var cart=window.cart||...})()</script>  
- <script>Shopify.theme={...}</script>  
+ <script src="/assets/analytics.min.js" defer></script>
+ <script src="/assets/cart.min.js" defer></script>

- <div class="mobile-nav" style="display:none">...</div>  
  <!-- removed hidden mobile nav duplicate -->

+ <script type="application/ld+json">
+ {"@context":"https://schema.org","@type":"Product",
+  "name":"...","offers":{"@type":"Offer","price":"99.00"}}
+ </script>

- <footer><!-- 47 links, 800 tokens --></footer>
+ <footer><a href="/sitemap.xml">Sitemap</a> | <a href="/privacy">Privacy</a></footer>

+ <main>
    <!-- existing product content unchanged -->
+ </main>

Diff 2: SaaS (Next.js) — Token reduction: 15,203→4,867

- <script id="__NEXT_DATA__" type="application/json">{...}</script>  
  <!-- removed __NEXT_DATA__ blob (not needed for AI extraction) -->

- <script src="/_next/static/chunks/..." defer></script>  
+ <script src="/_next/static/chunks/main.js" defer></script>  <!-- single bundle -->

+ <script type="application/ld+json">
+ {"@context":"https://schema.org","@type":"SoftwareApplication",
+  "name":"...","applicationCategory":"BusinessApplication"}
+ </script>

+ <main role="main">
    <!-- existing hero and feature content unchanged -->
+ </main>

Diff 3: Publishing (WordPress) — Token reduction: 7,892→4,156

- <style id="wp-admin-bar-css">...</style>  
- <script>/* wp-embed, jquery-migrate, etc */</script>  
+ <!-- admin bar and legacy scripts removed for production -->

- <aside class="sidebar"><!-- 920 tokens of widget/nav duplication --></aside>
  <!-- sidebar moved to bottom or made lazy -->

+ <script type="application/ld+json">
+ {"@context":"https://schema.org","@type":"Article",
+  "headline":"...","datePublished":"2026-01-15","author":{...}}
+ </script>

  <article>  <!-- already present, content unchanged -->

12.7 Repeatability Protocol

12.8 Storage & Privacy

Try the Extraction Lab Now

Paste any URL and see exactly how HTML structure affects AI fact extraction. Free, instant, no signup required.

Open Extraction Lab →

Or run a full Deep Audit for comprehensive site-wide analysis with automated remediation.

Continue Reading

Paper 1

The Great AI Disconnect

1M-domain AI-Trust study

Paper 2

Ghost Content

How CSR erases pages from AI

Paper 3

The Science of ACRI

Shadow RAG calibration study

Paper 4

Hallucination Risk

How noise causes LLMs to lie

View all papers →