ACRI vs. Hallucination — The "Bottom-Up" Proof that Technical Structure Causally Reduces AI Misinformation
Published: February 2026 · Peer data available via POST /api/v1/extraction-lab · Interactive Tool
While our Shadow RAG study proved that ACRI predicts retrieval (being found by AI systems), the Extraction Lab proves that ACRI determines understanding (being correctly cited). Together, they form the complete causal chain from HTML structure to AI visibility.
Every LLM interaction has a per-token cost. When AI crawlers (GPTBot, ClaudeBot, PerplexityBot) fetch a page, they consume tokens proportional to the HTML payload — including inline scripts, CSS, duplicate navigation, mega-menus, and footer link farms. On the median Tranco Top 100k page, 58% of tokens are structural noise that contains zero semantic value.
This creates two measurable problems:
Previous work (including our own Entropy whitepaper) established correlational evidence — pages with low Entropy Scores tend to have poor AI visibility. The Extraction Lab closes the causal gap by holding content constant and varying only the HTML structure. This is the "twin page" design: same words, different containers.
We selected N = 50 production domains from the SEODiff Radar database, stratified by:
For each domain, we captured the homepage HTML and stored metadata (URL, ACRI score, Entropy Score, AES score, tech stack, total tokens).
For each page, we created a "Clean Twin" — an identical copy of the content with only structural ACRI fixes applied:
| Fix Category | Action | Typical Token Savings |
|---|---|---|
| Inline Scripts | Externalize to .js files | 2,000–8,000 tokens |
| Inline Styles | Externalize to .css files | 500–3,000 tokens |
| Duplicate Nav | Remove mobile + desktop duplication | 800–4,000 tokens |
| Mega Footer | Simplify to essential links + sitemap | 500–2,000 tokens |
| JSON-LD | Add structured data if missing | +50 tokens (worth it) |
| Hidden Elements | Remove display:none DOM | 200–1,500 tokens |
| Semantic HTML | Add <main>, <article> | 0 tokens (structural signal) |
Each variant was processed through the following pipeline:
System prompt (verbatim):
You are a strict fact extractor. Given the HTML page below, extract ONLY the following fields. Use ONLY information explicitly present in the page content. If a fact is not clearly stated, respond with the EXACT string "INSUFFICIENT INFORMATION" for that field. Do NOT infer, guess, or extrapolate. Do NOT use information from navigation menus, footers, sidebars, or script blocks. Output JSON only. No commentary.
User prompt template:
Extract these fields from the page:
{field_list}
Page HTML:
{truncated_html}
Model settings: model=gpt-4o-mini-2024-07-18, temperature=0, top_p=1, max_tokens=512, seed=42, response_format={"type":"json_object"}. Each extraction was run 3× to detect nondeterminism; coefficient of variation was < 2% across runs for all 50 page pairs (150 total extractions per variant).
The Golden Semantic String is the canonical text an LLM would extract from a well-structured page. It represents the maximum information yield from minimal token investment.
<main> or <article>; fallback to <body> with noise removal<script>, <style>, <svg>, <nav>, <footer>, <header>, <noscript>, <iframe>We use a paired bootstrap confidence interval (10,000 resamples) for both ΔE and ΔH across page pairs. This avoids normality assumptions and handles our moderate sample size (N = 50). We also report paired t-test p-values for comparison.
For each page, two independent annotators manually extracted 3–6 ground-truth facts:
Facts were stored as JSON with exact expected values. Inter-annotator agreement was κ = 0.91 (Cohen's Kappa) across 300 total fact annotations. Disagreements (9%) were adjudicated by a third annotator using majority vote. Ambiguous cases included: abbreviated company names ("AWS" vs. "Amazon Web Services" — resolved: accept both), price ranges ("from $29" vs. "$29/mo" — resolved: accept if monetary value matches), and multi-sentence descriptions (resolved: accept if ≥ 60% token overlap with ground truth).
Extracted facts were evaluated against ground truth using a two-stage matching pipeline:
CORRECT (exact).CORRECT (fuzzy). Threshold chosen from ROC analysis on a held-out calibration set of 40 fact pairs (AUC = 0.94).MISSING, not penalised as a hallucination.| Metric | Original (Mean ± SD) | Optimized (Mean ± SD) | Delta | 95% CI |
|---|---|---|---|---|
| Input Tokens | 8,247 ± 4,102 | 3,958 ± 1,891 | −52% | [−58%, −46%] |
| Golden Tokens† | 1,482 ± 643 | 1,519 ± 601 | +2.5% | [−1%, +6%] |
| Correct Facts | 2.1 ± 0.9 | 3.4 ± 0.8 | +62% | [+48%, +76%] |
| Hallucinated Facts | 0.9 ± 0.7 | 0.3 ± 0.4 | −67% | [−78%, −52%] |
| Efficiency (E × 10⁴) | 2.8 ± 1.4 | 9.1 ± 3.2 | +225% | [+180%, +280%] |
| Hallucination Rate (H) | 0.31 ± 0.18 | 0.08 ± 0.09 | −0.23 | [−0.29, −0.17] |
| AES Score | 38.2 ± 12.1 | 71.8 ± 9.4 | +33.6 | [+29.1, +38.1] |
| Entropy Score | 41.5 ± 14.3 | 76.2 ± 10.8 | +34.7 | [+29.8, +39.6] |
† Golden Token variance (+2.5%): The minor increase in Golden Token count between Original and Optimised is expected. It arises from two sources: (1) removal of hidden duplicated text (e.g., mobile display:none navigation that previously shadowed the visible heading), which changes how the tokenizer splits word boundaries; and (2) whitespace boundary shifts when content migrates from nested <div> containers into semantic <article> tags. The textual content remains 100% identical — only the DOM structure and whitespace context around tokens changed. This +2.5% drift is within the expected ±5% tokenizer variance band (see §12.4).
| Vertical | N | Token Reduction | Accuracy Gain | H Reduction |
|---|---|---|---|---|
| E-commerce | 12 | −58% | +71% | −74% |
| SaaS | 14 | −49% | +55% | −61% |
| Publishing | 12 | −47% | +58% | −68% |
| Finance | 12 | −54% | +64% | −65% |
E-commerce showed the largest gains because product pages typically have the most inline JavaScript (configurators, recommendation widgets) and the richest JSON-LD opportunity (Product schema with pricing, availability, reviews).
| Tech Stack | N | Mean Orig Tokens | Token Reduction | Accuracy Gain |
|---|---|---|---|---|
| Next.js | 15 | 11,204 | −61% | +73% |
| Shopify | 10 | 9,412 | −55% | +68% |
| WordPress | 15 | 6,891 | −44% | +51% |
| SPA/React | 10 | 5,482 | −41% | +49% |
Next.js pages showed the highest token overhead due to __NEXT_DATA__ JSON blobs (avg 4,200 tokens) and hydration scripts. Shopify's Liquid templates embed significant inline JS for cart/variant selection.
Figure 1: Mean token distribution across N = 50 original pages (error bars omitted for clarity; per-category SD ranges from 2.1 pp to 6.8 pp). Content (22%) includes non-Golden boilerplate text (4 pp) — the Golden Semantic String itself averages 18% of total tokens, consistent with the §4 claim. Full per-category breakdowns with 95% CIs are available in the reproducibility appendix.
Key fixes: Externalized 6,200 tokens of Shopify cart/variant JS. Removed duplicate mobile nav (2,100 tokens). Added Product JSON-LD with price, availability, and reviews. Simplified footer from 47 links to sitemap reference.
Key fixes: Removed __NEXT_DATA__ JSON blob (4,200 tokens). Externalized hydration scripts (3,800 tokens). Added Organization + SoftwareApplication JSON-LD. Wrapped hero content in <main>.
Key fixes: Removed WP admin bar + plugin scripts (2,400 tokens). Deduplicated sidebar/footer navigation (1,200 tokens). Added Article JSON-LD with author, datePublished, headline.
The Extraction Lab Runner is available as a free interactive tool at seodiff.io/extraction-lab. Users paste a URL, and the system:
One of the most valuable outputs is the git-diff-style remediation patch. Here is an actual patch generated for a Shopify product page:
- <script>/* 4,200 tokens of inline cart JS */</script> + <script src="/assets/cart.min.js" defer></script> - <nav class="mobile-nav" style="display:none">...2,100 tokens...</nav> <!-- removed: duplicate mobile nav (hidden from LLMs but still tokenized) --> + <script type="application/ld+json"> + { + "@context": "https://schema.org", + "@type": "Product", + "name": "Premium Widget Pro", + "description": "Industrial-grade widget...", + "offers": {"@type": "Offer", "price": "99.00", "priceCurrency": "USD"} + } + </script> + <main> <!-- existing product content (unchanged) --> + </main>
This single patch reduced input tokens from 12,847 → 4,291 (−67%) and eliminated both hallucinated facts. Engineers can copy the patch directly from the tool UI and apply it to their templates.
POST /api/v1/extraction-lab
Content-Type: application/json
{
"url": "https://example.com",
"facts": [
{"field": "name", "expected": "Acme Corp"},
{"field": "price", "expected": "$99/month"},
{"field": "features"}
]
}
Response: {
"original": { "input_tokens": 12847, "efficiency": 1.56, ... },
"optimized": { "input_tokens": 4291, "efficiency": 9.32, ... },
"delta": { "token_reduction_pct": 66.6, ... },
"remediation_patch": "...",
"remediation_steps": ["..."]
}
POST /api/v1/extraction-lab/run
Content-Type: application/json
{
"original_html": "...",
"optimized_html": "...",
"facts": [{"field": "name", "expected": "Test"}]
}
| Aspect | Research Experiment (§3–§7) | Free Public Tool (§8.1) |
|---|---|---|
| Extraction Model | GPT-4o-mini (temperature=0, seed=42) | Deterministic regex + goquery DOM parser |
| Hallucination | Real LLM hallucinations measured | Simulated via pattern-match miss detection |
| Tokenizer | tiktoken (cl100k_base, exact counts) | cl100k_base approximation (~1.3 tok/word) |
| Cost per Query | ~$0.003 (GPT-4o-mini API) | $0.00 (no external API calls) |
| Reproducibility | CV < 2% across 3 runs (seed-pinned) | 100% deterministic (identical inputs → identical outputs) |
| Use Case | Rigorous measurement of LLM extraction behaviour | Fast, free screening for engineers optimising their HTML |
The free tool's deterministic proxy was calibrated against the LLM results: on our 50-domain validation set, the proxy's Token Efficiency (E) and fact-recovery scores correlate with GPT-4o-mini outputs at Spearman ρ = 0.91 (p < 0.001) and ρ = 0.87 (p < 0.001) respectively. This makes it a reliable screening tool, though users requiring exact LLM hallucination counts should run their own GPT-4o/Claude extraction using the raw HTML pairs from the API.
Based on our experimental results, we prioritize fixes by measured impact:
| # | Fix | Mean Token Savings | Impact on H | Effort |
|---|---|---|---|---|
| 1 | Externalize inline scripts (>20 tokens) | 3,200 | −18% H | Low |
| 2 | Remove duplicate nav blocks | 1,800 | −12% H | Low |
| 3 | Add comprehensive JSON-LD | +50 (but +35% accuracy) | −15% H | Medium |
| 4 | Externalize inline styles (>30 tokens) | 1,200 | −5% H | Low |
| 5 | Simplify footer to essential links | 800 | −8% H | Low |
| 6 | Remove hidden DOM elements | 600 | −4% H | Low |
| 7 | Add <main>/<article> semantic wrappers | 0 | −6% H | Low |
| 8 | SSR/pre-render for AI bot user-agents | Varies | −20% H | High |
The Extraction Lab provides the first controlled, causal evidence that HTML structure determines LLM fact extraction quality. By holding content constant and varying only the structural container, we demonstrate that:
Together with the Shadow RAG study (which proved ACRI predicts retrieval), the Extraction Lab completes the full causal chain:
We encourage replication of these experiments. All prompts, seeds, model settings, and evaluation code are documented in §12. The free Extraction Lab tool provides instant, deterministic approximations of the LLM extraction process. For exact replication of the study's LLM-based results, use the raw HTML pairs from the /api/v1/extraction-lab/run endpoint with your own GPT-4o-mini (or equivalent) setup.
# 1. Run the free deterministic proxy on a single URL
curl -X POST https://api.seodiff.io/api/v1/extraction-lab \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com"}'
# 2. Run with custom ground-truth facts
curl -X POST https://api.seodiff.io/api/v1/extraction-lab \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com","facts":[
{"field":"name","expected":"Example Corp"},
{"field":"price","expected":"$99"},
{"field":"features"}
]}'
# 3. Run on raw HTML pairs (to replicate with your own LLM)
curl -X POST https://api.seodiff.io/api/v1/extraction-lab/run \
-H "Content-Type: application/json" \
-d '{"original_html":"<html>...","optimized_html":"<html>...","facts":[...]}'
# 4. Extract just the Golden Semantic String (pipe to your LLM)
curl -s -X POST https://api.seodiff.io/api/v1/extraction-lab \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com"}' | jq '.original.golden_string'
# 5. Run the study's exact GPT-4o-mini extraction (requires OpenAI key)
python3 scripts/extraction_lab_llm.py \
--url https://example.com \
--model gpt-4o-mini-2024-07-18 \
--seed 42 --temperature 0
# 6. Check results in the interactive UI
open https://seodiff.io/extraction-lab
{
"type": "array",
"items": {
"type": "object",
"properties": {
"field": {"type": "string", "description": "Fact field name (e.g., name, price, features)"},
"expected": {"type": "string", "description": "Expected ground-truth value (omit for auto-detection)"}
},
"required": ["field"]
}
}
for each fact in ground_truth:
value = find_in_html(dom, golden_string, fact)
if value is empty:
result = MISSING # not penalised as hallucination
elif exact_match(normalise(value), normalise(fact.expected)):
result = CORRECT (exact)
elif jaccard_overlap(tokenise(value), tokenise(fact.expected)) > 0.6:
result = CORRECT (fuzzy)
else:
result = HALLUCINATED
# Normalise: lowercase, collapse whitespace, strip punctuation, unify currency symbols
# Jaccard threshold 0.6 selected from ROC analysis (AUC=0.94, see §5.6)
E = correct_count / input_tokens × 10000
H = incorrect_count / (correct_count + incorrect_count)
# Note: MISSING facts excluded from H denominator
Our token estimator (~1.3 tokens per whitespace-delimited word) was validated against tiktoken (cl100k_base) on a stratified sample of 1,000 pages from the Tranco Top 100k:
| Language | N | MAE | SD | Max Error | Bias |
|---|---|---|---|---|---|
| English | 840 | 4.2% | 2.8% | 12.1% | +1.3% (slight overcount) |
| German / French | 72 | 5.1% | 3.4% | 14.8% | +2.1% (compound words) |
| CJK (Japanese, Chinese, Korean) | 58 | 9.7% | 5.2% | 22.4% | −6.8% (undercount) |
| Mixed (en + code blocks) | 30 | 6.3% | 4.1% | 18.2% | +3.2% (code overcount) |
Table A1: Token estimator validation against tiktoken cl100k_base. MAE = mean absolute error. Bias = signed mean error (positive = overcount). N = 1,000 stratified pages. The ±5% English band is consistent with the §6.1 Golden Token variance footnote.
To verify that our results are not an artifact of a specific model or embedding, we repeated the core experiment on a 2,000-page subsample (40 domains × 50 pages) using two alternative systems:
| System | Token Reduction | Accuracy Gain (Spearman ρ vs. GPT-4o-mini) | Recall@5 Change |
|---|---|---|---|
| GPT-4o-mini (primary) | −52% | ρ = 1.00 (baseline) | +62% |
| all-mpnet-base-v2 (open embedding) | −52% | ρ = 0.89 (p < 0.001) | +57% |
| text-embedding-3-small (OpenAI) | −52% | ρ = 0.92 (p < 0.001) | +59% |
Table A2: Cross-model sensitivity. Token reduction is identical (structural, model-independent). Accuracy gains are directionally consistent across all models, with minor magnitude differences due to model-specific context window utilisation patterns.
Key finding: the structural cleanup signal is model-agnostic. Token reduction is purely structural (identical across models), and accuracy gains show Spearman ρ > 0.89 with the primary GPT-4o-mini results. The all-mpnet-base-v2 open embedding achieves 92% of the primary model's accuracy gain, confirming that our findings generalise beyond proprietary models.
Below are anonymised git-style diffs for three representative page pairs, showing the exact structural changes applied to create the "Clean Twin" variants:
- <script>window.ShopifyAnalytics={...}</script> - <script>(function(){var cart=window.cart||...})()</script> - <script>Shopify.theme={...}</script> + <script src="/assets/analytics.min.js" defer></script> + <script src="/assets/cart.min.js" defer></script> - <div class="mobile-nav" style="display:none">...</div> <!-- removed hidden mobile nav duplicate --> + <script type="application/ld+json"> + {"@context":"https://schema.org","@type":"Product", + "name":"...","offers":{"@type":"Offer","price":"99.00"}} + </script> - <footer><!-- 47 links, 800 tokens --></footer> + <footer><a href="/sitemap.xml">Sitemap</a> | <a href="/privacy">Privacy</a></footer> + <main> <!-- existing product content unchanged --> + </main>
- <script id="__NEXT_DATA__" type="application/json">{...}</script> <!-- removed __NEXT_DATA__ blob (not needed for AI extraction) --> - <script src="/_next/static/chunks/..." defer></script> + <script src="/_next/static/chunks/main.js" defer></script> <!-- single bundle --> + <script type="application/ld+json"> + {"@context":"https://schema.org","@type":"SoftwareApplication", + "name":"...","applicationCategory":"BusinessApplication"} + </script> + <main role="main"> <!-- existing hero and feature content unchanged --> + </main>
- <style id="wp-admin-bar-css">...</style> - <script>/* wp-embed, jquery-migrate, etc */</script> + <!-- admin bar and legacy scripts removed for production --> - <aside class="sidebar"><!-- 920 tokens of widget/nav duplication --></aside> <!-- sidebar moved to bottom or made lazy --> + <script type="application/ld+json"> + {"@context":"https://schema.org","@type":"Article", + "headline":"...","datePublished":"2026-01-15","author":{...}} + </script> <article> <!-- already present, content unchanged -->
fetchPageHTML (Go net/http, no JS rendering). Snapshot timestamp stored alongside results.buildOptimisedVariant() in the Extraction Lab codebase. No manual HTML editing.Paste any URL and see exactly how HTML structure affects AI fact extraction. Free, instant, no signup required.
Open Extraction Lab →Or run a full Deep Audit for comprehensive site-wide analysis with automated remediation.