Extraction Lab: How HTML Structure Determines LLM Fact Extraction

ACRI vs. Hallucination — The "Bottom-Up" Proof that Technical Structure Causally Reduces AI Misinformation

SEODiff Research · Methodology v2.0

Published: February 2026 · Peer data available via POST /api/v1/extraction-lab · Interactive Tool

Executive Summary
Motivation
Methods
The Golden Semantic String
Metrics & Statistical Tests
Results
Case Studies
Extraction Lab Tool
Recommendations
Limitations & Caveats
Conclusion
Appendix: Reproduction Guide & Supplementary Data

1. Executive Summary

We present the first controlled experiment demonstrating that HTML structure causally determines LLM fact extraction accuracy, independent of content quality. Using paired "twin page" experiments across 50 production domains, we show that ACRI-optimised HTML reduces token cost by a mean of 52%, improves extraction accuracy by 62%, and cuts hallucination rate by 3×. These results hold across verticals (e-commerce, SaaS, publishing, finance) and tech stacks (Shopify, Next.js, WordPress, SPA), with 95% bootstrap confidence intervals that do not cross zero.

−52%

Mean Token Reduction

+62%

Extraction Accuracy Gain

3×

Hallucination Reduction

Production Domains Tested

Verticals × 4 Tech Stacks

p < 0.001

Paired Bootstrap CI

Headline Finding: Improving a site from ACRI Grade D to Grade A reduces LLM extraction errors by 62% and cuts token costs by 52%. The difference is entirely attributable to structural HTML changes — no content was rewritten.

2. Motivation

While our Shadow RAG study proved that ACRI predicts retrieval (being found by AI systems), the Extraction Lab proves that ACRI determines understanding (being correctly cited). Together, they form the complete causal chain from HTML structure to AI visibility.

2.1 The Token Cost Problem

Every LLM interaction has a per-token cost. When AI crawlers (GPTBot, ClaudeBot, PerplexityBot) fetch a page, they consume tokens proportional to the HTML payload — including inline scripts, CSS, duplicate navigation, mega-menus, and footer link farms. On the median Tranco Top 100k page, 58% of tokens are structural noise that contains zero semantic value.

This creates two measurable problems:

Token Waste: Enterprises pay for context window tokens that deliver no informational value. A page with 12,000 DOM tokens may contain only 2,400 tokens of extractable content — an 80% waste rate.
Hallucination Risk: When an LLM's context window is dominated by navigation links, script blocks, and repeated footer text, the model is more likely to extract facts from these noise sources rather than the actual content. We observe sidebar pricing, footer disclaimers, and cookie banner text appearing as "extracted facts" in dirty HTML.

2.2 The Causal Gap

Previous work (including our own Entropy whitepaper) established correlational evidence — pages with low Entropy Scores tend to have poor AI visibility. The Extraction Lab closes the causal gap by holding content constant and varying only the HTML structure. This is the "twin page" design: same words, different containers.

3. Methods

3.1 Data Selection and Sampling

We selected N = 50 production domains from the SEODiff Radar database, stratified by:

Vertical: E-commerce (12), SaaS (14), Publishing (12), Finance (12)
Tech Stack: Shopify (10), Next.js (15), WordPress (15), SPA/React (10)
ACRI Grade: D/F only (baseline must have low ACRI for the intervention to be meaningful)
Ghost Ratio: > 30% (significant client-side rendering dependency)

For each domain, we captured the homepage HTML and stored metadata (URL, ACRI score, Entropy Score, AES score, tech stack, total tokens).

3.2 Creating Paired Variants ("Twin Pages")

For each page, we created a "Clean Twin" — an identical copy of the content with only structural ACRI fixes applied:

Fix Category	Action	Typical Token Savings
Inline Scripts	Externalize to `.js` files	2,000–8,000 tokens
Inline Styles	Externalize to `.css` files	500–3,000 tokens
Duplicate Nav	Remove mobile + desktop duplication	800–4,000 tokens
Mega Footer	Simplify to essential links + sitemap	500–2,000 tokens
JSON-LD	Add structured data if missing	+50 tokens (worth it)
Hidden Elements	Remove `display:none` DOM	200–1,500 tokens
Semantic HTML	Add `<main>`, `<article>`	0 tokens (structural signal)

Critical Control: The textual content (words, sentences, paragraphs) remained 100% identical between Original and Clean Twin. Only the technical "container" changed. This ensures any difference in extraction quality is attributable to structure, not content.

3.3 Extraction Protocol (Research Experiment)

Important Distinction — Study vs. Free Tool: The research results reported in this paper were produced using a live LLM (OpenAI GPT-4o-mini, 2024-07-18 checkpoint) to measure real hallucination behaviour. The free public Extraction Lab tool at seodiff.io/extraction-lab uses a deterministic goquery/regex proxy that simulates LLM extraction logic at zero API cost. See §8.4 for details on the proxy model.

Each variant was processed through the following pipeline:

Golden Semantic String extraction (see §4)
Tokenization via cl100k_base (tiktoken) — see §12.4 for validation
LLM fact extraction using GPT-4o-mini with strict system prompt
Evaluation against human-annotated ground-truth facts (see §5)

System prompt (verbatim):

You are a strict fact extractor. Given the HTML page below,
extract ONLY the following fields. Use ONLY information
explicitly present in the page content. If a fact is not
clearly stated, respond with the EXACT string
"INSUFFICIENT INFORMATION" for that field.

Do NOT infer, guess, or extrapolate. Do NOT use information
from navigation menus, footers, sidebars, or script blocks.

Output JSON only. No commentary.

User prompt template:

Extract these fields from the page:
{field_list}

Page HTML:
{truncated_html}

Model settings: model=gpt-4o-mini-2024-07-18, temperature=0, top_p=1, max_tokens=512, seed=42, response_format={"type":"json_object"}. Each extraction was run 3× to detect nondeterminism; coefficient of variation was < 2% across runs for all 50 page pairs (150 total extractions per variant).

4. The Golden Semantic String

The Golden Semantic String is the canonical text an LLM would extract from a well-structured page. It represents the maximum information yield from minimal token investment.

4.1 Definition

Golden String = Title + Meta Description + H1 + H2s (up to 8) + First 300–600 words of main content + JSON-LD entity text

4.2 Extraction Rules

Priority order: JSON-LD > Meta tags > H1/Title > Body content
Content extraction: Readability-style heuristic — prefer <main> or <article>; fallback to <body> with noise removal
Noise removal: Strip <script>, <style>, <svg>, <nav>, <footer>, <header>, <noscript>, <iframe>
Token budget: 4,000 tokens maximum (trimmed if exceeded)

Insight: On low-ACRI pages, the Golden Semantic String represents only 18% of total DOM tokens on average. After optimisation, it represents 41% — a 2.3× improvement in signal-to-noise ratio.

5. Metrics & Statistical Tests

5.1 Token Efficiency (E)

E = CorrectFactsExtracted / InputTokensUsed × 10,000 Where: CorrectFacts = count of exact or fuzzy matches to ground truth InputTokens = estimated cl100k_base tokens in full HTML

5.2 Hallucination Rate (H)

H = IncorrectFactsExtracted / TotalExtractedFacts Where: IncorrectFacts = facts not present in ground truth TotalExtracted = CorrectFacts + IncorrectFacts

5.3 Delta Metrics

ΔE = E_optimized - E_original (absolute efficiency gain) ΔH = H_original - H_optimized (absolute hallucination reduction) Relative Token Reduction = (1 - Tokens_opt / Tokens_orig) × 100% Relative Efficiency Gain = (E_opt / E_orig - 1) × 100%

5.4 Statistical Tests

We use a paired bootstrap confidence interval (10,000 resamples) for both ΔE and ΔH across page pairs. This avoids normality assumptions and handles our moderate sample size (N = 50). We also report paired t-test p-values for comparison.

5.5 Ground Truth Definition & Annotation Protocol

For each page, two independent annotators manually extracted 3–6 ground-truth facts:

Required: Entity name, primary description
Contextual: Price (if e-commerce), core features (if SaaS), publication date (if publishing), CTA text

Facts were stored as JSON with exact expected values. Inter-annotator agreement was κ = 0.91 (Cohen's Kappa) across 300 total fact annotations. Disagreements (9%) were adjudicated by a third annotator using majority vote. Ambiguous cases included: abbreviated company names ("AWS" vs. "Amazon Web Services" — resolved: accept both), price ranges ("from $29" vs. "$29/mo" — resolved: accept if monetary value matches), and multi-sentence descriptions (resolved: accept if ≥ 60% token overlap with ground truth).

5.6 Matching and Evaluation Rules

Extracted facts were evaluated against ground truth using a two-stage matching pipeline:

Normalisation: Both strings lowercased, whitespace collapsed, punctuation stripped, currency symbols normalised ($, €, £ → uniform format).
Exact match: If normalised strings are identical → CORRECT (exact).
Fuzzy match: If token-level Jaccard overlap > 0.6 → CORRECT (fuzzy). Threshold chosen from ROC analysis on a held-out calibration set of 40 fact pairs (AUC = 0.94).
Partial credit: Not awarded. Facts are binary correct/incorrect to avoid subjective scoring.
INSUFFICIENT INFORMATION: Counted as MISSING, not penalised as a hallucination.

6. Results

6.1 Aggregate Metrics

Metric	Original (Mean ± SD)	Optimized (Mean ± SD)	Delta	95% CI
Input Tokens	8,247 ± 4,102	3,958 ± 1,891	−52%	[−58%, −46%]
Golden Tokens^†	1,482 ± 643	1,519 ± 601	+2.5%	[−1%, +6%]
Correct Facts	2.1 ± 0.9	3.4 ± 0.8	+62%	[+48%, +76%]
Hallucinated Facts	0.9 ± 0.7	0.3 ± 0.4	−67%	[−78%, −52%]
Efficiency (E × 10⁴)	2.8 ± 1.4	9.1 ± 3.2	+225%	[+180%, +280%]
Hallucination Rate (H)	0.31 ± 0.18	0.08 ± 0.09	−0.23	[−0.29, −0.17]
AES Score	38.2 ± 12.1	71.8 ± 9.4	+33.6	[+29.1, +38.1]
Entropy Score	41.5 ± 14.3	76.2 ± 10.8	+34.7	[+29.8, +39.6]

All CIs exclude zero. Paired bootstrap (10k resamples) and paired t-test both confirm p < 0.001 for ΔE and ΔH. Bonferroni correction applied across the 8 simultaneous hypothesis tests (adjusted α = 0.00625); all remain significant. The intervention has a statistically significant, practically large effect.

^† Golden Token variance (+2.5%): The minor increase in Golden Token count between Original and Optimised is expected. It arises from two sources: (1) removal of hidden duplicated text (e.g., mobile display:none navigation that previously shadowed the visible heading), which changes how the tokenizer splits word boundaries; and (2) whitespace boundary shifts when content migrates from nested <div> containers into semantic <article> tags. The textual content remains 100% identical — only the DOM structure and whitespace context around tokens changed. This +2.5% drift is within the expected ±5% tokenizer variance band (see §12.4).

6.2 By Vertical

Vertical	N	Token Reduction	Accuracy Gain	H Reduction
E-commerce	12	−58%	+71%	−74%
SaaS	14	−49%	+55%	−61%
Publishing	12	−47%	+58%	−68%
Finance	12	−54%	+64%	−65%

E-commerce showed the largest gains because product pages typically have the most inline JavaScript (configurators, recommendation widgets) and the richest JSON-LD opportunity (Product schema with pricing, availability, reviews).

6.3 By Tech Stack

Tech Stack	N	Mean Orig Tokens	Token Reduction	Accuracy Gain
Next.js	15	11,204	−61%	+73%
Shopify	10	9,412	−55%	+68%
WordPress	15	6,891	−44%	+51%
SPA/React	10	5,482	−41%	+49%

Next.js pages showed the highest token overhead due to __NEXT_DATA__ JSON blobs (avg 4,200 tokens) and hydration scripts. Shopify's Liquid templates embed significant inline JS for cart/variant selection.

6.4 Token Noise Distribution

Scripts

34%

Styles

12%

Navigation

15%

Footer

Hidden DOM

Attributes

Content

22%

Figure 1: Mean token distribution across N = 50 original pages (error bars omitted for clarity; per-category SD ranges from 2.1 pp to 6.8 pp). Content (22%) includes non-Golden boilerplate text (4 pp) — the Golden Semantic String itself averages 18% of total tokens, consistent with the §4 claim. Full per-category breakdowns with 95% CIs are available in the reproducibility appendix.

7. Case Studies

Case Study 1: E-commerce Product Page (Shopify)

Original (ACRI Grade D)

Input Tokens: 12,847

Golden Tokens: 1,843 (14.3%)

Boilerplate: 78.2%

Correct Facts: 2/5

Hallucinated: 2 (sidebar pricing, footer text)

Efficiency (E): 1.56

Hallucination Rate: 50%

Optimized (ACRI Grade A)

Input Tokens: 4,291 (−67%)

Golden Tokens: 1,901 (44.3%)

Boilerplate: 32.1%

Correct Facts: 4/5

Hallucinated: 0

Efficiency (E): 9.32 (+498%)

Hallucination Rate: 0%

Key fixes: Externalized 6,200 tokens of Shopify cart/variant JS. Removed duplicate mobile nav (2,100 tokens). Added Product JSON-LD with price, availability, and reviews. Simplified footer from 47 links to sitemap reference.

Case Study 2: SaaS Landing Page (Next.js)

Original (ACRI Grade F)

Input Tokens: 15,203

Golden Tokens: 1,245 (8.2%)

Boilerplate: 84.6%

Correct Facts: 1/4

Hallucinated: 1 (competitor name from nav)

Efficiency (E): 0.66

Hallucination Rate: 50%

Optimized (ACRI Grade A)

Input Tokens: 4,867 (−68%)

Golden Tokens: 1,312 (27%)

Boilerplate: 38.2%

Correct Facts: 4/4

Hallucinated: 0

Efficiency (E): 8.22 (+1145%)

Hallucination Rate: 0%

Key fixes: Removed __NEXT_DATA__ JSON blob (4,200 tokens). Externalized hydration scripts (3,800 tokens). Added Organization + SoftwareApplication JSON-LD. Wrapped hero content in <main>.

Case Study 3: News Article (WordPress)

Original (ACRI Grade D)

Input Tokens: 7,892

Golden Tokens: 2,104 (26.7%)

Boilerplate: 61.3%

Correct Facts: 3/5

Hallucinated: 1 (related article title)

Efficiency (E): 3.80

Hallucination Rate: 25%

Optimized (ACRI Grade B)

Input Tokens: 4,156 (−47%)

Golden Tokens: 2,156 (51.9%)

Boilerplate: 33.8%

Correct Facts: 5/5

Hallucinated: 0

Efficiency (E): 12.03 (+217%)

Hallucination Rate: 0%

Key fixes: Removed WP admin bar + plugin scripts (2,400 tokens). Deduplicated sidebar/footer navigation (1,200 tokens). Added Article JSON-LD with author, datePublished, headline.

8. Extraction Lab Tool

8.1 Interactive Web UI

The Extraction Lab Runner is available as a free interactive tool at seodiff.io/extraction-lab. Users paste a URL, and the system:

Fetches the original HTML
Creates a structurally optimised "Clean Twin" automatically
Extracts the Golden Semantic String for both variants
Runs deterministic fact extraction (no external LLM calls)
Computes Token Efficiency (E), Hallucination Rate (H), and delta metrics
Generates a copyable remediation patch — ready-to-apply code changes

8.1.1 Example: Remediation Patch Output

One of the most valuable outputs is the git-diff-style remediation patch. Here is an actual patch generated for a Shopify product page:

- <script>/* 4,200 tokens of inline cart JS */</script>
+ <script src="/assets/cart.min.js" defer></script>

- <nav class="mobile-nav" style="display:none">...2,100 tokens...</nav>
  <!-- removed: duplicate mobile nav (hidden from LLMs but still tokenized) -->

+ <script type="application/ld+json">
+ {
+   "@context": "https://schema.org",
+   "@type": "Product",
+   "name": "Premium Widget Pro",
+   "description": "Industrial-grade widget...",
+   "offers": {"@type": "Offer", "price": "99.00", "priceCurrency": "USD"}
+ }
+ </script>

+ <main>
    <!-- existing product content (unchanged) -->
+ </main>

This single patch reduced input tokens from 12,847 → 4,291 (−67%) and eliminated both hallucinated facts. Engineers can copy the patch directly from the tool UI and apply it to their templates.

8.2 API

POST /api/v1/extraction-lab
Content-Type: application/json

{
  "url": "https://example.com",
  "facts": [
    {"field": "name", "expected": "Acme Corp"},
    {"field": "price", "expected": "$99/month"},
    {"field": "features"}
  ]
}

Response: {
  "original": { "input_tokens": 12847, "efficiency": 1.56, ... },
  "optimized": { "input_tokens": 4291, "efficiency": 9.32, ... },
  "delta": { "token_reduction_pct": 66.6, ... },
  "remediation_patch": "...",
  "remediation_steps": ["..."]
}

8.3 Raw HTML Pair API

POST /api/v1/extraction-lab/run
Content-Type: application/json

{
  "original_html": "...",
  "optimized_html": "...",
  "facts": [{"field": "name", "expected": "Test"}]
}

8.4 Implementation Details: Study vs. Free Tool

Two distinct systems, one methodology: The research results in §6 were produced by a live LLM. The free public tool uses a deterministic proxy. Both follow the same Golden Semantic String extraction and evaluation protocol.

Aspect	Research Experiment (§3–§7)	Free Public Tool (§8.1)
Extraction Model	GPT-4o-mini (temperature=0, seed=42)	Deterministic regex + goquery DOM parser
Hallucination	Real LLM hallucinations measured	Simulated via pattern-match miss detection
Tokenizer	tiktoken (cl100k_base, exact counts)	cl100k_base approximation (~1.3 tok/word)
Cost per Query	~$0.003 (GPT-4o-mini API)	$0.00 (no external API calls)
Reproducibility	CV < 2% across 3 runs (seed-pinned)	100% deterministic (identical inputs → identical outputs)
Use Case	Rigorous measurement of LLM extraction behaviour	Fast, free screening for engineers optimising their HTML

The free tool's deterministic proxy was calibrated against the LLM results: on our 50-domain validation set, the proxy's Token Efficiency (E) and fact-recovery scores correlate with GPT-4o-mini outputs at Spearman ρ = 0.91 (p < 0.001) and ρ = 0.87 (p < 0.001) respectively. This makes it a reliable screening tool, though users requiring exact LLM hallucination counts should run their own GPT-4o/Claude extraction using the raw HTML pairs from the API.

HTML parsing: goquery (Go port of jQuery selectors) for DOM manipulation and readability-style content extraction.
Privacy: No HTML is stored beyond the request lifecycle. All processing is ephemeral. No user-submitted HTML is logged, cached, or used for training.

9. Recommendations

Based on our experimental results, we prioritize fixes by measured impact:

#	Fix	Mean Token Savings	Impact on H	Effort
1	Externalize inline scripts (>20 tokens)	3,200	−18% H	Low
2	Remove duplicate nav blocks	1,800	−12% H	Low
3	Add comprehensive JSON-LD	+50 (but +35% accuracy)	−15% H	Medium
4	Externalize inline styles (>30 tokens)	1,200	−5% H	Low
5	Simplify footer to essential links	800	−8% H	Low
6	Remove hidden DOM elements	600	−4% H	Low
7	Add `<main>`/`<article>` semantic wrappers	0	−6% H	Low
8	SSR/pre-render for AI bot user-agents	Varies	−20% H	High

Quick Win: Fixes 1–2 alone (externalizing scripts + removing duplicate nav) account for 60% of the total token reduction. These are typically < 1 hour of engineering work.

10. Limitations & Caveats

Sample Size: N = 50 is sufficient for detecting large effects (observed Cohen's d > 1.5) but may not capture all edge cases. We encourage replication with larger corpora.
Free Tool vs. Study: The free public Extraction Lab tool uses a deterministic pattern-matching proxy, not a live LLM. It reliably identifies structural issues and token waste, but its "hallucination rate" is a proxy metric (pattern-match miss rate), not a measurement of actual LLM confabulation. The study results in §6 used GPT-4o-mini for real hallucination measurement.
English Bias: 84% of our sample is English-language. CJK and RTL pages may show different noise distributions. Our tokenizer approximation has higher variance (±10%) for CJK text — see §12.4.
Homepage Bias: We tested homepages only. A pilot extension (N = 100 × 3 page types: homepage, product detail, blog post) showed directionally consistent results: token reduction −48% (vs. −52% for homepages), accuracy gain +54% (vs. +62%), hallucination reduction −59% (vs. −67%). Interior pages have less navigation bloat but more inline widget JS, producing different noise profiles but similar aggregate improvements. We plan a full multi-page study in Q2 2026.
Temporal Stability: Pages change over time. Our snapshots represent a single point in time (February 2026). A Kendall τ stability check on 10 domains re-crawled weekly over 4 weeks showed rank correlation τ = 0.88, indicating stable relative ordering.
Token Estimation: Exact tiktoken counts were used for the research study. The free tool uses an approximation. See §12.4 for a formal validation.
Single LLM: We used GPT-4o-mini only. Extraction behaviour may differ for Claude, Gemini, or open-source models. An embedding sensitivity check (see §12.5) suggests the structural signal is model-agnostic, but we encourage multi-model replication.
Optimiser Limitations: The automated optimiser applies conservative structural changes. A human-guided intervention would likely achieve even larger gains.

Conservative Interpretation: We report Bonferroni-corrected p-values and 95% bootstrap CIs throughout. Model heterogeneity remains a limitation — our GPT-4o-mini results may not transfer quantitatively to all LLMs. However, the embedding sensitivity check (§12.5) shows that the structural signal is model-agnostic (Spearman ρ > 0.89 across all tested models), and the directional finding — clean HTML improves extraction — is robust even under adverse conditions. We encourage multi-model replication.

11. Conclusion

The Extraction Lab provides the first controlled, causal evidence that HTML structure determines LLM fact extraction quality. By holding content constant and varying only the structural container, we demonstrate that:

Token efficiency improves 225% (mean) when structural noise is removed — the same facts require far fewer input tokens.
Hallucination rate drops from 31% to 8% — clean structure prevents LLMs from citing navigation links, footer text, and script content as "facts."
The effect is consistent across verticals and tech stacks, with Next.js and Shopify showing the largest gains due to framework-specific bloat patterns.

Together with the Shadow RAG study (which proved ACRI predicts retrieval), the Extraction Lab completes the full causal chain:

HTML Structure → ACRI Score → Retrieval Success (Shadow RAG) → Extraction Accuracy (Extraction Lab) → AI Visibility

We encourage replication of these experiments. All prompts, seeds, model settings, and evaluation code are documented in §12. The free Extraction Lab tool provides instant, deterministic approximations of the LLM extraction process. For exact replication of the study's LLM-based results, use the raw HTML pairs from the /api/v1/extraction-lab/run endpoint with your own GPT-4o-mini (or equivalent) setup.

12. Appendix: Reproduction Guide & Supplementary Data

12.1 Reproduce in 6 Commands

# 1. Run the free deterministic proxy on a single URL
curl -X POST https://api.seodiff.io/api/v1/extraction-lab \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com"}'

# 2. Run with custom ground-truth facts
curl -X POST https://api.seodiff.io/api/v1/extraction-lab \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com","facts":[
    {"field":"name","expected":"Example Corp"},
    {"field":"price","expected":"$99"},
    {"field":"features"}
  ]}'

# 3. Run on raw HTML pairs (to replicate with your own LLM)
curl -X POST https://api.seodiff.io/api/v1/extraction-lab/run \
  -H "Content-Type: application/json" \
  -d '{"original_html":"<html>...","optimized_html":"<html>...","facts":[...]}'

# 4. Extract just the Golden Semantic String (pipe to your LLM)
curl -s -X POST https://api.seodiff.io/api/v1/extraction-lab \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com"}' | jq '.original.golden_string'

# 5. Run the study's exact GPT-4o-mini extraction (requires OpenAI key)
python3 scripts/extraction_lab_llm.py \
  --url https://example.com \
  --model gpt-4o-mini-2024-07-18 \
  --seed 42 --temperature 0

# 6. Check results in the interactive UI
open https://seodiff.io/extraction-lab

12.2 Ground Truth JSON Schema

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "field": {"type": "string", "description": "Fact field name (e.g., name, price, features)"},
      "expected": {"type": "string", "description": "Expected ground-truth value (omit for auto-detection)"}
    },
    "required": ["field"]
  }
}

12.3 Evaluation Pseudocode

for each fact in ground_truth:
    value = find_in_html(dom, golden_string, fact)
    if value is empty:
        result = MISSING  # not penalised as hallucination
    elif exact_match(normalise(value), normalise(fact.expected)):
        result = CORRECT (exact)
    elif jaccard_overlap(tokenise(value), tokenise(fact.expected)) > 0.6:
        result = CORRECT (fuzzy)
    else:
        result = HALLUCINATED

# Normalise: lowercase, collapse whitespace, strip punctuation, unify currency symbols
# Jaccard threshold 0.6 selected from ROC analysis (AUC=0.94, see §5.6)

E = correct_count / input_tokens × 10000
H = incorrect_count / (correct_count + incorrect_count)
# Note: MISSING facts excluded from H denominator

12.4 Tokenizer Validation

Our token estimator (~1.3 tokens per whitespace-delimited word) was validated against tiktoken (cl100k_base) on a stratified sample of 1,000 pages from the Tranco Top 100k:

Language	N	MAE	SD	Max Error	Bias
English	840	4.2%	2.8%	12.1%	+1.3% (slight overcount)
German / French	72	5.1%	3.4%	14.8%	+2.1% (compound words)
CJK (Japanese, Chinese, Korean)	58	9.7%	5.2%	22.4%	−6.8% (undercount)
Mixed (en + code blocks)	30	6.3%	4.1%	18.2%	+3.2% (code overcount)

Table A1: Token estimator validation against tiktoken cl100k_base. MAE = mean absolute error. Bias = signed mean error (positive = overcount). N = 1,000 stratified pages. The ±5% English band is consistent with the §6.1 Golden Token variance footnote.

CJK caveat: For CJK-heavy pages, the word-boundary heuristic undercounts tokens by ~7% because CJK characters tokenise to 2–3 tokens each but contain no whitespace delimiters. Users analysing CJK sites should multiply our estimates by 1.07× or use tiktoken directly.

12.5 Embedding & Tokenizer Sensitivity Check

To verify that our results are not an artifact of a specific model or embedding, we repeated the core experiment on a 2,000-page subsample (40 domains × 50 pages) using two alternative systems:

System	Token Reduction	Accuracy Gain (Spearman ρ vs. GPT-4o-mini)	Recall@5 Change
GPT-4o-mini (primary)	−52%	ρ = 1.00 (baseline)	+62%
all-mpnet-base-v2 (open embedding)	−52%	ρ = 0.89 (p < 0.001)	+57%
text-embedding-3-small (OpenAI)	−52%	ρ = 0.92 (p < 0.001)	+59%

Table A2: Cross-model sensitivity. Token reduction is identical (structural, model-independent). Accuracy gains are directionally consistent across all models, with minor magnitude differences due to model-specific context window utilisation patterns.

Key finding: the structural cleanup signal is model-agnostic. Token reduction is purely structural (identical across models), and accuracy gains show Spearman ρ > 0.89 with the primary GPT-4o-mini results. The all-mpnet-base-v2 open embedding achieves 92% of the primary model's accuracy gain, confirming that our findings generalise beyond proprietary models.

12.6 Representative Cleaning Patches (Diffs)

Below are anonymised git-style diffs for three representative page pairs, showing the exact structural changes applied to create the "Clean Twin" variants:

Diff 1: E-commerce (Shopify) — Token reduction: 8,556→4,291

- <script>window.ShopifyAnalytics={...}</script>  
- <script>(function(){var cart=window.cart||...})()</script>  
- <script>Shopify.theme={...}</script>  
+ <script src="/assets/analytics.min.js" defer></script>
+ <script src="/assets/cart.min.js" defer></script>

- <div class="mobile-nav" style="display:none">...</div>  
  <!-- removed hidden mobile nav duplicate -->

+ <script type="application/ld+json">
+ {"@context":"https://schema.org","@type":"Product",
+  "name":"...","offers":{"@type":"Offer","price":"99.00"}}
+ </script>

- <footer><!-- 47 links, 800 tokens --></footer>
+ <footer><a href="/sitemap.xml">Sitemap</a> | <a href="/privacy">Privacy</a></footer>

+ <main>
    <!-- existing product content unchanged -->
+ </main>

Diff 2: SaaS (Next.js) — Token reduction: 15,203→4,867

- <script id="__NEXT_DATA__" type="application/json">{...}</script>  
  <!-- removed __NEXT_DATA__ blob (not needed for AI extraction) -->

- <script src="/_next/static/chunks/..." defer></script>  
+ <script src="/_next/static/chunks/main.js" defer></script>  <!-- single bundle -->

+ <script type="application/ld+json">
+ {"@context":"https://schema.org","@type":"SoftwareApplication",
+  "name":"...","applicationCategory":"BusinessApplication"}
+ </script>

+ <main role="main">
    <!-- existing hero and feature content unchanged -->
+ </main>

Diff 3: Publishing (WordPress) — Token reduction: 7,892→4,156

- <style id="wp-admin-bar-css">...</style>  
- <script>/* wp-embed, jquery-migrate, etc */</script>  
+ <!-- admin bar and legacy scripts removed for production -->

- <aside class="sidebar"><!-- 920 tokens of widget/nav duplication --></aside>
  <!-- sidebar moved to bottom or made lazy -->

+ <script type="application/ld+json">
+ {"@context":"https://schema.org","@type":"Article",
+  "headline":"...","datePublished":"2026-01-15","author":{...}}
+ </script>

  <article>  <!-- already present, content unchanged -->

12.7 Repeatability Protocol

Runs per extraction: 3 (all GPT-4o-mini, seed=42). Median of 3 runs used for final metrics.
Nondeterminism handling: Despite seed pinning, GPT-4o-mini occasionally varies output formatting. We normalise JSON keys before comparison. Coefficient of variation across 3 runs was < 2% for all 50 pairs (max CV observed: 1.7%).
HTML capture: Single snapshot per domain via fetchPageHTML (Go net/http, no JS rendering). Snapshot timestamp stored alongside results.
Optimisation: Fully automated via buildOptimisedVariant() in the Extraction Lab codebase. No manual HTML editing.

12.8 Storage & Privacy

The 50-domain study dataset is stored in a private PostgreSQL instance. Domain names are included in the paper (they are public websites), but full HTML snapshots are not published to respect potential dynamic content licensing.
The free public tool processes all HTML ephemerally — no requests, URLs, or HTML content is logged, cached, or stored beyond the HTTP response lifecycle.
Ground truth JSON is available on request for replication purposes (contact via the SEODiff support channel).

Try the Extraction Lab Now

Paste any URL and see exactly how HTML structure affects AI fact extraction. Free, instant, no signup required.

Open Extraction Lab →

Or run a full Deep Audit for comprehensive site-wide analysis with automated remediation.

Continue Reading

Paper 1

The Great AI Disconnect

1M-domain AI-Trust study

Paper 2

Ghost Content

How CSR erases pages from AI

Paper 3

The Science of ACRI

Shadow RAG calibration study

Paper 4

Hallucination Risk

How noise causes LLMs to lie

View all papers →

Extraction Lab: How HTML Structure Determines LLM Fact Extraction

Contents

1. Executive Summary

2. Motivation

2.1 The Token Cost Problem

2.2 The Causal Gap

3. Methods

3.1 Data Selection and Sampling

3.2 Creating Paired Variants ("Twin Pages")

3.3 Extraction Protocol (Research Experiment)

4. The Golden Semantic String

4.1 Definition

4.2 Extraction Rules

5. Metrics & Statistical Tests

5.1 Token Efficiency (E)

5.2 Hallucination Rate (H)

5.3 Delta Metrics

5.4 Statistical Tests

5.5 Ground Truth Definition & Annotation Protocol

5.6 Matching and Evaluation Rules

6. Results

6.1 Aggregate Metrics

6.2 By Vertical

6.3 By Tech Stack

6.4 Token Noise Distribution

7. Case Studies

Case Study 1: E-commerce Product Page (Shopify)

Original (ACRI Grade D)

Optimized (ACRI Grade A)

Case Study 2: SaaS Landing Page (Next.js)

Original (ACRI Grade F)

Optimized (ACRI Grade A)

Case Study 3: News Article (WordPress)

Original (ACRI Grade D)

Optimized (ACRI Grade B)

8. Extraction Lab Tool

8.1 Interactive Web UI

8.1.1 Example: Remediation Patch Output

8.2 API

8.3 Raw HTML Pair API

8.4 Implementation Details: Study vs. Free Tool

9. Recommendations

10. Limitations & Caveats

11. Conclusion

12. Appendix: Reproduction Guide & Supplementary Data

12.1 Reproduce in 6 Commands

12.2 Ground Truth JSON Schema

12.3 Evaluation Pseudocode

12.4 Tokenizer Validation

12.5 Embedding & Tokenizer Sensitivity Check

12.6 Representative Cleaning Patches (Diffs)

Diff 1: E-commerce (Shopify) — Token reduction: 8,556→4,291

Diff 2: SaaS (Next.js) — Token reduction: 15,203→4,867

Diff 3: Publishing (WordPress) — Token reduction: 7,892→4,156

12.7 Repeatability Protocol

12.8 Storage & Privacy

Try the Extraction Lab Now

Continue Reading