← SEODiff Research Hub

Hallucination Risk: How Structural Noise Causes LLMs to Invent Facts

When AI can't find your pricing, it invents one. Here's the data — and the engineering playbook to fix it.

SEODiff Research · February 2026

v2.0 — Revised 20 February 2026

Abstract. We present a controlled experiment measuring how HTML structure affects large language model (LLM) hallucination when extracting factual information from web pages. Using a paired design across 50 production domains stratified by ACRI score, technology stack, and industry vertical, we show that pages with ACRI scores below 40 exhibit a hallucination rate 3.5× higher than well-structured pages. After applying deterministic remediation (JSON-LD injection, script externalization, content-first DOM ordering), mean hallucination rate drops by 62% (paired bootstrap 95% CI: [51%, 71%], p<0.001). Token efficiency improves by 52%. We release the full test protocol, prompt templates, and a free Hallucination Checker tool at seodiff.io/hallucination-test.
3.5×
Higher hallucination rate for ACRI<40 pages
−62%
Mean ΔH after remediation (CI: 51–71%)
+52%
Token efficiency improvement
50
Production domains tested
0.41
Median H for unoptimized pages

Table of Contents

  1. Executive Summary
  2. The Problem: AI Lies About Your Brand
  3. Experiment Design & Sampling
  4. Methodology: Prompting & Evaluation
  5. Results: The Hallucination Delta
  6. Case Studies: Before & After
  7. Error Taxonomy & Root Cause Analysis
  8. Engineering Checklist
  9. Reproducibility & Defensibility
  10. Conclusion & Implications

1. Executive Summary

Large language models don't just fail to find information on poorly structured pages — they invent it. When an LLM encounters a page where the pricing table is hidden behind JavaScript hydration, the company address is buried in a 300-link footer, and the main content is sandwiched between 50KB of inline scripts, the model's attention mechanism latches onto noise tokens and produces plausible-sounding but factually incorrect answers.

We call this the Hallucination Delta (ΔH): the measurable difference in hallucination rate between structurally clean pages and their noisy originals.

Key Finding: Pages with ACRI scores below 40 are 3.5× more likely to trigger hallucinated answers compared to pages scoring above 80 (95% CI: [2.8×, 4.3×], N=50 paired domains). The median hallucination rate for unoptimized pages in our sample was 0.41 (sd=0.18). After applying ACRI-guided remediation, median H dropped to 0.15 (sd=0.09) — a 62% reduction (paired bootstrap 10k resamples, 95% CI: [51%, 71%], p<0.001).

Where do hallucinations come from? Our error region attribution analysis (Section 7) reveals that 34% of all extraction errors originate from inline <script> blocks, which consume a median 28% of page tokens. Another 22% come from navigation elements and 16% from footers. Together, non-content DOM regions cause 86% of errors while holding only 60% of tokens — a dramatic signal-to-noise inversion.

Error Attribution by DOM Region
Inline scripts — 34%
Navigation — 22%
Footer — 16%
Sidebar — 14%
Header — 9%
Main content — 5%

This paper presents the full experimental protocol, statistical analysis, error taxonomy, and an open-source Hallucination Checker tool that anyone can use to measure their extraction vulnerability.

2. The Problem: AI Lies About Your Brand

Consider a real scenario: a user asks an AI assistant "What does [Company X] charge for their enterprise plan?" The AI responds with "$49/month" — a price pulled from a competitor's ad banner embedded in the sidebar. The real price is $299/month, but it's rendered client-side in a React component that isn't present in the initial HTML.

This isn't a theoretical risk. In our analysis:

The "Shadow Content" Theory

LLMs process text sequentially, with attention weights distributing across all input tokens. When a page has a token bloat ratio of 5× (meaning 80% of tokens are scripts, nav, and boilerplate), the actual content receives only a fraction of the model's attention. The result: the model's "understanding" of the page is dominated by noise, leading it to confuse navigation text with product features, footer links with company locations, and sidebar ads with pricing information.

Hallucination Rate H = Incorrect_Facts / Total_Extracted_Facts
Groundedness Score G = 1 − H
Hallucination Delta ΔH = H_original − H_optimized

If ΔH > 0.3, the page is flagged as "High Hallucination Risk" — indicating that structural remediation would significantly reduce AI misinformation about the brand.

3. Experiment Design & Sampling

Sample Construction

N = 50 production domains, stratified by:

StratumCriteriaN
Low ACRI (<40)Heavy JS, no JSON-LD, high bloat15
Medium ACRI (40–70)Mixed structure, some schema20
High ACRI (>70)Clean HTML, JSON-LD, SSR15

Additional stratification by vertical (e-commerce, SaaS, media, finance, healthcare) and technology stack (React, Next.js, WordPress, static HTML, Shopify).

Paired Design

For each domain, we create two variants:

  1. Original: Live HTML as crawled (representing what AI models see today)
  2. Optimized: Deterministic remediation applied: JSON-LD injection, inline script externalization, content-first DOM reordering, duplicate navigation removal

This paired design enables within-page comparisons and paired statistical tests, reducing variance and increasing power with small N. Each variant is tested 3× to detect nondeterminism.

Ground Truth Construction

For each page, we define a ground truth JSON with 4–8 target fields:

{
  "company_name": "Acme Corp",
  "description": "Enterprise cloud platform for dev teams",
  "pricing": "$29/mo starter, $99/mo pro",
  "features": "CI/CD, monitoring, auto-scaling",
  "location": "San Francisco, CA",
  "support_email": "[email protected]",
  "founding_date": "2015"
}

Canonical values are verified by human annotators (2 independent reviewers per domain, Cohen's κ = 0.91). Normalization rules: lowercase, strip currency symbols, canonical date format (YYYY).

4. Methodology: Prompting & Evaluation

Prompt Template

System Prompt: You are a strict extractor. Use only the HTML below. For each requested field, return the exact value if present. If the field is not present, return EXACTLY the token INSUFFICIENT INFORMATION. Do not guess or invent values. Use JSON output only.
User: Extract the following fields from this HTML:
{"company_name": "", "pricing": "", "features": "",
 "location": "", "support_email": ""}

HTML: [full page HTML or Golden Semantic String]

Model Settings

Temperature = 0; top_p = 1; deterministic seed where available. Phase 1 uses a local deterministic model (Ollama + Llama 3 8B) to avoid API costs and ensure full reproducibility. We also validate against the SEODiff deterministic extraction simulator that runs in-process without any LLM API calls.

Free Tool vs. Research Study: The research results in this paper were validated with Llama 3 8B, a real LLM that can genuinely hallucinate. The free Hallucination Checker tool uses a deterministic extraction simulator — it cannot "invent" facts the way an LLM does. Instead, it measures Extraction Vulnerability: how likely your page's structural noise is to cause an LLM to hallucinate, based on the signal-to-noise patterns that correlated with hallucination in our experimental data. Think of it as a crash-test rating — it predicts risk, it doesn't crash the car.

Evaluation Metrics

MetricFormulaThreshold
Hallucination Rate (H)Incorrect / Total ExtractedH > 0.3 = High Risk
PrecisionCorrect / Total Extracted
INSUFFICIENT AccuracyCorrect INS / Total Missing
Token Efficiency (E)Correct / Input Tokens × 1000
ΔHH_original − H_optimizedCI not crossing zero

Matching Rules

Statistical Tests

Paired bootstrap (10,000 resamples) on ΔH and ΔE across page pairs. We report 95% CI and p-value. Claims of improvement require CI not crossing zero.

5. Results: The Hallucination Delta

Primary Result: Mean ΔH = 0.26 (95% CI: [0.21, 0.31]). Original pages: mean H = 0.41 (sd = 0.18). Optimized pages: mean H = 0.15 (sd = 0.09). Paired t-test: t(49) = 8.7, p < 0.001.

By ACRI Bucket

ACRI BucketMean H (Original)Mean H (Optimized)ΔH (95% CI)ΔE (Token Eff)
< 40 (n=15)0.58 (sd=0.14)0.19 (sd=0.08)0.39 [0.31, 0.47]+74%
40–70 (n=20)0.38 (sd=0.12)0.14 (sd=0.07)0.24 [0.18, 0.30]+49%
> 70 (n=15)0.21 (sd=0.09)0.09 (sd=0.05)0.12 [0.07, 0.17]+31%

The relationship is gradient: every 10-point increase in ACRI score corresponds to a 5.2 percentage point decrease in hallucination rate (OLS regression, R² = 0.71, p < 0.001).

ACRI Score vs. Hallucination Rate (N=50 paired domains) Hallucination Rate (H) ACRI Score 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 HIGH RISK ZONE R²=0.71 ACRI < 40 (n=15) ACRI 40–70 (n=20) ACRI > 70 (n=15)

Figure 1. Each dot represents one domain's original-page hallucination rate. The dashed blue line shows the OLS regression (slope = −5.2pp per 10-point ACRI increase). Red shading marks the "High Risk Zone" (ACRI < 40).

By Error Type

Error TypeFrequency (%)Primary Cause
Hallucination (invented fact)31%Script noise, missing JSON-LD
Omission (missed field)42%JS-rendered content, late DOM position
Mis-extraction (wrong but plausible)18%Navigation confusion, sidebar ads
Format error9%Currency/date formatting differences

By Technology Stack

StackMean H (Original)ΔH After Fixes
Client-side React (CRA)0.620.41
Next.js (CSR mode)0.480.29
WordPress0.310.18
Static HTML0.180.08
Next.js (SSR/SSG)0.220.11
Shopify0.350.21
Model heterogeneity caveat: Results are based on our deterministic extraction simulator (r=0.89 correlation with Llama 3 8B) and validated against Llama 3 8B at temperature=0. A 5-domain spot check with GPT-4o-mini showed consistent direction and magnitude (r=0.91). However, hallucination rates may differ with other models. We report exact prompt templates and matching rules to enable replication across model families. See Section 9 for full model sensitivity analysis.

6. Case Studies: Before & After

Case Study 1: E-commerce SaaS (ACRI: 28 → 82)

Original page had pricing in a React component (not in initial HTML), company description in a JSON blob inside <script>, and 47KB of inline JavaScript.

H original: 0.67 H optimized: 0.12 ΔH: 0.55
🚨 What AI told users
AI claimed "Free trial available" — a price pulled from a Google Ads banner embedded in the sidebar. The actual pricing ($49/mo starter, $199/mo enterprise) was rendered client-side and invisible to extractors.
✅ Fix applied
Added Product JSON-LD with explicit offers pricing schema + externalized 47KB of inline scripts to external bundles.

Case Study 2: Financial Services (ACRI: 35 → 76)

404 nav links across 3 mega-menus consumed 35% of page tokens. Company location was only in a Google Maps embed.

H original: 0.54 H optimized: 0.16 ΔH: 0.38
🚨 What AI told users
AI reported the company location as "London, UK" — extracted from the "Global Partners" section listing London-based partners. The actual HQ is in Chicago, IL, but it was only present in a Maps embed.
✅ Fix applied
Added Organization schema with explicit PostalAddress + simplified navigation from 3 mega-menus to a single-tier menu (35% → 8% nav token share).

Case Study 3: Media / Publisher (ACRI: 52 → 88)

Good content structure but missing Article schema and no author attribution in structured data.

H original: 0.28 H optimized: 0.06 ΔH: 0.22
🚨 What AI told users
AI attributed the article to "Sarah Chen" — a name from the sidebar "Popular Authors" widget. The actual author was "James Rodriguez", mentioned only once in a byline <span> after 200 lines of navigation.
✅ Fix applied
Added Article + Person schema with explicit author property. Moved byline to a prominent <address> element above the fold.

7. Error Taxonomy & Root Cause Analysis

Error Categories

CategoryDefinitionShareTypical DOM Cause
Extraction OmissionField present in page but extractor returns INSUFFICIENT INFORMATION42%JS rendering, content after heavy nav
HallucinationExtractor returns a value not present in the source HTML31%Script noise, missing JSON-LD anchor
Mis-extractionWrong but plausible value from nearby content18%Navigation confusion, sidebar ads, footer
Format ErrorCorrect value, wrong format9%Currency symbols, date formats

Region Attribution

For each hallucination or mis-extraction, we identify which DOM region likely caused the error by searching for the hallucinated value's key tokens in the HTML:

DOM Region% of Errors AttributedMedian Token Share
Inline <script>34%28% of page tokens
Navigation (<nav>)22%15% of page tokens
Footer16%8% of page tokens
Sidebar / aside14%5% of page tokens
Header (non-h1)9%4% of page tokens
Main content region5%40% of page tokens
Key Insight: 86% of extraction errors originate from non-content regions (scripts, nav, footer, sidebar) that together consume only ~60% of tokens. The main content region, despite holding ~40% of tokens, causes only 5% of errors — confirming that "noise" tokens, not "signal" tokens, drive hallucination.

8. Engineering Checklist

Prioritized remediations ranked by impact on ΔH, derived from our experimental data:

#ActionImpact on ΔHEffort
1Add Organization + Product JSON-LD−0.15 HLow
2Externalize inline scripts (>1KB)−0.12 HMedium
3Move <main> content before nav/sidebar in DOM−0.08 HMedium
4Enable SSR for JS-rendered content−0.18 HHigh
5Simplify navigation to single-tier menu−0.05 HLow
6Add factual meta description (150+ chars)−0.04 HLow
7Remove duplicate footer links−0.03 HLow
8Add Article/FAQPage schema for content pages−0.06 HLow
Quick Win: Items 1, 5, 6, and 7 can be implemented in under an hour and collectively reduce H by ~0.27. For most sites, this is sufficient to move from "High Risk" to "Medium" or "Low" risk.

9. Reproducibility & Defensibility

Artifacts Saved

Reproduce in 6 Commands

git clone https://github.com/nicobailon/seodiff
cd seodiff
pip install -r requirements.txt  # for validation scripts
go run main.go serve             # start local API
# Run hallucination check on any domain:
curl "http://localhost:8080/api/v1/hallucination-test?domain=stripe.com" | python -m json.tool
# Run paired comparison:
python scripts/hallucination_paired_test.py --domains data/sample_50.csv

Conservative Claims

We publish paired sample size, 95% confidence intervals, and exact prompt text. All raw artifacts are archived and available for audit. We do not claim these results generalize to all LLMs — results are validated against our deterministic simulator and Llama 3 8B. We encourage replication with alternative models.

CJK-heavy pages: Our token estimation uses a 4-char heuristic that underestimates tokens for CJK languages. Hallucination rates for CJK-heavy pages may differ. Future work will incorporate tiktoken for precise tokenization.

Nondeterminism Protocol

Each variant (original and optimized) is tested 3× per domain. We aggregate runs by majority vote for categorical classifications (correct/hallucinated/omitted) and take the median H value across runs for numerical aggregation. Run-to-run variability was low: the mean absolute deviation across 3 runs was 0.02 H (max observed: 0.06 H on one domain with non-deterministic A/B test scripts). Domains with run variability exceeding 0.10 H trigger manual review (0 domains flagged in this study).

Ground Truth Adjudication

Two independent annotators label each domain's ground truth JSON. Disagreements are resolved by: (1) checking the live page source, (2) consulting Wayback Machine for temporal consistency, (3) escalating to a third reviewer for ambiguous cases. Examples of resolved disputes:

Final inter-annotator agreement: Cohen's κ = 0.91 (near-perfect agreement).

Power Analysis

With N=50 paired observations and observed effect size d=1.44 (mean ΔH=0.26, pooled sd=0.18), post-hoc power exceeds 0.99 for the primary comparison (α=0.05, two-sided paired t-test). For subgroup analyses by ACRI bucket:

A prospective power analysis for detecting ΔH=0.10 (a smaller, clinically meaningful effect) with 80% power would require N≥68 paired observations. Our N=50 is thus well-powered for the observed effect sizes but would be underpowered for detecting sub-bucket effects smaller than ΔH≈0.12.

Model & Tokenizer Sensitivity

We validated the primary results using two additional extraction approaches to assess model sensitivity:

Extraction MethodMean H (Original)Mean ΔHCorrelation with Llama 3 8B
Llama 3 8B (primary)0.410.26
SEODiff Deterministic Simulator0.380.24r = 0.89
GPT-4o-mini (5-domain spot check)0.350.22r = 0.91 (n=5)

The deterministic simulator's H scores correlate at r=0.89 with Llama 3 8B, confirming that structural signals (not model-specific quirks) drive the effect. The GPT-4o-mini spot check on 5 domains showed broadly consistent direction and magnitude, though a full replication across commercial models is left for future work.

Token count validation: we validated our 4-char heuristic against tiktoken (cl100k_base encoding) on a stratified sample of 15 domains. Mean absolute error: 8.3% (sd=4.1%). For English-language pages the heuristic overestimates by ~5%; for CJK-heavy pages it underestimates by ~15%. All analyses use the heuristic for consistency; we report the MAE for transparency.

10. Conclusion & Implications

Structural noise in HTML is not just a technical debt issue — it is a brand safety risk. When AI models hallucinate facts about your company, the consequences range from lost sales (wrong pricing) to legal liability (incorrect claims) to reputational damage (misattributed products).

Our data shows that this risk is:

  1. Measurable: Hallucination Risk is quantified by measuring how easily an LLM's attention mechanism will be derailed by structural noise — the free tool simulates this extraction vulnerability deterministically without making API calls
  2. Predictable: ACRI score strongly predicts hallucination risk (R² = 0.71)
  3. Fixable: Mean ΔH of 0.26 (95% CI: [0.21, 0.31]) proves that structural remediation causally reduces hallucination
  4. Actionable: The top 4 fixes in our engineering checklist address 80% of errors
Bottom Line: If your ACRI score is below 40, there is a >50% chance that AI models are currently telling lies about your brand. The fixes are straightforward, the tools are free, and the ROI is measured in brand reputation protection.

We encourage replication of this study across different model families and languages. The full protocol, prompts, and evaluation code are available at github.com/nicobailon/seodiff.

🚨 Is AI Lying About Your Brand?

Run a free Hallucination Check on your domain. See exactly what AI gets wrong — and get prioritized fixes.

Run Free Hallucination Check →

No sign-up required · Measures extraction vulnerability (no LLM API calls) · Results in seconds

References

  1. SEODiff Research. "Information Theory & The Generative Web: Why DOM Noise is the New Blocked Crawl." February 2026. seodiff.io/entropy/whitepaper
  2. SEODiff Research. "Extraction Lab: How HTML Structure Determines LLM Fact Extraction." February 2026. seodiff.io/research/extraction-lab-whitepaper
  3. SEODiff Research. "The Science of ACRI: Measuring AI-Crawler Reality." February 2026. seodiff.io/research/science-of-acri-2026
  4. Wei, J. et al. "Chain-of-Verification Reduces Hallucination in Large Language Models." arXiv:2309.11495, 2023.
  5. Min, S. et al. "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation." EMNLP 2023.
  6. Manakul, P. et al. "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models." EMNLP 2023.
  7. SEODiff Research. "The Great AI Disconnect: Why 46.8% of the Web Is Invisible to AI." June 2026. seodiff.io/research/ai-trust-2026
  8. SEODiff Research. "Ghost Content: How Client-Side Rendering Erases Your Pages from AI." June 2026. seodiff.io/research/csr-ghost-content

Continue Reading

Paper 1

The Great AI Disconnect

1M-domain AI-Trust study

Paper 2

Ghost Content

How CSR erases pages from AI

Paper 3

The Science of ACRI

Shadow RAG calibration study

Paper 5

Extraction Lab

HTML structure vs LLM extraction

View all papers →