Hallucination Risk: How Structural Noise Causes LLMs to Invent Facts

When AI can't find your pricing, it invents one. Here's the data — and the engineering playbook to fix it.

SEODiff Research · February 2026

v2.0 — Revised 20 February 2026

Abstract. We present a controlled experiment measuring how HTML structure affects large language model (LLM) hallucination when extracting factual information from web pages. Using a paired design across 50 production domains stratified by ACRI score, technology stack, and industry vertical, we show that pages with ACRI scores below 40 exhibit a hallucination rate 3.5× higher than well-structured pages. After applying deterministic remediation (JSON-LD injection, script externalization, content-first DOM ordering), mean hallucination rate drops by 62% (paired bootstrap 95% CI: [51%, 71%], p<0.001). Token efficiency improves by 52%. We release the full test protocol, prompt templates, and a free Hallucination Checker tool at seodiff.io/hallucination-test.

3.5×

Higher hallucination rate for ACRI<40 pages

−62%

Mean ΔH after remediation (CI: 51–71%)

+52%

Token efficiency improvement

Production domains tested

0.41

Median H for unoptimized pages

Executive Summary
The Problem: AI Lies About Your Brand
Experiment Design & Sampling
Methodology: Prompting & Evaluation
Results: The Hallucination Delta
Case Studies: Before & After
Error Taxonomy & Root Cause Analysis
Engineering Checklist
Reproducibility & Defensibility
Conclusion & Implications

1. Executive Summary

Large language models don't just fail to find information on poorly structured pages — they invent it. When an LLM encounters a page where the pricing table is hidden behind JavaScript hydration, the company address is buried in a 300-link footer, and the main content is sandwiched between 50KB of inline scripts, the model's attention mechanism latches onto noise tokens and produces plausible-sounding but factually incorrect answers.

We call this the Hallucination Delta (ΔH): the measurable difference in hallucination rate between structurally clean pages and their noisy originals.

Key Finding: Pages with ACRI scores below 40 are 3.5× more likely to trigger hallucinated answers compared to pages scoring above 80 (95% CI: [2.8×, 4.3×], N=50 paired domains). The median hallucination rate for unoptimized pages in our sample was 0.41 (sd=0.18). After applying ACRI-guided remediation, median H dropped to 0.15 (sd=0.09) — a 62% reduction (paired bootstrap 10k resamples, 95% CI: [51%, 71%], p<0.001).

Where do hallucinations come from? Our error region attribution analysis (Section 7) reveals that 34% of all extraction errors originate from inline <script> blocks, which consume a median 28% of page tokens. Another 22% come from navigation elements and 16% from footers. Together, non-content DOM regions cause 86% of errors while holding only 60% of tokens — a dramatic signal-to-noise inversion.

Error Attribution by DOM Region

Inline scripts — 34%

Navigation — 22%

Footer — 16%

Sidebar — 14%

Header — 9%

Main content — 5%

This paper presents the full experimental protocol, statistical analysis, error taxonomy, and an open-source Hallucination Checker tool that anyone can use to measure their extraction vulnerability.

2. The Problem: AI Lies About Your Brand

Consider a real scenario: a user asks an AI assistant "What does [Company X] charge for their enterprise plan?" The AI responds with "$49/month" — a price pulled from a competitor's ad banner embedded in the sidebar. The real price is $299/month, but it's rendered client-side in a React component that isn't present in the initial HTML.

This isn't a theoretical risk. In our analysis:

34% of pages with heavy JavaScript rendering had at least one pricing hallucination
28% of pages without JSON-LD had incorrect company name extraction
52% of pages with token bloat ratios above 4× had incorrect feature descriptions

The "Shadow Content" Theory

LLMs process text sequentially, with attention weights distributing across all input tokens. When a page has a token bloat ratio of 5× (meaning 80% of tokens are scripts, nav, and boilerplate), the actual content receives only a fraction of the model's attention. The result: the model's "understanding" of the page is dominated by noise, leading it to confuse navigation text with product features, footer links with company locations, and sidebar ads with pricing information.

Hallucination Rate H = Incorrect_Facts / Total_Extracted_Facts
Groundedness Score G = 1 − H
Hallucination Delta ΔH = H_original − H_optimized

If ΔH > 0.3, the page is flagged as "High Hallucination Risk" — indicating that structural remediation would significantly reduce AI misinformation about the brand.

3. Experiment Design & Sampling

Sample Construction

N = 50 production domains, stratified by:

Stratum	Criteria	N
Low ACRI (<40)	Heavy JS, no JSON-LD, high bloat	15
Medium ACRI (40–70)	Mixed structure, some schema	20
High ACRI (>70)	Clean HTML, JSON-LD, SSR	15

Additional stratification by vertical (e-commerce, SaaS, media, finance, healthcare) and technology stack (React, Next.js, WordPress, static HTML, Shopify).

Paired Design

For each domain, we create two variants:

Original: Live HTML as crawled (representing what AI models see today)
Optimized: Deterministic remediation applied: JSON-LD injection, inline script externalization, content-first DOM reordering, duplicate navigation removal

This paired design enables within-page comparisons and paired statistical tests, reducing variance and increasing power with small N. Each variant is tested 3× to detect nondeterminism.

Ground Truth Construction

For each page, we define a ground truth JSON with 4–8 target fields:

{
  "company_name": "Acme Corp",
  "description": "Enterprise cloud platform for dev teams",
  "pricing": "$29/mo starter, $99/mo pro",
  "features": "CI/CD, monitoring, auto-scaling",
  "location": "San Francisco, CA",
  "support_email": "[email protected]",
  "founding_date": "2015"
}

Canonical values are verified by human annotators (2 independent reviewers per domain, Cohen's κ = 0.91). Normalization rules: lowercase, strip currency symbols, canonical date format (YYYY).

4. Methodology: Prompting & Evaluation

Prompt Template

System Prompt: You are a strict extractor. Use only the HTML below. For each requested field, return the exact value if present. If the field is not present, return EXACTLY the token INSUFFICIENT INFORMATION. Do not guess or invent values. Use JSON output only.

User: Extract the following fields from this HTML:
{"company_name": "", "pricing": "", "features": "",
 "location": "", "support_email": ""}

HTML: [full page HTML or Golden Semantic String]

Model Settings

Temperature = 0; top_p = 1; deterministic seed where available. Phase 1 uses a local deterministic model (Ollama + Llama 3 8B) to avoid API costs and ensure full reproducibility. We also validate against the SEODiff deterministic extraction simulator that runs in-process without any LLM API calls.

Free Tool vs. Research Study: The research results in this paper were validated with Llama 3 8B, a real LLM that can genuinely hallucinate. The free Hallucination Checker tool uses a deterministic extraction simulator — it cannot "invent" facts the way an LLM does. Instead, it measures Extraction Vulnerability: how likely your page's structural noise is to cause an LLM to hallucinate, based on the signal-to-noise patterns that correlated with hallucination in our experimental data. Think of it as a crash-test rating — it predicts risk, it doesn't crash the car.

Evaluation Metrics

Metric	Formula	Threshold
Hallucination Rate (H)	Incorrect / Total Extracted	H > 0.3 = High Risk
Precision	Correct / Total Extracted	—
INSUFFICIENT Accuracy	Correct INS / Total Missing	—
Token Efficiency (E)	Correct / Input Tokens × 1000	—
ΔH	H_original − H_optimized	CI not crossing zero

Matching Rules

Exact match: For canonical fields (IDs, normalized prices)
Fuzzy match: Lowercase + strip punctuation + token overlap ≥ 70% → correct
Partial credit: 40–70% overlap → 0.5 correct, flagged for review
Format variant: Same alphanumeric content, different formatting → format_error (not hallucination)

Statistical Tests

Paired bootstrap (10,000 resamples) on ΔH and ΔE across page pairs. We report 95% CI and p-value. Claims of improvement require CI not crossing zero.

5. Results: The Hallucination Delta

Primary Result: Mean ΔH = 0.26 (95% CI: [0.21, 0.31]). Original pages: mean H = 0.41 (sd = 0.18). Optimized pages: mean H = 0.15 (sd = 0.09). Paired t-test: t(49) = 8.7, p < 0.001.

By ACRI Bucket

ACRI Bucket	Mean H (Original)	Mean H (Optimized)	ΔH (95% CI)	ΔE (Token Eff)
< 40 (n=15)	0.58 (sd=0.14)	0.19 (sd=0.08)	0.39 [0.31, 0.47]	+74%
40–70 (n=20)	0.38 (sd=0.12)	0.14 (sd=0.07)	0.24 [0.18, 0.30]	+49%
> 70 (n=15)	0.21 (sd=0.09)	0.09 (sd=0.05)	0.12 [0.07, 0.17]	+31%

The relationship is gradient: every 10-point increase in ACRI score corresponds to a 5.2 percentage point decrease in hallucination rate (OLS regression, R² = 0.71, p < 0.001).

Figure 1. Each dot represents one domain's original-page hallucination rate. The dashed blue line shows the OLS regression (slope = −5.2pp per 10-point ACRI increase). Red shading marks the "High Risk Zone" (ACRI < 40).

By Error Type

Error Type	Frequency (%)	Primary Cause
Hallucination (invented fact)	31%	Script noise, missing JSON-LD
Omission (missed field)	42%	JS-rendered content, late DOM position
Mis-extraction (wrong but plausible)	18%	Navigation confusion, sidebar ads
Format error	9%	Currency/date formatting differences

By Technology Stack

Stack	Mean H (Original)	ΔH After Fixes
Client-side React (CRA)	0.62	0.41
Next.js (CSR mode)	0.48	0.29
WordPress	0.31	0.18
Static HTML	0.18	0.08
Next.js (SSR/SSG)	0.22	0.11
Shopify	0.35	0.21

Model heterogeneity caveat: Results are based on our deterministic extraction simulator (r=0.89 correlation with Llama 3 8B) and validated against Llama 3 8B at temperature=0. A 5-domain spot check with GPT-4o-mini showed consistent direction and magnitude (r=0.91). However, hallucination rates may differ with other models. We report exact prompt templates and matching rules to enable replication across model families. See Section 9 for full model sensitivity analysis.

6. Case Studies: Before & After

Case Study 1: E-commerce SaaS (ACRI: 28 → 82)

Original page had pricing in a React component (not in initial HTML), company description in a JSON blob inside <script>, and 47KB of inline JavaScript.

H original: 0.67 H optimized: 0.12 ΔH: 0.55

🚨 What AI told users

AI claimed "Free trial available" — a price pulled from a Google Ads banner embedded in the sidebar. The actual pricing ($49/mo starter, $199/mo enterprise) was rendered client-side and invisible to extractors.

✅ Fix applied

Added Product JSON-LD with explicit offers pricing schema + externalized 47KB of inline scripts to external bundles.

Case Study 2: Financial Services (ACRI: 35 → 76)

404 nav links across 3 mega-menus consumed 35% of page tokens. Company location was only in a Google Maps embed.

H original: 0.54 H optimized: 0.16 ΔH: 0.38

🚨 What AI told users

AI reported the company location as "London, UK" — extracted from the "Global Partners" section listing London-based partners. The actual HQ is in Chicago, IL, but it was only present in a Maps embed.

✅ Fix applied

Added Organization schema with explicit PostalAddress + simplified navigation from 3 mega-menus to a single-tier menu (35% → 8% nav token share).

Case Study 3: Media / Publisher (ACRI: 52 → 88)

Good content structure but missing Article schema and no author attribution in structured data.

H original: 0.28 H optimized: 0.06 ΔH: 0.22

🚨 What AI told users

AI attributed the article to "Sarah Chen" — a name from the sidebar "Popular Authors" widget. The actual author was "James Rodriguez", mentioned only once in a byline <span> after 200 lines of navigation.

✅ Fix applied

Added Article + Person schema with explicit author property. Moved byline to a prominent <address> element above the fold.

7. Error Taxonomy & Root Cause Analysis

Error Categories

Category	Definition	Share	Typical DOM Cause
Extraction Omission	Field present in page but extractor returns INSUFFICIENT INFORMATION	42%	JS rendering, content after heavy nav
Hallucination	Extractor returns a value not present in the source HTML	31%	Script noise, missing JSON-LD anchor
Mis-extraction	Wrong but plausible value from nearby content	18%	Navigation confusion, sidebar ads, footer
Format Error	Correct value, wrong format	9%	Currency symbols, date formats

Region Attribution

For each hallucination or mis-extraction, we identify which DOM region likely caused the error by searching for the hallucinated value's key tokens in the HTML:

DOM Region	% of Errors Attributed	Median Token Share
Inline <script>	34%	28% of page tokens
Navigation (<nav>)	22%	15% of page tokens
Footer	16%	8% of page tokens
Sidebar / aside	14%	5% of page tokens
Header (non-h1)	9%	4% of page tokens
Main content region	5%	40% of page tokens

Key Insight: 86% of extraction errors originate from non-content regions (scripts, nav, footer, sidebar) that together consume only ~60% of tokens. The main content region, despite holding ~40% of tokens, causes only 5% of errors — confirming that "noise" tokens, not "signal" tokens, drive hallucination.

8. Engineering Checklist

Prioritized remediations ranked by impact on ΔH, derived from our experimental data:

#	Action	Impact on ΔH	Effort
1	Add Organization + Product JSON-LD	−0.15 H	Low
2	Externalize inline scripts (>1KB)	−0.12 H	Medium
3	Move <main> content before nav/sidebar in DOM	−0.08 H	Medium
4	Enable SSR for JS-rendered content	−0.18 H	High
5	Simplify navigation to single-tier menu	−0.05 H	Low
6	Add factual meta description (150+ chars)	−0.04 H	Low
7	Remove duplicate footer links	−0.03 H	Low
8	Add Article/FAQPage schema for content pages	−0.06 H	Low

Quick Win: Items 1, 5, 6, and 7 can be implemented in under an hour and collectively reduce H by ~0.27. For most sites, this is sufficient to move from "High Risk" to "Medium" or "Low" risk.

9. Reproducibility & Defensibility

Artifacts Saved

Raw HTML for original and optimized variants (50 pairs)
Golden Semantic Strings (per the Extraction Lab protocol)
Ground truth JSON files (human-verified, κ = 0.91)
All model outputs and token counts
Execution metadata: model name (Llama 3 8B), version, seed, prompt text, timestamp

Reproduce in 6 Commands

git clone https://github.com/nicobailon/seodiff
cd seodiff
pip install -r requirements.txt  # for validation scripts
go run main.go serve             # start local API
# Run hallucination check on any domain:
curl "http://localhost:8080/api/v1/hallucination-test?domain=stripe.com" | python -m json.tool
# Run paired comparison:
python scripts/hallucination_paired_test.py --domains data/sample_50.csv

Conservative Claims

We publish paired sample size, 95% confidence intervals, and exact prompt text. All raw artifacts are archived and available for audit. We do not claim these results generalize to all LLMs — results are validated against our deterministic simulator and Llama 3 8B. We encourage replication with alternative models.

CJK-heavy pages: Our token estimation uses a 4-char heuristic that underestimates tokens for CJK languages. Hallucination rates for CJK-heavy pages may differ. Future work will incorporate tiktoken for precise tokenization.

Nondeterminism Protocol

Each variant (original and optimized) is tested 3× per domain. We aggregate runs by majority vote for categorical classifications (correct/hallucinated/omitted) and take the median H value across runs for numerical aggregation. Run-to-run variability was low: the mean absolute deviation across 3 runs was 0.02 H (max observed: 0.06 H on one domain with non-deterministic A/B test scripts). Domains with run variability exceeding 0.10 H trigger manual review (0 domains flagged in this study).

Ground Truth Adjudication

Two independent annotators label each domain's ground truth JSON. Disagreements are resolved by: (1) checking the live page source, (2) consulting Wayback Machine for temporal consistency, (3) escalating to a third reviewer for ambiguous cases. Examples of resolved disputes:

Pricing ambiguity: Domain listed "$29/mo" on pricing page but "$290/yr" on homepage. Resolution: use the value present in the crawled page's HTML, not cross-page lookups.
Location vs. incorporation: Company HQ listed as "San Francisco" but incorporated in Delaware. Resolution: use physical HQ address (what users expect AI to report).
Multi-brand: Parent company name vs. product brand. Resolution: use the entity name most prominent in the page's title and H1.

Final inter-annotator agreement: Cohen's κ = 0.91 (near-perfect agreement).

Power Analysis

With N=50 paired observations and observed effect size d=1.44 (mean ΔH=0.26, pooled sd=0.18), post-hoc power exceeds 0.99 for the primary comparison (α=0.05, two-sided paired t-test). For subgroup analyses by ACRI bucket:

Low ACRI (n=15): observed d=2.79, power > 0.99
Medium ACRI (n=20): observed d=2.00, power > 0.99
High ACRI (n=15): observed d=1.33, power = 0.96

A prospective power analysis for detecting ΔH=0.10 (a smaller, clinically meaningful effect) with 80% power would require N≥68 paired observations. Our N=50 is thus well-powered for the observed effect sizes but would be underpowered for detecting sub-bucket effects smaller than ΔH≈0.12.

Model & Tokenizer Sensitivity

We validated the primary results using two additional extraction approaches to assess model sensitivity:

Extraction Method	Mean H (Original)	Mean ΔH	Correlation with Llama 3 8B
Llama 3 8B (primary)	0.41	0.26	—
SEODiff Deterministic Simulator	0.38	0.24	r = 0.89
GPT-4o-mini (5-domain spot check)	0.35	0.22	r = 0.91 (n=5)

The deterministic simulator's H scores correlate at r=0.89 with Llama 3 8B, confirming that structural signals (not model-specific quirks) drive the effect. The GPT-4o-mini spot check on 5 domains showed broadly consistent direction and magnitude, though a full replication across commercial models is left for future work.

Token count validation: we validated our 4-char heuristic against tiktoken (cl100k_base encoding) on a stratified sample of 15 domains. Mean absolute error: 8.3% (sd=4.1%). For English-language pages the heuristic overestimates by ~5%; for CJK-heavy pages it underestimates by ~15%. All analyses use the heuristic for consistency; we report the MAE for transparency.

10. Conclusion & Implications

Structural noise in HTML is not just a technical debt issue — it is a brand safety risk. When AI models hallucinate facts about your company, the consequences range from lost sales (wrong pricing) to legal liability (incorrect claims) to reputational damage (misattributed products).

Our data shows that this risk is:

Measurable: Hallucination Risk is quantified by measuring how easily an LLM's attention mechanism will be derailed by structural noise — the free tool simulates this extraction vulnerability deterministically without making API calls
Predictable: ACRI score strongly predicts hallucination risk (R² = 0.71)
Fixable: Mean ΔH of 0.26 (95% CI: [0.21, 0.31]) proves that structural remediation causally reduces hallucination
Actionable: The top 4 fixes in our engineering checklist address 80% of errors

Bottom Line: If your ACRI score is below 40, there is a >50% chance that AI models are currently telling lies about your brand. The fixes are straightforward, the tools are free, and the ROI is measured in brand reputation protection.

We encourage replication of this study across different model families and languages. The full protocol, prompts, and evaluation code are available at github.com/nicobailon/seodiff.

🚨 Is AI Lying About Your Brand?

Run a free Hallucination Check on your domain. See exactly what AI gets wrong — and get prioritized fixes.

Run Free Hallucination Check →

No sign-up required · Measures extraction vulnerability (no LLM API calls) · Results in seconds

References

SEODiff Research. "Information Theory & The Generative Web: Why DOM Noise is the New Blocked Crawl." February 2026. seodiff.io/entropy/whitepaper
SEODiff Research. "Extraction Lab: How HTML Structure Determines LLM Fact Extraction." February 2026. seodiff.io/research/extraction-lab-whitepaper
SEODiff Research. "The Science of ACRI: Measuring AI-Crawler Reality." February 2026. seodiff.io/research/science-of-acri-2026
Wei, J. et al. "Chain-of-Verification Reduces Hallucination in Large Language Models." arXiv:2309.11495, 2023.
Min, S. et al. "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation." EMNLP 2023.
Manakul, P. et al. "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models." EMNLP 2023.
SEODiff Research. "The Great AI Disconnect: Why 46.8% of the Web Is Invisible to AI." June 2026. seodiff.io/research/ai-trust-2026
SEODiff Research. "Ghost Content: How Client-Side Rendering Erases Your Pages from AI." June 2026. seodiff.io/research/csr-ghost-content

Continue Reading

Paper 1

The Great AI Disconnect

1M-domain AI-Trust study

Paper 2

Ghost Content

How CSR erases pages from AI

Paper 3

The Science of ACRI

Shadow RAG calibration study

Paper 5

Extraction Lab

HTML structure vs LLM extraction

View all papers →

Hallucination Risk: How Structural Noise Causes LLMs to Invent Facts

Table of Contents

1. Executive Summary

2. The Problem: AI Lies About Your Brand

The "Shadow Content" Theory

3. Experiment Design & Sampling

Sample Construction

Paired Design

Ground Truth Construction

4. Methodology: Prompting & Evaluation

Prompt Template

Model Settings

Evaluation Metrics

Matching Rules

Statistical Tests

5. Results: The Hallucination Delta

By ACRI Bucket

By Error Type

By Technology Stack

6. Case Studies: Before & After

Case Study 1: E-commerce SaaS (ACRI: 28 → 82)

Case Study 2: Financial Services (ACRI: 35 → 76)

Case Study 3: Media / Publisher (ACRI: 52 → 88)

7. Error Taxonomy & Root Cause Analysis

Error Categories

Region Attribution

8. Engineering Checklist

9. Reproducibility & Defensibility

Artifacts Saved

Reproduce in 6 Commands

Conservative Claims

Nondeterminism Protocol

Ground Truth Adjudication

Power Analysis

Model & Tokenizer Sensitivity

10. Conclusion & Implications

🚨 Is AI Lying About Your Brand?

References

Continue Reading