When AI can't find your pricing, it invents one. Here's the data — and the engineering playbook to fix it.
v2.0 — Revised 20 February 2026
Large language models don't just fail to find information on poorly structured pages — they invent it. When an LLM encounters a page where the pricing table is hidden behind JavaScript hydration, the company address is buried in a 300-link footer, and the main content is sandwiched between 50KB of inline scripts, the model's attention mechanism latches onto noise tokens and produces plausible-sounding but factually incorrect answers.
We call this the Hallucination Delta (ΔH): the measurable difference in hallucination rate between structurally clean pages and their noisy originals.
Where do hallucinations come from? Our error region attribution analysis (Section 7) reveals that 34% of all extraction errors originate from inline <script> blocks, which consume a median 28% of page tokens. Another 22% come from navigation elements and 16% from footers. Together, non-content DOM regions cause 86% of errors while holding only 60% of tokens — a dramatic signal-to-noise inversion.
This paper presents the full experimental protocol, statistical analysis, error taxonomy, and an open-source Hallucination Checker tool that anyone can use to measure their extraction vulnerability.
Consider a real scenario: a user asks an AI assistant "What does [Company X] charge for their enterprise plan?" The AI responds with "$49/month" — a price pulled from a competitor's ad banner embedded in the sidebar. The real price is $299/month, but it's rendered client-side in a React component that isn't present in the initial HTML.
This isn't a theoretical risk. In our analysis:
LLMs process text sequentially, with attention weights distributing across all input tokens. When a page has a token bloat ratio of 5× (meaning 80% of tokens are scripts, nav, and boilerplate), the actual content receives only a fraction of the model's attention. The result: the model's "understanding" of the page is dominated by noise, leading it to confuse navigation text with product features, footer links with company locations, and sidebar ads with pricing information.
If ΔH > 0.3, the page is flagged as "High Hallucination Risk" — indicating that structural remediation would significantly reduce AI misinformation about the brand.
N = 50 production domains, stratified by:
| Stratum | Criteria | N |
|---|---|---|
| Low ACRI (<40) | Heavy JS, no JSON-LD, high bloat | 15 |
| Medium ACRI (40–70) | Mixed structure, some schema | 20 |
| High ACRI (>70) | Clean HTML, JSON-LD, SSR | 15 |
Additional stratification by vertical (e-commerce, SaaS, media, finance, healthcare) and technology stack (React, Next.js, WordPress, static HTML, Shopify).
For each domain, we create two variants:
This paired design enables within-page comparisons and paired statistical tests, reducing variance and increasing power with small N. Each variant is tested 3× to detect nondeterminism.
For each page, we define a ground truth JSON with 4–8 target fields:
{
"company_name": "Acme Corp",
"description": "Enterprise cloud platform for dev teams",
"pricing": "$29/mo starter, $99/mo pro",
"features": "CI/CD, monitoring, auto-scaling",
"location": "San Francisco, CA",
"support_email": "[email protected]",
"founding_date": "2015"
}
Canonical values are verified by human annotators (2 independent reviewers per domain, Cohen's κ = 0.91). Normalization rules: lowercase, strip currency symbols, canonical date format (YYYY).
User: Extract the following fields from this HTML:
{"company_name": "", "pricing": "", "features": "",
"location": "", "support_email": ""}
HTML: [full page HTML or Golden Semantic String]
Temperature = 0; top_p = 1; deterministic seed where available. Phase 1 uses a local deterministic model (Ollama + Llama 3 8B) to avoid API costs and ensure full reproducibility. We also validate against the SEODiff deterministic extraction simulator that runs in-process without any LLM API calls.
| Metric | Formula | Threshold |
|---|---|---|
| Hallucination Rate (H) | Incorrect / Total Extracted | H > 0.3 = High Risk |
| Precision | Correct / Total Extracted | — |
| INSUFFICIENT Accuracy | Correct INS / Total Missing | — |
| Token Efficiency (E) | Correct / Input Tokens × 1000 | — |
| ΔH | H_original − H_optimized | CI not crossing zero |
Paired bootstrap (10,000 resamples) on ΔH and ΔE across page pairs. We report 95% CI and p-value. Claims of improvement require CI not crossing zero.
| ACRI Bucket | Mean H (Original) | Mean H (Optimized) | ΔH (95% CI) | ΔE (Token Eff) |
|---|---|---|---|---|
| < 40 (n=15) | 0.58 (sd=0.14) | 0.19 (sd=0.08) | 0.39 [0.31, 0.47] | +74% |
| 40–70 (n=20) | 0.38 (sd=0.12) | 0.14 (sd=0.07) | 0.24 [0.18, 0.30] | +49% |
| > 70 (n=15) | 0.21 (sd=0.09) | 0.09 (sd=0.05) | 0.12 [0.07, 0.17] | +31% |
The relationship is gradient: every 10-point increase in ACRI score corresponds to a 5.2 percentage point decrease in hallucination rate (OLS regression, R² = 0.71, p < 0.001).
Figure 1. Each dot represents one domain's original-page hallucination rate. The dashed blue line shows the OLS regression (slope = −5.2pp per 10-point ACRI increase). Red shading marks the "High Risk Zone" (ACRI < 40).
| Error Type | Frequency (%) | Primary Cause |
|---|---|---|
| Hallucination (invented fact) | 31% | Script noise, missing JSON-LD |
| Omission (missed field) | 42% | JS-rendered content, late DOM position |
| Mis-extraction (wrong but plausible) | 18% | Navigation confusion, sidebar ads |
| Format error | 9% | Currency/date formatting differences |
| Stack | Mean H (Original) | ΔH After Fixes |
|---|---|---|
| Client-side React (CRA) | 0.62 | 0.41 |
| Next.js (CSR mode) | 0.48 | 0.29 |
| WordPress | 0.31 | 0.18 |
| Static HTML | 0.18 | 0.08 |
| Next.js (SSR/SSG) | 0.22 | 0.11 |
| Shopify | 0.35 | 0.21 |
Original page had pricing in a React component (not in initial HTML), company description in a JSON blob inside <script>, and 47KB of inline JavaScript.
offers pricing schema + externalized 47KB of inline scripts to external bundles.
404 nav links across 3 mega-menus consumed 35% of page tokens. Company location was only in a Google Maps embed.
PostalAddress + simplified navigation from 3 mega-menus to a single-tier menu (35% → 8% nav token share).
Good content structure but missing Article schema and no author attribution in structured data.
<span> after 200 lines of navigation.
author property. Moved byline to a prominent <address> element above the fold.
| Category | Definition | Share | Typical DOM Cause |
|---|---|---|---|
| Extraction Omission | Field present in page but extractor returns INSUFFICIENT INFORMATION | 42% | JS rendering, content after heavy nav |
| Hallucination | Extractor returns a value not present in the source HTML | 31% | Script noise, missing JSON-LD anchor |
| Mis-extraction | Wrong but plausible value from nearby content | 18% | Navigation confusion, sidebar ads, footer |
| Format Error | Correct value, wrong format | 9% | Currency symbols, date formats |
For each hallucination or mis-extraction, we identify which DOM region likely caused the error by searching for the hallucinated value's key tokens in the HTML:
| DOM Region | % of Errors Attributed | Median Token Share |
|---|---|---|
| Inline <script> | 34% | 28% of page tokens |
| Navigation (<nav>) | 22% | 15% of page tokens |
| Footer | 16% | 8% of page tokens |
| Sidebar / aside | 14% | 5% of page tokens |
| Header (non-h1) | 9% | 4% of page tokens |
| Main content region | 5% | 40% of page tokens |
Prioritized remediations ranked by impact on ΔH, derived from our experimental data:
| # | Action | Impact on ΔH | Effort |
|---|---|---|---|
| 1 | Add Organization + Product JSON-LD | −0.15 H | Low |
| 2 | Externalize inline scripts (>1KB) | −0.12 H | Medium |
| 3 | Move <main> content before nav/sidebar in DOM | −0.08 H | Medium |
| 4 | Enable SSR for JS-rendered content | −0.18 H | High |
| 5 | Simplify navigation to single-tier menu | −0.05 H | Low |
| 6 | Add factual meta description (150+ chars) | −0.04 H | Low |
| 7 | Remove duplicate footer links | −0.03 H | Low |
| 8 | Add Article/FAQPage schema for content pages | −0.06 H | Low |
git clone https://github.com/nicobailon/seodiff cd seodiff pip install -r requirements.txt # for validation scripts go run main.go serve # start local API # Run hallucination check on any domain: curl "http://localhost:8080/api/v1/hallucination-test?domain=stripe.com" | python -m json.tool # Run paired comparison: python scripts/hallucination_paired_test.py --domains data/sample_50.csv
We publish paired sample size, 95% confidence intervals, and exact prompt text. All raw artifacts are archived and available for audit. We do not claim these results generalize to all LLMs — results are validated against our deterministic simulator and Llama 3 8B. We encourage replication with alternative models.
Each variant (original and optimized) is tested 3× per domain. We aggregate runs by majority vote for categorical classifications (correct/hallucinated/omitted) and take the median H value across runs for numerical aggregation. Run-to-run variability was low: the mean absolute deviation across 3 runs was 0.02 H (max observed: 0.06 H on one domain with non-deterministic A/B test scripts). Domains with run variability exceeding 0.10 H trigger manual review (0 domains flagged in this study).
Two independent annotators label each domain's ground truth JSON. Disagreements are resolved by: (1) checking the live page source, (2) consulting Wayback Machine for temporal consistency, (3) escalating to a third reviewer for ambiguous cases. Examples of resolved disputes:
Final inter-annotator agreement: Cohen's κ = 0.91 (near-perfect agreement).
With N=50 paired observations and observed effect size d=1.44 (mean ΔH=0.26, pooled sd=0.18), post-hoc power exceeds 0.99 for the primary comparison (α=0.05, two-sided paired t-test). For subgroup analyses by ACRI bucket:
A prospective power analysis for detecting ΔH=0.10 (a smaller, clinically meaningful effect) with 80% power would require N≥68 paired observations. Our N=50 is thus well-powered for the observed effect sizes but would be underpowered for detecting sub-bucket effects smaller than ΔH≈0.12.
We validated the primary results using two additional extraction approaches to assess model sensitivity:
| Extraction Method | Mean H (Original) | Mean ΔH | Correlation with Llama 3 8B |
|---|---|---|---|
| Llama 3 8B (primary) | 0.41 | 0.26 | — |
| SEODiff Deterministic Simulator | 0.38 | 0.24 | r = 0.89 |
| GPT-4o-mini (5-domain spot check) | 0.35 | 0.22 | r = 0.91 (n=5) |
The deterministic simulator's H scores correlate at r=0.89 with Llama 3 8B, confirming that structural signals (not model-specific quirks) drive the effect. The GPT-4o-mini spot check on 5 domains showed broadly consistent direction and magnitude, though a full replication across commercial models is left for future work.
Token count validation: we validated our 4-char heuristic against tiktoken (cl100k_base encoding) on a stratified sample of 15 domains. Mean absolute error: 8.3% (sd=4.1%). For English-language pages the heuristic overestimates by ~5%; for CJK-heavy pages it underestimates by ~15%. All analyses use the heuristic for consistency; we report the MAE for transparency.
Structural noise in HTML is not just a technical debt issue — it is a brand safety risk. When AI models hallucinate facts about your company, the consequences range from lost sales (wrong pricing) to legal liability (incorrect claims) to reputational damage (misattributed products).
Our data shows that this risk is:
We encourage replication of this study across different model families and languages. The full protocol, prompts, and evaluation code are available at github.com/nicobailon/seodiff.
Run a free Hallucination Check on your domain. See exactly what AI gets wrong — and get prioritized fixes.
Run Free Hallucination Check →No sign-up required · Measures extraction vulnerability (no LLM API calls) · Results in seconds