How Technical Structure Predicts AI Retrieval — A Shadow RAG Calibration Study on 926 Domains
Key Finding: In a controlled Shadow RAG experiment indexing 926 domains and running 4630 queries, we found a Spearman ρ = 0.629 correlation between ACRI and retrieval success (Recall@10). The 95% bootstrap confidence interval is [0.597, 0.658]. Pages in the peak ACRI tier (70-89 (B/A)) were retrieved 5.6× more often than those in the lowest tier (0–29).
Conservative interpretation: these results show a strong association between ACRI and retrieval in our Shadow RAG pilot; further work is required to confirm transfer to commercial embedding stacks and multi-page indices.
The modern web was built for browsers, not for AI. JavaScript-heavy frameworks, CSS-in-JS, client-side rendering, and bloated HTML create a signal-to-noise disaster for LLM context windows. When an AI system like ChatGPT, Perplexity, or a custom RAG pipeline ingests a web page, it doesn't see what a human sees — it sees raw HTML, scripts, and noise.
This matters because Retrieval-Augmented Generation (RAG) systems now power a growing share of information retrieval. These systems:
If your content is bloated, poorly structured, or inaccessible to AI crawlers, it will produce low-quality embeddings, rank poorly in retrieval, and be excluded from AI-generated answers. Google ranking ≠ AI visibility.
The AI-Crawler Reality Index (ACRI) was designed to measure how "AI-visible" a website truly is. Unlike traditional SEO scores that focus on human-facing signals, ACRI quantifies the four pillars that determine whether AI systems can extract, understand, and cite your content.
But until now, ACRI has been a theoretical framework. This study provides the empirical proof: does ACRI actually predict retrieval success?
ACRI is a composite score (0–100) computed as a weighted geometric mean of four pillar sub-scores:
Weights derived from calibration studies on 100k+ domains (see Section 6).
Outcome: AI Visibility = all four pillars working together.
| Pillar | Weight | Description | Key Signals |
|---|---|---|---|
| E — Extractability | 35% | Can AI extract clean, structured content? | Token Bloat, Ghost Ratio (JS dependency), Meta Completeness, Bot Access, Schema |
| S — Semantic Structure | 25% | Does content map into LLM embeddings? | Structure Density, Semantic Orphan Rate, Link Graph Health |
| C — Content Integrity | 20% | Unique, non-thin, information-rich? | Thin Content Rate, Duplicate Rate, Content Uniqueness |
| R — Retrieval Robustness | 20% | Chunks into LLM-friendly units? | Chunk Quality, Cluster Density, Hub Count |
The geometric mean penalizes weak areas: a site with excellent schema but terrible token bloat will score lower than one with balanced, moderate scores across all pillars. This reflects the reality that AI systems need all signals to work together.
When an AI system ingests a page, it attempts to extract the core meaning. We call this the Golden Semantic String. It is the canonical text an AI system would extract (title, meta description, H1, H2s, first 300–600 words, JSON-LD).
| Score | Grade | Interpretation |
|---|---|---|
| 90–100 | A+ | AI-optimized — high retrieval probability |
| 80–89 | A | Strong AI presence — minor improvements possible |
| 70–79 | B | Adequate — some signals need attention |
| 55–69 | C | At risk — significant gaps in AI visibility |
| 40–54 | D | Poor — likely invisible to AI systems |
| 0–39 | F | Critical — fundamental issues preventing AI access |
Reproducibility note: all code, sample data, and pipeline scripts are published in github.com/seodiff/research. This pilot uses all-MiniLM-L6-v2; robustness checks with stronger/open and commercial embeddings are tracked in the V3 roadmap.
A Shadow RAG is a controlled, private replica of the retrieval pipelines used by AI systems like Perplexity, ChatGPT (Browse), and SearchGPT. By building our own retrieval system and testing it against known ground-truth, we can measure exactly how well ACRI predicts real-world retrieval success — without relying on opaque third-party APIs.
Corpus: 926 domains sampled from the SEODiff Radar database (100k+ crawled domains).
Stratified sampling: Equal representation across 5 ACRI buckets (0–29, 30–49, 50–69, 70–89, 90–100) to prevent authority bias and ensure balanced coverage.
Page unit: The homepage or primary product page per domain, represented by its "Golden Semantic String" — the canonical text an AI system would extract (title, meta description, H1, H2s, first 300–600 words, JSON-LD). Note: This study evaluates the homepage as the primary entry point. Future work will expand to multi-page indexing (e.g., docs, product pages).
all-MiniLM-L6-v2
(sentence-transformers). Free, open-source, and widely used in RAG research. Produces 384-dimensional
embeddings that mimic the behavior of commercial embedding APIs.We generated 4630 queries (5 per domain, across 3 query types) that reflect real-world LLM usage patterns:
| Query Type | Example | Purpose |
|---|---|---|
| Entity (1 per domain) | "What is [domain]?" | Brand recognition — tests direct name retrieval |
| Topical (3 per domain) | "best [category] platform for [term]" | Content quality test — no domain name in query |
| Comparison (1 per domain) | "which [category] tool is best for [term]" | Competitive topical discovery — no domain name |
Each query has an explicit ground-truth mapping to the correct domain, enabling precise measurement of retrieval accuracy.
We compute Recall@1, Recall@5, and Recall@10, plus Mean Reciprocal Rank (MRR). Statistical significance is assessed via Spearman ρ with 1,000-iteration bootstrap confidence intervals.
| Metric | Value |
|---|---|
| Total Queries | 4630 |
| MRR (Mean Reciprocal Rank) | 0.1910 |
| Recall@1 | 0.0585 |
| Recall@5 | 0.1464 |
| Recall@10 | 0.2104 |
Figure 1: Scatter plot of ACRI score vs. Recall@10 with regression line. Each dot represents one domain. Higher ACRI scores correlate with higher retrieval success.
| Statistic | Value | Interpretation |
|---|---|---|
| Spearman ρ | 0.6294 | Strong positive rank correlation |
| Spearman p-value | 2.64e-103 | Highly significant (p < 0.001) |
| 95% Bootstrap CI | [0.5971, 0.6585] | Bootstrap confidence interval (n=1000) |
| Pearson r | 0.5570 | Linear correlation for comparison |
| Technical Structure beats Brand Authority (Partial ρ) | 0.6294 | ACRI effect after removing domain authority influence (Tranco rank) |
Interpretation: A Spearman ρ of 0.629 means that ACRI score is a strong predictor of RAG retrieval success. Domains with higher ACRI scores are systematically retrieved more often and at higher rank positions when queried in a controlled RAG environment.
Figure 2: Traditional authority proxy (log10 Tranco rank) vs Recall@10. Circled points are high-authority domains with low AI retrieval, supporting the claim that Google ranking ≠ AI visibility.
| Query Type | Count | MRR | Recall@1 | Recall@5 | Recall@10 |
|---|---|---|---|---|---|
| comparison | 926 | 0.0714 | 0.0022 | 0.0464 | 0.1015 |
| entity | 926 | 0.4202 | 0.2592 | 0.5486 | 0.6717 |
| topical | 2778 | 0.0868 | 0.0104 | 0.0457 | 0.0929 |
Figure 3: Mean Reciprocal Rank broken down by query type. Entity queries have the highest retrieval success, while comparison queries are more challenging.
To understand which ACRI sub-signals matter most for retrieval, we performed an ablation study. For each signal, we residualized ACRI (removed the linear contribution of that signal via linear regression) and recomputed the Spearman ρ with Recall@10. The difference (Δρ) indicates how much that signal contributes to the overall ACRI–retrieval correlation.
| Signal | ρ Without | Δρ | Importance |
|---|---|---|---|
| Bot Access (GPTBot) | 0.4003 | +0.2291 | 0.2291 |
| Schema Coverage | 0.4887 | +0.1407 | 0.1407 |
| JS Content Risk (Ghost Ratio) | 0.5082 | +0.1211 | 0.1211 |
| Content Depth (Word Count) | 0.5216 | +0.1078 | 0.1078 |
| Token Bloat | 0.5641 | +0.0653 | 0.0653 |
Base Spearman ρ: 0.6294. Higher Δρ = more important for retrieval prediction.
Figure 4: Feature importance from ablation study. Token Bloat and Schema Coverage have the largest impact on ACRI's ability to predict retrieval success.
Actionable insight: If you can only fix one thing, ensure AI crawlers (GPTBot, ClaudeBot) are not blocked — this binary signal had the largest Δρ in our ablation. Beyond that, add structured schema markup (the second-highest Δρ above), then reduce JS content risk (server-side render critical content to lower ghost ratio). Token bloat, while still beneficial, ranked lowest in our ablation — focus on the high-impact signals first.
We split domains into five ACRI tiers and compared their average retrieval success.
Figure 5: Average Recall@10 by ACRI tier. Domains in the peak tier (70-89 (B/A)) are retrieved 5.6× more often than those in the bottom tier (0–29). The slight dip in the 90-100 tier is due to a smaller sample size (n=126 vs n=200).
| ACRI Tier | Domains | Avg Recall@10 | Avg MRR | Lift vs. Lowest |
|---|---|---|---|---|
| 0-29 (F/D) | 200 | 0.064 | 0.143 | 1.0× |
| 30-49 (D) | 200 | 0.103 | 0.157 | 1.6× |
| 50-69 (C) | 200 | 0.249 | 0.182 | 3.9× |
| 70-89 (B/A) | 200 | 0.361 | 0.217 | 5.6× |
| 90-100 (A+) | 126 | 0.313 | 0.227 | 4.9× |
The headline number: Domains in the peak ACRI tier (70-89 (B/A)) are retrieved 5.6× more accurately than domains with ACRI < 30. This is the empirical proof that technical AI-readiness directly impacts whether your content appears in AI-generated answers.
Note: The 90–100 tier shows slightly lower Recall@10 (0.313) than the 70–89 tier (0.361). With a smaller sample in the top tier (126 vs 200 domains), this inversion is within statistical noise. The monotonic trend across the first four tiers is clear and robust.
| Done | Action | Impact | Effort |
|---|---|---|---|
| Unblock GPTBot in robots.txt | +5–20 ACRI points | 2 min | |
| Add Organization + WebSite schema | +5–15 ACRI points | 10 min | |
| Remove render-blocking scripts | +3–10 ACRI points | 15 min | |
| Add meta description + unique H1 | +2–8 ACRI points | 5 min | |
| Compress HTML / remove inline CSS | +2–8 ACRI points | 15 min |
all-MiniLM-L6-v2 (a 384-dimensional model) may not perfectly transfer
to OpenAI's text-embedding-3-large or Google's retrieval stack. However, the fundamental principles of token bloat and semantic structure apply universally. Future work will test OpenAI embeddings to confirm cross-model robustness.Corpus Size: 926 domains Queries: 4630 Embedding Model: all-MiniLM-L6-v2 FAISS Index: flat_ip (exact cosine similarity) Recall@K: K = 1, 5, 10 Bootstrap: 1,000 iterations Random Seed: 42 Timestamp: 2026-02-19T21:18:21Z
# 1. Install dependencies
cd shadow_rag/
pip install -r requirements.txt
# 2. Export domain data (needs API access or local JSONL)
python data_export.py
# 3. Run the full pipeline
python run_pipeline.py
# 4. Or run individual steps
python embedding.py # Embed + index
python query_generator.py # Generate queries
python evaluation.py # Run retrieval + stats
python charts.py # Generate charts
python whitepaper.py # Render this document
# 5. One-command reproducibility run
sh run_pipeline.sh
# 6. Open Figure 1 source data + chart
python -c "import pandas as pd; print(pd.read_csv('output/domain_retrieval_scores.csv').head())"
Public reproducibility repo: github.com/seodiff/research
output/sampled_corpus.csv — Stratified domain sample with ACRI scoresoutput/shadow_rag.faiss — FAISS vector indexoutput/query_battery.csv — Full query set with ground truthoutput/domain_retrieval_scores.csv — Per-domain Retrieval Scoreoutput/experiment_results.json — Complete results with all metricsoutput/logs/retrieval_results.csv — Raw retrieval logs for every queryoutput/charts/*.png — Publication-quality chartssentence-transformers 5.2.3 faiss-cpu 1.13.2 numpy 2.4.1 scipy 1.17.0 pandas 2.2.3 scikit-learn 1.8.0
query_battery.csv at generation time.Sample JSON lines and mapping scripts are included in the reproducibility repository.
To compute Ap for each signal, we regress ACRI on that signal, take residuals, and recompute Spearman between residualized ACRI and RetrievalScore.
# toy pseudocode rho_base = spearman(acri, retrieval) acri_resid = residualize(acri, signal) rho_without_signal = spearman(acri_resid, retrieval) delta_rho = rho_base - rho_without_signal
SEODiff is the industry standard for AI Visibility analytics. We help engineering and marketing teams measure, monitor, and improve how their content is retrieved by LLMs (ChatGPT, Perplexity, Claude, SearchGPT).
This research was conducted by the SEODiff Data Science team to validate the ACRI framework, which powers our core auditing platform. Unlike black-box SEO tools, we believe in radical transparency: our methodology, datasets, and calibration studies are open for peer review.
Every day your site is blocked or bloated, you are invisible to the next generation of search. Run a free, instant Shadow RAG simulation on your domain now.
Check Your ACRI ScoreFree analysis • No credit card required • Instant results