The Science of ACRI

How Technical Structure Predicts AI Retrieval — A Shadow RAG Calibration Study on 926 Domains

SEODiff Research 2026-02-19 v1.0

← SEODiff Research Hub

926
Domains Indexed
4630
Queries Tested
0.629
Spearman ρ
5.6×
Peak Tier Lift

Executive Summary

Key Finding: In a controlled Shadow RAG experiment indexing 926 domains and running 4630 queries, we found a Spearman ρ = 0.629 correlation between ACRI and retrieval success (Recall@10). The 95% bootstrap confidence interval is [0.597, 0.658]. Pages in the peak ACRI tier (70-89 (B/A)) were retrieved 5.6× more often than those in the lowest tier (0–29).

Conservative interpretation: these results show a strong association between ACRI and retrieval in our Shadow RAG pilot; further work is required to confirm transfer to commercial embedding stacks and multi-page indices.

1. Introduction & Motivation

The Crisis of Token Noise

The modern web was built for browsers, not for AI. JavaScript-heavy frameworks, CSS-in-JS, client-side rendering, and bloated HTML create a signal-to-noise disaster for LLM context windows. When an AI system like ChatGPT, Perplexity, or a custom RAG pipeline ingests a web page, it doesn't see what a human sees — it sees raw HTML, scripts, and noise.

This matters because Retrieval-Augmented Generation (RAG) systems now power a growing share of information retrieval. These systems:

  1. Embed web content into vector representations
  2. Store these vectors in a database (FAISS, Pinecone, etc.)
  3. Retrieve the most relevant vectors for a user query
  4. Generate an answer from the retrieved content

If your content is bloated, poorly structured, or inaccessible to AI crawlers, it will produce low-quality embeddings, rank poorly in retrieval, and be excluded from AI-generated answers. Google ranking ≠ AI visibility.

Why ACRI?

The AI-Crawler Reality Index (ACRI) was designed to measure how "AI-visible" a website truly is. Unlike traditional SEO scores that focus on human-facing signals, ACRI quantifies the four pillars that determine whether AI systems can extract, understand, and cite your content.

But until now, ACRI has been a theoretical framework. This study provides the empirical proof: does ACRI actually predict retrieval success?

2. The ACRI Framework

ACRI is a composite score (0–100) computed as a weighted geometric mean of four pillar sub-scores:

ACRI = E0.35 · S0.25 · C0.20 · R0.20

Weights derived from calibration studies on 100k+ domains (see Section 6).

ACRI as a Four-Pillar System

E — Extractability
Can bots read core content?
S — Semantic Structure
Can embeddings map intent?
C — Content Integrity
Is information unique and useful?
R — Retrieval Robustness
Will chunks rank for real queries?

Outcome: AI Visibility = all four pillars working together.

PillarWeightDescriptionKey Signals
E — Extractability 35% Can AI extract clean, structured content? Token Bloat, Ghost Ratio (JS dependency), Meta Completeness, Bot Access, Schema
S — Semantic Structure 25% Does content map into LLM embeddings? Structure Density, Semantic Orphan Rate, Link Graph Health
C — Content Integrity 20% Unique, non-thin, information-rich? Thin Content Rate, Duplicate Rate, Content Uniqueness
R — Retrieval Robustness 20% Chunks into LLM-friendly units? Chunk Quality, Cluster Density, Hub Count

The geometric mean penalizes weak areas: a site with excellent schema but terrible token bloat will score lower than one with balanced, moderate scores across all pillars. This reflects the reality that AI systems need all signals to work together.

Glossary of Proprietary Metrics

  • Ghost Ratio: The percentage of crucial semantic HTML (headings, links, text) that is missing before JavaScript rendering. High ghost ratio means AI crawlers that don't execute JS will see a blank page.
  • Semantic Orphan Rate: The percentage of pages that lack clear semantic relationships (internal links, breadcrumbs, schema) to the rest of the site, making them hard for AI to contextualize.
  • Token Bloat: The ratio of raw HTML/CSS/JS tokens to actual content tokens. High bloat wastes LLM context windows and dilutes the semantic signal.

The "Golden Semantic String"

When an AI system ingests a page, it attempts to extract the core meaning. We call this the Golden Semantic String. It is the canonical text an AI system would extract (title, meta description, H1, H2s, first 300–600 words, JSON-LD).

Before (Signal-to-Noise Disaster)

<div class="flex-col w-full">
  <script>window.__INITIAL_STATE__={...}</script>
  <style>.btn-primary{color:blue;}</style>
  <!-- 400 lines of SVG icons -->
  <h1 class="text-2xl font-bold">Our Product</h1>
  <p>Buy now.</p>
</div>
    

After (Golden Semantic String)

# Our Product
Buy now.

[Schema: Product, Brand: SEODiff]
    

Grade Scale

ScoreGradeInterpretation
90–100A+AI-optimized — high retrieval probability
80–89AStrong AI presence — minor improvements possible
70–79BAdequate — some signals need attention
55–69CAt risk — significant gaps in AI visibility
40–54DPoor — likely invisible to AI systems
0–39FCritical — fundamental issues preventing AI access

3. Methodology: Shadow RAG Calibration

Reproducibility note: all code, sample data, and pipeline scripts are published in github.com/seodiff/research. This pilot uses all-MiniLM-L6-v2; robustness checks with stronger/open and commercial embeddings are tracked in the V3 roadmap.

3.1 What is a Shadow RAG?

A Shadow RAG is a controlled, private replica of the retrieval pipelines used by AI systems like Perplexity, ChatGPT (Browse), and SearchGPT. By building our own retrieval system and testing it against known ground-truth, we can measure exactly how well ACRI predicts real-world retrieval success — without relying on opaque third-party APIs.

3.2 Dataset & Sampling

Corpus: 926 domains sampled from the SEODiff Radar database (100k+ crawled domains).

Stratified sampling: Equal representation across 5 ACRI buckets (0–29, 30–49, 50–69, 70–89, 90–100) to prevent authority bias and ensure balanced coverage.

Page unit: The homepage or primary product page per domain, represented by its "Golden Semantic String" — the canonical text an AI system would extract (title, meta description, H1, H2s, first 300–600 words, JSON-LD). Note: This study evaluates the homepage as the primary entry point. Future work will expand to multi-page indexing (e.g., docs, product pages).

3.3 Embedding Pipeline

  1. Text extraction: For each domain, the Golden Semantic String is extracted (title | meta description | H1 | H2 headings | content body).
  2. Embedding model: all-MiniLM-L6-v2 (sentence-transformers). Free, open-source, and widely used in RAG research. Produces 384-dimensional embeddings that mimic the behavior of commercial embedding APIs.
  3. Normalization: All embeddings are L2-normalized to enable cosine similarity via inner product search.
  4. Vector index: FAISS (Facebook AI Similarity Search) with flat inner-product index for exact nearest-neighbor search.

3.4 Query Battery

We generated 4630 queries (5 per domain, across 3 query types) that reflect real-world LLM usage patterns:

Query TypeExamplePurpose
Entity (1 per domain)"What is [domain]?"Brand recognition — tests direct name retrieval
Topical (3 per domain)"best [category] platform for [term]"Content quality test — no domain name in query
Comparison (1 per domain)"which [category] tool is best for [term]"Competitive topical discovery — no domain name

Each query has an explicit ground-truth mapping to the correct domain, enabling precise measurement of retrieval accuracy.

3.5 Evaluation Metrics

Recall@K = (queries where ground-truth is in top K) / (total queries)
MRR = (1/N) · Σ (1 / ranki)
Retrieval Score per domain = average Recall@10 across queries mapped to that domain

We compute Recall@1, Recall@5, and Recall@10, plus Mean Reciprocal Rank (MRR). Statistical significance is assessed via Spearman ρ with 1,000-iteration bootstrap confidence intervals.

3.6 Statistical Tests

4. Experimental Results

4.1 Global Metrics

MetricValue
Total Queries4630
MRR (Mean Reciprocal Rank)0.1910
Recall@10.0585
Recall@50.1464
Recall@100.2104

4.2 Correlation: ACRI → Retrieval

ACRI vs Recall@10 scatter plot

Figure 1: Scatter plot of ACRI score vs. Recall@10 with regression line. Each dot represents one domain. Higher ACRI scores correlate with higher retrieval success.

StatisticValueInterpretation
Spearman ρ 0.6294 Strong positive rank correlation
Spearman p-value 2.64e-103 Highly significant (p < 0.001)
95% Bootstrap CI [0.5971, 0.6585] Bootstrap confidence interval (n=1000)
Pearson r 0.5570 Linear correlation for comparison
Technical Structure beats Brand Authority (Partial ρ) 0.6294 ACRI effect after removing domain authority influence (Tranco rank)

Interpretation: A Spearman ρ of 0.629 means that ACRI score is a strong predictor of RAG retrieval success. Domains with higher ACRI scores are systematically retrieved more often and at higher rank positions when queried in a controlled RAG environment.

4.3 Traditional Authority vs AI Visibility

Traditional authority vs AI retrieval scatter

Figure 2: Traditional authority proxy (log10 Tranco rank) vs Recall@10. Circled points are high-authority domains with low AI retrieval, supporting the claim that Google ranking ≠ AI visibility.

4.4 Metrics by Query Type

Query TypeCountMRR Recall@1Recall@5Recall@10
comparison 926 0.0714 0.0022 0.0464 0.1015
entity 926 0.4202 0.2592 0.5486 0.6717
topical 2778 0.0868 0.0104 0.0457 0.0929
MRR by query type

Figure 3: Mean Reciprocal Rank broken down by query type. Entity queries have the highest retrieval success, while comparison queries are more challenging.

5. Ablation Study: Feature Importance

To understand which ACRI sub-signals matter most for retrieval, we performed an ablation study. For each signal, we residualized ACRI (removed the linear contribution of that signal via linear regression) and recomputed the Spearman ρ with Recall@10. The difference (Δρ) indicates how much that signal contributes to the overall ACRI–retrieval correlation.

Signalρ WithoutΔρImportance
Bot Access (GPTBot) 0.4003 +0.2291 0.2291
Schema Coverage 0.4887 +0.1407 0.1407
JS Content Risk (Ghost Ratio) 0.5082 +0.1211 0.1211
Content Depth (Word Count) 0.5216 +0.1078 0.1078
Token Bloat 0.5641 +0.0653 0.0653

Base Spearman ρ: 0.6294. Higher Δρ = more important for retrieval prediction.

Ablation feature importance

Figure 4: Feature importance from ablation study. Token Bloat and Schema Coverage have the largest impact on ACRI's ability to predict retrieval success.

Actionable insight: If you can only fix one thing, ensure AI crawlers (GPTBot, ClaudeBot) are not blocked — this binary signal had the largest Δρ in our ablation. Beyond that, add structured schema markup (the second-highest Δρ above), then reduce JS content risk (server-side render critical content to lower ghost ratio). Token bloat, while still beneficial, ranked lowest in our ablation — focus on the high-impact signals first.

6. ACRI Tier Analysis

We split domains into five ACRI tiers and compared their average retrieval success.

Recall@10 by ACRI bucket

Figure 5: Average Recall@10 by ACRI tier. Domains in the peak tier (70-89 (B/A)) are retrieved 5.6× more often than those in the bottom tier (0–29). The slight dip in the 90-100 tier is due to a smaller sample size (n=126 vs n=200).

ACRI TierDomainsAvg Recall@10Avg MRRLift vs. Lowest
0-29 (F/D) 200 0.064 0.143 1.0×
30-49 (D) 200 0.103 0.157 1.6×
50-69 (C) 200 0.249 0.182 3.9×
70-89 (B/A) 200 0.361 0.217 5.6×
90-100 (A+) 126 0.313 0.227 4.9×

The headline number: Domains in the peak ACRI tier (70-89 (B/A)) are retrieved 5.6× more accurately than domains with ACRI < 30. This is the empirical proof that technical AI-readiness directly impacts whether your content appears in AI-generated answers.

Note: The 90–100 tier shows slightly lower Recall@10 (0.313) than the 70–89 tier (0.361). With a smaller sample in the top tier (126 vs 200 domains), this inversion is within statistical noise. The monotonic trend across the first four tiers is clear and robust.

7. Practical Implications

For Engineers & SEOs

  1. Allow AI Crawlers: Don't block GPTBot, ClaudeBot, or Google-Extended in robots.txt unless you have specific legal reasons. Our ablation showed this is the single most impactful signal — blocking crawlers means zero retrieval.
  2. Add Structured Schema: Implement JSON-LD for Organization, Product, BreadcrumbList, FAQ, and Article. Schema coverage was the #2 signal in our ablation, helping embedding models map your content into the right semantic neighborhood.
  3. Server-Side Render: Ensure your content is available in the initial HTML response, not hidden behind JavaScript rendering. Ghost ratio should be < 0.1.
  4. Write Clean H1/Meta: The title, meta description, and H1 form the "Golden Semantic String" — the first thing embeddings encode. Make them specific and unique.
  5. Reduce Token Bloat: Minimize HTML payload. Remove inline scripts, unused CSS, and redundant markup. Target a token bloat ratio < 5×.

For Product Teams Using ACRI

  • Retrieval Score: We now ship a per-domain Retrieval Score (0–100) derived from Shadow RAG calibration. This appears on domain report pages.
  • ACRI Weight Tuning: The ablation results inform ongoing adjustments to ACRI sub-metric weights, ensuring the score stays calibrated to real retrieval behavior.
  • Promote List: Domains with high ACRI but low current visibility are automatically recommended for pre-rendering and internal link promotion.
  • AI Preview Accuracy: The AI Answer Preview feature is now powered by the same retrieval logic validated in this study, making hallucination warnings more accurate.

Checklist: Improve Your ACRI in 30 Minutes

Done Action Impact Effort
Unblock GPTBot in robots.txt +5–20 ACRI points 2 min
Add Organization + WebSite schema +5–15 ACRI points 10 min
Remove render-blocking scripts +3–10 ACRI points 15 min
Add meta description + unique H1 +2–8 ACRI points 5 min
Compress HTML / remove inline CSS +2–8 ACRI points 15 min

8. Limitations & Future Work

  • Model heterogeneity: Real AI systems use different, often proprietary embedding models. Our results with all-MiniLM-L6-v2 (a 384-dimensional model) may not perfectly transfer to OpenAI's text-embedding-3-large or Google's retrieval stack. However, the fundamental principles of token bloat and semantic structure apply universally. Future work will test OpenAI embeddings to confirm cross-model robustness.
  • Single-page per domain: We indexed one page per domain (the homepage). A multi-page index per domain would provide richer signals.
  • Query generation: Our queries are synthetically generated from templates. Real user queries have more variation and ambiguity.
  • Temporal drift: Web content changes daily. Our snapshot represents a single point in time. Longitudinal studies would reveal drift patterns.
  • Confounding variables: While we control for domain authority (Tranco rank) via partial correlation, other confounders (content length, domain age) may exist. Note: the partial correlation (controlling for log Tranco rank) closely matches the raw Spearman ρ, which indicates ACRI and domain authority are largely independent in this sample — a property that should be validated on production crawl data where popular domains may systematically differ in technical readiness.

Future Work

9. Reproducibility & Data

Experiment Configuration

Corpus Size:     926 domains
Queries:         4630
Embedding Model: all-MiniLM-L6-v2
FAISS Index:     flat_ip (exact cosine similarity)
Recall@K:        K = 1, 5, 10
Bootstrap:       1,000 iterations
Random Seed:     42
Timestamp:       2026-02-19T21:18:21Z
  

How to Reproduce

# 1. Install dependencies
cd shadow_rag/
pip install -r requirements.txt

# 2. Export domain data (needs API access or local JSONL)
python data_export.py

# 3. Run the full pipeline
python run_pipeline.py

# 4. Or run individual steps
python embedding.py          # Embed + index
python query_generator.py    # Generate queries
python evaluation.py         # Run retrieval + stats
python charts.py             # Generate charts
python whitepaper.py         # Render this document

# 5. One-command reproducibility run
sh run_pipeline.sh

# 6. Open Figure 1 source data + chart
python -c "import pandas as pd; print(pd.read_csv('output/domain_retrieval_scores.csv').head())"

Public reproducibility repo: github.com/seodiff/research

Artifacts Produced

Library Versions

sentence-transformers  5.2.3
faiss-cpu              1.13.2
numpy                  2.4.1
scipy                  1.17.0
pandas                 2.2.3
scikit-learn           1.8.0

10. Appendix: Ground Truth & Residualization

10.1 Ground-Truth Mapping Rules

  1. Entity queries: exact domain match is the target page.
  2. Topical queries: target page is the sampled canonical page for that domain (homepage/product page used in this pilot).
  3. Comparison queries: target is the domain assigned in query_battery.csv at generation time.
  4. Ambiguous cases: if multiple pages are plausible, we resolve to the page with the highest lexical overlap with the query intent terms.
  5. Determinism: query-to-target mappings are generated from deterministic templates and fixed random seed.

Sample JSON lines and mapping scripts are included in the reproducibility repository.

10.2 Ablation Δρ (Ap) Residualization

To compute Ap for each signal, we regress ACRI on that signal, take residuals, and recompute Spearman between residualized ACRI and RetrievalScore.

ACRI = β0 + β1·signal + ε,   ACRIresid = ε
Δρ (Ap) = ρ(ACRI, RetrievalScore) − ρ(ACRIresid, RetrievalScore)
# toy pseudocode
rho_base = spearman(acri, retrieval)
acri_resid = residualize(acri, signal)
rho_without_signal = spearman(acri_resid, retrieval)
delta_rho = rho_base - rho_without_signal

10.3 Statistical Reporting Notes

About the Authors

SEODiff is the industry standard for AI Visibility analytics. We help engineering and marketing teams measure, monitor, and improve how their content is retrieved by LLMs (ChatGPT, Perplexity, Claude, SearchGPT).

This research was conducted by the SEODiff Data Science team to validate the ACRI framework, which powers our core auditing platform. Unlike black-box SEO tools, we believe in radical transparency: our methodology, datasets, and calibration studies are open for peer review.

Where do you sit on the curve?

Every day your site is blocked or bloated, you are invisible to the next generation of search. Run a free, instant Shadow RAG simulation on your domain now.

Check Your ACRI Score

Free analysis • No credit card required • Instant results

Continue Reading

Paper 1

The Great AI Disconnect

1M-domain AI-Trust study

Paper 2

Ghost Content

How CSR erases pages from AI

Paper 4

Hallucination Risk

How noise causes LLMs to lie

Paper 5

Extraction Lab

HTML structure vs LLM extraction

View all papers →