The Science of ACRI — How Technical Structure Predicts AI Retrieval

Executive Summary

Key Finding: In a controlled Shadow RAG experiment indexing 926 domains and running 4630 queries, we found a Spearman ρ = 0.629 correlation between ACRI and retrieval success (Recall@10). The 95% bootstrap confidence interval is [0.597, 0.658]. Pages in the peak ACRI tier (70-89 (B/A)) were retrieved 5.6× more often than those in the lowest tier (0–29).

Conservative interpretation: these results show a strong association between ACRI and retrieval in our Shadow RAG pilot; further work is required to confirm transfer to commercial embedding stacks and multi-page indices.

ACRI predicts retrieval. The AI-Crawler Reality Index is a statistically significant predictor of whether a page will be selected by a RAG (Retrieval-Augmented Generation) system.
Bot Access (GPTBot) is the #1 retrieval factor. Our ablation study shows that this signal has the largest impact on retrieval probability (Δρ = +0.2291).
Schema Coverage ranks second. The second most impactful signal contributes Δρ = +0.1407 to the ACRI–retrieval correlation.
Content quality drives topical retrieval. When queries don't contain the domain name, only sites with rich, specific, well-structured content are retrieved — exactly the sites with high ACRI scores.
Bot access is the gateway. Blocking AI crawlers (GPTBot, ClaudeBot) eliminates content from the retrieval pipeline entirely — it's a binary prerequisite. Once access is granted, the quality signals (Schema, Content Depth, Ghost Ratio) determine retrieval ranking.

1. Introduction & Motivation

The Crisis of Token Noise

The modern web was built for browsers, not for AI. JavaScript-heavy frameworks, CSS-in-JS, client-side rendering, and bloated HTML create a signal-to-noise disaster for LLM context windows. When an AI system like ChatGPT, Perplexity, or a custom RAG pipeline ingests a web page, it doesn't see what a human sees — it sees raw HTML, scripts, and noise.

This matters because Retrieval-Augmented Generation (RAG) systems now power a growing share of information retrieval. These systems:

Embed web content into vector representations
Store these vectors in a database (FAISS, Pinecone, etc.)
Retrieve the most relevant vectors for a user query
Generate an answer from the retrieved content

If your content is bloated, poorly structured, or inaccessible to AI crawlers, it will produce low-quality embeddings, rank poorly in retrieval, and be excluded from AI-generated answers. Google ranking ≠ AI visibility.

Why ACRI?

The AI-Crawler Reality Index (ACRI) was designed to measure how "AI-visible" a website truly is. Unlike traditional SEO scores that focus on human-facing signals, ACRI quantifies the four pillars that determine whether AI systems can extract, understand, and cite your content.

But until now, ACRI has been a theoretical framework. This study provides the empirical proof: does ACRI actually predict retrieval success?

2. The ACRI Framework

ACRI is a composite score (0–100) computed as a weighted geometric mean of four pillar sub-scores:

ACRI = E^0.35 · S^0.25 · C^0.20 · R^0.20

Weights derived from calibration studies on 100k+ domains (see Section 6).

ACRI as a Four-Pillar System

E — Extractability
Can bots read core content?

S — Semantic Structure
Can embeddings map intent?

C — Content Integrity
Is information unique and useful?

R — Retrieval Robustness
Will chunks rank for real queries?

Outcome: AI Visibility = all four pillars working together.

Pillar	Weight	Description	Key Signals
E — Extractability	35%	Can AI extract clean, structured content?	Token Bloat, Ghost Ratio (JS dependency), Meta Completeness, Bot Access, Schema
S — Semantic Structure	25%	Does content map into LLM embeddings?	Structure Density, Semantic Orphan Rate, Link Graph Health
C — Content Integrity	20%	Unique, non-thin, information-rich?	Thin Content Rate, Duplicate Rate, Content Uniqueness
R — Retrieval Robustness	20%	Chunks into LLM-friendly units?	Chunk Quality, Cluster Density, Hub Count

The geometric mean penalizes weak areas: a site with excellent schema but terrible token bloat will score lower than one with balanced, moderate scores across all pillars. This reflects the reality that AI systems need all signals to work together.

Glossary of Proprietary Metrics

Ghost Ratio: The percentage of crucial semantic HTML (headings, links, text) that is missing before JavaScript rendering. High ghost ratio means AI crawlers that don't execute JS will see a blank page.
Semantic Orphan Rate: The percentage of pages that lack clear semantic relationships (internal links, breadcrumbs, schema) to the rest of the site, making them hard for AI to contextualize.
Token Bloat: The ratio of raw HTML/CSS/JS tokens to actual content tokens. High bloat wastes LLM context windows and dilutes the semantic signal.

The "Golden Semantic String"

When an AI system ingests a page, it attempts to extract the core meaning. We call this the Golden Semantic String. It is the canonical text an AI system would extract (title, meta description, H1, H2s, first 300–600 words, JSON-LD).

Before (Signal-to-Noise Disaster)

<div class="flex-col w-full">
  <script>window.__INITIAL_STATE__={...}</script>
  <style>.btn-primary{color:blue;}</style>
  <!-- 400 lines of SVG icons -->
  <h1 class="text-2xl font-bold">Our Product</h1>
  <p>Buy now.</p>
</div>

After (Golden Semantic String)

# Our Product
Buy now.

[Schema: Product, Brand: SEODiff]

Grade Scale

Score	Grade	Interpretation
90–100	A+	AI-optimized — high retrieval probability
80–89	A	Strong AI presence — minor improvements possible
70–79	B	Adequate — some signals need attention
55–69	C	At risk — significant gaps in AI visibility
40–54	D	Poor — likely invisible to AI systems
0–39	F	Critical — fundamental issues preventing AI access

3. Methodology: Shadow RAG Calibration

Reproducibility note: all code, sample data, and pipeline scripts are published in github.com/seodiff/research. This pilot uses all-MiniLM-L6-v2; robustness checks with stronger/open and commercial embeddings are tracked in the V3 roadmap.

3.1 What is a Shadow RAG?

A Shadow RAG is a controlled, private replica of the retrieval pipelines used by AI systems like Perplexity, ChatGPT (Browse), and SearchGPT. By building our own retrieval system and testing it against known ground-truth, we can measure exactly how well ACRI predicts real-world retrieval success — without relying on opaque third-party APIs.

3.2 Dataset & Sampling

Corpus: 926 domains sampled from the SEODiff Radar database (100k+ crawled domains).

Stratified sampling: Equal representation across 5 ACRI buckets (0–29, 30–49, 50–69, 70–89, 90–100) to prevent authority bias and ensure balanced coverage.

Page unit: The homepage or primary product page per domain, represented by its "Golden Semantic String" — the canonical text an AI system would extract (title, meta description, H1, H2s, first 300–600 words, JSON-LD). Note: This study evaluates the homepage as the primary entry point. Future work will expand to multi-page indexing (e.g., docs, product pages).

3.3 Embedding Pipeline

Text extraction: For each domain, the Golden Semantic String is extracted (title | meta description | H1 | H2 headings | content body).
Embedding model: all-MiniLM-L6-v2 (sentence-transformers). Free, open-source, and widely used in RAG research. Produces 384-dimensional embeddings that mimic the behavior of commercial embedding APIs.
Normalization: All embeddings are L2-normalized to enable cosine similarity via inner product search.
Vector index: FAISS (Facebook AI Similarity Search) with flat inner-product index for exact nearest-neighbor search.

3.4 Query Battery

We generated 4630 queries (5 per domain, across 3 query types) that reflect real-world LLM usage patterns:

Query Type	Example	Purpose
Entity (1 per domain)	"What is [domain]?"	Brand recognition — tests direct name retrieval
Topical (3 per domain)	"best [category] platform for [term]"	Content quality test — no domain name in query
Comparison (1 per domain)	"which [category] tool is best for [term]"	Competitive topical discovery — no domain name

Each query has an explicit ground-truth mapping to the correct domain, enabling precise measurement of retrieval accuracy.

3.5 Evaluation Metrics

Recall@K = (queries where ground-truth is in top K) / (total queries)

MRR = (1/N) · Σ (1 / rank_i)

Retrieval Score per domain = average Recall@10 across queries mapped to that domain

We compute Recall@1, Recall@5, and Recall@10, plus Mean Reciprocal Rank (MRR). Statistical significance is assessed via Spearman ρ with 1,000-iteration bootstrap confidence intervals.

3.6 Statistical Tests

Spearman ρ: Rank correlation between ACRI score and Retrieval Score (robust to outliers)
Pearson r: Linear correlation for comparison
Partial correlation: Controlling for domain authority (log Tranco rank) to isolate the ACRI effect from popularity bias
Bootstrap CI: 1,000 resamples for 95% confidence intervals on all correlations
Ablation study: Remove each ACRI sub-signal and measure Δρ to estimate feature importance

4. Experimental Results

4.1 Global Metrics

Metric	Value
Total Queries	4630
MRR (Mean Reciprocal Rank)	0.1910
Recall@1	0.0585
Recall@5	0.1464
Recall@10	0.2104

4.2 Correlation: ACRI → Retrieval

Figure 1: Scatter plot of ACRI score vs. Recall@10 with regression line. Each dot represents one domain. Higher ACRI scores correlate with higher retrieval success.

Statistic	Value	Interpretation
Spearman ρ	0.6294	Strong positive rank correlation
Spearman p-value	2.64e-103	Highly significant (p < 0.001)
95% Bootstrap CI	[0.5971, 0.6585]	Bootstrap confidence interval (n=1000)
Pearson r	0.5570	Linear correlation for comparison
Technical Structure beats Brand Authority (Partial ρ)	0.6294	ACRI effect after removing domain authority influence (Tranco rank)

Interpretation: A Spearman ρ of 0.629 means that ACRI score is a strong predictor of RAG retrieval success. Domains with higher ACRI scores are systematically retrieved more often and at higher rank positions when queried in a controlled RAG environment.

4.3 Traditional Authority vs AI Visibility

Traditional authority vs AI retrieval scatter

Figure 2: Traditional authority proxy (log10 Tranco rank) vs Recall@10. Circled points are high-authority domains with low AI retrieval, supporting the claim that Google ranking ≠ AI visibility.

4.4 Metrics by Query Type

Query Type	Count	MRR	Recall@1	Recall@5	Recall@10
comparison	926	0.0714	0.0022	0.0464	0.1015
entity	926	0.4202	0.2592	0.5486	0.6717
topical	2778	0.0868	0.0104	0.0457	0.0929

Figure 3: Mean Reciprocal Rank broken down by query type. Entity queries have the highest retrieval success, while comparison queries are more challenging.

5. Ablation Study: Feature Importance

To understand which ACRI sub-signals matter most for retrieval, we performed an ablation study. For each signal, we residualized ACRI (removed the linear contribution of that signal via linear regression) and recomputed the Spearman ρ with Recall@10. The difference (Δρ) indicates how much that signal contributes to the overall ACRI–retrieval correlation.

Signal	ρ Without	Δρ	Importance
Bot Access (GPTBot)	0.4003	+0.2291	0.2291
Schema Coverage	0.4887	+0.1407	0.1407
JS Content Risk (Ghost Ratio)	0.5082	+0.1211	0.1211
Content Depth (Word Count)	0.5216	+0.1078	0.1078
Token Bloat	0.5641	+0.0653	0.0653

Base Spearman ρ: 0.6294. Higher Δρ = more important for retrieval prediction.

Figure 4: Feature importance from ablation study. Token Bloat and Schema Coverage have the largest impact on ACRI's ability to predict retrieval success.

Actionable insight: If you can only fix one thing, ensure AI crawlers (GPTBot, ClaudeBot) are not blocked — this binary signal had the largest Δρ in our ablation. Beyond that, add structured schema markup (the second-highest Δρ above), then reduce JS content risk (server-side render critical content to lower ghost ratio). Token bloat, while still beneficial, ranked lowest in our ablation — focus on the high-impact signals first.

6. ACRI Tier Analysis

We split domains into five ACRI tiers and compared their average retrieval success.

Figure 5: Average Recall@10 by ACRI tier. Domains in the peak tier (70-89 (B/A)) are retrieved 5.6× more often than those in the bottom tier (0–29). The slight dip in the 90-100 tier is due to a smaller sample size (n=126 vs n=200).

ACRI Tier	Domains	Avg Recall@10	Avg MRR	Lift vs. Lowest
0-29 (F/D)	200	0.064	0.143	1.0×
30-49 (D)	200	0.103	0.157	1.6×
50-69 (C)	200	0.249	0.182	3.9×
70-89 (B/A)	200	0.361	0.217	5.6×
90-100 (A+)	126	0.313	0.227	4.9×

The headline number: Domains in the peak ACRI tier (70-89 (B/A)) are retrieved 5.6× more accurately than domains with ACRI < 30. This is the empirical proof that technical AI-readiness directly impacts whether your content appears in AI-generated answers.

Note: The 90–100 tier shows slightly lower Recall@10 (0.313) than the 70–89 tier (0.361). With a smaller sample in the top tier (126 vs 200 domains), this inversion is within statistical noise. The monotonic trend across the first four tiers is clear and robust.

7. Practical Implications

For Engineers & SEOs

Allow AI Crawlers: Don't block GPTBot, ClaudeBot, or Google-Extended in robots.txt unless you have specific legal reasons. Our ablation showed this is the single most impactful signal — blocking crawlers means zero retrieval.
Add Structured Schema: Implement JSON-LD for Organization, Product, BreadcrumbList, FAQ, and Article. Schema coverage was the #2 signal in our ablation, helping embedding models map your content into the right semantic neighborhood.
Server-Side Render: Ensure your content is available in the initial HTML response, not hidden behind JavaScript rendering. Ghost ratio should be < 0.1.
Write Clean H1/Meta: The title, meta description, and H1 form the "Golden Semantic String" — the first thing embeddings encode. Make them specific and unique.
Reduce Token Bloat: Minimize HTML payload. Remove inline scripts, unused CSS, and redundant markup. Target a token bloat ratio < 5×.

For Product Teams Using ACRI

Retrieval Score: We now ship a per-domain Retrieval Score (0–100) derived from Shadow RAG calibration. This appears on domain report pages.
ACRI Weight Tuning: The ablation results inform ongoing adjustments to ACRI sub-metric weights, ensuring the score stays calibrated to real retrieval behavior.
Promote List: Domains with high ACRI but low current visibility are automatically recommended for pre-rendering and internal link promotion.
AI Preview Accuracy: The AI Answer Preview feature is now powered by the same retrieval logic validated in this study, making hallucination warnings more accurate.

Checklist: Improve Your ACRI in 30 Minutes

Action	Impact	Effort
Unblock GPTBot in robots.txt	+5–20 ACRI points	2 min
Add Organization + WebSite schema	+5–15 ACRI points	10 min
Remove render-blocking scripts	+3–10 ACRI points	15 min
Add meta description + unique H1	+2–8 ACRI points	5 min
Compress HTML / remove inline CSS	+2–8 ACRI points	15 min

8. Limitations & Future Work

Model heterogeneity: Real AI systems use different, often proprietary embedding models. Our results with all-MiniLM-L6-v2 (a 384-dimensional model) may not perfectly transfer to OpenAI's text-embedding-3-large or Google's retrieval stack. However, the fundamental principles of token bloat and semantic structure apply universally. Future work will test OpenAI embeddings to confirm cross-model robustness.
Single-page per domain: We indexed one page per domain (the homepage). A multi-page index per domain would provide richer signals.
Query generation: Our queries are synthetically generated from templates. Real user queries have more variation and ambiguity.
Temporal drift: Web content changes daily. Our snapshot represents a single point in time. Longitudinal studies would reveal drift patterns.
Confounding variables: While we control for domain authority (Tranco rank) via partial correlation, other confounders (content length, domain age) may exist. Note: the partial correlation (controlling for log Tranco rank) closely matches the raw Spearman ρ, which indicates ACRI and domain authority are largely independent in this sample — a property that should be validated on production crawl data where popular domains may systematically differ in technical readiness.

Future Work

Scale to 10k–100k domain index for more statistical power
Test with multiple embedding models (all-mpnet-base-v2, OpenAI text-embedding-3-small)
Add multi-page indexing per domain
Introduce paraphrase robustness testing
Run longitudinal study over 30–90 days to measure temporal stability
Add retrieval-augmented answer generation with faithfulness scoring (Judge LLM)

9. Reproducibility & Data

Experiment Configuration

Corpus Size:     926 domains
Queries:         4630
Embedding Model: all-MiniLM-L6-v2
FAISS Index:     flat_ip (exact cosine similarity)
Recall@K:        K = 1, 5, 10
Bootstrap:       1,000 iterations
Random Seed:     42
Timestamp:       2026-02-19T21:18:21Z

How to Reproduce

# 1. Install dependencies
cd shadow_rag/
pip install -r requirements.txt

# 2. Export domain data (needs API access or local JSONL)
python data_export.py

# 3. Run the full pipeline
python run_pipeline.py

# 4. Or run individual steps
python embedding.py          # Embed + index
python query_generator.py    # Generate queries
python evaluation.py         # Run retrieval + stats
python charts.py             # Generate charts
python whitepaper.py         # Render this document

# 5. One-command reproducibility run
sh run_pipeline.sh

# 6. Open Figure 1 source data + chart
python -c "import pandas as pd; print(pd.read_csv('output/domain_retrieval_scores.csv').head())"

Public reproducibility repo: github.com/seodiff/research

Artifacts Produced

output/sampled_corpus.csv — Stratified domain sample with ACRI scores
output/shadow_rag.faiss — FAISS vector index
output/query_battery.csv — Full query set with ground truth
output/domain_retrieval_scores.csv — Per-domain Retrieval Score
output/experiment_results.json — Complete results with all metrics
output/logs/retrieval_results.csv — Raw retrieval logs for every query
output/charts/*.png — Publication-quality charts

Library Versions

sentence-transformers  5.2.3
faiss-cpu              1.13.2
numpy                  2.4.1
scipy                  1.17.0
pandas                 2.2.3
scikit-learn           1.8.0

10. Appendix: Ground Truth & Residualization

10.1 Ground-Truth Mapping Rules

Entity queries: exact domain match is the target page.
Topical queries: target page is the sampled canonical page for that domain (homepage/product page used in this pilot).
Comparison queries: target is the domain assigned in query_battery.csv at generation time.
Ambiguous cases: if multiple pages are plausible, we resolve to the page with the highest lexical overlap with the query intent terms.
Determinism: query-to-target mappings are generated from deterministic templates and fixed random seed.

Sample JSON lines and mapping scripts are included in the reproducibility repository.

10.2 Ablation Δρ (Ap) Residualization

To compute Ap for each signal, we regress ACRI on that signal, take residuals, and recompute Spearman between residualized ACRI and RetrievalScore.

ACRI = β₀ + β₁·signal + ε, ACRI_resid = ε

Δρ (Ap) = ρ(ACRI, RetrievalScore) − ρ(ACRI_resid, RetrievalScore)

# toy pseudocode
rho_base = spearman(acri, retrieval)
acri_resid = residualize(acri, signal)
rho_without_signal = spearman(acri_resid, retrieval)
delta_rho = rho_base - rho_without_signal

10.3 Statistical Reporting Notes

Headline statistics are reported with bootstrap 95% confidence intervals where available.
For families of multiple comparisons (e.g., ablation/subgroup screens), results should be interpreted with FDR-adjusted p-values in follow-up analyses.
Power is strongest at global-correlation level and weaker for small subgroup slices; subgroup estimates are presented as directional unless otherwise noted.

About the Authors

SEODiff is the industry standard for AI Visibility analytics. We help engineering and marketing teams measure, monitor, and improve how their content is retrieved by LLMs (ChatGPT, Perplexity, Claude, SearchGPT).

This research was conducted by the SEODiff Data Science team to validate the ACRI framework, which powers our core auditing platform. Unlike black-box SEO tools, we believe in radical transparency: our methodology, datasets, and calibration studies are open for peer review.

Where do you sit on the curve?

Every day your site is blocked or bloated, you are invisible to the next generation of search. Run a free, instant Shadow RAG simulation on your domain now.

Check Your ACRI Score

Free analysis • No credit card required • Instant results

Continue Reading

Paper 1

The Great AI Disconnect

1M-domain AI-Trust study

Paper 2

Ghost Content

How CSR erases pages from AI

Paper 4

Hallucination Risk

How noise causes LLMs to lie

Paper 5

Extraction Lab

HTML structure vs LLM extraction

View all papers →