The 47% Graveyard: 2026 State of AI-Trust and RAG-Readiness
Live Data Crawl v0.4.341 · 1,000,111 domains · Last refreshed: February 24, 2026
The rules of search changed — and almost nobody noticed.
For two decades, SEO professionals built empires on a single metric: Domain Authority. The logic was simple — the more websites that link to you, the higher you rank. It worked because Google's algorithms counted links as votes of confidence.
But in 2024–2026, search underwent its most radical transformation since Google itself launched. ChatGPT, Perplexity, Claude, and Google's AI Overviews don't rank websites — they synthesize answers from them. The paradigm shifted from "which page deserves position #1" to "which pages can I actually read and extract facts from."
This distinction is lethal. A site with a million backlinks and perfect Domain Authority becomes worthless if:
robots.txtWe built a system to measure exactly how bad the problem is. We call it Scholar Mode: a pipeline that crawls 1 million domains, classifies them into AI-readable (Tier 1) or AI-invisible (Tier 2), and computes an AI-Trust Score — the first PageRank designed for the AI era.
Traditional SEO and Generative Engine Optimization (GEO) operate on fundamentally different assumptions:
| Dimension | Traditional SEO | GEO (Generative Engine Optimization) |
|---|---|---|
| Core mechanism | Link counting — more links = higher rank | Semantic extraction — can the LLM read and extract facts? |
| Ranking model | PageRank → position #1 through #10 | RAG retrieval → inclusion in synthesized answer |
| Content readability | Googlebot renders JS; content visibility is high | LLM crawlers use lightweight HTTP clients; JS-rendered content is often invisible |
| Link value | All links pass authority equally | Links from AI-invisible sources pass zero authority |
| Failure mode | Low rankings → less traffic | Invisible → zero citations → zero "Answer Share of Voice" |
| Key metric | Domain Authority (DA/DR) | AI-Trust Score |
The critical insight is this: LLMs do not rank, they synthesize. When a user asks ChatGPT "What is the best CRM for small businesses?", the model doesn't sort ten blue links — it reads every document in its retrieved context window and weaves a coherent answer. If your site is in the context window, you might be cited. If it's not, you don't exist.
And whether your site ends up in that context window depends almost entirely on whether AI crawlers can physically extract meaningful content from your HTML. This is what ACRI measures — and it's the gateway to Tier 1.
Every domain first receives an ACRI score (AI-Crawler Reality Index) — a 0–100 composite measuring how well AI crawlers can access, render, and extract content from the site. ACRI is computed from five pillars:
<article>, <main>).For the full methodology, see The Science of ACRI.
Domains are classified into two tiers based on their ACRI score:
The AI-Trust Score is a modified PageRank that incorporates ACRI as a link-weight multiplier:
Where:
The key difference from standard PageRank: A(u). A link from a site with ACRI 95 passes full equity. A link from ACRI 5 passes almost nothing. A link from a Tier 2 site (Graveyard) passes exactly zero — it's excluded from the computation entirely.
| Parameter | Value |
|---|---|
| Total domains crawled | 1,000,111 |
| Tier 1 (Scholar) domains | 532,245 (53.2%) |
| Tier 2 (Graveyard) domains | 467,866 (46.8%) |
| Citation edges analyzed | 865,117 |
| PageRank iterations to convergence | 35 |
| Computation time | 73.9 seconds |
| Trust scores computed | 374,587 |
| Source ranking | Tranco Top 1M (research-grade popularity list) |
The most striking finding from our analysis is what we call the Invisible Giants Paradox: the websites with the highest traditional Domain Authority are among the least visible to AI search engines.
Facebook has one of the highest traditional Domain Authorities in existence. Billions of pages link to it. Yet because of its aggressive bot-blocking and JavaScript-heavy architecture, it scored ACRI 4/100 (Grade F). When users ask ChatGPT or Perplexity for B2B advice, community discussions, or brand research, Facebook is entirely absent from the citations.
Meanwhile, LinkedIn — with its more moderate bot policy and server-rendered content — scored AI-Trust 91.3, making it the #5 most-cited domain in the AI knowledge graph. The gap isn't subtle — it's total.
| Domain | Tranco Rank | ACRI | Grade | AI-Trust | Trust Rank | Tier |
|---|---|---|---|---|---|---|
| twitter.com | #16 | 12 | F | 93.0 | #4 | Scholar |
| linkedin.com | #19 | 55 | C | 91.3 | #5 | Scholar |
| x.com | #58 | 13 | F | 77.7 | #7 | Scholar |
| facebook.com | #7 | 4 | F | — | — | Graveyard |
| instagram.com | #4 | 6 | F | — | — | Graveyard |
| youtube.com | #2 | 8 | F | — | — | Graveyard |
| tiktok.com | #6 | 3 | F | — | — | Graveyard |
Twitter.com and X.com rank highly despite low ACRI scores because they receive enormous volumes of inbound links from Tier 1 domains. The trust metric rewards being cited by readable sources, not being readable yourself — though readability is the gateway to Tier 1 participation.
robots.txt rule and a JavaScript-heavy SPA architecture.
The following table ranks the top 30 domains by AI-Trust Score — the first backlink authority metric built for the generative era. These are the domains that LLMs cite most frequently and most heavily in synthesized answers.
| # | Domain | AI-Trust | ACRI | Grade | Inbound Links | Tranco | Tech Stack |
|---|---|---|---|---|---|---|---|
| 4 | twitter.com | 93.0 | 12 | F | 20,349 | #16 | Express |
| 5 | linkedin.com | 91.3 | 55 | C | 18,545 | #19 | Proprietary |
| 7 | x.com | 77.7 | 13 | F | 7,103 | #58 | Express |
| 12 | github.com | 63.4 | 53 | D | 2,520 | #31 | Rails |
| 14 | google.com | 62.5 | 35 | F | 2,202 | #1 | Blogger |
| 15 | wa.me | 60.9 | 37 | F | 1,545 | #80 | Proprietary |
| 17 | bsky.app | 55.6 | 23 | F | 1,693 | #511 | Proprietary |
| 20 | wordpress.org | 53.5 | 54 | D | 722 | #41 | WordPress |
| 21 | discord.gg | 53.5 | 52 | D | 1,191 | #114 | Webflow |
| 22 | whatsapp.com | 48.8 | 34 | F | 911 | #52 | Wix |
| 25 | generatepress.com | 47.1 | 55 | C | 176 | #2,480 | WordPress |
| 27 | trustpilot.com | 46.1 | 40 | D | 632 | #410 | Next.js |
| 28 | cookiedatabase.org | 44.7 | 44 | D | 433 | #414 | WordPress |
| 32 | bit.ly | 42.2 | 60 | C | 546 | #93 | Gatsby |
| 34 | threads.net | 41.8 | 20 | F | 742 | #1,836 | Express |
| 38 | line.me | 40.5 | 47 | D | 345 | #444 | Gatsby |
| 39 | dzen.ru | 40.3 | 37 | F | 506 | #11 | Next.js |
| 43 | mediawiki.org | 40.1 | 59 | C | 151 | #3,373 | MediaWiki |
| 44 | vimeo.com | 39.2 | 43 | D | 490 | #66 | Next.js |
| 45 | discord.com | 39.0 | 52 | D | 468 | #159 | Webflow |
Full top-200 leaderboard available at seodiff.io/radar/trust-leaderboard. Individual domain reports linked from each row.
GitHub (#12, Trust 63.4): Developer documentation sites excel because they naturally use clean semantic HTML, minimal JavaScript wrappers, and structured Markdown content. This makes them ideal "food" for LLM RAG chunking. GitHub also allows all major AI crawlers.
WordPress.org (#20, Trust 53.5): The WordPress ecosystem defaults to semantic HTML, clean heading hierarchies, and server-side rendering. Sites built on WordPress — when not overloaded with plugins — tend to be highly extractable.
MediaWiki.org (#43, Trust 40.1): Wiki platforms are structurally perfect for AI: clean heading hierarchy, minimal DOM noise, server-rendered content, and explicit semantic relationships between pages. MediaWiki sites score disproportionately well relative to their traffic.
467,866 domains landed in Tier 2. Why? Our analysis categorizes them into four primary failure archetypes:
These sites blocked GPTBot, ClaudeBot, and PerplexityBot in robots.txt out of fear of intellectual property theft — accidentally erasing themselves from the future of search.
User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: /
The irony is devastating: by blocking AI crawlers to "protect" their content, these sites ensured that AI assistants cannot answer questions about them. When a user asks ChatGPT about their products, ChatGPT doesn't produce a wrong answer — it produces no answer, or worse, cites a competitor instead.
These sites have a DOM so heavy that actual content comprises less than 10% of the total token count. When an LLM's context window fills up with navigation HTML, inline scripts, and tracking pixels, there is no room left for the actual content.
Our data shows a strong correlation between token bloat and AI invisibility:
| Token Bloat Ratio | Mean ACRI | % in Graveyard |
|---|---|---|
| 1–3× (clean) | 68 | 12% |
| 3–6× (moderate) | 42 | 38% |
| 6–10× (heavy) | 24 | 71% |
| 10×+ (extreme) | 11 | 94% |
For a deep dive into how token bloat causes LLMs to hallucinate facts, see our Hallucination Risk whitepaper.
Client-Side Rendered (CSR) applications — built with React, Angular, or Vue without server-side rendering — serve an empty <body> tag to standard crawlers. JavaScript must execute before any content appears.
Google eventually renders these pages (after queuing them for their rendering service). AI crawlers generally do not. GPTBot, ClaudeBot, and Perplexity's crawler use lightweight HTTP clients that read the initial HTML response — if it's blank, the page is blank.
Our Ghost Content study found that pure CSR sites have a median ghost ratio of 97% — meaning 97% of their content is invisible to AI crawlers.
These sites render content in the initial HTML but fail to provide machine-readable signals: no <article> or <main> elements, no JSON-LD structured data, no clean heading hierarchy. The content is technically present but semantically opaque.
For RAG pipelines that chunk documents by heading structure, a flat DOM with no heading hierarchy produces one giant, unstructured chunk — dramatically reducing retrieval precision. See our Extraction Lab paper for empirical evidence.
If your domain is in Tier 2, here are the three highest-impact fixes you can implement today:
robots.txtSwitch from a "Block All AI" strategy to a "Selective Allow" strategy for search-based LLMs:
## Before: Panic blocking User-agent: GPTBot Disallow: / ## After: Selective allow User-agent: GPTBot Allow: /blog/ Allow: /docs/ Allow: /products/ Disallow: /admin/ Disallow: /api/internal/
This allows AI search engines to read your public-facing content while protecting internal pages. The alternative — total blocking — means you simply don't exist in ChatGPT and Perplexity.
Wrap your main content in proper semantic elements so RAG pipelines can chunk your documents correctly:
<main role="main"> <article> <h1>Your Primary Heading</h1> <p>Your content...</p> <h2>Subsection</h2> <p>More content...</p> </article> </main> <script type="application/ld+json"> { "@context": "https://schema.org", "@type": "Article", "headline": "Your Primary Heading", "datePublished": "2026-02-24", "author": {"@type": "Organization", "name": "Your Brand"} } </script>
Strict heading hierarchies (H1 → H2 → H3) allow RAG systems to create meaningful document chunks. JSON-LD gives LLMs machine-readable facts without parsing ambiguous HTML.
If your site uses a JavaScript framework (React, Angular, Vue), implement Server-Side Rendering (SSR) or Static Site Generation (SSG) so that the initial HTML contains the actual content:
getServerSideProps or getStaticPropsuseAsyncData with SSR modeVerify your fix: run curl -s your-site.com | grep -c "your-main-content-text". If it returns 0, your content is ghost content.
Don't guess. See exactly how AI models view your site right now.
Run Free AI-Trust Scan →No sign-up required · Checks ACRI, ghost ratio, bot access, and semantic structure · Results in seconds
For enterprise CMOs and VPs of Organic Growth, this data demands a strategic pivot:
Google Analytics doesn't tell you how often ChatGPT cites your brand. Perplexity doesn't send referral headers that GA can parse. Most enterprises have zero visibility into their AI Search traffic — the fastest-growing channel in marketing.
AI-Trust Score is the first metric that measures your share of voice in AI-generated answers. If your competitor's Trust Score is 60 and yours is 12, they are being cited 5× more frequently in every AI-generated answer about your industry.
If 46.8% of all domains are in the AI Graveyard, then roughly half of all the backlinks your SEO team builds point to and from AI-invisible sites. In the generative era, these links pass zero AI-Trust authority.
This means your link-building strategy must now evaluate not just the quantity of backlinks but the ACRI score of the linking domain. A single link from a high-ACRI tech blog is worth more than 100 links from AI-invisible directories.
Only 53.2% of the web is Tier 1. Enterprises that optimize their AI visibility now — while 47% of competitors are still invisible — will establish compounding advantages as AI search market share grows from its current ~15% to a projected 40%+ by 2028.
The domain set is sourced from the Tranco list — a research-grade domain popularity ranking that aggregates data from multiple sources (Alexa, Umbrella, Majestic, Quantcast) to resist manipulation. We seeded the top 1,000,111 domains.
Each domain was crawled using SEODiff's radar scanner with the following probe signals:
ACRI scores are computed as a weighted composite of the five pillars described in Section 3.1. The full scoring methodology, pillar weights, calibration data, and statistical validation (Spearman ρ = 0.629 against Shadow RAG retrieval success) are documented in our companion paper: The Science of ACRI.
Citation edges (domain A links to domain B) are collected from the raw HTML of every crawled page. We extract all <a href="..."> elements, resolve relative URLs, extract unique target domains, and deduplicate at the domain level. Self-links are excluded. The final edge set contains 865,117 unique domain-to-domain edges.
The ACRI-weighted PageRank is computed iteratively with a damping factor of 0.85, standard initialization (1/N), and convergence threshold of 10−6. The algorithm converged in 35 iterations (73.9 seconds on a single-node server). Only Tier 1 domains participate in the graph — Tier 2 domains and their edges are excluded.
We encourage independent replication. The API endpoints at api.seodiff.io provide public access to leaderboard data, and every domain's individual report page includes the full scoring breakdown.
Traditional Domain Authority is dead. It is not dying — it is already dead for the 46.8% of the web that is invisible to AI search engines.
The transition from link-based ranking to RAG-based synthesis is not a future prediction — it is the present reality. ChatGPT has 200 million weekly users. Perplexity processes millions of queries daily. Google AI Overviews appear on more than 50% of search results. The shift is here.
And yet, nearly half of the world's most trafficked websites have not adapted. They block the crawlers, they serve empty HTML shells, they drown their content in 10× token bloat. They are invisible — and they don't even know it.
The data is clear. The fix is straightforward. The question is not whether to adapt, but how fast.
Enter any domain and see its AI-Trust rank, ACRI score, and detailed visibility audit — free, ungated, instant.
Scan Your Domain →Already scanned 1,000,111 domains · View the full leaderboard