Tool: Training Data Auditor

Checks if your domain appears in the three pillars of AI training data: Common Crawl, Wikipedia, and Reddit.

What it does

AI models like GPT-4, Claude, and Gemini are primarily trained on three data sources: Common Crawl (web pages), Wikipedia (encyclopedic knowledge), and Reddit (community discussion). This tool checks your domain's presence in all three.

The three pillars

1. Common Crawl

The largest public web archive. SEODiff queries the Common Crawl CDX API to check if your domain's pages have been crawled and archived. A domain present in Common Crawl is more likely to be in AI training data.

The tool returns the number of indexed pages, most recent crawl date, and a sample of archived URLs.

2. Wikipedia

The most trusted knowledge source for AI training. SEODiff queries the Wikipedia MediaWiki API to search for articles that mention your domain, brand name, or products. A Wikipedia mention is a strong signal that AI systems "know about" your entity.

Returns: matching article titles, snippets mentioning your domain, and whether your domain is linked as an external reference.

3. Reddit

Major source of conversational training data. SEODiff queries Reddit's public API to find discussions mentioning your domain. Reddit mentions influence how AI systems perceive your brand in conversational contexts.

Returns: subreddits where your domain is discussed, top posts/comments mentioning you, and overall mention count.

Why this matters

If your domain doesn't appear in any of these sources, AI systems likely have limited knowledge about your brand. This means:

AI chatbots won't recommend your products
AI search engines won't cite your content
Your competitors with training data presence get all the AI-driven traffic

API endpoint

GET /api/training-data?domain=example.com

JSON output

common_crawl — found, page_count, last_crawl_date, sample_urls
wikipedia — found, articles array with title, snippet, url
reddit — found, mention_count, subreddits, top_posts