AI models like GPT-4, Claude, and Gemini are primarily trained on three data sources: Common Crawl (web pages), Wikipedia (encyclopedic knowledge), and Reddit (community discussion). This tool checks your domain's presence in all three.
The largest public web archive. SEODiff queries the Common Crawl CDX API to check if your domain's pages have been crawled and archived. A domain present in Common Crawl is more likely to be in AI training data.
The tool returns the number of indexed pages, most recent crawl date, and a sample of archived URLs.
The most trusted knowledge source for AI training. SEODiff queries the Wikipedia MediaWiki API to search for articles that mention your domain, brand name, or products. A Wikipedia mention is a strong signal that AI systems "know about" your entity.
Returns: matching article titles, snippets mentioning your domain, and whether your domain is linked as an external reference.
Major source of conversational training data. SEODiff queries Reddit's public API to find discussions mentioning your domain. Reddit mentions influence how AI systems perceive your brand in conversational contexts.
Returns: subreddits where your domain is discussed, top posts/comments mentioning you, and overall mention count.
If your domain doesn't appear in any of these sources, AI systems likely have limited knowledge about your brand. This means:
GET /api/training-data?domain=example.com
common_crawl — found, page_count, last_crawl_date, sample_urlswikipedia — found, articles array with title, snippet, urlreddit — found, mention_count, subreddits, top_posts