Issue: Similarity Risk Medium

Pages with high text overlap and low information gain — near-duplicates that add little unique value.

What this means

A page has high text similarity with another page on your site, and the unique information it provides (Information Gain Score) is too low to justify its existence as a separate page. The JSON field is similarity_risk.

Detection condition

Triggered when both conditions are met:

SimilarityMaxPercent > 0 — the page was compared and found similar to another page.
Either SimilarityKind == "exact" OR InfoGainScore < 45

The similarity threshold defaults to 92% (configurable). Near-duplicate similarity is directional for programmatic pages — SEODiff only counts it as an issue when InfoGainScore(page) < 45. Exact duplicates always count regardless of InfoGain.

Suppressed when fewer than 10 pages are sampled.

Impact on scores

Severity weight: 7. Deductions: −18 on Indexability (dampened as heuristic), −14 on Content. Pages flagged for similarity risk are candidates for consolidation or differentiation.

Common causes

pSEO boilerplate: Programmatic pages where the template text vastly outweighs the unique data.
Thin variations: Location pages, product variants, or tag pages that differ by only a few words.
Auto-generated content: AI-generated pages that use similar patterns.

How to fix

Add unique value: Enrich pages with unique data, reviews, local information, or contextual content.
Consolidate: Merge near-duplicate pages into a single comprehensive page.
Differentiate templates: Ensure your template injects enough unique data per page.
Raise InfoGain: Target an InfoGainScore ≥ 45 for each page.