Issue: Similarity Risk Medium

Pages with high text overlap and low information gain — near-duplicates that add little unique value.

What this means

A page has high text similarity with another page on your site, and the unique information it provides (Information Gain Score) is too low to justify its existence as a separate page. The JSON field is similarity_risk.

Detection condition

Triggered when both conditions are met:

The similarity threshold defaults to 92% (configurable). Near-duplicate similarity is directional for programmatic pages — SEODiff only counts it as an issue when InfoGainScore(page) < 45. Exact duplicates always count regardless of InfoGain.

Suppressed when fewer than 10 pages are sampled.

Impact on scores

Severity weight: 7. Deductions: −18 on Indexability (dampened as heuristic), −14 on Content. Pages flagged for similarity risk are candidates for consolidation or differentiation.

Common causes

How to fix

  1. Add unique value: Enrich pages with unique data, reviews, local information, or contextual content.
  2. Consolidate: Merge near-duplicate pages into a single comprehensive page.
  3. Differentiate templates: Ensure your template injects enough unique data per page.
  4. Raise InfoGain: Target an InfoGainScore ≥ 45 for each page.