Pre-Clustering Data Hygiene
Data Cleaning and Deduplication Before Clustering
Turn raw exports into a clean, analysis-ready list. This guide shows how to normalize text, handle plurals and variants, strip noisy modifiers, detect near duplicates, and apply regex recipes in Google Sheets or code. We exclude cluster creation and focus only on cleaning.
Scope and principles
Good clusters start with clean inputs. The aim is to reduce noise without deleting intent. We standardize formatting, handle simple morphology like plurals, remove disposable modifiers, and collapse obvious duplicates or near duplicates. We do not merge distinct intents or build clusters here.
- Google on helpful content and crawlable links for later QA
- MDN on regular expressions and Unicode normalization
- web.dev on Core Web Vitals for performance checks downstream
Cleaning pipeline
Run transformations in a predictable order so results are reproducible. Keep the raw file intact, then create columns for each step. A light version looks like this.
| Order | Step | Purpose | Example |
|---|---|---|---|
| 1 | Trim, lowercase, normalize Unicode | Remove spacing and case noise | ” CRM Pricing “ → “crm pricing” |
| 2 | Strip punctuation and emojis | Canonical text only | “crm pricing!!!” → “crm pricing” |
| 3 | Standardize separators | Spaces for “+”, “/”, “&” when safe | “crm & sales” → “crm sales” |
| 4 | Remove disposable modifiers | Drop “free”, “2025”, “best” when not core | “best crm 2025” → “crm” |
| 5 | Handle plurals and variants | Light stemming for obvious pairs | “dashboards” → “dashboard” |
| 6 | Sort tokens for signature | Ignore stop words and order | “crm for sales” and “sales crm” → same signature |
| 7 | Exact dedupe by signature | Drop identical signatures | Rows with same signature collapse |
| 8 | Near duplicate pass | Flag small edits like “setup” vs “set up” | Manual or fuzzy distance check |
Text normalization rules
Case and spacing
- Use lowercase for comparisons
- Collapse multiple spaces into one
- Trim leading and trailing spaces
Unicode and punctuation
- Normalize to NFC or NFKD for accents
- Replace fancy quotes and dashes with simple ASCII
- Remove emoji and symbols outside text
See MDN on String.prototype.normalize.
Separators
- Convert “&”, “/”, “+” to space when they join words
- Keep “c++” or “c#” style tokens as exceptions list
- Map “vs”, “versus” to a shared token like “vs”
Plurals, stems, and intent
Do not over-stem. Light normalization is usually enough for cleaning. The goal is to merge obvious morphological variants without losing intent.
Safe pairs
- dashboards → dashboard
- analyses → analysis
- policies → policy
Avoid collapsing
- pricing vs price (different intent)
- analytics vs analysis (different meaning)
- report vs reporting (noun vs activity)
When in doubt, keep both and let the clustering step decide later.
Noise modifiers to strip or standardize
Modifiers add clutter when you want a canonical form. Keep a dictionary for safe removals and standardizations. Do not remove modifiers that change intent like “pricing” or “template”.
| Type | Examples | Action | Notes |
|---|---|---|---|
| Year and recency | 2023, 2024, latest, new | Remove | Keep if the query truly requires a year like tax 2024 |
| Quality fluff | best, top, ultimate | Remove or map to “best” | Keep one token if you need the flavor for later |
| Freebie noise | free, cheap | Remove | Keep “free” only if product is free forever |
| Punctuation | !, ?, :, ; | Remove | Safe in almost all contexts |
| Stop words | for, to, in, of | Ignore in signatures | Keep in display text |
Near duplicate detection
Exact dedupe is not enough. You will see spelling variants, swapped word order, and tiny edits. Use signatures and a lightweight fuzzy check to flag pairs for review.
Signature idea
- Lowercase
- Strip punctuation and disposable modifiers
- Tokenize, drop stop words, sort tokens
- Join with a single space
Example: “crm for sales teams” and “sales crm team” share signature “crm sales team”.
Fuzzy pass
- Use a distance metric on signatures
- Flag pairs with distance ≤ 2 for manual review
- Whitelist product names to avoid wrong merges
What to keep separate
- “pricing” vs “cost calculator”
- “setup” vs “migration”
- “template” vs “example” when docs differ
Regex cookbook
These patterns work in most regex engines. Test carefully. See the MDN cheatsheet if you are new to regex.
Remove years and fluff
// years and recency words
/(19|20)\d{2}\b|latest|newest|updated|update/gi
// quality fluff
/\b(best|top|ultimate|complete|definitive)\b/giNormalize separators and dashes
// fancy quotes and dashes to simple
/[“”‘’–—]/g → replacement: '-'
// joiners to space
/[&+/]/g → replacement: ' 'Collapse whitespace
/\s{2,}/g → replacement: ' 'Light plural handling
// endings like -s, -es, -ies
/\b([a-z]{3,})s\b/gi → "$1"
/\b([a-z]{3,})es\b/gi → "$1"
/\b([a-z]{3,})ies\b/gi → "$1y"Use only on a copy of the text and keep an exceptions list.
Google Sheets recipes
Keep raw keywords in column A. Create helper columns so every step is auditable.
Normalize and strip noise
=LOWER(TRIM(REGEXREPLACE(
REGEXREPLACE(A2,"(19|20)\d{2}|\b(latest|newest|updated|best|top)\b",""),
"[^a-z0-9\s\+&/]", "")))Keeps alphanumerics, spaces, plus, ampersand, slash for the next step.
Unify separators
=REGEXREPLACE(B2,"[&/+]"," ")Token sort signature
=TEXTJOIN(" ",TRUE,SORT(UNIQUE(
FILTER(SPLIT(REGEXREPLACE(C2,"\s{2,}"," "), " "),
LEN(SPLIT(REGEXREPLACE(C2,"\s{2,}"," "), " "))>0,
NOT(REGEXMATCH(SPLIT(REGEXREPLACE(C2,"\s{2,}"," "), " "),
"(^| )(for|to|in|of|the|a|an)$"))
))))Drops common stop words from the signature only. Display text stays untouched.
Exact dedupe with signature
=IF(COUNTIF($D$2:D2,D2)=1,"keep","drop")Flag near duplicates
Sheets has no native Levenshtein, but you can flag suspicious pairs with same first token and similar length. Export to code for a final fuzzy pass if needed.
Tiny code snippets
JavaScript signature function
const STOP = new Set(["for","to","in","of","the","a","an"]);
const normalize = s => s.toLowerCase()
.normalize("NFKD")
.replace(/[“”‘’–—]/g,"-")
.replace(/[^\w\s\+&/]/g,"")
.replace(/[&/+]/g," ")
.replace(/\s{2,}/g," ")
.trim();
function signature(q){
const toks = normalize(q).split(" ").filter(t => t && !STOP.has(t));
return [...new Set(toks)].sort().join(" ");
}Python near-duplicate flag
from difflib import SequenceMatcher
def near_dup(a, b, threshold=0.92):
r = SequenceMatcher(None, a, b).ratio()
return r >= threshold
# use on signatures, not raw strings
# near_dup("crm sales team", "sales crm team") → TrueQA checklist and sign-off
Coverage
- No blank or whitespace only rows
- No duplicate signatures kept
- All brand and year noise handled
Safety
- Pricing and template terms not removed
- Product names whitelisted from stemming
- Geo terms preserved if relevant to scope
Reproducibility
- Each step lives in its own column
- Dictionary files for modifiers and exceptions
- Change log with date and owner
FAQ
Should I remove brand names
Keep your own brand and major competitor names. They influence intent and will matter later. You can ignore them in the signature if the goal is to dedupe formatting only.
Do I keep misspellings
Keep high volume misspellings as separate rows if the SERP looks different. Otherwise correct them and merge into the canonical spelling.
What about long queries with many modifiers
Create a clean signature and store the full display text. You can later decide if the extra modifiers justify a separate page or belong as on-page copy.
Can I automate everything in Sheets
Most cleaning can be done with formulas. For near duplicates at scale, export signatures to a small script and review flagged pairs manually before deletion.
