Data Cleaning and Deduplication Before Clustering: Normalize, Plurals, Near Duplicates, Modifiers, Regex Tips

Pre-Clustering Data Hygiene

Data Cleaning and Deduplication Before Clustering

Turn raw exports into a clean, analysis-ready list. This guide shows how to normalize text, handle plurals and variants, strip noisy modifiers, detect near duplicates, and apply regex recipes in Google Sheets or code. We exclude cluster creation and focus only on cleaning.

Updated Aug 2025 • ~20 to 25 min read

Clean my keyword list Ask a quick question Ready to Cluster-Use a SERP-led tool for topical authority

Scope and principles

Good clusters start with clean inputs. The aim is to reduce noise without deleting intent. We standardize formatting, handle simple morphology like plurals, remove disposable modifiers, and collapse obvious duplicates or near duplicates. We do not merge distinct intents or build clusters here.

Helpful references while you work:

Google on helpful content and crawlable links for later QA
MDN on regular expressions and Unicode normalization
web.dev on Core Web Vitals for performance checks downstream

Cleaning pipeline

Run transformations in a predictable order so results are reproducible. Keep the raw file intact, then create columns for each step. A light version looks like this.

Order	Step	Purpose	Example
1	Trim, lowercase, normalize Unicode	Remove spacing and case noise	” CRM Pricing “ → “crm pricing”
2	Strip punctuation and emojis	Canonical text only	“crm pricing!!!” → “crm pricing”
3	Standardize separators	Spaces for “+”, “/”, “&” when safe	“crm & sales” → “crm sales”
4	Remove disposable modifiers	Drop “free”, “2025”, “best” when not core	“best crm 2025” → “crm”
5	Handle plurals and variants	Light stemming for obvious pairs	“dashboards” → “dashboard”
6	Sort tokens for signature	Ignore stop words and order	“crm for sales” and “sales crm” → same signature
7	Exact dedupe by signature	Drop identical signatures	Rows with same signature collapse
8	Near duplicate pass	Flag small edits like “setup” vs “set up”	Manual or fuzzy distance check

Text normalization rules

Case and spacing

Use lowercase for comparisons
Collapse multiple spaces into one
Trim leading and trailing spaces

Unicode and punctuation

Normalize to NFC or NFKD for accents
Replace fancy quotes and dashes with simple ASCII
Remove emoji and symbols outside text

See MDN on String.prototype.normalize.

Separators

Convert “&”, “/”, “+” to space when they join words
Keep “c++” or “c#” style tokens as exceptions list
Map “vs”, “versus” to a shared token like “vs”

Plurals, stems, and intent

Do not over-stem. Light normalization is usually enough for cleaning. The goal is to merge obvious morphological variants without losing intent.

Safe pairs

dashboards → dashboard
analyses → analysis
policies → policy

Avoid collapsing

pricing vs price (different intent)
analytics vs analysis (different meaning)
report vs reporting (noun vs activity)

When in doubt, keep both and let the clustering step decide later.

Noise modifiers to strip or standardize

Modifiers add clutter when you want a canonical form. Keep a dictionary for safe removals and standardizations. Do not remove modifiers that change intent like “pricing” or “template”.

Type	Examples	Action	Notes
Year and recency	2023, 2024, latest, new	Remove	Keep if the query truly requires a year like tax 2024
Quality fluff	best, top, ultimate	Remove or map to “best”	Keep one token if you need the flavor for later
Freebie noise	free, cheap	Remove	Keep “free” only if product is free forever
Punctuation	!, ?, :, ;	Remove	Safe in almost all contexts
Stop words	for, to, in, of	Ignore in signatures	Keep in display text

Near duplicate detection

Exact dedupe is not enough. You will see spelling variants, swapped word order, and tiny edits. Use signatures and a lightweight fuzzy check to flag pairs for review.

Signature idea

Lowercase
Strip punctuation and disposable modifiers
Tokenize, drop stop words, sort tokens
Join with a single space

Example: “crm for sales teams” and “sales crm team” share signature “crm sales team”.

Fuzzy pass

Use a distance metric on signatures
Flag pairs with distance ≤ 2 for manual review
Whitelist product names to avoid wrong merges

What to keep separate

“pricing” vs “cost calculator”
“setup” vs “migration”
“template” vs “example” when docs differ

Regex cookbook

These patterns work in most regex engines. Test carefully. See the MDN cheatsheet if you are new to regex.

Remove years and fluff

// years and recency words
/(19|20)\d{2}\b|latest|newest|updated|update/gi

// quality fluff
/\b(best|top|ultimate|complete|definitive)\b/gi

Normalize separators and dashes

// fancy quotes and dashes to simple
/[“”‘’–—]/g  → replacement: '-'

// joiners to space
/[&+/]/g      → replacement: ' '

Collapse whitespace

/\s{2,}/g → replacement: ' '

Light plural handling

// endings like -s, -es, -ies
/\b([a-z]{3,})s\b/gi     → "$1"
 /\b([a-z]{3,})es\b/gi   → "$1"
 /\b([a-z]{3,})ies\b/gi  → "$1y"

Use only on a copy of the text and keep an exceptions list.

Google Sheets recipes

Keep raw keywords in column A. Create helper columns so every step is auditable.

Normalize and strip noise

=LOWER(TRIM(REGEXREPLACE(
  REGEXREPLACE(A2,"(19|20)\d{2}|\b(latest|newest|updated|best|top)\b",""),
  "[^a-z0-9\s\+&/]", "")))

Keeps alphanumerics, spaces, plus, ampersand, slash for the next step.

Unify separators

=REGEXREPLACE(B2,"[&/+]"," ")

Token sort signature

=TEXTJOIN(" ",TRUE,SORT(UNIQUE(
  FILTER(SPLIT(REGEXREPLACE(C2,"\s{2,}"," "), " "),
  LEN(SPLIT(REGEXREPLACE(C2,"\s{2,}"," "), " "))>0,
  NOT(REGEXMATCH(SPLIT(REGEXREPLACE(C2,"\s{2,}"," "), " "), 
  "(^| )(for|to|in|of|the|a|an)$"))
))))

Drops common stop words from the signature only. Display text stays untouched.

Exact dedupe with signature

=IF(COUNTIF($D$2:D2,D2)=1,"keep","drop")

Flag near duplicates

Sheets has no native Levenshtein, but you can flag suspicious pairs with same first token and similar length. Export to code for a final fuzzy pass if needed.

Tiny code snippets

JavaScript signature function

const STOP = new Set(["for","to","in","of","the","a","an"]);
const normalize = s => s.toLowerCase()
  .normalize("NFKD")
  .replace(/[“”‘’–—]/g,"-")
  .replace(/[^\w\s\+&/]/g,"")
  .replace(/[&/+]/g," ")
  .replace(/\s{2,}/g," ")
  .trim();

function signature(q){
  const toks = normalize(q).split(" ").filter(t => t && !STOP.has(t));
  return [...new Set(toks)].sort().join(" ");
}

Python near-duplicate flag

from difflib import SequenceMatcher

def near_dup(a, b, threshold=0.92):
    r = SequenceMatcher(None, a, b).ratio()
    return r >= threshold

# use on signatures, not raw strings
# near_dup("crm sales team", "sales crm team") → True

QA checklist and sign-off

Coverage

No blank or whitespace only rows
No duplicate signatures kept
All brand and year noise handled

Safety

Pricing and template terms not removed
Product names whitelisted from stemming
Geo terms preserved if relevant to scope

Reproducibility

Each step lives in its own column
Dictionary files for modifiers and exceptions
Change log with date and owner

Clean my keyword list Help me with content writing After cleaning — run SERP-led clustering for topical authority

FAQ

Should I remove brand names

Keep your own brand and major competitor names. They influence intent and will matter later. You can ignore them in the signature if the goal is to dedupe formatting only.

Do I keep misspellings

Keep high volume misspellings as separate rows if the SERP looks different. Otherwise correct them and merge into the canonical spelling.

What about long queries with many modifiers

Create a clean signature and store the full display text. You can later decide if the extra modifiers justify a separate page or belong as on-page copy.

Can I automate everything in Sheets

Most cleaning can be done with formulas. For near duplicates at scale, export signatures to a small script and review flagged pairs manually before deletion.

Clean my keyword list Ask a quick question After cleaning — run SERP-led clustering for topical authority