Data Cleaning and Deduplication Before Clustering: Normalize, Plurals, Near Duplicates, Modifiers, Regex Tips

Pre-Clustering Data Hygiene

Data Cleaning and Deduplication Before Clustering

Turn raw exports into a clean, analysis-ready list. This guide shows how to normalize text, handle plurals and variants, strip noisy modifiers, detect near duplicates, and apply regex recipes in Google Sheets or code. We exclude cluster creation and focus only on cleaning.

Updated ~20 to 25 min read

Scope and principles

Good clusters start with clean inputs. The aim is to reduce noise without deleting intent. We standardize formatting, handle simple morphology like plurals, remove disposable modifiers, and collapse obvious duplicates or near duplicates. We do not merge distinct intents or build clusters here.

Helpful references while you work:

Cleaning pipeline

Run transformations in a predictable order so results are reproducible. Keep the raw file intact, then create columns for each step. A light version looks like this.

OrderStepPurposeExample
1Trim, lowercase, normalize UnicodeRemove spacing and case noise” CRM Pricing ““crm pricing”
2Strip punctuation and emojisCanonical text only“crm pricing!!!”“crm pricing”
3Standardize separatorsSpaces for “+”, “/”, “&” when safe“crm & sales”“crm sales”
4Remove disposable modifiersDrop “free”, “2025”, “best” when not core“best crm 2025”“crm”
5Handle plurals and variantsLight stemming for obvious pairs“dashboards”“dashboard”
6Sort tokens for signatureIgnore stop words and order“crm for sales” and “sales crm” → same signature
7Exact dedupe by signatureDrop identical signaturesRows with same signature collapse
8Near duplicate passFlag small edits like “setup” vs “set up”Manual or fuzzy distance check

Text normalization rules

Case and spacing

  • Use lowercase for comparisons
  • Collapse multiple spaces into one
  • Trim leading and trailing spaces

Unicode and punctuation

  • Normalize to NFC or NFKD for accents
  • Replace fancy quotes and dashes with simple ASCII
  • Remove emoji and symbols outside text

See MDN on String.prototype.normalize.

Separators

  • Convert “&”, “/”, “+” to space when they join words
  • Keep “c++” or “c#” style tokens as exceptions list
  • Map “vs”, “versus” to a shared token like “vs”

Plurals, stems, and intent

Do not over-stem. Light normalization is usually enough for cleaning. The goal is to merge obvious morphological variants without losing intent.

Safe pairs

  • dashboards → dashboard
  • analyses → analysis
  • policies → policy

Avoid collapsing

  • pricing vs price (different intent)
  • analytics vs analysis (different meaning)
  • report vs reporting (noun vs activity)

When in doubt, keep both and let the clustering step decide later.

Noise modifiers to strip or standardize

Modifiers add clutter when you want a canonical form. Keep a dictionary for safe removals and standardizations. Do not remove modifiers that change intent like “pricing” or “template”.

TypeExamplesActionNotes
Year and recency2023, 2024, latest, newRemoveKeep if the query truly requires a year like tax 2024
Quality fluffbest, top, ultimateRemove or map to “best”Keep one token if you need the flavor for later
Freebie noisefree, cheapRemoveKeep “free” only if product is free forever
Punctuation!, ?, :, ;RemoveSafe in almost all contexts
Stop wordsfor, to, in, ofIgnore in signaturesKeep in display text

Near duplicate detection

Exact dedupe is not enough. You will see spelling variants, swapped word order, and tiny edits. Use signatures and a lightweight fuzzy check to flag pairs for review.

Signature idea

  • Lowercase
  • Strip punctuation and disposable modifiers
  • Tokenize, drop stop words, sort tokens
  • Join with a single space

Example: “crm for sales teams” and “sales crm team” share signature “crm sales team”.

Fuzzy pass

  • Use a distance metric on signatures
  • Flag pairs with distance ≤ 2 for manual review
  • Whitelist product names to avoid wrong merges

What to keep separate

  • “pricing” vs “cost calculator”
  • “setup” vs “migration”
  • “template” vs “example” when docs differ

Regex cookbook

These patterns work in most regex engines. Test carefully. See the MDN cheatsheet if you are new to regex.

Remove years and fluff

// years and recency words
/(19|20)\d{2}\b|latest|newest|updated|update/gi

// quality fluff
/\b(best|top|ultimate|complete|definitive)\b/gi

Normalize separators and dashes

// fancy quotes and dashes to simple
/[“”‘’–—]/g  → replacement: '-'

// joiners to space
/[&+/]/g      → replacement: ' '

Collapse whitespace

/\s{2,}/g → replacement: ' '

Light plural handling

// endings like -s, -es, -ies
/\b([a-z]{3,})s\b/gi     → "$1"
 /\b([a-z]{3,})es\b/gi   → "$1"
 /\b([a-z]{3,})ies\b/gi  → "$1y"

Use only on a copy of the text and keep an exceptions list.

Google Sheets recipes

Keep raw keywords in column A. Create helper columns so every step is auditable.

Normalize and strip noise

=LOWER(TRIM(REGEXREPLACE(
  REGEXREPLACE(A2,"(19|20)\d{2}|\b(latest|newest|updated|best|top)\b",""),
  "[^a-z0-9\s\+&/]", "")))

Keeps alphanumerics, spaces, plus, ampersand, slash for the next step.

Unify separators

=REGEXREPLACE(B2,"[&/+]"," ")

Token sort signature

=TEXTJOIN(" ",TRUE,SORT(UNIQUE(
  FILTER(SPLIT(REGEXREPLACE(C2,"\s{2,}"," "), " "),
  LEN(SPLIT(REGEXREPLACE(C2,"\s{2,}"," "), " "))>0,
  NOT(REGEXMATCH(SPLIT(REGEXREPLACE(C2,"\s{2,}"," "), " "), 
  "(^| )(for|to|in|of|the|a|an)$"))
))))

Drops common stop words from the signature only. Display text stays untouched.

Exact dedupe with signature

=IF(COUNTIF($D$2:D2,D2)=1,"keep","drop")

Flag near duplicates

Sheets has no native Levenshtein, but you can flag suspicious pairs with same first token and similar length. Export to code for a final fuzzy pass if needed.

Tiny code snippets

JavaScript signature function

const STOP = new Set(["for","to","in","of","the","a","an"]);
const normalize = s => s.toLowerCase()
  .normalize("NFKD")
  .replace(/[“”‘’–—]/g,"-")
  .replace(/[^\w\s\+&/]/g,"")
  .replace(/[&/+]/g," ")
  .replace(/\s{2,}/g," ")
  .trim();

function signature(q){
  const toks = normalize(q).split(" ").filter(t => t && !STOP.has(t));
  return [...new Set(toks)].sort().join(" ");
}

Python near-duplicate flag

from difflib import SequenceMatcher

def near_dup(a, b, threshold=0.92):
    r = SequenceMatcher(None, a, b).ratio()
    return r >= threshold

# use on signatures, not raw strings
# near_dup("crm sales team", "sales crm team") → True

QA checklist and sign-off

Coverage

  • No blank or whitespace only rows
  • No duplicate signatures kept
  • All brand and year noise handled

Safety

  • Pricing and template terms not removed
  • Product names whitelisted from stemming
  • Geo terms preserved if relevant to scope

Reproducibility

  • Each step lives in its own column
  • Dictionary files for modifiers and exceptions
  • Change log with date and owner

FAQ

Should I remove brand names

Keep your own brand and major competitor names. They influence intent and will matter later. You can ignore them in the signature if the goal is to dedupe formatting only.

Do I keep misspellings

Keep high volume misspellings as separate rows if the SERP looks different. Otherwise correct them and merge into the canonical spelling.

What about long queries with many modifiers

Create a clean signature and store the full display text. You can later decide if the extra modifiers justify a separate page or belong as on-page copy.

Can I automate everything in Sheets

Most cleaning can be done with formulas. For near duplicates at scale, export signatures to a small script and review flagged pairs manually before deletion.