How We Wash Your Data: The Data Cleaning Protocol Explained

Introduction

You approved a $200,000 evaluation budget. Six months later, your evaluator delivered a report showing your program reduced turnover by 23 percent. You presented it to the board. One director asked how you handled duplicate records in the exit dataset. You did not know what that meant.

The board tabled the decision. Not because your program failed. Because they could not trust the number.

This is the data cleaning problem. Most executives approve evaluation budgets without understanding that raw data is not analysis-ready data. Before any calculation runs, the data must be cleaned. If it is not, every number in your final report is compromised.

Data cleaning is not a nice-to-have step evaluators do when they have extra time. It is the foundation that determines whether your findings are defensible or garbage. When executives skip the question “how did you clean this data,” they are signing off on conclusions built on a dataset they have never audited.

This article explains what data cleaning actually involves, why it matters for executive decision making, and how to audit whether the data backing your evaluation is trustworthy.

Why Raw Data Is Not Analysis-Ready Data

You receive a CSV export from your HRIS. It contains 2,400 rows representing employee exits over three years. You send it to your evaluator and assume the analysis starts immediately.

It does not. That dataset is not clean. Here is what is hiding in those 2,400 rows that will corrupt any analysis run without cleaning.

Duplicate records: The same exit appears multiple times because the employee was logged in two systems, or their record was updated after the initial entry, creating a second row. If you count duplicates as separate exits, your attrition rate is inflated. Your baseline is wrong. Your ROI calculation is wrong.
Inconsistent categorization: Exit reason is coded differently by different managers. One manager logs “better opportunity” while another logs “career growth” for the same type of voluntary exit. If your analysis segments by exit reason, these inconsistencies create artificial categories that fragment your sample and weaken your findings.
Missing values: Some records lack an exit date, a manager ID, or a department code. If you run an analysis without handling missing data, your statistical software will drop those rows silently. Your sample size shrinks without you knowing it. Your findings represent a subset of exits, not the full population, but your report claims to describe all exits.
Formatting errors: Dates are stored as text in some rows and as date objects in others. Salary is recorded with commas in some rows and without in others, causing some values to be read as text instead of numbers. When your evaluator tries to calculate average ramp cost, half the salary data does not compute. The output is blank or nonsensical.
Outliers that are errors, not insights: One record shows an employee who worked for 47 years. Another shows a ramp time of negative six months. These are data entry errors, not interesting edge cases. If your evaluator does not flag and remove them, they will distort every summary statistic. Your mean cost per exit is wrong because one impossible value dragged the average up.

Raw data is contaminated data. It reflects the reality of how information gets logged across systems, by different people, using inconsistent protocols. Analysis run on raw data produces numbers that look precise but are structurally unsound.

Data cleaning is the process of identifying and correcting these issues before any analysis begins. It is not optional. It is the minimum standard for credible evaluation.

What Data Cleaning Actually Involves: The Protocol

Most executives have never seen a data cleaning protocol because evaluators treat it as invisible background work. That invisibility is a mistake. If you do not understand what cleaning involves, you cannot assess whether it was done correctly. Here is the step-by-step process credible evaluators follow.

Step One: Audit for Duplicates

The evaluator runs a duplicate detection algorithm to identify rows that represent the same entity. For employee exit data, this typically means matching on employee ID and exit date. If the same employee ID appears with the same exit date in multiple rows, those are duplicates.

The protocol is not to delete duplicates blindly. The evaluator checks why duplicates exist. Sometimes a duplicate is an error. Sometimes it is a legitimate edge case, like an employee who exited and was rehired in the same year. The evaluator documents the decision rule: we kept rows where X condition was true and removed rows where Y condition was true.

Without this step, every count in your report is inflated by an unknown margin. You cannot defend your attrition rate if you do not know whether you are counting the same exit twice.

Step Two: Standardize Categorical Variables

The evaluator identifies all categorical variables (exit reason, department, role level, manager ID) and checks how many unique values exist for each. If “exit reason” has 47 unique values in a dataset of 2,400 exits, that is a red flag. It means managers are logging reasons inconsistently.

The evaluator builds a crosswalk, a mapping table that consolidates similar categories into standardized buckets. “Career growth,” “better opportunity,” and “advancement elsewhere” all map to “voluntary – career progression.” The evaluator applies the crosswalk to the dataset, documents the mapping decisions, and produces a standardized version of the variable.

This step makes segmentation analysis possible. Without it, you cannot compare exit rates by reason or department because the categories are too fragmented to produce meaningful patterns.

Step Three: Handle Missing Data

The evaluator runs a missing data audit. For each variable, what percentage of rows have missing values? If 30 percent of records lack a manager ID, that variable cannot be used reliably in analysis. If 2 percent of records lack an exit date, that is manageable.

The protocol depends on how much data is missing and why it is missing. Three strategies exist:

Deletion. If a row is missing values for critical variables (employee ID, exit date) and cannot be recovered, the row is removed from the analysis. This is documented. The final report notes that X rows were excluded due to incomplete data.
Imputation. If a value is missing but can be reasonably estimated, the evaluator imputes it using a rule. For example, if department is missing but manager ID exists, the evaluator looks up the manager’s department and assigns that value. The imputation rule is documented.
Flagging. If the variable is important but missing values cannot be imputed, the evaluator creates a “missing” category and flags it in analysis. When reporting exit reasons, one category will be “reason not recorded.” This preserves sample size while acknowledging the data limitation.

Without a documented missing data strategy, your evaluator is making invisible decisions about which rows to include and which to drop. You have no way to assess whether those decisions were defensible.

Step Four: Validate Data Types and Formatting

The evaluator checks that each variable is stored in the correct data type. Dates should be date objects, not text. Salaries should be numeric, not text with commas. IDs should be stored consistently (all numeric or all alphanumeric, not mixed).

If formatting is inconsistent, the evaluator writes code to convert all values to a standardized format. Dates are converted to a single format (YYYY-MM-DD). Currency values are stripped of symbols and commas and converted to numeric. Text fields are converted to lowercase to avoid case-sensitive mismatches.

This step is invisible to the executive but critical to computational accuracy. If salary is stored as text in 20 percent of rows, every calculation involving salary will fail for those rows, and the evaluator will produce a summary statistic based on an incomplete sample without realizing it.

Step Five: Identify and Address Outliers

The evaluator runs descriptive statistics on all numeric variables (salary, tenure, ramp time, age) to identify outliers. An outlier is a value that falls far outside the expected range. For tenure, an outlier might be 427 years. For ramp time, an outlier might be negative six months.

Some outliers are errors. These are corrected or removed. The evaluator documents which values were flagged as errors and what decision was made (corrected to a plausible value, removed entirely, or flagged for follow-up with the data source).

Some outliers are real. If one employee legitimately took 18 months to ramp while the average is six months, that is a real data point. The evaluator does not remove it, but flags it in analysis as an edge case that might skew averages. When reporting mean ramp cost, the evaluator also reports median to show what is typical when outliers are excluded.

Without outlier detection, one bad data point can distort every summary statistic in your report. Your cost-per-exit estimate will be wrong. Your ROI calculation will be wrong. The board will make decisions based on numbers that do not reflect reality.

Step Six: Document Every Decision

This is the step most evaluators skip and the step that determines whether your evaluation is defensible. The evaluator maintains a data cleaning log that records every decision made during the cleaning process.

The log includes:

How many duplicates were found and how they were resolved.
Which categorical variables were standardized and what mapping rules were used.
How much data was missing for each variable and what strategy was used to handle it.
Which outliers were flagged and whether they were corrected, removed, or retained.
What transformations were applied to formatting or data types.

This log is an appendix to your evaluation report. It allows a skeptical auditor to reproduce the cleaning process and verify that decisions were reasonable. Without it, your evaluation is a black box. The board has no way to assess whether the data was cleaned correctly or whether the evaluator made arbitrary choices that biased the findings.

The Cost of Skipping Data Cleaning

Most executives do not budget for data cleaning because they do not realize it is a discrete task. They assume it happens automatically. It does not. When evaluators skip cleaning due to time pressure or lack of expertise, the consequences compound.

Your findings are wrong. Duplicate records inflate your attrition count. Inconsistent categorization fragments your segments. Missing data shrinks your sample without documentation. Outliers distort your averages. Every number in your report is compromised, but the report does not acknowledge it. You present findings to the board with confidence. The findings are structurally unsound.
Your evaluation cannot be audited. A board member or external reviewer asks to see the raw data and the cleaning protocol. You do not have a cleaning log. You cannot explain how duplicates were handled or how missing data was addressed. The reviewer cannot verify that your findings are reproducible. Your evaluation loses credibility even if the program was effective.
You cannot defend your ROI claim. The CFO asks how you calculated cost per exit. You cite a number. They ask how many exits were included in the calculation. You say 2,400. They pull the HRIS export and count 2,150 unique employee IDs. Your number is inflated by duplicates. The CFO stops trusting your analysis. Your budget request gets tabled.
You waste time and money on garbage analysis. You paid $50,000 for an evaluation. The evaluator ran sophisticated statistical models on dirty data. The outputs look impressive. They are meaningless. You make program decisions based on those outputs. The decisions fail because they were based on findings that did not reflect reality. You spent $50,000 to produce noise disguised as insight.

The cost of skipping data cleaning is not just analytical. It is reputational. Once a board or funder loses trust in your data quality, they question every number you present. You spend the next two years rebuilding credibility that was destroyed by one uncleaned dataset.

How to Audit Whether Your Data Is Clean

You hired an evaluator. They delivered a report. How do you know if the data was cleaned correctly? Most executives do not ask this question. They should. Here is the audit checklist.

Ask for the data cleaning log. If the evaluator does not have one, the data was not cleaned systematically. Request a written summary of what cleaning steps were performed. If they cannot produce it, the cleaning was ad hoc or nonexistent.
Check the sample size at each stage. The report should state how many rows were in the raw dataset, how many were removed as duplicates, how many were excluded due to missing data, and how many were included in final analysis. If those numbers are not documented, you do not know what subset of data the findings represent.
Look for inconsistency flags. The report should acknowledge data limitations. If categorical variables had inconsistent coding, the report should mention it and explain how standardization was handled. If the report presents clean findings with no acknowledgment of data messiness, either the data was unrealistically clean (rare) or the evaluator skipped cleaning (common).
Request a duplicate check. Ask the evaluator to report how many duplicate records were found and how they were resolved. If they say “we didn’t find any duplicates,” verify that by asking them to describe the duplicate detection method. Duplicates exist in almost every organizational dataset. If none were found, the evaluator likely did not check.
Check for outlier documentation. Ask whether any outliers were identified in key numeric variables. If the evaluator says no outliers were found, that is suspicious. Administrative data always contains errors. Ask them to share descriptive statistics (min, max, mean, median) for key variables. If the maximum value for tenure is 500 years, you have an outlier that was not addressed.
Verify reproducibility. If the analysis was done in code (R, Python, SQL), ask to see the cleaning script. If the analysis was done in Excel or manually, ask for step-by-step documentation. A credible evaluator can walk you through the exact sequence of operations performed on the data. If they cannot, the process was not rigorous.

This audit does not require technical expertise. It requires asking the right questions. If your evaluator cannot answer these questions clearly, your data quality is suspect.

How Data Cleaning Fits Into the OLPADR Framework

Data cleaning is not a standalone task. It is embedded in the Diagnose phase of the OLPADR evaluation framework. Most organizations treat diagnosis as “let’s look at the numbers.” That is not diagnosis. That is reporting. Diagnosis is the process of ensuring the numbers are trustworthy before you interpret them.

Here is how data cleaning integrates into OLPADR.

Outcome and Constraints. You defined what success looks like and what data you need to measure it. Data cleaning ensures the data you collected actually measures what you think it measures. If your outcome is “voluntary exits reduced by 15 percent” but your exit dataset is full of duplicates and missing values, you cannot measure the outcome accurately.
Logic-Mapping. You built a theory of change that explains how your program reduces exits. Data cleaning ensures you can test that theory. If your logic model predicts that improved manager effectiveness reduces exits in specific departments, you need clean department codes to run that analysis. If department is inconsistently coded, you cannot test your theory.
Plan. You designed a data collection system before the program launched. Data cleaning is where you discover whether that system worked. If 40 percent of records are missing the variable you planned to analyze, your data collection plan failed. Cleaning exposes that failure early, before you waste months analyzing incomplete data.
Act. During program execution, data gets logged. Data cleaning is not something you do once at the end. It is a continuous audit. If you check data quality quarterly, you catch logging errors early and fix them while the program is still running. If you wait until the end, those errors are permanent.
Diagnose and Calibrate. This is where data cleaning lives. Before you analyze results, you clean the data. Before you draw conclusions, you document what you cleaned and why. The cleaning log is part of your evidence ledger. It shows stakeholders that your findings are based on audited data, not raw exports.
Result and Use. When you present findings to the board, you do not hide the cleaning process. You acknowledge it. You say: We started with 2,400 records. We removed 150 duplicates. We excluded 80 records with missing critical data. Final analysis includes 2,170 clean records. This transparency builds trust. Executives know you did not cherry-pick data. You followed a protocol.

Data cleaning is not a technical nuisance. It is the audit step that makes your evaluation defensible. Without it, you are building conclusions on a foundation you have never inspected.

Common Mistakes Executives Make

Assuming raw data is clean. If data comes from an HRIS or CRM, executives assume it is accurate. It is not. Administrative systems are designed for operations, not analysis. Data quality is inconsistent because the priority is logging transactions, not maintaining analytical rigor.
Not budgeting for cleaning time. Executives allocate 10 hours for data analysis but zero hours for data cleaning. Cleaning typically takes 20 to 40 percent of total analysis time. If you do not budget for it, the evaluator skips it or rushes it. Either way, your findings are compromised.
Treating cleaning as low-skill work. Some executives assume cleaning is data entry work that can be delegated to a junior analyst. It is not. Cleaning requires judgment. The evaluator must decide how to handle edge cases, what imputation rules are defensible, and which outliers are errors versus legitimate extremes. These are analytical decisions, not clerical tasks.
Not asking for a cleaning log. The single biggest mistake executives make is not requesting documentation of the cleaning process. If you do not ask, most evaluators will not provide it. You accept a report with no way to verify that the underlying data was handled correctly.

When to Bring in External Support

You need outside help when your internal team lacks the technical skills to write cleaning scripts in R, Python, or SQL. When you suspect your data quality is poor but do not know how to quantify it. When a board member or funder demands an audit of your data cleaning process and you have no documentation. When you need a third party to certify that your dataset meets quality standards before you submit it to a journal, grant agency, or regulatory body.

An external evaluator brings technical expertise in data cleaning, experience identifying common data quality issues across sectors, and the credibility of independence. They can audit your existing dataset, document the issues, clean it using a reproducible protocol, and provide a certification statement that your final dataset meets quality standards.

The goal of external support is not to hide messy data. It is to ensure messy data is cleaned transparently and documented thoroughly so your findings are defensible.

Moving Forward

You cannot make good decisions based on bad data. You cannot fix bad data without cleaning it. You cannot trust that data is clean unless you audit the cleaning process.

Next time you approve an evaluation budget, add a line item for data cleaning. Require the evaluator to produce a data cleaning log as a deliverable. Ask them to report sample size at each stage of the process. Request a duplicate count, a missing data summary, and an outlier report.

If they say “the data is fine, we didn’t need to clean it,” do not accept that answer. Every dataset needs cleaning. If they claim otherwise, they either did not check or they do not know what to check for. Either way, your findings are sitting on an uninspected foundation.

Data cleaning is not glamorous. It is not the part of evaluation that gets presented to the board. But it is the part that determines whether anything you present to the board is true.

What percentage of your evaluation budget is allocated to data cleaning, and when was the last time you asked to see a data cleaning log for a report you approved?

Ready to implement a data quality audit process for your organization’s evaluation work? I am opening 3 new Strategy Intensives this month. Click the link below to schedule a session and build a reproducible data cleaning protocol your team can apply to every analysis.

Link: https://www.claritytoimpact.com/professional-consultation-strategy-booking-form/