The Data Detox: Training Yourself for the Messy, Noisy, Real World

Image by Author

# Introduction

We have all spent hours debugging a model, only to discover that it wasn’t the algorithm but a wrong null value manipulating your results in row 47,832. Kaggle competitions give the impression that data is produced as clean, well-labeled CSVs with no class imbalance issues, but in reality, that is not the case.

In this article, we’ll use a real-life data project to explore four practical steps for preparing to deal with messy, real-life datasets.

# NoBroker Data Project: A Hands-On Test of Real-World Chaos

NoBroker is an Indian property technology (prop-tech) company that connects property owners and tenants directly in a broker-free marketplace.

This data project is used during the recruitment process for the data science positions at NoBroker.

In this data project, NoBroker wants you to build a predictive model that estimates how many interactions a property will receive within a given time frame. We won’t complete the entire project here, but it’ll help us discover methods for training ourselves on messy real-world data.

It has three datasets:

property_data_set.csv
- Contains property details such as type, location, amenities, size, rent, and other housing features.
property_photos.tsv
- Contains property photos.
property_interactions.csv
- Contains the timestamp of the interaction on the properties.

# Comparing Clean Interview Data Versus Real Production Data: The Reality Check

Interview datasets are polished, balanced, and boring. Real production data? It’s a dumpster fire with missing values, duplicate rows, inconsistent formats, and silent errors that wait until Friday at 5 PM to break your pipeline.

Take the NoBroker property dataset, a real-world mess with 28,888 properties across three tables. At first glance, it looks fine. But dig deeper, and you’ll find 11,022 missing photo uniform resource locators (URLs), corrupted JSON strings with rogue backslashes, and more.

This is the line between clean and chaotic. Clean data trains you to build models, but production data trains you to survive by struggling.

We’ll explore four practices to train yourself.

# Practice #1: Handling Missing Data

Missing data isn’t just annoying; it’s a decision point. Delete the row? Fill it with the mean? Flag it as unknown? The answer depends on why the data is missing and how much you can afford to lose.

The NoBroker dataset had three types of missing data. The photo_urls column was missing 11,022 values out of 28,888 rows — that is 38% of the dataset. Here is the code.

Here is the output.

Deleting these rows would wipe out valuable property records. Instead, the solution was to treat missing photos as if there were zero and move on.

def correction(x):
    if x is np.nan or x == 'NaN':
        return 0  # Missing photos = 0 photos
    else:
        return len(json.loads(x.replace('\\', '').replace('{title','{"title')))
pics['photo_count'] = pics['photo_urls'].apply(correction)

For numerical columns like total_floor (23 missing) and categorical columns like building_type (38 missing), the strategy was imputation. Fill numerical gaps with the mean, and categorical gaps with the mode.

for col in x_remain_withNull.columns:
    x_remain[col] = x_remain_withNull[col].fillna(x_remain_withNull[col].mean())
for col in x_cat_withNull.columns:
    x_cat[col] = x_cat_withNull[col].fillna(x_cat_withNull[col].mode()[0])

The first decision: do not delete without a questioning mind!

Understand the pattern. The missing photo URLs were not random.

# Practice #2: Detecting Outliers

An outlier is not always an error, but it is always suspicious.

Can you imagine a property with 21 bathrooms, 800 years old, or 40,000 square feet of space? You either found your dream place or someone made a data entry error.

The NoBroker dataset was full of these red flags. Box plots revealed extreme values across multiple columns: property ages over 100, sizes beyond 10,000 square feet (sq ft), and deposits exceeding 3.5 million. Some were legitimate luxury properties. Most were data entry mistakes.

df_num.plot(kind='box', subplots=True, figsize=(22,10))
plt.show()

Here is the output.

The solution was interquartile range (IQR)-based outlier removal, a simple statistical method that flags values beyond 2 times the IQR.

To handle this, we first write a function that removes those outliers.

def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3 - q1
    fence_low = q1 - 2 * iqr
    fence_high = q3 + 2 * iqr
    df_out = df_in.loc[(df_in[col_name] <= fence_high) & (df_in[col_name] >= fence_low)]
    return df_out  # Note: Multiplier changed from 1.5 to 2 to match implementation.

And we run this code on numerical columns.

df = dataset.copy()
for col in df_num.columns:
    if col in ['gym', 'lift', 'swimming_pool', 'request_day_within_3d', 'request_day_within_7d']:
        continue  # Skip binary and target columns
    df = remove_outlier(df, col)
print(f"Before: {dataset.shape[0]} rows")
print(f"After: {df.shape[0]} rows")
print(f"Removed: {dataset.shape[0] - df.shape[0]} rows ({((dataset.shape[0] - df.shape[0]) / dataset.shape[0] * 100):.1f}% reduction)")

Here is the output.

After removing outliers, the dataset shrank from 17,386 rows to 15,170, losing 12.7% of the data while keeping the model sane. The trade-off was worth it.

For target variables like request_day_within_3d, capping was used instead of deletion. Values above 10 were capped at 10 to prevent extreme outliers from skewing predictions. In the following code, we also compare the results before and after.

def capping_for_3days(x):
    num = 10
    return num if x > num else x
df['request_day_within_3d_capping'] = df['request_day_within_3d'].apply(capping_for_3days)
before_count = (df['request_day_within_3d'] > 10).sum()
after_count = (df['request_day_within_3d_capping'] > 10).sum()
total_rows = len(df)
change_count = before_count - after_count
percent_change = (change_count / total_rows) * 100
print(f"Before capping (>10): {before_count}")
print(f"After capping (>10): {after_count}")
print(f"Reduced by: {change_count} ({percent_change:.2f}% of total rows affected)")

The result?

A cleaner distribution, better model performance, and fewer debugging sessions.

# Practice #3: Dealing with Duplicates and Inconsistencies

Duplicates are easy. Inconsistencies are hard. A duplicate row is just df.drop_duplicates(). An inconsistent format, like a JSON string that’s been mangled by three different systems, requires detective work.

The NoBroker dataset had one of the worst JSON inconsistencies I’ve seen. The photo_urls column was supposed to contain valid JSON arrays, but instead, it was filled with malformed strings, missing quotes, escaped backslashes, and random trailing characters.

text_before = pics['photo_urls'][0]
print('Before Correction: \n\n', text_before)

Here is the before correction.

The fix required multiple string replacements to correct the formatting before parsing. Here is the code.

text_after = text_before.replace('\\', '').replace('{title', '{"title').replace(']"', ']').replace('],"', ']","')
parsed_json = json.loads(text_after)

Here is the output.

The JSON was indeed valid and parseable after the fix. It is not the cleanest way to do this kind of string manipulation, but it works.

You see inconsistent formats everywhere: dates saved as strings, typos in categorical values, and numerical IDs saved as floats.

The solution is standardization, as we did with the JSON formatting.

# Practice #4: Data Type Validation and Schema Checks

It all begins when you load your data. Finding out later that dates are strings or that numbers are objects would be a waste of time.

In the NoBroker project, the types were validated during the CSV read itself, as the project was enforcing the right data types upfront with pandas parameters. Here is the code.

data = pd.read_csv('property_data_set.csv')
print(data['activation_date'].dtype)  
data = pd.read_csv('property_data_set.csv',
                   parse_dates=['activation_date'], 
                   infer_datetime_format=True, 
                   dayfirst=True)
print(data['activation_date'].dtype)

Here is the output.

The same validation was applied to the interaction dataset.

interaction = pd.read_csv('property_interactions.csv',
    parse_dates=['request_date'], 
    infer_datetime_format=True, 
    dayfirst=True)

Not only was this good practice, but it was essential for anything downstream. The project required calculations of date and time differences between the activation and request dates.

So the following code would produce an error if dates are strings.

num_req['request_day'] = (num_req['request_date'] - num_req['activation_date']) / np.timedelta64(1, 'D')

Schema checks will ensure that the structure does not change, but in reality, the data will also drift as its distribution will tend to change over time. You can mimic this drift by having input proportions vary a little and check whether your model or its validation is able to detect and respond to that drift.

# Documenting Your Cleaning Steps

In three months, you won’t remember why you limited request_day_within_3d to 10. Six months from now, your teammate will break the pipeline by removing your outlier filter. In a year, the model will hit production, and no one will understand why it simply fails.

Documentation isn’t optional. That is the difference between a reproducible pipeline and a voodoo script that works until it doesn’t.

The NoBroker project documented every transformation in code comments and structured notebook sections with explanations and a table of contents.

# Assignment
# Read and Explore All Datasets
# Data Engineering
Handling Pics Data
Number of Interactions Within 3 Days
Number of Interactions Within 7 Days
Merge Data
# Exploratory Data Analysis and Processing
# Feature Engineering
Remove Outliers
One-Hot Encoding
MinMaxScaler
Classical Machine Learning
Predicting Interactions Within 3 Days
Deep Learning
# Try to correct the first Json
# Try to replace corrupted values then convert to json
# Function to correct corrupted json and get count of photos

Version control matters too. Track changes to your cleaning logic. Save intermediate datasets. Keep a changelog of what you tried and what worked.

The goal isn’t perfection. The goal is clarity. If you can’t explain why you made a decision, you can’t defend it when the model fails.

# Final Thoughts

Clean data is a myth. The best data scientists are not the ones who run away from messy datasets; they are the ones who know how to tame them. They discover the missing values before training.

They are able to identify the outliers before they influence predictions. They check schemas before joining tables. And they write everything down so that the next person doesn’t have to begin from zero.

No real impact comes from perfect data. It comes from the ability to deal with erroneous data and still construct something functional.

So when you have to deal with a dataset and you see null values, broken strings, and outliers, don’t fear. What you see is not a problem but an opportunity to show your skills against a real-world dataset.

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.

Source link

The Data Detox: Training Yourself for the Messy, Noisy, Real World

# Introduction

# NoBroker Data Project: A Hands-On Test of Real-World Chaos

# Comparing Clean Interview Data Versus Real Production Data: The Reality Check

# Practice #1: Handling Missing Data

# Practice #2: Detecting Outliers

# Practice #3: Dealing with Duplicates and Inconsistencies

# Practice #4: Data Type Validation and Schema Checks

# Documenting Your Cleaning Steps

# Final Thoughts

Leave a comment Cancel reply

You May Also Like

Read This Before Making a Career Switch to Data Science

Learn Probability in Computer Science with Stanford University for FREE