5 Data Pipeline Mistakes That Cost Me Weeks of Debugging

After three years building data pipelines in production, I’ve made plenty of mistakes. Some were quick fixes. Others cost me days of debugging and awkward conversations with management.

Here are five mistakes that taught me the most — not because they were dramatic or interesting, but because they were subtle enough to slip through testing and painful enough that I’ll never make them again.

If you’re building data pipelines, hopefully my mistakes save you some time.

Mistake #1: Silently Dropping 10% of Data

I added validation logic to filter out “invalid” records from our pipeline. Seemed smart — catch bad data before it reaches the warehouse. I tested it on a sample dataset, everything looked fine, deployed it on a Friday afternoon.

Monday morning, a business analyst aske…

After three years building data pipelines in production, I’ve made plenty of mistakes. Some were quick fixes. Others cost me days of debugging and awkward conversations with management.

If you’re building data pipelines, hopefully my mistakes save you some time.

Mistake #1: Silently Dropping 10% of Data

Monday morning, a business analyst asked why revenue looked 10% lower than expected. My first thought was “probably just a slow weekend.” Then finance called. Then my manager called.

Turns out, a source system had added a new status code without telling anyone. My validation logic saw this unknown code, flagged those records as invalid, and silently dropped them. 10% of our transactions, gone. The pipeline ran successfully — all green checkmarks, no errors — because technically it worked exactly as I’d coded it.

The worst part? It took almost a full day to figure out what was wrong. I was looking for pipeline failures, schema mismatches, network issues. Everything checked out. The data was being dropped intentionally by my own code, so there were no error logs.

How did I fix it? First, I stopped the pipeline immediately. Then I recovered the data from our raw zone and reprocessed it. Took about four hours to backfill everything.

But the real fix was changing how I think about validation. Now I don’t drop data silently — I send it to an error table with alerts. If something unexpected appears, I know about it. And I never, ever deploy significant changes on Friday afternoons anymore.

The lesson? Pipelines that succeed aren’t always correct. Green checkmarks just mean your code ran, not that it did the right thing.

Mistake #2: The Weekend Bug That Haunted Me

Our pipeline ran perfectly Monday through Friday. Every single weekend it failed. Every Saturday and Sunday morning, I got paged.

For the first few weeks, I thought it was a coincidence. Maybe the source system had issues on weekends. Maybe network problems. I spent hours checking infrastructure, reviewing logs, testing connections. Everything looked fine during the week.

Then I realized — the pipeline wasn’t actually failing. It was being killed by our monitoring system because it thought something was wrong.

The problem? I had a row count check with a hard-coded threshold. The pipeline expected at least 100,000 records per day. Monday through Friday, we got 120,000–150,000 transactions. Easy pass.

Weekends? Only 20,000–30,000 transactions. Our customers didn’t work weekends. Lower volume was completely normal. But my check didn’t know that. It saw “only 20,000 rows” and decided the pipeline had failed.

The fix was embarrassingly simple — change from a fixed threshold to a percentage-based check comparing against the same day of the week historically. Weekends are compared to previous weekends, not to weekdays.

Took me three weekends of being woken up at 6 AM to figure this out.

The lesson? Context matters in data validation. What’s normal on Tuesday isn’t normal on Sunday. Your checks need to understand the patterns in your data, not just absolute numbers.

Mistake #3: When $100 Became 10,000

Transaction amounts suddenly doubled in our reports overnight. Every single number was exactly 100x what it should have been.

This one took me almost a full day to debug because the numbers weren’t obviously wrong. A $100 transaction became 10,000. In isolation, that’s a valid transaction amount. Nothing technically broken — no nulls, no errors, schema matched perfectly.

The breakthrough came when I compared distributions. Average transaction amount had been around $150 for months. Suddenly it was $15,000. That’s when I knew something systemic had changed.

I traced it back to the source system. They’d changed from sending amounts in dollars ($100.00) to cents (10000). Their reasoning? “Cents are more precise and avoid floating-point issues.” Fair enough. But they didn’t tell anyone.

My pipeline happily processed the new format. Why wouldn’t it? Numbers are numbers. The schema was still “decimal field for amount.” Technically valid.

The fix was adding a validation check — if the average transaction amount changes by more than 50% day-over-day, alert someone. Also, I started tracking the ratio of amounts to compare against historical patterns.

But more importantly, I learned to monitor distributions, not just point values. A value can be individually valid but collectively wrong. If every transaction suddenly costs 100x more, something changed in how the data is formatted, even if the schema stayed the same.

Become a member The lesson? Data can be technically correct but business incorrect. Schema validation catches structure problems. Distribution monitoring catches semantic problems.

Mistake #4: The Schema Change Nobody Told Me About

A source system added new columns to their schema without telling anyone. I didn’t update my transformation logic to include these columns when checking for duplicates.

The result? Records that should have been deduplicated weren’t. We started seeing the same transactions appear multiple times in our reports. Not every record — just enough to make the numbers look slightly off.

The confusing part was that my deduplication logic was working correctly for the old schema. I was using transaction_id and timestamp to identify duplicates. But the source system had added a version column that changed for retries. Same transaction_id, same timestamp, different version. My code saw them as the same record. The database saw them as different.

It took me two days to figure out because the duplicates weren’t obvious. Revenue reports were 3–5% higher than expected. Not enough to scream “something’s broken” but enough that finance noticed during reconciliation.

The fix was simple once I found it — include all relevant columns in the deduplication logic. The lesson? Always check what changed in the source schema, even if nobody tells you it changed.

Now I log schema changes automatically. If a new column appears, I get an alert. Saves me from assuming the schema is the same as last week.

Mistake #5: The Missing Columns in COALESCE

I was merging data from multiple sources using COALESCE to pick the first non-null value across columns. Simple enough — if Source A has the data, use it. If not, fall back to Source B, then Source C.

Except I didn’t include all the columns in my logic. I focused on the main fields — customer ID, transaction amount, date. But I missed some metadata columns like source_system_id and updated_timestamp.

This created duplicates because records that should have been identified as the same transaction weren’t. They had the same main fields but different metadata, so my join logic treated them as separate records.

Debugging this was frustrating because the duplicates followed no obvious pattern. Some customers had them, others didn’t. Some days had duplicates, other days were clean. It looked random.

The breakthrough came when I added granularity to my debugging — instead of just checking if duplicates existed, I checked exactly which columns were causing them. I wrote a query that compared all fields between duplicate records and showed me which ones differed.

That’s when I saw it — the metadata columns I’d ignored. Once I added them to my COALESCE logic with proper priority ordering, the duplicates disappeared.

The lesson? When handling multiple data sources, think about ALL columns that define uniqueness, not just the obvious business keys. And when debugging duplicates, check field-by-field to see exactly what’s different.

What I Learned

Looking back at these five mistakes, there’s a pattern:

Test the right things. Schema validation is easy. Testing business logic is hard. Most of my bugs came from assumptions about the data, not the code.

Monitor what matters. Green checkmarks mean your pipeline ran. They don’t mean your data is correct. Track distributions, row counts, and patterns — not just success/failure.

Context is everything. A valid value on Tuesday might be invalid on Sunday. A normal schema last week might have changed this week. Your validation logic needs to understand context.

Never drop data silently. If something looks wrong, flag it loudly. Send it to an error table. Alert someone. Don’t just filter it out and hope it was actually bad data.

Keep raw data. Every single one of these mistakes was fixable because we kept the original data. When your transformation logic is wrong, you can reprocess. When you’ve dropped the raw data, you’re done.

The best part about making mistakes? You only make each one once — if you learn from it. These five cost me weeks of debugging time. But now I have checks in place to catch them before they reach production.

What’s your worst pipeline debugging story? I’d love to hear what others have learned the hard way.

— -

Want to discuss data pipelines or debugging strategies? Connect with me on LinkedIn or check out my portfolio. Always happy to talk about building reliable data systems.

— -

Thanks for reading! If this was helpful, follow for more articles on data engineering, production lessons, and building reliable systems.

Mistake #1: Silently Dropping 10% of Data

Mistake #1: Silently Dropping 10% of Data

Mistake #2: The Weekend Bug That Haunted Me

Mistake #3: When $100 Became 10,000

Mistake #4: The Schema Change Nobody Told Me About

Mistake #5: The Missing Columns in COALESCE

What I Learned

Similar Posts