Data Migration Discovery: What You Find Before You Build

There's a version of data migration that looks clean on paper. You profile the source, define the mappings, write the transformation scripts, load the data, validate. Done.

The thing is, that version doesn't exist in the wild.

The best data migrations we've run have one thing in common: by the time we reached delivery, we'd already changed our minds several times.

That's not a confession. It's the point. Discovery is supposed to surface the things your initial design didn't account for, the lookup codes that don't map, the validation logic that needs to move, the data that has to be inferred rather than read. If none of that happens, either the data is unusually clean, or the discovery wasn't deep enough.

Here's what it looks like when it works properly.

A national water data platform: when lookup codes don't exist

We were migrating groundwater data from four regional councils into a single national system. Each council had operated independently for years - think different legacy platforms, different schemas, different conventions for classifying the same things.

The assumption going in was reasonable: councils would be using broadly consistent lookup values for things like well types, aquifer classifications, and lithology types. We'd map them across, handle the edge cases & move right along.

What profiling revealed was different. Several councils were using lookup values that had no equivalent in the national standard, classifications that made sense within their regional context but had never needed to be reconciled with anyone else's. They weren't wrong. They just … didn't map.

This was invisible before we extracted and profiled the actual data. There was nothing in the documentation to flag it, because within each council's own system, everything was consistent.

The draft transformation scripts we'd begun writing had to stop. New lookup categories had to be designed - not by us, but collaboratively with domain specialists who understood what those classifications actually meant in practice. Groundwater scientists, not engineers, had to make those calls. Only then could we write transformation rules that reflected the reality of the data rather than our assumptions about it.

The delivery sequence changed too. Councils with clean, mappable data moved earlier. Councils with complex lookup situations moved later, once the extended category framework was established. This wasn't about deprioritising complexity - it was about not letting unresolved complexity block real progress. The system went live sooner because of that sequencing, with fewer councils initially, but with correct data rather than rushed data.

A government data agency: when validation logic belongs in the database

The second example came from a different kind of migration. We were building a data loading pipeline for a central government statistics agency, moving key summary data through cloud blob storage into a production database, with validation at each stage.

The initial design had validation happening at the application layer. That felt like a reasonable default. Application-level checks are easy to iterate on, easy to inspect, and don't require database schema changes when rules evolve.

Discovery told us it wasn't going to work here.

The structure of the data, once we were working with it rather than describing it, required validation logic that was deeply entangled with relational constraints. What looked like simple checks turned out to rely on referential integrity - cross-table relationships that made application-level enforcement fragile, with edge cases that only surfaced when multiple subject areas were loaded together.

So, the architecture changed. Validation moved into stored procedures at the database layer, which was more robust, more auditable, and better suited to the actual data structure. It was a more significant change than swapping a configuration value. It affected how the pipeline was sequenced, what the testing approach looked like, and how the agency's team would eventually maintain it.

That decision was made in discovery, on the basis of actual data, not theoretical structure. Made in delivery, it would have been expensive to unpick.

A major NZ enterprise with international operations: when bad data needs inference, not rejection

The third example is different again, not government datasets, but a decades-old proprietary business database running on a legacy 32-bit system with no modern API.

The organisation needed a defined subset of their legacy data available in a modern cloud environment, while the original system continued operating as the system of record. Twenty-three data domains: station records, lot tracking, genetics testing results, purchase accounts, creditor history. Around 2.5 million records in total.

The assumption going in was that field-level mapping would be the hard part. Define the source fields, define the target fields, write the transformation rules. Systematic, if tedious.

What discovery revealed was that a significant portion of the data was incorrect or incomplete - not corrupted, just the accumulated reality of a system that had been in use for decades without a clean-up. Values that were wrong. Fields that were blank when they shouldn't have been.

The transformation rules we'd written for clean data didn't hold. But the bigger problem was that we couldn't simply skip or reject bad records. These were core business entities with downstream dependencies, discarding them wasn't an option.

What discovery surfaced, and what changed our approach, was that in most cases the correct values could be inferred from other fields. The data was wrong, but it wasn't unrecoverable. What looked like a data quality problem was actually a data derivation problem, and those require different transformation logic entirely.

The scripts were rewritten around inference rules rather than direct mappings. What made this workable was how we handled records that couldn't load cleanly: rather than failing the entire pipeline, bad records were flagged as dead-letters, quarantined, visible, and actionable. A dead-lettered record could be re-triggered once the source was corrected, or explicitly marked to ignore if the data wasn't recoverable. The pipeline was never simply green or red. It had gradations, and that made the iterative cycles fast, a fix at the source or a tweak to the inference logic, then re-trigger the affected records rather than rerunning everything from scratch.

The extraction layer itself also changed. With no native API on a Windows-only legacy system, we built a wrapper that sanitised and normalised raw output before it entered the pipeline - providing a consistent interface regardless of what the underlying system returned.

That wasn't in the original design. Discovery made it unavoidable.

The pattern: discovery should change the plan

None of these surprises was a failure. They were just discovery doing its job.

The point of profiling real data before committing to transformation rules isn't to confirm your assumptions. It's to find out which ones are wrong while it's still cheap enough to respond. In both cases above, the change of course happened before significant delivery work had been done on the affected area. The cost was a few days of redesign rather than weeks of rework.

There's a version of discovery that treats this phase as documentation - capturing what's already known, formalising decisions already made, producing artefacts that justify proceeding. That's not what we mean by it.

Discovery is where you expect to be surprised. Where you build on the assumption that the data will tell you something your initial design didn't account for, and where the process is structured to absorb that without derailing what comes next.

Scoping transformation scripts by family or subject area rather than building a monolithic pipeline is part of this. A surprise in one domain affects the sequencing of that domain. It doesn't stop work proceeding elsewhere. And when rules change, which they always do, the Git history is the audit trail. The reasoning lives in the code, not in a separate document that drifts from reality

What this means for your migration team

The uncomfortable version of this is: if your discovery phase produces a migration plan that looks exactly like your initial assumptions, something has probably gone wrong.

Either the data really is that clean and well-documented (possible, but rare, we promise!) or the profiling wasn't deep enough to surface the issues that are waiting in delivery.

The useful version is: a discovery phase that changes its own outputs is a discovery phase that's working. The surprises it surfaces are surprises you can respond to deliberately, on your terms, before momentum makes that impossible.

We've run this pattern across numerous contexts, from national water data, census statistics, FoxPro legacy systems, to geospatial road network data. The contexts are hugely different… but the pattern holds.

The data always has something to say… the real question is whether you're listening before you build, or after.

FAQs about data migration discovery

Why does the migration plan always seem to change during discovery?

Because the data is the documentation. Source systems accumulate years of real-world complexity, lookup values that made sense in isolation, validation logic that evolved without being written down, records that are wrong but not deletable. A discovery phase that changes its outputs is working correctly. The surprises it surfaces are ones you can respond to deliberately, before delivery makes that expensive.

What does "profiling real data" actually mean in practice?

It means extracting a representative sample of the actual source data and examining what's in it - not what the schema says should be in it. Field-level distributions, null rates, value inconsistencies, referential integrity violations, lookup codes with no equivalent in the target system. You're looking for the gap between the documented structure and the lived reality of the data.

How do you handle data quality problems that only appear mid-migration?

It depends on what kind of problem it is. If data is wrong but recoverable, you write inference rules that derive the correct value from surrounding fields rather than rejecting the record. If it can't be recovered, you quarantine it as a dead-letter (flagged, visible, and re-triggerable once the source is corrected). The pipeline should never be simply green or red. Partial success states make iterative cycles fast and keep the overall programme moving.

What's the risk of skipping deep discovery on a data migration?

The transformation rules you write on the basis of assumptions rather than data tend to break in delivery, at the point when they're most expensive to unpick. A few days of redesign in discovery is routine. The same rethink after significant delivery work has been done on the affected area is a very different conversation.

How do you keep transformation logic auditable over time?

Scoping transformation scripts by subject area or domain rather than building a monolithic pipeline means that when rules change - and they always do - you can trace exactly what changed and why. The reasoning lives in the code and its commit history, not in a separate document that drifts from reality.