Combat of Duplicate Data, Who is the Survivor?

Tags

360 View, DUPLICATION, Matching, Single View, Survivorship

One foremost objective of implementing Master Data hub is to identify and resolve duplicate customer records. This is a crucial step towards achieving single version of truth about customer information thus help lower operational costs and maximize analytical capabilities.

Duplicate processing starts as soon as customer records are consolidated in MDM from the different sources data, or at worst, duplicate records found in same system in absence of business rules to prevent it.

The process has the following two steps and can take significant effort depending on how distorted the data is.

Step 1: Identifying duplicates (Data Matching)

Data matching is a process of identifying duplicate records. There are 2 matching techniques which today’s MDM applications use to detect duplicates.

Deterministic – Exact compare of data elements to assign scores.
Probabilistic – Matching based on phonetic (sounds like), likelihood of occurrence, leveraging statistical theory, and pinpoint variation and nuances. This technique assigns a percentage indicating the probability of match.

Although there are certain advantages to both the matching techniques, probabilistic matching scores an upper hand due to high accuracy in matching records.

In a recent article, Scott Schumacher compares Probabilistic and Deterministic Data Matching which explains key differences between above two matching techniques. Albeit matching is a very interesting (and one of my favorite) topic, lets save this for another column and concentrate on step 2, the main objective of this post.

Step 2: Merging Duplicate Records (Surviving golden record)

Once the duplicate records are identified, they need to be merged to create what we call as “single version of truth” or “golden record”. This record by definition will be the best possible data representing current, complete and accurate view of customer information.

The main challenge we often face during merging process is determining what data elements and their values from duplicated customer data needs to be considered to create the golden record. The foundation stone is building “Intelligent” data survivorship rules during automated merging when system has to decide what data to pick from duplicated records. The controls built into MDM should be comprehensive so complete and accurate master information is realized.

Below are some of the rules which can be followed in an automated customer de duplication process.

Select the data element coming from a most trusted source. In other words, the rules should know which source data element has higher priority. We usually figure out which source gets more priority by employing a step called data profiling on the source data. This will help us get metrics on data quality. More on profiling can be found here.
Pick the data which is most recently changed. You would want to select a most recently updated address from source A over older address from a duplicate record coming from source B.
Choose the data elements which are populated with more details. For example,

For names, choosing Caroline over Carry, Robert over Bob, Michael Harris over M. Harris etc.
For address, choose1607 Chestnut Drive. over 1607 Chestnut Dr
Choose postal codes which are complete. For example 43219-1933 over 03219 (zip+4 is better than incomplete zip)

Ignore null and empty values of suspect records.
Ignore ambiguous values. Ex: 123-45-6789 for social security number. (Well, these values should not be here if only you were punctual and did good data governance and data quality control beforehand).

Above list provides some of the commonly used rules during data stewardship. However there may be some variations depending on quality of data and customer situation. Sometimes you may need to build these specialized rules based on how new customer records are created in the system over updating existing records. However, the above guidelines are good starting point.

Do share your thoughts.

4 thoughts on “Combat of Duplicate Data, Who is the Survivor?”

FX Nicolas said:

December 22, 2011 at 10:05 am

Nice introduction to the domain, Prashanta.
One missing element is the binning that usually takes place prior to every matching process.
I have covered the overall process for building MDM on the following blog post:
http://www.semarchy.com/blog/mdm-deep-dive-the-convergence-hub-pattern/

- Prashanta C said:
  
  December 22, 2011 at 4:22 pm
  
  Hi,
  
  Thanks for reading and your comments.
  
  I couldn’t agree more on importance of binning and all other cleaning, profiling process which we have to do prior to consolidating data. Only after that we will be able to do a effective matching and de-duplication of data.
  
  Just checked out your post too and found it very useful explaining all the steps while building MDM.
  
  In this post, my main concentration was to cover data survivorship rules. I have written other blogs here which will explain matching, data quality and other topics. Please feel free to read and provide your comments too.
  
  Thanks again
  
DataQualityChronicle (@dqchronicle) said:

December 27, 2011 at 4:28 pm

Prashanta,
I enjoyed the post. It might be worth mentioning that duplicate consolidation is particular to the repository MDM model. Registry models do not consolidate duplicates but rather just identifiy them.
I appreciate the first line of the post as I am currently at a client that wants a registry but purchased a repository. It makes for a rough implementation.
On survivorship, have you encountered rules based on more complex logic? If so, I’d love to hear about it!

Prashanta C said:

December 28, 2011 at 1:32 am

William,

Glad you liked this post.

You are correct about registry model where we only identify the duplicates by applying suitable matching logic and let stewards know about the duplicate records. Most of the time, I do see hub style being the first choice except in health care scenario where registry model is considered the most. (Due to dispersed nature of patient data across network of hospitals)

I have seen variations on survivorship rules in every implementation. Although the list I put up in this blog serves most of the purpose, there are additional rules built in to make sure we do not end up merging records which should be kept separate. For example, we unmarked a duplicate record which differed by a Person/Organization indicator. (The case when a customer has a person account and he also operates as a point of contact for the company he works).
Similarly, there are instances when I had to change the rule to disable merging when the line of business of duplicated record was different. In this case customer wanted to remove duplicates only in same line of business. There are many such scenarios I can think about.

Btw, I get surprised when customers choose a wrong style of implementation. That might cost dearly.

MDM – A Geek's Point of View

Combat of Duplicate Data, Who is the Survivor?

4 thoughts on “Combat of Duplicate Data, Who is the Survivor?”

Leave a reply to Prashanta C Cancel reply

Rate this:

Share this:

Related

4 thoughts on “Combat of Duplicate Data, Who is the Survivor?”

Leave a reply to Prashanta C Cancel reply