, , , ,

One foremost objective of implementing Master Data hub is to identify and resolve duplicate customer records. This is a crucial step towards achieving single version of truth about customer information thus help lower operational costs and maximize analytical capabilities.

Duplicate processing starts as soon as customer records are consolidated in MDM from the different sources data, or at worst, duplicate records found in same system in absence of business rules to prevent it.

The process has the following two steps and can take significant effort depending on how distorted the data is.

Step 1: Identifying duplicates (Data Matching)

Data matching is a process of identifying duplicate records. There are 2 matching techniques which today’s MDM applications use to detect duplicates.

  1. Deterministic – Exact compare of data elements to assign scores.
  2. Probabilistic – Matching based on phonetic (sounds like), likelihood of occurrence, leveraging statistical theory, and pinpoint variation and nuances. This technique assigns a percentage indicating the probability of match.

Although there are certain advantages to both the matching techniques, probabilistic matching scores an upper hand due to high accuracy in matching records.

In a recent article, Scott Schumacher compares Probabilistic and Deterministic Data Matching which explains key differences between above two matching techniques. Albeit matching is a very interesting (and one of my favorite) topic, lets save this for another column and concentrate on step 2, the main objective of this post.

Step 2: Merging Duplicate Records (Surviving golden record)

Once the duplicate records are identified, they need to be merged to create what we call as “single version of truth” or “golden record”. This record by definition will be the best possible data representing current, complete and accurate view of customer information.

The main challenge we often face during merging process is determining what data elements and their values from duplicated customer data needs to be considered to create the golden record. The foundation stone is building “Intelligent” data survivorship rules during automated merging when system has to decide what data to pick from duplicated records. The controls built into MDM should be comprehensive so complete and accurate master information is realized.

Below are some of the rules which can be followed in an automated customer de duplication process.

  1. Select the data element coming from a most trusted source. In other words, the rules should know which source data element has higher priority. We usually figure out which source gets more priority by employing a step called data profiling on the source data. This will help us get metrics on data quality. More on profiling can be found here.
  2. Pick the data which is most recently changed. You would want to select a most recently updated address from source A over older address from a duplicate record coming from source B.
  3. Choose the data elements which are populated with more details. For example,
    1. For names, choosing Caroline over Carry, Robert over Bob, Michael Harris over M. Harris etc.
    2. For address, choose1607 Chestnut Drive. over 1607 Chestnut Dr
    3. Choose postal codes which are complete. For example 43219-1933 over 03219 (zip+4 is better than incomplete zip)
  4. Ignore null and empty values of suspect records.
  5. Ignore ambiguous values. Ex: 123-45-6789 for social security number. (Well, these values should not be here if only you were punctual and did good data governance and data quality control beforehand).

Above list provides some of the commonly used rules during data stewardship. However there may be some variations depending on quality of data and customer situation. Sometimes you may need to build these specialized rules based on how new customer records are created in the system over updating existing records. However, the above guidelines are good starting point.

Do share your thoughts.