Managing Duplicate Data in Integrated Systems
This article will discuss general best practices for identifying and controlling duplicate data in integrated systems. While these guidelines can help improve efficiency and integrity in any database, they are especially important for any integration that will employ synchronization between two or more systems. Finally, these considerations become critical in bi-directional synchronization, where systems will continually trade data and any complications or data contamination can be greatly amplified.
Understand your data
It is relatively simple enough to review the fields that are present in one system for a given type of record, and to understand their data types, intended usage and limitations. However, when doing the same for multiple systems, you will need to understand how these systems will translate or “map” with each other. One system may call a record a “Lead” if it represents a person who the company may be able to reach and sell products. Another system may call the same type of record “Prospect”, and may use different fields in the record’s data. They may both share a First Name and a Last Name field, but only one of them may use a Middle Name or Birthday. In one, the First Name and Last Name may be held in separate fields, while in another, there may be only one field containing the entire name. In one, a First Name field may be 20 characters long at maximum, while in another it may be 50 characters. One may have a Salutation field that is a predefined drop-down selection, while the other is an open text field. In planning for handling duplicates and integrating data, you must have a deep understanding of your own systems and a clear vision of what constitutes a single pathway for data to follow, and how it may need to change along the way.
What will properly and consistently identify a record as belonging to one real-life object (i.e. a person or a company)? For instance, if you have a database with a large number of Contacts, chances are you will encounter more than one person with the name “John Smith”. To make sure you have only one record for each real person, you might add additional identifying information such as an ID number or address info when discussing them. You want to make sure your users know that John Smith from Tuscon AZ is not the same as John Smith from Flagstaff AZ. Questions like this multiply in complexity when introducing sync with other databases, because each system may use different criteria. So, the first order of business is for a company to review their intended integration process and identify the minimum number of fields present in all synchronized systems that will guarantee uniqueness under any circumstance.
As mentioned, issues with duplicates can multiply with synchronized systems. For example, in one of the most common problem scenarios, system “A” may have a record for John Smith, and may rely on the full name together with an ID number to identify its own unique Contacts. Another system “B” may contain very similar fields for its own Contact records, but may rely only on the Email Address of a Contact as a unique identifier. It may have multiple records with the full name John Smith, but different email addresses. Let’s also assume that system B does not use the same ID number that system A uses (which is almost universally true). So, let’s say that system A has two Contact records named John Smith, with ID numbers and no email addresses. And let’s say system B has two of its own Contact records named John Smith, with unique email addresses and no ID numbers. What will happen when they integrate?
Now, when system A wants to update system B, it may pick the first John Smith record and look for a match in system B. Since they do not use the same criteria for uniqueness, your only option may be to match on full name. What happens next can be a wide variety of things based on how the integration is configured:
System A’s first record may match against only the first system B record. This means only one system B record will be updated. The second system A record would likely also update the same record, so the second system B record may become “stale”.
The first system A record may match against both of the system B records at once. This will result in one system A record applying its data to two different records in system B, which is very likely to cause confusion. Further, if system B later synced its data up to system A, it would then move only one record’s data into both of system A’s records, and all four records across both systems would quickly become indistinguishable. Even one or two iterations of synchronizations can destroy a significant amount of data in this way if great care is not taken to prevent the proliferation of contamination.
The first system A record may fail to match against any system B record due to its ID number not being found (if this was included in the match criteria). In this case, the integration might be instructed to create a brand new record in system B, which could cause further confusion and undesired sync behavior.
Other unexpected results; e.g. If a Null value is involved (that is, the total lack of data), results may also be unexpected, as a Null value will not match with a “blank” value. This is likely counter-intuitive to the average user, but might make perfect sense when developing a system.
In order to manage the flow of data properly, you must create uniquely matching criteria for all data that is usable by every system that will synchronize. If possible, consider customizing one or all systems to allow for storing the same uniquely identifying information to reduce the complexity of matching.
Close the loop
After you have accurately identified unique data and taken steps to strengthen matching criteria, you should explore other methods of preventing and resolving duplicate data. Taken as a whole process, your first goal with an integration is to avoid spreading corrupted data. Smaller errors with data types or missing fields can be resolved much more easily than situations where weak specifications allowed data to be inappropriately overwritten or created. For instance, it has happened that a company allowed one Lead record to exist which had a blank name and phone number, because they otherwise had use for a “dummy” record. However, in another system they had many Lead records that had phone numbers, but no names, as these were not yet known. You may guess what happened when they attempted to synchronize the first system’s data to the second, matching on each record’s name: while other records may have synchronized without issue, the one “dummy” record overwrote every Lead record with a blank name, and thus also destroyed all of the phone number data in those records. The last thing anyone wants is to have to discuss restoring backups in the middle of an integration, or heaven forbid, in a live system!
You must adopt a strong data policy to be followed in all systems. This includes both using the programmatic tools available to you, and training users effectively to use them and recognize when they are functional or not. Most systems have at least one mechanism to validate data and prevent a user from entering a duplicate. For instance, HubSpot uniquely identifies their Contact records by their email address, and prevents entry of another Contact record with the same value as an existing record. Dynamics CRM offers Duplicate Detection Rules, which (when activated, which is not by default!) will also validate data as it is entered, and will store criteria on which Duplicate Detection Jobs can be run to identify and merge existing duplicate data.
Users are always on the lookout for more efficient ways to do their jobs. While a great thing by intention, this can lead to some taking “shortcuts” in order to achieve what they believe to be the correct end result. Unfortunately this can have serious consequences for any system that maintains its own internal processes which may not be visible to the user. And again, such a problem may be greatly exacerbated when more than one system is synchronizing data.
Consider the case of the “dummy” Lead record. Originally, a user who was not properly trained had wanted to add general notes in the system, and since the system was configured to drive users down a workflow path primarily focused around Leads, the user reasoned that they could simply create a dummy Lead record with blank fields, and deposit their notes there. This issue was not caught by management or other users because a typical search for a record by name would never show the dummy record with a blank name. Even in testing the integration, this issue did not present itself, because the blank record was not brought to the test system. All of the resulting fallout that occurred, including damage to a production system’s data, could have been prevented if either a) the user fully understood the result of their actions and what alternatives were available, or b) the system had been configured to prevent creating a Lead without a name.
Even if an integration were not involved, this user’s shortcut would have caused other future challenges with data integrity. Other users may have been encouraged to do the same thing. Upgrading to a later version of the system might suddenly cause those records to crash if validation processes or form rendering were changed. Migrating data to new hardware or to a cloud server might fail to bring the records over. Or, the user might themselves forget the record even exists after not using it for some time, and may look to find a new “shortcut”.
You can look at a database as a giant aquarium. One tiny leak may not sound like much, but not only will the water be outside of the tank (creating issues for the inhabitants), it will cause unpredictable water damage to the surrounding room, and may not even be noticed until the damage is done. When you keep business-critical data contained within a database, that data is only as stable as the architecture of the system. Every case of a “loose” rule or loophole can and often will encourage further poor handling of the system. Maintaining a database is essentially fighting against the force of chaos and entropy; to win this war, you need to make proper handling enticing to your users. And to implement this, you need both a rigidly ordered system, and you need your users to not only follow the policies, but to understand the reasons for the policies.