What’s entity resolution anyway?
Have you ever added the same person to your phone’s contacts list multiple times under different contacts? Imagine you meet Melissa at a professional meeting and really hit it off, so you add her work number to your phone. Months later, you run into Melissa again. You save her email address this time, but under a different contact because you forgot about the original one. You continue to do this with different people, and soon you’ve created a messy, fragmented contacts list filled with several possible duplicates.
Eventually you decide to organize your contacts list and find two contacts labeled “Melissa.” The first contact contains a phone number, the second only an email address. Do these two contacts refer to the same Melissa? Data scientists often refer to the problem of matching records across different data sources where no identifier links records as entity resolution.
Why combined data matters
In an increasingly data-driven world, U.Group’s clients and business partners gain more insight into their data when they’re able to successfully bring together different data sources. Our data science team uses Natural Language Processing (NLP) and Artificial Intelligence (AI) to combine information from multiple data sources to produce a rich set of distinct records. Because there is more knowledge in the combined record than in the individual unmatched records, entity resolution makes valuable data even more valuable.
Consider a client who wishes to better understand their industry’s supply chain. To meet this client’s needs, U.Group scans many data sources to compile records of various contractors and suppliers. The different data sources we use often contain information on the same entities—e.g., companies, contracts, or projects. Entity resolution helps us resolve the records across different data sources, combining their information into a single record.
When we combine records, we merge their fields and features to create a single record that provides more information than the original records. As we fuse records across more data sources, a detailed picture of the entities in the data develops. For instance, if we merge financial data from one data source and information about awarded contracts from another data source, we can provide a richer view into a company and its relationships with other contractors and suppliers.
We can then apply advanced AI techniques, such as Probabilistic Soft Logic, to this detailed view to learn more about entities in the records and even establish relationships between them. Combining records and identifying relationships between entities, like contractors and suppliers, is key to helping our clients understand their industry’s supply chain.
How entity resolution works
Records are frequently stored in different forms and formats, often across multiple data storage systems. Each data source may hold information referencing the same entity but represented in different forms. For example, consider the company Lockheed Martin. This company may be represented as “Lockheed Martin” in one data source, “Lockheed Martin Co.” in another data source, and “Lockheed” in a third data source.
In order to match records related to the same entity, we first canonicalize the data sets. Canonicalization, sometimes called data normalization, is the process of manipulating data and records into a standard or consistent form. This preprocessing step improves the probability of entity resolution success and uses NLP to account for different spellings of first and last names, different representations of a company’s name, etc.
Once in canonical form, we can apply entity resolution to match and resolve records, achieving a richer and more concise record system. While there are various approaches for applying entity resolution, data scientists have coalesced around a few standard techniques.
Entity resolution application techniques
First, we must establish how records in various data systems relate to one another. For instance, the “Customer” field in data source A may map to the ”Name” field in data source B. After mapping the fields of different data sources, we can compare the values of these fields across two records to determine if they refer to the same thing in a process called pairwise matching.
One type of pairwise comparison is edit distance, which measures the dissimilarity of individual elements of two records by counting the minimum number of operations (or changes) required to transform one word or phrase into the other. The smaller the value of the edit distance, the more similar one element is to the other.
For instance, let’s consider two records, each with a first name field. One record contains a first name value of “John,” the other “Jon.” The edit distance between these two names is equal to one, because it would only take one operation to change “Jon” to “John” (add an “h”) or “John” to “Jon” (remove the “h”). Now consider the names “Sam” and “Samantha.” The edit distance between these two names is equal to five, which lowers our confidence that these two names refer to the same person.
One approach to entity resolution is to consider all of the fields that are shared by records across multiple data sources, compare the values of those fields in a pairwise fashion, compute and sum the edit distances, and combine records with the smallest aggregate edit distance to create a new merged dataset.
Because pairwise comparisons grow quadratically as the number of records for comparison increases, data scientists often use additional machine learning techniques to narrow the range of possible matches for comparison. A popular algorithm to assist with scaling is called clustering, where similar records are grouped based on the characteristics of their elements. One common way of clustering records is through blocking, in which only records with equal or near-equal values for a set of elements are compared.
Suppose we need to resolve individuals in records across data sources. We can narrow the number of records to compare against based on birth year by limiting all of our comparisons just to records having the same birth year. Possible comparisons between the two data sources within this limit are a “block.” If our records span several decades, blocking can significantly reduce the total comparisons required.
The U.Group advantage
At U.Group, we find new sources of information and provide insights and advantage to our clients by effectively combining these sources together to create a fuller picture. NLP and AI methodologies make this more feasible at scale. New approaches for entity resolution are being developed by U.Group and the data science community at large.
U.Group data scientists are taking entity resolution beyond just performing pairwise matching and are incorporating Probabilistic Soft Logic to infer additional relationships between records in response to the increasing volume of available data. Examples of new efforts by the data science community include cutting-edge techniques that leverage unsupervised machine learning for linking records, and tree-based representations of records that scale to massive amounts of data.