Fuzzy Matching 101 - Part II

By Lewis Fogden, Thu 13 July 2017, in category Etl

Python

  
@

Round 2, Fight!

Following on from my previous article on fuzzy matching rather disparate data sources, this post will detail further methods of comparing and matching entities, as well as optimising the performance of the process.

Last time, we concluded with a fuzzy_match function that would return the best match for a given entity in a list of choices, given a scoring criteria and a cutoff point.

import pandas as pd
from fuzzywuzzy import process, fuzz

def fuzzy_match(entity, choices, scorer, cutoff):
    return process.extractOne(
        entity, choices=choices, scorer=scorer, score_cutoff=cutoff
    )

We then apply()'d our function to a set of entities (column) in our source dataset, setting our list of choices as a column in our reference dataset. This resulted in a series, matching_results, where each row represents the best choice from reference_data for the equivalent row in source_data.

matching_results = source_data.loc[:, source_column].apply(
    fuzzy_match,
    args=(
        reference_data.loc[:, reference_column], # choices
        fuzz.ratio,                              # scorer
        80                                       # cutoff
    )
)

While this is a good start, and will certainly return some sort of mapping between our two sources, it is not exactly production ready. There is currently no gauging of accuracy with regards to matching the sources, and when comparing large data sets it is going to burn some impressive CPU hours. For instance, a project of ours had half a dozen different sources with up to several hundred thousand records per source, each of which needed to be compared to all others. Our initial runtime to compare any 2 sources with a single matching algorithm stretched overnight, so it was apparent that our approach, not our server, needed some upgrading.

Extracting Exact Matches

While the nature of our data sources will heavily effect our approach, particularly with regards to optimisation, its always a good idea to try and make some quick wins. In the case that some exact matches between our datasets do occur, we'll want to extract them first prior to attempting to fuzzy match. For each exact match that we find, we'll will be saving ourselves X fuzzy comparisons, where X is the length of our reference set / choices.

First lets determine which records in the source_data have exact matches in the reference_data and extract them.

exact_matches = source_data.loc[:, source_column].isin(
    reference_data.loc[:, reference_column].tolist()
)
exact_matches = source_data[exact_matches]

Next let's define a results DataFrame, which we will later add matches to as we find them, and create a column to indicate where or not a given match was exact or fuzzy.

results = exact_matches
results[reference_column] = results[source_column]
results['match_type'] = 'exact'

Finally, let's drop our exact matches from our source, so we don't try to use them in our fuzzy comparison later.

source_data = source_data.drop(exact_matches.index)

Rare Substrings

For some use cases, such as comparing customers between 2 CRM systems or looking for duplicates inside a single system, the nature of the data might be such that the records contain relatively non-unique substrings (common first names, cities, job titles, etc.), and that errors such as spelling mistakes or abbreviations are what limits the capabilities of exact matching. In this instance, cleaning steps like those mentioned in the previous post (removing stopwords, punctuation, substituting for abbreviations using Regex, etc.) and using a combination of fields common to both datasets as the matching entity can improve matching accuracy.

source_data['combined_matching_entity'] = source_data[column_1].str.cat(
    source[column_2], sep=' - '
)

In the use case I mentioned above, the datasets consisted of records of generally long strings containing rare / fairly unique terms / substrings. Of the two examples strings below, the records where of a kind with the latter.

string_with_common_substrings = 'John Smith, London, Project Manager'
string_with_rare_substrings = 'Eloquent Ellies Amazing Popcorn Machine'

These rare words were usually correctly spelt, with the difference between an entity located in both source and reference due to differing word order, additional terms, and abbreviations, as demonstrated below.

entity_in_source_data = 'Eloquent Ellies Amazing Popcorn Machine'
entity_in_reference_data = 'Ellies Eloquent Popcorn Mchn'

This meant that the cleaning methods described in the previous post, and the concatenation of common fields, didn't yield great results alone (though were still necessary as a base). For any given record in our source dataset, the comparisons to the reference generally set scored poorly, as both the length and rarity of the words meant the large numbers of edits were required to converge the two strings, and so the Levenshtein distance was large.

A Bag of Words Approach - The Diamond in the Rough

In order to deal with such data, our approach was to identify a short-list of highly likely matching entities in the reference dataset for each record in the source dataset using bag of words overlaps, and then fuzzy match against those first. While creating the short-list for each record did come at a certain computational cost, this was found to be insignificant compared to the savings gained by only running the expensive fuzzy matches against the reduced dataset.

To try and picture this, imagine the table below, where each source entity has its own reduced reference dataset to match against.

source_dataset          reference dataset
-----------------------------------------
entity_01               possible_match_01_01
                        possible_match_01_02
                        possible_match_01_03

entity_02               possible_match_02_01
                        possible_match_02_02

Counting Terms

To determine which terms in our entities are rare, we'll need to count them. First let's define a term_count_dict, of how many times a given term appears in our reference dataset. Note that below I am only counting a term the 1st time it appears in an entity (using set() to remove duplicates), but this might not be appropriate in your particular use case.

term_count_dict = {}

for index, entity in reference_data[column].iteritems():
    entity = list(set(entity.split()))

    for term in entity:
        term_count_dict[term] = term_count_dict.setdefault(term, 0) + 1

Note that the Counter or defaultdict objects from collections could also be used for this purpose, but I thought it best to minimise the use of additional modules in this tutorial. CountVectorizer from scikit-learn is another option.

Mapping Rare Terms to Entities

Next let's create a dictionary the maps each term present in our reference data set to each entity that contains that term. Note that we only want to do this for rare terms; terms that have a count less than rarity. This value will need to be tailored to your data source, let's considered a term to be rare if it appears less than 500 times (in ~100,000 entities).

rarity = 500
term_to_entities_dict = {}

for index, entity in reference_data[column].iteritems():
    for term in list(set(entity.split())):
        # Only include rare words (Less than rarity occurrences)
        if term_count_dict[term] < rarity:
            term_to_entities_dict.setdefault(term, [])
            term_to_entities_dict[term].append(entity)

Determining Potential Matches

Now let's write a function that can determine the potential matches in our reference dataset for a given entity in our source. The first argument of this function must be the entity itself, as later the df.apply() method will be used to run this against the entire source DataFrame in a vectorised manner.

def get_potential_matches(entity, term_count_dict, term_to_entities_dict):
    term_counts = []
    for term in list(set(entity.split())):
        if term in term_to_entities_dict:
            count = word_count_dict[term]
            term_counts.append((term, count))
    term_counts = sorted(term_counts, key=lambda x: x[1])

    potential_matches = []
    for i in range(len(term_counts)):
        term = term_counts[i][0]
        potential_matches.extend(term_to_entities_dict[term])

        # Limit the # of potential matches per entity
        if len(potential_matches) > 30:
            break

        # Limit to only use the i least common terms per entity
        if i >= 3:
            break

    potential_matches = list(set(potential_matches)) # Remove duplicates
    potential_matches.sort() # For easier output / validation

    return term_counts, potential_matches

First, the term counts from the reference set are determined for each of the unique words in the source entity, if that word was sufficiently rare to be in the term_to_entities_dict. The list is then ordered by count in an ascending fashion, so that the rarest terms are first. Next, for each source term the reference entities that contain that term are added to the potential_matches list.

A cutoff of 30 potential matches is enforced, such that if a given term extends the list to over 30 entities, no further terms are considered. A second cutoff is also enforced, such that only the 3 least common terms in a source entity are considered. Finally, duplicates are removed, and the list is sorted for easier validation. Duplicates may occur when a entity in the source and reference share more than 1 rare term (which is also a good sign generally in terms of matching).

Calculating Bag of Words Overlaps

Now that we have our sets of potential matches for our source entities, we need to define a function to determine the bag of words overlaps between an entity and its potential matches.

def get_bag_of_words_overlap(entity, potential_matches):
    source_bag = set(entity.split())
    overlaps = []

    for potential in potential_matches:
        potential_bag = set(potential.split())
        intersection = source_bag.intersection(potential_bag)
        overlap = round(len(intersection) / len(source_bag), 3)
        overlaps.append((potential, overlap))
    overlaps = sorted(overlaps, key=lambda x: x[1], reverse=True)
    return overlaps

The output is a list of tuples, each of the format (reference_entity, overlap_score). These are reverse sorted, such that the best match is the first element in the list.

Building Our Shortlist Dictionary

The final step to create our bag of words based shortlists is to combine the above and apply it over our source dataset.

bow_matches = {}
counter = 0
no_of_rows = source_data[column].shape[0]

for index, entity in source_data[column].iteritems():
    term_counts, potential_matches = get_potential_matches(
        entity, term_count_dict, term_to_entities_dict)
    overlap_list = get_bag_of_words_overlap(entity, potential_matches)
    bow_matches[entity] = overlap_list

    # Printout for sense checking
    if counter < 5:
        print("\nFinding potential matches for '{0}'".format(entity))
        print('Entity term counts: \n{0}'.format(term_counts))
        print('Top 5 potential matches: ')
        for i in overlap_list[:5]:
            print(i)

    # Printout for progress
    if counter % 500 == 0:
        print('Processing entity {0}/{1}'.format(counter, no_of_rows))

    counter += 1

# Drop scores, just keep matches
for entity, potential_matches in bow_matches.items():
    bow_matches[entity] = [x[0] for x in potential_matches]

The term_counts and potential_matches are first calculated, then the overlap_list is generated and stored in the bow_matches dictionary. For sense checking purposes, these are printed to the terminal (or a log file) for the first 5 entities. To give a sense of when the matching will be complete, some loading statistics are also output. Finally, the bow_matches is reduced to simply the reference entity lists.

Note that none of the printed output, or even the bag of words scores (after the overlap list is sorted) are needed, and can be removed if desired.

Fuzzy Matching Against Our Shortlists

We can now extend our fuzzy_match function use bow_matches. As this function will be apply()'d to our source DataFrame, we must feed in the entire bag of words dictionary as the choices argument, and then select the relevant reference list for each entity by indexing using the entity value as the key.

def fuzzy_match(entity, choices, scorer, cutoff):
    choices = choices[x] if isinstance(choices, dict) else choices.tolist()

    return process.extractOne(
        entity, choices=choices, scorer=scorer, score_cutoff=cutoff
    )

To achieve this, we can swap our reference_data for bow_matches and test for whether choices is a dict using isinstance. When that is the case, we feed in the entity as an index, otherwise, we convert the DataFrame to a list. This is because extractOne returns either a 2-tuple or 3-tuple depending on the type of the choices argument (see here for reference) , and we want to ensure consistency between our bag of words and full reference set matches.

The found matches should then be concatenated onto our results DataFrame and the entities dropped from our source data set.

source_data.loc[:, reference_column] = source_data.loc[:, source_column].apply(
    fuzzy_match, args=(bow_matches, fuzz.ratio, 80)
)
bow_matches = source_data[source_data[reference_column].notnull()]
results = pd.concat([results, bow_matches])
source_data = source_data.drop(bow_matches.index)

Fuzzy Matching Using the Reduced Source Set

Finally, we can run our remaining source records against the full reference data set and concatenate these to our results. We can then update our match_type column for all of the non-exact matches to fuzzy.

source_data.loc[:, reference_column] = source_data.loc[:, source_column].apply(
    fuzzy_match, args=(reference_data.loc[:, reference_column], fuzz.ratio, 60)
)
results = pd.concat([results, source_data])

results_with_refs = results[reference_column].notnull()
results.loc[results_with_refs, 'match_type'] = results.loc[
    results_with_refs, 'match_type'].fillna('fuzzy')

Note that currently the reference_column contains tuples of (match_string, match_score). To split these into separate columns, you can unpack them using the zip function.

fuzzy = results['match_type'] == 'fuzzy'
results.loc[fuzzy, reference_column], results.loc[fuzzy, 'match_score'] = zip(
    *results.loc[fuzzy, reference_column])

To Be Continued

We found this rare word short-list bag of words approach to give a significant boost in performance (from overnight to a sub one hour for most source - reference pairs), and resulted in more accurate matches than just comparing against the full reference set. However, in order to compare half a dozen data sources using multiple matching scorers, more juice was still needed.

In the following parts of this series, we'll look at using the multiprocessing module to split the workload over the multiple cores of your server, using multiple scorers, as well as an approach to determine matching accuracy and which match value to take for each entity (multiple scores mean multiple matches after all).

Please note that in writing the snippets in this post, readability was favoured to some degree over performance, where its effects were deemed insignificant to the overall result. When possible, make use of optimised python data structures, and apply functions in a vectorized way instead of looping. The code above has been written as a simplified example for this post, please comment any mistakes and or suggested improvements (while still remaining tutorial friendly) below, and together we can help the data community.