Fuzzy Matching 101 - Part I

By Lewis Fogden, Mon 30 January 2017, in category Etl

Python

  
@

The Mismatch

Often when dealing with internal data - whether it be from CRM or ERP systems, relational databases full of financial transactions or product information, or anything else of a similar vein - linking entities is easily achieved with a few joins based on some form of unique identification number or hash. And while this might not always be the case, you'd hope that these ID's could be mapped between systems without too much effort, generating a simple lookup table being all that's required.

However, when your data is being scraped from the web, streamed from external APIs, purchased from data vendors, or is from some other far off place, no such link generally exists. Perhaps you're mining tweets to analyse the sentiment around your product or brand, or scraping reviews of how your business is perceived. Then then following mismatch is quite likely to occur:

# Your internal systems
|product_id|         product_name        |   product_desc  |product_price|
|----------|-----------------------------|-----------------|-------------|
|  000001  |literally_the_best_thing_ever|what_doesnt_it_do|    999999   |

# The external data
{
    "review_id": "18492734",
    "review_text": "I lve my new literally da bst thing evr, can't wait 2 show all my friends"
}

In the first part of this fuzzy matching series, we'll be looking at cleaning your data prior to matching, and the advantages of fuzzy matching vs. an exact match approach.

Connecting Entities, Round 1 (Exact Match)

Now, moving past the review_text that I painstakingly authored above, how do we go about connecting our entity between our data sources. Generally, this will be based upon identifying and matching a substring common in both.

For instance, if we have a list of products, and a bunch of reviews (that we have gathered based on some search criteria such that we know that each is related to one of our products), we could iterate through the products, trying to find a match in each review.

product_reviews = []
for p in products:
    for r in review:
        if p in r:
            print('It\'s a match')
            product_reviews.append((p, r))

However, there are a few issues here. Firstly, this only works if your product name is an exact match to a substring in the tweet. This means that any tweets with spelling issues, abbreviations, or other deviations won't get picked up. And as you'd expect, this gets increasing unlikely with long or technical product names, as people simply won't bother to write them out in full. It also relies on your internal product name being perfectly correct and the same as what external people might call your product.

Secondly, for loops aren't known for their speed. And while vectorising - and possibly multi-processing - your process could greatly reduce its required calculation time, if your matching a large dataset against several others, you could be looking at some serious CPU hours if you run each item against a full matching set.

The Joy of Cleaning Text

So what can we do to improve. Firstly, let's get the data cleansing - everyone's favourite part of a project - out of the way. Data transformation steps are generally simple to implement and quick to run in pandas, the data analysis library, so our first step will be to transform data from both of our source systems into pandas dataframes.

For RDBMS's, the pandas.read_sql() or pandas.read_sql_table() functions make it simple. Both require Sql Alchemy engines to connect to a database, so to load a table would look something like the following:

from sqlalchemy import create_engine
engine = create_engine('dialect+driver://username:password@host:port/database')

import pandas as pd
data_1 = pd.read_sql(sql=table, con=engine, index_col='_id')

For JSON style data like that given in the external example, where each entity is represented as an object (similar to a python dict), we can transform the data into a dataframe through the following:

data_2 = pd.read_json(file)

Note that the orient parameter can be used depending on how the data is structured in the json file.

Once we have our two data sources in dataframes, we can begin cleansing them. Common steps for cleaning text data include:

Stripping whitespace

s = s.strip()

Lower casing

s = s.lower()

Striping (non-relevant) punctuation

s = s.translate(str.maketrans("", "", ",.-'\"():;+/?$°@"))

Stripping accents

s = ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

Removing stopwords

s =  ' '.join([part for part in s.split() if part not in stopwords)])

Additionally you can remove null rows, drop duplicates, replace known acronyms and terms using a mapping table, etc.

I recommend you write each required step into a function, and apply them as a vectorised action to a pandas dataframe of your entities, where each entity represents a single row. Something along the lines of:

def my_great_cleaning_function(s):
    # Does some great stuff
    return s

df[clean_product_name] = df[product_name].map(my_great_cleaning_function)

It's worth remembering that (for datasets that still fit in memory) the order of computational speed to apply a function on a column goes something along the lines of:

There's a great StackOverflow post that goes in to a bit more detail on this. This should give your code a bit of a boost to begin with, though most of the calculation time will be spent on the matching itself.

Connecting Entities, Round 2 (Fuzzy Wuzzy)

Now let's try this again, but with a less harsh matching criteria. This time, we'll look to the Fuzzy Wuzzy package for help. Fuzzy Wuzzy provides 4 types of fuzzy logic based matching, using Levenshtein Distance to determine the similarity between two strings. This metric mathematically determines similarity by looking at the minimum number of edits required for two strings to converge / be equal.

To quickly summarise the matching methods offered, there is:

For a more in-depth view, have a look at this post by the guys at SeatGeek.

Using the process module's extractOne function, we can find the best match for a product / record in a list of choices (our 2nd data source), for a given matching method. We can even set a cutoff, such that if no matches get an equal or higher value than this score, None is returned instead.

from fuzzywuzzy import process, fuzz

match = process.extractOne(
    data_1.loc[0, column], 
    choices=data_2.loc[:, column], 
    scorer=fuzz.ratio,
    score_cutoff=80
)

To apply this to the entirety of our first dataset, we can utilise pandas apply() method.

def fuzzy_match(x, choices, scorer, cutoff):
    return process.extractOne(
        x, choices=choices, scorer=scorer, score_cutoff=cutoff
    )

matching_results = data_1.loc[:, column].apply(
    fuzzy_match,
    args=(
        data_2.loc[:, column], 
        fuzz.ratio,
        80
    )
)

And so by applying the fuzzy_match function to one of our data sets, referencing the other as our choices parameter while changing the others (matching method and cutoff), we can determine sets of potential matches between our sources.

To Be Continued

In the next part of this series, we'll look at how we can create a tailored subset of likely matches for each item in one of our source systems. This will allows us to run fuzzy matching against a much smaller # of records, both vastly improving our computational time, and improving matching accuracy.

Additionally, we'll look at using the multi-processing module to further speed things up, distributing the work over all the cores on your machine.