Changing Similarity Threshold in Fuzzy Matching

You are currently viewing Changing Similarity Threshold in Fuzzy Matching

Fuzzy Matching is the technique of finding strings that match approximately (rather than exactly) and allows to create links between data sets that may be less than 100% similar. It provides you with the option to tune matching algorithm. In this blog post, we will see how this feature in Power BI allows you to merge queries. For this we have two tables, one containing the clean data and other contains the unclean data. This blog post looks closely into using the option of similarity threshold for fuzzy matching.

Fuzzy merge comes in with other options which can be tuned. Let us look at the options that Power BI provides you to tune your matching algorithm.

  • Similarity Threshold
  • Ignore Case
  • Match by combining Text Parts
  • Maximum number of Matches
  • Transformation Table

Fuzzy matching has various applications such as spelling correction, data similarity checking and data cleaning. For this blog post we have two tables, one containing the clean data and the other containing the unclean data.

This table contains a single column with two values: Sydney and Melbourne. These are the target values.

Now let’s see the un-clean data.   

This table contains a City column containing many incorrect variations for the two cities. Each variation is assigned an ID to track its matching result in the ID column.

Changing the Similarity Threshold

Power BI query editor allows you to change the Matching Threshold from 0 to 1. With 0 being the value for lowest similarity and 1 being the closest match. If no threshold is set, Power BI uses 0.8 as the default value of threshold.

To track the changes using different similarity thresholds:

1. In the Queries Pane, click on Merge1.

2. Now click on Gear icon next to Source in the Applied Steps Pane.

A pop-up menu appears

3. Type “1” in the blank box under Similarity Threshold (Optional). Click OK.

4. Click on the double headed arrow next to Cities_UncleanData. Click OK.

5. In the Pop-up, Click on Insert.

Using a threshold value of 1, we only see 3 rows in the result. This means that a threshold value of 1 does not takes into account any uncertainty except the word casing.

6. Now click on Gear icon next to Source in the Applied Steps Pane.

A pop-up menu appears

7. Type 0 in the blank box under Similarity Threshold (Optional). Click OK.

8. Click on The double headed arrow next to Cities_UncleanData. Click OK.

9. In the Pop-up, Click on Insert.

Using a threshold value of 0, we get all the rows in the result. This means that a threshold value of 0 does not takes into account strict matching rules.

Notice the values in the Cities_UncleanData.ID field. They describe the results of fuzzy matching. The value at the third position in the unclean data matched the most and so it was returned as first match.   

The results however do not show any match for Perth or Adelaide.

 

Conclusion:

Fuzzy matching is an algorithm used to merge data sets that are slightly similar. Power BI has an extensive support for this algorithm and provides its users with a freedom to match the data sets according to their requirements. Fuzzy Matching parameters such as Similarity Threshold, Ignore Case, Match by combining Text Parts, Maximum number of Matches and Transformation Table can be tweaked and optimum results can be obtained.

Are you a data analyst and want to learn more about Power BI?

Why not sign up for Power BI training in Australia.  We provide our services in the following regions:

Leave a Reply