Using Fuzzy Matching to Merge Queries and Find Data Similarity in Power BI

You are currently viewing Using Fuzzy Matching to Merge Queries and Find Data Similarity in Power BI

Fuzzy Matching is the technique of finding strings that match a pattern approximately (rather than exactly) and allows to create links between data sets that may be less than 100% perfect. Power BI also provides you with the option to tune the matching algorithm.  

In this blog post, we will see how this feature in Power BI allows you to merge queries. For this we have two tables, one containing the clean data and other contains the unclean data. This blog post describes the default fuzzy matching. Fuzzy merge comes in with other (mentioned at the end) which can be tuned.

Fuzzy matching has various applications such as spelling correction, data similarity checking and data cleaning.

For this blog post we have two tables, one containing the clean data and the other containing the unclean data. Let’s have a look at this first. 

1.Click on the Cities_CleanData in the queries pane.

This table contains a single column with two values: Sydney and Melbourne. These are the target values.

Now let’s see the un-clean data.   

2.Click on the Cities_UncleanData in the queries pane.

This table contains a City column containing many incorrect variations for the two cities. Each variation is assigned an ID to track its matching result in the ID column.

To get started:

3.Click on Cities_CleanData. Click on Home Tab in the Ribbon Menu.

4.From the Combine section, click on Merge Queries.

A drop-down menu appears.

5.Click on Merge Queries as New.

A pop-up menu appears.

6. Select Cities-CleanData in the first drop-down menu and click on City Column.

7. Select Cities-UncleanData in the second drop-down menu and click on City Column.

8. Select the check box to Use fuzzy matching to perform merge.

9. Click OK.

10. Click on the double headed arrow next to Cities_UncleanData. Click OK.

The ‘City’ column contains the clean data is matched with ‘Cities_UncleanData.City’. The values in Cities_UncleanData.ID determine the results of the matching algorithm. The similarity of match decreases as we move down the column.

Notice that only 10 rows have been returned by the algorithm using the default parameters.

Exploring Fuzzy Merge options in Power BI

Power BI allows you to work with different fuzzy merge options. These options allow you to set the criteria for the matching that the algorithm performs.  

Let us look at the options Power BI provides you to tune your matching algorithm.

Similarity Threshold

This option allows you to choose the Range of unclean data that can be considered as true match.

Ignore Case

This option is used for String Matching where case sensitivity is not required. Using this option sydney and SYDNEY will be considered as similar.

Match by combining Text Parts

Using this option, both the parts of the word are considered while algorithm looks for a match. Using this option, Sydney and Si dney will be considered similar.

Maximum number of Matches

This option allows you to limit the Number of Values that your algorithm returns. The value entered in this option returns results with the first result denoting the highest match and so on. If no value is entered the algorithm returns all rows in the same order.

Transformation Table

Transformation tables act as Mapping Tables for the algorithm when the dataset under consideration is highly dissimilar.

Conclusion:

Fuzzy merge is a technique to join data sets that are slightly similar. Microsoft Power BI allows you to join data sets without any single bit of coding. The powerful GUI of Power BI’s power pivot allows users to tune their matching criteria against different parameters such as similarity threshold, ignore case, match by combining parts, maximum number of matches and transformation table. In this blog post, the process of fuzzy matching is described in detail using ID columns. For further details on the tuning of parameters, stay tuned to our blog posts.

Leave a Reply