Maximum number of returns in fuzzy matching

You are currently viewing Maximum number of returns in fuzzy matching

One of the major factors for the success of Power BI as an analytics tool is providing its users with a high degree of freedom to tune their BI solution. This means that having attended a basic to advanced level Power BI training course you can tune your data analytics solution easily.  Merging data from different sources is the heart of data analytics. This becomes extremely difficult if the data is unclean and/or contains spelling mistakes. Fuzzy Matching is the technique of finding strings that matches a pattern approximately (rather than exactly) and allows to create links between data sets that may be less than 100% perfect. Power BI provides you with the option to tune matching algorithm.  

How is fuzzy Merge different from simple merge?

Simple Merge allows the merging of data sets that have a common key column. This means that the data value in primary key columns should exactly match with a data value in foreign key column.  Fuzzy Merge allows to merge data sets that are Slightly Similar. It uses fuzzy matching algorithm to find the similarity between the selected columns.

Microsoft Power BI provides you with different options to tune your algorithm. Fuzzy Matching can be performed in Power BI by either the Merge Button in the GUI or the power query functions. 

Let us look at the options Power BI provides you to tune your matching algorithm.

In this blog post, we will walk through the option of Maximum number of returns in fuzzy matching.

What does of Maximum number of returns means to fuzzy matching?

The value specified in of Maximum number of returns decides the number of valid entries in your result starting that have the most valid match. So if you specify a number 3, Power BI returns three entries that are most relevant to each row your target data set.

For this we have two tables, one containing the clean data and other contains the unclean data.  

This table contains a single column with two values: Sydney and Melbourne. These are the target values.

Now let’s see the un-clean data.   

This table contains a City column containing many incorrect variations for the two cities (Sydney and Melbourne) and correct entries for Perth and Adelaide. Each variation is assigned an ID to track its matching result in the ID column.

This blog post assumes that a simple fuzzy merge has been performed and now the now the other

Learn how to perform a fuzzy match : Using Fuzzy Matching To Merge Queries And Find Data Similarity In Power BI

Now fuzzy merge options need to be tweaked.

  1. Open the query settings by clicking on the gear icon in the Applied Steps Secion.
  2. Type “3” in the blank box under Maximum number of Matches (Optional). Click OK.

Make sure that the Ignore Case and Match by combining parts id checked.

3. Click on the double headed arrow next to Cities_UncleanData. Click OK.

4. In the pop-up, Click on

Notice that 3 rows for each match have been returned.

Where did Perth and Adelaide go ?

Did you notice that Perth and Adelaide were not returned as a part of results? This might seem tricky at first but did we have these in the unclean data? The answer is no. This shows how Power BI cleans the data based on the target values.

Exploring Fuzzy Merge options in Power BI

Power BI allows you to work with different fuzzy merge options. These options allow you to set the criteria for the matching that the algorithm performs.  

Let us look at the options Power BI provides you to tune your matching algorithm.

Similarity Threshold

This option allows you to choose the Range of unclean data that can be considered as true match.

Ignore Case

This option is used for String Matching where case sensitivity is not required. Using this option sydney and SYDNEY will be considered as similar.

Match by combining Text Parts

Using this option, both the parts of the word are considered while algorithm looks for a match. Using this option, Sydney and Si dney will be considered similar.

Maximum number of Matches

This option allows you to limit the Number of Values that your algorithm returns. The value entered in this option returns results with the first result denoting the highest match and so on. If no value is entered the algorithm returns all rows in the same order.

Transformation Table

Transformation tables act as Mapping Tables for the algorithm when the dataset under consideration is highly dissimilar.

Conclusion

Fuzzy matching can be used for data cleaning, spelling correction, data mapping and other applications. It is easy to use and provides accurate results only when the algorithm is provided with the correct parameters. Maximum number of matches dictates the number of rows returned for each value of the target data set. This option helps in limiting the number of matched results and helps to avoid the flooding of your data set with duplicates.

Are you a data analyst and want to learn more about Power BI?

Why not sign up for Power BI training in Australia We provide our services in the following regions:

Power BI Training in Sydney

Power BI Training in Melbourne

Power BI Training in Brisbane

Leave a Reply