This is the clean data that contains city column with correctly spelled city names. Now let’s have a look at the unclean data.
This table contains a City column containing many incorrect variations for the two cities. Each variation is assigned an ID to track its matching result in the ID column.
Fuzzy matching is performed in the query editor.
To perform fuzzy matching, you need to click on the Home Tab in the Ribbon Menu. The next step is to click on Merge Queries from the Combine section. As a result, a drop-down menu appears Now you will have to Click on Merge Queries as New and a pop-up menu appears.
A step-by-step guide to perform fuzzy matching can be found here:
The pop up menu that appears is the selection zone for tuning the fuzzy matching algorithm. Select Cities-CleanData in the first drop-down menu and click on the City Column. Then select Cities-UncleanData in the second drop-down menu and click on City Column. The next and most important step is to select the check box to Use fuzzy matching to perform merge.
Now you should expand the unclean data to include the tracking ID column and visually observe what happened behind the scene.
Notice that only 10 rows have been returned by the algorithm using the default parameters.
Exploring Fuzzy Merge options in Power BI
Power BI allows you to work with different fuzzy merge options. These options tune the criteria for the matching that the algorithm performs.
Let us look at the options that Power BI provides you to tune your matching algorithm.
This option allows you to choose the Range of unclean data that can be considered as true match. You can change the Matching Threshold from 0 to 1. With 0 being the value for lowest similarity and 1 being the closest match. If no threshold is set, Power BI uses 0.8 as the default value of threshold.
For a step-by-step guide click here: https://powerbitraining.com.au/changing-similarity-threshold-in-fuzzy-matching/
Using a threshold value of 0, we get all the rows in the result. This means that a threshold value of 0 does not takes into account strict matching rules.
Notice the values in the Cities_UncleanData.ID field. They describe the results of fuzzy matching. The value at the third position in the unclean data matched the most and so it was returned as first match.
The results however do not show any match for Perth or Adelaide.
This option is used for String Matching where case sensitivity is not required. Using this option sydney and SYDNEY will be considered as similar. Here’s a step-by-step guide to help you through
Notice that only 11 rows have been returned and the case sensitivity has been ignored when the fuzzy matching is performed.
Using this option, both the parts of the word are considered while the algorithm looks for a match. Using this option, Sydney and Si dney will be considered similar.
Click on this for a complete guide: https://powerbitraining.com.au/match-by-combining-parts-in-fuzzy-matching/
Using this option, a bi-gram contains two parts for a single word.
As an example let’s look at the second row with ID:2, Sydney was entered as “Sy dney” which has a space which divides the word into two parts.
Where did Perth and Adelaide go?
Did you notice that Perth and Adelaide were not returned as a part of results? This might seem tricky at first but did we have these in the unclean data? The answer is no. This shows how Power BI cleans the data based on the target values.
Transformation tables act as Mapping Tables for the algorithm when the dataset under consideration is highly dissimilar.
Transformation tables consist of a ‘To’ and a ‘From’ field which define how the matching is performed between highly dissimilar data sets. Transformation tables are used to define mapping rules. Transformation tables are designed to bridge the gap between datasets which cannot be matched using simple fuzzy matching. They are broadly used as abbreviation tables.
These tables need to be designed. They act as a template for fuzzy matching algorithm. The algorithm performs self-learning on the table and produces similar results.
Transformation tables are suited for datasets that have been built without any naming rules or their predicted inputs have not been defined.
For a step by step guide: https://powerbitraining.com.au/using-transformation-table-in-fuzzy-matching-in-power-bi/
For simplicity, we created a single row transformation table in this blog:
The results show that this table is used as template:
Notice that Syd is added to the results.
Fuzzy matching is an algorithm used to join data sets that are slightly similar. It comes with different fuzzy matching options which can be tuned to fix the matching criteria. The good news is; we have developed a complete archive on the Fuzzy matching which contains detailed step by step blogs to help you learn this topic properly. Here’s the link:
Are you a data analyst and want to learn more about Power BI?
Why not sign up for Power BI training in Australia. We provide our services in the following regions:
- Power BI Training in Sydney
- Power BI Training in Melbourne
- Power BI Training in Brisbane
- Power BI Training in Canberra