All the facts you need to know about fuzzy matching!

You are currently viewing All the facts you need to know about fuzzy matching!

Power BI users come from different backgrounds and have their own specific business requirements which they address using their business data.  Microsoft Power BI has a large user base in Australia. Power BI Training (s) in Sydney and other parts of the Australia focus on using real data for trainings. This makes it easier for Power BI beginners to understand Power BI and relate its features to their work. Millions of rows is not a problem for Power BI but for billions of rows of data we use additional tools.

This blog is in continuation with our series of blogs related to the fuzzy matching algorithm. One of the major factors for the success of Power BI as an analytics tool is providing its users with a high degree of freedom to tune their BI solution(s). Merging data from different sources is at the heart of data analytics. This becomes extremely difficult if the data is unclean and/or contains spelling mistakes. Fuzzy Matching is the technique for finding strings that match a pattern approximately (rather than exactly). This algorithm is also used to create links between data sets that may be less than 100% perfect. Power BI provides you with the option to tune the matching algorithm.

To completely get the essence of fuzzy matching one should know about the overall process of BI in Power BI. It involves Data Transformation via Query Editor using Query Formula Language (M), Data Modeling in relationship view, Data Enrichment by creating calculated columns, calculated measures using DAX, Data Visualizations using report view, Collaboration and Sharing via Power BI Services. Using Row level security (RLS) for sharing reports can provide businesses with functionality of securing the data at row level based on user roles. To read more about row level security read our blogs https://powerbitraining.com.au/category/row-level-security-in-power-bi/.

The BI Process is the journey of data from its raw state to a presentable form. Steps involved in BI:

  • Getting Raw Data
  • Data Transformation
  • Data Modelling
  • Data Enrichment
  • Data Visualization
  • Sharing and collaboration

When transforming data from legacy systems there are a lot of mistakes that go by unnoticed, so it is important to identify those mistakes and create a mechanism for autocorrection in the data. For example, in the snapshot of data given below there are several mistakes. White spaces, line breaks, spelling mistakes, wrong data types, inconsistent formats, leading and lagging spaces are some of them.  Fuzzy matching can be a solution for all the above listed problems. Fuzzy matching is basically an AI based tool that can be used to identify those mismatched values.fuzzy-matching

Now let’s get started to make use of fuzzy matching in various data scenarios:

Fun Fact

Why do we see two tables when we open the Excel file in Power BI?

One is simple sheet, and the other is Excel table. Always open the excel table not the sheet.Excel table

Some data cleaning steps:

Basically, we will build an AI based cleaning engine to automatically identify the unknow non-observant patterns in data. And these steps can be followed to clean any similar datasets in future. Download the Excel file from this link https://powerbitraining.com.au/facts-you-need-to-know-about-fuzzy-matching-downloads/.

Open it in Power BI Desktop to perform the below steps:

  1. Load the xlsx file into Power BI Desktop.
  2. Select both the Countries and Sales table.
  3. Click on Transform data button at the bottom of Navigator window.Navigator

Or you can open the Query Editor by clicking on Transform data in the Queries section on the Home tab.

  1. Go to View tab in the Data preview section click on show whitespace, column quality, column distribution and column prolife check boxes to see detailed information about the selected column on different section of the screen.View tab  

Show whitespaces shows the whitespace characters or line feeds in the columns data.

Column quality gives snapshot of every column including empty values and errors.Column quality

Column profile shows details of first thousand rows only because Power Query puts limit on the data shown.Column profile

Column distribution shows unique and distinct values in each column as histogram visual.

  1. Go to View tab in the Layout section click on Formula bar.Formula bar

Formula bar shows all the things that you have done to the columns with M Language also known as the query language.query language

Review the data profile for the columns in Sales table.Sales table

  1. Click on the top left corner of Trans Date column to change the data type of the “Trans Date” column to Date as we don’t need time for this column.Trans Date
  2. Rename the “urgent?” column to “Urgent Order”, so that it looks more descriptive.Urgent Order
  3. Remove the nulls from the Quantity Sold column by clicking on the drop-down option. Uncheck the null check box and click OK.
  1. Now, change the data type for the Price column to fixed decimal number so that it takes less space in the memory.fixed decimal number
  1. Hover over the Retailer type The data distribution shows that there are 12 distinct and 5 unique values in this column. The unique values mean that they will appear only once in the data column. Whereas the distinct values mean that if you remove the duplicates how many values you will be left with in the data column. The 5 unique values might indicate that their spellings are incorrect.Retailer type
  1. To correct the spelling mistakes, click on the Enter Data in the New query section from the Home tab.Enter Data
  1. Copy the following data and Paste it into the table.
  • Outdoor Shop
  • Sports Store
  • Store
  • Warehouse Store
  • Direct Marketing
  • Equipment Rental Store
  1. Click on the Undo Headers.Undo Headers
  1. Write the column name as “Retailer type” and the table name as “Correct Retailer Type” and click on OK.Correct Retailer Type
  1. Click on Sales Table in the Queries pane.
  2. Now, click on Merge Queries from the Home tab in the Combine section. Merge Queries
  3. Now the merge dialog box select the Retailer type column from the Sales table and select the Retailer type column from the Correct Retailer Type table.Correct Retailer Type
  4. There are 7513 right matches out of the 7520 total rows with the merge operation. Now click on the check box “Use fuzzy matching to perform the merge”. Now there are 7518 matches out of 7520. Click on OK. The simple merge option exactly matches the values whereas the fuzzy match will look for almost matching values.Use fuzzy matching to perform the merge
  5. Click on the expand column icon on the top right corner. And uncheck the Use original column name as prefix” And click on OK.Use original column name as prefix
  6. Reorder the column, drag the new column, and drop it side by side with the old column to see the differences between the two.Reorder the column
  7. Now remove the old Retailer type column by right clicking on the column and selecting the Remove column option.Remove
  8. Click on the column drop-down option and select the Remove Empty You can always go back to the previous steps performed in query editor by clicking on any step in the applied steps box.Remove Empty optionRemove Empty option-properties
  9. Rename the new column to Retailer type. Now, the column is 100% valid i.e., there no errors or mistakes in it.Retailer type
  10. Now, click on Merge Queries from the Home tab in the Combine section.
  11. Select the RetCity from the Sales Select the City column from the Countries table. Also check the “Use fuzzy matching to perform the merge” option.
  12. Click OK.
  13. Click on the expand column icon on the top right corner. And uncheck the “Country” Column.
  14. Click on OK.
  15. Reorder the column, drag the new column, and drop it side by side with the old column to see the differences between the two.Reorder the column
  16. Look how fuzzy merge has created two rows for the Zurich city. Because it confuses it with Munich and creates two rows eventually. Now, to change the options for fuzzy merge go back to that step by clicking on the gear icon on the right of Merged Quires1.Merged Quires1
  17. In the Merge dialog box go to join kind section and write 1 in “Maximum number of matches” field and click OK.
  18. Now click on the last step applied i.e., Expanded Countries. Now only one row is created in new column corresponding to rows in old column.Expanded Countries

To download the complete Power BI Report, click on this link: https://powerbitraining.com.au/facts-you-need-to-know-about-fuzzy-matching-downloads/

Exploring Fuzzy Merge options in Power BI

Power BI allows you to work with different fuzzy merge options. These options allow you to set the criteria for the matching that the algorithm performs.   Let us look at the options Power BI provides you to tune your matching algorithm.

Similarity Threshold

This option allows you to choose the Range of unclean data that can be considered as true match. You can change the Matching Threshold from 0 to 1. With 0 being the value for lowest similarity and 1 being the closest match. If no threshold is set, Power BI uses 0.8 as the default value of threshold.

For a step-by-step guide click here:

https://powerbitraining.com.au/changing-similarity-threshold-in-fuzzy-matching/

Using a threshold value of 0, we get all the rows in the result. This means that a threshold value of 0 does not considers strict matching rules.

Ignore Case

This option is used for String Matching where case sensitivity is not required. Using this option sydney and SYDNEY will be considered as similar. Here’s a step-by-step guide to help you through

https://powerbitraining.com.au/ignore-case-option-in-fuzzy-matching/

Match by combining Text Parts

Using this option, both the parts of the word are considered while algorithm looks for a match. Using this option, Sydney and Si dney will be considered similar. Click on this for a complete guide: https://powerbitraining.com.au/match-by-combining-parts-in-fuzzy-matching/

Maximum number of Matches


This option allows you to limit the Number of Values that your algorithm returns. The value entered in this option returns results with the first result denoting the highest match and so on. If no value is entered the algorithm returns all rows in the same order. Click on this link to learn more: https://powerbitraining.com.au/maximum-number-of-returns-in-fuzzy-matching/

Transformation Table


Transformation tables act as Mapping Tables for the algorithm when the dataset under consideration is highly dissimilar. Transformation tables consist of a ‘To’ and a ‘From’ field which define how the matching is performed between highly dissimilar data sets. Transformation tables are used to define mapping rules. Transformation tables are designed to bridge the gap between datasets which cannot be matched using simple fuzzy matching. These tables need to be designed. They act as a template for fuzzy matching algorithm. The algorithm performs self-learning on the table and produces similar results. They are broadly used as abbreviation tables. Below is an example of transformation table containing abbreviation for two cities.Transformation Table

Transformation tables are suited for datasets that have been built without any naming rules or their predicted inputs have not been defined. To learn more: https://powerbitraining.com.au/using-transformation-table-in-fuzzy-matching-in-power-bi/

Conclusion

Fuzzy matching is an algorithm used to join data sets that are slightly similar. It comes with different fuzzy matching options which can be tuned to fix the matching criteria. The good news is we have developed a complete archive on the Fuzzy matching which contains detailed step by step blogs to help you learn this topic properly. Here’s the link: https://powerbitraining.com.au/category/fuzzymatchinginpowerbitraining/

Leave a Reply