Microsoft Fabric Updates Blog

Fabric Changing the game: Validate dependencies with Semantic Link – Data Quality

As we embark on our quest to unravel the mysteries of the data within Fabric, a world of possibilities awaits. Let’s continue our journey in understanding and exploring the data.

In this installment, our attention is directed to how to explore the functional dependencies -relationships between columns in a table. With a specific focus on a file, we aim to decipher the nuances and uncover the insights locked within.

Please note that you also can get more information about Data exploration here: Fabric Change the Game: Exploring the data | Microsoft Fabric Blog | Microsoft Fabric.

What is the goal of Semantic Link?

From the following reference: What is semantic link? – Microsoft Fabric | Microsoft Learn. “The primary goals of semantic links are to facilitate data connectivity, enable the propagation of semantic information, and seamlessly integrate with established tools used by data scientists, such as notebooks. semantic link helps you preserve domain knowledge about data semantics in a standardized way that can speed up data analysis and reduce errors.

Semantic link is a feature that allows you to establish a connection between Power BI datasets and Synapse Data Science in Microsoft Fabric.”

Guy in a Cube covered in this very interesting video the semantic link and power BI integration: Finish the Puzzle: Querying Power BI Semantic Models with Semantic Link! – YouTube

Also, you can find more information about Power BI and Semantic link integration in the following post: Semantic link in Microsoft Fabric: Bridging BI and Data Science | Microsoft Fabric Blog | Microsoft Fabric

But did you know that besides the Power BI datasets, you can also use Semantic Links to cover and explore functional dependencies in files? Please note, that this is on the documentation: Validate functional dependencies in data with semantic link – Microsoft Fabric | Microsoft Learn and my plan is to explore that option as follows.

Scenario: You are working on a project. You know that from the files that are in CSV format, you need to design and understand functional dependencies between columns. Also, you need to understand if there is any violation of those dependencies which could compromise the quality of the data.

The data will be extracted from https://data.london.gov.uk/ which is an open data-sharing platform about the diverse and captivating city of London.

Step by Step

  1. The first thing is to create a Lakehouse inside of Fabric and install from a notebook the library – Fig 1 pipinstall. Note: If you need more information about the Lakehouse: Lakehouse tutorial – Create your first lakehouse – Microsoft Fabric | Microsoft Learn
Fig 1 pipinstall

2. From Onelake Explorer you can easily upload the files extracted. If you have doubts about Onelake Explorer: Access Fabric data locally with OneLake file explorer – Microsoft Fabric | Microsoft Learn. As you can see from Fig 2 – Onelake. I uploaded the data inside of the Cultural_Infra folder.

Fig 2 Onelake

3. Using Pandas I am listing the data from the folder library. Fig 3 Listing.

Fig 3 Listing

4. The next step is to understand which kind of dependencies may exist between the columns of this dataset, using the code below in Pyspark:

import pandas as pd
from sempy.fabric import FabricDataFrame
from sempy.dependencies import plot_dependency_metadata
from sempy.samples import download_synthea




df = FabricDataFrame(pd.read_csv("/lakehouse/default/Files/Cultural_Infra/Cultural_Infra/Library/CIM 2023 Libraries.csv"))



deps = df.find_dependencies()


plot_dependency_metadata(deps)

The results of this execution are displayed in Fig 4 Plot:

Fig 4 Plot

What does this mean?

  • “The find_dependencies function in SemPy detects functional dependencies between the columns of a FabricDataFrame.
  • Columns that have a 1:1 mapping will be represented as a list. 
  • The function uses a threshold on conditional entropy to discover approximate functional dependencies, where low conditional entropy indicates strong dependence between columns.
  • Columns that belong to a single group are put into a single cell.”

Ref: Validate functional dependencies in data with semantic link – Microsoft Fabric | Microsoft Learn

Cross Checking

The Library Name has a one-to-one dependency on latitude, longitude, easting, and northing columns so it is represented as a list. it makes sense since the library is in one specific location on the map. The library also has a ward code and this is dependent on the ward_name which also could be understood as a neighborhood or a region inside of the neighborhood which is a borough. For example, Highgate is a ward in the London Borough of Camden, in the United Kingdom. 

Next, the library file was saved in a table inside of the Lakehouse (Lakehouse tutorial – Prepare and transform data in the lakehouse – Microsoft Fabric | Microsoft Learn – Step 12). From that table, let’s try to execute a series of queries to confirm our understanding. The first query will check the one-to-one relationship between the plot listed columns by grouping the columns and counting if there is any duplication – Fig 5 Group by:

Fig 5 Group By

The result confirms one -to-one relationship.

Based on the results that were plotted before by the Semantic Link the listed columns (name, latitude, longitude, easting, and northing) are also related to the ward code, name, and borough. It is logical that we could have multiple ward codes, and ward names in a borough, though let’s confirm it with a query that will group those columns and check if there is duplication. Fig 6 Multiple Wards:

Fig 6 Multiple Wards

According to the results, we could have more than one ward code, ward name, and borough.

Next, let’s filter the data and confirm our findings with a sample. First, by checking all the libraries that exist in a certain borough. Then, confirming that every library has a one-to-one relationship between name, latitude, longitude, easting, northing, borough, ward code, and ward name. Hence, each borough should have multiple ward names (regions in a neighborhood) that are related to a ward code. Also, looking at the results a ward code could belong to more than one ward name. Results of the query filtered by one borough can be checked in Fig 7 Filtering:

Fig 7 Filtering

Violations

Therefore, using Semantic Link we were able to list the functional dependencies between columns. Can we also check for dependency violations? So, for example, can a ward have more than one ward code?

Using the following example, we are assuming there exists a functional dependency between the ward_2022_name(determinant) and ward_code(dependent) columns, we can use the semantic link to list any possible violation in Pyspark. Results in Fig 8 Violations:

violations = df.list_dependency_violations(determinant_col="ward_2022_name", dependent_col="ward_2022_code")
violations
Fig 8 Violations

Results show there is a ward name called Highgate which can be referenced by two different ward_2022_codes.

Interesting fact those results as I first thought the ward code would be like a postcode, but it is not. It is a unique identification of a ward within a Health Care Provider and the data provided confirm two ward codes for the same ward. WARD CODE (datadictionary.nhs.uk)

Using the following code in Pyspark, you can even plot that possible violation. The results can be checked in Fig 8 Violation plot:

from sempy.fabric import FabricDataFrame
from sempy.dependencies import plot_dependency_metadata
from sempy.samples import download_synthea
import pandas as pd


download_synthea(which='small')


df = FabricDataFrame(pd.read_csv("/lakehouse/default/Files/Cultural_Infra/Cultural_Infra/Library/CIM 2023 Libraries.csv"))




df.plot_dependency_violations(determinant_col="ward_2022_name", dependent_col="ward_2022_code")


Fig 8 Violation plot

Summary: Using Semantic Link in the Lakehouse we were able to find functional dependencies between columns in a file as well violations that could potentially affect the quality of the data. Semantic Link can be used with Power BI to explore the Gold Layer of the data, but it is not limited to that. It can also be used to explore data quality and functional dependencies.

Gerelateerde blogberichten

Fabric Changing the game: Validate dependencies with Semantic Link – Data Quality

oktober 9, 2024 door Misha Desai

At Fabric, we’re passionate about contributing to the open-source community, particularly in areas that advance the usability and scalability of machine learning tools. One of our recent endeavors has been making substantial contributions back to the FLAML (Fast and Lightweight AutoML) project, a robust library designed to automate the tedious and complex process of machine … Continue reading “Enhancing Open Source: Fabric’s Contributions to FLAML for Scalable AutoML”

september 25, 2024 door Santhosh Kumar Ravindran

We’re excited to introduce high concurrency mode for notebooks in pipelines, bringing session sharing to one of the most popular orchestration mechanisms for enterprise data ingestion and transformation. Notebooks will now automatically be packed into an active high concurrency session without compromising performance or security, while paying for a single session. Key Benefits: Why Use … Continue reading “Introducing High Concurrency Mode for Notebooks in Pipelines for Fabric Spark”