Microsoft Fabric Updates Blog

Fabric Change the Game: Unleashing the Power of Microsoft Fabric and OpenAI for Dataset Search

Microsoft Fabric and Open AI

Ever wondered how cutting-edge technology can transform the way we explore and understand vast datasets? Join me on a journey through the code as we harness the power of OpenAI to unlock insights from a rich dataset of the city of London.

Our exploration begins with the London Workplace Zone Classification dataset information available on London Datastore. To make our exploration seamless, I’ve converted the original XLS file to a CSV format, paving the way for easier processing and analysis.

This post is based on research that I was doing and it uses the following as a reference:

Azure OpenAI Service embeddings tutorial – Azure OpenAI | Microsoft Learn

Azure OpenAI Service embeddings – Azure OpenAI – embeddings and cosine similarity | Microsoft Learn

Azure-OpenAI-Docs-Samples-embedding_billsum(github.com)

Step by Step:

  1. OpenAI – You will need to have access to OpenAI.

Follow the docs to proceed with the request:

What is Azure OpenAI Service? – Azure AI services | Microsoft Learn

Request Access to Azure OpenAI Service (microsoft.com)

  1. Once the access is granted, proceed to deploy the Open AI service:

Azure OpenAI Service embeddings tutorial – Azure OpenAI | Microsoft Learn

  1. Keep track of the following values that we will need to connect to the Open AI service: Api_key and the Service Endpoint.
  2. You can find this information in the Azure Portal using the Open AI service that you created in Step 2. Fig 1 – OpenAI from docs, shows that information.

Docs for more information: Connect using API keys – Azure Cognitive Search | Microsoft Learn

Fig 1 – OpenAI from docs

4. Still inside of the Azure Open AI services in the Azure Portal click on Model Deployments -> Manage Deployment. As Fig 2 – Manage Deployment, shows:

Fig 2 – Manage Deployment

5. Create a new deployment and choose the mode text-embedding-ada-002 version 2, this model refers to a specific language model provided by OpenAI for text embeddings. Fig 3 – Ada, shows:

Fig 3 – Ada

6. Fabric – Create a workspace, if you do not have one. Create a Lakehouse and create a new notebook. If you need more details on how to do it, follow the docs: Data warehouse tutorial – create a workspace – Microsoft Fabric | Microsoft Learn, Create a lakehouse with OneLake – Microsoft Fabric | Microsoft Learn, How to use notebooks – Microsoft Fabric | Microsoft Learn. Once you have done that, proceed to upload the dataset using the CSV format through the Onelake as Fig 4 – upload shows:

Fig 4 Upload

7 – Inside Fabric Notebook, lets code – You will need to install the following libraries:

Code: python

============================================

%pip install openai --upgrade num2words matplotlib plotly scipy scikit-learn pandas

============================================

8 – Use the following configuration for the variables. Please note you will need the information that you collected at Step 3.

Code: python

=============================================

import openai


openai.api_key = "replace the API Key"
endpoint = "https://<your_service_name>.openai.azure.com/"
openai.api_type = "azure"
embedding_deployed_model = "text-embedding-ada-002"

===============================================

Cleaning the Dataset:

  1. Create the following Functions to help prepare the data and create the embeddings. Note: After version 0.28.1 of Open AI there is a break change in the way the embeddings are created. The utils.embeddings_utils was removed, so the embeddings will be created with a customized function. This is taken from the docs as most of the code used for this example: Embeddings

Code: python

===========================================================================

import re

def normalize_text(s, sep_token = " \n "):
s = re.sub(r'\s+', ' ', s).strip()
s = re.sub(r". ,","",s)
# remove all instances of multiple spaces
s = s.replace("..",".")
s = s.replace(". .",".")
s = s.replace("\n", "")
s = s.strip()
return s

===========================================================================

Generate Embeddings and Search functions:

Code: python

===========================================================================

import pandas as pd
import openai
from openai import AzureOpenAI
import numpy as np
from ast import literal_eval

# Embedding

def get_embedding(text, model="text-embedding-ada-002"):
return client.embeddings.create(input=[text], model=model).data[0].embedding

# Search – Cosine

def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Search

def search_docs(df, user_query, top_n=4, to_print=True):
embedding = get_embedding(user_query, model="text-embedding-ada-002")
df['ada_embedding'] = df['ada_embedding'].apply(literal_eval)
df = df[df['ada_embedding'].apply(lambda x: isinstance(x, np.ndarray))]
# Calculate cosine similarity for each row in the DataFrame
df["similarities"] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
# Sort and get the top N results
res = df.sort_values("similarities", ascending=False).head(top_n)
if to_print:
display(res)
return res

===========================================================================

  1. The next step is running the code, basically, the plan is: to get the file with the embeddings created with the function get_embedding and again use the same function to take an input and return the embedding vector from that. Follow this, use the cosine_similarity function to calculate the cosine between two vectors a and b.

In other words: Cross the information using the embedding created in the file, with the text we are trying to filter in a vector format and use the help of cosine calculation to find similarities in between.

Note:

  1. A folder called output was used to save the file with the embeddings created.
  2. The file only creates the embeddings for the first 20 rows instead of the whole file.
  3. All text columns were combined in one column called “combined” and the embeddings will use that column.
  4. File paths need to be replaced in the code, use the API path as Fig 5 – API path, shows:
Fig 5 – API Path

Code: python

===========================================================================

client = AzureOpenAI( api_key = openai.api_key, api_version = "2023-05-15", azure_endpoint = endpoint )
input_datapath = "/lakehouse/default/Files/Files_CSV/LWZC Classification.csv" #File path, do not forget to replace by your own path

df = pd.read_csv(input_datapath, index_col=0, nrows=20) # Read only the first 20 rows

text_columns = ['LA name', 'Group description', 'Subgroup description']
df['combined'] = df[text_columns].astype(str).apply(lambda row: ' '.join(row), axis=1)
df['combined'] = df["combined"].apply(lambda x : normalize_text(x))

# The new column ada_embedding will contain the embeddings created

df['ada_embedding'] = df["combined"].apply(lambda x: get_embedding(x, model='text-embedding-ada-002‘))

df.to_csv('/lakehouse/default/Files/Files_CSV/output/LWZC Classification_embeddindgs_20rows.csv', index=False) #File path, do not forget to replace by your own path

===========================================================================

  1. Next, inspect the file that was just loaded, as Fig 6 – Inspect shows: Note: there is a new column with the embeddings created.

Code: python

===========================================================================

# Load data into pandas DataFrame from Onelake

df = pd.read_csv("/lakehouse/default/" + "Files/Files_CSV/output/LWZC Classification_embeddindgs_20rows.csv")

# This will make sure the embeddings are in an array format.

df['ada_embedding'] = df.ada_embedding.apply(eval).apply(np.array)
df.head(20)

===========================================================================

Fig 6 – Inspect

12. Search -“Professional, retail & leisure in dynamic” and I want the dataset to be filtered accordingly. Hence, the function will convert my sentence into a vector format and the cosine similarity will find the similarity between the vector that was created in the dataset.

Code: python

===========================================================================

# Assuming df is your DataFrame and it is using ‘ada_embedding’ as my example

embedding = get_embedding("Professional, retail & leisure in dynamic", model="text-embedding-ada-002")
df["similarities"] = df['ada_embedding'].apply(lambda x: cosine_similarity(x, embedding))
res = df.sort_values("similarities", ascending=False).head(4)

# Display the results

res.head()

Follow the results – Fig 7 Result:

Fig 7 Result

===========================================================================

In summary, I utilized Microsoft Fabric to establish a connection with the OpenAI service. Through this connection, I generated embeddings, which serve as numerical representations of words, phrases, or documents within a continuous vector space. These embeddings are specifically crafted to capture semantic relationships between words, enhancing the capability of algorithms to comprehend and process textual data more efficiently. Furthermore, I conducted a search within a dataset using this concept, leveraging the embeddings to enhance the accuracy and relevance of the search results.

Related blog posts

Fabric Change the Game: Unleashing the Power of Microsoft Fabric and OpenAI for Dataset Search

October 9, 2024 by Misha Desai

At Fabric, we’re passionate about contributing to the open-source community, particularly in areas that advance the usability and scalability of machine learning tools. One of our recent endeavors has been making substantial contributions back to the FLAML (Fast and Lightweight AutoML) project, a robust library designed to automate the tedious and complex process of machine … Continue reading “Enhancing Open Source: Fabric’s Contributions to FLAML for Scalable AutoML”

September 25, 2024 by Santhosh Kumar Ravindran

We’re excited to introduce high concurrency mode for notebooks in pipelines, bringing session sharing to one of the most popular orchestration mechanisms for enterprise data ingestion and transformation. Notebooks will now automatically be packed into an active high concurrency session without compromising performance or security, while paying for a single session. Key Benefits: Why Use … Continue reading “Introducing High Concurrency Mode for Notebooks in Pipelines for Fabric Spark”