Fabric Change the Game: Unleashing the Power of Microsoft Fabric and OpenAI for Dataset Search
Microsoft Fabric and Open AI
Ever wondered how cutting-edge technology can transform the way we explore and understand vast datasets? Join me on a journey through the code as we harness the power of OpenAI to unlock insights from a rich dataset of the city of London.
Our exploration begins with the London Workplace Zone Classification dataset information available on London Datastore. To make our exploration seamless, I’ve converted the original XLS file to a CSV format, paving the way for easier processing and analysis.
This post is based on research that I was doing and it uses the following as a reference:
Azure OpenAI Service embeddings tutorial – Azure OpenAI | Microsoft Learn
Azure OpenAI Service embeddings – Azure OpenAI – embeddings and cosine similarity | Microsoft Learn
Azure-OpenAI-Docs-Samples-embedding_billsum(github.com)
Step by Step:
- OpenAI – You will need to have access to OpenAI.
Follow the docs to proceed with the request:
What is Azure OpenAI Service? – Azure AI services | Microsoft Learn
Request Access to Azure OpenAI Service (microsoft.com)
- Once the access is granted, proceed to deploy the Open AI service:
Azure OpenAI Service embeddings tutorial – Azure OpenAI | Microsoft Learn
- Keep track of the following values that we will need to connect to the Open AI service:
Api_key
andthe Service Endpoint.
- You can find this information in the Azure Portal using the Open AI service that you created in Step 2. Fig 1 – OpenAI from docs, shows that information.
Docs for more information: Connect using API keys – Azure Cognitive Search | Microsoft Learn
4. Still inside of the Azure Open AI services in the Azure Portal click on Model Deployments -> Manage Deployment. As Fig 2 – Manage Deployment, shows:
5. Create a new deployment and choose the mode text-embedding-ada-002 version 2, this model refers to a specific language model provided by OpenAI for text embeddings. Fig 3 – Ada, shows:
6. Fabric – Create a workspace, if you do not have one. Create a Lakehouse and create a new notebook. If you need more details on how to do it, follow the docs: Data warehouse tutorial – create a workspace – Microsoft Fabric | Microsoft Learn, Create a lakehouse with OneLake – Microsoft Fabric | Microsoft Learn, How to use notebooks – Microsoft Fabric | Microsoft Learn. Once you have done that, proceed to upload the dataset using the CSV format through the Onelake as Fig 4 – upload shows:
7 – Inside Fabric Notebook, lets code – You will need to install the following libraries:
Code: python
============================================
%pip install openai --upgrade num2words matplotlib plotly scipy scikit-learn pandas
============================================
8 – Use the following configuration for the variables. Please note you will need the information that you collected at Step 3.
Code: python
=============================================
import openai
openai.api_key = "replace the API Key"
endpoint = "https://<your_service_name>.openai.azure.com/"
openai.api_type = "azure"
embedding_deployed_model = "text-embedding-ada-002"
===============================================
Cleaning the Dataset:
- Create the following Functions to help prepare the data and create the embeddings. Note: After version 0.28.1 of Open AI there is a break change in the way the embeddings are created. The
utils.embeddings_utils
was removed, so the embeddings will be created with a customized function. This is taken from the docs as most of the code used for this example: Embeddings
Code: python
===========================================================================
import re
def normalize_text(s, sep_token = " \n "):
s = re.sub(r'\s+', ' ', s).strip()
s = re.sub(r". ,","",s)
# remove all instances of multiple spaces
s = s.replace("..",".")
s = s.replace(". .",".")
s = s.replace("\n", "")
s = s.strip()
return s
===========================================================================
Generate Embeddings and Search functions:
Code: python
===========================================================================
import pandas as pd
import openai
from openai import AzureOpenAI
import numpy as np
from ast import literal_eval
# Embedding
def get_embedding(text, model="text-embedding-ada-002"):
return client.embeddings.create(input=[text], model=model).data[0].embedding
# Search – Cosine
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Search
def search_docs(df, user_query, top_n=4, to_print=True):
embedding = get_embedding(user_query, model="text-embedding-ada-002")
df['ada_embedding'] = df['ada_embedding'].apply(literal_eval)
df = df[df['ada_embedding'].apply(lambda x: isinstance(x, np.ndarray))]
# Calculate cosine similarity for each row in the DataFrame
df["similarities"] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
# Sort and get the top N results
res = df.sort_values("similarities", ascending=False).head(top_n)
if to_print:
display(res)
return res
===========================================================================
- The next step is running the code, basically, the plan is: to get the file with the embeddings created with the function
get_embedding
and again use the same function to take an input and return the embedding vector from that. Follow this, use thecosine_similarity
function to calculate the cosine between two vectorsa
andb
.
In other words: Cross the information using the embedding created in the file, with the text we are trying to filter in a vector format and use the help of cosine calculation to find similarities in between.
Note:
- A folder called output was used to save the file with the embeddings created.
- The file only creates the embeddings for the first 20 rows instead of the whole file.
- All text columns were combined in one column called “combined” and the embeddings will use that column.
- File paths need to be replaced in the code, use the API path as Fig 5 – API path, shows:
Code: python
===========================================================================
client = AzureOpenAI( api_key = openai.api_key, api_version = "2023-05-15", azure_endpoint = endpoint )
input_datapath = "/lakehouse/default/Files/Files_CSV/LWZC Classification.csv" #File path, do not forget to replace by your own path
df = pd.read_csv(input_datapath, index_col=0, nrows=20) # Read only the first 20 rows
text_columns = ['LA name', 'Group description', 'Subgroup description']
df['combined'] = df[text_columns].astype(str).apply(lambda row: ' '.join(row), axis=1)
df['combined'] = df["combined"].apply(lambda x : normalize_text(x))
# The new column ada_embedding will contain the embeddings created
df['ada_embedding'] = df["combined"].apply(lambda x: get_embedding(x, model='text-embedding-ada-002
‘))
df.to_csv('/lakehouse/default/Files/Files_CSV/output/LWZC Classification_embeddindgs_20rows.csv', index=False)
#File path, do not forget to replace by your own path
===========================================================================
- Next, inspect the file that was just loaded, as Fig 6 – Inspect shows: Note: there is a new column with the embeddings created.
Code: python
===========================================================================
# Load data into pandas DataFrame from Onelake
df = pd.read_csv("/lakehouse/default/" + "Files/Files_CSV/output/LWZC Classification_embeddindgs_20rows.csv")
# This will make sure the embeddings are in an array format.
df['ada_embedding'] = df.ada_embedding.apply(eval).apply(np.array)
df.head(20)
===========================================================================
12. Search -“Professional, retail & leisure in dynamic” and I want the dataset to be filtered accordingly. Hence, the function will convert my sentence into a vector format and the cosine similarity will find the similarity between the vector that was created in the dataset.
Code: python
===========================================================================
# Assuming df is your DataFrame and it is using ‘ada_embedding’ as my example
embedding = get_embedding("Professional, retail & leisure in dynamic", model="text-embedding-ada-002")
df["similarities"] = df['ada_embedding'].apply(lambda x: cosine_similarity(x, embedding))
res = df.sort_values("similarities", ascending=False).head(4)
# Display the results
res.head()
Follow the results – Fig 7 Result:
===========================================================================
In summary, I utilized Microsoft Fabric to establish a connection with the OpenAI service. Through this connection, I generated embeddings, which serve as numerical representations of words, phrases, or documents within a continuous vector space. These embeddings are specifically crafted to capture semantic relationships between words, enhancing the capability of algorithms to comprehend and process textual data more efficiently. Furthermore, I conducted a search within a dataset using this concept, leveraging the embeddings to enhance the accuracy and relevance of the search results.