Microsoft Fabric Updates Blog

Fabric Change the Game: Embracing Azure Cosmos DB for NoSQL

In this new post of our ongoing series, we’ll explore setting up Azure Cosmos DB for NoSQL, leveraging the Vector Search capabilities of AI Search Services through Microsoft Fabric’s Lakehouse features. Additionally, we’ll explore the integration of Cosmos DB Mirror, highlighting the seamless integration with Microsoft Fabric. It’s important to note that this approach harnesses the search services’ capabilities, with Python coding facilitated through Lakehouse. This is just one of the myriad possibilities available within Fabric, particularly useful if your data resides in Cosmos DB and you wish to utilize Fabric’s integration capabilities for search or data mirroring. Whether it’s for search enhancement or data replication, Fabric stands ready for integration, offering flexibility and efficiency.

Vector Search

As for Azure Cosmos DB for No SQL specifically the configuration for Vector Search involves Azure Open AI and Cognitive search services.

You will need:

  1. An Azure Cosmos for No SQL has already been deployed. You can even use the Serverless option for cost management. Follow some references if you are starting with Azure Cosmos for No SQL.
    MS Docs:
    Get started with Azure Cosmos DB for NoSQL – Training | Microsoft Learn
    Quickstart – Create Azure Cosmos DB resources from the Azure portal | Microsoft Learn
    End to End from Cyrille from MS FTA team:
    Getting started with Azure Cosmos DB – end to end example – Azure Cosmos DB Blog (microsoft.com)
  2. The plan is to use Vector Search through the Lakehouse. You also need a Microsoft Fabric workspace with a Lakehouse:
    Getting Started | Microsoft Fabric
    Get started with Microsoft Fabric – Training | Microsoft Learn
    Create a lakehouse – Microsoft Fabric | Microsoft Learn
  3. Create the Search Service: Introduction to Azure AI Search – Azure AI Search | Microsoft Learn.
    Ensure you keep track of the Search key value by navigating to the Search Service, then accessing the keys section and copying the value provided. Additionally, copy the URL values available on the overview page of the Search Service for future reference.
  4. Extracted from this post – Fabric Change the Game: Unleashing the Power of Microsoft Fabric and OpenAI for Dataset Search | Microsoft Fabric Blog | Microsoft Fabric. You will also need the Open AI service:
  5. You’ll need to upload the files containing the embeddings into Azure Cosmos DB for No SQL. These files can be found in the “Data” folder within the “Code_Samples” repository:

    Azure-Samples/azure-vector-database-samples: A collection of samples to demonstrate vector search capabilities using different Azure tools like Azure AI Search, PostgreSQL, Redis etc. (github.com). This repo created by the Microsoft ISE team with many contributions (check the contribution list) but mainly by Jose Perales and Raihan Alam, it has solid examples for Vector Search implementations.

    Please note the repo has also a pretty cool example with Fabric and Kusto made by Siliang Jiao and Gary Wang, I encourage you to try out: azure-vector-database-samples/code_samples/fabric_kusto at main · Azure-Samples/azure-vector-database-samples (github.com)

    For more examples of Vector Search using different Cosmos versions,
    this repo has some pretty cool samples and I also used as reference: AzureDataRetrievalAugmentedGenerationSamples/README.md at main · microsoft/AzureDataRetrievalAugmentedGenerationSamples (github.com)

Step By Step:

  1. Considering that the service for Cosmos is already there (mentioned above) you will need to create a database, my example uses the name Vector_DB:
    Quickstart – Create Azure Cosmos DB resources from the Azure portal | Microsoft Learn
  2. Inside of the service, look for the URI and copy and paste in a notepad separated as Fig. 1-URI shows, mine for example is – https://lilem.documents.azure.com:443/:
Fig 1 – URI

3. Also, look for Keys inside of your Cosmos DB and copy the Primary Key in the notepad, as Fig 2 – Keys shows:

Fig 2 – Keys

4. With the information provided above, let’s proceed to create the container within the Fabric Lakehouse. Alternatively, you can click and create the container through the Cosmos UI.

%pip install azure-cosmos

from azure.cosmos import CosmosClient
from azure.cosmos import exceptions, CosmosClient, PartitionKey
cosmos_db_api_endpoint="COPY THE URI HERE"
cosmos_db_api_key = "COPY THE KEY HERE"
database_name = "Vector_DB"###this is your Database name
text_table_name = 'text_sample'###this is your container name

# Initialize the Cosmos DB client
client = CosmosClient(cosmos_db_api_endpoint, credential=cosmos_db_api_key)
database = client.create_database_if_not_exists(id=database_name)

try:
        container = database.create_container_if_not_exists(
        id=text_table_name,
        partition_key=PartitionKey(path="/id")  )
        print(f"Document {container} created successfully")

except Exception as e:
            print(f"Error: {e}")

5.Upload the data inside Cosmos.
Data:azure-vector-database-samples/code_samples/data/text/product_docs_embeddings.json at main · Azure-Samples/azure-vector-database-samples (github.com)

When it comes to insertion or uploading, you have the freedom to choose your preferred method. The repositories I mentioned earlier provide Python examples, and Microsoft Documentation offers some Bash examples as well. To simplify matters, I’ll proceed by inserting the embedding file directly from Onelake/ Fabric Lakehouse.

import pandas as pd
cosmosdb_container_name = text_table_name
container = database.get_container_client(cosmosdb_container_name)

# Read data from the JSON file
text_df = pd.read_json('/API PATH/product_docs_embeddings.json')
records = text_df.to_dict(orient='records')

# Iterate through the data and insert the files with the embeddings into the container       
item['@search.action'] = 'upload'
        # Convert the 'id' attribute to a string
        item['id'] = str(item['id'])
        # Insert the item into the container
        container.create_item(body=item)
    print(f"Data items inserted into the Cosmos DB {cosmosdb_container_name}")

except exceptions.CosmosResourceExistsError as e:
    print(f"Document {container} with ID {item['id']} already exists...")
    print(f"Error: {e}")

except Exception as e:
    # Handle other exceptions
    print(f"Error: {e}")

6. Let’s use Azure AI services for the Search. Note: Vector database – Azure Cosmos DB | Microsoft Learn

Create the DataSource
First, let’s create the DataSource for Azure Cosmos DB, using the Search Service in Azure Portal. as Fig 3 – Datasource:

Fig 3 -Datasource



Connectionstring for my database example – Vector_DB: “AccountEndpoint=URI;AccountKey=YOURKEY==;Database=Vector_DB;”

Create the index

As shown in Figures 4 and 5 (Index and Fields, respectively), let’s continue within the Search Service interface. Utilizing the UI, we’ll configure the index. Since this process is performed via the UI, you’ll need to add each field individually.

Fig 4-Index
Fig 5-Fields

Please note for title_vector and content_vector you have will one extra step which includes create the profile – Fig 6 – profile:
Note:
“We are using HSNW “Hierarchical Navigable Small World (HNSW): HNSW is a leading ANN algorithm optimized for high-recall, low-latency applications where data distribution is unknown or can change frequently. “ Ref: VectorSearch
About the Vector Size: “For each vector field, Azure AI Search constructs an internal vector index using the algorithm parameters specified on the field. Each vector is usually an array of single-precision floating-point numbers, in a field of type Collection(Edm.Single)“Ref: Vector Size

Fig 6- Profile

Once the fields are created as Fig 5 – Fields (above) just, define a name for the index and hit the button create.

Create the indexer

Now still inside of the Search Service -> create the indexer using the DataSource and the index that was created previously, save and run. As Fig 7 – indexer, shows

Fig 7-indexer

For more information how the Vector Search Service works: VectorSearch Works

“On the indexing side, Azure AI Search takes vector embeddings and uses a nearest neighbors algorithm to place similar vectors close together in an index. Internally, it creates vector indexes for each vector field.”

So, all the configuration is done, now let’s Search!!
Libraries:

%pip install azure-cosmos openai --upgrade azure-search-documents===11.4.0

import json
import datetime
import time
from azure.core.exceptions import AzureError
from azure.core.credentials import AzureKeyCredential
from azure.cosmos import exceptions, CosmosClient, PartitionKey
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient
from azure.search.documents.models import (
    QueryAnswerType,
    QueryCaptionType,
    QueryType )
from azure.core.credentials import AzureKeyCredential
import numpy as np
from typing import List
import pandas as pd
from ast import literal_eval
import openai

Functions for the vector search:

def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

def cosine_similarity(a, b):   
# Convert the input arrays to numpy arrays
    a = np.asarray(a, dtype=np.float64)
    b = np.asarray(b, dtype=np.float64)

    # Check for empty arrays or arrays with zero norms
    if np.all(a == 0) or np.all(b == 0):
        return 0.0

    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    similarity = dot_product / (norm_a * norm_b)
    return similarity

Initialize the Connection:

database_name = "Vector_DB"
text_table_name = 'YOUR CONTAINER NAME'###mine is text_sample3
cosmos_db_api_endpoint="URI"
cosmos_db_api_key = "YOUR KEY"

# Configure Azure Cognitive Search
cog_search_endpoint  = "https://YOURSERVICENAME.search.windows.net"
cog_search_key  = "KEY of your service"

index_name = "YOUR Index Name" ##my example is index_textsample3
credential = AzureKeyCredential(str(cog_search_key))
openai.api_type = "azure"  
openai.api_key = "YOUR open AI Key"
openai.api_base = "https://YOUROpenAIService.openai.azure.com/"

cosmos_client = CosmosClient(cosmos_db_api_endpoint, cosmos_db_api_key)
database = cosmos_client.get_database_client(database_name)

Script for the Search:

from openai import AzureOpenAI
container_name =text_table_name


client = AzureOpenAI(
  api_key = openai.api_key,  
  api_version = "2023-05-15",
  azure_endpoint = openai.api_base)

container = database.get_container_client(container_name)
search_client = SearchClient(cog_search_endpoint, index_name, credential)

query = 'tools for software development'##example
query_vector = get_embedding(query, model = model)

# Perform Azure Cognitive Search query
search_results = search_client.search(search_text=query, select=["title", "content", "category", "title_vector", "content_vector"])

for result in search_results:
    result_vector = result.get("content_vector", None)
    if result_vector is not None and len(result_vector) > 0:
        similarity_score = cosine_similarity(query_vector, result_vector)
        print(f"Title: {result['title']}")
        print(f"Score: {result['@search.score']}")
        print(f"Content: {result['content']}")
        print(f"Category: {result['category']}")
        print(f"Cosine Similarity: {similarity_score}\n")
    else:
        print(f"Skipping result with empty or missing vector.\n")

Results – Fig 8- Search:

Fig 8-Search


As for Vector Search, if you are interested, I encourage you to check the repositories I mentioned at the beginning of this post, and you will see the many options and implementations. The python code can be reused inside of the Lakehouse with a few changes.

Mirror Cosmos DB for No SQL

Our earlier example showcased how to build the Vector Search using the AI Search services with Cosmos for No SQL and the Lakehouse instead of Python scripts. Now, let’s explore another option: mirroring it into Fabric. Once the mirroring process is finalized, shortcuts can be established across Microsoft Fabric workspaces, directing to the mirror. Furthermore, the SQL Endpoint can be employed to create queries, it means you can use T-SQL commands that query data objects but not manipulate the data in teh SQl Endpoint, as it’s a read-only copy.
Note: Mirrors can be stopped at any given time.

Review the doc to understand the solution: Microsoft Fabric mirrored databases from Azure Cosmos DB (Preview) – Microsoft Fabric | Microsoft Learn

Step by Step:

1 – Inside Fabric – Choose Mirror Azure Cosmos DB.

As Fig 9 – Cosmos DB option illustrates:

Fig 9 – Cosmos DB option

2 – Name the Mirror that will be created, as Fig 10 – Name mirror, shows:

Fig 10 – Name mirror

3 – Choose Cosmos DB for No SQL currently in preview. As Fig 11 – Cosmos option will show as follows:

Fig 11 – Cosmos option

4 – Inside Azure Portal look for your Cosmos DB NO SQL, open and copy and paste the URI in a notepad as Fig. 12-URI shows, mine for example is – https://lilem.documents.azure.com:443/. <This is same step as I did for Vector Search.>

Fig 12 – URI

5 – Look for Keys inside of your Cosmos DB and copy one the Primary Key in the notepad, as Fig 13 – key shows:

Fig 13 – Keys

6 – Use the information you copied earlier in step 4 and step 5 and input it into their respective fields as shown in Figure 14 – Mirror Fields.

Fig 14 – Mirror Fields

7 – Next connect -> Select the database and start to mirror:

Fig 15-Mirror

There are some preliminary steps missing in the mirror configuration. The error message indicates: “The database cannot be mirrored to Fabric due to the following error: Continuous backup must be enabled before you mirror an Azure Cosmos DB database to Fabric. Please enable 7-day or 30-day continuous backup on your Azure Cosmos DB account from the Azure portal.”
Therefore, before proceeding with the mirror setup, ensure that continuous backup is enabled on your Azure Cosmos DB account for either 7-day or 30-day retention period via the Azure portal.

According to the docs: Microsoft Fabric mirrored databases from Azure Cosmos DB (Preview) – Microsoft Fabric | Microsoft LearnWhen you enable mirroring on your Azure Cosmos DB database, inserts, update, and delete operations on your online transaction processing (OLTP) data continuously replicates into Fabric OneLake for analytics consumption.The continuous backup feature is a prerequisite for mirroring. “

So, let’s fix!!

Reopen the Azure Portal for Cosmos DB, locate the database you intend to mirror, and navigate to the Backup and Restore section. Select the continuous backup option, as indicated in the message. Refer to Figure 16 – Continuous, which illustrates this configuration.

Fig 16 – Continuous

After making this change, please wait for a moment. You’ll notice that the Point in Time Restore option mentioned in the documentation (Migrate an Azure Cosmos DB account from periodic to continuous backup mode | Microsoft Learn) will become available. If you select this option, you’ll see a message stating, “The Backup Policy is migrating,” as shown in Figure 17 – Policy. Hence, while the migration is in progress, please wait until it’s completed before attempting to restart the mirror in Fabric.

Fig 17 – Policy

Once the backup policy is finished to be migrated, you can go back to Fabric and hit the Mirror button, as Fig 18 – Mirror, shows:

Fig 18 – Mirror

8 – Now you can query Cosmos DB for No SQL from the SQL Endpoint, as Fig 19 – CosmosSQL shows:

Fig 19 – CosmosSQL

9 – You could even open Cosmos from SSMS by connecting to the SQL Endpoint. Copy the SQL Connection string as Fig 20- Endpoint and open SSMS as Fig 21 -SSMS:

Fig 20-Endpoint
Fig 21 -SSMS

And as mentioned before shortcuts from the Lakehouse can be created in different workspaces ( given the right permissions) to access the Cosmos DB mirrored Data, for example can be created to access the Mirror Data as Fig 22- Shortcuts:

Fig 22 – Shortcuts

Summary:


This post explores the diverse options available when integrating Cosmos DB for No SQL with Microsoft Fabric. It has delved into configuring Azure Cosmos DB for NO SQL with Vector Search services, leveraging Microsoft Fabric’s Lakehouse capabilities. We’ve also explored the integration of Cosmos DB Mirror, highlighting its seamless collaboration with Microsoft Fabric. It’s essential to recognize that this approach maximizes the search services’ potential, with Python coding streamlined through Lakehouse. This represents just one of the myriad possibilities within Fabric, particularly beneficial if your data resides in Cosmos DB, allowing you to harness Fabric’s integration capabilities for search or data mirroring needs. Whether it’s for enhancing search functionalities or replicating data, Fabric offers a versatile and efficient integration solution.

Entradas de blog relacionadas

Fabric Change the Game: Embracing Azure Cosmos DB for NoSQL

octubre 9, 2024 por Misha Desai

At Fabric, we’re passionate about contributing to the open-source community, particularly in areas that advance the usability and scalability of machine learning tools. One of our recent endeavors has been making substantial contributions back to the FLAML (Fast and Lightweight AutoML) project, a robust library designed to automate the tedious and complex process of machine … Continue reading “Enhancing Open Source: Fabric’s Contributions to FLAML for Scalable AutoML”

octubre 4, 2024 por Jason Himmelstein

We had an incredible time in our host city of Stockholm for FabCon Europe! 3,300 attendees joined us from our international community, and it was wonderful to meet so many of you in person. Throughout the week of FabCon Europe, our teams published a wealth of valuable content, and we want to ensure you have … Continue reading “Fabric Community Conference Europe Recap”