Microsoft Fabric Updates Blog

Empowering Real-Time Searches: Vector Similarity Search with Eventhouse

In the world of AI & data analytics, vector databases are emerging as a powerful tool for managing complex and high-dimensional data.  

In this article, we will explore the concept of vector databases, the need for vector databases in data analytics, and how Eventhouse in Microsoft Fabric can power your Real-Time semantic searches.

What is a Vector Database? 

Vector databases store and manage data in the form of vectors that are numerical arrays of data points. Vector databases allow manipulating and analyzing set of vectors at scale using vector algebra and other advanced mathematical techniques. 

The use of vectors allows for more complex queries and analyses, as vectors can be compared and analyzed using advanced techniques such as vector similarity search, quantization and clustering. 

Need for Vector Databases 

Traditional databases are not well-suited for handling high-dimensional data, which is becoming increasingly common in data analytics. In contrast, vector databases are designed to handle high-dimensional data, such as text, images, and audio, by representing them as vectors. 

This makes vector databases particularly useful for tasks such as machine learning, natural language processing, and image recognition, where the goal is to identify patterns or similarities in large datasets

Vector Similarity Search 

Vector similarity is a measure of how different (or similar) two or more vectors are. Vector similarity search is a technique used to find similar vectors in a dataset.  

In vector similarity search, vectors are compared using a distance metric, such as Euclidean distance or cosine similarity. The closer two vectors are, the more similar they are. 

Vector embeddings 

Embeddings are a common way of representing data in a vector format for use in vector databases. An embedding is a mathematical representation of a piece of data, such as a word, text document or an image, that is designed to capture its semantic meaning. 

Embeddings are created using algorithms that analyze the data and generate a set of numerical values that represent its key features. For example, an embedding for a word might represent its meaning, its context, and its relationship to other words. 

Let’s look at an example.  

Below phrases are represented as vectors after embedding with a model.

(Image credits – OpenAI)

Embeddings that are numerically similar are also semantically similar. For example, as seen in the following chart, the embedding vector of “canine companions say” will be more similar to the embedding vector of “woof” than that of “meow.” 

The process of creating embeddings is straightforward, they can be created using standard python packages (eg. spaCy, sent2vec, Gensim), but Large Language Models (LLM) generate highest quality embeddings for semantic text search. Thanks to OpenAI and other LLM providers, we can now use them easily. You just send your text to an embedding model in Azure Open AI and it generates a vector representation. 

Eventhouse as a Vector Database 

At the core of Vector Similarity Search is the ability to store, index, and query vector data. 

Eventhouse provides a solution for handling and analyzing large volumes of data, particularly in scenarios requiring real-time analytics and exploration, making it an excellent choice for storing and searching vectors. 

Eventhouse supports a special data type called dynamic, which can store unstructured data such as arrays and property bags. Dynamic data type is perfect for storing vector values. You can further augment the vector value by storing metadata related to the original object as separate columns in your table.  

We have introduced a new encoding type Vector16 designed for storing vectors of floating-point numbers in 16 bits precision (utilizing the Bfloat16 instead of the default 64 bits). It is highly recommended for storing ML vector embeddings as it reduces storage requirements by a factor of 4 and accelerates vector processing functions such as series_dot_product() and series_cosine_similarity(), by orders of magnitude.

Furthermore, we have added a new built-in function series_cosine_similarity to perform vector similarity searches on top of the vectors stored in Eventhouse. 

Demo scenario

Semantic searches on top of Wikipedia pages.  
 

We will generate vectors for tens of thousands of Wikipedia pages by embedding them with an Open AI model and storing vectors in Eventhouse along with some metadata related to the page. 

Now, we want to search wiki pages with natural language queries to look for the most relevant ones. We can achieve that by following steps:  

  1. Create an embedding for the natural language query using Open AI model (ensure you use the same model used for embedding the original wiki pages, we will use text-embedding-ada-002
  2. Open AI returns the embedding vector for the search term 
  3. Use the series_cosine_similarity KQL function to calculate the similarities between the query embedding vector and those of the wiki pages 
  4. Select the top “n” rows of the highest similarity to get the wiki pages that are most relevant to your search query 

Let’s run some queries:  

WikipediaEmbeddings
| extend similarity = series_cosine_similarity(searched_text_embedding, embedding_title)
| top 10 by similarity desc 
| project doc_title,doc_url, similarity

This query calculates similarity score for thousands of vectors in the table within seconds and returns the top n results.

Search query 1:  places where we worship

Result

doc_titledoc_urlsimilarity
Worshiphttps://simple.wikipedia.org/wiki/Worship0.8863
Service of worshiphttps://simple.wikipedia.org/wiki/Service_of_worship0.8808
Christian worshiphttps://simple.wikipedia.org/wiki/Christian_worship0.8714
Shrinehttps://simple.wikipedia.org/wiki/Shrine0.8612
Church (building)https://simple.wikipedia.org/wiki/Shrine/Church_(building)0.8561
Congregationhttps://simple.wikipedia.org/wiki/Congregation0.8461
Church musichttps://simple.wikipedia.org/wiki/Church_music0.8439
Chapelhttps://simple.wikipedia.org/wiki/Chapel0.8418
Cathedralhttps://simple.wikipedia.org/wiki/Cathedral0.8372
Altarhttps://simple.wikipedia.org/wiki/Altar0.8354

Getting started:

If you’d like to try this demo, head to the azure_kusto_vector GitHub repository and follow the instructions.

The repo includes a python Notebook which will allow you to –   

  1. Download precomputed embeddings created by OpenAI API. 
  2. Store the embeddings in Eventhouse, make sure to use Vector16 embedding for the column.
  3. Convert raw text query to an embedding with OpenAI API. 
  4. Use Eventhouse to perform cosine similarity search on the stored embeddings. 

We look forward to your feedback and all the exciting things you will build with vectors & Fabric.

Post Author(s):

Anshul Sharma Principal Product Manager, Real-Time Intelligence, Microsoft
Adi Eldar – Principal Data Scientist, Real-Time Intelligence, Microsoft

منشورات المدونات ذات الصلة

Empowering Real-Time Searches: Vector Similarity Search with Eventhouse

أكتوبر 29, 2024 بواسطة Dandan Zhang

Managed private endpoints allow Fabric experiences to securely access data sources without exposing them to the public network or requiring complex network configurations. We announced General Availability for Managed Private Endpoint in Fabric in May of this year. Learn more here: Announcing General Availability of Fabric Private Links, Trusted Workspace Access, and Managed Private Endpoints. … Continue reading “APIs for Managed Private Endpoint are now available”

أكتوبر 28, 2024 بواسطة Gali Reznick

The Data Activator team has rolled out usage reporting to help you better understand your capacity consumption and future charges. When you look at the Capacity Metrics App you’ll now see operations for the reflex items included. Our usage reporting is based on the following four meters: Rule uptime per hour: This is a flat … Continue reading “Usage reporting for Data Activator is now live”