In version 7.0, RavenDB introduced vector search, enabling semantic search on text and image embeddings.For example, searching for "Italian food" could return results like Mozzarella & Pasta. We now focus our efforts to enhance the usability and capability of this feature.
Vector search uses embeddings (AI models' representations of data) to search for meaning.Embeddings and vectors are powerful but complex.The Embeddings Generation feature simplifies their use.
RavenDB makes it trivial to add semantic search and AI capabilities to your system by natively integrating with AI models to generate embeddings from your data. RavenDB Studio's AI Hub allows you to connect to various models by simply specifying the model and the API key.
You can read more about this feature in this article or in the RavenDB docs. This post is about the story & reasoning behind this feature.
Cloudflare has a really good post explaining how embeddings work. TLDR, it is a way for you to search for meaning. That is why Ravioli shows up for Italian food, because the model understands their association and places them near each other in vector space. I’m assuming that you have at least some understanding of vectors in this post.
The Embeddings Generation feature in RavenDB goes beyond simply generating embeddings for your data.It addresses the complexities of updating embeddings when documents change, managing communication with external models, and handling rate limits.
The elevator pitch for this feature is:
RavenDB natively integrates with AI models to generate embeddings from your data, simplifying the integration of semantic search and AI capabilities into your system.The goal is to make using the AI model transparent for the application, allowing you to easily and quickly build advanced AI-integrated features without any hassle.
While this may sound like marketing jargon, the value of this feature becomes apparent when you experience the challenges of working without it.
To illustrate this, RavenDB Studio now includes an AI Hub.
You can create a connection to any of the following models:
Basically, the only thing you need to tell RavenDB is what model you want and the API key to use. Then, it is able to connect to the model.
The initial release of RavenDB 7.0 included
bge-micro-v2
as an embedded model. After using that and trying to work with external models, it became clear that the difference in ease of use meant that we had to provide a good story around using embeddings.
Some things I’m not willing to tolerate, and the current status of working with embeddings in most other databases is a travesty of complexity.
Next, we need to define an Embeddings Generation task, which looks like this:
Note that I’m not doing a walkthrough of how this works (see this article or the RavenDB docs for more details about that); I want to explain what we are doing here.
The screenshot shows how to create a task that generates embeddings from the Title
field in the Articles
collection.For a large text field, chunking options (including HTML stripping and markdown) allow splitting the text according to your configuration and generate multiple embeddings.RavenDB supports plain text, HTML, and markdown, covering the vast majority of text formats.You can simply point RavenDB at a field, and it will generate embeddings, or you can use a script to specify the data for embeddings generation.
Quantization
Embeddings, which are multi-dimensional vectors, can have varying numbers of dimensions depending on the model.For example, RavenDB's embedded model (bge-micro-v2) has 384 dimensions, while OpenAI's text-embedding-3-large
has 3,072 dimensions.Other common values for dimensions are 768 and 1,536.
Each dimension in the vector is represented by a 32-bit float, which indicates the position in that dimension.Consequently, a vector with 1,536 dimensions occupies 6KB of memory.Storing 10 million such vectors would require over 57GB of memory.
Although storing raw embeddings can be beneficial, quantization can significantly reduce memory usage at the cost of some accuracy.RavenDB supports both binary quantization (reducing a 6KB embedding to 192 bytes) and int8 quantization (reducing 6KB to 1.5KB).By using quantization, 57GB of data can be reduced to 1.7GB, with a generally acceptable loss of accuracy.Different quantization methods can be used to balance space savings and accuracy.
Caching
Generating embeddings is expensive.For example, using text-embedding-3-small
from OpenAI costs $0.02 per 1 million tokens.While that sounds inexpensive, this blog post has over a thousand tokens so far and will likely reach 2,000 by the end.One of my recent blog posts had about 4,000 tokens.This means it costs roughly 2 cents per 500 blog posts, which can get expensive quickly with a significant amount of data.
Another factor to consider is handling updates.If I update a blog post's text, a new embedding needs to be generated.However, if I only add a tag, a new embedding isn't needed. We need to be able to handle both scenarios easily and transparently.
Additionally, we need to consider how to handle user queries.As shown in the first image, sending direct user input for embedding in the model can create an excellent search experience.However, running embeddings for user queries incurs additional costs.
RavenDB's Embedding Generation feature addresses all these issues.When a document is updated, we intelligently cache the text and its associated embedding instead of blindly sending the text to the model to generate a new embedding each time..This means embeddings are readily available without worrying about updates, costs, or the complexity of interacting with the model.
Queries are also cached, so repeated queries never have to hit the model.This saves costs and allows RavenDB to answer queries faster.
Single vector store
The number of repeated values in a dataset also affects caching.Most datasets contain many repeated values.For example, a help desk system with canned responses doesn't need a separate embedding for each response.Even with caching, storing duplicate information wastes time and space. RavenDB addresses this by storing the embedding only once, no matter how many documents reference it, which saves significant space in most datasets.
What does this mean?
I mentioned earlier that this is a feature that you can only appreciate when you contrast the way you work with other solutions, so let’s talk about a concrete example. We have a product catalog, and we want to use semantic search on that.
We define the following AI task:
It uses the open-ai
connection string to generate embeddings from the Products
’ Name
field.
Here are some of the documents in my catalog:
In the screenshots, there are all sorts of phones, and the question is how do we allow ourselves to search through that in interesting ways using vector search.
For example, I want to search for Android phones. Note that there is no mention of Android in the catalog, we are going just by the names. Here is what I do:
$query = 'android'
from "Products"
where vector.search(
embedding.text(Name, ai.task('products-on-openai')),
$query
)
I’m asking RavenDB to use the existing products-on-openai
task on the Name field and the provided user input. And the results are:
I can also invoke this from code, searching for a “mac”:
var products = session.Query<Products>()
.VectorSearch(
x => x.WithText("Name").UsingTask("products-on-openai"),
factory => factory.ByText("Mac")
).ToList();
This query will result in the following output:
That matched my expectations, and it is easy, and it totally and utterly blows my mind. We aren’t searching for values or tags or even doing full-text search. We are searching for the semantic meaning of the data.
You can even search across languages. For example, take a look at this query:
This just works!
Here is a list of the things that I didn’t have to do:
- Generate the embeddings for the catalog
- And ensure that they are up to date as I add, remove & update products
- Handle long texts and appropriate chunking
- Perform quantization to reduce storage costs
- Handle issues such as rate limits, model downtime (The GPUs at OpenAI are melting as I write this), and other “fun” states
- Create a vector search index
- Generate an embedding vector from the user’s input
- See above for all the details we skip here
- Query the vector search index using the generated embedding
This allows you to focus directly on delivering solutions to your customers instead of dealing with the intricacies of AI models, embeddings, and vector search.
I asked Grok to show me what it would take to do the same thing in Python. Here is what it gave me. Compared to this script, the RavenDB solution provides:
- Efficiently managing data updates, including skipping model calls for unchanged data and regenerating embeddings when necessary.
- Implementing batching requests to boost throughput.
- Enabling concurrent embedding generation to minimize latency.
- Caching results to prevent redundant model calls.
- Using a single store for embeddings to eliminate duplication.
- Managing caching and batching for queries.
In short, Embeddings Generation is the sort of feature that allows you to easily integrate AI models into your application with ease.
Use it to spark joy in your users easily, quickly, and without any hassle.