Using analyzers in RavenDB indexing

Mar 08 2024

Using analyzers in RavenDB indexing

time to read 3 min | 566 words

We got an interesting question in the RavenDB Discussion:

We have Polo (shirts) products. Some customers search for Polo and others search for Polos. The term Polos exists in only a few of the descriptions and marketing info so the results are different.

Is there a way to automatically generate singular and plural forms of a term or would I have to explicitly add those?

What is actually requested here is to perform a process known as stemming. Turning a word into its root. That is a core concept in full-text search, and RavenDB allows you to make use of that.

The idea is that during indexing and queries, RavenDB will transform the search terms into a common stem and search on that. Let’s look at how this works, shall we?

The first step is to make an index named Products/Search with the following definition:

from p in docs.Products
select new { p.Name }

That is about as simple an index as you can get, but we still need to configure the indexing of the Name field on the index, like so:

You can see that I customized the Name field and marked it for full-text search using the SnowballAnalyzer, which is responsible for properly stemming the terms.

However, if you try to create this index, you’ll get an error. By default, RavenDB doesn’t include the SnowballAnalyzer, but that isn’t going to stop us. This is because RavenDB allows users to define custom analyzers.

In the database “Settings”, go to “Custom Analyzers”:

And there you can add a new analyzer. You can find the code for the analyzer in question in this Gist link.

You can also register analyzers by compiling them and placing the resulting DLLs in the RavenDB binaries directory. I find that having it as a single source file that we push to RavenDB in this manner is far cleaner.

Registering the analyzer via source means that you don’t need to worry about versioning, deploying to all the nodes in the cluster, or any such issues. It’s the responsibility of RavenDB to take care of this.

I produced the analyzer file by simply concatenating the relevant classes into a single file, basically creating a consolidated version containing everything required. That is usually done for C or C++ projects, but it is very useful in this case as well. Note that the analyzer in question must have a parameterless constructor. In this case, I just selected an English stemmer as the default one.

With the analyzer properly registered, we can create the index and start querying on it.

As you can see, we are able to find both plural and singular forms of the term we are searching for.

To make things even more interesting, this functionality is available with both Lucene and Corax indexes, as Corax is capable of consuming Lucene Analyzers.

The idea behind full-text search in RavenDB is that you have a full-blown indexing engine at your fingertips, but none of the complexity involved. At the same time, you can utilize advanced features without needing to move to another solution, everything is in a single box.

Tweet Share Share 0 comments

Tags:

Oren Eini

Oren Eini

CEO of RavenDB