RavenDB 5.1 Features: Searching in Office documents

time to read 3 min | 541 words

For a long time, whenever I tried to explain how RavenDB is a document database, people immediately assumed that I’m talking about Office documents (Word, Excel, etc) and that I’m building a SharePoint clone.  Explaining that documents are a different way to model data has been a repeated chore, and we still get prospects asking about RavenDB’s office integration. 

As an aside, I’ll be doing a Webinar on Tuesday talking about Data Modeling with RavenDB.

RavenDB 5.1 has a new feature, Nuget integration, which allows you to integrate Nuget packages into RavenDB’s indexes. Turns out, it takes very little code to allow RavenDB to search inside Office documents. Let’s consider a legal case system, where we track the progression of legal cases, the work done on them, billing, etc. As you can imagine, the amount of Word and Excel documents that are involved is… massive. Making sense of all of that information can be pretty hard. Here is how you can help your users with the use of RavenDB.

Here is the Filing/Search index definition:

As you can see, we are using two new features in RavenDB 5.1:

  • The LoadAttachment() / GetContentAsStream() methods, which expose the attachments to the indexing engine.
  • The Office.GetWordText() / Office.GetExcelText() methods, which extract the text from the relevant documents to be indexed by RavenDB.

Aside from that, this is a fairly standard index, we mark the Documents field as full text search (in red in the image below). There is also the yellow markers in the image, what are they for?

image

No, RavenDB didn’t integrate directly with Office, instead, we make use of the new Additional Assemblies (and the existing Additional Sources) to bring you this functionality. Let’s see how that works, shall we?

image

We tell RavenDB that for this index, we want to pull the NuGet package DocumentsFormat.OpenXml. And it will just happen, which means that we have the full power of this package in your indexes. In fact, this is exactly what we do. Here is the content of the Additional Sources:

What this code does is use the DocumetnsFormat.OpenXml package to read the data inside the provided attachments. We extract the text from them and then provide it to the RavenDB indexing engine, which enable us to do full text search on the content of attachments.

In effect, within the space of a single blog post, you can turn your RavenDB instance to a document indexing system.

Here is how we can query the data:

image

And the result is here:

image

And here is the relevant term inside the Office documents:

image

As you can imagine, this is a very exciting capability to add to RavenDB. There is much more that you can do with the ability to integrate such capabilities directly into your database.