RavenDB 5.1 Features: Searching in Office documents
For a long time, whenever I tried to explain how RavenDB is a document database, people immediately assumed that I’m talking about Office documents (Word, Excel, etc) and that I’m building a SharePoint clone. Explaining that documents are a different way to model data has been a repeated chore, and we still get prospects asking about RavenDB’s office integration.
As an aside, I’ll be doing a Webinar on Tuesday talking about Data Modeling with RavenDB.
RavenDB 5.1 has a new feature, Nuget integration, which allows you to integrate Nuget packages into RavenDB’s indexes. Turns out, it takes very little code to allow RavenDB to search inside Office documents. Let’s consider a legal case system, where we track the progression of legal cases, the work done on them, billing, etc. As you can imagine, the amount of Word and Excel documents that are involved is… massive. Making sense of all of that information can be pretty hard. Here is how you can help your users with the use of RavenDB.
Here is the Filing/Search index definition:
As you can see, we are using two new features in RavenDB 5.1:
- The LoadAttachment() / GetContentAsStream() methods, which expose the attachments to the indexing engine.
- The Office.GetWordText() / Office.GetExcelText() methods, which extract the text from the relevant documents to be indexed by RavenDB.
Aside from that, this is a fairly standard index, we mark the Documents field as full text search (in red in the image below). There is also the yellow markers in the image, what are they for?
No, RavenDB didn’t integrate directly with Office, instead, we make use of the new Additional Assemblies (and the existing Additional Sources) to bring you this functionality. Let’s see how that works, shall we?
We tell RavenDB that for this index, we want to pull the NuGet package DocumentsFormat.OpenXml. And it will just happen, which means that we have the full power of this package in your indexes. In fact, this is exactly what we do. Here is the content of the Additional Sources:
What this code does is use the DocumetnsFormat.OpenXml package to read the data inside the provided attachments. We extract the text from them and then provide it to the RavenDB indexing engine, which enable us to do full text search on the content of attachments.
In effect, within the space of a single blog post, you can turn your RavenDB instance to a document indexing system.
Here is how we can query the data:
And the result is here:
And here is the relevant term inside the Office documents:
As you can imagine, this is a very exciting capability to add to RavenDB. There is much more that you can do with the ability to integrate such capabilities directly into your database.
Comments
Oren - is this feature available @ https://cloud.ravendb.net/ ?
This will definitely make it easier for users of RavenDB cloud as well. Looking forward to this!
This would be super interesting hooked into Office 365 SharePoint. Especially for law firms. Fewer and fewer people are using on prem office and thus direct access to the documents isn't necessarily available. Being able to hook up to the change stream from SharePoint online and have the indexes update in ravendb on file change would be huge and allow all kinds of great integrated scenarios.
PK,
Yes, this will be available on the cloud once the 5.1 is rolled out (very soon)
Steve,
Yes, the cloud integration was part of what we were thinking. It simplify deployment of complex things enormously
James,
That sounds interesting. Is this something that can be hacked around to show viable in a short amount of time?
I have code that does similar. It's way too complex because MS doesn't give a really great feed, but basically it uses Microsoft Flow to capture the changes and brings the notifications down to a reasonable number from the one per key press you'd normally get. That sends it to Azure Event Grid that fires off a webhook to our api. From there you can just use the standard Azure openId authentication against your Tennant to pull the document contents as binary and load it into the OpenXML stuff for processing as above. It could be short circuited with the graph API notifications (which we found flaky but maybe you'd have better luck) or using MS Flow to send directly to a webhook.
It would be a pretty killer library to have. I'm sure the same could be done for google docs, box, dropbox etc. and if the root library was shared, they could just be providers for the notifications and subscriptions.
Does this run in the Raven.Server process? If so what kind of isolation / sandboxing would be in place for this kind of feature to protect against failures/memory leaks/security issues in the imported package?
If isolation was a concern or requirement, it would be interesting to see if the subscriptions API could somehow be adapted for "external indexing" (where a subscription process running externally, even on a different machine, could somehow contribute to index generation).
Just a thought.
Trev,
This runs inside the same process. There is no isolation from the rest of the system. The package, just like any index, can do whatever it wants. If memory is leaked, that will have an impact. Note that this is a feature that only DB Admin can define such indexes.
And you are correct, for isolation, you would use subscriptions to process the data and then provide the results back to RavenDB. A good example of that might be for OCR processing or anything that may take significant amount of time / computation.
although i'm pretty hyped about the office search feature, i'm even more hyped about the load-dlls-into-server-with-nuget!
i'm using ravendb for 8 years, a major PITA is still the dll deployment. hope to see that feature also for RavenMethods so it will be much easier for example to deploy extensions like RavenDB-NodaTime. https://datalust.co/seq does this in the same way
Tobias,
You can absolutely manage this in this manner. You can add Nuget for that package and specify the additional using statements, and then just use them.
This looks great!
That said, will there be more documentation around how RavenDB exposes IEnumerable<T> or specific documents (especially around nullability, etc).
Namely, if we want to pass a document (or sequence of documents) into something in NuGet, how would we go about doing that?
Nicholas,
The documents that you access inside the indexing function are actually a
dynamic
, as your types don't exists on the server side. You can convert that to a particular type as needed, of course.Right, but it has to be backed by _something_, right? In the cases where we want to try to be as optimal as possible, having access to whatever is backing it (with the understanding it's some sort of property bag) could aid in that (instead of making calls to something
dynamic
multiple times).Nicholas ,
This is basically backed by our internal implementation of the document. You can get to it, but in most cases, I would recommend against that. In some cases, actually writing the index in JS was a performance improvement, FWIW.
James Hancock: "This would be super interesting hooked into Office 365 SharePoint. Especially for law firms."
Speaking of "law firms" or even information governance companies/IT: Implementing document content extraction directly in a RavenDB index depends on how many documents and at what level of content extraction you want to do. Then there is a consideration of a malformed document bringing down your RavenDB server (maybe RavenDB is resilient to stack overflow exceptions while extracting/indexing text from a malformed document? I don't know the answer). Currently, you can use Microsoft IFIlter implementations in SQL database or even in RavenDB but with IFILTER it is hard to get at metadata and also process attachments in the documents via the IFILTER interface. And then even before calling the IFILTER interface one needs to reliably identify the document file format. Also, IFILTER crashes often. If you want a scalable system, you do not want your database/document store doing text extraction/processing and also doing document bulk inserts and indexing.
However, RavenDB is a very powerful document database for law firms, eDiscovery (ECA), information governance industries. We are in the process of writing a GitHub case study (still under development) using RavenDB for ECA: https://github.com/dotfurther/OpenDiscoverPlatformCaseStudy
Just in case you weren't aware... I suspect your GetWordText method will either not extract text out of tables and shapes or extract it incorrectly. eg join text together without spaces.
Phil,
I'm not actually trying to write the proper way to extract all text from the document. My intent was to show how this is possible to do so.
From there, you are free to implement the actual text extraction as you see fit.
I tried your exact example, but it does seem to have issues loading some assemblies...
2020-12-07T07:13:09.5837410Z, 10, Operations, Server, Raven.Server.Documents.Indexes.Static.IndexCompiler, Could not load additional assembly from 'C:\dev\ravendb\Server\mscordbi.dll'., EXCEPTION: System.BadImageFormatException: Could not load file or assembly 'C:\dev\ravendb\Server\mscordbi.dll'. The module was expected to contain an assembly manifest. File name: 'C:\dev\ravendb\Server\mscordbi.dll' at System.Reflection.AssemblyName.nGetFileInformation(String s) at System.Reflection.AssemblyName.GetAssemblyName(String assemblyFile) at System.Runtime.Loader.AssemblyLoadContext.GetAssemblyName(String assemblyPath) at Raven.Server.Documents.Indexes.Static.IndexCompiler.DiscoverAdditionalAssemblies() in C:\Builds\RavenDB-Stable-5.1\51007\src\Raven.Server\Documents\Indexes\Static\IndexCompiler.cs:line 73 2020-12-07T07:13:09.5844996Z, 10, Operations, Server, Raven.Server.Documents.Indexes.Static.IndexCompiler, Could not load additional assembly from 'C:\dev\ravendb\Server\mscorrc.dll'., EXCEPTION: System.BadImageFormatException: Could not load file or assembly 'C:\dev\ravendb\Server\mscorrc.dll'. The module was expected to contain an assembly manifest. File name: 'C:\dev\ravendb\Server\mscorrc.dll' at System.Reflection.AssemblyName.nGetFileInformation(String s) at System.Reflection.AssemblyName.GetAssemblyName(String assemblyFile) at System.Runtime.Loader.AssemblyLoadContext.GetAssemblyName(String assemblyPath) at Raven.Server.Documents.Indexes.Static.IndexCompiler.DiscoverAdditionalAssemblies() in C:\Builds\RavenDB-Stable-5.1\51007\src\Raven.Server\Documents\Indexes\Static\IndexCompiler.cs:line 73 2020-12-07T07:13:09.8090475Z, 10, Operations, Server, Raven.Server.Documents.Indexes.Static.IndexCompiler, Could not load additional assembly from 'C:\dev\ravendb\Server\sni.dll'., EXCEPTION: System.BadImageFormatException: Could not load file or assembly 'C:\dev\ravendb\Server\sni.dll'. The module was expected to contain an assembly manifest. File name: 'C:\dev\ravendb\Server\sni.dll' at System.Reflection.AssemblyName.nGetFileInformation(String s) at System.Reflection.AssemblyName.GetAssemblyName(String assemblyFile) at System.Runtime.Loader.AssemblyLoadContext.GetAssemblyName(String assemblyPath) at Raven.Server.Documents.Indexes.Static.IndexCompiler.DiscoverAdditionalAssemblies() in C:\Builds\RavenDB-Stable-5.1\51007\src\Raven.Server\Documents\Indexes\Static\IndexCompiler.cs:line 73 2020-12-07T07:13:10.2574424Z, 10, Operations, Server, Raven.Server.Documents.Indexes.Static.IndexCompiler, Could not load additional assembly from 'C:\dev\ravendb\Server\System.Private.CoreLib.dll'., EXCEPTION: System.IO.FileLoadException: Could not load file or assembly 'System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e'. at System.Runtime.Loader.AssemblyLoadContext.LoadFromPath(IntPtr ptrNativeAssemblyLoadContext, String ilPath, String niPath, ObjectHandleOnStack retAssembly) at System.Runtime.Loader.AssemblyLoadContext.LoadFromAssemblyPath(String assemblyPath) at System.Reflection.Assembly.LoadFile(String path) at Raven.Server.Documents.Indexes.Static.IndexCompiler.DiscoverAdditionalAssemblies() in C:\Builds\RavenDB-Stable-5.1\51007\src\Raven.Server\Documents\Indexes\Static\IndexCompiler.cs:line 78 2020-12-07T07:13:10.7162412Z, 10, Operations, Server, Raven.Server.Documents.Indexes.Static.IndexCompiler, Could not load additional assembly from 'C:\dev\ravendb\Server\ucrtbase.dll'., EXCEPTION: System.BadImageFormatException: Could not load file or assembly 'C:\dev\ravendb\Server\ucrtbase.dll'. The module was expected to contain an assembly manifest. File name: 'C:\dev\ravendb\Server\ucrtbase.dll' at System.Reflection.AssemblyName.nGetFileInformation(String s) at Raven.Server.Documents.Indexes.Static.IndexCompiler.DiscoverAdditionalAssemblies() in C:\Builds\RavenDB-Stable-5.1\51007\src\Raven.Server\Documents\Indexes\Static\IndexCompiler.cs:line 73
Stephan,
It looks like you are running on 32 bits machine? Is that possible? It might be that the underlying package doesn't have a binary for your architecture?
Comment preview