In this post, I'm going to walk you through the process of setting up machine learning pipeline within RavenDB. The first thing to ask, of course, is what am I talking about?
RavenDB is a database, it is right there in the name, what does this have to do with machine learning? And no, I'm not talking about pushing exported data from RavenDB into your model. I'm talking about actual integration.
Consider the following scenario. We have users with emails. We want to add additional information about them, so we assign as their default profile picture their Gravatar image. Here is mine:
On the other hand, we have this one:
In addition to simply using the Gravatar to personalize the profile, we can actually analyze the picture to derive some information about the user. For example, in non professional context, I like to use my dog's picture as my profile picture.
Let's see what use we can make of this with RavenDB, shall we?
Here you can see a simple document, with the profile picture stored in an attachment. So far, this is fairly standard fare for RavenDB. Where do we get to use machine learning? The answer is very quickly. I'm going to define the Employees/Tags index, like this:
This requires us to use the nightly of RavenDB 5.1, where we have support for indexing attachments. The idea here is that we are going to be making use of that to apply machine learning to classify the profile photo.
You'll note that we pass the photo's stream to ImageClassifier.Classify(), but what is that? The answer is that RavenDB itself has no idea about image classification and other things of this nature. What it does have is an easy way for you to extend RavenDB. We are going to use Additional Sources to make this happen:
The actual code is as simple as I could make it and is mostly concerned with setting up the prediction engine and outputting the results:
In order to make it work, we have to copy the following files to the RavenDB's directory. This allows the ImageClassifier to compile against the ML.Net code. The usual recommendations about making sure that the ML.Net version you deploy best matches the CPU you are running on applies, of course.
If you'll look closely at the code in ImageClassifier, you'll note that we are actually loading the model from a file via:
mlContext.Model.Load("model.zip", out _);
This model is meant to be trained offline by whatever system would work for you, the idea is that in the end, you just deploy the trained model as part of your assets and you can start applying machine learning as part of your indexing.
That brings us to the final piece of the puzzle. The output of this index. We output the data as indexed fields and give the classification for them. The Tag field in the index is going to contains all the matches that are above 75% and we are using dynamic fields to record the matches to all the other viable matches.
That means that we can run queries such as:
Insert your own dystopian queries here. You can also do a better drill down using something like:
The idea in this case is that we are able to filter the image by multiple tags and search for pictures of happy people with dogs. The capabilities that you get from this are enormous.
The really interesting thing here is that there isn't much to it. We run the machine learning process once, at indexing time. Then we have the full query abilities at our hands, including some pretty sophisticated ones. Want to find dog owners in a particular region? Combine this with a spatial query. And whenever the user will modify their profile picture, we'll automatically re-index the data and recompute the relevant tags.
I'm using what is probably the simplest possible options here, given that I consider myself very much a neophyte in this area. That also means that I've focused mostly on the pieces of integrating RavenDB and ML.Net, it is possible (likely, even) that the ML.Net code isn't optimal or the idiomatic way to do things. The beautiful part about that is that it doesn't matter. This is something that you can easily change by modifying the ImageClassifier's implementation, that is an extension, not part of RavenDB itself.
I would be very happy to hear from you about any additional scenarios you have in mind. Given the growing use of machine learning in the world right now, we are considering ways to allow you to utilize machine learning on your data with RavenDB.
This post required no code changes to RavenDB, which is really gratifying. I'm looking to see what features this would enable and what kind of support we should be bringing to the mainline product. Your feedback would be very welcome.