Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,067 | Comments: 49,795

filter by tags archive
time to read 1 min | 159 words

imageWe are looking to expand the number of top tier drivers and build a RavenDB client for PHP.

We currently have 1st tier clients for .NET, JVM, Python, Go, C++ and Node.JS. There are also 2nd tier clients for Ruby, PHP, R and a bunch of other environments.

We want to build a fully fledged client for RavenDB for PHP customers and I have had great success in the past in reaching awesome talent through this blog.

Chris Kowalczyk had done our Go client and detailed the process in a great blog post.

The project will involve building the RavenDB client for PHP, documenting it as well as building a small sample app or two.

If you are interested or know someone who would be, I would very happy if you can send us details to jobs@ravendb.net.

time to read 1 min | 200 words

RavenDB has the concept of metadata, which is widely used for many reasons. One of the ways we use the metadata is to provide additional context about a document. This is useful for both the user and RavenDB. For example, when you query, RavenDB will store the index score (how well a particular document matched the query) in the metadata. You can access the document metadata using:

This works great as long as we are dealing with documents. However, when you query a Map/Reduce index, you aren’t going to get a document back. You are going to get a projection over the aggregated information. It turns out that in this case, there is no way to get the metadata of the instance. To be more exact, the metadata isn’t managed by RavenDB, so it isn’t keeping it around for the GetMetadataFor() call.

However, you can just ask the metadata to be serialized with the rest of the projection’s data, like so:

In other words, we embed the metadata directly into the projection. Now, when we query, we can get the data directly:

image

time to read 5 min | 894 words

In this post, I’m going to walk you through the process of setting up machine learning pipeline within RavenDB. The first thing to ask, of course, is what am I talking about?

RavenDB is a database, it is right there in the name, what does this have to do with machine learning? And no, I’m not talking about pushing exported data from RavenDB into your model. I’m talking about actual integration.

Consider the following scenario. We have users with emails. We want to add additional information about them, so we assign as their default profile picture their Gravatar image. Here is mine:

On the other hand, we have this one:

In addition to simply using the Gravatar to personalize the profile, we can actually analyze the picture to derive some information about the user. For example, in non professional context, I like to use my dog’s picture as my profile picture.

Let’s see what use we can make of this with RavenDB, shall we?

image

Here you can see a simple document, with the profile picture stored in an attachment. So far, this is fairly standard fare for RavenDB. Where do we get to use machine learning? The answer is very quickly. I’m going to define the Employees/Tags index, like this:

This requires us to use the nightly of RavenDB 5.1, where we have support for indexing attachments. The idea here is that we are going to be making use of that to apply machine learning to classify the profile photo.

You’ll note that we pass the photo’s stream to ImageClassifier.Classify(), but what is that? The answer is that RavenDB itself has no idea about image classification and other things of this nature. What it does have is an easy way for you to extend RavenDB. We are going to use Additional Sources to make this happen:

image

The actual code is as simple as I could make it and is mostly concerned with setting up the prediction engine and outputting the results:

In order to make it work, we have to copy the following files to the RavenDB’s directory. This allows the ImageClassifier to compile against the ML.Net code. The usual recommendations about making sure that the ML.Net version you deploy best matches the CPU you are running on applies, of course.

If you’ll look closely at the code in ImageClassifier, you’ll note that we are actually loading the model from a file via:

mlContext.Model.Load("model.zip", out _);

This model is meant to be trained offline by whatever system would work for you, the idea is that in the end, you just deploy the trained model as part of your assets and you can start applying machine learning as part of your indexing.

That brings us to the final piece of the puzzle. The output of this index. We output the data as indexed fields and give the classification for them. The Tag field in the index is going to contains all the matches that are above 75% and we are using dynamic fields to record the matches to all the other viable matches.

That means that we can run queries such as:

from index 'Employees/Tags' where Tag in ('happy')

Insert your own dystopian queries here. You can also do a better drill down using something like:

from index 'Employees/Tags' where happy > 0.5 and dog > 0.75

The idea in this case is that we are able to filter the image by multiple tags and search for pictures of happy people with dogs. The capabilities that you get from this are enormous.

The really interesting thing here is that there isn’t much to it. We run the machine learning process once, at indexing time. Then we have the full query abilities at our hands, including some pretty sophisticated ones. Want to find dog owners in a particular region? Combine this with a spatial query. And whenever the user will modify their profile picture, we’ll automatically re-index the data and recompute the relevant tags.

I’m using what is probably the simplest possible options here, given that I consider myself very much a neophyte in this area. That also means that I’ve focused mostly on the pieces of integrating RavenDB and ML.Net, it is possible (likely, even) that the ML.Net code isn’t optimal or the idiomatic way to do things. The beautiful part about that is that it doesn’t matter. This is something that you can easily change by modifying the ImageClassifier’s implementation, that is an extension, not part of RavenDB itself.

I would be very happy to hear from you about any additional scenarios you have in mind. Given the growing use of machine learning in the world right now, we are considering ways to allow you to utilize machine learning on your data with RavenDB.

This post required no code changes to RavenDB, which is really gratifying. I’m looking to see what features this would enable and what kind of support we should be bringing to the mainline product. Your feedback would be very welcome.

time to read 1 min | 109 words

I usually talk about RavenDB in the context of .NET, but we actually have quite a few additional clients. For today, I want to talk about the JVM client for RavenDB.

I decided to show some sample code using Kotlin, since the RavenDB client is applicable to all JVM languages. Here is what some basic code looks like:

As you can see, this is trivial to consume RavenDB using the client API. The new API is fully supporting the 5.0 release and you can see in the same code that we are working with the new time series feature.

As usual, I would love any feedback you have to offer.

time to read 1 min | 115 words

YABT - Start of the SeriesAlex Klaus has decided to take up the task of showing how to build a non trivial application using RavenDB. The domain of choice is Yet Another Bug Tracker, mostly to be able to discuss the details of the implementation without having to explain the model and the business constraints.

The first two articles has already been published, with more to follow:

As usual, all feedback is welcome.

time to read 1 min | 76 words

On October 26, I’ll be giving a two days workshop on RavenDB 5.0 as part of the NDC Conference.

In the workshop, I’m going to talk about using RavenDB from scratch. We are going to explore how to utilize RavenDB, modeling and design decisions and how to use RavenDB’s features for your best advantage. Topics includes distribution and data management, application and system architecture and much more.

I’m really looking forward to it, see you there.

time to read 6 min | 1004 words

Chicago publishes its taxi’s trips in an easy to consume format, so I decided to see what kind of information I can dig out of the data using RavenDB. Here is what the data looks like:

image

There are actually a lot more fields in the data, but I wanted to generate a more focused dataset to show off certain features. For that reason, I’m going to record the trips for each taxi, where for each trip, I’m going to look at the start time, duration and pick up and drop off locations. The data’s size is significant, with about 194 million trips recorded.

I converted the data into RavenDB’s time series, with a Location time series for each taxi’s location at a given point in time. You can see that the location is tagged with the type of event associated with it. The raw data has both pickup and drop off for each row, but I split it into two separate events.

image

The reason I did it this way is that we get a lot of queries on how to use RavenDB for doing… stuff with vehicles and locations data. The Chicago’s taxi data is a good source for non trivial amount of real world data, which is very nice to use.

Once we have all the data loaded in, we can see that there are 9,179 distinct taxis in the data set and there are varying number of events for each taxi. Here is one such scenario:

image

The taxi in question has six years of data and 6,545 pickup and dropoff events.

The question now is, what can we do with this data? What sort of questions can we answer?

Asking where a taxi is at a given point in time is easy enough:

from Taxis where id() == 'taxis/174' select timeseries(from Location between '2015-01-01' and '2015-01-10')

And gives us the results:

image

But asking a question about a single taxi isn’t that interesting, can we do things across all taxis?

Let’s think about what kind of questions can we ask:

  • Generate heat map of pickup and drop off locations over time?
  • Find out what taxis where at a given location within at a given time?
  • Find out taxis that were nearby a particular taxi on a given day?

To answer all of these questions, we have to aggregate data from multiple time series. We can do that using a Map/Reduce index on the time series data. Here is what this looks like:

We are scanning through all the location events for the taxis and group them on an hourly basis. We are also generate a GeoHash code for the location of the taxi in that time. This is using a GeoHash with a length of 9, so it represent an accuracy of about 2.5 square meters.

We then aggregate all the taxis that were in the same GeoHash at the same hour into a single entry. To make it easier for ourselves, we’ll also use a spatial field (computed from the geo hash) to allow for spatial queries.

The idea is that we want to aggregate the taxi’s location information on both space and time. It is easy to go from a more accurate time stamp to a lower granularity one (zeroing the minutes and seconds of a time). For spatial location, we can a GeoHash of a particular precision to do pretty much the same thing. Instead of having to deal with the various points, we’ll aggregate the taxis by decreasing the resolution we use to track the location.

The GeoHash code isn’t part of RavenDB. This is provided as an additional source to the index, and can be seen fully in the following link. With this index in place, we are ready to start answering all sort of interesting questions. Since the data is from Chicago, I decided to look in the map and see if I can find anything interesting there.

I created the following shape on a map:

image

This is the textual representation of the shape using Well Known Text: POLYGON((-87.74606191713963 41.91097449402647,-87.66915762026463 41.910463501644806,-87.65748464663181 41.89359845829678,-87.64924490053806 41.89002045220879,-87.645811672999 41.878262735374236,-87.74194204409275 41.874683870355824,-87.74606191713963 41.91097449402647)).

And now I can query over the data to find the taxis that were in that particular area on Dec 1st, 2019:

image

And here are the results of this query:

image

You can see that we have a very nice way to see which taxis were at each location at a time. We can also use the same results to paint a heat map over time, counting the number of taxis in a particular location.

To put this into (sadly) modern terms, we can use this to track people that were near a particular person, to figure out if they might be at risk for being sick due to being near a sick person.

In order to answer this question, we need to take two steps. First, we ask to get the location of a particular taxi for a time period. We already saw how we can query on that. Then we ask to find all the taxis that were in the specified locations in the right times. That gives us the intersection of taxis that were in the same place as the initial taxi, and from there we can send the plague police.

time to read 2 min | 300 words

I recently got an interesting question about how RavenDB is processing certain types of queries. The customer in question had a document with a custom analyzer and couldn’t figure out why certain queries didn’t work.

For the purpose of the discussion, let’s consider the following analyzer:

In other words, when using this analyzer, we’ll have the following transformations:

  • “Singing avocadoes” – will be: “sing”, “avocadoes”
  • “Sterling silver” – will be: “ster”, “silver”
  • “Singularity Trailer” – will be “singularity”, “trailer”

As a reminder, this is used in a reverse index, which gives us the ability to lookup a term and find all the documents containing that term.

An analyzer is applied on the text that is being indexed, but also on the queries. In other words, because during indexing I changed “singing” to “sing”, I also need to do the same for the query. Otherwise a query for “singing voice” will have no results, even if the “singing” term was in the original data.

The rules change when we do a prefix search, though. Consider the following query:

What should we be searching on here? Remember, this is using an analyzer, but we are also doing a prefix search. Lets consider our options. If we pass this through an analyzer, the query will change its meaning. Instead of searching for terms starting with “sing”, we’ll search for terms starting with “s”.

That will give us results for “Sterling Silver”, which is probably not expected. In this case, by the way, I’m actually looking for the term “singularity”, and processing the term further would prevent that.

For that reason, RavenDB assumes that queries using wildcard searches are not subject to an analyzer and will not process them using one. The reasoning is simple, by using a wildcard you are quite explicitly stating that this is not a real term.

FUTURE POSTS

  1. RavenDB Cloud now supports HIPAA accounts - about one hour from now

There are posts all the way to Oct 21, 2020

RECENT SERIES

  1. Webinar recording (9):
    27 Aug 2020 - The App that Guarantees You're Going Out This Saturday Night
  2. Podcast (3):
    17 Aug 2020 - #SoLeadSaturday with Oren Eini
  3. RavenDB Webinar (3):
    01 Jun 2020 - Polymorphism at Scale
  4. Talk (5):
    23 Apr 2020 - Advanced indexing with RavenDB
  5. Challenge (57):
    21 Apr 2020 - Generate matching shard id–answer
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats