Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,059 | Comments: 49,783

filter by tags archive
time to read 1 min | 76 words

On October 26, I’ll be giving a two days workshop on RavenDB 5.0 as part of the NDC Conference.

In the workshop, I’m going to talk about using RavenDB from scratch. We are going to explore how to utilize RavenDB, modeling and design decisions and how to use RavenDB’s features for your best advantage. Topics includes distribution and data management, application and system architecture and much more.

I’m really looking forward to it, see you there.

time to read 6 min | 1004 words

Chicago publishes its taxi’s trips in an easy to consume format, so I decided to see what kind of information I can dig out of the data using RavenDB. Here is what the data looks like:

image

There are actually a lot more fields in the data, but I wanted to generate a more focused dataset to show off certain features. For that reason, I’m going to record the trips for each taxi, where for each trip, I’m going to look at the start time, duration and pick up and drop off locations. The data’s size is significant, with about 194 million trips recorded.

I converted the data into RavenDB’s time series, with a Location time series for each taxi’s location at a given point in time. You can see that the location is tagged with the type of event associated with it. The raw data has both pickup and drop off for each row, but I split it into two separate events.

image

The reason I did it this way is that we get a lot of queries on how to use RavenDB for doing… stuff with vehicles and locations data. The Chicago’s taxi data is a good source for non trivial amount of real world data, which is very nice to use.

Once we have all the data loaded in, we can see that there are 9,179 distinct taxis in the data set and there are varying number of events for each taxi. Here is one such scenario:

image

The taxi in question has six years of data and 6,545 pickup and dropoff events.

The question now is, what can we do with this data? What sort of questions can we answer?

Asking where a taxi is at a given point in time is easy enough:

from Taxis where id() == 'taxis/174' select timeseries(from Location between '2015-01-01' and '2015-01-10')

And gives us the results:

image

But asking a question about a single taxi isn’t that interesting, can we do things across all taxis?

Let’s think about what kind of questions can we ask:

  • Generate heat map of pickup and drop off locations over time?
  • Find out what taxis where at a given location within at a given time?
  • Find out taxis that were nearby a particular taxi on a given day?

To answer all of these questions, we have to aggregate data from multiple time series. We can do that using a Map/Reduce index on the time series data. Here is what this looks like:

We are scanning through all the location events for the taxis and group them on an hourly basis. We are also generate a GeoHash code for the location of the taxi in that time. This is using a GeoHash with a length of 9, so it represent an accuracy of about 2.5 square meters.

We then aggregate all the taxis that were in the same GeoHash at the same hour into a single entry. To make it easier for ourselves, we’ll also use a spatial field (computed from the geo hash) to allow for spatial queries.

The idea is that we want to aggregate the taxi’s location information on both space and time. It is easy to go from a more accurate time stamp to a lower granularity one (zeroing the minutes and seconds of a time). For spatial location, we can a GeoHash of a particular precision to do pretty much the same thing. Instead of having to deal with the various points, we’ll aggregate the taxis by decreasing the resolution we use to track the location.

The GeoHash code isn’t part of RavenDB. This is provided as an additional source to the index, and can be seen fully in the following link. With this index in place, we are ready to start answering all sort of interesting questions. Since the data is from Chicago, I decided to look in the map and see if I can find anything interesting there.

I created the following shape on a map:

image

This is the textual representation of the shape using Well Known Text: POLYGON((-87.74606191713963 41.91097449402647,-87.66915762026463 41.910463501644806,-87.65748464663181 41.89359845829678,-87.64924490053806 41.89002045220879,-87.645811672999 41.878262735374236,-87.74194204409275 41.874683870355824,-87.74606191713963 41.91097449402647)).

And now I can query over the data to find the taxis that were in that particular area on Dec 1st, 2019:

image

And here are the results of this query:

image

You can see that we have a very nice way to see which taxis were at each location at a time. We can also use the same results to paint a heat map over time, counting the number of taxis in a particular location.

To put this into (sadly) modern terms, we can use this to track people that were near a particular person, to figure out if they might be at risk for being sick due to being near a sick person.

In order to answer this question, we need to take two steps. First, we ask to get the location of a particular taxi for a time period. We already saw how we can query on that. Then we ask to find all the taxis that were in the specified locations in the right times. That gives us the intersection of taxis that were in the same place as the initial taxi, and from there we can send the plague police.

time to read 2 min | 300 words

I recently got an interesting question about how RavenDB is processing certain types of queries. The customer in question had a document with a custom analyzer and couldn’t figure out why certain queries didn’t work.

For the purpose of the discussion, let’s consider the following analyzer:

In other words, when using this analyzer, we’ll have the following transformations:

  • “Singing avocadoes” – will be: “sing”, “avocadoes”
  • “Sterling silver” – will be: “ster”, “silver”
  • “Singularity Trailer” – will be “singularity”, “trailer”

As a reminder, this is used in a reverse index, which gives us the ability to lookup a term and find all the documents containing that term.

An analyzer is applied on the text that is being indexed, but also on the queries. In other words, because during indexing I changed “singing” to “sing”, I also need to do the same for the query. Otherwise a query for “singing voice” will have no results, even if the “singing” term was in the original data.

The rules change when we do a prefix search, though. Consider the following query:

What should we be searching on here? Remember, this is using an analyzer, but we are also doing a prefix search. Lets consider our options. If we pass this through an analyzer, the query will change its meaning. Instead of searching for terms starting with “sing”, we’ll search for terms starting with “s”.

That will give us results for “Sterling Silver”, which is probably not expected. In this case, by the way, I’m actually looking for the term “singularity”, and processing the term further would prevent that.

For that reason, RavenDB assumes that queries using wildcard searches are not subject to an analyzer and will not process them using one. The reasoning is simple, by using a wildcard you are quite explicitly stating that this is not a real term.

time to read 4 min | 771 words

This question was raised in Twitter, and I thought it was quite interesting. In SQL, you can use the rank() function to generate this value, but if you are working on a large data set and especially if you are sorting, you will probably want to avoid this.

Microsoft has a reference architecture for the leader board problem where they recommend running a separate process to recompute the ranking every few minutes and cite about 20 seconds to run the query on a highly optimized scenario (with 1.6 billion entries in a column store).

RavenDB doesn’t have a rank() function, but that you cannot implement a leader board. Let’s see how we can build one, shall we? We’ll start with looking at the document representing a single game:

image

You’ll probably have a lot more data in your use case, but that should be sufficient to create the leader board. The first step we need to do is to create an index to aggregate those values into total score for the gamers. Here is what the index looks like:

This is a fairly trivial index, which will allow us to compute the total score of a gamer across all games. You might want to also add score per year / month / week / day, etc. I’m not going to touch that since this is basically the same thing.

RavenDB’s map/reduce indexes will process the data and aggregate it across all games. A new game coming in will not require us to recompute the whole dataset, only the gamer that was changed will be updated, and even so, RavenDB can optimize it even further in many cases and touch only some of the data for that gamer to compute the new total.

However, there is a problem here. How do we generate a leader board here? To find the top 20 gamers is easy:

image

That is easy enough, for sure. But a leader board has more features that we want to have. For example, if I’m not in the top 20, I might want to see other gamers around my level. How can we do that?

We’ll need to issue a few queries for that. First, we want to find what the gamer actual score is:

image

And then we will need to get the gamers that are a bit better from the current gamer:

image

What this does is to ask to get the 4 players that are better than the current gamer. And we can do the same for those that are a bit worse:

image

Note that we are switching the direction of the filter and the order by direction in both queries. That way, we’ll have a list of ten players that are ranked higher or lower than the current gamer, with the current one strictly in the middle.

I’m ignoring the possibility of multiple gamers with the same score, but you can change the > to >= to take them into account. Whatever this is important or not depends on how you want to structure your leader board.

The final, and most complex, part of the leader board is finding the actual rank of any arbitrary gamer. How can we do that? We don’t want to go through the entire result set to get it. The answer to this question is to use facets, like so:

image

What this will do is ask RavenDB to provide a count of all the gamers whose score are higher or lower than the current gamer. The output looks like this:

image

And you can now compute the score of the gamer.

Facets are optimized for these kind of queries and are going to operate over the results of the index, so they are already going to operate over aggregated data. Internally, the data is actually stored in a columnar format, so you can expect very quick replies.

There you have it, all the components required to create a leader board in RavenDB.

time to read 2 min | 235 words

A user reported an issue with RavenDB. They got unexpected results in their production database, but when they imported the data locally and tested things, everything worked.

Here is the simplified version of their index:

This is a multi map index that covers multiple collections and aggregate data across them. In this case, the issue was that in production, for some of the results, the CompanyName field was null.

The actual index was more complex, but once we trimmed it down in size to something more manageable, it became obvious what the problem is. Let’s look at the problematic line:

CompanyName = g.First().CompanyName,

The problem is with the First() call. There is no promise of ordering in the grouping results, and you are getting the first item there. If the item happened to be the one from the Company map, the index will appear to work and you’ll get the right company name. However, if the result from the User map will show up first, we’ll have null in the CompanyName.

We don’t make any guarantees about the order of elements in the grouping, but in practice it is often (don’t rely on it) depends on the order of updates in the documents. So you can update the user after the company and see the changes in the index.

The right way to index this data is to do so explicitly, like so:

CompanyName = g.First(x => x.CompanyName != null).CompanyName,

time to read 1 min | 65 words

Next week I’m going to be talking with Ryan Rounkles about his use of RavenDB in Tended App. The Tended App is a medical services app for parents with a child that has medical needs. It can be something as simple as the sniffles or a 24-hour virus, to special needs kids which require constant attention.

Join us on Aug 25, 2020 10:30 AM EST.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Webinar recording (9):
    27 Aug 2020 - The App that Guarantees You're Going Out This Saturday Night
  2. Podcast (3):
    17 Aug 2020 - #SoLeadSaturday with Oren Eini
  3. RavenDB Webinar (3):
    01 Jun 2020 - Polymorphism at Scale
  4. Talk (5):
    23 Apr 2020 - Advanced indexing with RavenDB
  5. Challenge (57):
    21 Apr 2020 - Generate matching shard id–answer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats