Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,059 | Comments: 49,783

filter by tags archive
time to read 6 min | 1004 words

Chicago publishes its taxi’s trips in an easy to consume format, so I decided to see what kind of information I can dig out of the data using RavenDB. Here is what the data looks like:

image

There are actually a lot more fields in the data, but I wanted to generate a more focused dataset to show off certain features. For that reason, I’m going to record the trips for each taxi, where for each trip, I’m going to look at the start time, duration and pick up and drop off locations. The data’s size is significant, with about 194 million trips recorded.

I converted the data into RavenDB’s time series, with a Location time series for each taxi’s location at a given point in time. You can see that the location is tagged with the type of event associated with it. The raw data has both pickup and drop off for each row, but I split it into two separate events.

image

The reason I did it this way is that we get a lot of queries on how to use RavenDB for doing… stuff with vehicles and locations data. The Chicago’s taxi data is a good source for non trivial amount of real world data, which is very nice to use.

Once we have all the data loaded in, we can see that there are 9,179 distinct taxis in the data set and there are varying number of events for each taxi. Here is one such scenario:

image

The taxi in question has six years of data and 6,545 pickup and dropoff events.

The question now is, what can we do with this data? What sort of questions can we answer?

Asking where a taxi is at a given point in time is easy enough:

from Taxis where id() == 'taxis/174' select timeseries(from Location between '2015-01-01' and '2015-01-10')

And gives us the results:

image

But asking a question about a single taxi isn’t that interesting, can we do things across all taxis?

Let’s think about what kind of questions can we ask:

  • Generate heat map of pickup and drop off locations over time?
  • Find out what taxis where at a given location within at a given time?
  • Find out taxis that were nearby a particular taxi on a given day?

To answer all of these questions, we have to aggregate data from multiple time series. We can do that using a Map/Reduce index on the time series data. Here is what this looks like:

We are scanning through all the location events for the taxis and group them on an hourly basis. We are also generate a GeoHash code for the location of the taxi in that time. This is using a GeoHash with a length of 9, so it represent an accuracy of about 2.5 square meters.

We then aggregate all the taxis that were in the same GeoHash at the same hour into a single entry. To make it easier for ourselves, we’ll also use a spatial field (computed from the geo hash) to allow for spatial queries.

The idea is that we want to aggregate the taxi’s location information on both space and time. It is easy to go from a more accurate time stamp to a lower granularity one (zeroing the minutes and seconds of a time). For spatial location, we can a GeoHash of a particular precision to do pretty much the same thing. Instead of having to deal with the various points, we’ll aggregate the taxis by decreasing the resolution we use to track the location.

The GeoHash code isn’t part of RavenDB. This is provided as an additional source to the index, and can be seen fully in the following link. With this index in place, we are ready to start answering all sort of interesting questions. Since the data is from Chicago, I decided to look in the map and see if I can find anything interesting there.

I created the following shape on a map:

image

This is the textual representation of the shape using Well Known Text: POLYGON((-87.74606191713963 41.91097449402647,-87.66915762026463 41.910463501644806,-87.65748464663181 41.89359845829678,-87.64924490053806 41.89002045220879,-87.645811672999 41.878262735374236,-87.74194204409275 41.874683870355824,-87.74606191713963 41.91097449402647)).

And now I can query over the data to find the taxis that were in that particular area on Dec 1st, 2019:

image

And here are the results of this query:

image

You can see that we have a very nice way to see which taxis were at each location at a time. We can also use the same results to paint a heat map over time, counting the number of taxis in a particular location.

To put this into (sadly) modern terms, we can use this to track people that were near a particular person, to figure out if they might be at risk for being sick due to being near a sick person.

In order to answer this question, we need to take two steps. First, we ask to get the location of a particular taxi for a time period. We already saw how we can query on that. Then we ask to find all the taxis that were in the specified locations in the right times. That gives us the intersection of taxis that were in the same place as the initial taxi, and from there we can send the plague police.

time to read 2 min | 300 words

I recently got an interesting question about how RavenDB is processing certain types of queries. The customer in question had a document with a custom analyzer and couldn’t figure out why certain queries didn’t work.

For the purpose of the discussion, let’s consider the following analyzer:

In other words, when using this analyzer, we’ll have the following transformations:

  • “Singing avocadoes” – will be: “sing”, “avocadoes”
  • “Sterling silver” – will be: “ster”, “silver”
  • “Singularity Trailer” – will be “singularity”, “trailer”

As a reminder, this is used in a reverse index, which gives us the ability to lookup a term and find all the documents containing that term.

An analyzer is applied on the text that is being indexed, but also on the queries. In other words, because during indexing I changed “singing” to “sing”, I also need to do the same for the query. Otherwise a query for “singing voice” will have no results, even if the “singing” term was in the original data.

The rules change when we do a prefix search, though. Consider the following query:

What should we be searching on here? Remember, this is using an analyzer, but we are also doing a prefix search. Lets consider our options. If we pass this through an analyzer, the query will change its meaning. Instead of searching for terms starting with “sing”, we’ll search for terms starting with “s”.

That will give us results for “Sterling Silver”, which is probably not expected. In this case, by the way, I’m actually looking for the term “singularity”, and processing the term further would prevent that.

For that reason, RavenDB assumes that queries using wildcard searches are not subject to an analyzer and will not process them using one. The reasoning is simple, by using a wildcard you are quite explicitly stating that this is not a real term.

time to read 4 min | 771 words

This question was raised in Twitter, and I thought it was quite interesting. In SQL, you can use the rank() function to generate this value, but if you are working on a large data set and especially if you are sorting, you will probably want to avoid this.

Microsoft has a reference architecture for the leader board problem where they recommend running a separate process to recompute the ranking every few minutes and cite about 20 seconds to run the query on a highly optimized scenario (with 1.6 billion entries in a column store).

RavenDB doesn’t have a rank() function, but that you cannot implement a leader board. Let’s see how we can build one, shall we? We’ll start with looking at the document representing a single game:

image

You’ll probably have a lot more data in your use case, but that should be sufficient to create the leader board. The first step we need to do is to create an index to aggregate those values into total score for the gamers. Here is what the index looks like:

This is a fairly trivial index, which will allow us to compute the total score of a gamer across all games. You might want to also add score per year / month / week / day, etc. I’m not going to touch that since this is basically the same thing.

RavenDB’s map/reduce indexes will process the data and aggregate it across all games. A new game coming in will not require us to recompute the whole dataset, only the gamer that was changed will be updated, and even so, RavenDB can optimize it even further in many cases and touch only some of the data for that gamer to compute the new total.

However, there is a problem here. How do we generate a leader board here? To find the top 20 gamers is easy:

image

That is easy enough, for sure. But a leader board has more features that we want to have. For example, if I’m not in the top 20, I might want to see other gamers around my level. How can we do that?

We’ll need to issue a few queries for that. First, we want to find what the gamer actual score is:

image

And then we will need to get the gamers that are a bit better from the current gamer:

image

What this does is to ask to get the 4 players that are better than the current gamer. And we can do the same for those that are a bit worse:

image

Note that we are switching the direction of the filter and the order by direction in both queries. That way, we’ll have a list of ten players that are ranked higher or lower than the current gamer, with the current one strictly in the middle.

I’m ignoring the possibility of multiple gamers with the same score, but you can change the > to >= to take them into account. Whatever this is important or not depends on how you want to structure your leader board.

The final, and most complex, part of the leader board is finding the actual rank of any arbitrary gamer. How can we do that? We don’t want to go through the entire result set to get it. The answer to this question is to use facets, like so:

image

What this will do is ask RavenDB to provide a count of all the gamers whose score are higher or lower than the current gamer. The output looks like this:

image

And you can now compute the score of the gamer.

Facets are optimized for these kind of queries and are going to operate over the results of the index, so they are already going to operate over aggregated data. Internally, the data is actually stored in a columnar format, so you can expect very quick replies.

There you have it, all the components required to create a leader board in RavenDB.

time to read 3 min | 475 words

This post talks about Python, but it generalize well to other programming languages and environments. I’m using Python and my own experience here to make a point, this isn’t really a post about Python itself.

I know how to read and write code in Python. By that I mean that I understand the syntax and how to do things in Python. I wouldn’t say that I’m an expert in all the myriad of ways that you can make Python jump on command, but I’m comfortable reading non trivial code bases and I like to use Python for small scripting jobs. I’m also maintaining (personally) the RavenDB Python Client which is just over 11K lines of code and decidedly non trivial.

But I don’t know Python.

Those two statements may seem to contradict one another, but I don’t really think so.

To this day, I find that packaging code in Python to be an unfamiliar territory. I don’t have a good feel for that and even when I follow the guides exactly, something doesn’t work properly more often than not. I also have only the vaguest idea about the Python virtual machine and the internals of the GC.

My few attempts to build Python interfaces on top of ctypes has been… painful. And a task such as creating an application or a package that would embed a native component in Python is likely beyond me without investing a significant amount of time and effort.

This is interesting, because my threshold for understand a language or a platform means that I should have the ability to do non trivial things with the environment, not just with the code.

Packaging is an obvious problem, once you go beyond the simplest of scripts. But the detailed knowledge on debugging, troubleshooting and analysis of the system is also what I would expect to have before I could claim that I’m familiar with a particular environment.

There is often a lot of derision for job requirements such as “Minimum 5 years experience with Xyz”. Leaving aside Xyz being younger than five years.  In many cases this is a requirement that came from the HR department and not the technical team.

But when I read a requirement like that, I translate that to the difference between knowing how the code work, and grokking how the whole environment operates.

Note that there is nothing really insurmountable with Python, per se. If I would dedicate enough time (probably in the order of weeks, not months) to study it properly, I would have most of the knowledge that I need. But whenever I run into a stumbling block when using Python, it is always easier to simply forgo using Python and go use something that I am more familiar with. Hence, there has never been enough of a reason to make the jump to really understand the platform.

time to read 2 min | 213 words

I was asked what the meaning of default(object) is in the following piece of code:

The code is something that you’ll see a lot in RavenDB indexes, but I understand why it is a strange construct. The default(object) is a way to null. This is asking the C# compiler to add the default value of the object type, which is null.

So why not simply say null there?

Look at the code, we aren’t setting a field here, we are creating an anonymous object. When we set a field to null, the compiler can tell what the type of the field is from the class definition and check that the value is appropriate. You can’t set a null to a Boolean properly, for example.

With anonymous objects, the compiler need to know what the type of the field is, and null doesn’t provide this information. You can use the (object)null construct, which has the same meaning as default(object), but I find the later to be syntactically more pleasant to read.

It may make more sense if you’ll look at the following code snippet:

This technique is probably only useful if you deal with anonymous objects a lot. That is something that you do frequently with RavenDB indexes, which is how I run into this syntax.

time to read 2 min | 380 words

A sadly common place “attack” on applications is called “Web Parameter Tampering”. This is the case where you have a URL such as this:

https://secret.medical.info/?userid=823

And your users “hack” you using:

https://secret.medical.info/?userid=999

And get access to another users records.

As an aside, that might actually be considered to be hacking, legally speaking. Which make me want to smash my head on the keyboard a few time.

Obviously, you need to run your security validation on parameters, but there are other reasons to want to avoid to expose the raw identifiers to the user. If you are using the a incrementing counter of some kind, creating two values might cause you to leak the rate in which your data change. For example, a competitor might want to create an order once a week and track the number of the order. That will give you a good indications of how many orders there have been in that time frame.

Finally, there are other data leakage issues that you want to might want to take into account. For example, “users/321” means that you are likely to be using RavenDB while “users/4383-B” means that you are using RavenDB 4.0 or higher and “607d1f85bcf86cd799439011” means that you are using MongoDB.

A common reaction to this is to switch your ids to use guids. I hate that option, it means that you are entering very unfriendly territory  for the application. Guids convey no information to the developers working with the system and they are hard to work with, from a humane point of view. They are also less nice for the database systemto work with.

A better alternative is to simply mask the information when it leaves your system. Here is the code to do so:

You can see that I’m actually using AES encryption to hide the data, and then encoding it in the Bitcoin format.

That means that an identifier such as "users/1123" will result in output such as this:

bPSPEZii22y5JwUibkQgUuXR3VHBDCbUhC343HBTnd1XMDFZMuok

The length of the identifier is larger, but not overly so and the id is even URL safe Smile. In addition to hiding the identifier itself, we also ensure that the users cannot muck about in the value. Any change to the value will result in an error to unmask it.

time to read 3 min | 490 words

In my previous post, I showed how you can search for a range of cleared bit in a bitmap, as part of an allocation strategy. I want to do better than just finding a range, I want to find the best range. Best is define as:

  • Near a particular location (to increase locality).
  • Create the least amount of fragmentation.

Here is the bitmap in question: [0x80520A27102E21, 0xE0000E00020, 0x1A40027F025802C8, 0xEE6ACAE56C6C3DCC].

And a visualization of it:

download (1)

What we want is to find the following ranges:

  • single cell near the red marker (9th cell)
  • 2 cells range near the red marker (9th cell)
  • 4 cells range near the yellow marker (85th cell)
  • 7 cells range near the green marker (198th cell)

We have an existing primitive already, the find_next() function. Let’s see how we can use it to find the right buffer for us.

Here is how it goes, we scan forward from the beginning of the word that you and trying to find a range that match the requirement. We’ll find the smallest range we can find, but as we go further from the starting point, we’ll loosen our requirements. If we can’t find anything higher, we’ll search lower that the given position. Regardless, if we find an exact match, one where we don’t need to fragment for, we’ll take that if it is close enough.

Here are the results from the code (the code itself will follow shortly):

  • single cell near the 9th – 12 – as expected
  • 2 cells near the 9th – 27 – because there is a two cell hole there and we won’t fragment
  • 4 cells near 85 – 64 – I mentioned that we are scanning only forward, right? But we are scanning on a word boundary. So the 85th cell’s bit is in the 2nd word and the first 4 bits there are free.
  • 7 cells near 198 – 47 – we couldn’t find anything suitable higher, so we started to look from the bottom.

Here is the full code for this. I want to highlight a couple of functions:

Here you can see how I’m approach the issue. First, get the data from the start of the word containing the desired location, then see if I can get the best match. Or, if I can’t, just try to get anything.

The fact that I can play around with the state of the system and resume operations at ease make it possible to write this code.

Note that there is a tension in allocation code between:

  • Finding the best possible match
  • The time that it takes to find the match
  • The resources consumed by the find

In particular, if I need to scan a lot of memory, I may run into issue. It is usually better to get something as soon as possible than wait for the perfect match.

If you are interested in a nice mental workout, implement the find_prev function. That is a somewhat of a challenge, because you need to manage the state in reverse. Hint, use: __builtin_ctzll().

time to read 2 min | 314 words

In my previous post I shows how you can find a range of cleared bits in a bitmap. I even got fancy and added features such as find range from location, etc. That was great to experiment with, but when I tried to take the next step, being able to ask question such as find me the next range from location or get the smallest range available, I run into problems.

The issue was simple, I had too much state in my hands and I got lost trying to manage it all. Instead, I sat down and refactor things down. I introduced an explicit notion of state:

This structure allows me to keep track of what is going on when I’m calling functions. Which means that I can set it up once, and then call find_next immediately, without needing to remember the state at the call site.

This behavior also helped a lot in the actual implementation. I’m processing the bitmap one uint64 at a time, which means that I have to account for ranges that cross a word boundary. The core of the routine is the find_range_once call:

This function operates on a single word, but the state is continuous. That means that if I have a range that spans multiple words, I can manage that without additional complications. Here is how I’m using this:

We are getting all the range from the current word, then decide if we should move to the next, etc.

Notice that I’m using the current member in the struct to hold a copy of the current word, that is so I can mask a copy and move on.

Note that if the first cleared bit is in the start, I’m overflowing from ULONG_MAX back to zero. And with that, I’m able to do much more targeted operations on the bitmap, but I’ll discuss that in the next post.

time to read 3 min | 483 words

Consider the following scenario, I have a buffer and I need to allocate from this buffer a certain size.  Here is how this looks like:

The buffer is divided into cells and I want to be able to find a free cell. An common way to handle this is to use a bitmap (something called a bit array or bit vector). In the case of the buffer above, we have 64 cells to deal with.

If I use a single bit to signal if a cell is free or in use, I can handle all of that with a single uint64 value. Let’s assume that I have a big buffer and thus an array of uint64 that represent the allocation bitmap. In the case of the buffer above, the X marks a busy cell and the bitmap for this is:

  • 0x80520A27102E21
  • 0b1000010001110100000010001110010001010000010010100000000100000000

I’m going to base this code on this post be Lemire. In the post, he is finding the set bits, and I need a range of the inverse. It isn’t hard to convert Lemire’s code to a range search. The complexity is that I wanted to enable search from arbitrary location. Here is the code, I’ll discuss it below:

The idea is that we use Lemire’s code to find the first set bit and see if the distance from the last set bit is enough to satisfy the range request we have. Using this function, we can tell that the 0x80520A27102E21 value has the following ranges:

  • 8 consecutive cells at: [47]
  • 7 consecutive cells at: [47]
  • 6 consecutive cells at: [14,47]
  • 5 consecutive cells at: [14,36,47]
  • 4 consecutive cells at: [1,14,36,47,51]
  • 3 consecutive cells at: [1,6,14,17,21,30,36,47,50]
  • 2 consecutive cells at: [1,3,6,14,16,18,21,27,30,36,38,42,47,49,51,53]
  • 1 consecutive cell at: [1,2,3,4,6,7,8,12,14,15,16,17,18,19,21,22,23,27,28,30,31,32,34,36,37,38,39,40,42,43,45,47,48,49,50,51,52,53,54]

Note that the function gives us quite a bit of information about the bitmap structure. We can tell how big a range we got back (so we can make a decision and get the optimal one). Here is an example of using this function to do best fit locality aware allocations. We have the following buffer, composed of the following bitmap: [0x80520A27102E21, 0xE0000E00020, 0x1A40027F025802C8, 0xEE6ACAE56C6C3DCC].

What we want is to find the following ranges:

  • single cell near the red marker (9th cell)
  • 2 cells range near the red marker (9th cell)
  • 4 cells range near the yellow marker (85th cell)
  • 7 cells range near the green marker (198th cell)

download (1)

We want to get the best cell range for each request, where best is defined as:

  • Near the requested range
  • Will create the least amount of fragmentation.

I’ll try to give an answer to this in my next post, see how you do here.

A word of caution. It took a while to refactor the code above to its final form. I’ll have another post that discuss the changes as well. Consider this code a template, it hasn’t been well tested and I re-wrote it since this post several times.

time to read 1 min | 172 words

I’m really happy to announce that RavenDB Cloud has now deployed shared instances support.  These are full production systems, with three separate nodes deployed in separate availability zones for maximum availability.

Shared instances are meant to answer users that have light load but still want to ensure high availability at a low cost.

image

We currently offer PS10 (about 30$ / month) and PS20 (about 50$ month) plans in both AWS and Azure.

Shared plans are just that, you are getting shared compute resources. You are still fully isolated from other customers and have full isolation from other instances. However, because the hardware resources you are using are utilized by multiple customers, you can expect lower capacity.  In our tests, such instances were able to handle nicely light load and typical hobbyists applications with no issue. You can also freely scale up from shared plans to basic or higher at any time.

Give it a try.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Webinar recording (9):
    27 Aug 2020 - The App that Guarantees You're Going Out This Saturday Night
  2. Podcast (3):
    17 Aug 2020 - #SoLeadSaturday with Oren Eini
  3. RavenDB Webinar (3):
    01 Jun 2020 - Polymorphism at Scale
  4. Talk (5):
    23 Apr 2020 - Advanced indexing with RavenDB
  5. Challenge (57):
    21 Apr 2020 - Generate matching shard id–answer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats