Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,059 | Comments: 49,783

filter by tags archive
time to read 6 min | 1004 words

Chicago publishes its taxi’s trips in an easy to consume format, so I decided to see what kind of information I can dig out of the data using RavenDB. Here is what the data looks like:

image

There are actually a lot more fields in the data, but I wanted to generate a more focused dataset to show off certain features. For that reason, I’m going to record the trips for each taxi, where for each trip, I’m going to look at the start time, duration and pick up and drop off locations. The data’s size is significant, with about 194 million trips recorded.

I converted the data into RavenDB’s time series, with a Location time series for each taxi’s location at a given point in time. You can see that the location is tagged with the type of event associated with it. The raw data has both pickup and drop off for each row, but I split it into two separate events.

image

The reason I did it this way is that we get a lot of queries on how to use RavenDB for doing… stuff with vehicles and locations data. The Chicago’s taxi data is a good source for non trivial amount of real world data, which is very nice to use.

Once we have all the data loaded in, we can see that there are 9,179 distinct taxis in the data set and there are varying number of events for each taxi. Here is one such scenario:

image

The taxi in question has six years of data and 6,545 pickup and dropoff events.

The question now is, what can we do with this data? What sort of questions can we answer?

Asking where a taxi is at a given point in time is easy enough:

from Taxis where id() == 'taxis/174' select timeseries(from Location between '2015-01-01' and '2015-01-10')

And gives us the results:

image

But asking a question about a single taxi isn’t that interesting, can we do things across all taxis?

Let’s think about what kind of questions can we ask:

  • Generate heat map of pickup and drop off locations over time?
  • Find out what taxis where at a given location within at a given time?
  • Find out taxis that were nearby a particular taxi on a given day?

To answer all of these questions, we have to aggregate data from multiple time series. We can do that using a Map/Reduce index on the time series data. Here is what this looks like:

We are scanning through all the location events for the taxis and group them on an hourly basis. We are also generate a GeoHash code for the location of the taxi in that time. This is using a GeoHash with a length of 9, so it represent an accuracy of about 2.5 square meters.

We then aggregate all the taxis that were in the same GeoHash at the same hour into a single entry. To make it easier for ourselves, we’ll also use a spatial field (computed from the geo hash) to allow for spatial queries.

The idea is that we want to aggregate the taxi’s location information on both space and time. It is easy to go from a more accurate time stamp to a lower granularity one (zeroing the minutes and seconds of a time). For spatial location, we can a GeoHash of a particular precision to do pretty much the same thing. Instead of having to deal with the various points, we’ll aggregate the taxis by decreasing the resolution we use to track the location.

The GeoHash code isn’t part of RavenDB. This is provided as an additional source to the index, and can be seen fully in the following link. With this index in place, we are ready to start answering all sort of interesting questions. Since the data is from Chicago, I decided to look in the map and see if I can find anything interesting there.

I created the following shape on a map:

image

This is the textual representation of the shape using Well Known Text: POLYGON((-87.74606191713963 41.91097449402647,-87.66915762026463 41.910463501644806,-87.65748464663181 41.89359845829678,-87.64924490053806 41.89002045220879,-87.645811672999 41.878262735374236,-87.74194204409275 41.874683870355824,-87.74606191713963 41.91097449402647)).

And now I can query over the data to find the taxis that were in that particular area on Dec 1st, 2019:

image

And here are the results of this query:

image

You can see that we have a very nice way to see which taxis were at each location at a time. We can also use the same results to paint a heat map over time, counting the number of taxis in a particular location.

To put this into (sadly) modern terms, we can use this to track people that were near a particular person, to figure out if they might be at risk for being sick due to being near a sick person.

In order to answer this question, we need to take two steps. First, we ask to get the location of a particular taxi for a time period. We already saw how we can query on that. Then we ask to find all the taxis that were in the specified locations in the right times. That gives us the intersection of taxis that were in the same place as the initial taxi, and from there we can send the plague police.

time to read 2 min | 235 words

A user reported an issue with RavenDB. They got unexpected results in their production database, but when they imported the data locally and tested things, everything worked.

Here is the simplified version of their index:

This is a multi map index that covers multiple collections and aggregate data across them. In this case, the issue was that in production, for some of the results, the CompanyName field was null.

The actual index was more complex, but once we trimmed it down in size to something more manageable, it became obvious what the problem is. Let’s look at the problematic line:

CompanyName = g.First().CompanyName,

The problem is with the First() call. There is no promise of ordering in the grouping results, and you are getting the first item there. If the item happened to be the one from the Company map, the index will appear to work and you’ll get the right company name. However, if the result from the User map will show up first, we’ll have null in the CompanyName.

We don’t make any guarantees about the order of elements in the grouping, but in practice it is often (don’t rely on it) depends on the order of updates in the documents. So you can update the user after the company and see the changes in the index.

The right way to index this data is to do so explicitly, like so:

CompanyName = g.First(x => x.CompanyName != null).CompanyName,

time to read 4 min | 762 words

The underlying assumption for any distributed system is that the network is hostile. That assumption is pervasive. If you open a socket, you have to be aware of malicious people on the other side (and in the middle). If you accept input from external sources, you have to be careful, because it may be carefully crafted to do Bad Things. In short, the network is hostile, and you need to protect yourself from harm at all levels.

In many cases, you already have multiple layers of protection. For example, when building web applications, the HTTP Server already validate that the incoming streams follows the HTTP protocol. Even ignoring maliciousness, you will get services that connect to you using the wrong protocols and creating havoc. For reference, look at 1347703880, 1213486160 and 1195725856 and the issues they cause. As it turns out, these are relatively benign issues, because they are caught almost immediately. In the real world, the network isn’t only hostile, it is also smart.

The problem was originally posed by Lamport in the Byzantine Generals paper. You have a group of generals that needs to agree on a particular time to attack a city. They can only communicate by (unreliable) messenger, and one or more of them are traitors. The paper itself is interesting to read and the problem is pervasive enough that we now divide distributed systems to Byzantine and non-Byzantine systems.  We now have pervasive cryptography deployed, to the point where you read this post over an encrypted channel, verified using public key infrastructure to validate that it indeed came from me. You can solve the Byzantine generals problem easily now.

Today, the terminology changed. We now refer to Byzantine networks as systems where some of the nodes are malicious and non-Byzantine as systems where we trust that other nodes will do their task. For example, Raft or Paxos are both distributed consensus algorithms that assumes a non-Byzantine system. Oh, the network communication gores through hostile environment, but that is why we have TLS for. Authentication and encryption over the wire are mostly a solved problem at this point. It isn’t a simple problem, but it is a solved one.

So where would you run into Byzantine systems today? The obvious examples are cryptocurrencies and Bit Torrent. In both cases, you have distributed environment with incentives for the other side to cheat. In the case of cryptocurrencies, this is handled by proof of work / proof of stake as well as the cost of getting to 51% majority on the network. In the case of Bit Torrent, it is an attempt to get peers to both download and upload. These examples are the first one that pops to mind, but in reality, the most common tool to use with a Byzantine network is the browser.

The browser has to assume that every site is malicious and that the web servers has to assume that each client is malicious.  For that matter, every server has to assume that every client is malicious as well. You only have to read through OWASP listing to understand that.

And how does this related to databases? Distributed databases are composed of independent nodes, which cooperate together to store and process your data. Most of the generally available database systems are assuming non-Byzantine model. In other words, they authenticate the other nodes, but once past the authentication, the other node is trusted to operate as expected.

For the most part, that is a reasonable assumption to make. You run your database on machine that you tend to trust, after all. And assuming non-Byzantine systems allow for a drastically simpler system design and much higher performance.

However, we are starting to see more and more system deployed on the edge. And that raise an interesting question, who controls the edge? Let’s assume that we have a traffic monitoring system, based on software that is running on your phone. While it may be all part of a single system, you have to take into account that you are now running on a system that is controlled by someone else, who may modify / change it at will.

That leads to interesting issues with regards to the design of such a system. On the one hand, you want to get data from the nodes in the fields, but on the other hand, you need to be careful about trusting those nodes.

How would you approach such a system? Keep in mind that you want to reduce, as much as possible, the complexity of the system while not breaching its security.

time to read 3 min | 588 words

I run into this blog post talking about using a real programming language for defining your configuration. I couldn’t agree more, I wrote about it 15 years ago. In fact, I agree so much I wrote a whole book about the topic.

Configuration is fairly simple, on its face. You need to pass some values to a program to execute. In the simplest form, you simple have a map of strings. If you need hierarchy, you can use dots (.) or slashes (/) for readability. A good example is:

As the original blog post notes, you also need to have comments in the format, if the file is meant to be human readable / editable. From that format, you can transform it to the desired result.  Other formats, such as JSON / YAML / XML are effectively all variations on the same thing.

Note that configuration usually takes a non trivial amount of work to properly read. In particular if you have to run validations. For example, the port above, must be greater than 1024 and less than 16,384. The log’s level can be either a numeric value or a small set of terms, etc.

The original post talked a lot about reusing configuration, which is interesting. Here is a blog post from 2007 showing exactly that purpose. I’m using a look to configure an IoC Container dynamically:

However, after doing similar things for a long while, I think that the most important aspect of this kind of capability has been missed. It isn’t about being able to loop in your configuration. That is certainly nice, but it isn’t the killer feature. The killer feature is that you don’t need to have complex configuration subsystem.

In the case above, you can see that we are doing dynamic type discovery. I can do that in the INI example by specifying something like:

I would need to go ahead and write all the discovery code in the app. And the kind of things that I can do here are fixed. I can’t manage them on the fly and change them per configuration.

Here is another good example, passwords. In the most basic form, you can store passwords in plain text inside your configuration files. That is… not generally a good thing. So you might put them in a separate file. Or maybe use DPAPI on Windows to secure them. Something like this:

I have to write separate code for each one of those options. Now, I get a requirement that I need to use Azure Vault in one customer. And in another, they use a Hardware Security Module that we have to integrate with, etc.

Instead of having to do it all in the software, I can push that kind of behavior to the configuration. The script we’ll run can run arbitrary operations to gather its data, including custom stuff defined on site for the specific use case.

That gives you a lot of power, especially when you get a list of integrations options that you have to work with. Not doing that is huge.  That is how RavenDB works, allowing you to shell out to a script for specific values. It means that we have a lot less work to do inside of RavenDB.

With RavenDB, we have gone with a hybrid approach. For most things, you define the configuration using simple JSON file, and we allow you to shell out to scripts for the more complex / dynamic features. That ends up being quite nice to use and work with.

time to read 6 min | 1077 words

I run across this article, which talks about unit testing. There isn’t anything there that would be ground breaking, but I run across this quote, and I felt that I have to write a post to answer it.

The goal of unit testing is to segregate each part of the program and test that the individual parts are working correctly. It isolates the smallest piece of testable software from the remainder of the code and determines whether it behaves exactly as you expect.

This is a fairly common talking point when people discuss unit testing. Note that this isn’t the goal. The goal is what you what to achieve, this is a method of applying unit testing. Some of the benefits of unit test, are:

Makes the Process Agile and Facilitates Changes and Simplifies Integration

There are other items in the list on the article, but you can just read it there. I want to focus right now on the items above, because they are directly contradicted by separating each part of the program and testing it individually, as is usually applied in software projects.

Here are a few examples from posts I wrote over the years. The common pattern is that you’ll have interfaces, and repositories and services and abstractions galore. That will allow you to test just a small piece of your code, separate from everything else that you have.

This is great for unit testing. But unit testing isn’t a goal in itself. The point is to enable change down the line, to ensure that we aren’t breaking things that used to work, etc.

An interesting thing happens when you have this kind of architecture (and especially if you have this specifically so you can unit test it): it becomes very hard to make changes to the system. That is because the number of times you repeated yourself has grown. You have something once in the code and a second time in the tests.

Let’s consider something rather trivial. We have the following operation in our system, sending money:

image

A business rule says that we can’t send money if we don’t have enough in our account. Let’s see how we may implement it:

This seems reasonable at first glance. We have a lot of rules around money transfer, and we expect to have more in these in the future, so we created the IMoneyTransferValidationRules abstraction to model that and we can easily add new rules as time goes by. Nothing objectionable about that, right? And this is important, so we’ll have unit tests for each one of those rules.

During the last stages of the system, we realize that each one of those rules generate a bunch of queries to the database and that when we have load on the system, the transfer operation will create too much pain as it currently stand. There are a few options that we have available at this point:

  • Instead of running individual operations that will each load their data, we’ll do it once for every one. Here is how this will look like:

As you can see, we now have a way to use Lazy queries to reduce the number of remote calls this will generate.

  • Instead of taking the data from the database and checking it, we’ll send the check script to the database and do the validation there.

And here we moved pretty much the same overall architecture directly into the database itself. So we’ll not have to pay the cost of remote calls when we need to access more information.

The common thing for both approach is that it is perfectly in line with the old way of doing things. We aren’t talking about a major conceptual change. We just changed things so that it is easier to work with properly.

What about the tests?

If we tested each one of the rules independently, we now have a problem. All of those tests will now require non trivial modification. That means that instead of allowing change, the tests now serve as a barrier for change. They have set our architecture and code in concrete and make it harder to make changes.  If those changes were bugs, that would be great. But in this case, we don’t want to modify the system behavior, only how it achieve its end result.

The key issue with unit testing the system as a set of individually separated components is that concept that there is value in each component independently. There isn’t. The whole is greater than the sum of its parts is very much in play here.

If we had tests that looked at the system as a whole, those wouldn’t break. They would continue to serve us properly and validate that this big change we made didn’t break anything. Furthermore, at the edges of the system, changing the way things are happening usually is a problem. We might have external clients or additional APIs that rely on us, after all. So changing the exterior is something that I want to enforce with tests.

That said, when you build your testing strategy, you may have to make allowances. It is very important for the tests to run as quickly as possible. Slow feedback cycles can be incredibly annoying and will kill productivity. If there are specific components in your system that are slow, it make sense to insert seams to replace them. For a example, if you have a certificate generation bit in your system (which can take a long time) in the tests, you might want to return a certificate that was prepared ahead of time. Or if you are working with a remote database, you may want to use an in memory version of that. An external API you’ll want to mock, etc.

The key here isn’t that you are trying to look at things in isolation, the key is that you are trying to isolate things that are preventing you from getting quick feedback on the state of the system.

In short, unless there is uncertainty about a particular component (implementing new algorithm or data structure, exploring unfamiliar library, using 3rd party code, etc), I wouldn’t worry about testing that in isolation. Test it from outside, as a user would (note that this may take some work to enable that as an option) and you’ll end up with a far more robust testing infrastructure.

time to read 1 min | 135 words

I posted about our RavenDB C++ client a while ago, but I was really bad about making sure that we have regular updates. We have actually finished it already, there are even articles about it available now. The article was written by Michael Yarichuk and covers getting started and some of the basic steps to get running in C++ with RavenDB.

We had a very simple challenge in building our C++ client. We want to give you the same level of comfort and features set in C++ as you would get in a managed language. While keeping the same level of performance you’ll expect from a C++ application. I think we have done so quite successfully. You can read the article for the full details. No gore included Smile.

time to read 3 min | 532 words

Once you put a document inside RavenDB, this is pretty much it, as far as RavenDB is concerned. It will keep your data safe, allow to query it, etc. But it doesn’t generally act upon it. There are a few exceptions, however.

RavenDB supports the @expires metadata attribute. This attribute allows you to specify a specific time in which RavenDB will automatically delete the document. This is very useful for expiring documents. The classic example being a password reset token, which should be valid for a period of time and then removed.

Here is what this looks like:

image

And you can configure the frequency in which we’ll check for expired documents in the studio.

image

Expiring documents, however, isn’t all that RavenDB can do. RavenDB also has an additional feature, refreshing documents. You can mark a document to be refreshed by specifying the @refresh metadata attribute, like so:

image

It is easy to understand what @expires do. At a given time, it will delete the document, because it expired. But what does refresh do? Well, at the specified time, a document with the @refresh metadata attribute will be updated by RavenDB to remove the @refresh metadata attribute from the document.

Yep, that is all. In other words, the document above would turn into:

image

That is all. Surely this is the most useless feature ever. You set a property that will be removed at a future time, but the only thing that the property can say is when to remove itself. What kind of feature is this?

Well, this is a case where by itself, this would be a pretty useless feature. But the point of this feature is that this will cause the document to be updated. At that point, it is a normal update, which means that:

  • The document will be re-indexed.
  • The document will be sent over ETL.
  • The document will be sent to the relevant subscriptions.

The last point is the most important one. Here is an example of a typical subscription:

As you can see, this is a pretty trivial subscription, but it filters out commands that are set to refresh. What does this mean? It means that if the @refresh attribute is set, we’ll ignore the document. But since RavenDB will automatically clear the attribute when the refresh timer is hit, we gain a powerful ability.

We now have the ability to process delayed commands. In other words, you can save a document with a refresh and have it processed by a subscription at a given time.

Expanding on this, you can do the same using ETL. So you have a document that will be sent over to the ETL destination at a given time. You can also do the same for indexing as well.

And now this seemingly trivial / useless feature become a pivot for a whole new set of capabilities that you get with RavenDB.

time to read 4 min | 698 words

A map/reduce index in RavenDB can be configured to output its value to a collection. This seems like a strange thing to want to do at first. We already got the results of the index, in the index. Why do we want to duplicate that by writing them to collections?

As it turns out, this is a pretty cool feature, because it enable us to do quite a lot. It means that we can apply anything that work on documents on the results of a map/reduce index. This list include:

  • Map/Reduce – so you can create recursive / chained map/reduce operations.
  • ETL – so you can push aggregated data to another location, allowing distributed aggregation at scale easily.
  • Subscription / Changes – so you can get notified when an aggregated value has been changed.

The key about the list above is that all of them don’t require you to know upfront the id of the generated documents. Indeed, RavenDB uses documents ids like the following for such documents:

image

Technically speaking, you can compute the id. RavenDB uses a predictable algorithm to generate such an id, but practically speaking, it can be hard to figure out exactly what the inputs are for the id generation. That means that certain document related features are not available. In particular, you can’t easily:

  • Include such a document
  • Load it directly (you have to query)

So we need a better option to deal with it. The way RavenDB solves this issue is by allowing you to specify a pattern for the output collection, like so:

image

As you can see, we have a map/reduce index that group by the company and year (marked in blue). We output the collection to YearlySummary, as shown in the previous image.

The pattern (marked in red) specify how we should name the output documents. Here is the result of this index:

image

And here is what this document looks like:

image

Huh?

This is strange, you probably think. This is the document we need to show the summary for companies/9-A in 1998, but there is no such data here. Instead, you’ll notice that the document collection is references (marked in red) and that it points to (marked in blue) the actual document with the data. Why do we do things this way?

A map/reduce document is free to output multiple results for the same reduce key, so we need to handle multiple documents here. We also have to deal with multiple reduce outputs that end up with the same pattern. For example, if we use map/reduce by day, but our pattern only specify the month, we’ll have multiple reduce keys that end up with the same pattern.

In practice, because RavenDB has great support for following documents by id, it doesn’t matter. Here is how I can use this index in a query:

This single query allow us to ask a question about companies (those that reside in London, in this case), as well as sales total data for a particular year. Note that this doesn’t do any joins or anything expensive. We have the information at hand, and can just use it.

You’ll notice that the pattern we specified is using both items that we reduce by. But that isn’t mandatory. We can also use this:

image

Here we only specify the company in the pattern. What would be the result?

image

Now we get the sales total for the company, on a per year basis.

We can now run the following query:

And this will give us the following output:

image

As you can imagine, this opens up quite a few possibilities for advanced features. In particular, it means that you can make it even easier for you to show and process aggregate information and work through complex object models.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Webinar recording (9):
    27 Aug 2020 - The App that Guarantees You're Going Out This Saturday Night
  2. Podcast (3):
    17 Aug 2020 - #SoLeadSaturday with Oren Eini
  3. RavenDB Webinar (3):
    01 Jun 2020 - Polymorphism at Scale
  4. Talk (5):
    23 Apr 2020 - Advanced indexing with RavenDB
  5. Challenge (57):
    21 Apr 2020 - Generate matching shard id–answer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats