Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,038 | Comments: 49,743

filter by tags archive
time to read 4 min | 762 words

The underlying assumption for any distributed system is that the network is hostile. That assumption is pervasive. If you open a socket, you have to be aware of malicious people on the other side (and in the middle). If you accept input from external sources, you have to be careful, because it may be carefully crafted to do Bad Things. In short, the network is hostile, and you need to protect yourself from harm at all levels.

In many cases, you already have multiple layers of protection. For example, when building web applications, the HTTP Server already validate that the incoming streams follows the HTTP protocol. Even ignoring maliciousness, you will get services that connect to you using the wrong protocols and creating havoc. For reference, look at 1347703880, 1213486160 and 1195725856 and the issues they cause. As it turns out, these are relatively benign issues, because they are caught almost immediately. In the real world, the network isn’t only hostile, it is also smart.

The problem was originally posed by Lamport in the Byzantine Generals paper. You have a group of generals that needs to agree on a particular time to attack a city. They can only communicate by (unreliable) messenger, and one or more of them are traitors. The paper itself is interesting to read and the problem is pervasive enough that we now divide distributed systems to Byzantine and non-Byzantine systems.  We now have pervasive cryptography deployed, to the point where you read this post over an encrypted channel, verified using public key infrastructure to validate that it indeed came from me. You can solve the Byzantine generals problem easily now.

Today, the terminology changed. We now refer to Byzantine networks as systems where some of the nodes are malicious and non-Byzantine as systems where we trust that other nodes will do their task. For example, Raft or Paxos are both distributed consensus algorithms that assumes a non-Byzantine system. Oh, the network communication gores through hostile environment, but that is why we have TLS for. Authentication and encryption over the wire are mostly a solved problem at this point. It isn’t a simple problem, but it is a solved one.

So where would you run into Byzantine systems today? The obvious examples are cryptocurrencies and Bit Torrent. In both cases, you have distributed environment with incentives for the other side to cheat. In the case of cryptocurrencies, this is handled by proof of work / proof of stake as well as the cost of getting to 51% majority on the network. In the case of Bit Torrent, it is an attempt to get peers to both download and upload. These examples are the first one that pops to mind, but in reality, the most common tool to use with a Byzantine network is the browser.

The browser has to assume that every site is malicious and that the web servers has to assume that each client is malicious.  For that matter, every server has to assume that every client is malicious as well. You only have to read through OWASP listing to understand that.

And how does this related to databases? Distributed databases are composed of independent nodes, which cooperate together to store and process your data. Most of the generally available database systems are assuming non-Byzantine model. In other words, they authenticate the other nodes, but once past the authentication, the other node is trusted to operate as expected.

For the most part, that is a reasonable assumption to make. You run your database on machine that you tend to trust, after all. And assuming non-Byzantine systems allow for a drastically simpler system design and much higher performance.

However, we are starting to see more and more system deployed on the edge. And that raise an interesting question, who controls the edge? Let’s assume that we have a traffic monitoring system, based on software that is running on your phone. While it may be all part of a single system, you have to take into account that you are now running on a system that is controlled by someone else, who may modify / change it at will.

That leads to interesting issues with regards to the design of such a system. On the one hand, you want to get data from the nodes in the fields, but on the other hand, you need to be careful about trusting those nodes.

How would you approach such a system? Keep in mind that you want to reduce, as much as possible, the complexity of the system while not breaching its security.

time to read 3 min | 588 words

I run into this blog post talking about using a real programming language for defining your configuration. I couldn’t agree more, I wrote about it 15 years ago. In fact, I agree so much I wrote a whole book about the topic.

Configuration is fairly simple, on its face. You need to pass some values to a program to execute. In the simplest form, you simple have a map of strings. If you need hierarchy, you can use dots (.) or slashes (/) for readability. A good example is:

As the original blog post notes, you also need to have comments in the format, if the file is meant to be human readable / editable. From that format, you can transform it to the desired result.  Other formats, such as JSON / YAML / XML are effectively all variations on the same thing.

Note that configuration usually takes a non trivial amount of work to properly read. In particular if you have to run validations. For example, the port above, must be greater than 1024 and less than 16,384. The log’s level can be either a numeric value or a small set of terms, etc.

The original post talked a lot about reusing configuration, which is interesting. Here is a blog post from 2007 showing exactly that purpose. I’m using a look to configure an IoC Container dynamically:

However, after doing similar things for a long while, I think that the most important aspect of this kind of capability has been missed. It isn’t about being able to loop in your configuration. That is certainly nice, but it isn’t the killer feature. The killer feature is that you don’t need to have complex configuration subsystem.

In the case above, you can see that we are doing dynamic type discovery. I can do that in the INI example by specifying something like:

I would need to go ahead and write all the discovery code in the app. And the kind of things that I can do here are fixed. I can’t manage them on the fly and change them per configuration.

Here is another good example, passwords. In the most basic form, you can store passwords in plain text inside your configuration files. That is… not generally a good thing. So you might put them in a separate file. Or maybe use DPAPI on Windows to secure them. Something like this:

I have to write separate code for each one of those options. Now, I get a requirement that I need to use Azure Vault in one customer. And in another, they use a Hardware Security Module that we have to integrate with, etc.

Instead of having to do it all in the software, I can push that kind of behavior to the configuration. The script we’ll run can run arbitrary operations to gather its data, including custom stuff defined on site for the specific use case.

That gives you a lot of power, especially when you get a list of integrations options that you have to work with. Not doing that is huge.  That is how RavenDB works, allowing you to shell out to a script for specific values. It means that we have a lot less work to do inside of RavenDB.

With RavenDB, we have gone with a hybrid approach. For most things, you define the configuration using simple JSON file, and we allow you to shell out to scripts for the more complex / dynamic features. That ends up being quite nice to use and work with.

time to read 6 min | 1077 words

I run across this article, which talks about unit testing. There isn’t anything there that would be ground breaking, but I run across this quote, and I felt that I have to write a post to answer it.

The goal of unit testing is to segregate each part of the program and test that the individual parts are working correctly. It isolates the smallest piece of testable software from the remainder of the code and determines whether it behaves exactly as you expect.

This is a fairly common talking point when people discuss unit testing. Note that this isn’t the goal. The goal is what you what to achieve, this is a method of applying unit testing. Some of the benefits of unit test, are:

Makes the Process Agile and Facilitates Changes and Simplifies Integration

There are other items in the list on the article, but you can just read it there. I want to focus right now on the items above, because they are directly contradicted by separating each part of the program and testing it individually, as is usually applied in software projects.

Here are a few examples from posts I wrote over the years. The common pattern is that you’ll have interfaces, and repositories and services and abstractions galore. That will allow you to test just a small piece of your code, separate from everything else that you have.

This is great for unit testing. But unit testing isn’t a goal in itself. The point is to enable change down the line, to ensure that we aren’t breaking things that used to work, etc.

An interesting thing happens when you have this kind of architecture (and especially if you have this specifically so you can unit test it): it becomes very hard to make changes to the system. That is because the number of times you repeated yourself has grown. You have something once in the code and a second time in the tests.

Let’s consider something rather trivial. We have the following operation in our system, sending money:

image

A business rule says that we can’t send money if we don’t have enough in our account. Let’s see how we may implement it:

This seems reasonable at first glance. We have a lot of rules around money transfer, and we expect to have more in these in the future, so we created the IMoneyTransferValidationRules abstraction to model that and we can easily add new rules as time goes by. Nothing objectionable about that, right? And this is important, so we’ll have unit tests for each one of those rules.

During the last stages of the system, we realize that each one of those rules generate a bunch of queries to the database and that when we have load on the system, the transfer operation will create too much pain as it currently stand. There are a few options that we have available at this point:

  • Instead of running individual operations that will each load their data, we’ll do it once for every one. Here is how this will look like:

As you can see, we now have a way to use Lazy queries to reduce the number of remote calls this will generate.

  • Instead of taking the data from the database and checking it, we’ll send the check script to the database and do the validation there.

And here we moved pretty much the same overall architecture directly into the database itself. So we’ll not have to pay the cost of remote calls when we need to access more information.

The common thing for both approach is that it is perfectly in line with the old way of doing things. We aren’t talking about a major conceptual change. We just changed things so that it is easier to work with properly.

What about the tests?

If we tested each one of the rules independently, we now have a problem. All of those tests will now require non trivial modification. That means that instead of allowing change, the tests now serve as a barrier for change. They have set our architecture and code in concrete and make it harder to make changes.  If those changes were bugs, that would be great. But in this case, we don’t want to modify the system behavior, only how it achieve its end result.

The key issue with unit testing the system as a set of individually separated components is that concept that there is value in each component independently. There isn’t. The whole is greater than the sum of its parts is very much in play here.

If we had tests that looked at the system as a whole, those wouldn’t break. They would continue to serve us properly and validate that this big change we made didn’t break anything. Furthermore, at the edges of the system, changing the way things are happening usually is a problem. We might have external clients or additional APIs that rely on us, after all. So changing the exterior is something that I want to enforce with tests.

That said, when you build your testing strategy, you may have to make allowances. It is very important for the tests to run as quickly as possible. Slow feedback cycles can be incredibly annoying and will kill productivity. If there are specific components in your system that are slow, it make sense to insert seams to replace them. For a example, if you have a certificate generation bit in your system (which can take a long time) in the tests, you might want to return a certificate that was prepared ahead of time. Or if you are working with a remote database, you may want to use an in memory version of that. An external API you’ll want to mock, etc.

The key here isn’t that you are trying to look at things in isolation, the key is that you are trying to isolate things that are preventing you from getting quick feedback on the state of the system.

In short, unless there is uncertainty about a particular component (implementing new algorithm or data structure, exploring unfamiliar library, using 3rd party code, etc), I wouldn’t worry about testing that in isolation. Test it from outside, as a user would (note that this may take some work to enable that as an option) and you’ll end up with a far more robust testing infrastructure.

time to read 1 min | 135 words

I posted about our RavenDB C++ client a while ago, but I was really bad about making sure that we have regular updates. We have actually finished it already, there are even articles about it available now. The article was written by Michael Yarichuk and covers getting started and some of the basic steps to get running in C++ with RavenDB.

We had a very simple challenge in building our C++ client. We want to give you the same level of comfort and features set in C++ as you would get in a managed language. While keeping the same level of performance you’ll expect from a C++ application. I think we have done so quite successfully. You can read the article for the full details. No gore included Smile.

time to read 3 min | 532 words

Once you put a document inside RavenDB, this is pretty much it, as far as RavenDB is concerned. It will keep your data safe, allow to query it, etc. But it doesn’t generally act upon it. There are a few exceptions, however.

RavenDB supports the @expires metadata attribute. This attribute allows you to specify a specific time in which RavenDB will automatically delete the document. This is very useful for expiring documents. The classic example being a password reset token, which should be valid for a period of time and then removed.

Here is what this looks like:

image

And you can configure the frequency in which we’ll check for expired documents in the studio.

image

Expiring documents, however, isn’t all that RavenDB can do. RavenDB also has an additional feature, refreshing documents. You can mark a document to be refreshed by specifying the @refresh metadata attribute, like so:

image

It is easy to understand what @expires do. At a given time, it will delete the document, because it expired. But what does refresh do? Well, at the specified time, a document with the @refresh metadata attribute will be updated by RavenDB to remove the @refresh metadata attribute from the document.

Yep, that is all. In other words, the document above would turn into:

image

That is all. Surely this is the most useless feature ever. You set a property that will be removed at a future time, but the only thing that the property can say is when to remove itself. What kind of feature is this?

Well, this is a case where by itself, this would be a pretty useless feature. But the point of this feature is that this will cause the document to be updated. At that point, it is a normal update, which means that:

  • The document will be re-indexed.
  • The document will be sent over ETL.
  • The document will be sent to the relevant subscriptions.

The last point is the most important one. Here is an example of a typical subscription:

As you can see, this is a pretty trivial subscription, but it filters out commands that are set to refresh. What does this mean? It means that if the @refresh attribute is set, we’ll ignore the document. But since RavenDB will automatically clear the attribute when the refresh timer is hit, we gain a powerful ability.

We now have the ability to process delayed commands. In other words, you can save a document with a refresh and have it processed by a subscription at a given time.

Expanding on this, you can do the same using ETL. So you have a document that will be sent over to the ETL destination at a given time. You can also do the same for indexing as well.

And now this seemingly trivial / useless feature become a pivot for a whole new set of capabilities that you get with RavenDB.

time to read 4 min | 698 words

A map/reduce index in RavenDB can be configured to output its value to a collection. This seems like a strange thing to want to do at first. We already got the results of the index, in the index. Why do we want to duplicate that by writing them to collections?

As it turns out, this is a pretty cool feature, because it enable us to do quite a lot. It means that we can apply anything that work on documents on the results of a map/reduce index. This list include:

  • Map/Reduce – so you can create recursive / chained map/reduce operations.
  • ETL – so you can push aggregated data to another location, allowing distributed aggregation at scale easily.
  • Subscription / Changes – so you can get notified when an aggregated value has been changed.

The key about the list above is that all of them don’t require you to know upfront the id of the generated documents. Indeed, RavenDB uses documents ids like the following for such documents:

image

Technically speaking, you can compute the id. RavenDB uses a predictable algorithm to generate such an id, but practically speaking, it can be hard to figure out exactly what the inputs are for the id generation. That means that certain document related features are not available. In particular, you can’t easily:

  • Include such a document
  • Load it directly (you have to query)

So we need a better option to deal with it. The way RavenDB solves this issue is by allowing you to specify a pattern for the output collection, like so:

image

As you can see, we have a map/reduce index that group by the company and year (marked in blue). We output the collection to YearlySummary, as shown in the previous image.

The pattern (marked in red) specify how we should name the output documents. Here is the result of this index:

image

And here is what this document looks like:

image

Huh?

This is strange, you probably think. This is the document we need to show the summary for companies/9-A in 1998, but there is no such data here. Instead, you’ll notice that the document collection is references (marked in red) and that it points to (marked in blue) the actual document with the data. Why do we do things this way?

A map/reduce document is free to output multiple results for the same reduce key, so we need to handle multiple documents here. We also have to deal with multiple reduce outputs that end up with the same pattern. For example, if we use map/reduce by day, but our pattern only specify the month, we’ll have multiple reduce keys that end up with the same pattern.

In practice, because RavenDB has great support for following documents by id, it doesn’t matter. Here is how I can use this index in a query:

This single query allow us to ask a question about companies (those that reside in London, in this case), as well as sales total data for a particular year. Note that this doesn’t do any joins or anything expensive. We have the information at hand, and can just use it.

You’ll notice that the pattern we specified is using both items that we reduce by. But that isn’t mandatory. We can also use this:

image

Here we only specify the company in the pattern. What would be the result?

image

Now we get the sales total for the company, on a per year basis.

We can now run the following query:

And this will give us the following output:

image

As you can imagine, this opens up quite a few possibilities for advanced features. In particular, it means that you can make it even easier for you to show and process aggregate information and work through complex object models.

time to read 2 min | 271 words

After a long journey, I have an actual data structure implemented. I only lightly tested it, and didn’t really do too much with it. In fact, as it current stands, I didn’t even implement a way to delete the table. I relied on closing the process to release the memory.

It sounds like a silly omission, right? Something that is easily fixed. But I run into a tricky problem with implementing this. Let’s write the simplest free method we can:

Simple enough, no? But let’s look at one setup of the table, shall we?

As you can see, I have a list of buckets, each of them point to a page. However, multiple buckets may point to the same page. The code above is going to double free address 0x00748000!

I need some way to handle this properly, but I can’t actually keep track of whatever I already deleted a bucket. That would require a hash table, and I’m trying to delete one Smile. I also can’t track it in the memory that I’m going to free, because I can’t access it after free() was called. So what to do?

I thought about this for a while, and I came up with the following solution.

What is going on here? Because we may have duplicates, we first sort the buckets. We want to sort them by the value of the pointer. Then we simply scan through the list and ignore the duplicates, freeing each bucket only once.

There is a certain elegance to it, even if the qsort() usage is really bad, in terms of ergonomics (and performance).

time to read 3 min | 582 words

The naïve overflow handling I wrote previously kept me up at night. I really don’t like it. I finally figured out what I could do to handle this in an elegant fashion.

The idea is to:

  • Find the furthest non overflow piece from the current one.
  • Read its keys and try to assign them to its natural location.
  • If successfully moved all non native keys, mark the previous piece as non overlapping.
  • Go back to the previous piece and do it all over again.

Maybe it will be better to look at it in code?

There is quite a lot that is going on here, to be frank. We call this method after we deleted a value and go a piece to be completely empty. At this point, we scan the next pieces to see how far we have to go to find the overflow chain. We then proceed from the end of the chain backward. We try to move all the keys in the piece that aren’t native to the piece to their proper place. If we are successful, we mark the previous piece as non overflowing, and then go back one piece and continue working.

I intentionally scan more pieces than the usual 16 limit we use for put, because I want to reduce overflows as much as possible (to improve lookup times). To reduce the search costs, we only search within the current chain, and I know that the worst case scenario for that is 29 in truly random cases.

This should do amortize the cost of fixing the overflows on deletes to a high degree, I hope.

Next, we need to figure out what to do about compaction. Given that we are already doing some deletion book keeping when we clear a piece, I’m going to also do compaction only when a piece is emptied. For that matter, I think it make sense to only do a page level compaction attempt when the piece we just cleared is still empty after an overflow merge attempt. Here is the logic:

Page compaction is done by finding a page’s sibling and seeing if we can merge them together. A sibling page is the page that share the same key prefix with the current page except a single bit. We need to check that we can actually do the compaction, which means that there is enough leaf pages, that the sizes of the two pages are small enough, etc. There are a lot of scenarios we are handling in this code. We verify that even if we have enough space theoretically, the keys distribution may cause us to avoid doing this merge.

Finally, we need to handle the most complex parts. We re-assign the buckets in the hash, then we see if we can reduce the number of buckets and eventually the amount of memory that the directory takes. The code isn’t trivial, but it isn’t really complex, just doing a lot of things:

With this, I think that I tackled the most complex pieces of this data structure. I wrote the code in C because it is fun to get out and do things in another environment. I’m pretty sure that there are bugs galore in the implementation, but that is a good enough proof of concept to do everything that I wanted it to do.

However, writing this in C, there is one thing that I didn’t handle, actually destroying the hash table. As it turns out, this is actually tricky, I’ll handle that in my next post.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Webinar recording (7):
    02 Jul 2020 - Practical indexing with RavenDB
  2. RavenDB Webinar (3):
    01 Jun 2020 - Polymorphism at Scale
  3. Podcast (2):
    28 May 2020 - Adventures in .NET High performance databases with RavenDB with Oren Eini
  4. Talk (5):
    23 Apr 2020 - Advanced indexing with RavenDB
  5. Challenge (57):
    21 Apr 2020 - Generate matching shard id–answer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats