Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 18 | Comments: 79

filter by tags archive

Repeatable random tests

time to read 3 min | 460 words

Testing our software is something that we take very serious. And in some cases, we want to go beyond testing stuff that we know. We want to test random stuff. For example, if we add 10,000 documents, then remove every 17th, what happens? Is there any differences in behavior, performance, etc?

It is easy to do random stuff, of course. But that leads to an interesting case. As long as the tests are passing, you can pat yourself on the back: “We have done good, and everything works as it should”.

But when something fails… well, the only thing that you know is that something did fail. You don’t have a way to reproduce this. Because the test is… random.

In order to handle that, we wrote the following code:

 [AttributeUsage(AttributeTargets.Method, AllowMultiple = true)]
 public class InlineDataWithRandomSeed : DataAttribute
 {
 
     public InlineDataWithRandomSeed(params object[] dataValues)
     {
         this.DataValues = dataValues ?? new object[] {null};
     }
 
     public object[] DataValues { get; set; }

     public override IEnumerable<object[]> GetData(MethodInfo methodUnderTest, Type[] parameterTypes)
     {
         var objects = new object[DataValues.Length+1];
         Array.Copy(DataValues,0,objects,0, DataValues.Length);
         objects[DataValues.Length] = Environment.TickCount;
         yield return objects;
         
     }
 }

This is using XUnit, which gives us the ability to add a seed to the test. Let us see how the test looks:

image

And this is what this looks like when it runs:

image

When we have a failure, we know what the seed is, and we can run the test with that seed, and see what exactly happened there.

Presenting, Highly Available & Scalable Solutions at GOTO Copenhagen

time to read 1 min | 168 words

I’ll be presenting at the GOTO Copenhagen conference in Oct 7 – 8 this year. The full session summary is:

Presentation: Highly Available & Scalable Solutions with RavenDB

Track: Solutions Track 1 / Time: Monday 13:20 - 14:10 / Location: Rosenborg

RavenDB is a 2nd generation document database, with built-in load distribution, seamless replication, disaster recovery and data-driven sharding.

In this session, we are going to explore how RavenDB deals with scaling under load and remain highly available even under failure conditions.

We'll see how RavenDB's data-driven sharding allows to increase the amount of the data in our cluster without giving up the benefits of data locality.

We are are going to execute complex distributed map-reduce queries on a sharded cluster, giving you lightning-fast responses over very large data volumes.

Hibernating Rhinos will also be presenting at a booth, and we’ll have a few members of the core team there to talk about RavenDB and the cool things that you can do with it.

Code reading: Wukong full-text search engine

time to read 11 min | 2033 words

I like reading code, and recently I was mostly busy with moving our offices, worrying about insurance, lease contracts and all sort of other stuff that are required, but not much fun. So I decided to spend a few hours just going through random code bases and see what I’ll find.

I decided to go with Go projects, because that is a language that I can easily read, and I headed out to this page. Then I basically scanned the listing for any projects that took my fancy. The current one is Wukong, which is a full text search engine. I’m always interested in those, since Lucene is a core part of RavenDB, and knowing how others are implementing this gives you more options.

This isn’t going to be a full review, merely a record of my exploration into the code base. There is one problem with the codebase, as you can see here:

image

All the text is in Chinese, which I know absolutely nothing about, so comments aren’t going to help here. But so far, the code looks solid. I’ll start from the high level overview:

image

We can see that we have a searcher (of type Engine), and we add documents to it, then we flush the index, and then we can search it.

Inside the engine.Init, the first interesting thing is this:

image 

This is interesting because we see sharding from the get go. By default, there are two shards. I’m sure what the indexers are, or what they are for yet.

Note that there is an Indexer and a Ranker for each shard. Then we have a bunch of initialization routines that looks like so:

image

So we create two arrays of channels (the core Go synchronization primitive), and initialize them with a channel per shard. That is a pretty standard Go behavior, and usually there is a go routine that is spinning on each of those channels. That means that we have (by default) two “threads” for adding documents, and two for doing lookups. No idea what either one of those are yet.

There is a lot more like this, for ranking, removing ranks, search ranks, etc, but that is just more of the same, and I’ll ignore it for now. Finally, we see some actions when we start producing threads to do actual works:

image

As you can see, spin off go routines (which will communicate with the channels we already saw) to do the actual work. Note that per shard, we’ll have as many index lookup and rank workers as we have CPUs.

There is an option to persist to disk, using the kv package, this looks like this:

image

On the one hand, I really like the “let us just import a package from github” option that go has, on the other hand. Versioning control has got to be a major issue here for big projects.

Anyway, let us get back to the actual hot path I’m interested in, indexing documents. This is done by this function:

imagea

Note that we have to supply the docId externally, instead of creating it ourselves. Then we just hash the docId and the actual content and send the data to the segmenter to do work. Currently I think that the work of the segmenter is similar to the work of the analyzer in Lucene. To break apart the content to discrete terms. Let us check…

image

this certainly seems to be the case here, yep. This code run on as many cores as the machine has, each one handling a single segmentation request. Once that work is done, it is sent to another thread to actually do the indexing:

image

Note that the indexerRequest is basically an array of all the distinct tokens, their frequencies and the start positions in the original document. There is also handling for ranking, but for now I don’t care about this, so I’ll ignore this.

Sending to the indexing channel will end up invoking this code:

image

And the actual indexing work is done inside AddDocument. A simplified version of it is show here:

image

An indexer is protected by a read/write mutex (which explains why we want to have sharding, it gives us better concurrency, without having to go to concurrent data structures and it drastically simplify the code.

So, what is going on in here? For each of the keywords we found on the document, we create a new entry in the table’s dictionary. With a value that contains the matching document ids. If there are already documents for this keyword, we’ll search for the appropriate position in the index (using simple binary search), then place the document in the appropriate location. Basically, the KeywordIndices is (simplified) Dictionary<Term : string , SortedList<DocId : long>>.

So that pretty much explains how this works. Let us look at how searches are working now…

The first interesting thing that we do when we get a search request is tokenize it (segmentize it, in Wukong terminology):

image

Then we call this code:

image

This is a fairly typical Go code. We create a bounded channel (that has a capacity as the same number of the shards), and we send it to all the shards. We’ll get the reply from all of the shards, then do something with the results from all the shards.

I like this type of interaction because it is easy to model concurrent interactions with it, and it is pervasive in Go. Seems simpler than the comparable strategies in .NET, for example.

Here is a simple example of how this is used:

image

This is the variant without the timeout. And we are just getting the results from all the shards, note that we don’t have any ordering here, we just add all the documents into one big array. I’m not sure how/if Wukong support sorting, there was a lot of stuff about ranking earlier in the codebase, but that doesn’t seem to be apparent in what I saw so far, I’ll look at it later. For now, let us see how a single shard is handling a search lookup request.

What I find most interesting is that there is rank options there, and document ids, which I’m not sure what they are used for. We’ll look at that later, for now, the first part of looking up a search term is here:

image

We take a read lock, then look at the table. We need to find entries for all the keywords that we have in the query to get a result back.

This indicates that a query to Wukong has an implicit AND between all the terms in the query. The result of this is an array with all the indices for each keyword. It then continues to perform set intersection between all the matching keywords, to find all the documents that appear in all of them. Following that, it will compute the BM25 (a TF-IDF function that is used to compute ranking).  After looking at the code, I found where it is actual compute the ranking. It is doing that after getting the results from all the shards, and then it is going to sort them according to their overall scores.

So that was clear enough, and it makes a lot of sense. Now, the last thing that I want to figure out before we are done, is how does Wukong handles deletions?

It turns out that I actually missed part in the search process. The indexer will just find the matching documents, but their BM25 score. It is the ranker (which is sent from the indexer, and then replying to the engine) that will actually sort them. This gives the user the chance to add their own scoring mechanism. Deletion is actually handled as a case where you have nothing to score with, and it gets filtered along the way as an uninteresting value. That means that the memory cost of having a document index cannot be alleviated by deleting it. The actual document data is still there and is kept.

It also means that there is no real facility to update a document. For example, if we have a document whose content used to say Ayende and we want to change it to Oren. We have no way of going to the Ayende keyword and removing it from there. We need to delete the document and create a new one, with a new document id.

Another thing to note is that this has very little actual functionality. There is no possibility of making complex queries, or using multiple fields. Another thing that is very different from how Lucene works is that is runs entirely in memory. While it has a persistent option, that option is actually just a log of documents being added and removed. On startup, it will need to go through the log and actually index all of them again. That means that for large systems, it is going to be a prohibitly expensive startup cost.

All in all, that is a nice codebase, and it is certainly simple enough to grasp without too much complexity. But one need to be aware of the tradeoffs associated with actually using it. I expect it to be fast, although the numbers mentioned in the benchmark page (if I understand the translated Chinese correctly) are drastically below what I would expect to see. Just to give you some idea, 1,400 requests a second are a very small number for an in memory index. I would expect something like 50,000 or so, assuming that you can get all cores to participate. But maybe they are counting this through the network ?  

Turning bugs into features

time to read 1 min | 186 words

I’m using R# for over a decade now, and it has gotten to the point where I’m actually able to utilize R# bugs to get things working better for me.

In this case, the scenario is using the Find Usages as a refactoring aid. I have a tricky refactoring to do, which require me to touch several pieces of code. In order to handle this properly, I start by Finding Usages on the relevant item. In this case, ByView.

Then I go to each of those code locations and make the require modifications.

image

So far, so good, but on complex scenarios, it is hard to remember which portions I have done and which portions are still left undone. In order to handle this, I Ctrl+X, Ctrl+Z the line I find. R# detect this as a change to the code that invalidate the found usage, and suddenly, I got a nice todo list with a strikethrough for completed tasks.

Production postmortemThe case of the man in the middle

time to read 3 min | 553 words

One of the most frustrating things when you dealing with production issues is when the problem is not in our product, but elsewhere. In particular, this post is dedicated to the hard work done by many anti virus products, in particular, to make our life harder.

Let us take a look at the following quote, taken from the ESET NOD32 Anti Virus knowledge base (emphasis mine):

By default, your ESET product automatically detects programs that are used as web browsers and email clients, and adds them to the list of programs that the internal proxy scans. This can cause loss of internet connectivity or other undesired results with applications that use network features but are not web browsers/email clients.

Yes, it can. In fact, it very often does.

Previously, we looked at a similar issue with Anti Virus slowing down I/O enough to cause us to slowly die. But in this case, the issue is a lot more subtle.

Because it is doing content filtering, it tends to put a much higher overhead on the system resources, which means that as far as the user is concerned, RavenDB is slow. We actually developed features specifically to handle this scenario. The traffic watch mode will tell you how much time you spend on the server side, and we have added a feature that will make RavenDB account for the internal work each query is doing, so we can tell where the actual cost is.

You can enable that by issuing:

GET databases/Northwind/debug/enable-query-timing

And one that is setup, you can get a good idea about what is costly in the query, as far as RavenDB is concerned. Here is an example of a very slow query:

image

You can see that the issue is that we are issuing a very wide range query, so most of the time is spent in inside Lucene. Other examples might be ridicilously complex queries, which result in high parsing time (we have seen queries in the hundreds of KB range). Or loading a lot of big documents, or… you get the drift. If we see that the server thinks that a query is fast, but the overall time is slow, we know to blame the network.

But an even more insidious issue is that this would drop requests,  consistently and randomly (and yes, I know that those are contradictions, it was consistently dropping requests in a random pattern that seemed explicitly designed to thwart figuring out what is going on). Leading to things breaking, and escalated support calls. “RavenDB is broken” leads to a lot of headache, and a burning desire to hit something when you figure out that not only isn’t it your fault, but the underlying reason is actively trying to prevent you from figuring it out (I assume it is to deal with viruses that try to shut it off), which lead to really complex find facting sessions.

That is more annoying because it seems that the issue there was a bug in respecting keep alive sessions for authenticated requests under some scenarios, in the AV product in question! Absolutely not fun!

Intersection of the complexities

time to read 3 min | 516 words

As you can imagine, I’m really interested in what is going on with storage engines. So when I read the RocksDB Tuning Guide with interest. Actually, mounting horror was more like it, to be frank.

It all culminated pretty nicely with the final quote:

Unfortunately, configuring RocksDB optimally is not trivial. Even we as RocksDB developers don't fully understand the effect of each configuration change. If you want to fully optimize RocksDB for your workload, we recommend experiments…

Um… okaaaay.

That is pretty scary statement. Now, to be fair, RocksDB is supposed to be a low level embedded storage engine, it isn’t meant to be something that is directly exposed to the user or operations people.

And yet…

I’m literally writing databases for a living, and I had a hard time following all the options they had there. And it appears that from the final thought that the developers of RocksDB are also at a loss here. It is a very problematic state to be in. Because optimizations and knowing how to get the whole thing working is a pretty important part of using a database engine. And if your optimizations process relies on “change a bunch of settings”, and see what happens, that is not kind at all.

Remember, RocksDB is actually based on LevelDB, which is a database which was designed to be the backend of IndexdDB, which runs in the browser and has a single threaded client and a single user, pretty much. LevelDB can do a lot more than that, but that was the primary design goal, at least initially.

The problems with LevelDB are pretty well known, it suffers from write amplification, as well as known to hang if there is a compaction going on, and… well, you can read on that elsewhere.

RocksDB was supposed to take LevelDB and improve on all the issues that LevelDB had. But it appears that most of what was done was to actually create more configuration options (yes, I’m well aware that this is an unfair statement, but it sure seems so from the tuning guide). What is bad here is that the number of options is very high, and the burden it puts on the implementers is pretty high. Basically, it is a “change & pray” mindset.

Another thing that bugs me is the number of “mutex bottlenecks” that are mentioned in the tuning guide, with the suggestions to shard things to resolve them.

Now, to be fair, an embedded database require a bit of expertise, and cooperation from the host process, but that seems to be pushing it. It is possible that this is only required for the very high end usage, but that is still not fun.

Compared to that, Voron has about 15 configuration options, most of them about controlling the default sizes we use. And pretty much all of them are set to be auto tuned on the fly. Voron on its on actually have very few moving parts, which makes reasoning about it much simpler and efficient. 

What is new in RavenDB 3.5Monitoring support

time to read 2 min | 333 words

The final monitoring feature in RavenDB 3.5 is SNMP support. For those of you who aren’t aware, SNMP stands for Simple Network Management Protocol. It is used primarily for monitoring network services. And with RavenDB 3.5, we have full support for it. We even registered our own root OID for all RavenDB work (1.3.6.1.4.1.45751, if anyone cares at this stage). We have also setup a test server where you can look at the result on SNMP support in RavenDB 3.5 (login as guest to see details).

But what is this about?

Basically, a lot of monitoring features that we looked at boiled down to re-implementing enterprise monitoring tools that are already out there. Using SNMP gives all those tools direct access to the internal details of RavenDB, and allow you to plot and manage them using your favorite monitoring tools. From Zabbix to OpenView to MS MOM.

We expose a long list of metrics, from the loaded databases to the number of indexes items per second to the ingest rate to the number of queries to how much storage space each database takes to…

Well, you can just go ahead and read the whole list and go over it.

We are still going to put effort into making figuring out what is going on with RavenDB directly from the studio, but as customers start running large numbers of RavenDB instances, it becomes unpractical to deal with each of them individually. That is why using a monitoring system that can watch many servers is preferable. You can also set it up to send alerts when certain threshold is reached, and… those are now features that aren’t RavenDB features, those are your monitoring system features.

Being able to just off load all of those features is great, because we can just expose the values to the monitoring tools and go on to focus on other stuff, rather than just have to do the full monitoring work, UI, configuration, alerts, etc.

What is new in RavenDB 3.5Monitoring active I/O operations

time to read 2 min | 265 words

RavenDB 3.5 have just a few of major monitoring features (although wait for the next one, it is a biggie), but this one is a pretty important one.

This feature allows RavenDB to track, at a very detailed level, all the I/O work that is done by the server, and give you accurate information about what exactly is going on with the system.

Take a look at this report:

image

As you can see, you see a one minute usage, with writes going on and some indexing work along the way.

The idea here is that you can narrow down any bottlenecks that you have in the system. Not only by looking at the raw I/O stats that the OS provides, but actually be able to narrow it down to a particular database and a particular action inside that database. For users with multi tenants databases, this can be a very useful tool in figuring out what is actually going on in their system.

The mechanics behind this report are actually interesting. We are using ETW to capture the I/O rates, but since we are capturing kernel events, that require admin privileges. Typically, RavenDB isn’t run with those privileges. To work around that, the admin is going to run the Raven.Monitor.exe process, in an elevated context. That gives us access to the kernel events, and we then process the information and show them to the user in the studio.

What is new in RavenDB 3.5Filters & transformers with RavenDB Replication

time to read 1 min | 166 words

In the previous post, I introduced RavenDB Collection Specific Replication. This allows you to filter which collections you’ll get to replicate.

The next step is to apply filters and transformers along the way. For example, like so:

image

As you can see, the transformation script allows us to modify the outgoing data, in this case, to hide the email address.

This feature is primarily intended for data replication back to staging / development environment, where you have the need to have the data, but can’t expose some of it outside.

It can also be used to modify details going to slave databases so we’ll have per database values (for example, striping details that are not relevant for a particular tenant).

Like Collection Specific Replication, this replication destination will not be considered to be a failover target.

What is new in RavenDB 3.5Collection Specific Replication

time to read 2 min | 349 words

With RavenDB 3.5, we added a really cool feature to the RavenDB Replication. Actually, I’m not sure how much of a “feature” this is, because this actually take away capabilities Smile.

As the name suggest, this allows you to select specific collections and only replicate those to a specific destination. For example, in this example, we can see that we are only replicating the Categories, Companies and Employees collection, instead of replicating the entire database.

image

Why is this important?

Because it opens up new ways of managing your data. It allows you to use RavenDB replication (high throughput, reliable and error resilient) to manage data distribution.

Let us imagine for a moment that we have a web ordering system, with multiple tenants. And we have some common information that needs to be shared among all the tenants. For example, the baseline pricing information.

We can setup replication like so:

image

The Shared database contains a lot of information, but only the pricing information is replicated. This means that you can change it once, and it will propagate to all the end destinations on its own.

Common scenarios for such shared data include:

  • Users /logins
  • Base data for local modifications (product catalog that each tenant can override)
  • Rules

Note that because we are using collection specific replication, this does not make the destination database into a duplicate of the source. As such, it will not take part in failover configuration for the source database.

You can mix and match, a single database can replicate to failover destination (full replication) and partial (only specific collections). And the clients will know how to fail to the right node if something bad happens.

FUTURE POSTS

  1. The insidious cost of allocations - one day from now
  2. Buffer allocation strategies: A possible solution - 4 days from now
  3. Buffer allocation strategies: Explaining the solution - 5 days from now
  4. Buffer allocation strategies: Bad usage patterns - 6 days from now
  5. The useless text book algorithms - 7 days from now

And 1 more posts are pending...

There are posts all the way to Sep 11, 2015

RECENT SERIES

  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    03 Sep 2015 - The industry at large
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats