Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,431 | Comments: 47,555

filter by tags archive

Production postmortemThe case of 99.99% percentile

time to read 2 min | 326 words

imageA customer called us with a problem, for the most part, RavenDB was very well behaved, but they were worried about the 99.99% percentile of request duration. While the 99.9% was excellent (around a few milliseconds), the 99.99% was measured in seconds.

Usually, this is a particularly slow request that happens, and we can handle that by figuring out what is slow on that particular request and fix it. Common causes for such an issue is a request that returns a lot of unnecessary data, such as large full documents, when it needs just a few fields.

In this case, however, our metrics told us that the problem was pretty widespread. There wasn’t a particular slow request, rather, at what appeared to be random times, certain requests would be slow. We also realized that it wasn’t a particular request that was slow, but all requests within a given time period.

That hinted quite strongly at the problem, it was very likely that this is a GC issue that caused a long pause. We still had to do some work to rule out other factors, such as I/O / noisy neighbors, but we narrowed down on GC as being the root cause.

The problem in this case was that the customer was running multiple databases on the same RavenDB process. And each of them was fairly large. The total amount of memory that the RavenDB process was using was around 60GB of managed memory. At that level, anything that would cause a GC can cause a significant pause and impact operations.

The solution in this case was to break that into multiple separate processes, one for each database. In this manner, we didn’t have a single managed heap that the GC had to traverse, each heap was much smaller, and the GC pause times were both greatly reduced and spaced much further apart.

Building a query parser over a weekendPart II

time to read 3 min | 576 words

In the previous post I talked about what I wanted to get, and how I decided to use the GOLD parser to formally define the language grammar. However, I had no intention of actually generating the parser, I wanted to write it myself. This is because I have a lot of poor experience with the results of parser generators. They generate horrible code that manages to be both unreadable and force you to go into it frequently enough to be maddening.

In particular, I haven’t been able to find anything that would be good enough as a hand rolled parser in terms of performance, readability of the code and the quality of errors it will generate. The later might sound strange, but a key part of the usefulness of any language is the kind of errors that you get when you give it invalid output.

An important consideration is that I’m intending to accept input over HTTP, so I can assume that my input is already wrapper in a pretty System.String package, which saves a lot of the complications that you usually have to deal with if your input is a streaming medium. Initially I tried to go with a scanner / parser in the same place, but that ended up being a Bad Idea, it was too complex to handle. Instead, I split the responsibility to a scanner and parser classes.

The scanner is responsible for scanning the string and finding the next token. But I decided to make it reactive, so you’ll tell it what you expect, and it will see if the next bit matches. Scanners will typically scan the input and produce a stream of tokens. That works, but it means that you need a lot more state in the scanner, and it can be more complex. I decided to simply if as much as I possible could.

Here is how we find identifiers in the scanner:

You might notice that the code is pretty simply, it runs over the input string (_q) and check if the next token is an identifier based on our rules. That make the parser code easier to handle. Let us consider how we can parse a  FROM clause. The formal definition is:

<From> ::= 'FROM INDEX' <From Source> | 'FROM' <From Source>

<From Source> ::= <Simple Field> 'AS' <Simple Field> | <Simple Field>

<Simple Field> ::= Identifier | StringLiteral

If you are familiar with BNF, this is quite readable. Now, how do we parse this? Here is the actual parser code:

As you can see, we start by asking for a FROM token (the scanner is case insensitive, so we don’t need to worry about casing) then check if this is followed by INDEX term then we get actual identifier or string literal to use, and possible alias.

This code can parse the following:

  • from Users
  • from Users as user
  • from index “Users/ByActiveMarker” AS u

You can note that I’m taking full advantage of the possibly of asking the input several questions, because it make my code simpler overall. I’m also not doing any substring operations. Instead, I’m passing indexes into the overall query string that allow me to get the information without paying the price for allocating all those strings.

In this manner, we also get something that is pretty easy to work with, and we can compare it to the formal definition to guide us in the parsing. At the same time, we get code that is readable and has quite good performance.

Building a query parser over a weekendPart I

time to read 4 min | 746 words

imageSome tasks are fun, they are self contained, easy to conceptualize (though not always easy to build) and challenging.  A few weeks ago I spent my weekend writing a parser, and it was a lot of fun.

I’ve been writing parsers for a very long time, the book came out in 2010 and I was playing with Boo since 2005. The ANTLR book was an interesting read and  it thought me a lot about how to approach text parsing.

However, you might have noticed that I shifted my thinking about a lot of design problems. In particular, performance, number of allocations, exceptions thrown during parsing, the readability of errors when getting invalid input, etc.

In particular, the Lucene query parser is a really good example of a really crappy one. It fail on pretty much all points above. I worked with a bunch of parser generators, and I never found one whose output was something that I could really like. They typically generate unreadable code and customization of their behavior are non obvious, to say the least.

Martin Fowler (only slightly out of context):

…it's hard to write a parser.

My most recent parser is the RavenDB JSON parser, you can see the progression of the ideas around that in this series of posts. That isn’t something that you’ll really want to read without a cup of coffee and some writing instruments to write notes.

Most non trivial parsers are composed of at least two pieces, a tokenizer and an builder. The tokenizer goes over the input, break it into tokens that the builder use to build the final format. The JSON scanner in RavenDB is called UnmanagedJsonParser and the builder is  BlittableJsonDocumentBuilder. Traditionally they would be called scanner and parser, for the roles they play, but we’ll leave the names as is because it doesn’t really matter. This code is not fun to go through, it has been through multiple performance reviews, each time making it uglier then before, but much faster.

JSON is also one of the simplest possible textual formats. The formal definition of JSON fits a post-it note. The JSON scanner I have for RavenDB is close to 900 lines of code and is only part of the parsing process.

But the parser I built over the weekend wasn’t for JSON. Instead, I wanted to play with a query language, so I naturally wanted something SQL like. And that is anything but trivial to do.

The first thing I needed was to actually sit down and figure out what the language is going to look like. In order to do that, you almost always want to use a BNF notation of some kind. This allow you to specify what your language should look like, not just as a few snippets in a notepad window but in a more structure manner.

More to the point, there are a lot of tools out there to use. I decided to use GOLD Parser, it was last updated in in 2012 and it shows, but it had the lowest friction of all the parser IDEs that I tried and it has great support for debugging and working with grammars. Why not use ANTLR, which is pretty much the default choice? Put simply, it is usually too much of a hassle to setup ANTLR properly and I didn’t want to get bogged down with the actual details of generating the parsers, I wanted to focus on the grammar.

I actually don’t know how to parse text using the GOLD Parser. It looks like it generate a binary file that you feed to some library that would do it for you, but I’m not sure and it doesn’t matter. What I care about is that I can develop a formal definition and debug it easily. I’m not actually going to use it to generate the parser.

Huh?! Why do all this work for no reason?

A formal definition of the language is incredibly helpful when you consider a new syntax, because you can verify that you aren’t creating holes and ambiguities in your language. It also give you a pretty clear guideline on how to implement the language.

I’m going to go into more details about the language itself and building a parser for it in my next post.

Reviewing ResinPart V

time to read 4 min | 735 words

In the previous part, I looked at how indexing and queries are handled in Resin. This post is mostly about the pieces I haven’t talked about so far. We’ll start with the query parser and move to the trie.

Queries in Resin looks like this:

image

This is sort of looking like the Lucene syntax, but it looks like it keeps the same context until a new field comes along.

Range queries looks better, sort of:

image

I had a hard time figuring this one out, until I realized that this is not an XML tag in the middle.

The problem is that the Lucene query syntax kinda sucks. Actually, it sucks a lot. It is complex and ambiguous to parse and it is full of all those little things like the ~ over there that is not very obvious but is very important to the query. I would actually suggest something more like SQL. Sure, that wouldn’t be what you’ll put in the search box, but programmers will appreciate you for that.

Looking at the parser code, there aren’t any surprises there. It is using a hard rolled system using regex and split, which can be vastly improved. One thing to note is that because of the simplicity of the parser, it isn’t really able to process things like a search for a token with a colon in it, so it can’t process this query: 

url:http://ayende.com

Anyway, the query parser isn’t really the most important thing here. The core of Resin, and what I haven’t looked at so far at all is the trie…

LcrsTrie stands for Left child Right sibling, there is a good discussion on the reasons why you’ll want to use this here. At this point, I’m not really sure why the choice of Lcrs was used. In general, they are used to reduce space and simplify the representation, but I don’t think that this is a good idea for a persistent structure. I’ll look at that later. Right now I’m reading the code, and it is mostly pretty obvious code. But then you get to this:

image

This pattern of using IEnumerable to return a single value is something that I’ve seen in other places in the codebase, and I don’t really get it.

I like the use of the Levenshtein distance in fuzzy search, mostly because we don’t need to store a lot of data to get fuzzy search working. In particular, it looks like suggestion style queries are pretty easy, and would be much cheaper then it would be in Lucene.

Probably the core operation you always perform on a trie is the search, and the core of that in this case is the TryFindPath method:

There is nothing surprising in this code, but it is a pure in memory implementation, which is a very different environment then a persistent data structure.

The persistent data structure is actually the MappedTrieReader, so let us examine it. Looking at it, there is some reference to the notions of segments within the file, but I’m not seeing where it is used. This is what the “*.six” file is used for, it seems. I think that this might be related to merging, but I don’t really know.

And here is the reason for the IsWord design:

image

When using a single LcrsTrie, it isn’t needed. But when using a possibly segmented reader, we might have multiple results for the same word.

What is worrying here is that the same access pattern for the trie that is used in memory is also using in the persistent mode. That means that each time we need to traverse the trie, we’ll need to seek. Actually, it looks like that might only be needed when we aren’t on the right path, but that is actually pretty common, so there are going to be a lot of seeks.

That is enough for now, my next post will be more detailed analysis of the Resin I/O structure and what I would probably do instead.

Reviewing ResinPart IV

time to read 5 min | 964 words

In the previous part, I looked at UpsertTransaction in Resin and speculated about how the queries work. In this one, I’m going to try to figure out how queries work. Our starting point is this:

image

We start from the index header, and we’ll traverse down from there. One one the first things that happen in the Collector is the creation of the DocHashReader, whose sole purpose is to… read a document hash. It is doing it like this:

The problem is that there is really no need to write all this code. It would be simple to use:

image

It does the exact same thing, but with a lot less work all around.

The core of the Collect method is:

image

For our purposes, we are running just a single query, so no need to worry about sub queries at this time. Looking at the Scan method, the first thing it does is to open the tri file. It looks like I missed a bunch of stuff there.

The field name hash is the one used in the key, not the name itself. That means that you aren’t limited to just stuff that is safe to use on the file system. There is also a “.six” file that I’ve not seen before, it is related to tries, and I’m skipping it for now because I want to have a separate post about them.

It is used like this:

image

The problem I have is that this means that the GetTreeReader will open a bunch of files, then immediately close them. That is going to be a lot of system calls that are being generated, which can probably be saved with some effort.

The really interesting bit is here:

This is where the magic happens. This is the core for actually searching over the tries and figuring out what values we actually have there.

The result of this is a List<(string Field, Word Word)>. And Word contains:

image

Reminder, the Postings is actually the list of all the documents that contain this value in this field, and the number of times that this value appear in the document.

The next method is GetPostings, which starts by reading them:

image

The problem I have here is that this method looks like it has been refactored half way. It can only return a single list, and again, there is the over use of Linq operations and their allocations.

As an aside on code formatting, in many places in the code so far, I have chosen to minify the code without changing its meaning, because there is such a high overhead to the differences. I’m doing this fairly automatically, because it help me read and understand. Here is a before and after example, which was drastic enough to make me realize I’m doing this.

Before

After

image image

Functionally, those two are doing the same, and I fine the after option much more readable.

The Sum method here is pretty horrible, in the sense that it has high complexity, luckily, it is never called with more then one list, so that cost is hidden.

A fun exercise would be to compute the actual complexity with real inputs. I just looked at it and went “this gotta be expensive” then figured out that the code only ever call it with a single list, so I skipped it.

After getting the posting, we need to score the query. This is where we see the usage of the document hash. They are used to go from the document id to check if the document has been deleted. The actual scoring is Tf-Idf, so pretty standard and not interesting here.

It does bugs me to see things like this:

image

Sorting can be very expensive, and I’m pretty sure that it is not actually needed here, and it would improve performance quite impressively to remove it.

Okay, now we are almost done with the query, all that remains is investigating this line:

image

The unbounded result set is annoying, but I gave up that fight, I’m afraid. Let us see what Reduce does. In complex queries, I expect that it would merge all the subqueries and do filtering / intersection/etc.

image

And it does just that, which is great. I do wonder if scoring the results could be pushed after the query reducing, because that would reduce the amount of work that needs to be done, but that is a small optimizations, probably.

Okay, that is enough for this post. I now have a pretty good understanding on how queries actually work. Next, I’m going to look at some other pieces of the code that I haven’t looked at, then focus of the trie.

Reviewing ResinPart III

time to read 6 min | 1062 words

In the previous part, I started looking at UpsertTransacction, but got sidetracked into the utils functions. Let us focus back on this. The key parts of UpsertRansaction are:

image

Let us see what they are. The DocumentStream is the source of the documents that will be written in this transaction, its job is to get the documents to be indexed, to give them a unique id if they don’t already have one and hash them.

I’m not sure yet what is the point there, but we have this:

image

Which sounds bad. The likelihood is small, but it isn’t a crypto hash, so likely very easily broken. For example, look at what happened to MurmurHash.

I think that this is later used to handle some partitioning in the trie, but I’m not sure yet. We’ll look at the _storeWriter later. Let us see what the UpsertTransaction does. It builds a trie, then push each of the document from the stream to through the trie. The code is doing a lot of allocations, but I’m going to stop harping at that from now on.

The trie is called for each term for each document with the following information:

image

The code isn’t actually using tuple, I just collapsed a few classes to make it clear what the input is.

This is what will eventually allow the trie to do lookups on a term and find the matching document, I’m assuming.

That method is going to start a new task for that particular field name, if it is new, and push the new list of words for that field into the work queue for that task. The bad thing here is that we are talking about a blocking task, so if you have a lot of fields, you are going to spawn off a lot of threads, one per field name.

What I know now is that we are going to have a trie per field, and it is likely, based on the design decisions made so far, that a trie isn’t a small thing.

Next, the UpsertTransaction need to write the document, this is done taking the document we are processing and turning that into a dictionary of short to string. I’m not sure how it is supposed to handle multiple values for the same field, but I’ll ignore that for now. That dictionary is then saved into a file and its length and positions are returned.

I know that I said that I won’t talk about performance, but I looked at the serialization code and I saw that it is using compression, like this. This is done on a field by field basis, while you could probably benefit from compressing them all together.

image

Those are a lot of allocations, and then we go a bit deeper:

image

First, we have the allocation of the memory stream, then the ToArray call, and that happens, per field, per document. Actually, if we go up, we’ll see:

image

So it is allocations all the way down.

Okay, let us focus on what is going on in terms of files:

  • "write.lock" – this one is pretty obvious
  • *.da – stands for document address. Holds a series of (long Position, int Size) of document addresses. I assume that this is using the same sort as something else, not sure yet. The fact that this is fixed size means that we can easily skip into it.
  • *.rdoc – documents are stored here. Contains the actual serialized data for the documents (the Dictionary<short, Field>), this is the target for the addresses that are held by the “*.da” files.
  • *.pk – holds document hashes. Holds a list of document pk hash and a flag saying if it is deleted, I’m assuming. From context, it looks like the hash is a way to update documents across transactions.
  • *.kix – key index. Text file holding the names of all the fields across the entire transaction.
  • *.pos – posting file. This one holds the tries that were built during the transaction. This is basically just List<(int DocumentId, int Count)>, but I’m not sure how they are used just yet. It looks like this is how Resin is able to get the total term frequency per document. It looks like this is also sorted.
  • *.tri – the trie files that actually contain the specific values for a particular field. The name pattern is “{indexVersion}-{fieldName}.tri”. That means that your field names are limited to valid file names, by the way.

The last part of the UpsertTransaction is the commit, which essentially boil down to this:

image

I think that this was very insightful read, I have a much better understanding of how Resin actually work. I’m going to speculate wildly, and then use my next post to check further into that.

Let us say that we want to search for all users who live in New York City. We can do that by opening the “636348272149533175-City.tri” file. The 636348272149533175 is the index version, by the way.

Using the trie, we search for the value of New York City. The trie value actually give us a (long Position, int Size) into the 636348272149533175.pos file, which holds the posting. Basically, we now have an array of (int DocumentId, int Count) of the documents that matched that particular value.

If we want to retrieve those documents, we can use the 636348272149533175.da file, which holds the addresses of the documents. Again, this is effectively an array of (long Position, int Size) that we can index into using the DocumentId. This points to the location on the 636348272149533175.rdoc file, which holds the actual document data.

I’m not sure yet what the point of *.pa and *.kix is, but I’m sure the next post we’ll figure it out.

Reviewing ResinPart II

time to read 3 min | 564 words

In the first pat of this series, I looked into how Resin is tokenizing and analyzing text. I’m still reading the code from the tests (this is because the Tests folder sorted higher then the Resin folder, basically) and I now moved to the second file, CollectorTests.

That one has a really interesting start:

There are a lot of really interesting things here, UpsertTransaction, document structure, issuing queries, etc. UpsertTransaction is a good place to start looking around, so let us poke in. When looking at it, we can se a lot of usage in the Utils class, so I’ll look at that first.

This is probably a bad idea. While using the current time ticks seems like it would generate ever increasing values, that is actually not the case, certainly not with local time (clock shift, daylight saving, etc). Using that for the purpose of generating a file id is probably a mistake. It is better to use our own counter, and just keep track of the last one we used on the file system itself.

Then we have this:

It took me a while to figure out what was going on there, and then more to frantically search where this is used. Basically, this is used in fuzzy searches, and it will allocate a new instance of the string on each call. Given that fuzzy search is popular in full text search usage, and that this is called a lot during any such search, this is going to allocate like crazy. It would be better to move the entire thing to using mutable buffers, instead of passing strings around.

Then we go to the locking, and I had to run it a few times to realize what is going on.

And this isn’t the way to do this at all. Basically, this relies on the file system to fail when you are trying to copy a file into an already existing file. However, that is a really bad way to go about doing that. The OS and the file system already have locking primitives that you can use, and they are going to be much better then this option. For example, consider what happens after a crash, is the directory locked or not? There is no real way to answer that, since the process might have crashed, leaving the file in place, or it might be doing things, expected that this is locked.

Moving on, we have this simple looking method:

I know I’m harping on that, but this method is doing a lot of allocations by using lambdas, and depending on the number of files, the delegate indirection can be quite costly.  For that matter, there is also the issue of error handling. If there is a lock file in this directory when this is called, this will throw.

Our final code for this post is:

I really don’t like this code, it is something that look like it is cheap, but it will:

  • Sort all the index files in the folder
  • Open all of them
  • Read some data
  • Sum over that data

Leaving aside that the deserialization code has the typical issue of not checking that the the entire buffer was read, this can cause a lot of I/O on the system, but luckily this function is never called.

Okay, so we didn’t actually get to figure out what UpsertTransaction is, we’ll look at that in the next post.

Reviewing ResinPart I

time to read 5 min | 827 words

Resin is a “Cross-platform document database and search engine with query language, API and CLI”. It is written in C#, and while I admit that reading C# code isn’t as challenging as diving into a new language, a project that has a completely new approach to a subject that is near and dear to my heart is always welcome.  It is also small, coming at about 6,500 lines of code, so that make for quick reading.

I’m reviewing commit ddbffff88995226fa52236f6dd6af4a48c833f7a.

As usual, I’m going to start reading the code in alphabetical file order, and then jump around as it make sense. The very first file I run into is Tests/AnalyzerTests where we find the following:

This is really interesting, primarily because of what it tells me. Analyzers are pretty much only used for full text search, such as Lucene or Noise. Jumping into the analyzer, we see:

image

This tell me quite a few things. To start with, this is a young project. The first commit is less then 18 months ago and I’m judging it with the same eye I use to looking at our own code. This code needs to be improved, for several reasons.

First, we have a virtual method call here, probably intended to be an extension point down the line. Currently, it isn’t used, and we pay the virtual call cost for no reason. Next we have the return value. IEnumerable is great, but this method is using yield, which means that we’ll have a new instance created per document. For the same reason, the tokenDic is also problematic. This one is created per field’s value, which is going to cost.

One of the first thing you want to have when you start worrying about performance is controlling your allocations. Reducing allocations in this case, by reusing the dictionary instance, or avoiding the yield would help. Lucene did a lot of stuff right in that regard, and it ensures that you can reuse instances wherever possible (almost always), since that can dramatically improve performance.

Other than this, we can also see that we have Analyze and Index features, for now I’m going to assume that they are identical to Lucene until proven otherwise. This was the analyzer, but what is going on with the tokenizer? Usually that is a lot more low level.

The core of the tokenizer is this method (I prettified it a bit to make it fit better on screen):

image

As far as I can tell so far, most of the effort in the codebase has gone into the data structures used, not to police allocations or get absolute performance. However, even so this would be one of the first places I would look at whenever performance work would start. (To be fair, speaking with the author of this code, I know there hasn’t been any profiling / benchmarking on the code).

This code is going to be at the heart of any indexing, and for each value, it is going to:

  • Allocate another string with the lowered case value.
  • Allocate a character buffer of the same size as the string.
    • Process that character buffer.
  • Allocate another string from that buffer.
  • Split that string.
  • Use a lambda on each of the parts and evaluate that against the stopwords.

That is going to have a huge amount of allocations / computation that can be saved. Without changing anything external to this function, we can write the following:

This will do the same, but at a greatly reduced allocation cost. A better alternative here would be to change the design. Instead of having to allocate a new list, send a buffer and don’t deal with strings directly, instead, deal with a spans. Until we .NET Core 2.0 is out, I’m going to skip spans and just use direct tokens, like so:

There are a few important things here. First, the code now don’t do any string allocations, instead, it is operating on the string characters directly. We have the IsStopword method that is now more complex, because it needs to do the check without allocating a string and while being efficient about it. How it left as an exercise for the reader, but it shouldn’t be too hard.

One thing that might not be obvious is that tokens list that we get as an argument. The idea here is that the caller code can reuse this list (and memory) between calls, but that would require a major change in the code.

In general, not operating on strings at all would be a great thing indeed. We can work with direct character buffers, which will allow us to be mutable. Again, spans would probably fit right into this and be of great help.

That is enough for now, I know we just started, but I’ll continue this in the next post.

RavenDB 4.0Securing the keys to the kingdom

time to read 7 min | 1238 words

imageA major design goal for RavenDB is that it would be easy and convenient to user. A major constraint is that it must be secured. As you can imagine, those two are quite often work against one another. Security is often anything but easy to use, and it is rarely convenient. 

Previously, we have used Windows Authentication and OAuth to secure access to RavenDB. That works and has been deployed in the wild for quite some time. It is also a major pain whenever there is an issue. If the connection to the domain controller drops, we might have authentication delays of many seconds, and trying to debug Active Directory issues in production deployments can be… a bit of a pain, in the same way that an audit by the IRS that starts with SWAT team bashing down your door is mildly annoying.  OAuth, on the other hand, is much better, since it is under our control, and we can figure out exactly what is going on with it if need be.

Since RavenDB 4.0 is running on Windows, Linux & Mac, we decided to drop the Windows Authentication support and just use OAuth. The problem is that if we choose to support HTTP, we have to rely on extremely complex protocols that attempt to secure authentication using plain text, but don’t usually deliver good results and are typically a pain to debug and support. Or, we can use HTTPS and just let SSL/TLS to handle it all for us. A good example of the difference can be seen in OAuth 1.0 vs OAuth 2.0.

When we built RavenDB 1.0, roughly around 2009, the operating environment was quite different. In 2017, not using HTTPS is pretty much a sin into itself. As we started security modeling for RavenDB 4.0, it became obvious that we couldn’t really support any security on top of HTTP without effectively having to implement most of the properties of HTTPS ourselves. I’m many things, but I’m not a security expert, not by a long shot. Given the chance to implement my own security protocol, I would gladly do that, for a toy project or a weekend hackfest. But there is no way I would trust my own security in production against serious attacks. That pretty much led us to the realization that we have to require HTTPS for anything that require security.

That includes running inside the organization, exposed to the public internet, running inside the cloud or in a shared datacenter, etc. Pretty much, unless you have HTTPS, there is no real point in talking about security. Given that, it meant that we could shift our baseline approach to security. If we are always going to require HTTPS for security, it means that we are operating in an environment that is much nicer for us to apply security.

Now, you can choose to run HTTP only, and avoid the need for certificate management, etc. However, at that point, you aren’t running a secure system, or you are already running it in a trusted and secured environment. In that case, we want to be clear that there isn’t any point to try to apply security policy (such as who can access what). Any network sniffer can figure out the access tokens and pretend to be whomever they want, if you are using HTTP.

With HTTPS required, we now move to the realm of having the admin take care of the certificates, securing them, renewal, etc. That is the part where it isn’t as easy or convenient as we could wish for. However, once we had that as a baseline, it opens an interesting path for security. Instead of relying on our own solution, we can use the builtin one and use x509 certificates from the client for authentication. This has the advantage that it is widely supported, standardized and secured. It is a bit less convenient then just a password, but the advantage is that any security system already in place know how to deal with, store, authorize and manage access to certificates.

The idea is that you can go to RavenDB and either register or generate a x509 certificate. To that certificate an administrator can assign permissions (such as what dbs it is allowed to access). From that point on, a client (RavenDB, browser, curl, etc) can connect to RavenDB and just issue REST requests. There is no need to do anything else for the system to work. Contrast that with how you would typically have to deal authentication using OAuth, by sending the token, keeping it fresh manually, etc.

Using x509 also has the distinct advantage that it is widely trusted. We intend to provide this level of security to all editions of RavenDB (so the Community Edition will also be able to use it).

A nice accidental feature of this decision is that we are going to be able to apply authentication at the connection level, and connection pooling means that we are likely going to have connections live for a long time. That means that we only need to pay the authentication cost once, instead of per request, with OAuth.

To simplify matters, we’ll likely just use the client certificates for authenticating the client, so we’ll not care if they are from a trusted root, etc. We’ll just require that the admin register the valid certificate with the cluster so they will be recognized. If you need to stop using a certificate, you can delete its registration or generate a new certificate to take its place. On the client side, it means that the DocumentStore will expose a X509Certificate property that you can set (or the equivalent in other clients). That means that you can use your own policies on the client to determine how to store the certificate.

On the server side, by the way, we’ll expose an extension point that will allow you to retrieve the certificate using your own policies. For example, if you are using Azure Key Vault or Hashicorp Vault or even your own HSM. This is done by invoking a process you specify, so you can write your own scripts / mini programs and apply whatever logic you need. This creates a clean separation between RavenDB and the secret store in use.

Authentication between servers is also done using SSL and certificates. We expect that we’ll commonly have all the servers running the same wildcard certificate, in which case they will obviously trust each other. Alternatively, you can also specify additional certificates that will be treated as servers. This is useful for when you are running with separate certificate for each server, but it is also a critical part of certificate rotation. When your certificate is about to expire, the admin will register the new certificate as trusted, and then start replacing the certificates of each of the nodes in turn. This allow us to run with both old and new certificates concurrently during this process.

We considered relying on some properties of the certificate itself, but it seemed like an error prune process. It is better to have the admin explicitly state, both for clients and server certificates which one we should actually trust, and at what level.

I would really appreciate any commentary you have about this feature, both in terms of ease of use, acceptability and obviously its security.

Bug storiesThe memory ownership in the timeout

time to read 3 min | 442 words

imageWe are running a lot of tests right now on RavenDB, in all sort of interesting configurations. Some of the more interesting results came from testing wildly heterogeneous systems. Put a node on a fast Windows machine, connect it to a couple of Raspberry PIs, a cheap Windows tablet over WiFi and a slow Linux machine and see how that kind of cluster is handling high load.

This has turned out a number of bugs, the issue with the TCP read buffer corruption is one such example, but another is the reason for this post. In one of our test runs, the RavenDB process crashed with invalid memory access. That was interesting to see. Tracking down the issue led us to the piece of code that is handling incoming replication. In particular, the issue was possible if the following happened:

  • Node A is significantly slower than node B, primarily with regards to disk I/O.
  • Node B is being hit with enough load that it send large requests to node A.
  • There is a Node C that is also trying to replicate the same information (because it noticed that node A isn’t catching fast enough and is trying to help).

The root cause was that we had a bit of code that looked like this:

Basically, we read the data from the network into a buffer, and now we hand it off to the transaction merger to run. However, if there is a lot of load on the server, it is possible that the transaction merger will not have a chance to return in time.  We try to abort the connection here, since something is obviously wrong, and we do just that. The problem is that we sent a buffer to the transaction merger, and while it might not have gotten to processing our request yet (or haven’t completed it, at least), there is no way for us to actually be able to pull the request out (it might have already started executing, after all).

The code didn’t consider that, and what happened when we did get a timeout is that the buffer was returned to the pool, and if it was freed in time, we would get an access violation exception if we were lucky, or just garbage in the buffer (that we previously validated, so we didn’t check again) that would likely also cause a crash.

The solution was to wait for the task to complete, but ping the other host to let it know that we are still alive, and that the connection shouldn’t be aborted.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB 4.0 (12):
    14 Aug 2017 - Maintaining transaction boundary integrity in a distributed cluster
  2. Public Service Announcement (2):
    11 Aug 2017 - ConcurrentDictionary.Count is locking
  3. PR Review (4):
    10 Aug 2017 - Errors, errors and more errors
  4. Production postmortem (19):
    07 Aug 2017 - 30% boost with a single line change
  5. Building a query parser over a weekend (2):
    25 Jul 2017 - Part II
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats