Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,399 | Comments: 47,407

filter by tags archive

Inventory management in MongoDB: A design philosophy I find baffling

time to read 5 min | 961 words

Related imageI’m reading MongoDB in Action right now. It is an interesting book and I wanted to learn more about the approach to using MongoDB, rather then just be familiar with the feature set and what it can do. But this post isn’t about the book, it is about something that I read, and as I was reading it I couldn’t help but put down the book and actually think it through.

More specifically, I’m talking about this little guy. This is a small Ruby class that was presented in the book as part of an inventory management system. In particular, this piece of code is supposed to allow you to sell limited inventory items and ensure that you won’t sell stuff that you don’t have. The example is that if you have 10 rakes in the stores, you can only sell 10 rakes. The approach that is taken is quite nice, by simulating the notion of having a document per each of the rakes in the store and allowing users to place them in their cart. In this manner, you prevent the possibility of a selling more than you actually have.

What I take strong issue with is the way this is implemented. MongoDB doesn’t have multi document transactions, but the solution presented requires it. Therefor, the approach outlined in the book is to try to build transactional semantics from the client side. I write databases for a living, and I find that concept utterly baffling. Clients shouldn’t try to do stuff like that, not only would they most likely get it wrong, but they’ll do that extremely inefficiently.

Let us consider the following tidbit of code:

The idea here is that the fetcher is supposed to be able to atomically add the products to the order. If there aren’t enough available products to be added, the entire thing is supposed to be rolled back. As a business operation, this make a lot of sense. The actual implementation, however, made me wince.

What it does, if it was SQL, is the following:

I intentionally used SQL here, both to simplify the issue for people who aren’t familiar with MongoDB and to explain the major dissonance that I have with this approach. That little add_to_cart call that we had earlier resulted in no less than eight network roundtrips. That is in the happy case.  There is also the failure mode to consider, which involved resetting all the work done so far.

The thing that really bothers me is that I can’t believe that this is something that you’ll actually want to do except as an intellectual exercise. I mean, sure, how we can pretend to get transactions from non transactional store is interesting, but given the costs of doing this or the possibility of failure or the fact that this is a non atomic state transition or… you get my point, right?

In the case of this code, the whole process is non atomic. That means that outside observers can see the changes as they are happening. It also opens you up for a lot of Bad Stuff in terms of abusing the system. If the user is malicious, they can use the fact that this “transaction” is going to be running back and forth to the database (and thus taking a lot of time) and just open another tab to initiate an action while this is going on, resulting in operations on invalid state. In the example that the book give, we can use that to force purchases of invalid items.

If you think that this is unrealistic, consider this page, which talks about doing things like making money appear from thin air using just this sort of approaches.

Another thing that really bugged me about this code is that it has “error handling” I use that in quotes because it is like a security blanket for a 2 years old. Having it there might calm things up, but it doesn’t actually change anything. In particular, this kind of error handling looks right, but it is horribly broken if you consider what kind of actual errors can happen here. If the process running this code failed for any reason, the “transaction” is going to stay in an invalid state. It is possible that one of your rake will just disappear into thin air, as a result. It is supposed to be in someone’s cart, but it isn’t.  The same can be the case if the server had an issue midway or just a regular network hiccup.

I’m aware that this is code that was written explicitly to be readable and easy to explain, rather then be able to withstand the vagaries of production, but still, this is a very dangerous thing to do.

As an aside, not quite related to the topic of this post, but one thing that really bugged me in the book so far is the number of remote requests that are commonly required to do things. Is there an assumption that the database in question is nearby or very cheap to access, because the entire design philosophy I use is to assume that going over the network is expensive, so let us give the users a lot of ways to reduce that cost. In contrast, at least in the book, there is a lot of stuff that is just making remote calls like there is a fire sale that will close in 5 minutes.

To be fair to the book, it notes that there is a possibility of failure here and explain how to handle one part of it (it missed the error conditions in the error handling) and call this out explicitly as something that should be done with consideration.

PR Reviewavoid too many parameters

time to read 2 min | 228 words

During code review I run into these two sections, which raised a flag. Can you tell why?

image

image

The problem with this type of code is two fold. First, we add optional parameters, to reduce the number of breaking changes we have. The problem with that is that we already have parameters on the call, and eventually you’ll get to something like this:

image

Which is the queen of optional parameters method, and you can probably guess how it looks internally.

In the first case, we can add the new optional parameter to the… options variable that we are already sending this method. This way, we don’t have to worry about breaking changes, and we already have a way to setup options, determine defaults, etc.

In the second case, we are passing two bools to the method, and there isn’t a preexisting parameters object. Instead of creating one, we can use a Flags enum, whose bits we can set to determine what exactly the behavior of this method should be. That is generally much easier to maintain in the long run.

Inside RavenDB 4.0 book–Chapter 4 & 5 are done

time to read 1 min | 189 words

imageThe RavenDB 4.0 book is going really well, this week I have managed to write about 20,000 words and the current page count is at 166. At this rate, it is going to turn into a monster in terms of how big it is going to be.

The book so far covers the client API, how to model data in a document database and how to use batch processing in RavenDB with subscriptions. The full drafts are available here, and I would really appreciate any feedback you have.

Next topic is to start talking about clustering and this is going to be real fun.

I’m also looking for a technical editor for the book. Someone who can both tell me that I missed a semi column in a code listing and tell me why my phrasing / spelling / grammar is poor (and likely all three at once and some other stuff that I don’t even know). If you know someone, or better yet, interested and capable, send me a line.

PR Reviewthe errors should be nurtured

time to read 2 min | 202 words

I run into the following in a PR I recently reviewed. Take a look and try to figure out why I’m pointing this bits out:

image_thumb[2]

image_thumb[3]

Those are bad errors. In the sense that they are hiding information. Sure, in 90% of the cases the user just put it the backup location or the restore location, so they know what the application is referring to. But the other 10% is looking at the exception from the logs, having “I got an error” support call or all sort of weird stuff.

In particular, one of the most annoying things is when the user and the application disagree on a relative path. If the user believes that the relative path is based on one location and the application another, hilarity ensues. But not in a fun way.

A better way to prevent all of those would be to just include the actual locations (fully resolved, mind) in the error. It doesn’t prevent support calls, but it does make them a lot easier to solve.

Reviewing Noise Search EngineSummary

time to read 5 min | 835 words

imageAfter the last three parts, I think that I have a good high level understanding of what is going on in the Noise codebase, as a reminder, I’m reviewing 48d808375e541eca383826ea928272976bce064d.

I love reading code, especially stuff that I’m not familiar with, such as a new platform or doing something novel. The Noise codebase is quite interesting, and it is a very early stage in the project. That said, when looking at what it is doing and comparing to what I’m doing in my day job, a few things popped up.

I couldn’t help but think that this is a very similar model to the default Azure DocumentDB. All the data is indexed, and the values are shredded as they go in. The model you use was very similar. The nice thing about this kind of model is that it gives you a lot of freedom and flexibility, but the downside is that if you need more, there aren’t any alternatives. Storing large JSON documents inside Noise is likely to cause problems, especially if you are typically working with the full document. If you are just throwing things into it and then pulling back pieces, especially if the documents you write are relatively small, this is likely to be much more suited.

Noise is making some clever usage of the LSM nature of RocksDB to do cleanups, and I really like the elegance of the indexing engine. The query model is also interesting, in that it gives you a lot of freedom to operate. Select the right fields, define the appropriate aggregation, etc. But queries are limited (no computation) and aggregation queries are going to run with cost of O(N) with respect to their results. Given that this is intended to be embedded, and apparently into low end devices, that might very well be the right decision.

A side affect of the shredding is that you need to reconstruct documents all the time, since they were broken into component parts. Things like scanning results by one field while returning another looks like they can generate O(N * log N) number of queries in RocksDB. I’m not sure how much that can scale with regards to queries. On the one hand, there are very clear layers between the pieces of the code, which is great, but that also have some built in issues with regards to how much performance you can squeeze out of the system. I don’t have a feeling for Rust code performance, nor do I think that the project is ready at this early stage for performance analysis, but it is a concern. Then again, I did just spent the past two years head down in a major performance work, so that might very well be the reason. And it is entirely possible that the code will perform significantly better when the project is done.

One last thing to note, when I tried building a replacement to Lucene one of the annoying thing to pop up was that it is really hard to replace Lucene efficiently and still be better in term of time / space / complexity. Noise in particular is storing all the data at least twice (once as terms to be searched, once as the actual document data). This make is very easy to do certain operations (such as update a document), but it also means that the cost of updating a document is now doubled. We first need to index the new document, then we need to re-index old document, compare between the two and then write the changes. I guess that it is possible that reducing the amount work to be written would help matters, but the cost of indexing is likely to dominate.

I would expect a Noise index to be significantly larger than the Lucene equivalent, because Lucene does a lot to reduce file size (in quite evil & nasty ways, to be fair).

Another thing that I don’t think that is good is the reliance on a query language in an embedded solution. A much better idea in my option would be to expose the raw documents out and let the client handle it. If it is running in the same memory space, there is no extra cost. This is because the moment that this open up a huge number of features / issues. I’m talking specifically about the notion of “return” here. The idea is really cool, you can project exactly what you want from the document, and that allows the engine to only pull the relevant data.

At any rate, this is really getting late now, so I’ll close with the note that Noise is a really cool codebase to read. Even if you don’t know Rust, going ahead and reading something non trivial is a really good exercise in learning new things. And I’m really interested in what Noise is going to do next.

Reviewing Noise Search EnginePart III

time to read 5 min | 837 words

After the previous two parts, I think that I have a good high level understanding of what is going on in the Noise codebase, as a reminder, I’m reviewing 48d808375e541eca383826ea928272976bce064d.

Now I’m reading through the JsonValue file. The first interesting thing here is:

image

Given all the code I have read so far, I think that this is a really elegant way to handle JSON, both in terms of how you can define such a recursive structure in 7 lines of code and in how this allow you to express relatively complex scenario quite nicely.

One thing that bugs me is that there are a lot of things like this:

image

This is done to have enough space for escaping. The problem is that this waste quite a bit of space. The last few blog posts I had were about 2KB of text each. Imagine doing a batch of a few thousands of them, just this types of calls can take up a lot of extra memory. Now, I’m used to managed memory, where this kind of thing hangs around and need to be collected. This code is in Rust, and it will clean itself up as soon as it is out of scope. But given that we hold on to it for a non trivial amount of time, I think that it is a concern. It might actually be cheaper to go over the string multiple times to figure out exactly how many escape chars we need.

One of the first optimizations we did when we wrote our own JSON parser was get rid of the costly escape char handling. It took literally more time than anything else, because we kept having to scan the string and break it over and over again.

Another issue that popped up is here:

image

A silent assumption here is that the order of properties in the two objects is the same. Two identical objects that have different property order will sort unequal. That is likely to be a surprise to the user.

The next file is parser.rs, which is over 1,300 lines of code (almost 20% of the entire codebase of Noise!), this is pretty typical parsing code, low level, nasty and quite tedious. I skimmed it briefly, but I don’t think that there is anything really important going on there. Its main purpose is to take a query string and produce a filter, next.

The next file is query.rs and I’m going to assume that this will lead me to also jump to filter.rs which I previous skipped.

The code here start using terms that I’m familiar with from Lucene. We’ll start where the money is, in Query::get_matches, which starts by building the query itself:

image

Basically, create a snapshot of the index, parse the query. We need to pass the snapshot to the parser so it can pass it on to the filters that it builds internally. The order of operations in this code also determines the required order of appearances of the clauses in the Noise query syntax.

The rest of the method goes over the query and build it, setting up ordering, aggregation and such. There is close cooperation between query.rs, filter.rs and snapshot.rs. The query is a high level concept of everything that must happen, filter is a much lower level, just finding the right stuff. On top of that the query layer boosting, scoring, ordering and aggregation. The filter uses the snapshot to access the stored information.

For example, the AllDocsFilter is using the snapshot.new_all_docs_iterator, which is defined as:

image

Basically, scan all the values with the prefix of S, which I think stands for sequence number.

More complex stuff can be seen in the Scorer, as a good example. We have terms prefixed with C, which I think stands for Count, which gives the number of times that a particular term appear in a particular position. And K is the number of times a particular property has shown up in general. This seems to be the basis of tf-idf calculation. Those are merged together using RocksDB merge command and the custom merge policy that we looked at in the index.rs.

At this point I looked at the clock and realized that this is now after midnight and that I got to the most interesting parts that I wanted to see, the rest are pretty familiar details, since a lot of this follows the Lucene construction.

I’m going to have another blog post about my thoughts about this project, which will show up tomorrow.

Reviewing Noise Search EnginePart II

time to read 6 min | 1169 words

In my previous post, I started going over the Noise project code. We got to the indexer, but there is still a lot to do, as a reminder, I’m reviewing 48d808375e541eca383826ea928272976bce064d.

The indexer code handed off a lot of work to the shredder, so it is a good thing that this is my next target for review. At a glance, it looks like Noise is using rustc_serialize to do the actual JSON parsing. I’m guessing it make sense at this point, but given that RavenDB 4.0 had rolled its own JSON parser, and given that Noise is accepting the data as strings, I’m going to assume that JSON parsing costs are going to be high on any profiler output.

The first thing to see in the file is the shredder struct:

image

The existing_key_value_to_delete is very interesting, since it implies that this is how deletes & updates are handled.

It looks like this is working really closely with the key builder. The first method on the file is:

image

There are a whole bunch of stuff like that, with roughly the same pattern (add_bool_null_entries, add_stemmed_entries, etc). Because of the close cooperation of those two, I’m going to be reading them in parallel. The KeyBuilder struct looks like:

image

At this point, I’m guessing that what is going on is that the shredder goes over the JSON  document, and add the property path to the KeyBuilder. But given add_number_entries, I don’t know see how that would work. We are adding the document sequence number to the end of the key path, and store the value of the property in the value. That is… strange. The KeyBuilder data contains the property paths and the current position in the array (possibly nested), so that much is clear, its usage, not so much at this stage.

Because this is confusing, I’ll start from the shred function and go from there, to see what it is exactly that this is going on.

image

This takes the string, pass it to the JSON parser and start iterating over it. Okay, going over the code, that seems easier to follow now. This parse the JSON one token at at time. When it encounters a property, it add that to the KeyBuilder. So the KeyBuilder is basically the path into the current value in the document that we are parsing.

This bugs me, though:

image

I’m following what is going on, and it makes perfect sense. But all those special key prefixes that are scattered are not documented anywhere that I could see, and as I mentioned previously, single letter prefixes are hard on readability and I don’t think they add much, given that RocksDB is using prefix compression anyway.

There is already a PR to fix this, which is likely to be applied by the time you are reading this blog post, by the way. This change the prefixes to be constants, improving code readabily.

A lot of the functionality seems to be inside maybe_add_value, see the call sites:

image

It looks like all the numbers are recorded as floats (given that JSON use double precision IEEE floating point, I guess that make sense). You can see the different prefixes that are given for the values.

The tests at the bottom of the file completes the picture, and if I understand this properly, given the following JSON:

{ "name":"John", "age":31, "city":"New York", “hobbies”: [“reading”, “dogs”] }

The shredder is going to output the following entries, given a document sequence of 555:

  • W.age!f31#555,
  • W.hoobies$!dogs#555,1
  • W.hobbies$!reading#555,0
  • W.name!john#555,

I think that the code calls this key path, and the general format is (white space added:

W <property path> ! <value> # <document sequence> , <array path>

A critical observation is that an array is denoted as a $ in the property path, and there are as many indexes in the array path as there are $ signs in the property path.  I’m currently ignoring the full text search aspect of things, since it is not important for our purposes. With the understanding of what is being generated, let’s move to seeing what is actually going on.

Let us go back to merge_existing_doc, and see how that works. This is called after we shredded a document, so we have done all of the interesting work with regards to parsing the document and building the tokens, splitting the data, etc. At this point, we check if we already have this document, and if so, we load all the existing values and compare them to the current version. The nice thing is that we get to optimize. If a certain value didn’t change, we don’t need to do anything at all. On the other hand, if a value was there and is no longer in the current version, we need to delete it. That is effectively how updates are handled.

With deletes, we do the same, but in reverse, we reconstruct the document, find all the indexed values, and then we remove it. Finally, we have add_all_to_batch, which take all the in memory work that was done and apply that to the RocksDB batch, pretty much finishing all the work. I don’t like that some methods are holding into the state (shredded_key_values / existing_key_value_to_delete)  then flush it and some are manipulating the RocksDB batch directly, it seems like there should be either or kind of approach.

Coming back to the issue that I had with the compaction routine, we have this code, that add a number to the entries:

image

Now, if I’m reading things properly, if the value is zero (not uncommon by any means), that will result in 8 bytes of zeros being written as the value. Combine this with this:

image

And now I think that this will cause that value to be removed during compaction, resulting in data corruption. No idea if this is actually the case, since I’m just going over the code, and not running / testing it thoroughly.

It turns out that I’m not reading this properly (pun intended). I missed the negation on the first line of compaction_filter. The C & K values are the only things that will be removed during compaction, not the other way around, so no data corruption, just me remember that there is a reason I use “== false” instead of “!”, because I keep missing them.

Reviewing Noise Search EnginePart I

time to read 11 min | 2097 words

I run across this post by Damien Katz, and was immediately intrigued. That has several reasons.  The last time I did a deep review of Damien’s code was about a decade ago, when I went over the CouchDB source code, in an attempt to learn Erlang. What end up happening was that I wrote RavenDB.

Note: This is pretty much my thoughts as I’m going through the code, it isn’t meant to be a guide or anything except a record of what went through my head as I read the code.

Another reason is that this is obviously a subject matter that is near & dear to my heart. Searching is something that we do all  the time, and I would dearly love to read some code that isn’t Lucene to do so. Finally, the code is written in Rust, which I have some passing interest at, so I consider this a win all around. You can find the project site here and the code is here. I’m reviewing 48d808375e541eca383826ea928272976bce064d.

Some things that are obvious from the get go, just by skimming through the docs. This is a small project, less than 10,000 lines of code. That is great, since it means that I can review it in one sitting. I also don’t really care that much for the parser and format that they do, I’m mostly interested in how they are storing and searching the data. The storage is done via RocksDB, which is interesting, because it impose certain limitations on the solution (in particular, single key space) so it is interesting how the data is stored.  Most importantly, I really want to see how this is handling updates, which is what I find to be the hardest problem for search impl.

As usual, I’m going to start reading code in file name order, without any relation to the current structure of the code, and jump around from there.

In aggregates.rs, we find this piece of code, which allows to run an aggregation. The init / action / extract are the set  of operations that happen in the aggregation process.

image

Right now I don’t know what JsonValue is, but it is interesting to read just from the signatures. The action function can get a reference to a mutable instance, an immutable instance and a reference to an immutable instance. I’m assuming that the mutable instance is the one that is being modified to hold the aggregated value.

The aggregation itself is implemented by defining an enum that has a get_fun_impls method that initialize the AggregateFunImpls based on the specific aggregation value required.

Sum is defined as two functions, sum_init and sum, without any final step. The actual implementation is:

image

That is fun to read, because it pretty simple, although the syntax in the if let line is really confusing to me (first time in the code, non Rust dev). Basically, it says that we’ll take the JsonValue instance, convert it from a JsonValue to a number instance, and then set it to the same variable name as the original one.

If I was writing it in C it would probably look like:

JsonValue* existing = …;
int* existing = extract_number_(existing);
existing += new;

That is leaving aside the fact that the “new” that is used here and the “new’ in  the function arguments are not the same (well, they are conceptually, but not the same type, etc). I find this confusing, but I know that this is fairly common in Rust code.

The rest of the file is pretty similar constructs for other aggregation operations, and the next is error.rs, which is just boilerplate code that Rust requires from anything. The next file is filters.rs, which seems really interesting, but it doesn’t make sense at the current stage of the code, because it refers to a lot of other things that I haven’t checked yet, so moving to the next one, index.rs. Hopefully that will start the juicy bits.

Indeed, the interesting bits start when we call open, where we open up a rocksdb instance and start playing around. There is some interesting error handling here:

 

image

In particular, note that if the error failed because the database didn’t exists, it is created, and a header version is added. I wholeheartedly approve, because such early planning will pay dividends down the road. It is important to note that assert statements in Rust are going to be there for release. Only debug_assert will be removed in release builds.

The index has the ability to create a snapshot, based on the RocksDB Snapshots. I wrote a blog post on how that works on LevelDB (which RocksDB is based on).

image

I do wonder about the unwrap call here. If I understand correctly, this would cause a panic if I’m access an instance that isn’t open. That might be bad if you have a query hitting the server during shutdown, I’m guessing. I assume that this is oversight because it make the code easier to consume.

The add method there is where stuff really start to happen.

image

This is quite interesting, because you pass it a json string, and a batch (a Noise concept that basically wraps a RocksDB batch).

Actually, thinking about this, there is something that might be trippy here. RocksDB allow concurrent batches, but while it has support for optimistic concurrency, I’m guessing (at this point it is a guess) that this isn’t used by Noise, and that can lead to severe problems if you are trying to add to the index from concurrent threads. This is important because parallel indexing is pretty much the first optimization step that anyone will try, and Lucene calls it out explicitly as a best practice.

imageThe json is parsed by a class called shredder, I’m assuming that this is the JSON parser used by Noise.

The actual behavior is quite interesting.

image

We create a new shredder, then set it on the JSON string. The “let … if let” syntax gave me a pause for a bit. It is elegant, in a crazy sort of way.

But the really interesting bit here is that this code says quite a lot about the way Noise works. First, we now know that Noise has stable document ids, since we can see it being used when we check if the document is already in the index. That means that there is update support, which is pretty cool. Although I do wonder why the values are merged back in. For deletion, or are they just creating a large piece?

If there is no id, Noise uses a guid. I would usually call out that this might be a problem (because down the line, given enough work, it will create a lot of compaction work around the files), but I’m suspecting that not specifying an id is going to be a pretty rare occurrence, so that isn’t important.

Finally, we add it to the batch:

image

That means tat the shredder is probably a very important piece in the system. I’ll continue to read the index code for now and see what I can figure out, before jumping to that code.

Next, we have delete, and the key part of that method is:

image

So we have delete support, and comparing that to the update process is very interesting. They are different, which means that updates aren’t just delete/insert. That is interesting.

Next we have gather_doc_fields, which gets a string document id. What happens from there is interesting. Let us assume that our document id is “users/1”.

Noise searches the RocksDB for a value named “Iusers/1” <- Note that I prefix, I assume to stand for id. That happens in fetch_seq. If there is such a key, its value would be a string representation of int64, which is parsed and returned. That seq is the internal sequence number, I’m assuming. And I’m guessing it is used for most operations. Let us assume that the seq number returned in 1234.

gather_doc_fields then do another RocksDB search for “V1234#”. This is done via a KeyBuilder, which Noise is using to generate keys based on the depth of the search, I’m guessing. Haven’t really looked into that yet.

The code to collect the keys for a document sent me into a bit of a wild goose chase, in the sense that I couldn’t figure out what was going on here:

image

RocksDB is a key/value database, it doesn’t support multiple values per key. Took me a while to figure out that this is a (I’m guessing inefficient) way to to clone the value from the database owned memory to memory controlled by the Noise.

The next method, flush, just update the internal data (version and last sequence number) and submit the write batch. That concludes it as far as I’m concerned, unless something really strange is going on, indexing in Noise must be done in a single threaded manner. Note that there isn’t anything wrong with that (this is what we ended up doing in RavenDB 4.0), but it is something that needs to be stated out load, because it is contrary to the expected practice. Then again, I haven’t seen anything on indexing speed or any sign of benchmarking yet, maybe the team didn’t get there at this point.

The last thing on this file that is of interest are the RocksDB management functions. The compaction_filter – Will keep any key that starts with K or C, but if the value doesn’t start with them, then the value is interpreted as int32, and if it is 0, it is discarded. This is interesting, because HDB (the header for the db) key is not one of the protected values, and so the reason it isn’t thrown out is that Noise is using little endian value here, and accidently the int64 value (1 in this case) is interpreted as non zero int32. I’m assuming that the same holds true for the rest of values, but it seems more like accidental behavior. There is also the fact that we already know that there are some values (sequence numbers) whose value is a stringified int64. That also will pass the “not zero int32”, but even if it works, it looks wrong.

image

So basically, we have a few more key prefixes, which have a certain meaning, yet unknown. But if they are those values, they should be compared using the KeyBuilder comparison, otherwise, just compare them as is. I don’t like the single letter prefixes. RocksDB uses prefix compression, so I don’t think that using more readable prefix would cost anything.

Finally, we have the sum_merge, which is called by RocksDB when you make a Merge call. This is an interesting approach to handling updates, and what this basically does in the case of Noise is that it allow to merge multiple calls to the same value and return the sum total of them. I’m assuming that the value used here is the ref count or some such. Given the interplay between sum_merge and compaction_filter.

That is it for the index.rs file. There are actually tests there for the index, which is interesting. I don’t like having tests in the same file, but I think it is common in Rust.

The only thing of note there is that we find some interesting information about the format of the keys:

The document:

image

Is translated into the following:

image

That is the job of the KeyBuilder, I think, but I’ll leave digging into that to another post.

The case of the hanging request in the cache

time to read 2 min | 351 words

imageA user just reported that after a short time of using RavenDB, all requests started to hang and eventually timed out. We are talking about making several dozen requests in the space of a minute or so, as a user operating against a web application.

That was a cause of concern, obviously, and something that was very strange. Luckily, the user was able to send us a stripped down version of his application and we were able to reproduce this state locally. On that particular application only. We couldn’t reproduce this anywhere else.

In fact, we fixed the problem and still don’t know how to reproduce it. The underlying issue is that we are making a GET request using HttpClient and specifying HttpCompletionOption.ResponseHeadersRead in the SendAsync method. Now, the server might return us a 304 response, in which case we’ll use the cached value we have. If we make a few requests like that, all future requests will hang. In order to fix this problem we need to dispose the response, or read the (empty) content of the response.

Now, the crazy thing is that I can probably speculate pretty accurately on why this is going on, we have fixed it, but I can’t reproduce this problem except within the application the user sent us. So no independent reproduction.

The likely issue is that the HttpClient considers the request open until we read all its content (even if there is non), and that as we make more requests that have the same 304 and therefor no access to the content of the request, we have more and more requests that are considered to be open, up to the maximum number of requests that are allowed against the server concurrently. At this point, we are waiting for timeout, and it look like we are hanging.

All very reasonable and perfectly understandable (if annoying because you only need to dispose it sometimes is a trap waiting to happen), but I don’t know why we couldn’t reproduce it.

RavenDB 4.0The etag simplification

time to read 2 min | 357 words

A seemingly small change in RavenDB 4.0 is the way we implement the etag. In RavenDB 3.x and previous we used a 128 bits number, that was divided into 8 bits of type, 56 bits of restarts counter and 64 bits of changes within the current restart period. Visually, this looks like a GUID:  01000000-0000-0018-0000-000000000002.

The advantage of this format is that it is always increasing, very cheap to handle and requires very little persistent data. The disadvantage is that it is very big, not very human readable and the fact that the number of changes reset on every restart means that you can’t make meaningful deduction about relative sizes between any two etags.

In RavenDB 4.0 we shifted to use a single 64 bits number for all etag calculations. That means that we can just expose a long (no need for the Etag class) which is more natural for most usages. This decision also means that we need to store a lot less information, and etags are one of those things that we go over a lot.  A really nice side affect which was totally planned is that we can now take two etags and subtract them and get a pretty good idea bout the range that needs to be traversed.

Another important decision is that everything uses the same etag range. So documents, revisions, attachments and everything share the same etag, which make it very simple to scan through and find the relevant item just based on a single number. This make it very easy to implement replication, for example, because the wire protocol and persistence format remain the same.

I haven’t thought to write about this, seemed like too small a topic for post, but there was some interest about it in the mailing list, and enumerating all the reasons, it suddenly seems like it isn’t such a small topic.

Update: I forgot to mention, a really important factor of this decision is that we can do do this:

image

So we can give detailed information and expected timeframes easily.

FUTURE POSTS

  1. Zombies vs. Ghosts: The great debate - 12 hours from now
  2. Bug stories: The data corruption in the cluster - about one day from now
  3. Bug stories: How do I call myself? - 3 days from now
  4. Bug stories: The memory ownership in the timeout - 4 days from now
  5. We won’t be fixing this race condition - 5 days from now

And 2 more posts are pending...

There are posts all the way to Jul 04, 2017

RECENT SERIES

  1. RavenDB 4.0 (8):
    13 Jun 2017 - The etag simplification
  2. PR Review (2):
    23 Jun 2017 - avoid too many parameters
  3. Reviewing Noise Search Engine (4):
    20 Jun 2017 - Summary
  4. De-virtualization in CoreCLR (2):
    01 May 2017 - Part II
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats