Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,903 | Comments: 49,340

filter by tags archive
time to read 5 min | 837 words

My previous post got an interesting comment:

I smell a Lucene.NET replacement for Raven 5 in the future :-)

I wanted to deal with this topic, but before we get to it, please note. I’m not committing to any features / plans for RavenDB in here, merely outlining my thinking.

I really want to drop Lucene. There are many reasons for this.

  • The current release version of Lucene.NET was released over 7 years ago(!) and it matches a Lucene (JVM) version that was out in 2010.
  • The active development of Lucene.NET is currently focused on released 4.8 (dated 2014) and has been stalled for several years.
  • Lucene’s approach to resource utilization greatly differs from how I would like to approach things in modern .NET Core applications.

There is a problem with replacing Lucene, however. Lucene is, quite simply, an amazing library. It is literally the benchmark that you would compare yourself against in its field. This isn’t the first time that I have looked into this field, mind. Leaving aside the fact that I believe that I reviewed most of the open source full text search libraries, I also wrote quite a few spikes around that.

Lucene is big, so replacing it isn’t something that you can do in an afternoon or a weekend. You can get to the 80% functionality in a couple of weeks, for sure, but it is the remaining 20% that would take 300% of the time Smile. For example, inverted indexes are simple, but then you need to be able to handle phrase queries, and that is a whole other ball game.

We looked into supporting the porting of Lucene.NET directly as well as developing our own option, and so far neither option has gotten enough viability to take action.

2015 – 2018 has been very busy years for us. We have rebuilt RavenDB from scratch, paying attention to many details that hurt us in the past. Some of the primary goals was to improve performance (10x) and to reduce support overhead. As a consequence of that, RavenDB now uses a lot less managed memory. Most of our memory utilization is handled by RavenDB directly, without relying on the CLR runtime or the GC.

Currently, our Lucene usage is responsible for the most allocations in RavenDB. I can’t think of the last time that we had high managed memory usage that wasn’t directly related to Lucene. And yes, that means that when you do a memory dump of a RavenDB’s  process, the top spot isn’t taken by System.String, which is quite exceptional, in my experience.

Make no mistake, we are still pretty happy with what Lucene can do, but we have a lot of experience with that particular version and are very familiar with how it works. Replacing it means a substantial amount of work and introduce risks that we’ll need to mitigate. Given these facts, we have the following alternatives:

  1. Write our own system to replace Lucene.
  2. Port Lucene 8 from JVM to .NET Core, adapting it to our architecture.
  3. Upgrade to Lucene.NET 4.8 when it is out.

The first option gives us the opportunity to build something that will fit our needs exactly. This is most important because we also have to deal with both feature fit and architectural fit. For example, we care a lot about memory usage and allocations. That means that we can build it from the ground up to reduce these costs. The risk is that we would be developing from scratch and it is hard to scope something this big.

The second option means that we would benefit from modern Lucene. But it means having to port non trivial amount of code and have to adapt both from JVM to CLR but also to our own architectural needs. The benefit here is that we can likely not port stuff that we don’t need, but there is a lot of risk involved in such a big undertaking. We will also effectively be either forking the .NET project or have just our own version. That means that the maintenance burden would be on us.

Upgrading to Lucene.NET 4.8 is the third option, but we don’t know when that would be out (stalled for a long time) and it does move us very far in the Lucene’s versions. It also likely going to require a major change in how we behave. Not so much with the coding wise (I don’t expect that to be too hard) but in terms of the different operational properties than what we are used to.

The later two options are going to keep Lucene as the primary sources of allocations in our system, while the first option means that we can probably reduce that significantly for both indexing and querying, but we’ll have to push our own path.

We have made no decisions yet, and we’ll probably won’t for a while. We are playing around with all options, as you can see on this blog, but that is also mostly because it is fun to explore and learn.

time to read 4 min | 645 words

Full text search is a really interesting topic, which I have been dipping my toes into again and again over the years. It is a rich area of research, and there has been quite a few papers, books and articles about the topic. I read a bunch of projects for doing full text search, and I have been using Lucene for a while.

I thought that I would write some code to play with full text search and see where that takes me. This is a side project, and I hope it will be an interesting one. The first thing that I need to do is to define the scope of work:

  • Be able to (eventually) do full text search queries
  • Compare and contrast different persistence strategies for this
  • Be able to work with multiple fields

What I don’t care about: Analysis process, actually implementing complex queries (I do want to have the foundation for them), etc.

Given that I want to work with real data, I went and got the Enron dataset. That is over 517,000 emails from Enron totaling more than 2.2 GB. This is one of the more commonly used test datasets for full text search, so that is helpful. The first thing that we need to do is to get the data into a shape that we can do something about it.

Enron is basically a set of MIME encoded files, so I’ve used MimeKit to speed the parsing process. Here is the code of the algorithm I’m using for getting the relevant data for the system. Here is the relevant bits:

As you can see, this is hardly a sophisticated approach. We are spawning a bunch of threads, processing all half million emails in parallel, select a few key fields and do some very basic text processing. The idea is that we want to get to the point where we have enough information to do full text search, but without going through the real pipeline that this would take.

Here is an example of the output of one of those dictionaries:

As you can see, this is bare bones (I forgot to index the Subject, for example), but on my laptop (8 cores Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz) with 16 GB of RAM, we can index this amount of data in under a minute and a half.

So far, so good, but this doesn’t actually gets us anywhere, we need to construct an inverted index, so we can ask questions about the data and be able to find stuff out. We are already about half way there, which is encouraging. Let’s see how far we can stretch the “simplest thing that could possibly work”… shall we?

Here is the key data structures:

Basically, we have an array of fields, each of which holds a dictionary from each of the terms and a list of documents for the terms.

For the full code for this stage, look at the following link, it’s less than 150 lines of code.

Indexing the full Enron data set now takes 1 minute, 17 seconds, and takes 2.5 GB in managed memory.

The key is that with this in place, if I want to search for documents that contains the term: “XML”, for example, I can do this quite cheaply. Here is how I can “search” over half a million documents to get all those that have the term HTML in them:

image

As you can imagine, this is actually quite fast.

That is enough for now, I want to start actually exploring persistence options now.

The final code bits are here, I ended up implementing stop words as well, so this is a really cool way to show off how you can do full text search in under 200 lines of code..

time to read 4 min | 796 words

imageAn important aspect of RavenDB is that It Just Works. In other words, we aim to make RavenDB a really boring database, one that require very little of your attention. As it turns out, this can get really complex. The topic for today is role assignment in a distributed cluster after recovering from a failure.

Let’s consider the case of a cluster composed of 10 databases, spread among three nodes (A,B and C). The databases are replicated on all three nodes, which A being the primary for 4 of them, and C and B being the primary for 3 each.

And the following sequence of events:

  • Node A goes down.
  • The RavenDB cluster will react to that by re-shuffling responsibilities in the cluster for the databases where A was the primary node of contact.
  • Clients will observe no change in behavior and continue to operate normally. All the databases are now served by nodes B and C.
  • Node B goes down.
  • At this point, there is no longer a cluster (which requires a majority of the nodes to be up, in this case, two). Node C and the remaining client follow the disaster plan that was created (and distributed to the clients) ahead of time by the cluster.
  • All the clients now talk to node C for all databases.

So far, this is business as usual and how RavenDB works to ensure that even in the presence of failure, we can continue working normally. Node C is the only remaining node in the cluster, and it shoulders the load admirably.

At this point, the admin takes pity on our poor node C and restarts node A and B. The following operations then take place:

  • The cluster can communicate between the nodes and start monitoring the system.
  • Nodes A and B are getting updated from node C with all the missing changes that they missed.
  • The cluster will update the topology of the databases to reflect that node A and B are now up.

imageAt this point, however, node C is the primary node for all the databases, and the CPU heat sink is starting to make some warning signs.

What we need to do now is to re-assign the primary role for each database across the cluster. And that turns out to be surprisingly tricky thing to do.

We sat down and wrote a list of the potential items we need to consider in this case:

  • The number of databases already assigned to the cluster node.
  • The number of requests / sec that the database gets vs. the already assigned databases on the node.
  • The working set that the database require to operate vs. the existing working set on the node.
  • The number of IOPS that are required vs…

There is also the need to take into account the hardware resources for each node. They are usually identical, but they don’t have to be.

As we started scoping the amount of work that we would need to make this work properly, we were looking at a month or so. In addition to actually making the proper decision, we first need to gather the relevant information (and across multiple operating systems, containers, bare metal, etc). In short, that is a lot of work to be done.

We decided to take a couple of steps back, before we had to figure out where we’ll put this huge task in our queue. What was the actual scenario?

If there is a single database on the cluster (which isn’t uncommon), then this feature doesn’t matter. A single node will always be the primary anyway. This means that this distribution of primary responsibility in the cluster is not actually a single step operation, it only make sense when you apply it to multiple databases. And that, as it turns out, gives us a lot more freedom.

We were able to implement the feature in the same day, and the cost of that was quite simple. Whenever the cluster detected that a database’s node has recovered, it would run the following command:

primary = topology[rand(topology.Length)];

This is the only thing that we need in order to get fair distribution of the data across the nodes.

The way it works is simple. As each database become up to date, the cluster will pick it up and a primary at random. Because we have multiple such cases, sheer chance is going to ensure that they will go to different nodes (and those, rebalance the load across the cluster).

It simple, it works quite nicely and it is a lot less scary than having to implement a smart system. And in our testing, it worked like magic Smile.

time to read 3 min | 526 words

In my company, I have a simple rule. If you want a tool, ask for it, you’ll get it. If you want training, ask for it, you’ll get it. If you want technical books, let me know, you’ll get it. I don’t ask questions, and I don’t try to enforce any rules around that. I got requests for things like Pluralsight subscription (very relevant) and technical books on topics that we probably would never touch (which I happily purchased).  But by far, I don’t get many requests for stuff. Things gotten so bad that we had a marketing effort internally to get people to ask for stuff.

I’ll repeat that: I had to actively make it attractive to have people send me an email “can you get us XYZ”.

There is no tedious process involving multiple pitches and getting buy in. There is literally just send an email and you’ll get it. And people don’t take advantage of this option.

Recently, I tried outsmarting my folks, and put that as an item for the current sprint. Something like: “Suggest a courses / conference / training that you should to this quarter”. I got push back from the team leaders, saying that no one could find something that they wanted to go to.

I’m still in the process of trying to find a solution to this problem, to be frank.

I thought about just giving individual people a budget and just letting them handle that directly. That actually fails for a bunch of reasons:

How do you pay for this? Simplest would be to just have the developers pay and reimburse them for that money. I don’t like this option, because there is no need for the dev to float money for the company. Especially since some of these can be fairly high. The cost of a training course, for example, can be thousands of dollars. At that point, it is likely that we are going to have a discussion on this anyway, so I might as well pay that directly. The same applies for tooling / books, etc (although they usually cost less).

Of more interest to me is that if there is a tool / training that one dev wants to go, it is likely that others will want as well. That matters, because you can usually get volume discounts instead of paying for multiple individual options.

Finally, there are tools, and then there are tools. What sort of text editor you use doesn’t really matter to me. Nor do I care what sort of Git client you use. But a tool that is used to generate code, or part of the build / test process, is something that I do care about and want to look at.

What we end up with is a situation where you can’t decentralize the process, but we also can’t seem to get the people involved to just ask.

I would like to hope that this is because they have everything they need. I have tried to make the process as smooth and painless as possible, with no takers. At this point, I’m just going to go an meditate over this bit of wisdom.

time to read 1 min | 90 words

I’ll be doing a webinar today dealing with data modeling in RavenDB. In particular, I’m going to focus on both general modeling advice and how RavenDB was explicitly designed to make it easy / simple to do the common tasks that are usually so hard to deal with.

I’m going to be talking about the shape of your documents, the shape of your entities, how to design your system for best results and the kind of hidden features behind the scenes that you might want to take advantage of.

FUTURE POSTS

  1. Searching through text: Part II, Exploring posting lists persistence - 23 minutes from now
  2. Searching through text: Part III, Managing posting lists - about one day from now
  3. Creating a good SaaS experience in the cloud - 2 days from now

There are posts all the way to Oct 18, 2019

RECENT SERIES

  1. Searching through text (3):
    14 Oct 2019 - Part I, full text search in under 200 lines of code
  2. re (22):
    19 Aug 2019 - The Order of the JSON, AKA–irresponsible assumptions and blind spots
  3. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  4. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
  5. Production postmortem (26):
    07 Jun 2019 - Printer out of paper and the RavenDB hang
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats