Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,903 | Comments: 49,347

filter by tags archive
time to read 8 min | 1407 words

Iimage care about usability and the user experience for our users. We spend a lot of time on making sure that things are running smoothly. When we created RavenDB Cloud, I knew that it was important to create a good experience for our cloud offering.

One of the most important things that I did was to go and look at other people’s offerings and see where they failed to meet customer expectations. I recently run into this article about AWS Elastic. Similar issues has been raised about it for a while. And it was one of my explicit design goals of what not to do in our cloud offerings.

Summarizing the discussion, it seems like the following major issues across the board.

  • Backups – Being able to have your own backup schedule and destinations. Retention policies based on your needs, etc. AWS Elastic uses hourly backups and you only get 14 days of that. Cosmos DB, to look at an Azure offering, is taking a backup every 4 hours and you have a maximum of two of them.

We have users that needs to be able to go back weeks / months / years and look at the state of their database at a given point in time. The 8 hours backup period for Cosmos DB is really short, but even 14 days on AWS Elastic is short enough that you probably need to roll some other solutions for that.

With RavenDB Cloud, you have automatic backups (hourly) with a retention period that defaults to 14 days. The key here, by the way, is defaults. You are absolutely free to define your own backup policies (per database or per cluster), that means that you can set your own destinations (want to do cross cloud backups, no problem) and your own retention policies.

  • Visibility – This is a very common complaint, it showed up in pretty much all resources that I checked and have been a common cause of complaints among the peers I sampled when we did the background for RavenDB Cloud. In particular, no logs or access to the debug endpoints is a killer for productivity. If you can’t tell what the problem is, you can’t fix it, so you’ll need to call support and have someone look things over. Here are a few choice quotes, that I think goes to the heart of things.

Here is Liz Bennett talking about things in 2017:

I feel equipped to deal with most Elasticsearch problems, given access to administrative Elasticsearch APIs, metrics and logging. AWS’s Elasticsearch offers access to none of that. Not even APIs that are read-only, such as the /_cluster/pending_tasks API.

Without access to logs, without access to admin APIs, without node-level metrics (all you get is cluster-level aggregate metrics) or even the goddamn query logs, it’s basically impossible to troubleshoot your own Elasticsearch cluster.

AWS’s Elasticsearch doesn’t provide access to any of those things, leaving you no other option but to contact AWS’s support team. But AWS’s support team doesn’t have the time, skills or context to diagnose non-trivial issues

And here is Nick Price, just a few days ago:

So your cluster resize job broke (on a service you probably chose so you wouldn’t have to deal with this stuff in the first place), so you open a top severity ticket with AWS support. Invariably, they’ll complain about your shard count or sizing and will helpfully add a link to the same shard sizing guidelines you’ve read 500 times by now. And then you wait for them to fix it. And wait. And wait. The last time I tried to resize a cluster and it locked up, causing a major production outage, it took SEVEN DAYS for them to get everything back online.

They couldn’t even tell if they’d fixed the problem and had to have me verify whether they had restored connectivity between their own systems.

I think that the reason this is the case is that AWS Elastic is a multi tenant service. so each instance is shared among multiple clients. That means that they have to limit the endpoints that you can access, to not leak data from other clients.

With RavenDB Cloud, you get all the usual diagnostic features features that you would usually get. We don’t want you to escalate things to support. The ideal scenario is that if you run into any trouble, you’ll be able to figure things out completely on your own. Hell, we do active monitoring and will contact our customers if we see outliers in the system behavior to gives them a heads up.

You can go to your RavenDB Cloud instance, pull the logs, watch ongoing traffic and even get a stack trace of the running production system. Everything is there, in the box.

  • Flexibility – The most common cited issue with AWS Elastic seems to be related to the issue of not being able to make changes in the environment. Or, rather, you can do that (increasing node count or changing the instance types) but when you do that, you are going to have a full blown migration step. That means that you’ll double the number of nodes you’ll run and incur a really expensive operation to copy all the data. Given that Elastic already has this feature (add a node to an existing cluster) the decision not to support it is likely related to constraints on the AWS Elastic multi tenancy layer.

I’m not really sure what to say here, RavenDB Cloud has no such issue. To be rather more exact, our multi tenant architecture was specifically designed so the outward facing differences between how you’ll operate RavenDB on your own systems vs. how you’ll operate RavenDB on the Cloud will be minimal.

Adding and removing node is certainly possible. And in fact, in my intro video to RavenDB Cloud I showed how I can upgrade a cluster along multiple axes, while it is being actively written to. All with exactly zero interruptions in service. Client code that was busy reading and writing from the cluster didn’t even noticed that. That was certainly not an easy feature to implement, but I considered this to be the baseline of what we had to offer.

  • Support – Both Nick and Liz had a poor experience with AWS Elastic support. You can read the quotes above, or read the full posts for the whole picture.

I don’t like support. We explicitly modeled the company so support is a cost center, not a source of revenue. That means that we want to close each support incident as soon as possible.  What this doesn’t mean is that we do an auto close of all issues on creation. That would give us fantastic closure rates, I imagine, but at a cost.

Instead, we have a multi layered system to deal with things. Consider the scenario Nick run into, a full disk. That is certainly something that you might run into, no?

  1. Automatic recovery - With RavenDB,  a single full disk will simply cause that node to reject writes, nothing else. And the cluster as a whole will continue to operate normally. With RavenDB Cloud, we have a lot more control, and our monitoring systems will automatically alert on disk full and increase its size automatically behind the scenes with no impact on users.
  2. Diagnostics - Active monitoring means that you’ll be notified in the RavenDB Studio about suspicious issues (for example, you run out of IOPS) and be able to investigate them with full visibility. RavenDB does a lot of work to ensure that if something is broken, you’ll know about it.
  3. Front line support – if you need to call our support, the person answering the call is going to be able to help you. They would typically be an engineer that was involved either in the actual building / managing of RavenDB Cloud or (2nd tier) involved in the development of RavenDB itself.

My goal with a support call is to get you back to speed as soon as possible, and the usual metrics for that are measured in minutes, not days.

We are now several months post the launch of RavenDB Cloud and the pickup of customers has been great. What is more important from my end, however, is that we are seeing how this kind of investment in our architecture and setup is paying off.

time to read 5 min | 843 words

In the previous posts in this series, I explored a bit how to generate a full text index on top of the Enron data set. In particular, we looked at (rudimentary) analysis of text in the first post and looked into posting lists (list of matching documents for specific terms) in the second one. It occurred to me that we need to actually have a much better understanding of the kind of requirements that we have from posting lists in general, so let’s look at them, shall we?

  • Add to the list (increasing numbers only).
  • Iterate the list (all, or from starting point).
  • Reduce disk space and memory utilization as much as possible.

The fact that I want to be able to add to the list is interesting. The typical use case in full text search is to generate the full blown posting list from scratch every time. The typical model is to use LSM (Log Structure Merge) and take advantage on the fact that we are dealing with sorted list to merge them cheaply.

Iterating the list is something you’ll frequently do, to find all the matches or to merge two separate lists. Here is the kind of API that I initially had in mind:

As you can see, there isn’t much there, which is intentional. I initially thought about using this an the baseline of a couple of test implementations using StreamVByte, FastPFor as well as Gorrilla compression. The problem is that there is the need to balance compression ratio with the cost of actually going over the list. Given that my test cases showed a big benefit for using Roaring Bitmaps, I decided to look at them first and see what I can get out of it.

RoaringBitamps is a way to store (efficiently) a set of bits, they are very widely used in the industry. The default implementation is also entirely suitable for my purposes. Mostly because they make use of managed memory, and a hard requirement that I have placed on this series is that I want to be able to use persistent memory. In other words, I want to be able to write the data out, then be able to do everything on top of memory mapped data, without having to parse it.

Roaring Bitmaps works in the following manner. Each 64K range of integers is divided into each own 8KB segments. Given that I’m using Voron as a persistence library, these numbers don’t work for my needs. Voron uses an 8KB page size, so we’ll drop these numbers by half. Each range will be 32K of integers and take a maximum of 4KB of disk space. This allows me to store it much more efficiently inside of Voron. Each segment, in turn, has a type. The types can be either:

  • Array – if the number of set bits in the segment is less than 2048, the data will use a simple sorted array implementation, with each value taking 2 bytes.
  • Bitmap – if the number of set bits in the segment is between 2048 and 30,720, the segment will use a total of 4096 bytes and be a standard bitmap.
  • Reversed array – if the number of set bits in the segment is higher than 30,720, we’ll store in the segment the unset bits as a sorted array.

This gives us quite a few advantages:

  • It is straightforward to build this incrementally (remember that we only ever add items in the end).
  • It is quite efficient in terms of space saving in the case of sparse / busy usage.
  • It is cheap (computationally) to work with and process.
  • It is very simple to use from memory mapped file without having to parse / create managed objects.

The one thing that we still need to take into account is how to deal with the segment metadata. How do we know what segment belong to what range. In order to handle that, we’ll define the following:

The idea is that we need to store two important pieces of information. The start location (is always going to be a multiple of 32K) and the number of set bits (which has a maximum of 32K). Therefor, we can pack all of them into a single int64. The struct is merely there for convenience.

In other words, in addition to the segments with the actual set bits, we are also going to have an array of all the segment’s metadata. In practice, we’ll also need another value here, the actual location of the segment’s data, but that is merely another int64, so that is still very reasonable.

As this is currently a mere exercise, I’m going to skip actually building the implementation, but it seems like it should be a fairly straightforward approach. I might do another post about how to actually implement this feature on Voron, because it is interesting. But I think that this is already long enough.

We still have another aspect to consider. So far, we talked only about the posting lists, but we also need to discuss the terms. But that is a topic for the next post in the series.

time to read 3 min | 551 words

In full text search terminology, a posting list is just a list of document ids. These are used to store and find matches for particular terms in the index.

I took the code from the previous post and asked it to give me the top 50 most frequent terms in the dataset and their posting lists. The biggest list had over 200,000 documents, and I intentionally use multiple threads to build things, so the actual list is going to be random from run to run (which adds a little more real-worldedness to the system*).

*Yes, I invented that term. It make sense, so I’m sticking with it.

I took those posting lists and just dumped them to a file, in the simplest possible format. Here are the resulting files:

image

There are a few things to note here. As you can see, the file name is the actual term in the index, the contents of the file is a sorted list of int64 of the document ids (as 8 bytes little endian values).

I’m using int64 here because Lucene uses int32 and thus has the ~2.1 billion document limit, which I want to avoid. It also make it more fun to work with the data, because of the extra challenge.  The file sizes seems small, but the from file contains over 250,000 entries.

When dealing with posting lists, size matter, a lot. So let’s see what it would take to reduce the size here, shall we?

image

Simply zipping the file gives us a massive space reduction, so there is a lot left on the table, which is great.

Actually, I might have skipped a few steps:

  • Posting lists are sorted, because it helps do things like union / intersect queries.
  • Posting lists are typically only added to.
  • Removal are handled separately, with a merge step to clean this up eventually.

Because the value is sorted, the first thing I tried was to use a diff model with variable sized int. Here is the core code:

Nothing really that interesting, I have to admit, but it did cut the size of the file to 242KB, which is nice (and better than ZIP). Variable sized integers are used heavily by Lucene, so I’m very familiar with them. But there are other alternatives.

  • StreamVByte is a new one, with some impressive perf numbers, but only gets us to 282 KB (but it is possible / likely that my implementation of the code is bad).
  • FastPFor compresses the (diffed) data down to 108KB.
  • RoaringBitmap gives us a total of 64KB.

There are other methods, but they tend to go to the esoteric and not something that I can very quickly test directly.

It is important to note that there are several separate constraints here:

  • Final size on disk
  • Computational cost to generate that final format
  • Computation cost to go from the final format to the original values
  • How much (managed) memory is required during this process

That is enough for now, I believe. My next post will deal delve into the actual semantics that we need to implement to get a good behavior from the system. This is likely going to be quite interesting.

time to read 5 min | 837 words

My previous post got an interesting comment:

I smell a Lucene.NET replacement for Raven 5 in the future :-)

I wanted to deal with this topic, but before we get to it, please note. I’m not committing to any features / plans for RavenDB in here, merely outlining my thinking.

I really want to drop Lucene. There are many reasons for this.

  • The current release version of Lucene.NET was released over 7 years ago(!) and it matches a Lucene (JVM) version that was out in 2010.
  • The active development of Lucene.NET is currently focused on released 4.8 (dated 2014) and has been stalled for several years.
  • Lucene’s approach to resource utilization greatly differs from how I would like to approach things in modern .NET Core applications.

There is a problem with replacing Lucene, however. Lucene is, quite simply, an amazing library. It is literally the benchmark that you would compare yourself against in its field. This isn’t the first time that I have looked into this field, mind. Leaving aside the fact that I believe that I reviewed most of the open source full text search libraries, I also wrote quite a few spikes around that.

Lucene is big, so replacing it isn’t something that you can do in an afternoon or a weekend. You can get to the 80% functionality in a couple of weeks, for sure, but it is the remaining 20% that would take 300% of the time Smile. For example, inverted indexes are simple, but then you need to be able to handle phrase queries, and that is a whole other ball game.

We looked into supporting the porting of Lucene.NET directly as well as developing our own option, and so far neither option has gotten enough viability to take action.

2015 – 2018 has been very busy years for us. We have rebuilt RavenDB from scratch, paying attention to many details that hurt us in the past. Some of the primary goals was to improve performance (10x) and to reduce support overhead. As a consequence of that, RavenDB now uses a lot less managed memory. Most of our memory utilization is handled by RavenDB directly, without relying on the CLR runtime or the GC.

Currently, our Lucene usage is responsible for the most allocations in RavenDB. I can’t think of the last time that we had high managed memory usage that wasn’t directly related to Lucene. And yes, that means that when you do a memory dump of a RavenDB’s  process, the top spot isn’t taken by System.String, which is quite exceptional, in my experience.

Make no mistake, we are still pretty happy with what Lucene can do, but we have a lot of experience with that particular version and are very familiar with how it works. Replacing it means a substantial amount of work and introduce risks that we’ll need to mitigate. Given these facts, we have the following alternatives:

  1. Write our own system to replace Lucene.
  2. Port Lucene 8 from JVM to .NET Core, adapting it to our architecture.
  3. Upgrade to Lucene.NET 4.8 when it is out.

The first option gives us the opportunity to build something that will fit our needs exactly. This is most important because we also have to deal with both feature fit and architectural fit. For example, we care a lot about memory usage and allocations. That means that we can build it from the ground up to reduce these costs. The risk is that we would be developing from scratch and it is hard to scope something this big.

The second option means that we would benefit from modern Lucene. But it means having to port non trivial amount of code and have to adapt both from JVM to CLR but also to our own architectural needs. The benefit here is that we can likely not port stuff that we don’t need, but there is a lot of risk involved in such a big undertaking. We will also effectively be either forking the .NET project or have just our own version. That means that the maintenance burden would be on us.

Upgrading to Lucene.NET 4.8 is the third option, but we don’t know when that would be out (stalled for a long time) and it does move us very far in the Lucene’s versions. It also likely going to require a major change in how we behave. Not so much with the coding wise (I don’t expect that to be too hard) but in terms of the different operational properties than what we are used to.

The later two options are going to keep Lucene as the primary sources of allocations in our system, while the first option means that we can probably reduce that significantly for both indexing and querying, but we’ll have to push our own path.

We have made no decisions yet, and we’ll probably won’t for a while. We are playing around with all options, as you can see on this blog, but that is also mostly because it is fun to explore and learn.

time to read 4 min | 796 words

imageAn important aspect of RavenDB is that It Just Works. In other words, we aim to make RavenDB a really boring database, one that require very little of your attention. As it turns out, this can get really complex. The topic for today is role assignment in a distributed cluster after recovering from a failure.

Let’s consider the case of a cluster composed of 10 databases, spread among three nodes (A,B and C). The databases are replicated on all three nodes, which A being the primary for 4 of them, and C and B being the primary for 3 each.

And the following sequence of events:

  • Node A goes down.
  • The RavenDB cluster will react to that by re-shuffling responsibilities in the cluster for the databases where A was the primary node of contact.
  • Clients will observe no change in behavior and continue to operate normally. All the databases are now served by nodes B and C.
  • Node B goes down.
  • At this point, there is no longer a cluster (which requires a majority of the nodes to be up, in this case, two). Node C and the remaining client follow the disaster plan that was created (and distributed to the clients) ahead of time by the cluster.
  • All the clients now talk to node C for all databases.

So far, this is business as usual and how RavenDB works to ensure that even in the presence of failure, we can continue working normally. Node C is the only remaining node in the cluster, and it shoulders the load admirably.

At this point, the admin takes pity on our poor node C and restarts node A and B. The following operations then take place:

  • The cluster can communicate between the nodes and start monitoring the system.
  • Nodes A and B are getting updated from node C with all the missing changes that they missed.
  • The cluster will update the topology of the databases to reflect that node A and B are now up.

imageAt this point, however, node C is the primary node for all the databases, and the CPU heat sink is starting to make some warning signs.

What we need to do now is to re-assign the primary role for each database across the cluster. And that turns out to be surprisingly tricky thing to do.

We sat down and wrote a list of the potential items we need to consider in this case:

  • The number of databases already assigned to the cluster node.
  • The number of requests / sec that the database gets vs. the already assigned databases on the node.
  • The working set that the database require to operate vs. the existing working set on the node.
  • The number of IOPS that are required vs…

There is also the need to take into account the hardware resources for each node. They are usually identical, but they don’t have to be.

As we started scoping the amount of work that we would need to make this work properly, we were looking at a month or so. In addition to actually making the proper decision, we first need to gather the relevant information (and across multiple operating systems, containers, bare metal, etc). In short, that is a lot of work to be done.

We decided to take a couple of steps back, before we had to figure out where we’ll put this huge task in our queue. What was the actual scenario?

If there is a single database on the cluster (which isn’t uncommon), then this feature doesn’t matter. A single node will always be the primary anyway. This means that this distribution of primary responsibility in the cluster is not actually a single step operation, it only make sense when you apply it to multiple databases. And that, as it turns out, gives us a lot more freedom.

We were able to implement the feature in the same day, and the cost of that was quite simple. Whenever the cluster detected that a database’s node has recovered, it would run the following command:

primary = topology[rand(topology.Length)];

This is the only thing that we need in order to get fair distribution of the data across the nodes.

The way it works is simple. As each database become up to date, the cluster will pick it up and a primary at random. Because we have multiple such cases, sheer chance is going to ensure that they will go to different nodes (and those, rebalance the load across the cluster).

It simple, it works quite nicely and it is a lot less scary than having to implement a smart system. And in our testing, it worked like magic Smile.

time to read 1 min | 145 words

I have been talking about memory and RavenDB a lot, and I thought that I would share the following image from one of our test runs:

image

This is RavenDB running in a container with 16MB of available memory.  This is when we are under (moderate) load:

image

Note that the actual working set used by RavenDB is 2.28MB, and while the total allocations are higher than that, it is still quite reasonable in size.

In 1995, I got a new computer with 133MHz and 16 MB of RAM. It run a full OS and apps (Win95, Netscape, Office, etc) and was quite impressive.

It is really interesting that we can run RavenDB on that constrained environment.

time to read 3 min | 536 words

After my podcast about RavenDB’s dev ops story, I was asked an interesting question by Remi:

…do you think it can work with non technical product (let's say banking app) where your user and your engineer are not in the same industry.

This is quite an interesting scenario. A line of business application is going to be composed of two separate planes. You have the technical plane, which is fairly standard and you can get quite a lot of mileage from standard dev ops monitoring tools. For example, you probably don’t need the same level of diagnostics in a web apps or a service backend as you need for a database engine. However, the business plane is just an interesting an area and often can benefit quite a bit by building business level diagnostics into the application.

If we’ll take the example of banking app, you might want to track things such as payment flow across various accounts. You may want to be able to get a view of a single user’s activities over time or simply have a good visibility to various financial instruments.

I have run into several cases were I had to break down how loans work (interest, compounding, collateral, etc) for college educate people who were really quite smart, but didn’t pay attention to that part of life. Given that I consider loans to be one of the simplest financial instruments, building visibility into these can be of great help.

Still in the banking field, just the notion of taxation is freakishly complex. I have had a case where a customer in India was suppose to pay us a 1,000 USD. They sent 857 USD (a bit of that was eaten by bank fees) and the rest we had to claim as a refund from my tax authorities, because the rest of the money was paid as taxes in India and the two countries are doing reconciliation. Given the inherent complexity that is involved, just being able to visual, inspect and explain things is of enormous value.

Things like Know Your Customer and Anti Money Laundering are also quite complex and can put the system into a tail spin. I had a customer send us a payment, but the payment was stopped because the same customer also paid (in a completely different transaction and to a different destination entirely) with funds that came from crypto currencies. Leaving aside the aggravation of such scenarios, I am actually impressed/scared that they are able to track such things so well.

I can’t really be upset with the bank, even. Laws and regulations are in place that have strict limits on how they can behave, including personal criminal liability and Should Have Known clauses. I can understand why they are cautious.

But at the same time, trying to untangle such a system is a lot like trying to debug a software system. And having the tools in place for the business expert to easily obtain and display the data is an absolute competitive advantage.

I have recently close a bank account specifically because the level of service provided didn’t meat my expectations. Having better systems in place means that you can give better service, and that is worth quite a lot.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  2. re (22):
    19 Aug 2019 - The Order of the JSON, AKA–irresponsible assumptions and blind spots
  3. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  4. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
  5. Production postmortem (26):
    07 Jun 2019 - Printer out of paper and the RavenDB hang
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats