Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,038 | Comments: 49,743

filter by tags archive
time to read 1 min | 172 words

I’m really happy to announce that RavenDB Cloud has now deployed shared instances support.  These are full production systems, with three separate nodes deployed in separate availability zones for maximum availability.

Shared instances are meant to answer users that have light load but still want to ensure high availability at a low cost.

image

We currently offer PS10 (about 30$ / month) and PS20 (about 50$ month) plans in both AWS and Azure.

Shared plans are just that, you are getting shared compute resources. You are still fully isolated from other customers and have full isolation from other instances. However, because the hardware resources you are using are utilized by multiple customers, you can expect lower capacity.  In our tests, such instances were able to handle nicely light load and typical hobbyists applications with no issue. You can also freely scale up from shared plans to basic or higher at any time.

Give it a try.

time to read 4 min | 628 words

I posted about the @refresh feature in RavenDB, explaining why it is useful and how it can work. Now, I want to discuss a possible extension to this feature. It might be easier to show than to explain, so let’s take a look at the following document:

The idea is that in addition to the data inside the document, we also specify behaviors that will run at specified times. In this case, if the user is three days late in paying the rent, they’ll have a late fee tacked on. If enough time have passed, we’ll mark this payment as past due.

The basic idea is that in addition to just having a @refresh timer, you can also apply actions. And you may want to apply a set of actions, at different times. I think that the lease payment processing is a great example of the kind of use cases we envision for this feature. Note that when a payment is made, the code will need to clear the @refresh array, to avoid it being run on a completed payment.

The idea is that you can apply operations to the documents at a future time, automatically. This is a way to enhance your documents with behaviors and policies with ease. The idea is that you don’t need to setup your own code to execute this, you can simply let RavenDB handle it for you.

Some technical details:

  • RavenDB will take the time from the first item in the @refresh array. At the specified time, it will execute the script, passing it the document to be modified. The @refresh item we are executing will be removed from the array. And if there are additional items, the next one will be schedule for execution.
  • Only the first element in the @refresh array only. So if the items aren’t sorted by date, the first one will be executed and the persisted again. The next one (which was earlier than the first one) is already ready for execution, so will be run on the next tick.
  • Once all the items in the @refresh array has been processed, RavenDB will remove the @refresh metadata property.
  • Modifications to the document because of the execution of @refresh scripts are going to be handled as normal writes. It is just that they are executed by RavenDB directly. In other words, features such as optimistic concurrency, revisions and conflicts are all going to apply normally.
  • If any of the scripts cause an error to be raised, the following will happen:
    • RavenDB will not process any future scripts for this document.
    • The full error information will be saved into the document with the @error property on the failing script.
    • An alert will be raised for the operations team to investigate.
  • The scripts can do anything that a patch script can do. In other words, you can put(), load(), del() documents in here.
  • We’ll also provide a debugger experience for this in the Studio, naturally.
  • Amusingly enough, the script is able to modify the document, which obviously include the @refresh metadata property. I’m sure you can imagine some interesting possibilities for this.

We also considered another option (look at the Script property):

The idea is that instead of specifying the script to run inline, we can reference a property on a document. The advantage being is that we can apply changes globally much easily. We can fix a bug in the script once. The disadvantage here is that you may be modifying a script for new values, but not accounting for the old documents that may be referencing it. I’m still in two minds about whatever we should allow a script reference like this.

This is still an idea, but I would like to solicit your feedback on it, because I think that this can add quite a bit of power to RavenDB.

time to read 3 min | 415 words

imageI run into this article that talks about building a cache service in Go to handle millions of entries. Go ahead and read the article, there is also an associated project on GitHub.

I don’t get it. Rather, I don’t get the need here.

The authors seem to want to have a way to store a lot of data (for a given value of lots) that is accessible over REST.  The need to be able to run 5,000 – 10,000 requests per second over this. And also be able to expire things.

I decided to take a look into what it would take to run this in RavenDB. It is pretty late here, so I was lazy. I run the following command against our live-test instance:

image

This say to create 1,024 connections and get the same document. On the right you can see the live-test machine stats while this was running. It peaked at about 80% CPU. I should note that the live-test instance is pretty much the cheapest one that we could get away with, and it is far from me.

Ping time from my laptop to the live-test is around 230 – 250 ms. Right around the numbers that wrk is reporting. I’m using 1,024 connections here to compensate for the distance. What happens when I’m running this locally, without the huge distance?

image

So I can do more than 22,000 requests per second (on a 2016 era laptop, mind) with max latency of 5.5 ms (which the original article called for average time). Granted, I’m simplifying things here, because I’m checking a single document and not including writes. But 5,000 – 10,000 requests per second are small numbers for RavenDB. Very easily achievable.

RavenDB even has the @expires feature, which allows you to specify a time a document will automatically be removed.

The nice thing about using RavenDB for this sort of feature is that millions of objects and gigabytes of data are not something that are of particular concern for us. Raise that by an orders of magnitude, and that is our standard benchmark. You’ll need to raise it by a few more orders of magnitudes before we start taking things seriously.

time to read 5 min | 822 words

This post asked an interesting question, why are hash table so prevalent for in memory usage and (relatively) rare in the case of databases. There is some good points in the post, as well as in the Hacker News thread.

Given that I just did a spike of persistent hash table and have been working on database engines for the past decade, I thought that I might throw my own two cents into the ring.

B+Tree is a profoundly simple concept. You can explain it in 30 minutes, and it make sense. There are some tricky bits to a proper implementation, for sure, but they are more related to performance than correctness.

Hash tables sounds simple, but the moment you have to handle collisions gracefully, you are going to run into real challenges. It is easy to get into nasty bugs with hash tables, the kind that silently corrupt your state without you realizing it.

For example, consider the following code:

This is a hash table using linear addressing. Collisions are handled by adding them to the next available node. And in this case, we have a problem. We want to put “ghi” in position zero, but we can’t, because it is already full. We move it to the first available location. That is well understood and easy. But when we delete “def”, we remove the entry from the array, but we forgot to do fixups for the relocated “ghi”, that value is now gone from the table, effectively. This is the kind of bug you need the moon to be in a certain position while a cat sneeze to figure out.

A B+Tree also maps very nicely to persistent model, but it is entirely non obvious how you can go from the notion of a hash table in memory to one on disk. Extendible hashing exists, and has for a very long time. Literally for more time than I’m alive, but it is not very well known / generically used. It is a beautiful algorithm, mind you. But just mapping the concept to a persistence model isn’t enough, typically, you also had a bunch of additional requirements from disk data structure. In particular, concurrency in database systems is frequently tied closely to the structure of the tree (page level locks).

There is also the cost issue. When talking about disk based data access, we are rarely interested in the actual O(N) complexity, we are far more interested in the number of disk seeks that are involved. Using extendible hashing, you’ll typically get 1 – 2 disk seeks. If the directory is in memory, you have only one, which is great. But with a B+Tree, you can easily make sure that the top levels of the tree will also be memory resident (similar to the extendible hash directory), that leads to typical 1 disk access to read the data, so in many cases, they are roughly the same performance for either option.

Related to the cost issue, you have to also consider security risks. There have been a number of attacks against hash tables that relied on generating hash collisions. The typical in memory fix is to randomize the hash to avoid this, but if you are persistent, you have to use the same hash function forever. That means that an attacker can very easily kill your database server, by generating bad keys.

But these are all relatively minor concerns. The key issue is that B+Tree is just so much more useful. A B+Tree can allow me to:

  • Store / retrieve my data by key
  • Perform range queries
  • Index using a single column
  • Index using multiple columns (and then search based on full / partial key)
  • Iterate over the data in specified order

Hashes allow me to:

  • Store / retrieve my data by key

And that is pretty much it. So B+Tree can do everything that Hashes can, but also so much more. They are typically as fast where it matters (disk reads) and more than sufficiently fast regardless.

Hashes are only good for that one particular scenario of doing lookup by exact key. That is actually a lot more limited than what you’ll consider.

Finally, and quite important, you have to consider the fact that B+Tree has certain access patterns that they excel at. For example, inserting sorted data into a B+Tree is going to be a joy. Scanning the B+Tree in order is also trivial and highly performant.

With hashes? There isn’t an optimal access pattern for inserting data into a hash. And while you can scan a hash at roughly the same cost as you would a B+Tree, you are going to get the data out of order. That means that it is a lot less useful than it would appear to upfront.

All of that said, hashes are still widely used in databases. But they tend to be used as specialty tools. Deployed carefully and for very specific tasks. This isn’t the first thing that you’ll reach to, you need to justify its use.

time to read 8 min | 1528 words

A common question I field on many customer inquiries is comparing RavenDB to one relational database or another. Recently we got a whole spate of questions on RavenDB vs. PostgreSQL and I though that it might be easier to just post about it here rather than answer the question each time. Some of the differences are general, for all or most relational databases, but I’m also going to touch on specific features of PostgreSQL that matter for this comparison.

The aim of this post is to provide highlights to the differences between RavenDB and PostgreSQL, not to be an in depth comparison of the two.

PostgreSQL is a relational database, storing the data using tables, columns and rows. The tabular model is quite entrenched here, although PostgreSQL has the notion of JSON columns. 

RavenDB is a document database, which store JSON documents natively. These documents are schema-free and allow arbitrarily complex structure.

The first key point that distinguish these databases is with the overall mindset. RavenDB is meant to be a database for OLTP systems (business applications) and has been designed explicitly for this. PostgreSQL is trying to achieve both OLTP and OLAP scenarios and tends to place a lot more emphasis on the role of the administrator and operations teams. For example, PostgreSQL requires VACUUM, statistics refresh, manual index creation, etc. RavenDB, on the other hand, it design to run in a fully automated fashion. There isn’t any action that an administrator needs to take (or schedule) to ensure that RavenDB will run properly.

RavenDB is also capable of configuring itself dynamically, adjusting to the real world load it has based on feedback from the operational environment. For example, the more queries a particular index has, the more resources it will be granted by RavenDB. Another example is how RavenDB processes queries in general. Its query analyzer will run through the incoming queries and figure out what is the best set of indexes that you need to answer them. RavenDB will then go ahead and create these indexes on the fly. Such an action tends to be scary for users coming from relational databases, but RavenDB was designed upfront for specifically these scenarios. It is able to build the new indexes without adding too much load to the server and without taking any locks. Other tasks that are typically handled by the DBA, such as configuring the system, are handled dynamically by RavenDB based on actual operational behavior. RavenDB will also cleanup superfluous indexes and reduce the resources available for indexes that aren’t in common use. All of that without a DBA to perform acts of arcane magic.

Another major difference between the databases is the notion of schema. PostgreSQL requires you to define your schema upfront and adhere to that. The fact that you can use JSON at times to store data provides an important escape hatch, but while PostgreSQL allows most operations on JSON data (including indexing them), it is unable to collect statistics information on such data, leading to slower queries. RavenDB uses a schema-free model, documents are grouped into collections (similar to tables, but without the constraint of having the same schema), but have no fixed schema. Two documents at the same collection can have distinct structure.  Typical projects using JSON columns in PostgreSQL will tend to pull specific columns from the JSON to the table itself, to allow for better integration with the query engine. Nevertheless, PostgreSQL’s ability to handle both relational and document data gives it a lot of brownie points and enable a lot of sophisticated scenarios for users.

RavenDB, on the other hand, is a pure JSON database, which natively understand JSON. It means that the querying language is much nicer for querying that involve JSON and comparable for queries that don’t have a dynamic structure. In addition to being able to query the JSON data, RavenDB also allows you to run aggregation using Map/Reduce indexes. These are similar to materialized views in PostgreSQL, but unlike those, RavenDB is going to update the indexes automatically and incrementally. That means that you can query on large scale aggregation in microseconds, regardless of data sizes.

For complex queries, that touch on relationships between pieces of data, we have very different behaviors. If the relations inside PostgreSQL are stored as columns and using foreign keys, it is going to be efficient to deal with them. However, if the data is dynamic or complex, you’ll want to put it in a JSON column. At this point, the cost of joining relations skyrockets for most data sets. RavenDB, on the other hand, allow you to follow relationships between documents naturally, at indexing or querying time. For more complex relationships work, RavenDB also has graph querying which allow you to run complex queries on the shape of your data.

I mentioned before that RavenDB was designed explicitly for business applications, that means that it has a much better feature set around their use case. Consider the humble Customers page, which needs to show the Customers details, Recent Orders (and their total), Recent Support Calls, etc.

When querying PostgreSQL, you’ll need to make multiple queries to fetch this information. That means that you’ll have to deal with multiple network roundtrips, which in many cases can be the most expensive piece of actually querying the database. RavenDB, on the other hand, has the Lazy feature, which allow you to combine multiple separate queries into a single network roundtrip. This seemingly simple feature can have a massive impact on your overall performance.

A similar feature is related to the includes feature. It is very common when you load one piece of data that you want to get related information. With RavenDB, you can indicate that to the database engine, which will send you all the results in one shot. With a relational database, you can use a join (with the impact on the shape of the results, Cartesian products issue and possible performance impact) or issue multiple queries. Simple change, but significant improvement over the alternative.

RavenDB is a distributed database by default while PostgreSQL is a single node by default. There exists features and options (log shipping, logical replication, etc), which allow PostgreSQL to run as a cluster, but they tend to be non trivial to setup, configure and maintain. With RavenDB, even if you are running a single node, you are actually running a cluster. And when you have multiple nodes, it is trivial to join them into a cluster, from which point on, you can just manage everything as usual. Features such as multi-master, the ability for disconnected work and widely distributed clusters are native parts of RavenDB and integrate seamlessly, while they tend to be of the “some assembly required” in PostgreSQL.

The two databases are very different from one another and tend to be used for separate purposes. RavenDB is meant to be the application database, it excels in being the backend of OTLP systems and focus on that to the exclusion of all else. PostgreSQL tend to be more general, suitable for dynamic queries, reports and exploration as well as OLTP scenarios. It may not be a fair comparison, but I have literally built RavenDB specifically to be better than a relational database for the common needs of business applications, and ten years in, I think it still shows significant advantages in that area.

Finally, let’s talk about performance. RavenDB was designed based on failures in the relational model. I spent years as a database performance consultant, going from customer to customer fixing the same underlying issues. When RavenDB was designed, we took that to account. The paper OLTP – Through the Looking Glass, and What We Found There has some really interesting information. Including the issue of about 30% of a relational database performance is spent on locking.

RavenDB is using MVCC, just like PostgreSQL. Unlike PostgreSQL, RavenDB doesn’t need to deal with transaction id wraparound, VACUUM costs, etc. Instead, we maintain MVCC not on the row level, but on the page level. There is a lot less locks to manage and deal with and far less complexity internally. This means that read transactions are completely lock free and don’t have to do any coordination with write transactions. That has an impact on performance and RavenDB can routinely manage to achieve benchmark numbers on commodity hardware that are usually reserved for expensive benchmark machines.

One of our developers got a new machine recently and did some light benchmarking. Running in WSL (Ubuntu on Windows), RavenDB was able to exceed 115,000 writes / sec and 275,000 reads / sec. Hare the specs:

image

And let’s be honest, we weren’t really trying hard here, but we still got nice numbers. A lot of that is by designing how we interact internally to have a much simpler architecture and shape, and it shows. And the nice thing is that these advantages are cumulative. RavenDB is fast, but you also gain the benefits of the protocol allowing you to issue multiple queries in a single roundtrip, the ability to include additional results and dynamically adjusting to the operation environment.

It Just Works.

time to read 6 min | 1023 words

Following my posts about search, I wanted to narrow my focus a bit and look into the details of implementing a persistent data structure inside Voron.

Voron is RavenDB’s storage engine and forms the lowest layers of RavenDB. It is responsible for speed, safety, transactions and much more. It is also a very low level piece of code, which has a lot of impact on the design and implementation.

Some of the things that we worry about when worrying Voron code are:

  • Performance – reduce computation / allocations (ideally to zero) for writes.
  • Zero copies – no cost for reads.
  • Safety – concurrent transactions can operate without interfering with one another.
  • Applicability – we tend to implement low level features that enable us to do a lot more on the higher tiers of the code.
  • Scale – handling data that may be very large, millions and billions of results.

In this case, I want to look into what it would take to implement a persistent set. If I was working in memory, I would be using Set<Int64>, but when using a persistent data structure, things are more interesting. The set we use will simply record Int64 values. This is important for a bunch of reasons.

First, Int64 is big, such values are used as file pointers, artificial ids, etc. Even though it seems limiting, we can get a lot more functionality than expected.

Second, if we are using a set of Int64, we can implement that using a bitmap. A set value indicate that the value is in the set, which allows us to do set union, intersection and exclusion cheaply. The only problem here is that a bitmap with Int64 values is… a problem. Imagine that I have the following code:

set.Add(82_100_447_308);

We would need to use 76GB(!) of memory to hold a bitmap for this set. That is obviously not going to be a workable solution for us. Luckily, there are other alternatives. Roaring Bitmaps are efficient in both time and space, so that is great. We just need to have an implementation that can work with a persistent model.

In order to understand how I’m going to go about implementing this feature, you need to understand how Voron is built. Voron is composed of several layers, the paging layer, which managed transactions and ACID and the data structure layer, which managed B+Trees, tables, etc.

In this case, we are implementing something at the data structure layer. And the first hurdle to jump through is decide how the data should look like. On the fact of it, this is a fairly simple decision, most of the decisions has already been made and outline in the previous post. We are going to have a sorted array of segment metadata, which will host individual segments with the set bits. This works if we have a single set, but in our case, we expect lots.

If we are going to use this for storing the posting lists, we have to deal with the following scenarios (talking about the specific documents matching the terms in the index):

  1. Many such lists that have a single item (unique id, date, etc)
  2. Lots of lists that have just a few values (Customer’s field in an order, for example)
  3. Few lists that have many values ( OrderCompleted: true, for example, can be safely expected to be about 99% of the total results)
  4. Many lists that have moderate amount of values (Each of the Tags options , for example)

That means that we have to think very carefully about each scenario. The third and forth options are relatively similar and can probably be best served by the roaring bitmap that we discussed. But what about the first two?

To answer that, we need to compute the metadata required to maintain the roaring set. At a minimum, we are going to have one SegmentMetadata involved, but we’ll also need an offset for that segment’s data, so that means that the minimum size involved has got to be 16 bytes (SegmentMetadata is 8 bytes, and a file offset is the same). There is also some overhead to store these values, which is 4 bytes each. So to store a single value using roaring set we’ll need:

  • 16 bytes for the segment metadata and actual segment’s offset
  • 4 bytes storage metadata for the previous line’s data
  • 2 bytes (single short value) to mark the single flipped bit
  • 4 bytes storage metadata for the segment itself

In short, we are getting to 26 bytes overhead if we just stored everything as a roaring set. Instead of doing that, we are going to try to do better and optimize as much as possible the first two options (unique id and very few matches). We’ll set a limit of 28 bytes (which, together with the 4 bytes storage metadata will round up to nice 32 bytes). Up to that limit, we’ll simple store the document ids we have as delta encoded varint.

Let’s say that we need to store the following document id lists:

List

Encoding

[12394]

[234, 96]

[319333, 340981,342812]

[229, 190, 19, 144, 169, 1, 167, 14]

You can see that the first list, which is 8 bytes in size, we encoded using merely 2 bytes. The second list, composed of three 8 bytes values (24 bytes) was encoded to merely 8 bytes. Without delta encoding, that value would be decoded to: [229, 190, 19, 245, 231, 20, 156, 246, 20], an additional byte. This is because we substract from each number the previous one, hopefully allowing to pack the value in a much more compact manner.

With a size limit of 28 bytes, we can pack quite a few ids in the list. In my experiments, I could pack up to 20 document ids (so 160 bytes, without encoding) into that space with realistic scenario. Of course, we may get a bad pattern, but that would simply mean that we have to build the roaring set itself.

I’m going to go ahead and do just that, and then write a post about the interesting tidbits of the code that I’ll encounter along the way.

time to read 5 min | 843 words

In the previous posts in this series, I explored a bit how to generate a full text index on top of the Enron data set. In particular, we looked at (rudimentary) analysis of text in the first post and looked into posting lists (list of matching documents for specific terms) in the second one. It occurred to me that we need to actually have a much better understanding of the kind of requirements that we have from posting lists in general, so let’s look at them, shall we?

  • Add to the list (increasing numbers only).
  • Iterate the list (all, or from starting point).
  • Reduce disk space and memory utilization as much as possible.

The fact that I want to be able to add to the list is interesting. The typical use case in full text search is to generate the full blown posting list from scratch every time. The typical model is to use LSM (Log Structure Merge) and take advantage on the fact that we are dealing with sorted list to merge them cheaply.

Iterating the list is something you’ll frequently do, to find all the matches or to merge two separate lists. Here is the kind of API that I initially had in mind:

As you can see, there isn’t much there, which is intentional. I initially thought about using this an the baseline of a couple of test implementations using StreamVByte, FastPFor as well as Gorrilla compression. The problem is that there is the need to balance compression ratio with the cost of actually going over the list. Given that my test cases showed a big benefit for using Roaring Bitmaps, I decided to look at them first and see what I can get out of it.

RoaringBitamps is a way to store (efficiently) a set of bits, they are very widely used in the industry. The default implementation is also entirely suitable for my purposes. Mostly because they make use of managed memory, and a hard requirement that I have placed on this series is that I want to be able to use persistent memory. In other words, I want to be able to write the data out, then be able to do everything on top of memory mapped data, without having to parse it.

Roaring Bitmaps works in the following manner. Each 64K range of integers is divided into each own 8KB segments. Given that I’m using Voron as a persistence library, these numbers don’t work for my needs. Voron uses an 8KB page size, so we’ll drop these numbers by half. Each range will be 32K of integers and take a maximum of 4KB of disk space. This allows me to store it much more efficiently inside of Voron. Each segment, in turn, has a type. The types can be either:

  • Array – if the number of set bits in the segment is less than 2048, the data will use a simple sorted array implementation, with each value taking 2 bytes.
  • Bitmap – if the number of set bits in the segment is between 2048 and 30,720, the segment will use a total of 4096 bytes and be a standard bitmap.
  • Reversed array – if the number of set bits in the segment is higher than 30,720, we’ll store in the segment the unset bits as a sorted array.

This gives us quite a few advantages:

  • It is straightforward to build this incrementally (remember that we only ever add items in the end).
  • It is quite efficient in terms of space saving in the case of sparse / busy usage.
  • It is cheap (computationally) to work with and process.
  • It is very simple to use from memory mapped file without having to parse / create managed objects.

The one thing that we still need to take into account is how to deal with the segment metadata. How do we know what segment belong to what range. In order to handle that, we’ll define the following:

The idea is that we need to store two important pieces of information. The start location (is always going to be a multiple of 32K) and the number of set bits (which has a maximum of 32K). Therefor, we can pack all of them into a single int64. The struct is merely there for convenience.

In other words, in addition to the segments with the actual set bits, we are also going to have an array of all the segment’s metadata. In practice, we’ll also need another value here, the actual location of the segment’s data, but that is merely another int64, so that is still very reasonable.

As this is currently a mere exercise, I’m going to skip actually building the implementation, but it seems like it should be a fairly straightforward approach. I might do another post about how to actually implement this feature on Voron, because it is interesting. But I think that this is already long enough.

We still have another aspect to consider. So far, we talked only about the posting lists, but we also need to discuss the terms. But that is a topic for the next post in the series.

time to read 3 min | 551 words

In full text search terminology, a posting list is just a list of document ids. These are used to store and find matches for particular terms in the index.

I took the code from the previous post and asked it to give me the top 50 most frequent terms in the dataset and their posting lists. The biggest list had over 200,000 documents, and I intentionally use multiple threads to build things, so the actual list is going to be random from run to run (which adds a little more real-worldedness to the system*).

*Yes, I invented that term. It make sense, so I’m sticking with it.

I took those posting lists and just dumped them to a file, in the simplest possible format. Here are the resulting files:

image

There are a few things to note here. As you can see, the file name is the actual term in the index, the contents of the file is a sorted list of int64 of the document ids (as 8 bytes little endian values).

I’m using int64 here because Lucene uses int32 and thus has the ~2.1 billion document limit, which I want to avoid. It also make it more fun to work with the data, because of the extra challenge.  The file sizes seems small, but the from file contains over 250,000 entries.

When dealing with posting lists, size matter, a lot. So let’s see what it would take to reduce the size here, shall we?

image

Simply zipping the file gives us a massive space reduction, so there is a lot left on the table, which is great.

Actually, I might have skipped a few steps:

  • Posting lists are sorted, because it helps do things like union / intersect queries.
  • Posting lists are typically only added to.
  • Removal are handled separately, with a merge step to clean this up eventually.

Because the value is sorted, the first thing I tried was to use a diff model with variable sized int. Here is the core code:

Nothing really that interesting, I have to admit, but it did cut the size of the file to 242KB, which is nice (and better than ZIP). Variable sized integers are used heavily by Lucene, so I’m very familiar with them. But there are other alternatives.

  • StreamVByte is a new one, with some impressive perf numbers, but only gets us to 282 KB (but it is possible / likely that my implementation of the code is bad).
  • FastPFor compresses the (diffed) data down to 108KB.
  • RoaringBitmap gives us a total of 64KB.

There are other methods, but they tend to go to the esoteric and not something that I can very quickly test directly.

It is important to note that there are several separate constraints here:

  • Final size on disk
  • Computational cost to generate that final format
  • Computation cost to go from the final format to the original values
  • How much (managed) memory is required during this process

That is enough for now, I believe. My next post will deal delve into the actual semantics that we need to implement to get a good behavior from the system. This is likely going to be quite interesting.

time to read 4 min | 645 words

Full text search is a really interesting topic, which I have been dipping my toes into again and again over the years. It is a rich area of research, and there has been quite a few papers, books and articles about the topic. I read a bunch of projects for doing full text search, and I have been using Lucene for a while.

I thought that I would write some code to play with full text search and see where that takes me. This is a side project, and I hope it will be an interesting one. The first thing that I need to do is to define the scope of work:

  • Be able to (eventually) do full text search queries
  • Compare and contrast different persistence strategies for this
  • Be able to work with multiple fields

What I don’t care about: Analysis process, actually implementing complex queries (I do want to have the foundation for them), etc.

Given that I want to work with real data, I went and got the Enron dataset. That is over 517,000 emails from Enron totaling more than 2.2 GB. This is one of the more commonly used test datasets for full text search, so that is helpful. The first thing that we need to do is to get the data into a shape that we can do something about it.

Enron is basically a set of MIME encoded files, so I’ve used MimeKit to speed the parsing process. Here is the code of the algorithm I’m using for getting the relevant data for the system. Here is the relevant bits:

As you can see, this is hardly a sophisticated approach. We are spawning a bunch of threads, processing all half million emails in parallel, select a few key fields and do some very basic text processing. The idea is that we want to get to the point where we have enough information to do full text search, but without going through the real pipeline that this would take.

Here is an example of the output of one of those dictionaries:

As you can see, this is bare bones (I forgot to index the Subject, for example), but on my laptop (8 cores Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz) with 16 GB of RAM, we can index this amount of data in under a minute and a half.

So far, so good, but this doesn’t actually gets us anywhere, we need to construct an inverted index, so we can ask questions about the data and be able to find stuff out. We are already about half way there, which is encouraging. Let’s see how far we can stretch the “simplest thing that could possibly work”… shall we?

Here is the key data structures:

Basically, we have an array of fields, each of which holds a dictionary from each of the terms and a list of documents for the terms.

For the full code for this stage, look at the following link, it’s less than 150 lines of code.

Indexing the full Enron data set now takes 1 minute, 17 seconds, and takes 2.5 GB in managed memory.

The key is that with this in place, if I want to search for documents that contains the term: “XML”, for example, I can do this quite cheaply. Here is how I can “search” over half a million documents to get all those that have the term HTML in them:

image

As you can imagine, this is actually quite fast.

That is enough for now, I want to start actually exploring persistence options now.

The final code bits are here, I ended up implementing stop words as well, so this is a really cool way to show off how you can do full text search in under 200 lines of code..

time to read 8 min | 1502 words

One of the silent features of people moving to the cloud is that it make it the relation between performance and $$$ costs evident. In your on data center, it is easy to lose track between the yearly DC budget and the application performance. But on the cloud, you can track it quite easily.

The immediately lead people to try to optimize their costs and use the minimal amount of resources that they can get away with. This is great for people who believe in efficient software.

One of the results of the focus on price and performance has been the introduction of burstable cloud instances. These instances allow you to host machines that do not need the full machine resources to run effectively. There are many systems whose normal runtime cost is minimal, with only occasional spikes. In AWS, you have the T series and Azure has the B series. Given how often RavenDB is deployed on the cloud, it shouldn’t surprise you that we are frequently being run on such instances. The cost savings can be quite attractive, around 15% in most cases. And in the case of small instances, that can be even more impressive.

RavenDB can run quite nicely on a Raspberry PI, or a resource starved container, and for many workloads, that make sense. Looking at AWS in this case, consider the t3a.medium instance with 2 cores and 4 GB at 27.4$ / month vs. a1.large (the smallest non burstable instance) with the same spec (but ARM machine) at 37.2$ per month. For that matter, a t3a.small with 2 cores and 2 GB of memory is 13.7$. As you can see, the cost savings adds up, and it make a lot of sense to want to use the most cost effective solution.

Enter the problem with burstable instances. They are bursty. That means that you have two options when you need more resources. Either you end up with throttling (reducing the amount of CPU that you can use) or you are being charged for the additional extra CPU power you used.

Our own production systems, running all of our sites and backend systems are running on a cluster of 3 t3.large instances. As I mentioned, the cost savings are significant. But what happens when you go above the limit? In our production systems, for the past 6 months, we have never had an instance where RavenDB used more than the provided burstable performance. It helps that the burstable machines allows us to accrue CPU time when we aren’t using it, but overall it means that we are doing a good job of  handling requests efficiently. Here are some metrics from one of the nodes in the cluster during a somewhat slow period (the range is usually 20 – 200 requests / sec).

image

So we are pretty good in terms of latency and resource utilizations, that’s great.

However, the immediately response to seeing that we aren’t hitting the system limits is… to reduce the machine size again, to pay even less. There is a lot of material on cost optimization in the cloud that you can read, but that isn’t the point of this post. One of the more interesting choices you can make with burstable instances is to ask to not go over the limit. If your system is using too much CPU, just take it away until it is done. Some workloads are just fine for this, because there is no urgency. Some workloads, such as servicing requests, are less kind to this sort of setup. Regardless, this is a common enough deployment model that we have to take it into account.

Let’s consider the worst case scenario from our perspective. We have a user that runs the cheapest possible instance a t3a.nano with 2 cores and 512 MB of RAM costing just 3.4$ a month. The caveat with this instance is that you have just 6 CPU credits / hour to your name. A CPU credit is basically 100% CPU for 1 minute. Another way of looking at this is that t t3a.nano instance has a budget of 360 CPU seconds per hour. If it uses more than that, it is charged a hefty fee (about ten times per hour than the machine cost). So we have users that disable the ability to go over the budget.

Now, let’s consider what is going to happen to such an instance when it hits an unexpected load. In the context of RavenDB, it can be that you created a new index on a big collection. But something seemingly simple such as generating a client certificate can also have a devastating impact on such an instance.RavenDB generates client certificates with 4096 bits keys. On my machine (nice powerful dev machine), that can take anything from 300 – 900 ms of CPU time and cause a noticeable spike in one of the cores:

image

On the nano machine, we have measured key creation time of over six minutes.

The answer isn’t that my machine is 800 times more powerful than the cloud machine. The problem is that this takes enough CPU time to consume all the available credits, which cause throttling. At this point, we are in a sad state. Any additional CPU credits (and time) that we earn goes directly to the already existing computation. That means that any other CPU time is pushed back. Way back. This happens at a level below the operating system (at the hypervisor), so there it isn’t even aware of it. What is happening from the point of view of the OS is that the CPU is suddenly much slower.

All of this make sense and expected given the burstable nature of the system. But what is the observed behavior?

Well, the instance appears to be frozen. All the cloud metrics will tell you that everything is green, but you can’t access the system (no CPU to accept new connections) you can’t SSH into it (same) and if you have an existing SSH connection, it appears to have frozen. Measuring performance from inside the machine shows that everything is cool. One CPU is busy (expected, generating the certificate, doing indexing, etc). Another is idle. But the system behaves as if it has no CPU available, which is exactly what is going on, except in a way that we can’t tell.

RavenDB goes to a lot of trouble to be a nice citizen and a good neighbor. This includes scheduling operations so the underlying OS will always have the required resources to handle additional load. For example, background tasks are run with lowered priority, etc. But when running on a burstable machine that is being throttled… well, from the outside it looked like certain trivial operations would just take the entire machine down and it wouldn’t be recoverable short of hard reboot.

Even when you know what is going on, it doesn’t really help you. Because from inside the machine, there is no way to access the cloud metrics in a good enough precision to take action.

We have a pretty strong desire to not get into these kind of situation, so we implemented what I can only refer to as counter measures. When you are running on a burstable instance, you can let RavenDB know what is your CPU credits situation, at which point RavenDB will actively monitor the machine and compute its own tally of the outstanding CPU credits situation. When we notice that the CPU credits are running short, we’ll start pro-actively halting internal processes to give the machine more space to recover from the CPU credits shortage.

We’ll move ETL processes and ongoing backups to other nodes in the cluster and even pause indexing in order to recover the CPU time we need. Here is an example of what you’ll see in such a case:

image

One of the nice things about the kind of promises RavenDB make about its indexes is that it is able to make these kind of decisions without impacting the guarantees that we provide.

If the situation is still bad, we’ll start proactively rejecting requests. The idea is that instead of getting of getting to the point where we are being throttled (and effectively down to the world), we’ll process a request by simply sending 503 Service Unavailable response back. This is going to be very cheap to do, so won’t put additional strain on our limited budget. At the same time, the RavenDB’s client treat this kind of error as a trigger for a failover and will use another node to service this request.

This was a pretty long post to explain that RavenDB is going to give you a much nicer experience on burstable machines even if you don’t have bursting capabilities.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Webinar recording (7):
    02 Jul 2020 - Practical indexing with RavenDB
  2. RavenDB Webinar (3):
    01 Jun 2020 - Polymorphism at Scale
  3. Podcast (2):
    28 May 2020 - Adventures in .NET High performance databases with RavenDB with Oren Eini
  4. Talk (5):
    23 Apr 2020 - Advanced indexing with RavenDB
  5. Challenge (57):
    21 Apr 2020 - Generate matching shard id–answer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats