Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,074 | Comments: 49,800

filter by tags archive
time to read 3 min | 537 words

At the beginning of the year, we run into a problematic query. The issue was the use of an in clause vs. a series of OR. You can see the previous investigation results here. We were able to pinpoint the issue pretty well, very deep in the guts of Lucene, our query engine.

Fast Query Slow Query
image image
Time: 1 – 2 msTime: 60 – 90 ms
imageimage

The key issue for this query was simple. There are over 600,000 orders with the relevant statuses, but there are no orders for CustomerId “customers/100”. In the OR case, we would evaluate the query lazily. First checking the CustomerId, and given that there have been no results, short circuiting the process and doing no real work for the rest of the query. The IN query, on the other hand, would do things eagerly. That would mean that it would build a data structure that would hold all 600K+ documents that match the query, and then would throw that all away because no one actually needed that.

In order to resolve that, I have to explain a bit about the internals of Lucene. As its core, you can think of Lucene in terms of sorted lists inside dictionaries. I wrote a series of posts on the topic, but the gist of it is:


Note that the ids for documents containing a particular term are sorted. That is important for a lot of optimizations in Lucene, which is also a major problem for the in query. The problem is that each component in the query pipeline needs to maintain this invariant. But when we use an IN query, we need to go over potentially many terms. And then we need to get the results in the proper order to the calling code. I implemented a tiered approach. If we are using an IN clause with a small number of terms in it (under 128), we will use a heap to manage all the terms and effectively do a merge sort on the results.

When we have more than 128 terms, that stops being very useful, however. Instead, we’ll create a bitmap for the possible results and scan through all the terms, filling the bitmap. That can be expensive, of course, so I made sure that this is done lazily by RavenDB.

The results are in:

OR Query IN Query
Invalid CustomerId 1.39 – 1.5 ms 1.33 – 1.44 ms
Valid CustomerId17.5 ms 12.3 ms

For the first case, this is now pretty much a wash. The numbers are slightly in favor of the IN query, but it is within the measurement fluctuations.

For the second case, however, there is a huge performance improvement for the IN query. For that matter, the cost is going to be more noticeable the more terms you have in the IN query.

I’m really happy about this optimization, it ended up being quite elegant.

time to read 2 min | 333 words

I had a task for which I need to track a union of documents and then iterate over them in order. It is actually easier to explain in code than in words. Here is the rough API:

As you can see, we initialize the value with a list of streams of ints. Each of the streams can contain any number of values in the range [0 … maxId). Different streams can the same or different ids.

After initialization, we have to allow to query the result, to test whatever a particular id was stored, which is easy enough. If this was all I needed, we could make do with a simple HashSet<int> and mostly call it a day.  However, we also need to support iteration, more interesting, we have to support sorted iteration.

A quick solution would be to use something like SortedList<int,int>, but that is going to be massively expensive to do (O(N*logN) to insert). It is also going to waste a lot of memory, which is important. A better solution would be to use a bitmap, which will allow us to use a single bit per value. Given that we know the size of the data in advance, that is much cheaper, and the cost of insert is O(N) to the number of ids we want to store. Iteration, on the other hand, is a bit harder on a bitmap.

Luckily, we have Lemire to provide a great solution. I have taken his C code and translated that to C#. Here is the result:

I’m using BitOperations.TrailingZeroCount, which will use the compiler intrinsics to compile this to a very similar code to what Lemire wrote. This allows us to iterate over the bitmap in large chunks, so even for a large bitmap, if it is sparsely populated, we are going to get good results.

Depending on the usage, a better option might be a Roaring Bitmap, but even there, dense sections will likely use something similar for optimal results.

time to read 2 min | 204 words

imageRavenDB Cloud is now offering HIPAA Compliant Accounts.

HIPAA stands for Health Insurance Portability and Accountability Act and is a set of rules and regulations that health care providers and their business associates need to apply.

That refers to strictly limiting access to Personal Health Information (PHI) and Personally Identifying Information (PII) as well as audit and security requirements. In short, if you deal with medical information in the states, this is something that you need to deal with. In the rest of the world, there are similar standards and requirements.

With HIPAA compliant accounts, RavenDB Cloud takes on itself a lot of the details around ensuring that your data is stored in a safe environment and in a manner that match the HIPAA requirements. For example, the audit logs are maintained for a minimum of six years. In addition, there are further protections on accessing your cluster and we enforce a set of rules to ensure that you don’t accidently expose private data.

This feature ensures that you can easily run HIPAA compliant systems on top of RavenDB Cloud with a minimum of hassle.

time to read 1 min | 159 words

imageWe are looking to expand the number of top tier drivers and build a RavenDB client for PHP.

We currently have 1st tier clients for .NET, JVM, Python, Go, C++ and Node.JS. There are also 2nd tier clients for Ruby, PHP, R and a bunch of other environments.

We want to build a fully fledged client for RavenDB for PHP customers and I have had great success in the past in reaching awesome talent through this blog.

Chris Kowalczyk had done our Go client and detailed the process in a great blog post.

The project will involve building the RavenDB client for PHP, documenting it as well as building a small sample app or two.

If you are interested or know someone who would be, I would very happy if you can send us details to jobs@ravendb.net.

time to read 1 min | 200 words

RavenDB has the concept of metadata, which is widely used for many reasons. One of the ways we use the metadata is to provide additional context about a document. This is useful for both the user and RavenDB. For example, when you query, RavenDB will store the index score (how well a particular document matched the query) in the metadata. You can access the document metadata using:

This works great as long as we are dealing with documents. However, when you query a Map/Reduce index, you aren’t going to get a document back. You are going to get a projection over the aggregated information. It turns out that in this case, there is no way to get the metadata of the instance. To be more exact, the metadata isn’t managed by RavenDB, so it isn’t keeping it around for the GetMetadataFor() call.

However, you can just ask the metadata to be serialized with the rest of the projection’s data, like so:

In other words, we embed the metadata directly into the projection. Now, when we query, we can get the data directly:

image

time to read 2 min | 297 words

imageI mentioned in my previous post that I managed to lock myself out of the car by inputting the wrong pin code. I had to wait until the system reset before I could enter the right pin code. I got a comment to the post saying that it would be better to use a thumbprint scanner for the task, to avoid this issue.

I couldn’t disagree more.

Let’s leave aside the issue of biometrics, their security and the issue of using that for identity. I don’t want to talk about that subject. I’ll assume that biometrics cannot fail and can 100% identify a person with no mistakes and no false positives and negatives.

What is the problem with a thumbprint vs. a pin code as the locking mechanism on a car?

Well, what about when I need someone else to drive my car? The simplest example may be valet parking, leaving the car at the shop or just loaning it to someone.  I can give them the pin code over the phone, I’m hardly going to mail someone my thumb because. There are many scenarios where I actually want to grant someone the ability to drive my car, and making it harder to do so it a bad idea.

There is also the issue of what happens if my thumb is inoperable? It might be raining and my hands are wet, or I changed a tire and will need half an hour at the sink to get cleaned up again.

You can think up solutions to those issues, sure, but they are cases where the advanced solution makes anything out of the ordinary a whole lot more complex. You don’t want to go there.

time to read 3 min | 582 words

Complex systems are considered problematic, but I’m using this term explicitly here. For reference, see this wonderful treatise on the subject. Another way to phrase this is how to create robust systems with multiple layers of defense against failure.

When something is important, you should prepare in advance for it. And you should prepare for failure. One of the typical (and effective) methods for handle failures is the good old retry. I managed to lock myself out of my car recently, had to wait 15 minutes before it was okay for me to start it up. But a retry isn’t going to help you if the car run out of fuel, or there is a flat tire. In some cases, a retry is just going to give you the same failing response.

Robust systems do not have a single way to handle a process, they have multiple, often overlapping, manners to do their task. There is absolutely duplication between these methods, which tend to raise the hackles of many developers. Code duplication is bad, right? Not when it serve a (very important) purpose.

Let’s take a simple example, order processing. Consider the following example, an order made on a website, which needs to be processed in a backend system:

image

The way it would work, the application would send the order to a payment provider (such as PayPal) which would process the actual order and then submit the order information via web hook to the backend system for processing.

In most such systems, if the payment provider is unable to contact the backend system, there will be some form of retry. Usually with a exponential back-off strategy. That is sufficient to handle > 95% of the cases without issue. Things get more interesting when we have a scenario where this breaks. In the above case, assume that there is a network issue that prevent the payment provider from accessing the backend system. For example, a misconfigured DNS entry means that external access to the backend system is broken. Retrying in this case won’t help.

For scenarios like this, we have another component in the system that handles order processing. Every few minutes, the backend system queries the payment provider and check for recent orders. It then processes the order as usual. Note that this means that you have to handle scenarios such as an order notification from the backend pull process concurrently with the web hook execution. But you need to handle that anyway (retrying a slow web hook can cause the same situation).

There is additional complexity and duplication in the system in this situation, because we don’t have a single way to do something.

On the other hand, this system is also robust on the other direction. Let’s assume that the backend credentials for the payment provider has expired. We aren’t stopped from processing orders, we still have the web hook to cover for us.  In fact, both pieces of the system are individually redundant. In practice, the web hook is used to speed up common order processing time, with the backup pulling recent orders as backup and recovery mechanism.

In other words, it isn’t code duplication, it is building redundancy into the system.

Again, I would strongly recommend reading: Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety.

time to read 5 min | 894 words

In this post, I’m going to walk you through the process of setting up machine learning pipeline within RavenDB. The first thing to ask, of course, is what am I talking about?

RavenDB is a database, it is right there in the name, what does this have to do with machine learning? And no, I’m not talking about pushing exported data from RavenDB into your model. I’m talking about actual integration.

Consider the following scenario. We have users with emails. We want to add additional information about them, so we assign as their default profile picture their Gravatar image. Here is mine:

On the other hand, we have this one:

In addition to simply using the Gravatar to personalize the profile, we can actually analyze the picture to derive some information about the user. For example, in non professional context, I like to use my dog’s picture as my profile picture.

Let’s see what use we can make of this with RavenDB, shall we?

image

Here you can see a simple document, with the profile picture stored in an attachment. So far, this is fairly standard fare for RavenDB. Where do we get to use machine learning? The answer is very quickly. I’m going to define the Employees/Tags index, like this:

This requires us to use the nightly of RavenDB 5.1, where we have support for indexing attachments. The idea here is that we are going to be making use of that to apply machine learning to classify the profile photo.

You’ll note that we pass the photo’s stream to ImageClassifier.Classify(), but what is that? The answer is that RavenDB itself has no idea about image classification and other things of this nature. What it does have is an easy way for you to extend RavenDB. We are going to use Additional Sources to make this happen:

image

The actual code is as simple as I could make it and is mostly concerned with setting up the prediction engine and outputting the results:

In order to make it work, we have to copy the following files to the RavenDB’s directory. This allows the ImageClassifier to compile against the ML.Net code. The usual recommendations about making sure that the ML.Net version you deploy best matches the CPU you are running on applies, of course.

If you’ll look closely at the code in ImageClassifier, you’ll note that we are actually loading the model from a file via:

mlContext.Model.Load("model.zip", out _);

This model is meant to be trained offline by whatever system would work for you, the idea is that in the end, you just deploy the trained model as part of your assets and you can start applying machine learning as part of your indexing.

That brings us to the final piece of the puzzle. The output of this index. We output the data as indexed fields and give the classification for them. The Tag field in the index is going to contains all the matches that are above 75% and we are using dynamic fields to record the matches to all the other viable matches.

That means that we can run queries such as:

from index 'Employees/Tags' where Tag in ('happy')

Insert your own dystopian queries here. You can also do a better drill down using something like:

from index 'Employees/Tags' where happy > 0.5 and dog > 0.75

The idea in this case is that we are able to filter the image by multiple tags and search for pictures of happy people with dogs. The capabilities that you get from this are enormous.

The really interesting thing here is that there isn’t much to it. We run the machine learning process once, at indexing time. Then we have the full query abilities at our hands, including some pretty sophisticated ones. Want to find dog owners in a particular region? Combine this with a spatial query. And whenever the user will modify their profile picture, we’ll automatically re-index the data and recompute the relevant tags.

I’m using what is probably the simplest possible options here, given that I consider myself very much a neophyte in this area. That also means that I’ve focused mostly on the pieces of integrating RavenDB and ML.Net, it is possible (likely, even) that the ML.Net code isn’t optimal or the idiomatic way to do things. The beautiful part about that is that it doesn’t matter. This is something that you can easily change by modifying the ImageClassifier’s implementation, that is an extension, not part of RavenDB itself.

I would be very happy to hear from you about any additional scenarios you have in mind. Given the growing use of machine learning in the world right now, we are considering ways to allow you to utilize machine learning on your data with RavenDB.

This post required no code changes to RavenDB, which is really gratifying. I’m looking to see what features this would enable and what kind of support we should be bringing to the mainline product. Your feedback would be very welcome.

FUTURE POSTS

  1. Webinar Recording: Advanced Search Scenarios in RavenDB - 12 hours from now
  2. Putting JSON in a block chain? First decide what your JSON is… - about one day from now
  3. The failure of a computer you didn't even know existed - 2 days from now

There are posts all the way to Oct 30, 2020

RECENT SERIES

  1. Webinar recording (10):
    27 Aug 2020 - The App that Guarantees You're Going Out This Saturday Night
  2. re (27):
    27 Oct 2020 - Investigating query performance issue in RavenDB
  3. Reminder (10):
    25 Oct 2020 - Online RavenDB In Action Workshop tomorrow via NDC
  4. Podcast (3):
    17 Aug 2020 - #SoLeadSaturday with Oren Eini
  5. RavenDB Webinar (3):
    01 Jun 2020 - Polymorphism at Scale
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats