Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,708 | Comments: 48,617

filter by tags archive

Graphs in RavenDBThe query language

time to read 4 min | 781 words

Pretty much all our early discussions about graphs in RavenDB focused on how to build the actual graph implementation. How to allow fast traversal, etc. When we started looking at the actual implementation, we realized that we seriously neglected a very important piece of the puzzle, the query interface for the graphs.

This is important for several reasons. First, ergonomics matter, if we end up with a query language that is awkward, it won’t see much use and complicate the users’ lives (and our own). Second, the query language effectively dictate how the user think about the model, so making low level decisions that would have impact on how the user is actually using this feature is probably not a good idea yet. We need to start from the top, what do we give to the user, and then see how we can make that a reality.

The most common use case of graph queries is the friends of friends query. Let’s see how this query is handled in various existing implementation, shall we?

Neo4J, using Cypher:

image

OrientDB doesn’t seem to have an easy way to do this. The following shows how you can find the 2nd degree friends, but it doesn’t exclude friends of friends who are already your friends. StackOverflow questions on that show scary amount of code, so I’m going to skip them.

image

Gremlin, which is used in a wide variety of databases:

image

We looked at other options, but it seems that graph query languages fall into the following broad categories:

  • ASCII art to express the relationship between the nodes.
  • SQL extensions that express the relationships as nested queries.
  • Method calls to express the traversal.

Of the three options, we found the first option, using ASCII Art / Cypher as the easier one to work with. This is true both in terms of writing the query and actually executing it.

Let’s look at how friends of friends query will look like in RavenDB:

image

Graph queries are composed of two portions:

  • With clauses, which determine source point for the graph traversal.
  • Match clause (singular) that contain the graph pattern that we need to match on.

In the case, above, we are starting the graph traversal from start, this is defined as a with clause. A query can have multiple with clauses, each defining an alias that can be used in the match clause. The match clause, on the other hand, uses these aliases to decide how to process the query.

You can see that we have two clauses in the above query, and the actual processing is done by pattern matching (to me, it make sense to compare it to regular expressions or Prolog). It would probably be easier to show this with an example. Here is the relationship graphs among a few people:

image

We’ll set the starting point of the graph as Arava and see how this will be processed in the query.

For the first clause, we’ll have:

  • start (Arava) –> f1 (Oscar) –> f2 (Phoebe)
  • start (Arava) –> f1 (Oscar) –> f2 (Sunny)
  • start (Arava) –> f1 (Sunny) –> f2 (Phoebe)
  • start (Arava) –> f1 (Sunny) –> f2 (Oscar)

For the second clause, of the other hand, have:

  • start (Arava) –> f2 (Oscar)
  • start (Arava) –> f2 (Sunny)

These clauses are joined using and not operator. What this means is that we need to exclude from the first clause anything that matches on the second cluase. Match, in this case, means the same alias and value for any existing alias.

Here is what we need up with:

  • start (Arava) –> f1 (Oscar) –> f2 (Phoebe)
  • start (Arava) –> f1 (Oscar) –> f2 (Sunny) 
  • start (Arava) –> f1 (Sunny) –> f2 (Phoebe)
  • start (Arava) –> f1 (Sunny) –> f2 (Oscar)

We removed two entries, because they matched the entries from the second clause. The end result being just friends of my friends who aren’t my friends.

The idea with behind the query language is that we want to be high level and allow you to express what you want, and we’ll be in charge of actually making this work properly.

In the next post, I’ll talk a bit more about the query language, what scenarios it enables and how we are going to go about processing queries.

Graphs in RavenDBThe overall design

time to read 5 min | 863 words

Note: These series of posts are about a planned feature, exploring how we go about building it. This is meant to solicit feedback and get more eyes on the idea, things aren’t set in stone and we don’t have a firm release date on this.

We have been wanting to add graph queries to RavenDB for several years now, but we always had more important things get in the way. That didn’t prevent us from discussing this internally and sketch up a few options. We are now looking at this more seriously and I thought that sharing the details of our deliberations would be interesting and likely to garner us some valuable feedback. I’m going to assume that the reader is at least somewhat familiar with the notion of graph data and graph queries.

Probably the most well known graph database is Neo4J, which provides the notion of nodes and edges, both of which have a type and a set of (flat) properties. This allow you to define a model of any arbitrary complexity. This works if you model is purely graph based, but it doesn’t work for RavenDB, whose users are used to the document model. On the surface, this looks like a minor detail. RavenDB has documents, which can have any shape, including containing embedded values and collections inside them. Neo4J, on the other hand, model things differently. The simplest example that I can think of is Orders and Order Lines, where you’ll have the following models:

Neo4J

RavenDB

image image

Both models have the same information, but each element in the Neo4J graph is an independent node that is linked to the others. On the other hand, with RavenDB, we have a single document that embeds a lot of the information directly.  Note that what we haven’t shown in the image is that in RavenDB as well, you have other documents as well. The products, for example, are separate documents. 

Graph databases are often used to handle the basis of recommendation engines, fraud detection, etc. But they are usually used to augment the capabilities of the system, rather than as the primary data store of applications. RavenDB, on the other hand, is most frequently deployed as the primary data store. We want to give our users the ability to perform graph operations, but we don’t want to lose anything that make RavenDB useful and easy to use.

We initially thought about having the following definition:

  • Each document is (implicitly) a node in the graph.
  • You can call Link(src,dest,type, attributes) to create an edge between any two documents.
  • Provide the usual graph queries on top of that.

We started exploring this implementation, but it quickly led to mounting complexity. From the point of view of the user, it led to having to do additional work, you’ll have to maintain your document model and the edges at the same time. This allow you to do some interesting things, but it also likely to cause complications down the line and very likely to cause issues when the document model and graph model disagree with one another. Other issues relates to how do you handle graphs in a distributed manner. How do you deal with the creation on an edge between two documents on one node when one of them was deleted on another?

We pushed in that direction for a while, because that was the obvious thing to do, but it really turned up to be a bad idea which didn’t play well with the rest of RavenDB. The worst part was the fact that you might modify the document properties but not define the edge, which lead to inconsistency. This was very easy to do.

The next thing we played with was to remove the Link() call and allow the user to define a background operation that would go and create the links between documents automatically whenever they were updated. This would allow us to avoid having any inconsistencies between the data in the documents and the links between then. After thinking about this for a while, we went ahead with this approach, but removed the requirement for a background operations.

RavenDB will be able to use your existing document model as the graph model as well. In other words, in the model above, you have the orders/2 document, which has two links, for each of the products. This give us both the ability to have a well define document model, with its well known Domain Driven architecture and the ability to hop off all the pre-existing links that we have in the model.

I’ll discuss the querying model and how it all plays together in a future post. For now, I want to show you how this looks like when we want to do a typical graph operation, friends of friends:

image

More details will come in the next post…

Debug considerations for high level system architecture

time to read 4 min | 602 words

I run into this twit:

This resonated very strongly with me, because when we architected RavenDB 4.0, one of the key considerations was the issue of debuggability. RavenDB instances often run for months on end, usually only restarted to apply updates to OS or database. They are often running in production environments where it is not possible to do any meaningful debugging. We rely heavily on resolving issues through minidumps, core dumps, etc. Part of the work we did in architecting RavenDB 4.0 was to sit down and think about supporting the system in production.

For many of the core components, async was right out. Part of that was because of issues relating to the unpredictability of async execution, we want certain things to always happen first, avoid thread pool starvation / growth policies / etc. But primarily, we were sick and tired of getting a dump (or even just pausing a running instance when we debug a complex situation) and having to manually reconstruct the state of the system. Parallel stacks alone is an amazing feature for figuring out what is going on in a complex system.

The design of RavenDB called for any long lived task to run on a dedicated thread. These threads are named, so if you stop in the debugger, you can very quickly see what is actually is going on there. This is also useful for things like account for memory, CPU time, etc. We had a problem in a particular component that was leaking memory at a rate of 144 bytes per second, just under 12 MB per day. This is something that is very easy to lose in the noise. But because we do memory accounting on a thread basis, it was easy to go to a system that was running for a few weeks and see that this particular thing had 500MB of memory in use, when we expected maybe 15MB.

We still use async for handling of short term operations. For example, processing of a single request, because these are fast and if there are problems with them, we’ll usually see them already executing.

I’m really happy with this decision, since it provided us many dividends down the line. We planned this for production, to be honest, but it ended up really helpful in normal debugging as well.

This also allow us to take advantage of the fact that a thread that is not runnable is effectively free (aside from some memory, of course), so we can dedicate a full thread for these long running tasks and greatly simplify everything. An index in RavenDB always has its own dedicated thread, which is woken up if there is anything that this index needs to process. This means that indexing code is simple, isolated and we can start applying policies at the index level easily. For example, if I have an index that has a low priority, I can just adjust the thread’s priority and let the OS do the hard work of scheduling it accordingly.

Async simplifies the programming model significantly, but it also come at a cost of system complexity and maintenance overhead. Figuring out that you have a request stuck on a task that will never return, for example, is never pleasant. The same thing using blocking operations is immediately obvious. That is a benefit that should absolutely not be discounted.

Answering the web developer task

time to read 1 min | 101 words

In my previous post, I talked about a task we give candidates that interview for the web developer position. They need to implement the following:

Given that I don’t like handing our tasks that I haven’t done, I took a few minutes to answer my own question. Here is how this can be implemented:

I believe that I mentioned that my JavaScript skills are from the last decade, if that, so I’m probably committing quite a few sins against JavaScript (if that is even possible), but this code run the first time I tried it and gave the proper result.

Interview question for Web Developer

time to read 2 min | 281 words

One of the roles we are looking for right now is for a web developer. We are looking for someone who can do great things on the browser and write good, maintainable code. I’m not a web developer, haven’t been in a while, but it has been really interesting to see the interview process. In particular, I have great fondness for the following line of questions.

What does this code do?

I love this because it is simple, short and can reveal a lot about the mental model that the candidate has about how JavaScript works. If they hesitate too much in answering this, we typically just run this snippet in the browser and ask them to explain the results. This is important because it shows that they understand the execution model, how code is interleaves, etc.

The next stage is to ask them what the following snippet will do:

I actually have zero expectation that they will be able to answer this correctly. We let them think about this for a few seconds and then run it in the browser. Asking them to explain why it gave the output it gave is a lot more interesting.

The final piece of this line of questions is to have them implement the following:

And here we get to also investigate how they are thinking about code. This isn’t a trivial thing to implement, because you need to understand lambdas, how to coordinate several actions into a cohesive whole, etc.

We usually ask people to describe us how they would handle something like this, not actually write the code. And hearing the thought process that the candidates go through as they solve this can be illuminating.

Reading OSS code to figure out what is actually going on

time to read 3 min | 440 words

I use Open Live Writer to post to this blog, the problem is that whenever I post a new post, it opens up the metadata api endpoint in the browser (services/metaweblogapi.ashx). I actually want to see the blog post that I just posted. I decided that this was annoying enough that I’m going to figure out how this is done and see if there is a way for my blog to give Open Live Writer the address of the newly created post.

I want this to be a focused operation, I don’t wanna read through it all. So I’m going to see if I can figure out how this works with a minimum of effort. I know that OLW is opening the browser after the post is published, this is usually done with Process.Start, so I run the following query:

image

The very first result is promising, showing ExecuteFile. This sounds interesting, let’s see how this is used. No one seems to be calling this method, but reading through the ShellHelper file, I run into LaunchUrl(), which seems promising. Searching for this method got me to some interesting locations, including the ViewPage method, which seems to be exact what I want.

image 

This seems to indicate that the blog post should support pages, not sure what this is about, but I found this piece of code by searching for IsPage:

image

Not sure what pages are, but looking at the configuration for my blog, I see:

image

Continuing my blaze through the code, I can see we have:

image

My blog doesn’t implement this method, but OLW doesn’t probe for this. It seems that because I’m using the generic interface, it already pre-loaded the available options there. What this means is that this exploration ended up at a dead end. I figured out roughly what is going on, but actually getting all the details is probably too much of a hassle for me to debug through the OLW code and update my blog engine. I’m already used to just closing the newly opened tab and go to the new post directly. I’ll keep this in my todo tasks for when I actually get around to doing this.

RavenDB 4.1, Inside RavenDB and RavenHQ

time to read 3 min | 451 words

imageLast Friday, RavenDB 4.1 hit the RTM milestone and then went out to see what other interesting things it can do. The highlights include:

  1. Cluster wide transactions
  2. JavaScript indexes
  3. SQL Migration Wizard
  4. Distributed Counters
  5. RavenDB Embedded
  6. MongoDB & CosmosDB migration

The Inside RavenDB book is also complete and is now available at Amazon. This ended up being 562 pages of deep dive into everything that RavenDB does, how it does it and most importantly, why it is doing so. Not willing to just give you the raw details, I tried to tell both the story of RavenDB and place it in context, so you’ll understand how to actually make the best use of the database. This has been almost a year and a half in the making. I’m really glad that I wrote the book, because the mere fact that I had to present a coherent story for the book was very helpful for the product itself. Of course, that meant that I had to go and re-write a bunch of stuff, but the end result is both a better product and a better book.

The book is available in both paper back formats and on Kindle (and Kindle Unlimited). I would really appreciate any feedback you have. The actual first copy are currently in the mail (currently residing in Weybridge, Surrey GB) and I can’t wait to actually hold them in my hands.

In other news, we are nearing public availability of 4.1 in RavenHQ as well. RavenHQ is currently in the middle of their private beta for RavenDB 4. Along with supporting v4.1, RavenHQ is revamping it's hosting model to offer features such as hosting multiple databases per subscription and the ability to create a cluster in the region of your choice. RavenHQ will be opening up it's V4.1 clusters to the public within the next few weeks, but the private beta is currently offering clusters for free! If you’d like to participate in the private beta, fill out their survey to get an invitation.

The hosting model revamp sounds small, but it has huge implications on all sort of details that you have to get right when you operate at the scale that RavenHQ does. In particular, whereas the 3.5 offering for RavenHQ is for a single (replicated) database, the RavenDB 4.1 offering is actually a full fledged cluster. This give you the ability to manage your own databases, designate production / test / dev environments, automatically create databases, etc. RavenHQ is still going to operate and manage everything for you, but you will now have far more freedom.

RavenDB 4.1 Tidbits: The production environment

time to read 1 min | 171 words

A few things in RavenDB 4.1 are relatively minor features, but they fill a niche and can really help out. One of them is the server environment reminder. When working with multiple clusters with different environment, it is easy to forget in which environment you are on. That can lead to thinking that you are on a test environment when you are actually on production, with much hilarity down the road for the people reading your post mortem.

The server environment (which you can configure from the Studio Configuration) give you the ability to have a visible reminder where you are at. Here is what this looks like:

image

In some cases, you’ll have a single cluster, but have different databases for different tasks. The environment can be defined at the cluster and database levels, like so:

image

Reviewing FASTERSummary

time to read 4 min | 695 words

FASTER is an interesting project, with some unique approaches to solving their tasks that I haven’t encountered before. When I initially read the paper about a year or so ago, I was impressed with what they were doing, even I didn’t quite grasp exactly what was going on. After reading the code, this is now much clearer. I don’t remember where I read it, but I remember reading a Googler talking about the difference between Microsoft and Google with regards to publishing technical papers. Google would only publish something after it has been in production for a while (and probably ready to sunset Smile) while Microsoft would publish papers about software that hasn’t been deployed yet.

The reason I mention this is that FASTER isn’t suitable for production. Not by a long shot. I’m talking about issues such as swallowing errors, writing to the console as an error handling approach, calling sleep(), lack of logging / tracing / visibility into what is going on in the system. In short, FASTER looks like it was produced to support the paper. It is proof of concept / research code, not something that can take and use.

You can see it clearly in the way that the system is designed to be operated. You have dedicated threads that process requests as fast as they possibly can. However, there is no concept of working in any kind of operational environment, you can’t start using FASTER from an ASP.Net MVC app, for example. They models are just too different. I can think of a few ways to build a server using the FASTER model, but they are all pretty awkward and very specialized. This lead to the next issue with the project, it is highly specialized solution.

It isn’t meant for general consumption. In fact, as I can figure out, this is perfect if you have a relatively small working set that you do a lot of operations on. The examples I have seen given are related to tracking ads, which is a great example. If you want to store impressions on an ad, the active ads are going to pretty small, but you are going to have a lot of impressions on them. For other stuff, I don’t see a lot of usage scenarios.

FASTER is limited in the following ways:

  • Get / Set / Update only – no way to query
  • No support for atomic operations on more than a single key
  • Can support fixed length values only
  • Crash means data loss (of the most recent 14.4 GB, usually)
  • The API is awkward to use, and you need to write a bit of (non trivial) code for each key/val you want to store.
  • No support for compaction of data beyond dropping the oldest entries

Some of these issues can be mitigated. For example, compaction can be implemented, and you can force data to be written to disk faster if you want to, but those aren’t in the box and require careful handling.

Now that I have gone over the code, I’m not quite sure what was the point there, to be honest. In terms of performance, you can get about 25% of the achieved performance by just using ConcurrentDictionary in .NET, I’m pretty sure that you can do better by literally just using a concurrent hash map in C++. This isn’t something that you can use as a primary data store, after all, so I wonder why not just keep all the data in memory and be done with it.

I liked the mutable / read only portions of the log, that is certainly a really nice way to do it, and I’m sure that the epoch idea simplified things during the implementation with the ability to not worry about concurrent accesses. However, the code is complex and I’m pretty sure that it is not going to be fun to debug / work with in real world scenarios.

To sum it up, interesting codebase and approaches, but I would caution from using it for real. The perf numbers are to salivate over,  but the manner in which the benchmark was written means that it is not really applicable for any real world scenario.

Reviewing FASTERWhen the data hits the disk

time to read 5 min | 817 words

So far, I ignored anything in FASTER about how the data actually hits the disk. Based on my reading of the code and the paper, here is what I think that is going on. FASTER works in segments and in conjunction with its allocator. When you create a new instance, you have to define what would be the log size. From looking at the code, they seem to be favoring 16GB as the default size of the log. This is passed to PersistentMemoryMalloc, which uses pages of 32MB each to manage the memory. Out of these 16GB, 14.4GB are considered to be mutable and 1.6 GB is considered to be read only.

On startup, this class allocates two pages (64MB). Whenever FASTER needs more memory, such as to store a new record, it will eventually call to this method:

image

Again we have num_slots that actually means size in bytes, but I’ll leave my pet peeves aside.

You can see that this allocates from tail of the page use Reserve, which does an atomic operation. If we run out of space in the 32MB page, the caller need to call NewPage() to handle the new allocation. This plays together with the buffer management and the epoch. In particular, here is how a new page is allocated in the normal case. Assuming we just started and we consumed 64MB of memory, the new entry will allocate the 3rd page. This will also move the section of read only memory when there is a new page allocated and the number of active pages is beyond 14.4 GB.

image

What this means, in practice, is that FASTER typically has a 14.4GB range in which all operations are working on purely mutable memory. That means that two impressions on the same ad will end up being very close to simply Interlocked.Increment on the relevant value. This is the key for the performance that FASTER exhibits.

What happens once we start going beyond the 14.4 GB? FASTER will begin to move data to the read only section. In this case, it means that the any new modifications to the data will create a new copy of it in the mutable section.

At the same time, PageAlignedShiftReadOnlyAddress() will also starts an async process of writing these readonly pages to disk. Here is how it works:

image

If the read_only_address was updated, FASTER calls to BumpCurrentEpoch() and will execute OnPagesMarkedReadOnly() when the Epoch moves beyond this range (this works because then it is guaranteed that no one may mutate this memory).  That method looks like this:

image

The notion of read_only_address and safe_readonly_only_address is discussed in the paper quite nicely, by the way. AsyncFlushPages() writes the data to disk, as you would expect and updates various in memory structures.

Note that just having the data written to disk doesn’t remove the in memory copy. It is still available in memory until it is pushed back from the log by new data. Now that we understand how the data goes to the log and then to the disk, I want to figure out how the data is written in to the disk itself. Actual writes are handled here:

image

What you can see is that the destination offset is used to divide the data on disk to sections. Each section is 1GB in size. In other words, the way FASTER works is to write the data in 1 GB segments that are sequential over time.

This also plays into the expiration policy that FASTER employs. Because it uses a logs based system, it accumulate data over time and will need to handle that. The current way it deals with the problem is to just delete old files, this gets rid of the data in roughly time based order, which is suitable for the use case that the paper talks about. Another alternative is to read the old files, and move the still valid entries to the start. That doesn’t seem to be implemented and I think it will be pretty hard to do and likely consume a lot of resources.

I’ll probably have another summary post about FASTER, but that is pretty much it. I’ve ignored other parts (recovery, several state machines used to handle internal state, etc), but they aren’t important to grokking what it is actually doing. It is an interesting codebase, but it feels… incomplete. But I’ll reserve such thoughts to the summary post.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Graphs in RavenDB (2):
    19 Sep 2018 - The query language
  2. Reviewing FASTER (9):
    06 Sep 2018 - Summary
  3. RavenDB 4.1 features (12):
    22 Aug 2018 - MongoDB & CosmosDB Migration Wizards
  4. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
  5. Codex KV (2):
    06 Jun 2018 - Properly generating the file
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats