Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,565
|
Comments: 51,184
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 161 words

Before I start, I wanted to explain that NHibernate fully support the identity generator, and you can work with it easily and without pain.

There are, however, implications of using the identity generator in your system. Tuna does a great job in detailing them. The most common issue that you’ll run into is that identity breaks the notion of unit of work. When we use an identity, we have to insert the value to the database as soon as we get it, instead of deferring to a later time. It also render batching useless.

And, just to put some additional icing on the cake. On SQL 2005 and SQL 2008, identity is broken.

I know that “select ain’t broken” most of the time, but this time, it appears it does :-)

We strongly recommend using some other generator strategy, such as guid.comb (similar to new sequential id) or HiLo (which also generates human readable values).

time to read 1 min | 173 words

It is surprising how many bombs you have waiting for you at a low enough level. Here is another piece of problematic code from the same codebase:

// Read the name.
var _buffer = new byte[_len];
int _numRead;
try {
    _numRead = stream.Read(_buffer, 0, _buffer.Length);
}
catch (Exception _ex) {
    logger.Log(string.Format("Conn: read(): {0}", _ex));
    throw new SpreadException(string.Format("read(): {0}", _ex));
}

// Check for not enough data.
if (_numRead != _len) {
    logger.Log("Conn: Connection closed during connect attempt");
    throw new SpreadException("Connection closed during connect attempt");
}

Can you spot the problem?

time to read 1 min | 173 words

I am going over some library code at the moment, and I am following my usual routine of starting at the basic infrastructure to find the fault lines. Take a look at this:

public void Log(string pSource, string pMessage, EventLogEntryType pEntryType) {
    try {
        if (!EventLog.SourceExists(pSource)) {
            EventLog.CreateEventSource(pSource, "Application");
        }

        EventLog.WriteEntry(pSource, pMessage, pEntryType);
    }
    catch (Exception _ex) {
        Log("", _ex.ToString(), EventLogEntryType.Error);
    }
}
This is… impressive.
time to read 3 min | 596 words

So far I posted quite a few of posts about building the document database. To be frank, the reason that I did this is because the idea has been bouncing in my head a lot recently, and sitting down and actually thinking about it has been great, especially since now I have the design dancing in my head, shiny & beautiful. Here is the full list, in case you missed anything:

  1. Schema-less databases
  2. Designing a document database
  3. Designing a document database: Storage
  4. Designing a document database: Scale
  5. Designing a document database: Authorization
  6. Designing a document database: Concurrency
  7. Designing a document database: Attachments
  8. Designing a document database: Replication
  9. Designing a document database: Views
  10. Designing a document database: Aggregation
  11. Challenge: C# Rewriting
  12. Designing a document database: Aggregation Recalculating
  13. Designing a document database: View syntax
  14. Designing a document database: Remote API & Public API

A few days ago I asked on twitter what do people think, do I have this written up yet or not. Opinions seems to be divided on this score. Let me try to set the record straight. I have a lot of scattered code around this, yes. But it is not a project, is is a lot of tiny experiments to prove that one approach or the other would work. This series of posts has required a lot of research. But I don’t have anything that is even remotely close to a working system.

I am estimating that it would take a month or two to take this from the drawing board to something that I would be willing to use in production*. This if full time work, by the way. It is likely that I can get something usable faster than that, depending on your definition of usable :-). Most of the challenge is going to be in implementing the views, as I see it now. Everything else seems to be pretty straightforward.

That is somewhat of a problem. I don’t really want to spend several months (and the associated support costs afterward) to build an open source project. The main issue is that while it is fun, there is simply no money in it, and I heard that eating is mandatory. On the other hand, I don’t really see something like that selling as a commercial package. This is infrastructure, and infrastructure has been commoditized. The ideal solution from my point of view is what we tried to do with Linq to NHibernate. Getting a company, or several companies, to sponsor its development as an OSS project.

The motivation would be the same as usual, this is something that the aforementioned companies need, and are willing to pay for. It didn’t end up the way I expected it with Linq for NHibernate, but it ended up very well after all, so I am happy about that.

Oh, and as an aside, if you want more posts in this series, do suggest a few topics that you want to hear about.

* Just to give you an idea about the complexity involved, I estimated Linq to NHibernate to be about 3 months.

time to read 3 min | 474 words

One of the greatest challenges that we face when we try to design API is balancing power, flexibility, complexity, extensibility and version tolerance. There is a set of tradeoffs that we have to make in order to make sure that we design an API, and selecting the right tradeoffs impacts the way  that users will work with the software.

One of the decisions that I made about the docs db is that it would be actually have two published API. The first would be the remote API, accessible for clients, and the second would be a public API, accessible for people who want to extend the db. I consider extending the db to be a very common operation, for doing things like adding more functions for the views, supplying the sharding function, etc. I am assuming that most people who would want to do this extension would be .Net devs, so I am just going to use my usual style here and expose some extension points so it will be possible to mix & match things.

For the remote API, we need a way to add / remove documents, get a document and query a view. I would like that part of the API to be accessible from any language, even if I don’t foresee it getting used much in other languages. The idea of being able to access it from JavaScript, and making the entire application hosted through the DB is actually a very interesting idea, see CouchApp for more details. Making it accessible for everyone is a pretty good idea, and I think that adopting’s a REST API similar to the one that CouchDB is using would be a good choice. Something that I would be extremely interested on, as a matter of fact, would be to make use of Astoria’s REST API, instead of having to write my own implementation. I think it bears some investigating, since on the minus side there is the problem that Astoria expects some sort of schema from the backend implementation, but it would be an interesting avenue for investigating.

Currently I am thinking about an API similar to:

  • POST or PUT to /docs[/id] – to create/update a document
  • POST to /bulk_docs – to create/update a batch of documents
  • POST or PUT to /views – to create / update a view definition
  • GET from /docs/id to get a document
  • GET from /views/name – to get all view docs of particular view
  • GET from /views/name?key=foo – to get all view docs with matching key
  • GET from /views/name?start=foo&end=bar – to get all view docs within range

This seems to be a pretty simple proposition, and it seems pretty complete.

Thoughts?

time to read 2 min | 334 words

I was asked how we can handle more complex views scenarios. So let us take a look at how we can deal with them.

Joins

In a many to one scenario, (post & comments), how can we get both on them in one call to the database? I am afraid that I am not doing anything new here, since the technique is actually described here. The answer to that is quite simple, you don’t. What you can do, however, is generate a view that will allow you to get both of them at the same time. For example:

image

The key here is that views are always calculated in sort order, so what is actually happening here is that we sort by post id, then by IsPost. Since false is higher then true, the actual post is always the first item, with the comments directly following that. This means that we can query for all of them in one DB call.

Returning more than a single result per row

To be fair, I haven’t considered this, but it seems like a pretty obvious that this is needed. Here is the original request:

Question: can a view contain more rows than the underlying document database? For example: assume an invoice database (each document is an invoice with buyer's and seller's Tax ID). I want to create index: Tax ID -> #of invoices, where tax id can belong either to buyer or seller. In worst case scenario, unique tax IDs in every invoice, we'll have index with 2N entries. How view syntax would look like?

If I understand the problem correctly, this can be resolve using the following view definition:

image

Thoughts?

time to read 5 min | 973 words

The choice of using Linq queries as the default syntax was not an accident. If you look at how Couch DB is doing things, you can see that the choice of Javascript as the query language can cause some really irritating imperative style coding. For example, look at this piece of code:

function(doc) {
  if (doc.type == "comment") {
    map(doc.author, {post: doc.post, content: doc.content});
  }
}

This works, and it allows for some really complicated solutions, but it comes with its own set of problems. Unlike Couch DB, I actually want to enforce a schema for the views, and I need to be able to tell that schema at view creation time. This is partly because of the storage engine choice, and partly because the imperative style means that it is very easy to violate some of the map reduce required behaviors, such as repeatability of the results (by querying a separate data source, for example).

Linq queries are not imperative, they are a good way of expressing set based logic in a really nice way, while still allowing for an almost embarrassingly complex set of problems to be expressed with them. More than that, Linq queries are strongly typed, provide me with a whole bunch of information and allow me to do some really interesting things along the way, some of which we will talk about later. There is also the issue of how easy it would be to utilize such things as PLinq, or that the extensibility story for the DB becomes much easier with this scenario, or that at least in a theoretical perspective, the performance that we are talking about here should be much better than a Javascript based solution. 

Another property of Linq that I considered, much as I am loath to admit it in such a public forum is the marketing aspect of it. A linq-driven database is sure to get a lot of attention, you only have to look at the number of comment on the previous posts in this topic, compare those with linq queries to those without the linq queries. The difference is quite astounding.

All in all, it sounds like an impressive amount of reason to go with Linq.

The problem, of course, is that Linq implies C#, and I don’t really think that C# is the best language for doing language oriented programming. This time, however, we have the major advantage that the domain concepts that we want are already built into the language, so we don’t really need a lot of tweaking here to get things exciting.

I posted about the syntax before, but I don’t think that a lot of people actually got what I meant. Here is the entire view definition:

image 

It is not a snippet, and it is not a part of something larger that I am not showing. This is the view. And yes, it is not compliable on its own. Nor do I imagine that we will see people writing this code in Visual Studio. Or, at least, I imagine that it will be written there, but it will not stay there.

Much like in Couch DB today, you are going to have to create the view on the server, and you do that by creating a specially named document, which will contain this syntax as its content.

Internally, we are going to do some interesting things to it, but I think that I can stop now by just showing your the first stage, what happens to the view code after preprocessing it:

image

Readers of my book should recognize the pattern, I am using the notion of Implicit Base Class here to get us an executable class, which we can now compile and execute at will. Note that the query itself was modified, to make it compliable. We can now proceed to do additional analysis of the actual query, generate the fixed schema out of it, and start doing the really interesting things that we want to do.

But I have better leave those for another post…

time to read 4 min | 775 words

One of the more interesting problems with document databases is the views, and in particular, how are we going to implement views that contain aggregation. In my previous post, I discussed the way we will probably expose this to the users. But it turn out that there are significant challenges in actually implementing the feature itself, not just in the user visible parts.

For projection views, the actual problem is very simple, when a document is updated/removed, all we have to do is to delete the old view item, and create a new item, if applicable.

For aggregation views, the problem is much harder, mostly because it is not clear what the result of adding, updating or removing a document may be. As a reminder, here is how we plan on exposing aggregation views to the user:

image

Let us inspect this from the point of view of the document database. Let us say that we have 100,000 documents already, and we introduce this view. A background process is going to kick off, to transform the documents using the view definition.

The process goes like this:

image

Note that the process depict above is a serial process. This isn’t really useful in the real world. Let us see why. I want to add a new document to the system, how am I going to update the view? Well… an easy option would be this:

image

I think you can agree with me that this is not a really good thing to do from performance perspective. Luckily for us, there are other alternative. A more accurate representation of the process would be:

image

We run the map/reduce process in parallel, producing a lot of separate reduced data points. Now we can do the following:

image

We take the independent reduced results and run a re-reduce process on them again. That is why we have the limitation that map & reduce must return objects in the same shape, so we can use reduce for data that came from map or from reduce, without caring where it came from.

This also means that adding a document is a much easier task, all we need to do is:

image

We get the single reduced result from the whole process, and now we can generate the final result very easily:

image

All we have to do is run the reduce on the final result and the new result. The answer from that would be identical to the answer running the full process on all the documents. Things get more interesting, however, when we talk about document update or document removal. Since update is just a special case of atomic document removal and addition, I am going to talk about document removal only, in this case.

Removing a document invalidate the final aggregation results, but it doesn’t necessarily necessitate recalculating the whole thing from scratch. Do you remember the partial reduce results that we mentioned earlier? Those are not only useful for parallelizing the work, they are also very useful in this scenario. Instead of discarding them when we are done with them, we are going to save them as well. They wouldn’t be exposed to the user at any way, but they are persisted. They are going to be useful when we need to recalculate. The fun thing about them is that we don’t really need to recalculate everything. All we have to do is recalculate the batch that the removed document resided on, without that document. When we have the new batch, we can now reduce the whole thing to a final result again.

I am guessing that this is going to be… a challenging task to build, but from design perspective, it looks pretty straightforward.

time to read 2 min | 285 words

I said that I would speak a bit about aggregations. On the face of it, aggregation looks simple, really simple. Continuing the same thread of design from before, we can have:

image

The problem is that while this is really nice, it doesn’t really work.

The problem is that using this approach, we are going to have to recalculate the view for the entire document set that we have, a potentially very expensive operation. Now, technically I can solve the problem by rewriting the Linq statement. The problem is that it wouldn’t really work. While it is possible to do so, it wouldn’t really work because the following code assume that it knows all the state, and there is no way to regenerate that state in an incremental fashion.

Let us try a better approach:

image

Thanks for Alex Yakunin, for helping me simplify this.

What do we have now? We split the problem into two sections, the Map and the Reduce. Note that to simplify things, map and reduce must return objects in the same shape. That means that we don’t need an explicit re-reduce phase.

That is much easier to reason about, and it allow us to perform aggregation in a very easy manner, allowing us to do aggregation in a manner that is simple to partition. I am probably going to have another post regarding the actual details of handling aggregations.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}