Ayende @ Rahien

It's a girl

Implementing a document database: simple queries

And now this passes:

public class PerformingQueries
{
    const string query = @"
var pagesByTitle = 
from doc in docs
where doc.type == ""page""
select new { Key = doc.title, Value = doc.content, Size = (int)doc.size };
";

    [Fact]
    public void Can_query_json()
    {
        var serializer = new JsonSerializer();
        var docs = (JArray)serializer.Deserialize(
                new JsonTextReader(
                    new StringReader(
                        @"[
{'type':'page', title: 'hello', content: 'foobar', size: 2},
{'type':'page', title: 'there', content: 'foobar 2', size: 3},
{'type':'revision', size: 4}
]")));
        var compiled = new LinqTransformer(query, "docs", typeof(JsonDynamicObject)).Compile();
        var compiledQuery = (AbstractViewGenerator<JsonDynamicObject>)Activator.CreateInstance(compiled);
        var actual = compiledQuery.Execute(docs.Select(x => new JsonDynamicObject(x)))
            .Cast<object>().ToArray();
        var expected = new[]
        {
            "{ Key = hello, Value = foobar, Size = 2 }",
            "{ Key = there, Value = foobar 2, Size = 3 }"
        };

        Assert.Equal(expected.Length, actual.Length);
        for (var i = 0; i < expected.Length; i++)
        {
            Assert.Equal(expected[i], actual[i].ToString());
        }
    }
}

You wouldn’t believe how much effort it took, and all in all, implementing this is about 500 lines of code or so.

It depends on what you are optimizing for…

Today I was writing this code:

public class FakeRandomValueGenerator : IRandomValueGenerator
{
    private readonly int valueToReturn;

    public FakeRandomValueGenerator(int valueToReturn)
    {
        this.valueToReturn = valueToReturn;
    }

    public int Next(int min, int max)
    {
        return valueToReturn;
    }
}

This caused some concern to my pair, who asked me why I was hand rolling a stub instead of using a mocking framework. My answer was simple.

It will take me less time to write the class than it would take me to bring up the Add Reference dialog.

Challenge: probability based selection

Here is an interesting problem that I just run into. I need to select a value from a (small) set based on percentage. That seems like it would be simple, but for some reason I can’t figure out an elegant way of doing this.

Here is my current solution:

var chances = new Page[100];
int index = 0;
foreach (var page in pages)
{
    for (int i = index; i < index + page.PercentageToShow; i++)
    {
        chances[i] = row;
    }
    index += page.PercentageToShow;
}
return chances[new Random().Next(0, 100)];

This satisfy the requirement, but it is… not as elegant as I would wish it to be.

I may have N number of values, for small N. There isn’t any limitation on the percentage allocation, so we may have (50%, 10%, 12%, 28%). We are assured that the numbers will always match to a 100.

Microsoft Connect - FAIL (yet again)

I really think that Microsoft should close Connect. Because it is pretty obvious that they aren't managing that properly.

Let us take a look at yet another interesting bug report. This is related to a bug in System.Data that makes using System.Transactions even trickier than you would initially believe.

It was acknowledged as a bug by Miguel Gasca (from Microsoft), and a connect was reported.

That was in 2007(!), it was resolved, a month later, by "Microsoft", because it is "External" issue.

That bug is till here today, two years later, and still impacting customers. That is after a full release and SP1. The situation is that now I have to work around this bug because Microsoft cannot manage its own bugs database in a way that would allow it to... oh, I don't know, fix bugs!

FAIL

And you know what, I wouldn't be annoyed with this if this wasn't an ongoing pattern with Connect.

FAIL

M is to DSL as Drag & Drop is to programming

image

I have been sitting on this post for a while now, because that was my first impression after seeing Oslo & M and all the hype around it from the PDC. To be frank, I had a hard time believing my own gut feeling. I kept having the feeling that I am missing something, which is why I avoided talking about this so far.

But, as time passed, and as we started to see more and more about Oslo and M, it validated my initial thinking. Now, just to be clear, I don’t intend to even touch on the whole of Oslo in this post. I don’t have a problem stating that I still don’t see the whole point there, but that is beside the point (no pun intended). What I would like to talk about is the M language, its usage, and the DSL that Microsoft shows as samples.

I see M as a whole lot of effort trying to optimize something that is really not that interesting, complex or really very hard. I look at the M language, the way that you worked with, the tooling and the API and I would fully agree that it is a nice parser generator.

What it is not, I have to say, is a DSL toolkit. It is just one, small, part of building a DSL. And, to be perfectly honest, M is the drag & drop of DSL. It looks good, on first glance, but then you dig just a little deeper and you see what actually going on, and you realize that you are probably not where you wanted to be.I  see it as trying very hard to optimize opening the car’s door. While I assume that this is interesting to someone in the world, optimizing the opening of a car door is crucial, I don’t really see it as an important feature. More to the point, it has negligible effect on the time taken for the primary task for which we use a car, the actual driving!

Why am I saying that?

Well, M is used for defining the syntax of the language, which is what most people look at. It does a good job there, but it also stops there. And there is a lot of stuff other then the syntax that you really care about.

Here is a snippet from MisBehave, which was an attempt to build a BDD framework on top of M:

image

Pretty impressive syntax, right?

The problem is that there isn’t really a good way to take this and translate that into something that is executable. Not without doing a lot of work. And that is why I am saying that M isn’t really an important piece of the stack. The actual syntax definition isn’t really that important. It is all the other things that you do with the DSL that matters.

Let us take a look at MUrl:

image

I am looking at this, and after looking at the source code, I still can’t figure out the point.

Yes, this is a demo DSL. But it is a good example that shows how you can completely miss the point with regards to a DSL. What problem does this DSL solve? What benefits do I get from integrating that into my system?

How does this helps me solve a real problem?

The answer is that it doesn’t. The only remotely useful case that I can think of is if I really want to be able to issue REST calls from the command line, and even then, there are better ways of doing that on the command line. MUrl is an exercise in abstraction for the sake of abstraction. More than that, it gives the impression that it is something that is is not.

If you want to show me a DSL, show me one that has logic, not one that is a glorified serialization format. That is the sweet spot for a DSL, to extract policy decisions from your systems, so you can work with them at a higher level and have easier time making change.

M is not a language for creating DSL. It is a language to define a serialization format, that is all.

I don’t do errors, and my name isn’t Google

image

For some reason, a lot of people come to me with a phrase that is like fingernails on a blackboard to me: “I have an error".

I try hard to be polite, so I usually don’t say what comes to mind immediately. Usually a variation on any of the following:

  • Good for you
  • That is nice
  • So do I, let get them together and see if we can breed them
  • And what do you want me to do about it?
  • ARGH!

The main problem is that this is usually all that is said. So it is a declaration of totally useless information.

Something that I find almost as offensive is asking me for information that is easily findable in Google.

Here is an example that demonstrate both of those issues:

image

I am not going to actually comment further. Suffice to say that this post is my way of taking out the frustration of having to do this yet again. And don’t get to gung ho on the actual example, that isn’t actually that important.

I wonder if I can get a title of Chief Google Searcher or something like that.

MEF & Open Generic Types

I read Glenn' s post about MEF's not supporting open generic types with somewhat resembling shock. The idea that it isn't supporting this never even crossed my mind, it was a given that this is a mandatory feature for any container in the .NET land.

Just to give you an idea, what this means is that you can't register Repository<T> and then resolve Repository<Order>. In 2006, I wrote an article for MSDN detailing what has since became a very common use of this pattern. Generic specialization is not something that I would consider optional, it is one of the most common usage patterns of containers in the .NET land. IRepository<T> is probably the most common example that people quote, but there are others as well.

This is not a simple feature, let me make it clear. Not simple at all. I should know, I implement that feature for both Object Builder and Windsor. But that is not what I would consider an optional one.

I am even more disturbed by the actual reasoning behind not supporting this. It is a technical limitation of MEF because internally all components are resolved by string matching, rather than CLR Types. This decision is severely limiting the things that MEF can do. Not supporting what is (in my opinion) is a pretty crucial feature is one example of that, but there are other implications. It means that you can't really do things like polymorphic resolutions, that your choices in extending the container are very limited, because the container isn't going to carry the information that is required to make those decision.

I would advice the MEF team to rethink the decision to base the component resolution on strings. At this point in time, it is still possible to change things ( and yes, I know it isn't as easy as I make it seems ), because not supporting open generic types is bad, but not having the ability to do so, and the reason for that (not keeping CLR Type information) are even worse. I get that MEF needs to work with DLR objects as well, but that means that MEF makes the decision to have lousier support for CLR idioms for the benefit of the DLR.

Considering the usage numbers for both of them, I can't see this being a good decision. It is certainly possible to support them both, but if there are any tradeoffs that have to be made, I would suggest that it should be the DLR, and not the CLR, which would be the second class role.

Tags:

Published at

Originally posted at

Comments (3)

NH-1711 – Inappropriate error handling with NH 2.1 Alpha 1 when distributed transaction fails can cause application crashes

The actual problem has been fixed and it will be part of NH 2.1 Alpha 2. That is why we call them alphas, after all :-)

The actual bug is pretty convulsed mess, to be frank. And it is no wonder that it slipped by me. Yes, I am the one responsible for that, so I guess I am making excuses. Let me tell you about the actual scenario. When you are using NHibernate 2.1 Alpha 1 (it does not affect NHibernate 2.0 or 2.0.1) with a System.Transaction.Transaction, there is a slightly different code path that we have to go through, because the actual work that has to be done is no longer controlled by NHibernate, but by the DTC infrastructure.

So far, so good. The problem is that most of the time, this is done on the same thread as the application that we are running on. There are cases, specifically, when using a DTC with multiple durable enlistments, that the actual work is done on a worker thread. The problem is that if there is an error during that phase, for example, if we are trying to execute invalid command, or run into transaction deadlock, NHibernate wouldn’t properly handle this error, and it would bubble up. The result of unhandled thread exception is, of course, an application crash.

That is considered to be a bad thing, I understand, so after being able to isolate the problem, I went ahead and fix this. You can get the trunk now and get the fix, or wait until Alpha 2 is released.

Who is impacted by this?

You have to use multiple different durable enlistments inside a distributed transaction for the error condition to even be applicable. The problem is that there is one very common scenario that will run into this every single time. The .NET Service Buses all wrap their processing in a TransactionScope, and then tend to have multiple durable enlistments (the DB and MSMQ). This means that if you are using NServiceBus, Rhino ServiceBus or MassTransit alongside with NHibernate 2.1 Alpha1, you are probably impacted by this issue.

As I mentioned, a fix has already been committed (r4149) and it will be part of NHibernate 2.1 Alpah2.

3rd party integration assumption - other system was written by a drunken monkey typing with his feet

image If you have heard me speak, you are probably are aware that I tend to use this analogy a lot. Any 3rd party system that I have to integrate with was written by a drunken monkey typing with his feet.

So far, I am sad to say, that assumption has been quite accurate over a large range of projects.

You are probably already familiar with concepts of System Boundary and Anti Corruption Layer and how to apply them in order to keep the crazy monkey shtick away from your system, so I am not going to talk about those.

What I do want to talk about now is something slightly different, it is not related to the actual 3rd party system itself, it is related to its management.

One of the more annoying things about 3rd party stuff is that it is usually broken in… interesting ways. So you really want to be able to test that thing during the integration phase, so you would know what to expect. That seems like a very simple concept, right? All you have to do is to hit the QA env. and have a ball. That bring us back to the system management, and the really big screw ups that are happening there.

Before we continue, I want to state that usually, when I am integrating with a 3rd party, I usually do so out of some business reason, so it tend to be the case that I want to give them money, customers, pay per operation, or something else that is to the benefit of that 3rd party! Making integration with your software hard has a direct affect on the number of people who are trying to give you money! 

That is why I am so surprised by the amount of trouble that you have to go through in some organizations (I would say, the majority of the organization). Here is a partial list of things that I run into recently:

  • Having no QA env. – We were told to basically just work off of the spec and push it to production. You can imagine how big a success that was.
  • Having a QA env. that was a different version than the one in production, don’t tell the people that are integrating with you anything about that.
  • A variation of the above, have QA env. that is significantly different than production.
  • Having a QA env. that has real world effects. For example, if I am testing my bulk mail integration, I do not expect all those emails to be actually sent! Or, in another memorable case, if I am integrating with a merchant provider and testing authorization, I do not want to see those in my real credit report!
  • Having a QA env. that requires frequent human intervention. For example:
    • In order to validate that your integration has been successful, you have to call someone and wait until they verify that yes, the appropriate values are there in the appropriate system. Each and every time.
    • The integration is a one way operation (imagine something like CreateUser, which would fail if it already exists), and you cannot use any dummy values (imagine that you need to pass a valid credit card to that function), you have to have real ones. So every time you test the integration, you have to call someone and have them reset that information.
  • Having a QA env. that is down for two weeks just as we were suppose to test the integration.

That is why I am saying that the system management is such a crucial thing. And why I am so surprised and disappointed to see so many organizations get it wrong in ways that are so not funny.

If you are building a system that people will integrate with, consider, as soon as possible, the implications of not having a good testing environment for your system. As a matter of fact, I suggest building QA hooks from day one, so you can pass a flag to the system that would tell it “this is only for tests”, which would mean that any external action on the system would be omitted, but all the logic and behavior are retained. 

Looking for .NET developers in Fort Lee, NJ area

You probably know the drill by now, my client need a few good .NET developers in the Fort Lee, NJ area. This is for a full time employee position, working on some interesting projects.

If you are interested, please shoot me an email.

NHibernate: Avoid identity generator when possible

Before I start, I wanted to explain that NHibernate fully support the identity generator, and you can work with it easily and without pain.

There are, however, implications of using the identity generator in your system. Tuna does a great job in detailing them. The most common issue that you’ll run into is that identity breaks the notion of unit of work. When we use an identity, we have to insert the value to the database as soon as we get it, instead of deferring to a later time. It also render batching useless.

And, just to put some additional icing on the cake. On SQL 2005 and SQL 2008, identity is broken.

I know that “select ain’t broken” most of the time, but this time, it appears it does :-)

We strongly recommend using some other generator strategy, such as guid.comb (similar to new sequential id) or HiLo (which also generates human readable values).

How NOT to write network code

It is surprising how many bombs you have waiting for you at a low enough level. Here is another piece of problematic code from the same codebase:

// Read the name.
var _buffer = new byte[_len];
int _numRead;
try {
    _numRead = stream.Read(_buffer, 0, _buffer.Length);
}
catch (Exception _ex) {
    logger.Log(string.Format("Conn: read(): {0}", _ex));
    throw new SpreadException(string.Format("read(): {0}", _ex));
}

// Check for not enough data.
if (_numRead != _len) {
    logger.Log("Conn: Connection closed during connect attempt");
    throw new SpreadException("Connection closed during connect attempt");
}

Can you spot the problem?

How NOT to write a logger

I am going over some library code at the moment, and I am following my usual routine of starting at the basic infrastructure to find the fault lines. Take a look at this:

public void Log(string pSource, string pMessage, EventLogEntryType pEntryType) {
    try {
        if (!EventLog.SourceExists(pSource)) {
            EventLog.CreateEventSource(pSource, "Application");
        }

        EventLog.WriteEntry(pSource, pMessage, pEntryType);
    }
    catch (Exception _ex) {
        Log("", _ex.ToString(), EventLogEntryType.Error);
    }
}
This is… impressive.

Designing a document database: What next?

So far I posted quite a few of posts about building the document database. To be frank, the reason that I did this is because the idea has been bouncing in my head a lot recently, and sitting down and actually thinking about it has been great, especially since now I have the design dancing in my head, shiny & beautiful. Here is the full list, in case you missed anything:

  1. Schema-less databases
  2. Designing a document database
  3. Designing a document database: Storage
  4. Designing a document database: Scale
  5. Designing a document database: Authorization
  6. Designing a document database: Concurrency
  7. Designing a document database: Attachments
  8. Designing a document database: Replication
  9. Designing a document database: Views
  10. Designing a document database: Aggregation
  11. Challenge: C# Rewriting
  12. Designing a document database: Aggregation Recalculating
  13. Designing a document database: View syntax
  14. Designing a document database: Remote API & Public API

A few days ago I asked on twitter what do people think, do I have this written up yet or not. Opinions seems to be divided on this score. Let me try to set the record straight. I have a lot of scattered code around this, yes. But it is not a project, is is a lot of tiny experiments to prove that one approach or the other would work. This series of posts has required a lot of research. But I don’t have anything that is even remotely close to a working system.

I am estimating that it would take a month or two to take this from the drawing board to something that I would be willing to use in production*. This if full time work, by the way. It is likely that I can get something usable faster than that, depending on your definition of usable :-). Most of the challenge is going to be in implementing the views, as I see it now. Everything else seems to be pretty straightforward.

That is somewhat of a problem. I don’t really want to spend several months (and the associated support costs afterward) to build an open source project. The main issue is that while it is fun, there is simply no money in it, and I heard that eating is mandatory. On the other hand, I don’t really see something like that selling as a commercial package. This is infrastructure, and infrastructure has been commoditized. The ideal solution from my point of view is what we tried to do with Linq to NHibernate. Getting a company, or several companies, to sponsor its development as an OSS project.

The motivation would be the same as usual, this is something that the aforementioned companies need, and are willing to pay for. It didn’t end up the way I expected it with Linq for NHibernate, but it ended up very well after all, so I am happy about that.

Oh, and as an aside, if you want more posts in this series, do suggest a few topics that you want to hear about.

* Just to give you an idea about the complexity involved, I estimated Linq to NHibernate to be about 3 months.

Designing a document database: Remote API & Public API

One of the greatest challenges that we face when we try to design API is balancing power, flexibility, complexity, extensibility and version tolerance. There is a set of tradeoffs that we have to make in order to make sure that we design an API, and selecting the right tradeoffs impacts the way  that users will work with the software.

One of the decisions that I made about the docs db is that it would be actually have two published API. The first would be the remote API, accessible for clients, and the second would be a public API, accessible for people who want to extend the db. I consider extending the db to be a very common operation, for doing things like adding more functions for the views, supplying the sharding function, etc. I am assuming that most people who would want to do this extension would be .Net devs, so I am just going to use my usual style here and expose some extension points so it will be possible to mix & match things.

For the remote API, we need a way to add / remove documents, get a document and query a view. I would like that part of the API to be accessible from any language, even if I don’t foresee it getting used much in other languages. The idea of being able to access it from JavaScript, and making the entire application hosted through the DB is actually a very interesting idea, see CouchApp for more details. Making it accessible for everyone is a pretty good idea, and I think that adopting’s a REST API similar to the one that CouchDB is using would be a good choice. Something that I would be extremely interested on, as a matter of fact, would be to make use of Astoria’s REST API, instead of having to write my own implementation. I think it bears some investigating, since on the minus side there is the problem that Astoria expects some sort of schema from the backend implementation, but it would be an interesting avenue for investigating.

Currently I am thinking about an API similar to:

  • POST or PUT to /docs[/id] – to create/update a document
  • POST to /bulk_docs – to create/update a batch of documents
  • POST or PUT to /views – to create / update a view definition
  • GET from /docs/id to get a document
  • GET from /views/name – to get all view docs of particular view
  • GET from /views/name?key=foo – to get all view docs with matching key
  • GET from /views/name?start=foo&end=bar – to get all view docs within range

This seems to be a pretty simple proposition, and it seems pretty complete.

Thoughts?

Designing a document database: Looking at views

I was asked how we can handle more complex views scenarios. So let us take a look at how we can deal with them.

Joins

In a many to one scenario, (post & comments), how can we get both on them in one call to the database? I am afraid that I am not doing anything new here, since the technique is actually described here. The answer to that is quite simple, you don’t. What you can do, however, is generate a view that will allow you to get both of them at the same time. For example:

image

The key here is that views are always calculated in sort order, so what is actually happening here is that we sort by post id, then by IsPost. Since false is higher then true, the actual post is always the first item, with the comments directly following that. This means that we can query for all of them in one DB call.

Returning more than a single result per row

To be fair, I haven’t considered this, but it seems like a pretty obvious that this is needed. Here is the original request:

Question: can a view contain more rows than the underlying document database? For example: assume an invoice database (each document is an invoice with buyer's and seller's Tax ID). I want to create index: Tax ID -> #of invoices, where tax id can belong either to buyer or seller. In worst case scenario, unique tax IDs in every invoice, we'll have index with 2N entries. How view syntax would look like?

If I understand the problem correctly, this can be resolve using the following view definition:

image

Thoughts?

Designing a document database: View syntax

The choice of using Linq queries as the default syntax was not an accident. If you look at how Couch DB is doing things, you can see that the choice of Javascript as the query language can cause some really irritating imperative style coding. For example, look at this piece of code:

function(doc) {
  if (doc.type == "comment") {
    map(doc.author, {post: doc.post, content: doc.content});
  }
}

This works, and it allows for some really complicated solutions, but it comes with its own set of problems. Unlike Couch DB, I actually want to enforce a schema for the views, and I need to be able to tell that schema at view creation time. This is partly because of the storage engine choice, and partly because the imperative style means that it is very easy to violate some of the map reduce required behaviors, such as repeatability of the results (by querying a separate data source, for example).

Linq queries are not imperative, they are a good way of expressing set based logic in a really nice way, while still allowing for an almost embarrassingly complex set of problems to be expressed with them. More than that, Linq queries are strongly typed, provide me with a whole bunch of information and allow me to do some really interesting things along the way, some of which we will talk about later. There is also the issue of how easy it would be to utilize such things as PLinq, or that the extensibility story for the DB becomes much easier with this scenario, or that at least in a theoretical perspective, the performance that we are talking about here should be much better than a Javascript based solution. 

Another property of Linq that I considered, much as I am loath to admit it in such a public forum is the marketing aspect of it. A linq-driven database is sure to get a lot of attention, you only have to look at the number of comment on the previous posts in this topic, compare those with linq queries to those without the linq queries. The difference is quite astounding.

All in all, it sounds like an impressive amount of reason to go with Linq.

The problem, of course, is that Linq implies C#, and I don’t really think that C# is the best language for doing language oriented programming. This time, however, we have the major advantage that the domain concepts that we want are already built into the language, so we don’t really need a lot of tweaking here to get things exciting.

I posted about the syntax before, but I don’t think that a lot of people actually got what I meant. Here is the entire view definition:

image 

It is not a snippet, and it is not a part of something larger that I am not showing. This is the view. And yes, it is not compliable on its own. Nor do I imagine that we will see people writing this code in Visual Studio. Or, at least, I imagine that it will be written there, but it will not stay there.

Much like in Couch DB today, you are going to have to create the view on the server, and you do that by creating a specially named document, which will contain this syntax as its content.

Internally, we are going to do some interesting things to it, but I think that I can stop now by just showing your the first stage, what happens to the view code after preprocessing it:

image

Readers of my book should recognize the pattern, I am using the notion of Implicit Base Class here to get us an executable class, which we can now compile and execute at will. Note that the query itself was modified, to make it compliable. We can now proceed to do additional analysis of the actual query, generate the fixed schema out of it, and start doing the really interesting things that we want to do.

But I have better leave those for another post…

Designing a document database: Aggregation Recalculating

One of the more interesting problems with document databases is the views, and in particular, how are we going to implement views that contain aggregation. In my previous post, I discussed the way we will probably expose this to the users. But it turn out that there are significant challenges in actually implementing the feature itself, not just in the user visible parts.

For projection views, the actual problem is very simple, when a document is updated/removed, all we have to do is to delete the old view item, and create a new item, if applicable.

For aggregation views, the problem is much harder, mostly because it is not clear what the result of adding, updating or removing a document may be. As a reminder, here is how we plan on exposing aggregation views to the user:

image

Let us inspect this from the point of view of the document database. Let us say that we have 100,000 documents already, and we introduce this view. A background process is going to kick off, to transform the documents using the view definition.

The process goes like this:

image

Note that the process depict above is a serial process. This isn’t really useful in the real world. Let us see why. I want to add a new document to the system, how am I going to update the view? Well… an easy option would be this:

image

I think you can agree with me that this is not a really good thing to do from performance perspective. Luckily for us, there are other alternative. A more accurate representation of the process would be:

image

We run the map/reduce process in parallel, producing a lot of separate reduced data points. Now we can do the following:

image

We take the independent reduced results and run a re-reduce process on them again. That is why we have the limitation that map & reduce must return objects in the same shape, so we can use reduce for data that came from map or from reduce, without caring where it came from.

This also means that adding a document is a much easier task, all we need to do is:

image

We get the single reduced result from the whole process, and now we can generate the final result very easily:

image

All we have to do is run the reduce on the final result and the new result. The answer from that would be identical to the answer running the full process on all the documents. Things get more interesting, however, when we talk about document update or document removal. Since update is just a special case of atomic document removal and addition, I am going to talk about document removal only, in this case.

Removing a document invalidate the final aggregation results, but it doesn’t necessarily necessitate recalculating the whole thing from scratch. Do you remember the partial reduce results that we mentioned earlier? Those are not only useful for parallelizing the work, they are also very useful in this scenario. Instead of discarding them when we are done with them, we are going to save them as well. They wouldn’t be exposed to the user at any way, but they are persisted. They are going to be useful when we need to recalculate. The fun thing about them is that we don’t really need to recalculate everything. All we have to do is recalculate the batch that the removed document resided on, without that document. When we have the new batch, we can now reduce the whole thing to a final result again.

I am guessing that this is going to be… a challenging task to build, but from design perspective, it looks pretty straightforward.

Designing a document database: Aggregation

I said that I would speak a bit about aggregations. On the face of it, aggregation looks simple, really simple. Continuing the same thread of design from before, we can have:

image

The problem is that while this is really nice, it doesn’t really work.

The problem is that using this approach, we are going to have to recalculate the view for the entire document set that we have, a potentially very expensive operation. Now, technically I can solve the problem by rewriting the Linq statement. The problem is that it wouldn’t really work. While it is possible to do so, it wouldn’t really work because the following code assume that it knows all the state, and there is no way to regenerate that state in an incremental fashion.

Let us try a better approach:

image

Thanks for Alex Yakunin, for helping me simplify this.

What do we have now? We split the problem into two sections, the Map and the Reduce. Note that to simplify things, map and reduce must return objects in the same shape. That means that we don’t need an explicit re-reduce phase.

That is much easier to reason about, and it allow us to perform aggregation in a very easy manner, allowing us to do aggregation in a manner that is simple to partition. I am probably going to have another post regarding the actual details of handling aggregations.

Challenge: C# Rewriting

Here is an interesting one. Can you write code that would take the first piece of text and would turn it into the second piece of text?

First (not compiling)

image

Second (compiling):

image

Hint, you can use NRefactory to do the C# parsing.

I don’t speak Russian, but I do speak code

Here is a snippet from a blog post describing a lecture I gave yesterday in the Ural University:

Imagine him [Ayende] giving a public lecture at the Ural State University and demonstrating one of the numerous code snippets he prepared. Suddenly a guy (I think his name is Alex) interrupts him and tries to point out an error in the code. Unfortunately, Alex fails to express himself in English, and instead just mumbles incomprehensibly. After two more attempts, he gives up and explains the bug to the audience — in Russian. But before anyone has a chance to translate it, Oren smiles and says: “Oh right! You are absolutely correct, I have to insert a break statement here!”. Now, granted, Oren is a great talker and has no problem understanding people; but that was unbelievable even for him, because I can swear that: 1) he doesn't know a single word of Russian, 2) the guy who spotted the problem didn't use a single word of English (like break or foreach). Truth to be told, the whole situation was even scary a bit.

There was a roar of laugher in the audience when I did that, and it took me a while to understand why.

Designing a document database: Views

One of the more interesting problems with document databases is how you handle views. But a lot of people already had some issues with understanding what I mean with document database (hint, I am not talking about a word docs repository), so I have better explain what I mean by this.

A document database stores documents. Those aren’t what most people would consider as a document, however. It is not excel or word files. Rather, we are talking about storing data in a well known format, but with no schema. Consider the case of storing an XML document or a Json document. In both cases, we have a well known format, but there is not a required schema for those. That is, after all, one of the advantages of document db’s schema less nature.

However, trying to query on top of schema less data can be… problematic. Unless you are talking about lucene, which I would consider to be a document indexer rather than a document DB, although it can be used as such. Even with lucene, you have to specify the things that you are actually interested on to be able to search on them.

So, what are views? Views are a way to transform a document to some well known and well defined format. For example, let us say that I want to use my DB to store wiki information, I can do this easily enough by storing the document as a whole, but how do I lookup a page by its title? Trying to do this on the fly is a receipt for disastrous performance. In most document databases, the answer is to create a view. For RDMBS people, a DDB view is often called a materialized view in an RDMBS.

I thought about creating it like this:

image

Please note that this is only to demonstrate the concept, actually implementing the above syntax requires either on the fly rewrites or C# 4.0

The code above can scan through the relevant documents, and in a very clean fashion (I think), generate the values that we actually care about. Basically, we now have created a view called “pagesByTitleAndVersion”, index by title (ascending) and version (descending). We can now query this view for a particular value, and get it in a very quick manner.

Note that this means that updating views happen as part of a background process, so there is going to be some delay between updating the document and updating the view. That is BASE for you :-)

Another important thing is that this syntax is for projections only. Those are actually very simple to build. Well, simple is relative, there is going to be some very funky Linq stuff going on in there, but from my perspective, it is fairly straightforward. The part that is going to be much harder to deal with is aggregation. I am going to deal with that separately, however.

Designing a document database: Replication

In a previous post, I asked about designing a document DB, and brought up the issue of replication, along with a set of questions that effect the design of the system:

  • How often should we replicate?
    • As part of the transaction?
    • Backend process?
    • Every X amount of time?
    • Manual?

I think that we can assume that the faster we replicate, the better it is. However, there are cost associated with this. I think that a good way of doing replication would be to post a message on a queue for the remote replication machine, and have the queuing system handle the actual process. This make it very simple to scale, and create a distinction between the “start replication” part an the actual replication process. It also allow us to handle spikes in a very nice manner.

  • Should we replicate only the documents?
    • What about attachments?
    • What about the generated view data?

We don’t replicate attachments, since those are out of scope.

Generated view data is a more complex issue. Mostly because we have a trade off here, of network payload vs. cpu time. Since views are by their very nature stateless (they can only use the document data), running the view on source machine or the replicated machine would result in exactly the same output. I think that we can safely ignore the view data, treating this as something that we can regenerate. CPU time tend to be far less costly than network bandwidth, after all.

Note that this assumes that view generation is the same across all machines. We discuss this topic more extensively in the views part.

  • Should we replicate to all machines?
    • To specified set of machines for all documents?
    • Should we use some sharding algorithm?

I think that a sharding algorithm would be the best option, given a document, it will give a list of machine to replicate to. We can provide a default implementation that replicate to all machines or to secondary and tertiaries.

Designing a document database: Attachments

In a previous post, I asked about designing a document DB, and brought up the issue of attachments, along with a set of questions that needs to be handled:

  • Do we allow them at all?

We pretty much have to, otherwise we will have the users sticking them into the document directly, resulting in very inefficient use of space (binaries in Json format sucks).

  • How are they stored?
    • In the DB?
    • Outside the DB?

Storing them in the DB will lead to very high database sizes. And there is the simple question if a Document DB is the appropriate storage for BLOBs. I think that there are better alternatives for that than the Document DB. Things like Rhino DHT, S3, the file system, CDN, etc.

  • Are they replicated?

Out of scope for the document db, I am afraid. That depend on the external storage that you wish for.

  • Should we even care about them at all? Can we apply SoC and say that this is the task of some other part of the system?

Yes we can and we should.

However, we still want to be able to add attachments to documents. I think we can resolve them pretty easily by adding the notion of a document attributes. That would allow us to add external information to a document, such as the attachment URLs. Those should be used for things that are related to the actual document, but are conceptually separated from it.

An attribute would be a typed key/value pair, where both key and value contains strings. The type is an additional piece of information, containing the type of the attribute. This will allow to do things like add relations, specify attachment types, etc.