Ayende @ Rahien

Refunds available at head office

That No SQL Thing - Key/Value stores

The simplest No SQL databases are the Key/Value stores. They are simplest only in terms of their API, because the actual implementation may be quite complex. But let us focus on the API that is exposed to us for a second. Most of the Key/Value stores expose some variation on the following API:

void Put(string key, byte[] data);
byte[] Get(string key);
void Remove(string key);

There are many variations, but that is the basis for everything else. A key value store allows you to store values by key, as simple as that. The value itself is just a blob, as far as the data store is concerned, it just stores it, it doesn’t actually care about the content. In other words, we don’t have a data stored defined schema, but a client defined semantics for understanding what the values are. The benefits of using this approach is that it is very simple to build a key value store, and that it is very easy to scale it. It also tend to have great performance, because the access pattern in key value store can be heavily optimized.

Concurrency – In Key/Value Store, concurrency is only applicable on a single key, and it is usually offered as either optimistic writes or as eventually consistent. In highly scalable systems, optimistic writes are often not possible, because of the cost of verifying that the value haven’t changed (assuming the value may have replicated to other machines), there for, we usually see either a key master (one machine own a key) or the eventual consistency model.

Queries – there really isn’t any way to perform a query in a key value store, except by the key. Even range queries on the key are usually not possible.

Transactions – while it is possible to offer transaction guarantees in a key value store, those are usually only offer in the context of a single key put. It is possible to offer those on multiple keys, but that really doesn’t work when you start thinking about a distributed key value store, where different keys may reside on different machines. Some data stores offer no transaction guarantees.

Schema – key value stores have the following schema Key is a string, Value is a blob :-) beyond that, the client is the one that determines how to parse the data.

Scaling Up – In Key Value stores, there are two major options for scaling, the simplest one would be to shard the entire key space. That means that keys starting in A go to one server, while keys starting with B go to another server. In this system, a key is only stored on a single server. That drastically simplify things like transactions guarantees, but it expose the system for data loss if a single server goes down. At this point, we introduce replication.

Replication – In key value stores, the replication can be done by the store itself or by the client (writing to multiple servers). Replication also introduce the problem of divergent versions. In other words, two servers in the same cluster think that the value of key ‘ABC’ are two different things. Resolving that is a complex issue, the common approaches are to decide that it can’t happen (Scalaris) and reject updates where we can’t ensure non conflict or to accept all updates and ask the client to resolve them for us at a later date (Amazon Dynamo, Rhino DHT).

Usages – Key Value stores shine when you need to access the data by key :-)

More seriously, key based access is actually quite common. Things like user profiles, user sessions, shopping carts, etc. In all those cases, note, we are storing the entire thing as a single value in the data store, that makes it cheap to handle (one request to read, one request to write) easy to handle when you run into concurrency conflict (you only need to resolve a single key).

Because key based queries are practically free, by structuring our data access along keys, we can get significant performance benefit by structuring our applications to fit that need. It turns out that there is quite a lot that you can do with just key/value store. Amazon’s shopping cart runs on a key value store (Amazon Dynamo), so I think you can surmise that this is a highly scalable technique.

Data stores to look at:

  • The Amazon Dynamo paper is one of the best resources on the topic that one can ask.
  • Rhino DHT is a scalable, redundant, zero config, key value store on the .NET platform.

That No SQL Things – How to evaluate?

In my posts about the No SQL options, I am going to talk about their usage in two major scenarios, first, at at a logical perspective, what kind of API and interface do they us, and second, what kind of scaling capabilities they have.

Almost all data stores need to handle things like:

  • Concurrency
  • Queries
  • Transactions
  • Schema
  • Replication
  • Scaling Up

And I am going to touch on each of those for each option.

One thing that should be made clear upfront is the major difference between performance and scalability, the two are often at odds and usually increasing one would decrease the other.

For performance, we ask: How can we execute the same set of requests, over the same set of data with:

  • less time?
  • less memory?

For scaling, we ask: How can we meet our SLA when:

  • we get a lot more data?
  • we get a lot more requests?

With relational databases, the answer is usually, you don’t scale. The No SQL alternatives are generally quite simple to scale, however.

Tags:

Published at

That No SQL Thing

Probably the worst thing about relational databases is that they are so good in what they are doing. Good enough to conquer the entire market on data storage and hold it for decades.

Wait! That is a bad thing? How?

It is a bad thing because relational databases are appropriate for a wide range of tasks, but not for every task. Yet it is exactly that that caused them to be used in contexts where they are not appropriate. In the last month alone, my strong recommendation for two different client was that they need to switch to a non relational data store because it would greatly simplify the work that they need to do.

That met with some (justified) resistance, predictably. People think that RDBMS are the way to store data. I decided to write a series of blog posts about the topic, trying to explain why you might want to work with a No SQL database.

Relational Databases have the following properties:

  • ACID
  • Relational
  • Table / Row based
  • Rich querying capabilities
  • Foreign keys
  • Schema

Just about any of the No SQL approaches give up on some of those properties., usually, it gives up on all of those properties. But think about how useful an RDBMS is, how flexible it can be. Why give it up?

Indeed, the most common reason to want to move from a RDBMS is running into the RDBMS limitations. In short, RDBMS doesn’t scale. Actually, let me phrase that a little more strongly, RDBMS systems cannot be made to scale.

The problem is inherit into the basic requirements of the relational database system, it must be consistent, to handle things like foreign keys, maintain relations over the entire dataset, etc. The problem, however, is that trying to scale a relational database over a set of machine. At that point, you run head on into the CAP theorem, which state that if consistency is your absolute requirement, you need to give up on either availability or partition tolerance.

In most high scaling environments, it is not feasible to give up on either option, so relational databases are out. That leaves you with the No SQL options, I am going to dedicate a post for each of the following, let me know if I missed something:

  • Key/Value store
    • Key/Value store – sharding
    • Key/Value store - replication
    • Key/Value store – eventually consistent
  • Document Databases
  • Graph Databases
  • Column (Family) Databases

Other databases, namely XML databases and Object databases, exists. Object databases suffer from the same problem regarding CAP as relational databases, and XML databases are basically a special case of a document database.

How to setup dynamic groups in MSBuild without Visual Studio ruining them

One of the really annoying things about VS & MSBuild is that while MSBuild is perfectly capable of doing things like this:

<EmbeddedResource Include="WebUI\**\*.*"/>

Visual Studio would go ahead, resolve the expression and then hard code the current values.

That sort of defeat the way I want to make use of it, which would make it frictionless to add more elements. If you do it in a slightly more complex way, VS can’t resolve this easily, so it does the right thing and compute this at build time, rather than replacing with the hard coded values.

Here is how you do it:

    <Target Name="BeforeBuild">
        <CreateItem Include="WebUI\**\*.*">
            <Output ItemName="EmbeddedResource" TaskParameter="Include" />
        </CreateItem>
    </Target>

Unveiling Raven DB

I’ll be giving a talk on the 28th Apr, in Skills Matter’s office, talking about Raven DB.

That is also when we will make the 1.0 release.

I love ConcurrentDictionary!

Not just because it is concurrent, because of this wonderful method:

public class Storage : IStorage
{
    private readonly ConcurrentDictionary<Guid, ConcurrentDictionary<int, List<object>>> values =
        new ConcurrentDictionary<Guid, ConcurrentDictionary<int, List<object>>>();

    public void Store(Guid taskId, int level, object value)
    {
        values.GetOrAdd(taskId, guid => new ConcurrentDictionary<int, List<object>>())
            .GetOrAdd(level, i => new List<object>())
            .Add(value);
    }
}

Doing this with Dictionary is always a pain, but this is an extremely natural way of doing things.

Tags:

Published at

Originally posted at

Comments (8)

On languages and platforms

I got the following question in email:

Can you blog about why you chose c# and .net over other languages for all your projects. What was it that made you stick around the windows platform? Was it coincidence or a firm decision based on something? Was it because c# was the first language that you learned in your profession and then decided to become a pro in that?

I would expect to see you working on OSS in the lamp stack. You could have displayed your capabilities very well in Java. Just a bit curious…

It started as a coincidence, truth to be told. I was working on what I had available at the time, and that was Windows. I actually made the jump from hobbyist to professional in a C++ course. I wanted to learn C++ because that was where most of the action was at the time. It was a Visual C++ 6.0 course, not something that I was aware of at the time. If it was a GNU/C++ course, I would probably be in a very different position today.

That said, it stopped being a coincidence several years ago, when I made a conscious decision to stick to the .NET platform. That decision was based on several factors. Size of the market, acceptance for deployment, familiarity, future prospects, etc.

In essence, it boiled down to the fact that I really didn’t have any burning issues with the .Net platform, I am familiar with it, what you can do with it and what you can’t. Moving to another platform would introduce a very high cost during the switch, but so far there hadn’t been anything that was enough of a killer feature to convince me to move there.

Note, this post was written while taking an Erlang course.

Published at

Originally posted at

Comments (9)

Rhino ETL Video

Paul Barriere has a video up of a presentation about Rhino ETL:

ETL stands for Extract, Transform, Load. For example, you receive files or other data from vendors or other third parties which you need to manipulate in some way and then insert into your own database. Rhino ETL is an open source C# package that I have used for dozens of production processes quite successfully. By using C# for your ETL tasks you can create testable, reusable components more easily than with tools like SSIS and DTS.

It is good to see more information available on Rhino ETL.

Software pricing - Don’t Just Roll the Dice

I found this ebook very informative. I very happy to learn that I at least thought about most of the things that the book recommend, but I wish I read it two years ago.

If you are an ISVN, I highly recommended, because it is not just some dry academic reading, it contains examples that you can easily relate to, probably examples that you are already familiar with.

What is map/reduce for, anyway?

Yesterday I gave a visual explanation about map/reduce, and the question came up about how to handle computing navigation instructions using map/reduce. That made it clear that while (I hope) what map/reduce is might be clearer, what it is for is not.

Map/reduce is a technique aimed to solve a very simple problem, you have a lot of data and you want to go through it in parallel, probably on multiple machines. The whole idea with the concept is that you can crunch through massive data sets in realistic time frame. In order for map/reduce to be useful, you need several things:

  • The calculation that you need to run is one that can be composed. That is, you can run the calculation on a subset of the data, and merge it with the result of another subset.
    • Most aggregation / statistical functions allow this, in one form or another.
  • The final result is smaller than the initial data set.
  • The calculation has no dependencies external input except the dataset being processed.
  • Your dataset size is big enough that splitting it up for independent computations will not hurt overall performance.

Now, given those assumptions, you can create a map/reduce job, and submit it to a cluster of machines that would execute it. I am ignoring data locality and failover to make the explanation simple, although they do make for interesting wrinkles in the implementation.

Map/reduce is not applicable, however, in scenarios where the dataset alone is not sufficient to perform the operation. In the case of the navigation computation example, you can’t really handle this via map/reduce because you lack key data point (the starting and ending points). Trying to computing paths from all points to all other points is probably a losing proposition, unless you have a very small graph. The same applies if you have a 1:1 mapping between input and output. Oh, Map/Reduce will still work, but the resulting output is probably going to be too big to be really useful. It also means that you have a simple parallel problem, not a map/reduce sort of problem.

If you need fresh results, map/reduce isn’t applicable either, it is an inherently a batch operation, not an online one. Trying to invoke map/reduce operation for a user request is going to be very expensive, and not something that you really want to do.

Another set of problems that you can’t really apply map/reduce to are recursive problems. Fibonacci being the most well known among them. You cannot apply map/reduce to Fibonacci for the simple reason that you need the previous values before you can compute the current one. That means that you can’t break it apart to sub computations that can be run independently.

If you data size is small enough to fit on a single machine, it is probably going to be faster to process it as a single reduce(map(data)) operation, than go through the entire map/reduce process (which require synchronization). In the end, map/reduce is just a simple paradigm for gaining concurrency, as such it is subject to the benefits and limitations of all parallel programs. The most basic one of them is Amdahl's law.

Map/reduce is a very hot topic, but you need to realize what it is for. It isn’t some magic formula from Google to make things run faster, it is just Select and GroupBy, run over a distributed network.

Tags:

Published at

Originally posted at

Comments (4)

Map / Reduce – A visual explanation

Map/Reduce is a term commonly thrown about these days, in essence, it is just a way to take a big task and divide it into discrete tasks that can be done in parallel. A common use case for Map/Reduce is in document database, which is why I found myself thinking deeply about this.

Let us say that we have a set of documents with the following form:

{
  "type": "post",
  "name": "Raven's Map/Reduce functionality",
  "blog_id": 1342,
  "post_id": 29293921,
  "tags": ["raven", "nosql"],
  "post_content": "<p>...</p>",
  "comments": [
    { 
      "source_ip": '124.2.21.2',
      "author": "martin",
      "text": "..."
  }]
}

And we want to answer a question over more than a single document. That sort of operation requires us to use aggregation, and over large amount of data, that is best done using Map/Reduce, to split the work.

Map / Reduce is just a pair of functions, operating over a list of data. In C#, LInq actually gives us a great chance to do things in a way that make it very easy to understand and work with. Let us say that we want to be about to get a count of comments per blog. We can do that using the following Map / Reduce queries:

from post in docs.posts
select new {
  post.blog_id, 
  comments_length = comments.length 
  };

from agg in results
group agg by agg.key into g
select new { 
  agg.blog_id, 
  comments_length = g.Sum(x=>x.comments_length) 
  };

There are a couple of things to note here:

  • The first query is the map query, it maps the input document into the final format.
  • The second query is the reduce query, it operate over a set of results and produce an answer.
  • Note that the reduce query must return its result in the same format that it received it, why will be explained shortly.
  • The first value in the result is the key, which is what we are aggregating on (think the group by clause in SQL).

Let us see how this works, we start by applying the map query to the set of documents that we have, producing this output:

image

The next step is to start reducing the results, in real Map/Reduce algorithms, we partition the original input, and work toward the final result. In this case, imagine that the output of the first step was divided into groups of 3 (so 4 groups overall), and then the reduce query was applied to it, giving us:

image

You can see why it was called reduce, for every batch, we apply a sum by blog_id to get a new Total Comments value. We started with 11 rows, and we ended up with just 10. That is where it gets interesting, because we are still not done, we can still reduce the data further.

This is what we do in the third step, reducing the data further still. That is why the input & output format of the reduce query must match, we will feed the output of several the reduce queries as the input of a new one. You can also see that now we moved from having 10 rows to have just 7.

image

And the final step is:

image

And now we are done, we can't reduce the data any further because all the keys are unique.

There is another interesting property of Map / Reduce, let us say that I just added a comment to a post, that would obviously invalidate the results of the query, right?

Well, yes, but not all of them. Assuming that I added a comment to the post whose id is 10, what would I need to do to recalculate the right result?

  • Map Doc #10 again
  • Reduce Step 2, Batch #3 again
  • Reduce Step 3, Batch #1 again
  • Reduce Step 4

What is important is that I did not have to touch quite a bit of the data, making the recalculation effort far cheaper than it would be otherwise.

And that is (more or less) the notion of Map / Reduce.

NHibernate symbol & source server support

A symbol server allows you to debug into code that you don’t have on your machine by querying a symbol server and a source server for the details.

SymbolSource.org has added support for NHibernate, and you can download an unofficial build of the current trunk which can be used with the symbol server here.

Moving on, all of NHibernate’s releases will be tracked on Symbol Source, so even if you don’t have the source available, you can just hit F11 and debug into NHibernate’s source.

You can’t learn the basics from the pros

There was a question recently in the NHibernate mailing list from a guy wanting to learn how to write database agnostic code from the NHibernate source code.

While I suppose that this is possible, I can’t really think of a worst way to learn how to write database agnostic code than reading an OR/M code. The reason for that is quiet simple, an OR/M isn’t just about one thing, it is about doing a lot of things together and bringing them together into a single whole. Yes, most OR/M are database agnostic, but trying to figure out the principles of that from the code base is going to be very hard.

That leads to an even more interesting problem, it is very hard for a beginner to actually learn something useful from a professional. That is true in any field, of course. In software, the problem is that most pros would simply skip whole steps that beginners are thought to be crucial. It isn’t from neglect, it is because they are going through those steps, but not in a conscious level.

Very often, I’ll come up with a design, and when it is only when I need to justify it to someone else that I actually realize the finer points of what have actually gone through my own head (another reason that having a blog is useful).

I think that there is a reason that we have names like code smells, you can immediately sense a problem in a smelly codebase, although it may take you a while to actually articulate it.

Using Lucene – External Indexes

Lucene is a document indexing engine, that is its sole goal, and it does so beautifully. The interesting bit about using Lucene is that it probably wouldn’t be your main data store, but it is likely to be an important piece of your architecture.

The major shift in thinking with Lucene is that while indexing is relatively expensive, querying is free (well, not really, but you get my drift). Compare that to a relational database, where it is usually the inserts that are cheap, but queries are usually what cause us issues. RDBMS are also very good in giving us different views on top of our existing data, but the more you want from them, the more they have to do. We hit that query performance limit again. And we haven’t started talking about locking, transactions or concurrency yet.

Lucene doesn’t know how to do things you didn’t set it up to do. But what it does do, it does very fast.

Add to that the fact that in most applications, reads happen far more often than write, and you get a different constraint system. Because queries are expensive on the RDBMS, we try to make few of them, and we try to make a single query do most of the work. That isn’t necessarily the best strategy, but it is a very common one.

With Lucene, it is cheap to query, so it makes a lot more sense to perform several individual queries and process their results together to get the final result that you need. It may require somewhat more work (although there are things like Solr that would do it for you), but it is results in a far faster system performance overall.

In addition to that, since the Lucene index is important, but can always be re-created from source data (it may take some time, though), it doesn’t require all the ceremony associated with DB servers. Instead of buying an expensive server, get a few cheap ones. Lucene scale easily, after all. And since you only use Lucene for indexing, your actual DB querying pattern shift. Instead of making complex queries in the database, you make them in Lucene, and you only hit the DB with queries by primary key, which are the fastest possible way to get the data.

In effect, you outsourced your queries from the expensive machines to the cheap ones, and for a change, you actually got better performance overall

Cut the abstractions by putting test hooks

I have been hearing about testable code for a long time now. It looks somewhat like this, although I had to cut on the number of interfaces along the way.

We go through a lot of contortions to be able to do something fairly simple, avoid hitting a data store in our tests.

This is actually inaccurate, we are putting in a lot of effort into being able to do that without changing production code. There are even a lot of explanations how testable code is decoupled, and easy to change, etc.

In my experience, one common problem is that we put in too much abstraction in our code. Sometimes it actually serve a purpose, but in many cases, it is just something that we do to enable testing. But we still pay the hit in the design complexity anyway.

We can throw all of that away, and keep only what we need to run production code, but that would mean that we would have harder time with the tests. But we can resolve the issue very easily by making my infrastructure aware of testing, such as this example:

image

But now your production code was changed by tests?!

Yes, it was, so? I never really got the problem people had with this, but at this day and age, putting in the hooks to make testing easier just make sense. Yes, you can go with the “let us add abstractions until we can do it”, but it is much cheaper and faster to go with this approach.

Moreover, notice that this is part of the infrastructure code, which I don’t consider as production code (you don’t touch it very often, although of course it has to be production ready), so I don’t have any issue with this.

Nitpicker corner: Let us skip the whole TypeMock discussion, shall we?

TDD fanatic corner: I don’t really care about testable code, I care about tested code. If I have a regression suite, that is just what i need.

Lessons learned from building the NHibernate Profiler

Last week I gave a talk about the things I learned from building NH Prof.

Skills Matter had recorded the session and made it available.

Looking forward for your comments, but I should disclaimer that this was after a full day of teaching and on 50 min of sleep in the last 48 hours

Slaying relational hydras (or dating them)

Sometimes client confidentiality can be really annoying, because the problem sets & suggested solutions that come up are really interesting. That said, since I am interesting in having future clients, it is pretty much a must have. As such, the current post represent a real world customer problem, but probably in a totally different content. In fact, I would be surprised if the customer was able to recognize the problem as his.

That said, the problem is actually quite simple. Consider a dating site, where you can input your details and what you seek, and the site will match you with the appropriate person. I am going to ignore a lot of things here, so if you actually have built a dating site, try not to cringe.

At the most basic level, we have two screens, the My Details screen, where the users can specifies their stats and their preferences:

image

And the results screen, which shows the user the candidate matching their preferences.

There is just one interesting tidbit, the list of qualities is pretty big (hundreds or thousands of potential qualities).

Can you design a relational model that would be a good fit for this? And allow efficient searching?

I gave it some thought, and I can’t think of one, but maybe you can.

I’ll follow up on this post in a day or two, showing how to implement the problem using Raven.

Rhino Divan DB – Performance

The usual caveat applies, I am running this in process, using small data size and small number of documents.

This isn’t so much as real benchmarks, but they are good enough to give me a rough indication about where i am heading, and whatever or not i am going in completely the wrong direction.

Those two should be obvious:

image image

This one is more interesting, RDB doesn’t do immediate indexing, I chose to accept higher CUD throughput and make indexing a background process. That means that the index may be inconsistent for a short while, but it greatly reduce the amount of work required to insert/update a document.

But, what is that short while in which the document and the DB may be inconsistent. The average time seems to be around 25ms in my tests, with some spikes toward 100 ms in some of the starting results. In general, it looks like things are improving the longer the database run. Trying it out over a 5,000 document size give me an average update duration of 27ms, but note that I am testing the absolute worst usage pattern, lot of small documents inserted one at a time with index requests coming in as well.

image

Be that as it may, having inconsistency period measured in a few milliseconds seems acceptable to me. Especially since RDB is nice enough to actually tell me if there are any inconsistencies in the results, so I can chose whatever to accept them or retry the request.

Getting code ready for production

I am currently doing the production-ready pass through the Rhino DivanDB code base, and I thought that this change was interesting enough to post about:

public void Execute()
{
    while(context.DoWork)
    {
        bool foundWork = false;
        transactionalStorage.Batch(actions =>
        {
           var task = actions.GetFirstTask();
           if(task == null)
           {
               actions.Commit(); 
               return;
           }
           foundWork = true;

           task.Execute(context);

           actions.CompleteCurrentTask();

           actions.Commit();
        });
        if(foundWork == false)
            context.WaitForWork();
    }
}

This is “just get things working” phase. When getting a piece of code ready for production, I am looking for several things:

  • If this is running in production, and I get the log file, will I be able to understand what is going on?
  • Should this code handle any exceptions?
  • What happens if I send values from a previous version? From a future version?
  • Am I doing unbounded operations?
  • For error handling, can I avoid any memory allocations?

The result for this piece of code was:

public void Execute()
{
    while(context.DoWork)
    {
        bool foundWork = false;
        transactionalStorage.Batch(actions =>
        {
            var taskAsJson = actions.GetFirstTask();
            if (taskAsJson == null)
            {
                actions.Commit();
                return;
            }
            log.DebugFormat("Executing {0}", taskAsJson);
            foundWork = true;

            Task task;
            try
            {
                task = Task.ToTask(taskAsJson);
                try
                {
                    task.Execute(context);
                }
                catch (Exception e)
                {
                    if (log.IsWarnEnabled)
                    {
                        log.Warn(string.Format("Task {0} has failed and was deleted without completing any work", taskAsJson), e);
                    }
                }
            }
            catch (Exception e)
            {
                log.Error("Could not create instance of a task from " + taskAsJson, e);
            }

            actions.CompleteCurrentTask();
            actions.Commit();
        });
        if(foundWork == false)
            context.WaitForWork();
    }
}

The code size blows up really quickly.