Ayende @ Rahien

It's a girl

NHibernate Shards: Progress Report

Since my last post about it, there has been a lot of changes to NHibernate Shards.

Update: I can’t believe I forgot, I was so caught up in how cool this was that I did give the proper credits. Thanks to Dario Quintana and all the other contributors to NHibernate Shards.

The demo actually works :-) You can look at the latest code here: http://nhcontrib.svn.sourceforge.net/svnroot/nhcontrib/trunk/src/NHibernate.Shards/

You can read the documentation for the Java version (most of which is applicable for the .NET version) here: http://docs.jboss.org/hibernate/stable/shards/reference/en/html/

Let us go through how it works, okay?

We have the following class, which we want to shard.

image

The class mapping is almost standard:

image

As you can see, the only new thing is the primary key generator. Because entities are sharded based  on their primary key, we have to encode the appropriate shard in the shard. The easiest way of doing that is using the SharedUUIDGenerator. This generator generates keys that looks like this:

  • 00010000602647468c2ef2f10ded039a
  • 000200006ba74626a564d147dc89f9ad
  • 00030000eb934532b828601979036e3c

The first four characters are reserved for the shard id.

Next, we need to specify the configurations for each shard, we can do this normally, but we have to specify the shard id in the configuration.

cfg.SetProperty(ShardedEnvironment.ShardIdProperty, 1);

The shard id is an integer that is used to select the appropriate shard. It is also used to allow you to add new shards without breaking the ids of already existing shards.

Next, you need to implement a shard strategy factory:

image

This allows you to configure the shard strategy based on your needs. This is often where you would add custom behaviors. A shard strategy is composed of several components:

image

The Shard Selection Strategy is used solely to select the appropriate shard for new entities. If you shard your entities based on the user name, this is where you’ll implement that, by providing a shard selection strategy that is aware of this. On of the nice things about NH Shards is that it is aware of the graph as a whole, and if you have an association to a sharded entity, it knows that it needs to place you in the appropriate shard, without giving the burden to you.

For new objects, assuming that you haven’t provided your own shard selection strategy, NHibernate Shards will try to spread them evenly between the shards. The most common implementation is the Round Robin Load Balancer, which will give you a new shard for each new item that you save.

The Shard Resolution Strategy is quite simple, given an entity and the entity id, in which shard should we look for them?

image

image

If you are using a sharded id, such as the one that WeatherReport is using, NH Shards will know which shard to talk to automatically. But if you are using a non sharded id, you have to tell NHibernate how to figure out which shards to look at. By default, if you have non sharded id, it will look at all shards until it finds it.

The shard access strategy specifies how NHibernate Shards talks to the shards when it needs to talk to more than a single shard. NHibernate Shards can do it either sequentially or in parallel. Using parallel access strategy means that NHibernate will hit all your databases at the same time, potentially saving quite a bit of time for you.

The access strategy is also responsible for handling post processing the queries result, merging them and ordering them as needed.

Let us look at the code, okay? As you can see, this is a pretty standard usage of NHibernate.

using(ISession session = sessionFactory.OpenSession())
using(session.BeginTransaction())
{
session.Save(new WeatherReport
{
Continent = "North America",
Latitude = 25,
Longitude = 30,
ReportTime = DateTime.Now,
Temperature = 44
});

session.Save(new WeatherReport
{
Continent = "Africa",
Latitude = 44,
Longitude = 99,
ReportTime = DateTime.Now,
Temperature = 31
});

session.Save(new WeatherReport
{
Continent = "Asia",
Latitude = 13,
Longitude = 12,
ReportTime = DateTime.Now,
Temperature = 104
});
session.Transaction.Commit();
}
Since we are using the defaults, each of those entities is going to go to a different shard. Here is the result:

image
Our data was saved into three different databases. And obviously we could have saved them to three different servers as well.

But saving the data is only part of things, what about querying? Well, let us look at the following query:

session.CreateCriteria(typeof(WeatherReport), "weather").List()

This query will give us:

image

Note that we have three different sessions here, each for its own database, each executing a single query. What is really interesting is that NHibernate will take all of those results and merge them together. It can even handle proper ordering across different databases.

Let us see the code:

var reports =
session.CreateCriteria(typeof(WeatherReport), "weather")
.Add(Restrictions.Gt("Temperature", 33))
.AddOrder(Order.Asc("Continent"))
.List();
foreach (WeatherReport report in reports)
{
Console.WriteLine(report.Continent);
}

Which results in:

image

And in the following printed to the console:

Asia
North America

We got the proper ordering, as we specified in the query, but note that we aren’t handling ordering in the database. Because we are hitting multiple sources, it is actually cheaper to do the ordering in memory, rather than get partially ordered data and they trying to sort it.

Well, that is about it from the point of view of the capabilities.

One of the things that is holding NH Shards back right now is that only core code paths has been implemented. A lot of the connivance methods are currently not implemented. 

image

They are relatively low hanging fruits, and can be implemented without any deep knowledge of NHibernate or NHibernate Shards. Beyond that, the sharded HQL implementation is still not handling order properly, so if you care about ordering you can only query using ICriteria (at the moment).

It isn’t there yet, but it is much closer. You can get a working demo, and probably start working with this and implement things on the fly as you run into things. I strongly urge you to contribute the missing parts, at least the convenience methods, which should be pretty easy.

Please submit patches to our JIRA and discuss the topic at: http://groups.google.com/group/nhcdevs

Comments

Brian Hartsock
10/18/2009 02:54 PM by
Brian Hartsock

First off, awesome.

Second, I was wondering about the ShardResolutionStrategy. How is this going to work with associations and aggregate roots? I would think only the aggregate root needs the ShardUUIDGenerator, but it seems as though that assumption is wrong. Would aggregates have their own generator and be spread out between multiple databases for the same root?

Ayende Rahien
10/18/2009 02:57 PM by
Ayende Rahien

Brian,

Since with DDD you only ever access stuff from the root, only aggregates are required to have shared id.

BTW, ShardUUIDGenerator isn't the only one that you can use, you can use other sharding aware id generators.

An aggregates all reside on a single shard.

configurator
10/18/2009 03:43 PM by
configurator

How come it's cheaper to do the sorting in memory than to allow the database to do it for each shard? Combining sorted list is an O(n) operation.

Fabio Maulo
10/18/2009 03:48 PM by
Fabio Maulo

Thanks to Dario Quintana to put a lot of effort developing NHibernate.Shards and thanks to the others developers have sent patch in the last few weeks.

Ayende Rahien
10/18/2009 03:49 PM by
Ayende Rahien

configurator,

Because you are going to have to sort them anyway. At that point, it is cheaper to let the DB just stream it to us and we will sort them in memory.

Ayende Rahien
10/18/2009 03:51 PM by
Ayende Rahien

Fabio,

You are absolutely right!

I updated the post

configurator
10/18/2009 03:53 PM by
configurator

But database sorts are cheaper, because they are indexed... Am I missing something here?

Suppose we have a large amount of data - say 120 records, sharded into 3 shards and we want it sorted by an indexed field. Now we get a bunch of records and sort them ourselves. But we could use the (fast) database indexed sort, and then combine the lists ourself quite quickly.

Ayende Rahien
10/18/2009 04:01 PM by
Ayende Rahien

Configurator,

stackoverflow.com/.../which-sort-algorithm-work...

There are some algorithms that performs horrible on nearly sorted data.

Quicksort, in particular, may get on O(n^2) on sorted data.

Since we aren't dealing with large amounts of data, it is quick to sort them in memory, it is a step we are going to take anyway, so why let the DB do it?

120 is no data, practically

configurator
10/18/2009 04:24 PM by
configurator

This data is not mostly sorted. It's several sorted list - and for that you have a simple algorithm that scales well over any amount of data as long as there aren't too many shards - it's approximately an O(n*m) algorithm where n is the total data and m is the number of shards.

    List

<t MergeSortedLists <t(params IEnumerable <t[] enumerables) {

        // How this works:

        // Stores a list of the enumerators

        // For each item, finds the minimum 'Current' of all enumerators, then preforms MoveNext() on the enumerator whose item has been selected

        // Once an enumerator is finished, it is removed from the enumerator list

        // Once the list is empty, we're done.


        // First, create a list of enumerators, with one MoveNext() to get to the first item

        List

<ienumerator enumerators = new List <ienumerator();

        foreach (IEnumerable

<t enumerable in enumerables) {

            IEnumerator

<t e = enumerable.GetEnumerator();

            if (e.MoveNext())

                enumerators.Add(e);

            else

                // If there is no first item, we don't bother adding the enumerator to the list

                e.Dispose();

        }



        List

<t result = new List <t();

        while (enumerators.Count > 0) {

            // compare current item in each enumerator and choose the first one

            IEnumerator

<t next = enumerators[0];

            for (int i = 1; i < enumerators.Count; i++) {

                if (SmallerThan(enumerators[i].Current, next.Current))

                    next = enumerators[i];

            }


            // current item on selected enumerator is added to the result

            result.Add(next.Current);


            // move next on that enumerator and get rid of it if we're done

            if (!next.MoveNext()) {

                enumerators.Remove(next);

                next.Dispose();

            }

        }


        return result;

    }
Ayende Rahien
10/18/2009 04:28 PM by
Ayende Rahien

Configurator,

I am willing to lay odds that you wouldn't be able to get this to perform faster than.

Array.Sort or List.Sort that are already in the BCL.

configurator
10/18/2009 05:38 PM by
configurator

With a slight change (chaging SmallerThan to comparer.Compare and adding a Comparer <t comparer = Comparer <t.Default line) the merge's times were better than List <t.Sort. My tests don't include the database sorting time, and I haven't really bothered with optimising this code, but my times were:

Merge time: 3562

Sort time: 6236

This was done with four lists of 10,000,000 integers each.

Ayende Rahien
10/18/2009 08:02 PM by
Ayende Rahien

You are never going to pull millions of items.

configurator
10/18/2009 08:19 PM by
configurator

Of course you won't pull millions of items. It's just impossible to see the difference with less data - and if we're talking about a server that servers millions of requests, each with 100 items (which, like you said, is practically no data), it would matter.

Nadav
10/19/2009 12:45 PM by
Nadav

Merging sorted is O(m*n) when m is the number of shards.

Sorting in memory is O(n*log2(n)).

So if m <log2(n)> So if we have 8 shards and 10000 records then

m=8, log2(n)=13(more or less).

The other good thing about the merge algorithm is that

you don't need to read all the data into memory to start returning results.

If you implement the MergeSortedList to return an IEnumerable then you just need to get the first item from each shard to return the first item.

BTW, I don't think you need to sort the shards, each time you just need the shard with the smallest first Item, so you can use a priority queue,

which , if I remember correctly, is O(log(m)) for updating one item,

So if you put all the enumerators in a priority queue (ordered on the value of the Current Item) and then just remove the first item from the priority queue, return the Current item, Execue MoveNext() and return the enumerator to the queue if it is not empty, You should have an algorithm that works with O(log(m)n) which should be alot better than O(log(n)n).

Ayende Rahien
10/19/2009 01:28 PM by
Ayende Rahien

Nadav,

Which ends up being a lot of brain power for just getting the data, sorting it using the builtin methods, and moving on to actually producing value :-)

Jeremy Gray
10/19/2009 02:21 PM by
Jeremy Gray

So, how about this: The merge fans acknowledge that Oren's version is simpler, which has value, Oren acknowledges that the merge version is faster, which has value, and the merge fans submit a patch when all is said and done. :)

Nadav
10/20/2009 06:36 AM by
Nadav

Actually, I think that oren's version will probably be faster too for most cases (and the priority queue solution might be nice in theory, but the added complexity makes it not worth it :) ).

I think the real issue with the in memory version is with scalabilty.

I mean, what if the user wants to export the last 5 years worth of history to a text file? That can be millions of records. I don't want to keep all that in memory so I can sort it.

Nadav

Evgeny Shapiro
10/25/2009 11:04 AM by
Evgeny Shapiro

Nadav,

If you want to export much data NHibernate is probably the wrong tool to go with. ETL is the right answer.

The constraints of the applicability of NHibernate make inmemory sorting as good as merged sort (that is beside extreme cases).

Agile Jedi
11/03/2009 11:30 PM by
Agile Jedi

If I want the first 10 records from a sorted query on a table with 10,000,000 records it would be much more performant to sort on the database.

This is a common operation when paging results.

I'd say that DB sorting is quicker....now if you do the sort and pull only primary keys...then grab the top 10 records using primary key lookup...you may have a better result with in memory sort.

Stan
12/16/2009 08:31 PM by
Stan

how do i help to port Hyberrnate Shards to with NH Shards?

who i i contact?

Comments have been closed on this topic.