NHibernate Shards: Progress Report
Since my last post about it, there has been a lot of changes to NHibernate Shards.
Update: I can’t believe I forgot, I was so caught up in how cool this was that I did give the proper credits. Thanks to Dario Quintana and all the other contributors to NHibernate Shards.
The demo actually works :-) You can look at the latest code here: http://nhcontrib.svn.sourceforge.net/svnroot/nhcontrib/trunk/src/NHibernate.Shards/
You can read the documentation for the Java version (most of which is applicable for the .NET version) here: http://docs.jboss.org/hibernate/stable/shards/reference/en/html/
Let us go through how it works, okay?
We have the following class, which we want to shard.
The class mapping is almost standard:
As you can see, the only new thing is the primary key generator. Because entities are sharded based on their primary key, we have to encode the appropriate shard in the shard. The easiest way of doing that is using the SharedUUIDGenerator. This generator generates keys that looks like this:
- 00010000602647468c2ef2f10ded039a
- 000200006ba74626a564d147dc89f9ad
- 00030000eb934532b828601979036e3c
The first four characters are reserved for the shard id.
Next, we need to specify the configurations for each shard, we can do this normally, but we have to specify the shard id in the configuration.
cfg.SetProperty(ShardedEnvironment.ShardIdProperty, 1);
The shard id is an integer that is used to select the appropriate shard. It is also used to allow you to add new shards without breaking the ids of already existing shards.
Next, you need to implement a shard strategy factory:
This allows you to configure the shard strategy based on your needs. This is often where you would add custom behaviors. A shard strategy is composed of several components:
The Shard Selection Strategy is used solely to select the appropriate shard for new entities. If you shard your entities based on the user name, this is where you’ll implement that, by providing a shard selection strategy that is aware of this. On of the nice things about NH Shards is that it is aware of the graph as a whole, and if you have an association to a sharded entity, it knows that it needs to place you in the appropriate shard, without giving the burden to you.
For new objects, assuming that you haven’t provided your own shard selection strategy, NHibernate Shards will try to spread them evenly between the shards. The most common implementation is the Round Robin Load Balancer, which will give you a new shard for each new item that you save.
The Shard Resolution Strategy is quite simple, given an entity and the entity id, in which shard should we look for them?
If you are using a sharded id, such as the one that WeatherReport is using, NH Shards will know which shard to talk to automatically. But if you are using a non sharded id, you have to tell NHibernate how to figure out which shards to look at. By default, if you have non sharded id, it will look at all shards until it finds it.
The shard access strategy specifies how NHibernate Shards talks to the shards when it needs to talk to more than a single shard. NHibernate Shards can do it either sequentially or in parallel. Using parallel access strategy means that NHibernate will hit all your databases at the same time, potentially saving quite a bit of time for you.
The access strategy is also responsible for handling post processing the queries result, merging them and ordering them as needed.
Let us look at the code, okay? As you can see, this is a pretty standard usage of NHibernate.
using(ISession session = sessionFactory.OpenSession())
using(session.BeginTransaction())
{
session.Save(new WeatherReport
{
Continent = "North America",
Latitude = 25,
Longitude = 30,
ReportTime = DateTime.Now,
Temperature = 44
});
session.Save(new WeatherReport
{
Continent = "Africa",
Latitude = 44,
Longitude = 99,
ReportTime = DateTime.Now,
Temperature = 31
});
session.Save(new WeatherReport
{
Continent = "Asia",
Latitude = 13,
Longitude = 12,
ReportTime = DateTime.Now,
Temperature = 104
});
session.Transaction.Commit();
}
But saving the data is only part of things, what about querying? Well, let us look at the following query:
session.CreateCriteria(typeof(WeatherReport), "weather").List()
This query will give us:
Note that we have three different sessions here, each for its own database, each executing a single query. What is really interesting is that NHibernate will take all of those results and merge them together. It can even handle proper ordering across different databases.
Let us see the code:
var reports =
session.CreateCriteria(typeof(WeatherReport), "weather")
.Add(Restrictions.Gt("Temperature", 33))
.AddOrder(Order.Asc("Continent"))
.List();
foreach (WeatherReport report in reports)
{
Console.WriteLine(report.Continent);
}
Which results in:
And in the following printed to the console:
Asia
North America
We got the proper ordering, as we specified in the query, but note that we aren’t handling ordering in the database. Because we are hitting multiple sources, it is actually cheaper to do the ordering in memory, rather than get partially ordered data and they trying to sort it.
Well, that is about it from the point of view of the capabilities.
One of the things that is holding NH Shards back right now is that only core code paths has been implemented. A lot of the connivance methods are currently not implemented.
They are relatively low hanging fruits, and can be implemented without any deep knowledge of NHibernate or NHibernate Shards. Beyond that, the sharded HQL implementation is still not handling order properly, so if you care about ordering you can only query using ICriteria (at the moment).
It isn’t there yet, but it is much closer. You can get a working demo, and probably start working with this and implement things on the fly as you run into things. I strongly urge you to contribute the missing parts, at least the convenience methods, which should be pretty easy.
Please submit patches to our JIRA and discuss the topic at: http://groups.google.com/group/nhcdevs
Comments
First off, awesome.
Second, I was wondering about the ShardResolutionStrategy. How is this going to work with associations and aggregate roots? I would think only the aggregate root needs the ShardUUIDGenerator, but it seems as though that assumption is wrong. Would aggregates have their own generator and be spread out between multiple databases for the same root?
Brian,
Since with DDD you only ever access stuff from the root, only aggregates are required to have shared id.
BTW, ShardUUIDGenerator isn't the only one that you can use, you can use other sharding aware id generators.
An aggregates all reside on a single shard.
How come it's cheaper to do the sorting in memory than to allow the database to do it for each shard? Combining sorted list is an O(n) operation.
Thanks to Dario Quintana to put a lot of effort developing NHibernate.Shards and thanks to the others developers have sent patch in the last few weeks.
configurator,
Because you are going to have to sort them anyway. At that point, it is cheaper to let the DB just stream it to us and we will sort them in memory.
Fabio,
You are absolutely right!
I updated the post
But database sorts are cheaper, because they are indexed... Am I missing something here?
Suppose we have a large amount of data - say 120 records, sharded into 3 shards and we want it sorted by an indexed field. Now we get a bunch of records and sort them ourselves. But we could use the (fast) database indexed sort, and then combine the lists ourself quite quickly.
Configurator,
stackoverflow.com/.../which-sort-algorithm-work...
There are some algorithms that performs horrible on nearly sorted data.
Quicksort, in particular, may get on O(n^2) on sorted data.
Since we aren't dealing with large amounts of data, it is quick to sort them in memory, it is a step we are going to take anyway, so why let the DB do it?
120 is no data, practically
This data is not mostly sorted. It's several sorted list - and for that you have a simple algorithm that scales well over any amount of data as long as there aren't too many shards - it's approximately an O(n*m) algorithm where n is the total data and m is the number of shards.
<t MergeSortedLists <t(params IEnumerable <t[] enumerables) {
<ienumerator<t> enumerators = new List <ienumerator<t>();
<t enumerable in enumerables) {
<t e = enumerable.GetEnumerator();
<t result = new List <t();
<t next = enumerators[0];
Configurator,
I am willing to lay odds that you wouldn't be able to get this to perform faster than.
Array.Sort or List.Sort that are already in the BCL.
With a slight change (chaging SmallerThan to comparer.Compare and adding a Comparer <t comparer = Comparer <t.Default line) the merge's times were better than List <t.Sort. My tests don't include the database sorting time, and I haven't really bothered with optimising this code, but my times were:
Merge time: 3562
Sort time: 6236
This was done with four lists of 10,000,000 integers each.
You are never going to pull millions of items.
Of course you won't pull millions of items. It's just impossible to see the difference with less data - and if we're talking about a server that servers millions of requests, each with 100 items (which, like you said, is practically no data), it would matter.
Merging sorted is O(m*n) when m is the number of shards.
Sorting in memory is O(n*log2(n)).
So if m <log2(n)> So if we have 8 shards and 10000 records then
m=8, log2(n)=13(more or less).
The other good thing about the merge algorithm is that
you don't need to read all the data into memory to start returning results.
If you implement the MergeSortedList to return an IEnumerable then you just need to get the first item from each shard to return the first item.
BTW, I don't think you need to sort the shards, each time you just need the shard with the smallest first Item, so you can use a priority queue,
which , if I remember correctly, is O(log(m)) for updating one item,
So if you put all the enumerators in a priority queue (ordered on the value of the Current Item) and then just remove the first item from the priority queue, return the Current item, Execue MoveNext() and return the enumerator to the queue if it is not empty, You should have an algorithm that works with O(log(m)n) which should be alot better than O(log(n)n).
Nadav,
Which ends up being a lot of brain power for just getting the data, sorting it using the builtin methods, and moving on to actually producing value :-)
So, how about this: The merge fans acknowledge that Oren's version is simpler, which has value, Oren acknowledges that the merge version is faster, which has value, and the merge fans submit a patch when all is said and done. :)
Actually, I think that oren's version will probably be faster too for most cases (and the priority queue solution might be nice in theory, but the added complexity makes it not worth it :) ).
I think the real issue with the in memory version is with scalabilty.
I mean, what if the user wants to export the last 5 years worth of history to a text file? That can be millions of records. I don't want to keep all that in memory so I can sort it.
Nadav
Nadav,
If you want to export much data NHibernate is probably the wrong tool to go with. ETL is the right answer.
The constraints of the applicability of NHibernate make inmemory sorting as good as merged sort (that is beside extreme cases).
If I want the first 10 records from a sorted query on a table with 10,000,000 records it would be much more performant to sort on the database.
This is a common operation when paging results.
I'd say that DB sorting is quicker....now if you do the sort and pull only primary keys...then grab the top 10 records using primary key lookup...you may have a better result with in memory sort.
how do i help to port Hyberrnate Shards to with NH Shards?
who i i contact?
Comment preview