Ayende @ Rahien

It's a girl

Design patterns in the test of time: Singleton

In software engineering, the singleton pattern is a design pattern that restricts the instantiation of a class to one object. This is useful when exactly one object is needed to coordinate actions across the system.

More about this pattern.

I won’t show code or diagrams here. If you don’t know the Singleton pattern, you probably don’t have any business reading this series. Go hit the books and then come back for the rest of my review.

Of the top of my head, I can’t think of a single pattern that has been as denigrated as the Singleton pattern. It has been the bane of testers anywhere, and just about any Singleton implementation had had to become thread safe, given the demands that we usually have from our apps.

That basically means that any time that you use a Singleton, you have to be damn sure that your code is thread safe.  When it isn’t, this becomes really painful. That along would be a huge mark against it, since multi thread proofing code is hard. But Singleton also got a bad rep because they create hidden dependencies that were hard to break. Probably the most famous of them was HttpContext.Current and DateTime.Now.

Singleton may have a tattered reputation and wounded dignity, but it is still a crucially important pattern. Most of the issues that people have with the Singleton aren’t with the notion of the single instance, but with the notion of a global static gateway, which means that it becomes very hard to modify for things like tests, and it is easy to create code that is very brittle in its dependencies on its environment.

Common workaround to that is to break apart the notion of accessing the value and the single nature of the value. So you typically inject the value in, and something else, usually the container, is in charge of managing the lifetime of the object.

Common use cases for Singletons include caches (which would be pretty bad if they didn’t stuck around) and the NHibernate’s Session Factory (which is very expensive to create).

Recommendation: The notion of having just a single instance of an object is still very important, especially when you use that single instance to coordinate things. It does means that you have multi threading issues, but that can be solved. It is a very useful pattern, but you have to watch for the pit falls (global static accessor that is used all over the place is one of the major ones).

Optimization story: GetNextIdentityValueWithoutOverwritingOnExistingDocuments

A customer had a problem. They were mostly using the RavenDB HiLo algorithm for saving documents to the database, which is very fast & cheap. That client, however, chose to use the identity method. Which means that RavenDB will assign the value.

This is usually used if you need to have sequential values. The identity is actually being managed internally by RavenDB, and that works perfectly fine.

Except… What happens when you enter replication to the mix. The documents with the identity values are replicated to the secondary server, and there we don’t have the identity value, we just have the docs being written with their full id. (users/1, users/2, users/3, etc).

So far, so good. But what happens when you have a failover and you need to write to the secondary, and you use the identity? Well, RavenDB ain’t stupid, and it won’t overwrite the users/1 document. Instead, it will search for the next available opening from the smallest identity value generated and use that. The code looks like this:

   1: private long GetNextIdentityValueWithoutOverwritingOnExistingDocuments(string key, 
   2:     IStorageActionsAccessor actions, 
   3:     TransactionInformation transactionInformation)
   4: {
   5:     long nextIdentityValue;
   6:     do
   7:     {
   8:         nextIdentityValue = actions.General.GetNextIdentityValue(key);
   9:     } while (actions.Documents.DocumentMetadataByKey(key + nextIdentityValue, transactionInformation) != null);
  10:     return nextIdentityValue;
  11: }

This works, great. Except when you have large number of documents that have already been written. Instead of the brute force search, we now use the following approach:

   1: public long GetNextIdentityValueWithoutOverwritingOnExistingDocuments(string key,
   2:     IStorageActionsAccessor actions,
   3:     TransactionInformation transactionInformation,
   4:     out int tries)
   5: {
   6:     long nextIdentityValue = actions.General.GetNextIdentityValue(key);
   7:  
   8:     if (actions.Documents.DocumentMetadataByKey(key + nextIdentityValue, transactionInformation) == null)
   9:     {
  10:         tries = 1;
  11:         return nextIdentityValue;
  12:     }
  13:     tries = 1;
  14:     // there is already a document with this id, this means that we probably need to search
  15:     // for an opening in potentially large data set. 
  16:     var lastKnownBusy = nextIdentityValue;
  17:     var maybeFree = nextIdentityValue*2;
  18:     var lastKnownFree = long.MaxValue;
  19:     while (true)
  20:     {
  21:         tries++;
  22:         if(actions.Documents.DocumentMetadataByKey(key + maybeFree, transactionInformation) == null)
  23:         {
  24:             if (lastKnownBusy + 1 == maybeFree)
  25:             {
  26:                 actions.General.SetIdentityValue(key, maybeFree);
  27:                 return maybeFree;
  28:             }
  29:             lastKnownFree = maybeFree;
  30:             maybeFree = Math.Max(maybeFree - (maybeFree - lastKnownBusy) / 2, lastKnownBusy + 1);
  31:  
  32:         }
  33:         else
  34:         {
  35:             lastKnownBusy = maybeFree;
  36:             maybeFree = Math.Min(lastKnownFree, maybeFree*2);
  37:         }
  38:     }
  39: }

This can figure out the first free item in a range of billion documents in under 100 tries, which I am pretty sure if good enough.

Tags:

Published at

Originally posted at

Comments (18)

Design patterns in the test of time: Prototype

Create objects based on a template of an existing object through cloning.

More about this pattern.

This is how it looks like:

Prototype Example

Surprisingly enough, there are very few useful concrete examples of this, even in the literature. A lot of the time you see reference to ConcreteImplA and ConcreteImplB.

The original impetus for the Prototype pattern was actually:

  • avoid subclasses of an object creator in the client application, like the abstract factory pattern does.
  • avoid the inherent cost of creating a new object in the standard way (e.g., using the 'new' keyword) when it is prohibitively expensive for a given application.

That is actually quite interesting. As I mentioned in the Factory Method analysis post, I like the notion of using Factory Delegate (and thus avoiding subclassing) quite a lot. This is usually useful for behavioral objects, that contains little state (it would be more accurate to say that their state is behavior, such as a class that mostly contains delegate members for different things). But for those sort of things, you usually don’t really need to modify them after the fact, so there isn’t much of a prototype here.

The second reasoning is not relevant for most things today. The cost of new is so near zero to be effectively meaningless.

But something that isn’t mentioned about this pattern is that it is very useful for multi threading. The notion of being able to handle a cloned object that can be modified independently of its original is key in things like caches, as you can see in the code above. We make heavy use of that internally inside RavenDB, for example, although we choose a slight more complex (and performant) route.

A key observation about this is that Prototype assumes long lives objects. Because otherwise, there wouldn’t be the prototype instance to clone from. In wide variety of applications today, that is simply not the case. Most of our objects live only for a single request. And anything whose lifetime is longer than a single request is usually persisted to a stable storage, rendering the basis for the Prototype pattern existence moot.

Recommendation: This is still a useful pattern for a limited number of scenarios. In particular, the ability to hand out a copy of the instance from a cache means that we don’t have to worry about multi threading. That said, beyond this scenario, I haven’t found many other uses for this.

Design patterns in the test of time: Factory Method

Define an interface for creating an object, but let the classes that implement the interface decide which class to instantiate. The Factory method lets a class defer instantiation to subclasses.

More on this pattern.

Here is some sample code:

   1: public class MazeGame {
   2:   public MazeGame() {
   3:      Room room1 = MakeRoom();
   4:      Room room2 = MakeRoom();
   5:      room1.Connect(room2);
   6:      AddRoom(room1);
   7:      AddRoom(room2);
   8:   }
   9:  
  10:   protected virtual Room MakeRoom() {
  11:      return new OrdinaryRoom();
  12:   }
  13: }

This pattern is quite useful, and is in fairly moderate use. For example, you can take a look at WebClient.GetWebRequest, which is an exact implementation of this pattern. I like this pattern because this allows me to keep the Open Closed Principle, I don’t need to modify the class, I can just inherit and override it to change things.

Still, this is the class method. I like to mix things up a bit and not use a virtual method, instead, I do things like this:

   1: public class MazeGame {
   2:    public Func<Room> MakeRoom = () => new OrdinaryRoom();
   3: }

This allows me change how we are creating the room without even having to create a new subclass. In fact, it allows me to change this per instance.

I make quite a heavy use of this in RavenDB, for example. The DocumentConventions class is basically built of nothing else.

Recommendation: Go for the lightweight Factory Delegate approach. As with all patterns, use with caution and watch for overuse & abuse. In particular, if you need to manage state between multiple delegate, fall back to the overriding approach, because you can keep the state in the subclass.

Design patterns in the test of time: Builder

The intent of the Builder design pattern is to separate the construction of a complex object from its representation. By doing so, the same construction process can create different representations.

More about this pattern.

The sample code that usually comes with this pattern is something like this:

   1: PizzaBuilder hawaiianPizzaBuilder = new HawaiianPizzaBuilder();
   2: Cook cook = new Cook();
   3: cook.SetPizzaBuilder(hawaiianPizzaBuilder);
   4: cook.ConstructPizza();
   5: // create the product
   6: Pizza hawaiian = cook.GetPizza();

I find this sort of code to be extremely verbose and hard to read. Especially when we have a lot of options and things to do. Fluent Interfaces, however, are just an instance of the Builder pattern, and they are basically adding a modern API look & feel to the way we are actually constructing objects. Another thing to remember is that we are dealing with C#, and we have things like object initializers to do a lot of the heavy lifting for building objects. You should use that, for most cases.

NHibernate, for example, has the notion of a Builder, using the NHibernate.Cfg.Configuration object. It allows us to put all of the construction / validation code in one spot, and then the actual runtime code in a different place (and can assume a lot about its invariants). It also allows to do a lot of interesting things, like serializing the builder object (to save building time), which is something that you usually can’t / hard to do with real objects.

That said, you should be careful of code like the one listed above .What you have there is an overly abstract system. Requiring multiple steps to just get somewhere. If you find yourself feeding builders into builders, please stop and think about what you are doing. If you got there, you have not simplified the construction process.

Recommendation: This is still a very useful pattern. It should absolutely not be used if all you need to do is just setting some values. Reserve the Builder patterns where you actually have logic and behavior associated with the building process.

A week in London: RavenDB & NHibernate training

I’ll be spending the last week on November in London, at the Skills Matter offices.

On the 26 Nov, I’ll be giving a 3 days NHibernate course.

And on the 29 Nov, I’ll be giving 2 full days of RavenDB awesomeness. This course is scheduled to run along the same time as the RavenDB 1.2 release, which leads me to the In The Brains session I’ll be giving along the way.

What is new in RavenDB 1.2

RavenDB 1.0 was exciting and fun. RavenDB 1.2 builds on top of that and adds a whole host of really nice features.
Come to hear about the new Changes API, or how you can use evil patching to make the database bow to your wished. Learn how you can add encryption and compression to your database in a few minutes, and watch how operational tasks became even simpler. In short, come and see all of the new stuff for RavenDB!

Design patterns in the test of time: A modern alternative to Abstract Factory–filtered dependencies

In my Abstract Factory post, I mentioned that I really don’t like the pattern, and in particular, code like this:

   1: static IGUIFactory CreateOsSpecificFactory()
   2: {
   3:    string sysType = ConfigurationSettings.AppSettings["OS_TYPE"];
   4:    if (sysType == "Win") 
   5:    {
   6:        return new WindowsFactory();
   7:    } 
   8:    else 
   9:    {
  10:        return new MacFactory();
  11:    }
  12: }

One of the comments mentioned that this might no be ideal, but it is still better than:

   1: if(RunningOnWindows)
   2: {
   3:     // code
   4: }
   5: else if(RunningOnMac)
   6: {
   7:    // code
   8: }
   9: else if(RunningOnLinux)
  10: {
  11:    // code
  12: }

And I agree. But I think that, as the comment mentioned, a far better alternative would be using the container. You can do this using:

   1: [OperationSystem("Windows")]
   2: public class WindowsFactory : IGUIFactory
   3: {
   4: }
   5:  
   6: [OperationSystem("Linux")]
   7: public class LinuxFactory : IGUIFactory
   8: {
   9: }
  10:  
  11: [OperationSystem("Mac")]
  12: public class MacFactory : IGUIFactory
  13: {
  14: }
  15:  

Then you just need to wire things through the container. Among other things, this means that we respect the open / closed principle. If we need to support a new system, we can just add a new class, we don’t need to modify code.

Remember, the Go4 book was written in the age of C++. Reflection didn’t exists, and that means that a lot of patterns do by hands things that can happen automatically.

Design patterns in the test of time: Abstract Factory

The essence of the Abstract Factory method Pattern is to "Provide an interface for creating families of related or dependent objects without specifying their concrete classes".

More about this pattern.

Here is some sample code:

   1: static IGUIFactory CreateOsSpecificFactory()
   2: {
   3:     string sysType = ConfigurationSettings.AppSettings["OS_TYPE"];
   4:     if (sysType == "Win") 
   5:     {
   6:         return new WindowsFactory();
   7:     } 
   8:     else 
   9:     {
  10:         return new MacFactory();
  11:     }
  12: }

I am in two minds about this pattern. On the one hand, we have pretty damning evidence that this has been really bad for the industry at large. For details, you can see Why I Hate Frameworks post. When I first saw that, just shortly after reading the Go4 for the first time, I was in tears from laughing. But the situation he describe is true, accurate and still painful today.

Case in point, WCF suffers from a serious overuse of abstract factories. For example, IInstanceProvider (and I just love that in order to wire that in you usually have to implement IServiceBehavior).

As the I Hate Frameworks post mentioned:

Each hammer factory factory is built for you by the top experts in the hammer factory factory business, so you don't need to worry about all the details that go into building a factory.

Awesome, or not, as the case may be.

Then again, it is a useful pattern. The problem is that in the general case, creating objects that create objects (that create even more objects) is a pretty good indication that your architecture is already pretty hosed.  You should strive to an architecture that has minimal amount of levels, and an abstract factory is a whole new level even on its own.

Recommendation: Avoid if you can. If you run into a place where you think that needs this, consider if you can simplify your architecture to the point where this is not required.

Design patterns in the test of time

Design Patterns: Elements of Reusable Object-Oriented Software

Amazon tells me that I purchased this book in Sep 2004, and have since then misplaced it, for some reason. I remember how important this book was to shaping how I thought about software. For the first time, I actually have the words discuss what I was doing, and proven pathways to success. Of course, we all know that… it didn’t end up being quite so good.

In particular, it led to Cargo Cult Programming. From my perspective, it looks like a lot of people made the assumption that their application is good because it has design patterns, not because design patterns will result in simpler code.

Now, this book came out in 1994. And quite a bit have changed in the world of software since that time. In this series, I am going to take a look at all those design patterns and see how they hold up in the test of time. Remember, the design patterns were written at a time where most software was single user client applications (think Win Forms, then reduce by nine orders of magnitude), no web or internet, no multi threading, very little networking, very slow upgrade cycles and far slower machines. None of those assumptions are relevant to how we build software today, but they form the core of the environment that was relevant when the book was written. I think that it is going to be interesting to see how those things hold up.

And because I know the nitpickers, let me setup the context. We are talking about building design patterns within the context of .NET application, aimed at either system software (like RavenDB) or enterprise applications. This is the context in which I am talking about. Bringing arguments / options from additional contexts is no relevant to the discussion.

I am also not going to discuss the patterns themselves at any depth, if you want that, go and read the book, it is still a very good one, even though it came out almost 20 years ago.

Uber Prof V2.0 is now in Public Beta

Well, we worked quite a bit on that, but the Uber Prof (NHibernate Profiler, Entity Framework Profiler, Linq to SQL Profiler, etc) version 2.0 are now out for public beta.

We made a lot of improvements. Including performance, stability and responsiveness, but probably the most important thing from the user perspective is that we now support running the profiler in production, and even on the cloud.

We will have the full listing of all the new goodies up on the company site soon, including detailed instructions on how to enable production profiling and on cloud profiling, but I just couldn’t wait to break the news to you.

In fact, along with V2.0 of the profilers, we have a brand new site for our company, which you can check here: http://hibernatingrhinos.com/.

To celebrate the fact that we are going on beta, we also offer a 20% discount for the duration of the beta.

Nitpicker corner, please remember that this is a beta, there are bound to be problems, and we will fix them as soon as we can.

Multi threaded design guidelines for libraries: Part III

In this post, I want to talk about libraries that want or need to not only support being run in multiple threads, but actually want to use multiple threads themselves. Remember, you are a library, not a framework. You are a guest in someone’s else home, and you shouldn’t litter.

The first thing to remember is error handling. That actually comes in two parts. First, unhandled exceptions from a thread will kill the application. There are very few things that people will find more annoying with your library than your errors killing their application. Second, and almost as important, you should have a way to report those errors.

Even more annoying than killing my application, failing to do something silently and in a way that is really hard to debug is going to cause major hair loss all around.

There are several scenarios that we need to consider:

  • Long running threads – I need to do something in a background thread that would usually live as long as the application itself.
  • Short term threads – I need to do something that requires a lot of threads, just for a short time.
  • Timeouts / delays / expirations – I need to do something every X amount of time.

In the first case, of long running threads, there isn’t much that can be done. You want to handle errors, obviously, and you want to make it crystal clear when you spin up your threads, and when / how you tear them down again. Another important aspect is that you should name your threads. This is important because it means that when debugging things, we can figure out what this or that thread is doing more easily.

The next approach is much more common, you just need some way to execute some code in parallel. The easiest thing to do is to go to new Thread(), ThreadPool.QueueUserWorkItem or Task.Factory.StartNew(). Such, this is easy to do, and it is also perfectly wrong.

Why is that, you say?

Quite simply, it ain’t your app. You don’t get to make such decisions for the application that is hosting your library. Maybe the app needs to conserve threads to serve requests? Maybe it is trying to utilize less threads to reduce CPU load and save power on a laptop running on batteries? Maybe they are trying to debug something and all those threads popping around is driving them crazy?

The polite thing to do when you recognize that you have a threading requirement in your application is to:

  • Give the user a way to control that.
  • Provide a default implementation that works.

A good example of that can be seen in RavenDB’s sharding implementation.

public interface IShardAccessStrategy
{
    event ShardingErrorHandle<IDatabaseCommands> OnError;

    T[] Apply<T>(IList<IDatabaseCommands> commands, ShardRequestData request, Func<IDatabaseCommands, int, T> operation);
}

As you can see, we abstracted the notion of making multiple requests. We provide you out of the box with sequential and parallel implementations for this.

The last item, timeouts /expirations / delays is also something that you want to give the user of your library control of. Ideally, using something like the strategy above. By all means, make a default implementation and wire it without needing anything.

But it is important to have control over those things. The expert users for your library will want and need it.

Multi threaded design guidelines for libraries: Part II

Next on the agenda for writing correct multi threaded libraries, how do you handle shared state?

The easiest way to handle that is to use the same approach that NHibernate and the RavenDB Client API uses. You have a factory / builder / fizzy object that you use to construct all of your state, this is done on a single thread, and then you call a method that effectively “freeze” this state from now on.

All future accesses to this state are read only. This is really good for doing things like reflection lookups, loading configuration, etc.

But what happens when you actually need shared mutable state? A common example is a cache, or global statistics. This is where you actually need to pull out your copy of Concurrent Programming on Windows and very carefully write true multi threaded code.

It is over a thousand pages, you say? Sure, and you need to know all of this crap to get multi threading working properly. Multi threading is scary, hard and should not be used.

In general, even if you actually need to do shared mutable state, you really want to make sure that there are clear definitions between things that can be shared among multiple threads and the things that cannot. And you want to make most of the work in the parts where you don’t have to worry about multi threading.

It also means that your users have much easier time figuring out what the expected behavior of the system is. This is very important with the advent of C# 5.0, since async API are going to be a lot more common. Sure, you use the underlying async primitives, but did you consider what may happen when you are issuing multiple concurrent async requests. Is that allowed?

With C# 5.0, you can usually treat async code as if it was single threaded, but that breaks down if you are allowing multiple concurrent async operations.

In RavenDB and NHibernate, we use the notion of Document Store / Session Factory – which are created once, safe for multi threading and are usually singletons. And then we have the notion of sessions, which are single threaded, easy & cheap to create and follow the notion of one per thread (actually, one per work unit, but that is beside the point).

On my next post, I’ll discuss what happens when your library actually wants to go beyond just being safe for multi threading, when the library wants to use threading directly.

Multi threaded design guidelines for libraries: Part I

The major difference between libraries and frameworks is that a framework is something that runs your code, and is in general in control of its own environment and a library is something that you use in your own code, where you control the environment.

Examples for frameworks: ASP.Net, NServiceBus, WPF, etc.

Examples for libraries: NHibernate, RavenDB Client API, JSON.Net, SharpPDF, etc.

Why am I talking about the distinction between frameworks and libraries in a post about multi threaded design?

Simple, there are vastly different rules for multi threaded design with frameworks and libraries. In general, frameworks manage their own threads, and will let your code use one of their threads. On the other hands, libraries will use your own threads.

The simple rule for multi threaded design for libraries? Just don’t do it.

Multi threading is hard, and you are going to cause issues for people if you don’t know exactly what you are doing. Therefor, just write for a single threaded application and make sure to hold no shared state.

For example, JSON.Net pretty much does this. The sole place where it does do multi threading is where it is handling caching, and it must be doing this really well because I never paid it any mind and we got no error reports about it.

But the easiest thing to do is to just not support multi threading for your objects. If the user want to use the code from multiple threads, he is welcome to instantiate multiple instances and use one per thread.

In my next post, I’ll talk about what happens when you actually do need to hold some shared state.

You release too often, you don’t release enough: RavenDB Plans

A common complaint from users with RavenDB 1.0 has been that we release too often, which meant that to upgrade RavenDB too often.

We made a clean break with 1.2 (which allowed handle several very large issues cleanly) and we have been releasing unstable version at a steady rate of 5 – 8 builds a week.  Now, of course, we get complaints about not releasing enough Smile.

At any rate, the following is a rough plan for where we will be in the near future. RavenDB 1.2 is currently running in production (running our internal infrastructure) and we have been doing a lot of comparability, stability and performance work. There are still a lot of little tweaks that we still want to put into the product, but major features are likely to be deferred post 1.2

We have just completed triage of the things that we intend to do for 1.2, and we currently have ~60 - 70 items to go through in terms of features to implement, bug fixes to do, etc. Toward the end of this month, we intend to stop all new feature development and focus on bug fixes, stability and performance. 

By mid Nov, we want to have an RC version out that you can take take out to town, and the release is scheduled to late Nov or early Dec. Post 1.2 release, we will go back to a stable release every 4 - 6 weeks.

Tags:

Published at

Originally posted at

Comments (8)

This code ain’t production ready

Greg Young has a comment on my Rhino Events post that deserves to be read in full. Go ahead, read it, I’ll wait.

Since you didn’t, I’ll summarize. Greg points out numerous faults and issues that aren’t handled or could be handled better in the code.

That is excellent, from my point of view, if only because it gives me more stuff to think about for the next time.

But the most important thing to note here is that Greg is absolutely correct about something:

I have always said an event store is a fun project because you can go anywhere from an afternoon to years on an implementation.

Rhino Events is a fun project, and I’ve learned some stuff there that I’ll likely use again letter on. But above everything else, this is not production worthy code .It is just some fun code that I liked. You may take and do whatever you like with it, but mostly I was concerned with finding the right ways to actually get things done, and not considering all of the issues that might arise in a real production environment.

Introducing Rhino.Events

After talking so often about how much I consider OSS work to be indicative of passion, I got bummed when I realized that I didn’t actually did any OSS work for a while, if you exclude RavenDB.

I was recently at lunch at a client, when something he said triggered a bunch of ideas in my head. I am afraid that I made for poor lunch conversation, because all I could see in my head was code and IO blocks moving around in interesting ways.

At any rate, I sat down at the end of the day and wrote a few spikes, then I decided to write the actual thing in a way that would actually be useful.

What is Rhino Events?

It is a small .NET library that gives you embeddable event store. Also, it is freakishly fast.

How fast is that?

image

Well, this code writes a fairly non trivial events 10,000,000 (that is ten million times) to disk.

It does this at a rate of about 60,000 events per second. And that include the full life cycle (serializing the data, flushing to disk, etc).

Rhino.Events has the following external API:

image

As you can see, we have several ways of writing events to disk, always associating to a stream, or just writing the latest snapshot.

Note that the write methods actually return a Task. You can ignore that Task, if you wish, but this is part of how Rhino Events gets to be so fast.

When you call EnqueueEventAsync, we register the value in a queue and have a background process write all of the pending events to disk. This means that we actually have only one thread that is actually doing writes, which means that we can batch all of those writes to get really nice performance from being able to handle all of that.

We can also reduce on the number of times that we have to actually Flush to disk (fsync), so we only do that when we run out of things to write or at a predefined times (usually after a full 200 ms of non stop writes. Only after the information was fully flushed to disk will we set the task status to completed.

This is actually a really interesting approach from my point of view, and it makes the entire thing transactional, in the sense that you can wait to be sure that the event has been persisted to disk (and yes, Rhino Events is fully ACID) or you can fire & forget it, and move on with your life.

A few words before I let you go off and play with the bits.

This is a Rhino project, which means that it is a fully OSS one. You can take the code and do pretty much whatever you want with it. But I , or Hibernating Rhinos, will not be providing support for that.

You can get the bits here: https://github.com/ayende/Rhino.Events

Tags:

Published at

Originally posted at

Comments (7)

Handling entities validations in RavenDB

This post came out of a stack overflow question. The user had the following code:

   1: public void StoreUser(User user)
   2: {
   3:     //Some validation logic
   4:     if(string.IsNullOrWhiteSpace(user.Name))
   5:         throw new Exception("User name can not be empty");
   6:  
   7:     Session.Store(user);
   8: }

But he noted that this will not work for other approaches, such as this:

   1: var u1 = Sesion.Load<User>(1);
   2: u1.Name = null; //change is tracked and will persist on the next save changes
   3: Session.SaveChanges();

This is because RavenDB tracks the entity and will persist it if there has been any changes when SaveChanges is called.

The question was:

Is there someway to get RavenDB to store only a snapshot of the item that was stored and not track further changes?

The answer is, as is often the case if you run into hardship with RavenDB, you are doing something wrong. In this particular case, that wrongness is the fact that you are trying to do validation manually. This means that you always have to remember to call it, and that you can’t use a lot of the good stuff that RavenDB gives you, like change tracking. Instead, RavenDB contains the hooks to do it once, and do it well.

   1: public class ValidationListener : IDocumentStoreListener
   2: {
   3:     readonly Dictionary<Type, List<Action<object>>> validations = new Dictionary<Type, List<Action<object>>>();
   4:  
   5:     public void Register<T>(Action<T> validate)
   6:     {
   7:         List<Action<object>> list;
   8:         if(validations.TryGetValue(typeof(T),out list) == false)
   9:             validations[typeof (T)] = list = new List<Action<object>>();
  10:  
  11:         list.Add(o => validate((T) o));
  12:     }
  13:  
  14:     public bool BeforeStore(string key, object entityInstance, RavenJObject metadata, RavenJObject original)
  15:     {
  16:         List<Action<object>> list;
  17:         if (validations.TryGetValue(entityInstance.GetType(), out list))
  18:         {
  19:             foreach (var validation in list)
  20:             {
  21:                 validation(entityInstance);
  22:             }
  23:         }
  24:         return false;
  25:     }
  26:  
  27:     public void AfterStore(string key, object entityInstance, RavenJObject metadata)
  28:     {
  29:     }
  30: }

This will be called by RavenDB whenever we save to the database. We can now write the validation / registration code like this:

   1: var validationListener = new ValidationListener();
   2: validationListener.Register<User>(user=>
   3:     {
   4:         if (string.IsNullOrWhiteSpace(user.Name))
   5:             throw new Exception("User name can not be empty");
   6:     });
   7: store.RegisterListener(validationListener);

And that is all that she wrote.

Tags:

Published at

Originally posted at

Comments (18)

ListOfParams and other horrible things that you shouldn’t bring to RavenDB

In the mailing list, we got asked about an issue with code that looked like this:

   1: public abstract class Parameter
   2: {
   3:     public String Name { get; set; }
   4: }
   5:  
   6: public class IntArrayParameter : Parameter
   7: {
   8:     public Int32[,] Value { get; set; }
   9: }

I fixed the bug, but that was a strange thing to do, I thought. Happily, the person asking that question was actually taking part of a RavenDB course and I could sit with him and understand the whole question.

It appears that in their system, they have a lot of things like that:

  • IntParameter
  • StringParameter
  • BoolParameter
  • LongParameter

And along with that, they also have a coordinating class:

   1: public class ListOfParams
   2: {
   3:    public List<Param> Values {get;set;}
   4: }

The question was, could they keep using the same approach using RavenDB? They were quite anxious about this, since they had a need for the capabilities of this in their software.

This is why I hate Hello World questions. I could answer just the question that was asked, and that was it. But the problem is quite different.

You might have recognized it by now, what they have here is Entity Attribute Value system. A well known anti pattern for the relational database world and one of the few ways to actually get a dynamic schema in that world.

In RavenDB, you don’t need all of those things. You can just get things done. Here is the code that we wrote to replace the above monstrosity:

   1: public class Item : DynamicObject
   2: {
   3:     private Dictionary<string, object>  vals = new Dictionary<string, object>();
   4:  
   5:     public string StaticlyDefinedProp { get; set; }
   6:  
   7:     public override bool TryGetMember(GetMemberBinder binder, out object result)
   8:     {
   9:         return vals.TryGetValue(binder.Name, out result);
  10:     }
  11:  
  12:     public override bool TrySetMember(SetMemberBinder binder, object value)
  13:     {
  14:         if(binder.Name == "Id")
  15:             return false;
  16:         vals[binder.Name] = value;
  17:         return true;
  18:     }
  19:  
  20:     public override bool TrySetIndex(SetIndexBinder binder, object[] indexes, object value)
  21:     {
  22:         var key = (string) indexes[0];
  23:         if(key == "Id")
  24:             return false;
  25:         vals[key] = value;
  26:         return true;
  27:     }
  28:  
  29:     public override bool TryGetIndex(GetIndexBinder binder, object[] indexes, out object result)
  30:     {
  31:         return vals.TryGetValue((string) indexes[0], out result);
  32:     }
  33:  
  34:     public override IEnumerable<string> GetDynamicMemberNames()
  35:     {
  36:         return GetType().GetProperties().Select(x => x.Name).Concat(vals.Keys);
  37:     }
  38: }

Not only will this class handle the dynamics quite well, it also serializes  to idiomatic JSON, which means that querying that is about as easy as you can ask.

The EAV schema was created because RDBMS aren’t suitable for dynamic work, and like many other things from the RDMBS world, this problem just doesn’t exists for us in RavenDB.

Published at

Originally posted at

Comments (4)

Get thou out of my head, damn idea

Sometimes I get ideas, and they just won’t leave my head no matter what I do.

In this case, I decided that I wanted to see what it would take to implement an event store in terms of writing a fully managed version.

I am not really interested in the actual event store, I care a lot more about the actual implementation idea that I had (I/O queues in append only mode, if you care to know).

After giving it some though, I managed to create a version that allow me to write the following code:

   1: var diskData = new OnDiskData(new FileStreamSource(), "Data");
   2:  
   3: var data = JObject.Parse("{'Type': 'ItemCreated', 'ItemId': '1324'}");
   4: var sp = Stopwatch.StartNew();
   5: Parallel.For(0, 1000*10, i =>
   6:     {
   7:         var tasks = new Task[1000];
   8:         for (int j = 0; j < 1000; j++)
   9:         {
  10:             tasks[j] = diskData.Enqueue("users/" + i, data);
  11:         }
  12:         Task.WaitAll(tasks);
  13:     });
  14:  
  15: Console.WriteLine(sp.ElapsedMilliseconds);

Admittedly, it isn’t a really interesting client code, but it is plenty good enough for what I need, and it allowed me to check something really interesting, just how hard would I have to go to actually get really good performance. As it turned out, not that far.

This code writes 10 million events, and it does so in under 1 minutes (on my laptop, SSD drive). Just to give you some idea, that is > 600 Mb of events, and about 230 events per milliseconds or about 230 thousands events per second. Yes, that is 230,000 events / sec.

The limiting factor seems to be the disk, and I have some ideas on how to implement that. I still got roughly 12MB/s, so there is certainly room for improvement. 

How does this work? Here is the implementation of the Enqueue method:

   1: public Task Enqueue(string id, JObject data)
   2: {
   3:     var item = new WriteState
   4:         {
   5:             Data = data,
   6:             Id = id
   7:         };
   8:  
   9:     writer.Enqueue(item);
  10:     hasItems.Set();
  11:     return item.TaskCompletionSource.Task;
  12: }

In other words, this is a classic producer/consumer problem.

The other side is  reading the events from the queue and writing them to disk. There is just one thread that is doing that, and it is always appending to the end of the file. Moreover, because of the way it works, we are actually gaining the ability to batch a lot of them together into a stream of really nice IO calls that optimize the actual disk access. When we finished with a batch of items and flushed them to disk, only then are we going to complete the task, so the fun part is that for all intents and purposes, we are doing that while preserving transactionability of the system. Once the Enqueue task returned, we can be sure that the data is fully saved on disk.

That was an interesting spike, and I wonder where else I would be able to make use of something like this in the future.

Yes, those are pretty small events, and yes, that is a fake test, but the approach seems to be very solid.

And just for fun, with absolutely no optimizations what so ever, no caching, no nothing, I am able to load 1,000 events per stream in less than 10 ms.

Awesome indexing with RavenDB

I am currently teaching a course in RavenDB, and as usual during a course, we keep doing a lot of work that pushes what we do with RavenDB. Usually because we try to come up with new scenarios on the fly and adapting to the questions from the students.

In this case, we were going over the map/reduce stack and we kept coming more and more complex example and how to handle them, and then we got to this scenario.

Given the following class structure:

   1: public class Animal
   2: {
   3:     public string Name { get; set; }
   4:     public string Species { get; set; }
   5:     public string Breed { get; set; }
   6: }

Give me the count of all the species and all the breeds.  That is pretty easy to do, right?  In SQL, you would write it like this:

   1: SELECT Species, Breed, Count(*) FROM Animals
   2: GROUP BY Species, Breed

And that is nice, but it still means that you have to do some work on the client side to merge things up to get the final result, since we want something like this:

  • Dogs: 6
    • German Shepherd: 3
    • Labrador: 1
    • Mixed: 2
  • Cats: 3
    • Street: 2
    • Long Haired: 1

In RavenDB, we can express the whole thing in a simple succinct index:

   1: public class Animals_Stats : AbstractIndexCreationTask<Animal, Animals_Stats.ReduceResult>
   2: {
   3:     public class ReduceResult
   4:     {
   5:         public string Species { get; set; }
   6:         public int Count { get; set; }
   7:         public BreedStats[] Breeds { get; set; }
   8:  
   9:         public class BreedStats
  10:         {
  11:             public string Breed { get; set; }
  12:             public int Count { get; set; }
  13:         }
  14:     }
  15:  
  16:     public Animals_Stats()
  17:     {
  18:         Map = animals =>
  19:                 from animal in animals
  20:                 select new
  21:                     {
  22:                         animal.Species,
  23:                         Count = 1,
  24:                         Breeds = new [] {new {animal.Breed, Count = 1}}
  25:                     };
  26:         Reduce = animals =>
  27:                     from r in animals
  28:                     group r by r.Species
  29:                     into g
  30:                     select new
  31:                         {
  32:                             Species = g.Key,
  33:                             Count = g.Sum(x => x.Count),
  34:                             Breeds = from breed in g.SelectMany(x => x.Breeds)
  35:                                     group breed by breed.Breed
  36:                                     into gb
  37:                                     select new {Breed = gb.Key, Count = gb.Sum(x => x.Count)}
  38:                         };
  39:  
  40:     }
  41: }

And the result of this beauty?

image

And that is quite pretty, even if I say so myself.

Tags:

Published at

Originally posted at

Comments (21)

RavenDB Bootcamp: Milano–special offer

I am on my third consecutive RavenDB course right now, and I just realized that I’ll be doing a RavenDB bootcamp in three weeks as well.

This means 10 straight hours of intensive RavenDB, taking you from a newbie status all the way to a master of your (document) domain Smile.

RavenDB training has really picked up, which is good in the sense that a lot of people want to use it. It is bad in the sense that I would like to spend a little less time traveling.

Therefor, I decided to make things simpler for me by cutting the costs of the RavenDB bootcamp by 35% (!). You can use the following link to register to that.

Am I crazy? I hope that I am crazy like the guy on the left, and not the guy on the right.

 

The basic ideas goes like this.

RavenDB training is something that I want to have as many people as possible to take.  And I want to do this in as big a batch as possible, as well. I already had to make appointments to get back to the places I was just at to give the next round of training, and my hope is that this special offer will net all the people who care about RavenDB around Milano, and wouldn’t have to come back for a while to complete things.

Then again, there is also the fact that I was traveling on my wife birthday, which might have something to do with that. I would really like to avoid missing out on other important events. I already in trouble for not paying attention to our anniversary, and that didn’t help.

Tags:

Published at

Originally posted at

Comments (6)

Thou shall not do threading unless you know what you are doing

I had a really bad couple of days. I am pissed, annoyed and angry, for totally not technical reasons.

And then I run into this issue, and I just want to throw something really hard at someone, repeatedly.

The issue started from this bug report:

   1: NetTopologySuite.Geometries.TopologyException was unhandled
   2:   HResult=-2146232832
   3:   Message= ... trimmed ...
   4:   Source=NetTopologySuite
   5:   StackTrace:
   6:        at NetTopologySuite.Operation.Overlay.Snap.SnapIfNeededOverlayOp.GetResultGeometry(SpatialFunction opCode)
   7:        at NetTopologySuite.Operation.Union.CascadedPolygonUnion.UnionActual(IGeometry g0, IGeometry g1)
   8:        at NetTopologySuite.Operation.Union.CascadedPolygonUnion.Worker.Execute()
   9:        at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
  10:        at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
  11:        at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
  12:        at System.Threading.ThreadHelper.ThreadStart()

At first, I didn’t really realized why it was my problem. I mean, it is NTS problem, isn’t it?

Except that this particular issue actually crashed ravendb (don’t worry, it is unstable builds only). The reason it crashed RavenDB? An unhandled thread exception.

What I can’t figure out is what on earth is going on. So I took a look at the code, have a look:

image

I checked, and this isn’t code that has been ported from the Java code. You can see the commented code there? That is from the Java version.

And let us look at what the execute method does?

image

So let me see if I understand. We have a list of stuff to do, so we spin out threads, reclusively, then we wait on them. I think that the point was to optimize things in some manner by parallelizing the work between the two halves.

Do you know what the real killer is? If we assume that we have a geometry with just 20 items on it, this will generate twenty two threads.

Leaving aside the issue of not handling errors properly (and killing the entire process because of this), the sheer cost of creating the threads is going to kill this program.

Libraries should be made to be thread safe (I already had to fix a thread safety bug there),  but they should not be creating their own threads unless it is quite clear that they need to do so.

I believe that this is a case of a local optimization for a specific scenario, it also carry all of the issues associated with local optimizations. It solves one problem and opens up seven other ones.

Lucene is beautiful

So, after I finished telling you how much I don’t like the lucene.net codebase, what is this post about?

Well, I don’t like the code, but then again, I generally don’t like to read low level code. The ideas behind Lucene are actually quite amazingly powerful in their simplicity.

At its core, Lucene is just a set of sorted dictionaries on disk (greatly simplified, I know). Everything else is build on top of that, and if you grok what is going on there, you would be quite amazed at the number of things that this has made possible.

Indexing in Lucene is done by a pipeline of documents and fields and analyzers, which all participate together to generate those dictionaries. Searching in lucene is done by traversing those dictionaries in various ways, and combining the results in interesting ways.

I am not going to go into details about how it works, you can read all about that here. The important thing is that once you have grasped the essential structure inside lucene, the rest are just details.

The concept and the way the implementation fell out are quite beautiful.

Tags:

Published at

Originally posted at

Comments (4)