Ayende @ Rahien

Refunds available at head office

Querying your way to madness: the Facebook timeline

Facebook certainly changed the way we are doing things. Sometimes, that ain’t always for the best, as can be seen from the way too much time humanity as a whole spend watching cat videos.

One of the ways that Facebook impacted our professional lives is that a lot of people look at that as a blue print of how they want their application to be. I am not going to touch on whatever that is a good thing or not, suffice to say that this is a well known model that is very easy for a lot of users to grasp.

It is also a pretty hard model to actually design and build. I recently had a call from a friend who was tasked with building a Facebook like timeline. Like most such tasks, we have the notion of social network, with people following other people. I assume that this isn’t actually YASNP (yet another social network project), but I didn’t check too deeply.

The question was how to actually build the timeline. The first thing that most people would try is something like this:

   1: var user = session.Load<User>(userId);
   2: var timelineItems = 
   3:    session.Query<Items>()
   4:       .Where(x=>x.Source.In(user.Following))
   5:       .OrderByDescending(x=>x.PublishedAt)
   6:       .Take(30)
   7:       .ToList();

Now, this looks good, and it would work, as long as you have small number of users and no one follows a lot of people. And as long as you don’t have  a lot of items. And as long as you don’t have to do any additional work.  When any of those assumption is broken… well, welcome to unpleasantville, population: you.

It can’t work. And I don’t care what technology you are using for storage. You can’t create a good solution using queries for something like the timeline.

Nitpicker corner:

  • If you have users that are following a LOT of people (and you will have those), you are likely to get into problems with the query.
  • The more items you have, the slower this query becomes. Since you need to sort them all before you can return results. And you are likely to have a LOT of them.
  • You can’t really shard this query nicely or easily.
  • You can’t apply additional filtering in any meaningful way.

Let us consider the following scenario. Let us assume that I care for this Rohit person. But I really don’t care for Farmville.

hide farmville ribbon

And then:

hide farmville

Now, try to imagine doing this sort of thing in a query. For fun, assume that I do care for Farmville updates from some people, but not from all. That is what I mean when I said that you want to do meaningful filtering.

You can’t do this using queries. Not in any real way.

Instead, you have to turn it around. You would do something like this:

   1: var user = session.Load<User>(userId);
   2: var timelineItmes = session.Query<TimeLineItems>() 
   3:       .Where(x=>x.ForUser == userId)
   4:       .OrderBy(x=>x.Date)
   5:       .ToList();

Note how we structure this. There is a set of TimeLineItems objects, which store a bit of information about a set of items. Usually we would have one per user per day. Something like:

  • users/123/timeline/2013-03-12
  • users/123/timeline/2013-03-13
  • users/123/timeline/2013-03-14

That means that we get well scoped values, we only need to search on a single set of items (easily sharded, with a well known id, which means that we can also just load them by id, instead of querying for them).

Of course, that means that you have to have something that builds those timeline documents. That is usually an async process that run whenever you have a user that update something. It goes something like this:

   1: public void UpdateFollowers(string itemId)
   2: {
   3:     var item = session.Include<Item>(x=>x.UserId)
   4:         .Load(itemId);
   5:  
   6:     var user = session.Load<User>(item.UserId);
   7:  
   8:     // user.Followers list of documents with batches of followers
   9:     // we assume that we might have a LOT, so we use this techinque
  10:     // to avoid loading them all into memory at once
  11:     // http://ayende.com/blog/96257/document-based-modeling-auctions-bids
  12:     foreach(var followersDocId in user.Followers)
  13:     {
  14:         NotifyFollowers(followersDocId, item);
  15:     }
  16: }
  17:  
  18: public void NotifyFollowers(string followersDocId, Item item)
  19: {
  20:     var followers = session.Include<FollowersCollection>(x=>x.Followers)
  21:         .Load(followersDocId);
  22:  
  23:     foreach(var follower in followers.Followers)
  24:     {
  25:         var user = session.Load<User>(follower);
  26:         if(user.IsMatch(item) == false)
  27:             continue;
  28:         AddToTimeLine(follower, item);
  29:     }
  30: }

As you can see, we are batching the operation, loading the followers and batched on their settings, decide whatever to let that item to be added to their timeline or not.

Note that this has a lot of implications. Different people will see this show up in their timeline in different times (but usually very close to one another). Your memory usage is kept low, because you are only processing some of it at any given point in time. For users with a LOT of followers, and there will be some like those, you might want to build special code paths, but this should be fine even at its current stage.

What about post factum operations? Let us say that I want to start following a new user? This require special treatment, you would have to read the latest timeline items from the new user to follow and start merging that with the existing timeline. Likewise when you need to delete someone. Or adding a new filter.

It is a lot more work than just changing the query, sure. But you can get things done this way. And you cannot get anywhere with the query only approach.

What is making us slow (for the first time, after an idle period)?

We recently covered this question in several iterations in the ravendb mailing list.

The actual content of the discussion wasn’t so interesting as the number of ways idle time can make you life… interesting. In order to avoid having issues with idle time, you need to:

  • Disable IIS unloading for inactive websites.
  • Disable RavenDB  unloading for inactive databases.
  • Make sure that the HD doesn’t spin down during inactivity.
  • You need to make sure that the system doesn’t got to idle / hibenration.
  • Check that the server hasn’t been paged.
  • Check that the CPU hasn’t moved to low power mode.
  • Check authentication timeouts.

In the end, it was actually the last one that caused the problem. By default, Windows Auth token expire after 15 minutes, so you have to re-authenticate again, and that may make the first query after a while a little slower.

Just for fun, by default, all of the above happen. And that is just when running on a physical machine. When running on VMs (or in the cloud), you need to do all of those checks for the VM and the host machines.

My Passover Project: Introducing Rattlesnake.CLR

Okay, after spending quite a lot of time digging through the leveldb codebase, and with several years of working with RavenDB, I can say with confidence that the CLR make it extremely hard to build high performance server side systems using the CLR.

Mostly, the issues are related to GC and memory. In particular, not having any way to control memory allocation and/or the GC means that we can’t optimize those scenarios in any meaningful way. At the same time, I do not want to go back to the unmanaged world. As mentioned ,I just came back from a very deep dive into a non trivial C++ codebase ,and while I consider that codebase a really good one, that ain’t to say it is a pleasure to always be thinking about all the stuff that the CLR just takes away.

Therefor, I decided that I’m going to be doing something about it. And Rattlesnake.CLR was born:

image

The major features of the Rattlesnake.CLR include explicit memory management when required. Let us say that we know that we are going to be needing some amount of memory for a while, and then all of that can be thrown away. This is extremely common in scenarios such as a web request, pretty much all the memory that you generate during the processing web request can be safely free immediately. In RavenDB’s case, the memory we consume during indexing can be free immediately when we stop indexing. Right now this is a painful process of making sure that we allocate within the same gen0 and hoping that it won’t be too expensive, or that we won’t get a complete halt of the entire server while it is releasing memory. It also make it really hard to do things like limit the amount of memory your code uses.

Another requirement that I have is that Rattlesnake.CLR should be able to execute existing .NET assemblies without any additional steps. Since I don’t fancy doing ports of stuff that already exists.

In order to handle this scenario with the given constraints, we have:

   1: var heap = Heap.Create(HeapOptions.None, 
   2:     1024 * 1024,
   3:     512 * 1024 * 1024);
   4:  
   5: using(MemoryAllocations.AllocateFrom(heap))
   6: {
   7:    var sb = new StringBuilder();
   8:    for(var i = 0; i < 100; i ++ )
   9:          sb.AppendLine(i);
  10:    Console.WriteLine(sb.ToString());
  11: }
  12:  
  13: heap.Destroy(); 

All the code within the using statement is allocated in our own heap. In line 13, we are destroying all of that memory in one fell swoop.

There are a few notes about this that we probably should address:

  • By default, memory allocated by this form is not subject to any form of GC. The idea is that this whole heap is getting released immediately.
  • Note that last two parameters for the Heap.Create. The first is the initial size of the heap, and the second is  the max size. We now have a real way to actually limit the amount of memory a piece of code will use. This is really important on server applications where avoiding paging is critical.
  • For that matter, we can now figure out how much memory a particular piece of code uses, and allocate our resources accordingly.
  • You can use multiple heaps at the same time, although only one can be installed as the default allocation at a given point in time.

There is the explicit heap.GarbageCollect() method that will do GC only on that heap, and which you can schedule at your own convenience.  You can have two heaps, and allocate from one while you are GCing from the other. And yes ,that means that GCs using this methods will not stop the process!

Memory allocated on the heap is obviously only valid as long as the heap is valid. That means that once the heap is destroyed, you can’t access any of the objects that were created there. This has implications for things like cache. We provide MemoryAllocations.AllocateOnGlobalHeap<T>(args) method to force you to use the global heap, instead, if you want this memory to be always available and subject to GC.

This is early days yet, but we already see some really interesting performance improvements!

How does this work?

While an early experiment with Rattlensake.CLR was based on the Mono runtime. I quickly decided that I wanted to keep using the MS CLR. Now, it order to handle this I had to do some unnatural things (to say the least), but I think that I even managed to make this a supported option. Essentially, we are using the CLR Hosting API for this. In particular:

  • ICLRGCManager
  • IHostMalloc
  • IHostMemoryManager

You can use Rattlesnake.CLR like this:

.\Rattlesnake.exe Raven.Server.exe

Just for fun, we also allowed to place limits on the default heap, so you can be sure that you aren’t allocating too much there.

.\Rattlesnake.exe Raven.Server.exe --max-default-heap-size=256MB

We are still running some tests, but this is looking really good.

Hibernating Rhinos Practices: A Sample Project

I have previously stated that one of the things that I am looking for in a candidate is the actual candidate code. Now, I won’t accept “this is a project that I did for a client / employee”, and while it is nice to be pointed at a URL from the last project the candidate took part of, it is not a really good way to evaluate someone’s abilities.

Ideally, I would like to have someone that has an OSS portfolio that we can look at, but that isn’t always relevant. Instead, I decided to sent potential candidates the following:

Hi,

I would like to give you a small project, and see how you handle that.

The task at hand is to build a website for Webinars questions. We run bi-weekly webinars for our users, and we want to do the following:

  • Show the users a list of our webinars (The data is here: http://www.youtube.com/user/hibernatingrhinos)
  • Show a list of the next few scheduled webinar (in the user’s own time zone)
  • Allow the users to submit questions, comment on questions and vote on questions for the next webinar.
  • Allow the admin to mark specific questions as answered in a specific webinar (after it was uploaded to YouTube).
  • Manage Spam for questions & comments.

The project should be written in C#, beyond that, feel free to use whatever technologies that you are most comfortable with.

Things that we will be looking at:

  • Code quality
  • Architecture
  • Ease of modification
  • Efficiency of implementation
  • Ease of setup & deployment

Please send us the link to a Git repository containing the project, as well as any instructions that might be necessary.

Thanks in advance,

     Oren Eini

This post will go live about two weeks after I started sending this to candidates, so I am not sure yet what the response would be.

Software architecture with nail guns

As you probably know, I get called quite a lot to customers to “assist” in failing or problematic software projects. Maybe the performance isn’t nearly what it should be, maybe it is so very hard to make changes, maybe it is… one of the thousand and one things that can go wrong, and usually does.

Internally, I divide those projects into two broad categories: The stupid and the nail guns.

I rarely get called to projects that fall under the stupid category. When it happens, it is usually because someone new came in, looked at the codebase and called for help. I love working with stupid code bases. They are easy to understand, if hard to work with, and it is pretty obvious what is wrong. And the team is usually very receptive about getting advice on how to fix it.

But I usually am called for nail gun projects, and those are so much more complex…

But before I can talk about them, I need to explain first what I meant when I say “nail gun projects”. Consider an interesting fact. Absolutely no one will publish an article saying “we did nothing special, we had nothing out of the ordinary, and we shipped roughly on time, roughly on budget and with about the expected feature set. The client was reasonably happy.” And even if someone would post that, no one would read it.

Think about your life, as an example. You wake up, walk the dogs, take kids to school, go to work, come back from work, fall asleep reading this sentence, watch some TV, eat along the way, sleep. Rinse, repeat.

Now, let us go and look at the paper. At the time of this writing, those were the top stories at CNN:

Hopefully, there is a big disconnect between your life and those sort of news.

Now, let us think about the sort of posts, articles and books that you have been reading. You won’t find any book called: "Delivering OK projects”

And most of the literature about software projects is on one of two ends: We did something incredibly hard, and we did it well or we did something (obvious, usually) and we failed really badly. People who read those books tend to look at those books (either kind) and almost blindly adopt the suggested practices. Usually without looking at that section called “When it is appropriate to do what we do”.

Probably the best example is the waterfall methodology, originated in the 1970  paper "Managing the Development of Large Software Systems" from Winston W. Royce.

From the paper:

…the implementation described above is risky and invites failure

As you can imagine, no one actually listened, and the rest is history.

How about those nail guns again?

Well, imagine that you are a contractor, and here are you tools of the trade:

They are good tools, and they served you well for a while. But now you are reading about “Nail guns usage for better, faster and more effective framing or roofing". In the study, you read how there was a need to nail 3,000 shingles and using a nail gun the team was successfully able to complete the task with higher efficiency over the use of the standard hammer.

Being a conscientious professional, you head the advice and immediately buy the best nail gun you can find:

(This is just a random nail gun picture, I don’t know what brand, nor really care.)

And indeed, a nail gun is a great tool when you need to nail a lot of things very fast. But it is a highly effective tool that is extremely limited in what it can do.

But you know that a nail gun is 333% more efficient than the hammer, so you throw it away. And then you get a request: Can you hang this picture on the wall, please?

It would be easy with a hammer, but with a nail gun:

It isn’t the stupid / lazy / ignorant people that go for the nail gun solutions.

It is the really hard working people, the guys who really try to make things better. Of course, what usually happen is this:

 

And here we get back to the projects that I usually get called for. Those are projects that were created by really smart people, with the best of intentions, and with the clear understanding that they want to get quality stuff done.

The problem is that they are using Nail Guns for the architecture. For example, let us just look at this post. And the end is already written.

Tags:

Published at

Originally posted at

Comments (10)

Hibernating Rhinos Practices: Design

One of the things that I routinely get asked is how we design things. And the answer is that we usually do not. Most things does not require complex design. The requirements we set pretty much dictate how things are going to work. Sometimes, users make suggestions that turn into a light bulb moment, and things shift very rapidly.

But sometimes, usually with the big things, we actually do need to do some design upfront. This is usually true in complex / user facing part of our projects. The Map/Reduce system, for example, was mostly re-written  in RavenDB 2.0, and that only happened after multiple design sessions internally, a full stand alone spike implementation and a lot of coffee, curses and sweat.

In many cases, when we can, we will post a suggested design on the mailing list and ask for feedback. Here is an example of such a scenario:

In this case, we didn’t get to this feature in time for the 2.0 release, but we kept thinking and refining the approach for that.

The interesting things that in those cases, we usually “design” things by doing the high level user visible API and then just let it percolate. There are a lot of additional things that we would need to change to make this work (backward compatibility being a major one), so there is a lot of additional work to be done, but that can be done during the work. Right now we can let it sit, get users’ feedback on the proposed design and get the current minor release out of the door.

Single Responsibility Principle, Object Orientation & Active Code

Jason Folkens had a comment on my previous post:

When people combine methods and data into a class in a way such that you are recommending, I wonder if they truly value the single responsibility principle. In my mind, storing both schema and behavior in the same class qualifies as a violation of the SRP. Do you disagree with me that this is a 'violation', or do you just not think the SRP is important?

I can’t disagree enough. From Wikipedia:

An object contains encapsulated data and procedures grouped together to represent an entity.

The whole point of OOP is to encapsulate both data & behavior. To assume otherwise leads us to stateless functions and isolated DTOs.

Or, in other words, procedures and structures. And I think I’ll leave that to C.

Tags:

Published at

Originally posted at

Comments (29)

Active vs. Passive code bases

I was review code at a customer site, and he had a lot of classes that looked something like this:

   1: public class ValidationData
   2: {
   3:     public string Type {get;set;}
   4:     public string Value {get;set;}
   5: }

In the database, he would have the data like this:

image

This is obviously a very simple example, but it gets the job done, I think.

In his code base, the customer had several instance of this example, for validation of certain parts of the system, for handling business rules, for checking how to handle various events, and I think you get the picture.

I seriously dislike such codebases. You take an innocent piece of code and make it so passive it… well, you can see:

image

Here is why this is bad. The code is passive, it is just a data holder. And that means that in order to process it you are going to have some other code that handles that for you. That likely means a switch statement of the equivalent. And it also means that making any sort of change now have to happen on multiple locations. Puke.

For fun, using this anti pattern all over your codebase result in you have to do this over and over again, for any new interesting thing that you are doing .It is a lot of work, and a lot of places that you have to change.

But you can be a hero and set the code free:

You do that by making a very simple change. Instead of having passive data containers that other pieces of the code need to react to, make them active.

   1: public class AvoidCurseWordsValidator : IValidator
   2: {
   3:    public string[] CurseWords {get;set;}
   4:    public void Validate(...) { }
   5: }
   6:  
   7: public class MaxLenValidator : IValidator
   8: {
   9:    public int MaxLen {get; set;}
  10:    public void Validate(...) { }
  11: }
  12:  
  13: public class InvalidCharsValidator : IValidator
  14: {
  15:    public char[] InvalidChards {get;set;}
  16:    public void Validate(...) { }
  17: }

Now, if we want to modify / add something, we can do this in only one spot. Hurray for Single Responsibility and Open Closed principles.

SO… don’t let your codebase be dominated by switch statements, parallel hierarchies and other nasties. Make it go active, and you’ll like the results.

Get thou out of my head, damn idea

Sometimes I get ideas, and they just won’t leave my head no matter what I do.

In this case, I decided that I wanted to see what it would take to implement an event store in terms of writing a fully managed version.

I am not really interested in the actual event store, I care a lot more about the actual implementation idea that I had (I/O queues in append only mode, if you care to know).

After giving it some though, I managed to create a version that allow me to write the following code:

   1: var diskData = new OnDiskData(new FileStreamSource(), "Data");
   2:  
   3: var data = JObject.Parse("{'Type': 'ItemCreated', 'ItemId': '1324'}");
   4: var sp = Stopwatch.StartNew();
   5: Parallel.For(0, 1000*10, i =>
   6:     {
   7:         var tasks = new Task[1000];
   8:         for (int j = 0; j < 1000; j++)
   9:         {
  10:             tasks[j] = diskData.Enqueue("users/" + i, data);
  11:         }
  12:         Task.WaitAll(tasks);
  13:     });
  14:  
  15: Console.WriteLine(sp.ElapsedMilliseconds);

Admittedly, it isn’t a really interesting client code, but it is plenty good enough for what I need, and it allowed me to check something really interesting, just how hard would I have to go to actually get really good performance. As it turned out, not that far.

This code writes 10 million events, and it does so in under 1 minutes (on my laptop, SSD drive). Just to give you some idea, that is > 600 Mb of events, and about 230 events per milliseconds or about 230 thousands events per second. Yes, that is 230,000 events / sec.

The limiting factor seems to be the disk, and I have some ideas on how to implement that. I still got roughly 12MB/s, so there is certainly room for improvement. 

How does this work? Here is the implementation of the Enqueue method:

   1: public Task Enqueue(string id, JObject data)
   2: {
   3:     var item = new WriteState
   4:         {
   5:             Data = data,
   6:             Id = id
   7:         };
   8:  
   9:     writer.Enqueue(item);
  10:     hasItems.Set();
  11:     return item.TaskCompletionSource.Task;
  12: }

In other words, this is a classic producer/consumer problem.

The other side is  reading the events from the queue and writing them to disk. There is just one thread that is doing that, and it is always appending to the end of the file. Moreover, because of the way it works, we are actually gaining the ability to batch a lot of them together into a stream of really nice IO calls that optimize the actual disk access. When we finished with a batch of items and flushed them to disk, only then are we going to complete the task, so the fun part is that for all intents and purposes, we are doing that while preserving transactionability of the system. Once the Enqueue task returned, we can be sure that the data is fully saved on disk.

That was an interesting spike, and I wonder where else I would be able to make use of something like this in the future.

Yes, those are pretty small events, and yes, that is a fake test, but the approach seems to be very solid.

And just for fun, with absolutely no optimizations what so ever, no caching, no nothing, I am able to load 1,000 events per stream in less than 10 ms.

On Professional Code

Trystan made a very interesting comment on my post about unprofessional code:

I think it's interesting that your definition of professional is not about SOLID code, infrastructure, or any other technical issues. Professional means that you, or the support staff, can easily see what the system is doing in production and why.

It is a pretty accurate statement, yes. More to the point, a professional system is one that can be supported in production easily. About the most unprofessional thing that you can say is: “I have no idea what is going on.”

Expanding on this, we have been paying a LOT of attention recently to production readiness. We can’t afford not to. Just building the software is often just not enough for us. In many cases, if there is a problem, we can’t just debug through the process. Either because reproducing the problem is too hard or because it happens at a client side with their own private data. Even more important than that, if we can give the ops team the tools to actually see what is going on within the system, we drastically reduce the number of support calls we have to take.

Not to mention that software that actively support and help the ops team gets into the actual data center a lot faster and easier than software that doesn’t. Sure, clean code is important, but production ready code is often not clean code. I read this a long time ago, and it stuck:

Back to that two page function. Yes, I know, it's just a simple function to display a window, but it has grown little hairs and stuff on it and nobody knows why. Well, I'll tell you why: those are bug fixes. One of them fixes that bug that Nancy had when she tried to install the thing on a computer that didn't have Internet Explorer. Another one fixes that bug that occurs in low memory conditions. Another one fixes that bug that occurred when the file is on a floppy disk and the user yanks out the disk in the middle. That LoadLibrary call is ugly but it makes the code work on old versions of Windows 95.

Each of these bugs took weeks of real-world usage before they were found. The programmer might have spent a couple of days reproducing the bug in the lab and fixing it. If it's like a lot of bugs, the fix might be one line of code, or it might even be a couple of characters, but a lot of work and time went into those two characters.

Some parts of the RavenDB code are ugly. HttpServer class, for example, goes on for over thousands lines of mostly error detection and recovery modes. But it works, and it allows us to inspect it on a running production server.

That is important, and that make the separation from good code and production worthy code.

It isn’t a feature that is killing you, it is the intersection of features

Over time, projects get more features. And that is fine, as long as they are orthogonal features. It is when those features overlap that they are really putting the hurt on us.

For example, with the recent Changes API addition to RavenDB, one of the things that was really annoying is that in order to actually implement this feature, I had to implement this to:

  • Embedded
  • Client Server
  • Sharded
  • Silverlight

And that one is easy. We have bugs right now that are happening because people are using two or three bundles at the same time, and each of them works fine, but in conjunction, you get strange results.

What should happen when the Unique Constraints bundle creates an internal document when you have the Versioning bundle enabled? How about when we add replication to the mix?

I am not sure if I have any good ideas about the matter. Most of the things that we do are orthogonal to one another, but when used in combination, they actually have to know about their impact on other things.

My main worry is that as time goes by, we have more & more of those intersections. And that adds to the cost of maintaining and support the product.

System vs. User task security: Who pays the sports writer?

Let us assume for a moment that we are building a system for a sports site. We have multiple authors, submitting articles, and we pay each author for those articles.

The data model might look like this:

image

In this post, I want to talk about the security implications of such a system. Typically, this gets translated to requirements such as:

  • Authors can edit their articles.
  • Authors cannot modify / view any payments.

Which very often gets boiled down to something like this:

GRANT SELECT,INSERT,UPDATE,DELETE ON Articles TO Authors;
DENY SELECT,INSERT,UPDATE,DELETE On Payments TO Authors;

What do you think of such a system? My approach, this is a horrible mess altogether. Think what it means for something like this:

public ActionResult SubmitArticle(Article article)
{
    if(IsValid(article)==false)
        return View();

    Session.Store(article);

    var payment = GetOrCreatePaymentFor(article.Author);

    payment.AddArticle(article);

    return RedirectToAction("index");
}

In order to run, this code would actually have to run under several different security credentials in order to work successfully.

That is before we take into account how using multiple users for different operations would result in total chaos for small things like connection pooling.

In real world systems, the security can’t really operate based on the physical structure of the data in the data store. It is far too complex to manage. Instead, we implement security by separating the notion of the System performing tasks (such as adding a payment for an article) that are system tasks, and the System performing tasks on behalf of  the user.

The security rules are implemented in the system, and the application user have no physical manifestations (such as being DB users) in the system at all.

And to the commentators, I know there are going to be some of you are going to claim that physical security at the database level is super critical, but while you are doing that, please also answer the problems of connection pooling and the complexities of multiple security contexts required for most real world business operations.

Geo Location & Spatial Searches with RavenDB–Part VII–RavenDB Client vs. Separate REST Service

In my previous post, I discussed how we put the GeoIP dataset in a separate database, and how we access it through a separate session. I also asked, why use RavenDB Client at all? I mean, we might as well just use the REST API and expose a service.

Here is how such a service would look like, by the way:

public class GeoIPClient : IDisposable
{
    private readonly HttpClient httpClient;

    public GeoIPClient(string url, ICredentials credentials)
    {
        httpClient = new HttpClient(new HttpClientHandler{Credentials = credentials})
        {
            BaseAddress = new Uri(url),
                
        };
    }

    public Task<Location> GetLocationByIp(IPAddress ip)
    {
        if (ip.AddressFamily != AddressFamily.InterNetwork)
            return null;

        var reverseIp = (long)BitConverter.ToUInt32(ip.GetAddressBytes().Reverse().ToArray(), 0);

        var query = string.Format("Start_Range:[* TO 0x{0:X16}] AND End_Range:[0x{0:X16} TO NULL]", reverseIp);

        return httpClient.GetAsync("indexes/Locations/ByRange?pageSize=1&" + query)
            .ContinueWith(task => task.Result.Content
                .ReadAsAsync<QueryResult>()
                .ContinueWith(task1 => task1.Result.Results.FirstOrDefault())).Unwrap();

    } 

    public void Dispose()
    {
        httpClient.Dispose();
    }
}

I think that you can agree that this is fairly simple and easy to understand. It make it explicit that we are just going to query the database and it is even fairly easy to read.

Why not go with that route?

Put simply, because it is doing only about 10% of the things that we do in the RavenDB Client. The first thing that pops to mind is that this service doesn’t support caching, HTTP ETag responses, etc. That means that we would have to implement that ourselves. This is decidedly non trivial.

The RavenDB Client will automatically cache all data for you if it can, you don’t have to think about it, worry about it or even pay it any mind. It is just there and working hard to make sure that you application is more performant.

Next, this will only support Windows Authentication. RavenDB also support OAuth, so if you wanted to run this on RavenHQ, for example, which requires OAuth. You would have to write some additional stuff as well.

Finally, using the RavenDB Client leaves us open to do additional things in the future very easily, while using a dedicate service means that we are on the hook for implementing from scratch basically anything else that we want.

Sure, we could implement this service using RavenDB Client, but that is just adding layers, and I really don’t like that. There is no real point.

Geo Location & Spatial Searches with RavenDB–Part VI–Database Modeling

If you had sharp eyes, you might have noticed that in this code, I am actually using two different sessions:

We have the GeoSession, and we have the RavenSession.

The GeoSession is actually pointed at a different database, and it is a read only. In fact, here is how we use this:

image

As you can see, we create this on as needed basis, and we only dispose it, we never actually call SaveChanges().

So, those are the technical details, but what is the reasoning behind this?

Well, it is actually pretty simple. The GeoIP dataset is about 600 MB in size, and mostly it is about… well, geo location stuff. It is a very nice feature, but it is a self contained one, and not something that I really care for putting inside my app database. Instead, I have decided to go another way, and use a separate database.

That means that we have separation, at the data layer, between the different databases. It makes sense, about the only thing that we need from the GeoIP dataset is the ability to handle queries, and that is expressed only via GetLocationByIp, nothing more.

I don’t see a reason to make the app database bigger and more complex, or to have to support updates to the GeoIP dataset inside the app. This is a totally separate service. And having this in a separate database make it much easier to use this the next time that I want to use geo location. And it simplify my life right now with regards to maintaining and working with my current app.

In fact, we could have taken it even further, and not use RavenDB access to this at all. We can use REST calls to get the data out directly. We have chosen to still use the RavenDB Client, I’ll discuss exactly why we chose not to do that.

Entities Associations: Point in Time vs. Current Associations

Having just finished giving three courses (2 on RavenDB and 1 on NHibernate), you might want to say that I have a lot of data stuff on my mind. Teaching is always a pleasure to me, and one of the major reasons for that is that I get to actually learn a lot whenever I teach.

In this case, in all three courses, we run into an issue with modeling associations. For the sake of the example, let us talk about employees and paychecks. You can see the model below:

image

Do you note the blue lines? Those represent Employee reference, but while they are both referencing the same employee, they are actually quite different associations.

The Manager association is a Current Association. It is just a pointer to the managing employee. What does this means?

Let us say that the manager of a certain employee changed her name. In that scenario, when we look at the current employee record, we should see the updated employee manager name. In this case, we are always interested in the current status.

On the other hand when looking at the paycheck PaidTo reference to an employee, we have something all together different. We have a reference no to the current employee record, but to the employee record as it was at a certain point in time. If the employee in question change his name, that paycheck was issued to Mr. Version One, not to Mr. Version Two, even though the name has been changed.

when dealing with associations, it is important to distinguish between the two options, as each require different way of working with the association.

Assuming that the laws of physics no longer apply, we can build this

This is a reply to a post by Frans Bouma, in which he asks for:

…loud vendors should offer simply one VM to me. On that VM I run the websites, store my DB and my files. As it's a virtual machine, how this machine is actually ran on physical hardware (e.g. partitioned), I don't care, as that's the problem for the cloud vendor to solve. If I need more resources, e.g. I have more traffic to my server, way more visitors per day, the VM stretches, like I bought a bigger box. This frees me from the problem which comes with multiple VMs: I don't have any refactoring to do at all: I can simply build my website as if it runs on my local hardware server, upload it to the VM offered by the cloud vendor, install it on the VM and I'm done.

Um… no.

Go ahead and read the whole post, it is interesting. But the underlying premise that is rely on is flawed. It is like starting out with assuming that since TCP/IP contains no built in prohibition to send data faster than light, the cloud providers can and should create networks that can send data faster than light. After all, I can show a clear business case for the reduced ping time, and that is certainly something that can be abstracted from my application.

What aren’t those bozos doing that?

Well, the answer to that is that it just ain’t possible. There are several minor problems along the way. The CAP theorem, to start with, but even if we ignore that aspect of the problem, there are also the fallacies of distributed computing.

According to Frans’ premise, we can have a single VM that can scale up to as many machines as is needed, without any change required to the system. Let us start with Frans’ answer to the actual scope of the problem:

But what about memory replication and other problems?

This environment isn't simple, at least not for the cloud vendor. But it is simple for the customer who wants to run his sites in that cloud: no work needed. No refactoring needed of existing code. Upload it, run it.

Um.. no.

Let us take a look at a few pieces of code, and see what is going to happen to then in Frans’ cloud environment. For example, let us take a look at this:

var tax = 0;
foreach(var item in order.Items)
{
  tax += item.CalculateTax();   
}
order.Tax = tax;

Problem, because of the elasticity of the VM, we actually spread things around so each of the items in the order collection is located in another physical machine. This is, of course, completely transparent to the code. But that means that each loop iteration is actually doing a network call behind the scene.

OR/M users are familiar with this as the SELECT N+1 problem, but in this case, you have a potential problem on every memory access. Network attached memory isn’t new, you can read about it in OS books and it is a nice theoretical idea, but it is just isn’t going to work, because you actually care about the speed of accessing the data.

In fact, we have many algorithms that were changed specifically to be able to take advantage of cache lines, L1 & L2 cache, etc. Because that has a major increase in the system performance, and that is only on a single machine. Trying to imagine a transparent network memory is futile, you actually care about memory access speed, a lot.

But let us talk about another aspect, I want to make have an always incrementing order id number. So I do:

Interlocked.Increment(ref lastOrderId);

All well and good when running on a single machine, but how should the VM make it work when running on multiple machines?

And remember, this call actually translate to a purpose built assembly instruction (XADD or one of its friends). In this case, you need to do this across the network, and touch as many machines as your system currently runs on.

But the whole point here is to allow us to rapidly generate a new number. This has now turned into a total mess in terms of performance.

What about parallel computing, for that matter?

var results = new Result[items.Length];
Parallel.For(items, (item, i) => 
{
    results[i] = item.Calculate();
});

I have many items, and I want to be able to compute the result in parallel, so I run this fairly standard code. But we are actually going to execute this on multiple threads, so this get scheduled on several different machines. But now you have to copy the results buffer to all of those machines, as well as any related state that they have, then copy it back out when it is done, then somehow merge the different changes made by different systems into a coherent whole.

Good luck with that.

I could go on, but I think that you get the point by now.

And we haven’t talked about the error condition yet. What happen if my system is running on 3 machines, and one of them goes down (power outage, hardware failure, etc)? 3rd of my memory, ongoing work and a lot of stuff just got lost. For that matter, I might have (actually, probably have) dangling references to memory that used to be on the failed machines, so the other two systems are likely to hit this inaccessible memory and fail themselves.

So.. no, this idea is a pipe dream, it isn’t going to work, not because of some evil plot by dastardly folks conspiring to make your life harder, but for the simple reason that it is easier to fly by flapping your arms.

Tags:

Published at

Originally posted at

Comments (19)

Your ATM doesn’t use transactions

I just got a series of SMSes from my back, saying that someone just made several withdrawals from my account. As I am currently sitting and watching TV, I was a bit concerned.

It appears that my wife withdrew some money, but there was an issue with one ATM machine, so she used another one.

The problem, I got 3 SMS messages, saying that the follow activities happened on my account:

  • ATM withdrawal for 2,000 NIS
  • ATM withdrawal for 2,000 NIS
  • ATM withdrawal for 2,900 NIS

Checking with my wife, she had actually withdrawn only 2,900.

I was a bit concerned, so I logged into the bank and got this:

image

In English, this is:

Date Description Auth Code Debit Credit
10 Apr ATM withdrawal 00003581 2,000  
10 Apr ATM withdrawal 00003581   2,000
10 Apr ATM withdrawal 00003581 2,900  

This is actually interesting, because the way my wife described it, she wen to the ATM, punch the right codes, and went through the motions of everything. Then, just before it was about to give her the money, it failed.

What is really interesting? From my point of view, is that I can actually see this in my bank account. We didn’t have a transaction rollback because of failure to dispense the money. That isn’t how ATM works. We actually had a compensating action (that occurred as separate transaction) to show that the ATM refunded the money it wasn’t able to give.

So next time someone tries to quote you “banks use transactions”, you can tell them that the bank definition of what a transaction is would make any decent DTC cry with shame.

Security decisions: Separate Operations & Queries

The question came up several times in the mailing list with regards to how the RavenDB Authorization Bundle operates, and I think it serves a broader discussion.

Let us imagine a system where we have contracts, which may be in several states:

  • Mine – Contracts that an employee signed.
  • Done – Standard users can view, Lawyers assigned to the company can sign.
  • Draft – Lawyers can view / edit, Partners can approve.
  • Proposed – Lawyers can create / edit, but only the lawyer that created it can view it, Partners can accept.

So far, fairly simple, right? Except the pure hell that you are going to get into when you are trying to show the users all of the contracts that they can see, sorted by edit date and in the NDA category.

Why am I being so negative here? Well, let us look at what we are going to have to do in the most trivial of cases:

image

In this sort of system, we are going to have to show the user all of the contracts that they are allowed to see, and show them some indication what operations they can do on each.

The problem is that generating this sort of view is expensive. Especially when you have large amount of data to work through. More interesting, from a UX perspective, it also doesn’t really work that well. Most users would want a better separation of the things that they can do, probably something like this:

image

This allows us to do a first level filtering on the data itself, rather than try to apply security rules to it.

In the first case, we need to get all the contracts that we are allowed to see. The security rules above are really simple, mind. But trying to translate them into an efficient query is going to be pretty hard. Both in terms of the code requires and the cost to actually perform the query on the server. There are other things that are involved as well, such as paging and sorting in such an environment.  I have created several such systems in the past, Rhino Security is probably the most well known of them, and it gets really hard to optimize things and make sure that everything works when you start getting more complex security rules (especially when you have a user editable security system, which is a common request).

The second case is cheaper because we can limit the choices that we see in the query itself. We may still need to apply security concerns, but those goes through the query directly, rather than a security sub system. This kind of change usually force people to be more explicit in what they want, and it result in a system that tends to be simpler. The security rules aren’t just something arbitrary that can be defined, they are actually visible on the screen (My Contracts, Drafts, etc). Changing them isn’t something that is done on an administrator’s whim.

Yes, this is a way to manage the client and their expectations, but that is important. But what about the complex security that they want?

That might still be there, certainly, but that would be active mostly for operations (stuff that happen on a single entity), not on things that happen over all entities. It is drastically easier to make a single entity security decisions work efficiently than make it work over the whole set inside the database.

Monika: A lesson in component based design

I was giving a lecture on architecture recently, and the notion of components came in. The most important bit about that lecture was probably at the very end, when I discussed what it is that I consider to be a component. During that discussion, I introduced Monika, the payment processing component.

Monika has the following Service Level Agreement:

  • Payment initiation is done messages.
  • Notification about payment completion is handled via a callback REST call.
  • The SLA calls for 90% of all successful payments to be processed in 2 business days.

So far, it doesn’t sound really complicated, right? And there isn’t even a hint of how Monika works in the SLA or the contracts.

This is Monica:

Well, not really, but it makes the point, doesn’t it.

Monica is a component in the system that respond to (SMTP) messages, does some work, and respond by clicking on a link in the email (REST call).

Monica has a really sucky SLA, since she has only 22% uptime over the course of the year, and then there are those two weeks when she has her yearly maintenance period (vacation), etc.

The most important thing about this is that we are able to abstract all of that away and treat this scenario as just another component in the system.

All too often, people hear components and they start thinking about things like this:

A component in a system is usually something much larger than a single class or a set of classes. It is an independent agent in the system that has its own behavior, resources, dedicated team and deployment schedule separate from all other components.

Beware the common infrastructure

One of the common problems that I run into when consulting with clients, or just whenever I am talking to developers in general is the notion of common infrastructure. “We are going to spend some time building a common infrastructure which we can then use on all of our applications.”

I made that mistake myself with Rhino Commons, and again very recently with RaccoonBlog (look at the code, you see the Loci stuff, that is stuff that is used from another project).

Why is that a problem? Well, for the simplest reason of all. Different projects have different needs. A common infrastructure that tries to accommodate them all is going to be much more complex. Not only that, it is going to be much more brittle. If I am modifying it in the context of project A, can I really say that I didn’t break something for project B?

Let us take a simple example, executing tasks. In RaccoonBlog, we need tasks merely to handle comments and email (long running background tasks). In another application, we need to do retries, and we need to get notifications if after N retries, the task have failed. In a third project, we need a way to specify dependencies between tasks.

Sure, you can build something that satisfy all three projects, but it would be drastically more complex than having to modify the original task executer for each project needs. And yes, I do mean copying the code and modifying it.

And no, it is not a horrible sin against the Little Endianness. Even duplicated N times, the code is going to be simpler to read, perform faster, easier to maintain and modify over time.

Nitpicker note: I am not talking about 3rd party libraries here. If you can find something that fits your needs that already exists, wonderful. I am talking about infrastructure that you build, inside your organization.

Tags:

Published at

Originally posted at

Comments (25)

Searching ain’t simple: solution

On my last post, I descried the following problem:

image_thumb

And stated that the following trivial solution is the wrong approach to the problem:

select d.* from Designs d 
 join ArchitectsDesigns da on d.Id = da.DesignId
 join Architects a on da.ArchitectId = a.Id
where a.Name = @name

The most obvious reason is actually that we are thinking too linearly. I intentionally showed the problem statement in terms of UI, not in terms of a document specifying what should be done.

The reason for that is that in many cases, a spec document is making assumptions that the developer should not. When working on a system, I like to have drafts of the screens with rough ideas about what is supposed to happen, and not much more.

In this case, let us consider the problem from the point of view of the user. Searching by the architect name makes sense to the user, that is usually how they think about it.

But does it makes sense from the point of view of the system? We want to provide good user experience, which means that we aren’t just going to provide the user with a text box to plug in some values. For one thing, they would have to put in the architect full name as it is stored in our system. That is going to be a tough call in many cases. Ask any architect what the first name of Gaudi is, and see what sort of response you’ll get.

Another problem is how to deal with misspelling, partial names, and other information. What if we actually have the architect id, and are used to type that? I would much rather type 1831 than Mies Van Der Rohe, and most users that work with the application day in and day out would agree.

From the system perspective, we want to divide the problem into two separate issues, finding the architect and finding the appropriate designs. From a user experience perspective, that means that the text box is going to be an ajax suggest box, and the results would be loaded based on valid id.

Using RavenDB and ASP.Net MVC, we would have the following solution. First, we need to define the search index:

image

This gives us the ability to search across both name and id easily, and it allows us to do full text searches as well. The next step is the actual querying for architect by name:

image

Looks complex, doesn’t it? Well, there is certainly a lot of code there, at least.

First, we look for an a matching result in the index. If we find anything, we send just the name and the id of the matching documents to the user. that part is perfectly simple.

The interesting bits happen when we can’t find anything at all. In that case, we ask RavenDB to find us results that might be the things that the user is looking for. It does that by running a string distance algorithm over the data in the database already and providing us with a list of suggestions about what the user might have meant.

We take it one step further. If there is just one suggestion, we assume that this is what the user meant, and just return the results for that value. If there is more than that, we sent an empty result set to the client along with a list of alternatives that they can suggest to the user.

From here, the actual task of getting the designs for this architect becomes as simple as:

image

And it turns out that when you think about it right, searching is simple.

Tags:

Published at

Originally posted at

Comments (15)

Searching ain’t simple

The problem statement is best described using:

image

This seems like a nice and easy problem, right? We join the architects table to the designs table and we are done.

select d.* from Designs d 
 join ArchitectsDesigns da on d.Id = da.DesignId
 join Architects a on da.ArchitectId = a.Id
where a.Name = @name

This is a trivial solution, and shouldn’t take a lot of time to build…

It is also the entirely wrong approach for the problem, can you tell me why?

Tags:

Published at

Originally posted at

Comments (30)

Composite entities

In my previous post, I discussed some of the problems that you run into when you try to have a single source of truth with regards to an entity definition. The question here is, how do we manage something like a Customer across multiple applications / modules.

For the purpose of discussion, I am going to assume that all of the data is either:

  • All sitting in the same physical database (common if we are talking about different modules in the same application).
  • Spread across multiple databases with some data being replicate to all databases (common if we are talking about different applications).

We will focus on the customer entity as an example, and we will deal with billing and help desk modules / applications. There are some things that everyone can agree on with regards to the customer. Most often, a customer has a id, which is shared across the entire system, as well as some descriptive details, such as a name.

But even things that you would expect to be easily agreed upon aren’t really that easy. For example, what about contact information? The person handling billing at a customer is usually different than the person that we contact for help desk inquires. And that is the stuff that we are supposed to agree on. We have much bigger problems when we have to deal with things like customer’ payment status vs. outstanding helpdesk calls this month.

The way to resolve this is to forget about trying to shove everything into a single entity. Or, to be rather more exact, we need to forget about trying to thing about the Customer entity as a single physical thing. Instead, we are going to have the following:

image

There are several things to note here:

  • There is no inheritance relationship between the different aspect of a customer.
  • We don’t give in and try to put what appears to be shared properties (ContactDetails) in the root Customer. Those details have different meaning for each entity.

There are several ways to handle actually storing this information. If we are using a single database, then we will usually have something like:

image

The advantage of that is that it makes it very easy to actually look at the entire customer entity for debugging purposes. I say for debugging specifically because for production usage, there really isn’t anything that needs to look at the entire thing, every part of the system only care for its own details.

You can easily load the root customer document and your own customer document whenever you need to.

More to the point, because they are different physical things, that solves a lot of the problems that we had with the shared model.

Versioning is not an issue, if billing needs to make a change, they can just go ahead and change things. They don’t need to talk to anyone, because no one else is touching their data.

Concurrency is not an issue, if you make a concurrent modification to billing and help desk, that is not a problem, they are stored into two different locations. That is actually what you want, since it is perfectly all right for having those concurrent changes.

It free us from having to have everyone’s acceptance on any change for everything except on the root document. But as you can probably guess, the amount of information that we put on the root is minimal, precisely to avoid those sort of situations.

This is how we handle things with a shared database, but what is going on when we have multiple applications, with multiple databases?

As you can expect, we are going to have one database which contains all of the definitions of the root Customer (or other entities), and from there we replicate that information to all of the other databases. Why not have them access two databases? Simple, it makes things so much harder. It is easier to have a single database to access to and have replication take care of that.

What about updates in that scenario? Well, updates to the local part is easy, you just do that, but updates to the root customer details have to be handled differently.

The first thing to ask is whatever there really is any need for any of the modules to actually update the root customer details. I can’t see any reason why you would want to do that (billing shouldn’t update the customer name, for example). But even if you have this, the way to handle that is to have a part of the system that is responsible for the root entities database, and have it do the update, from where it will replicate to all of the other databases.

Tags:

Published at

Originally posted at

Comments (27)

There ain’t no such thing, the definitive entity definition

I was at a customer site, and we were talking about a problem they had with modeling their domain. Actually, we were discussing a proposed solution, a central and definitive definition for all of their entities, so all of the applications could use that.

I had a minor seizure upon hearing that, but after I recovered, I was able to articulate my objections to this approach.

To start with, it breaks the Single Responsibility Principle, the Open Closed Principle and the Interface Segregation Principle. It also makes versioning hard, and introduce a central place where everyone must coordinate with. Think about the number of people that has to be involved whenever you make a change.

Let us take the customer as the representative entity for this discussion. We can all agree that a customer has to have a name, an email and an id. But billing also need to know his credit card information, help desk needs to track what support contracts he has and sales needs to know what sort of products we sold the guy, so we can sell him upgrades.

Now, would you care to be the guy who has to mediate between of all of those different concerns?

And what about changes and updates? Whenever you need to make a change, you have to wait for all of those teams and application to catch up and update and deploy their apps,.

And what about actual usage? You actually don’t want the help desk system to be able to access the billing information, and you most certainly don’t want them to change anything there.

And does it matter if we have concurrent modifications to the entity by both help desk and billing?

All of those things argue very strongly against having a single source of truth about what an entity is. In my next post, I’ll discuss a solution for this problem, Composite Entities.

Tags:

Published at

Originally posted at

Comments (25)