Ayende @ Rahien

It's a girl

Why I HATE editing?

I have just finished 2nd pass editing all the chapters and appendixes for my DSL book. This has been, I think, the 1,390th time that I had gone through the book since I finished writing it. Luckily for me, this appears to be the end, there is the final printer proofs stage, and then I’ll truly be free.

One of the things that I wasn’t told about when I agree to write the book is just how many times I would have to read it. I think that I am good writer (God knows that I get enough practice), but for crying out load, there is only so many times that I can read the same piece (especially if I wrote it) before it gets pretty boring.

Now, to be absolutely clear, the publisher is doing great job, and I think that each pass has been extremely valuable. We found problems in grammar, spelling, fuzzy speak, missing concepts, bugs in code, etc. This is important, valuable and necessary.

One of the things that some people have hard time understanding about the way that I think is that I can accept that some things are necessary even if they aren’t fun. That mean that I’ll do them, but I’ll complain about it. Just like taxes. I’ll pay them, but I reserve the right to bitch about it.

It took me a while to understand why I hate editing so much. I know how to handle boring work, and I know how to handle creative work. What I don’t know how to handle is boring creative work. I split my attention when I am doing boring stuff, usually having something in the background that it interesting while I am doing the boring stuff. For creative stuff, I usually enjoy it.

But boring creative stuff? I can’t split my attention then, and it is boring. It is annoying, and I do think about it like a tax.

Consider this statement, that was flagged by my editor:

The MacroMacro passes the MacroStatement that the compiler hands to the macro using the name of the macro

Can you figure out what I meant here? It took me a while to do that :-)

Editing the book has been a huge weight on my mind, having finished that is such a huge relief…

You see that database? OFF WITH HIS HEAD!

image A while ago I was chatting with a friend that complained about a migration script timing out on his local machine. When I looked at the script, it was fairly obvious what was wrong:

DECLARE @VideoID UNIQUEIDENTIFIER
DECLARE @NewID UNIQUEIDENTIFIER

DECLARE VideoCursor CURSOR READ_ONLY
FOR
SELECT ID FROM Video

OPEN VideoCursor

FETCH NEXT FROM VideoCursor
INTO @VideoID

WHILE @@FETCH_STATUS = 0
BEGIN
    SET @NewID = NEWID()
   
    INSERT INTO Content (ID, Body, FormatBody, Plain)
        SELECT @NewID, ContentItem.Body, Video.FormatBody, Video.Plain
        FROM ContentItem
        INNER JOIN Video
        ON Video.Id=ContentItem.ID
        WHERE Video.Id=@VideoID
   
    UPDATE Video SET ContentId=@NewID WHERE Video.Id=@VideoID
       
    UPDATE ThumbImage SET ContentId=@NewID WHERE Video_id=@VideoID
   
    FETCH NEXT FROM VideoCursor
    INTO @VideoID
END

The script was using a cursor!

Every time you want a use a cursor, you must fast for three days while reading the memoires of Edgar F. Codd.

Cursors are evil!

Let us see how we can make this work using set based logic, shall we?

INSERT INTO #TempContent 
SELECT newid() as NewId, Video.Id as OldId, ContentItem.Body, Video.FormatBody, Video.Plain
FROM ContentItem
INNER JOIN Video
ON Video.Id=ContentItem.ID
WHERE Video.Id=@VideoID

INSERT INTO Content(ID, Body, FormatBody, Plain)
SELECT NewId, ContentItem.Body, Video.FormatBody, Video.Plain

UPDATE Video 
SET ContentId=NewId 
FROM #TempContent
WHERE Video.Id=OldId

UPDATE ThumbImage 
SET ContentId=NewId 
FROM #TempContent
WHERE Video.Id=OldId

DROP TABLE #TempContent

I can assure you that this will work faster, read better, get parallelize by the database and in generally be better behaved than the previous version.

Transactional queuing system perf test

After running into sever performance issues when using MSMQ and transactions, I decided to run a more thorough test case.

Writing 10,000 messages to MSMQ Transactional Queue, separate transactions:

private static void AddData(MessageQueue q1, byte[] bytes)
{
    Console.WriteLine("{0:#,#}", bytes.Length);
    var sp = Stopwatch.StartNew();
    for (int i = 0; i < 10000; i++)
    {
        using (var msmqTx = new MessageQueueTransaction())
        {
        
            msmqTx.Begin();
            
            q1.Send(new Message
            {
                BodyStream = new MemoryStream(bytes),
                Label = i.ToString()
            }, msmqTx);

            msmqTx.Commit();
        }

        if (i%10 == 0) Console.WriteLine(i);
    }
    q1.Dispose();

    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This took 6.08 seconds, or about 1,600 messages per second.

Writing 10,000 messages to MSMQ Transaction Queue, single transaction:

private static void AddData(MessageQueue q1, byte[] bytes)
{
    Console.WriteLine("{0:#,#}", bytes.Length);
    var sp = Stopwatch.StartNew();
    using (var msmqTx = new MessageQueueTransaction())
    {

        msmqTx.Begin(); for (int i = 0; i < 10000; i++)
        {


            q1.Send(new Message
            {
                BodyStream = new MemoryStream(bytes),
                Label = i.ToString()
            }, msmqTx);


            if (i % 10 == 0) Console.WriteLine(i);
        }
        msmqTx.Commit();
    }
    q1.Dispose();

    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This took 0.825 seconds for 10,000 messages, so it should be able to process about 12,000 messages per second. But this is not a realistic scenario.

Now, to my test scenario, which is touching two queues for 10,000 messages…

Wait! I accidently run the wrong code, and that led me down the following path:

private static void JustTx()
{
    var sp = Stopwatch.StartNew();
    int msgs = 0;
    for (int i = 0; i < 1000; i++)
    {
        using (var msmqTx = new MessageQueueTransaction())
        {
            msmqTx.Begin();
            msgs = msgs + 1;
            if (msgs % 10 == 0)
                Console.WriteLine(msgs);
            
            msmqTx.Commit();
        }
    }
    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This code is just running & committing MSMQ transaction, no actual operation is done on the queue, and there shouldn’t be any work for the system. We open & close a 1,000 transactions. That takes 17 seconds.

Remember, this is just open/close MSMQ local transaction, with no work done, and that gives me about 60 transactions per second. I think that I found my culprit.  I have other independent verification of this, and I find it extremely sad.

I am still waiting to hear from MSMQ experts about what is going on.

In the meantime, I tested this with my own queuing project, Rhino Queues. The first thing that I tried was the simplest, unrealistic scenario of sending 10,000 messages in a single transaction:

private static void AddData(IQueueManager manager, byte[] bytes)
{
    var sp = Stopwatch.StartNew();
    using (var tx = new TransactionScope())
    {
        for (int i = 0; i < 10000; i++)
        {
            manager.Send(new Uri("rhino.queues://localhost:2222/test1"), new MessagePayload
                {
                    Data = bytes
                });


            if (i % 10 == 0)
                Console.WriteLine(i);
        }
        tx.Complete();
    }

    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This code takes 12.2 seconds to run, giving about 800 message per second. Not bad, but not really good either. In this scenario MSMQ finished everything in less than a second.

Let us see a more realistic scenario, of sending 10,000 messages in 10,000 separated transactions:

private static void AddData(IQueueManager manager, byte[] bytes)
{
    var sp = Stopwatch.StartNew();

    for (int i = 0; i < 10000; i++)
    {
        using (var tx = new TransactionScope())
        {
            manager.Send(new Uri("rhino.queues://localhost:2222/test1"), new MessagePayload
            {
                Data = bytes
            });


            if (i % 10 == 0)
                Console.WriteLine(i);
            tx.Complete();
        }
    }

    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This completes in 3.7 minutes, for about 45 messages per second. Slightly worse than what MSMQ can do, which isn’t good.

However, the good part about using Rhino Queues is that I know what is going on there and I intentionally left some points of optimization out (get it working, get it working right, get it working fast). After exploring some of those optimizations, the same code base run for 2.7 minutes, so we saved 60 seconds on the runtime, bringing us to 60 messages per second.

Rhino Queues is now comparable to MSMQ performance in this scenario. I find this spooky, to tell you the truth. Profiling Rhino Queues tells me that most of the time with Rhino Queues (over 40%!) is spent not in Rhino Queues, but inside System.Transactions.Transaction.Dispose().

I wonder how I can reduce that load.

The next thing I tried was implementing ISinglePhaseNotification. This means that if there are no other durable enlistments for the DTC, Rhino Queues will be able to take advantage of the lightweight transactions support in System.Transactions.

That change had a dramatic effect. I can’t really grasp the perf difference!

The code (same code!) now executes in 16.7 seconds! That means 600 messages per second.

Of course, things aren’t as good when you compare the CopyData routine:

private static void CopyData(IQueueManager manager)
{

    Stopwatch sp = Stopwatch.StartNew();
    for (int i = 0; i < 10000; i++)
    {
        using (var tx = new TransactionScope())
        {
            var message = manager.Receive("test1");
            manager.Send(new Uri("rhino.queues://localhost/test2"), new MessagePayload
            {
                Data = message.Data,
                Headers = message.Headers
            });

            if (i % 10 == 0)
                Console.WriteLine(i);
            tx.Complete();
        }
    }

    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This guy takes 5.7 minutes to complete, at a rate 30 messages per second. That is pretty lame. Similar code with MSMQ did 60 message per second on my machine, and I considered that utterly unacceptable.

Firing up dotTrace showed me that I was being STUPID (as in, not just stupid, or Stupid with a capital s, but full fledged stupid in all caps) and didn’t clean out the queue from read messages. After I fixed that, the same code run in 1.5 minutes, or about 110 messages per second.

Nice.

And I still consider being almost twice as fast as MSMQ to be very spooky.

An epic bug story

This is a fictitious  example of such a bug, but this would probably sound familiar.

Jane was soundly sleeping around 3AM when the phone started ringing off the hook. Dazed and not quite in full possession of her faculties, she picked up the phone and heard Tom from Operations screaming that the Northwind System was gone. It wasn’t just one application that died, it was all of them, and Operations had no idea what is going on, any CxO in the company was pissed and they were bleeding money and reputation by the second.

While driving to work, Jane took part in a conference call trying to figure out what happened. The following things became clear:

  • There had been no changes to the production environment in the last week.
  • The last change was in the reporting system, which was humming along for the last two weeks or so.
  • Database servers were pegged by ridiculous amounts of queries from all applications.
  • Cache servers were up and functioning.
  • The production logs verbosity was adjusted and it seemed that every single query was a cache miss.

It became clear that something cause every single application to skip the cache and go directly to the database, but no one knew what.

Upon arriving at the office, they run into more puzzlement, frustration and anger. The problem was definitely related to the cache, but there was absolutely no explanation for that. The cache code was written several years ago, it was a pretty complex piece of code, and it was tested in production for almost as long.

The first suspicion fell on some sort of communication breakage between the application servers and the cache servers, but that was quickly ruled out. Looking at the logs, it was clear that the problem was 100% cache misses causing every single query to hit the database, which responded by throwing up its hands and sulking, refusing to talk to anyone.

The problem was absolutely non reproducible in the test or development environment.

The team decided to appropriate a single production server and run their tests on it. In production, it was absolutely reproducible, but no one could figure out why. Network traces from that server showed that it never even tried to connect to the cache servers for reading the queries, however, it would happily connect to the cache servers for writing the queries.

The root problem was found, then, somehow, there was a problem in reading data from the cache servers. Three developers and at least one IT guy were willing to swear on the religion of your choice that this is absolutely not possible, but evidence suggested otherwise.

Installing a kernel level network trace was suggested. After installing the trace and rebooting the machine, the team hovered over the admin head while they started up the application again and watched the network trace from the kernel. This time, however, they could see it reading from the cache servers.

Not only that, but it looked like the application was functioning normally! Installing the kernel trace fixed the problem! Since this was an issue of production being down, they tried the same method on a second server, and that came back up normally as well. Every single IT guy in the building was suddenly busy installing a kernel network trace, and lo and behold, the entire system came back up.

By that time, it was close to 4 PM of the following day, and just about everyone was exhausted. The CTO sent everyone whom and scheduled a meeting at 9 AM to discuss what happened.

At 9 AM the following day, no one could really suggest a working hypothesis for what actually happened. The Administration Module Team Lead suggested that their system has run into some threading bug in the network stack and installing the kernel level trace had changed the timing just enough to make the problem go away. The Reporting Module Team Lead snapped that it is not possible to have a network bug that only affect reading from a particular socket. The IT group insisted that there were no modifications to the production environment, all changed firewall rules in the last month were reviewed, and they couldn’t do that.

The CTO closed the meeting by quoting the damage that this downtime has done to the company, and directed everyone to investigate it further. No actual smoking gun could be found, and eventually the matter was dropped. One of the results of this change was an added emphasis on the production worthiness of all the parts of the system. Traces and logs were enhanced, circuit breakers were introduced and in general the team thought that they are pretty safe. The kernel trace fixed the problem, of course, but next time they will have a lot more information about what is going on.

Four months later…

The production issue that brought down the site was almost forgotten by this time, the system was running along happily for months, the changes to make the system more robust in production were already there, and everyone was happy.

Until the 3AM call on Thursday’s night. While driving back to work on empty roads, Jane had a strong sense of deja-vu. Again, the problem was that the system was hammering the database, again the log showed 100% cache misses. Again, there was no significant change to the production environment.

This time, the team already knew who the culprit was, and the caching system was given a thorough overview. No one could really find something wrong with it, but even more logging was added, and the change was pushed to production, where it didn’t seem to do any good at all.

Trying to ignore the CxO screaming match in the nearby conference room, the Infrastructure team tried to figure out what was wrong exactly. Jane thought that the problem was with the cache invalidation logic. The caching system used a fairly complex system to decide whatever a value in the cache was fresh enough or not.

Disabling that piece of the system was risking stale data scenarios, but stale data was better than no data. They commented that part out and push that to production. Unfortunately, it didn’t change one iota, and the system kept producing cache misses.

That was patently impossible, of course, but there was it. The IT group, in the meantime, came to the rescue. Trevor, the new guy on the team, decided to try deploying the application to a brand new server that was waiting to be provisioned. On the new server, the system worked, and that server did not have the kernel network trace installed.

Acting on instinct, they tried uninstalling the kernel network trace on one of the existing production servers. Like magic, that server came back to life and started working normally seconds after the reboot was completed. A rolling uninstallion of the magical kernel trace fixed the issue.

But this was being the second failure this year, and both could be traced to the same component. The development team was at a loss to explain what was going on there. For the next month or so, no developer dared walk by the executives part of the building. External consultants were brought in to do an in depth study of the code base, trying to find the problem. The consultants couldn’t really find anything to blame, but they didn’t trust the existing home grown cache servers. The company bought a commercial cache package, to the tune of 2,000 $ per CPU, replacing their caching backend.

For a while, it seemed like that would do it…

Jane woke up a few seconds before 3 AM, two months after the previous incident. It was Thursday again, and the phone was ringing. After briefly considering running away to the Sahara, she picked up the phone. It was the exact same scenario.

After arriving at the office, Jane didn’t bother to join the already violent row in the conference room, she sat down on her desk and brought up the Caching Module. She sat down and added a logging statement after each line in the codebase. Then she called Operations and had them push the new caching module to production.

That change didn’t fix the problem, but it gave Jane an exception. And that exception just wasn’t possible.

System.IndexOutOfRangeException: Index was outside the bounds of the array.
at System.Array.InternalGetReference(Void* elemRef, Int32 rank, Int32* pIndices)
at System.Array.GetValue(Int32 index)
at Northwind.CachingModule.CacheSelector.GetServer()
at Northwind.CachingModule.Cache.Get(string key)

That didn’t make any sort of sense, but at least she had a single method to check for.

private CacheServer GetServer()
{
if(aliveAndWellServes.Count == 0)
return null;
var server = aliveAndWellServes[Environment.TickCount % aliveAndWellServes.Length];
if(server.IsAlive == false)
{
aliveAndWellServes.Remove(server);
server.Dispose();
}
return server;
}

The idea was to spread the load between the cache servers, so there wouldn’t be a single cache server being hit repeatedly. Somehow, that method was failing, when it failed, it caused the caching system generic error handler to consider this as a downed cache server, and produce a cache miss.

Jane pushed another update to production, this time adding pervasive logging to the GetServer method. The output said it all.

Alive and well servers: 4
Tick count: -137538094
Server index: -2

Looking at the documentation of the Environment.TickCount, it all became clear. That method will return a negative number if the system has been running for over 25 days. When that happened, the cache implementation would fail, leading to overload on the database and closing the entire system down.

The reason that the error was patches meant that the servers were occasionally rebooted, so you had to wait until enough time passed since the last reboot. Installing and uninstalling the kernel trace resulted in a reboot, which temporarily fix the problem.

An interesting final tidbit about this error. No one mentioned to the executives that all source control systems come with a blame feature.

Tags:

Published at

Originally posted at

Comments (20)

YAGNI, I told them, millennium hand and shrimp!

I am running a long experiment with myself to see whatever you are gonna need YAGNI.

So I went way back to this:

image

And this:

image

This is an application that I am explicitly growing organically, only adding things as I really need them. It is working so far.

Dense code

The amount of information that you can push into just a few lines of code keeps amazing me.

image

This is doing so much stuff that it is just creepy. And yes, it is quite elegant at the same time, since it is handling the required scenario in such a simple manner.

Impleo – a CMS I can tolerate

If you head out to http://hibernatingrhinos.com/, you will see that I finally had the time to setup the corporate site. This is still very early, but I have a lot of content to add there, but it is a start.

Impleo, the CMS running the site, doesn’t have any web based interface, instead, it is built explicitly to take advantage of Windows Live Writer and similar tools. The “interface” for editing the site is the MetaWeblog API. This means that in order to edit the site, there isn’t any Wiki syntax to learn, or XML files to edit, or anything of this sort.

You have a powerful editor in your fingertips, one that properly handle things like adding images and other content. This turn the whole experience around. I usually find documentation boring, but I am used to writing in WLW, it is fairly natural to do, and it removes all the pain from the equation.

One of the things that I am trying to do with it is to setup a proper documentation repository for all my open source projects. This isn’t something new, and it is something that most projects have a hard time doing. I strongly believe in making things simple, in reducing friction. What I hope to do is to be able to accept documentation contributions from the community for the OSS projects.

I think that having a full fledged rich text editor in your hands is a game changer, compared to the usual way OSS handle documentation.  Take a look at what is needed to make this works, it should take three minutes to get started, no learning curve, no “how do they do this”.

So here is the deal, if you would like to contribute documentation (which can be anything that would help users with the projects), I just made things much easier for you. Please contact me directly and I’ll send you the credentials to be able to edit the site.

Thanks in advance for your support.

NHibernate webcasts series

Rob Conery and I had sat down and started recording a series of webcasts on NHibernate, which will show up on TekPub. TekPub is a paid service, and there are two NHibernate screen casts there already. We plan to have lots more.

I got a lot of requests for making my NHibernate courses available to a wider audience. And this is a way for you to get a ton of NHibernate content. So far the reactions that we got were very positive.

Oh, and as a side effect, you get to see NH Prof used by the person who wrote it, you might learn a thing or two about extracting information from the profiler.

How to mis-configure your production to dead crawl

One of the things that I avidly subscribe to is that when you are building server applications, visibility into what the application is doing is key. If you don’t have visibility into the application, you are left in the dark, stumbling and alone.

NH Prof is a good example of how you make things much more visible to developers, but I am doing this in many other places, Rhino Service Bus is another example of how important it is to have visibility into what your servers are doing. Being able to see what is actually going on in your apps is priceless!

However, that does come at a cost. And a pretty steep one at that.

Logging is expensive. It is expensive to gather the data to log, it is expensive to actually log the data, in order to produce coherent logs in multi threaded environment, you need to perform synchronization, etc.

One of the most common performance problems that people run into is when they push their debug logging configuration to production. If we will look into NHibernate as a good example, if you set the log level to debug, you are going to get tens of thousands of messages just for the startup! And you are going to get hundreds and thousands of log messages for routine operations.

That is going to slow things down, considerably.

Production configuration management is a big topic, and it deserves its own post, but the impact of setting the log level too low in production cannot be stressed strongly enough.

Pay attention to this, or it will make you pay attention to it.

In one notable case, several years ago, we re-implemented a whole module in the application (significantly optimizing it) to get exactly zero benefit (well, negative benefit, actually) from the exercise, since the actual problem was slow logging database.

In another, we improved performance by over 60% by simply raising the log level from DEBUG to ERROR.

MSMQ: Insufficient resources to perform operation.

I run into an annoyingly persistent error on a production server:

System.Messaging.MessageQueueException: Insufficient resources to perform operation.
at System.Messaging.MessageQueue.SendInternal(Object obj, MessageQueueTransaction internalTransaction, MessageQueueTransactionType transactionType)
at System.Messaging.MessageQueue.Send(Object obj)
at ConsoleApplication1.Program.Main(String[] args)

This was generated from the following code:

class Program
{
static void Main(string[] args)
{
try
{
var queue = new MessageQueue(@".\private$\MyQueue");
queue.Send("
test msg");
Console.WriteLine("
success");

}
catch (Exception e)
{
Console.WriteLine(e);
}
}
}

The error was happening on all queues, and persisted even after rebooting the machine. I should mention that we had a runaway message generation, and we have queues with lots of messages currently.  The queue that I was trying to send the message to was empty, however. And there was no reading/writing on the queues.

Nasty issue to figure out, I was pointed to this post, and indeed, suggestion #7 was it, we run over the machine quota for MSMQ, so it started rejecting messages.

No one really thought about it, because we never set the quota, but it seems that by default there is a 1GB quota for the entire machine. This is usually more than enough to handle things, I would assume, but when you run into a convoy situation, bad things will happen.

Database independence with NHibernate

Karl has posted about his experience of porting an NHibernate application from SQL Server to PostgreSQL, long story short, he did it in 1 hour.

He does bring up a few points where the database that he was using bled into the code, hurting the database independence goal. I wanted to look at those and point out the builtin solutions that NHibernate provides for those.

Handling unique constraints violations, Karl has code like this:

try
{
    Session.SaveOrUpdate(user);
    transaction.Commit();
    return true;
}
catch (GenericADOException ex)
{
    transaction.Rollback();
    var sql = ex.InnerException as SqlException;
    if (sql != null && sql.Number == 2601)
    {
        return false;
    }
    throw;
}
catch (Exception)
{
    transaction.Rollback();
    throw;
}

Obviously, this code relies heavily on internal knowledge of SQL Server error codes, and wouldn’t translate to PostgreSQL.

With NHibernate, you deal with those issues by writing ISqlExceptionConverter, you can read Fabio’s post about this, but the gist of it is that you can provide your own exceptions in a database independent way.

The count function may return different types on different databases. NHibernate tries to stay true to whatever the database is giving it, so it wouldn’t truncate information. You can force the issue by overriding the count function definition by writing a derived dialect. You can see an example of that in this post.

Bits & Booleans, while SQL Server accepts 1 as a Boolean value, PostgreSQL requires that you would use ‘1’ instead. This is where NHibernate’s query.substitutions supports come in. I usually define substitutions for true and false and then use true and false in the queries. Based on the database that I am running on, I can select what will be substituted for those values.

Achieving database independence with NHibernate is very easy, as Karl’s 1 hour porting story proves.

Emergency fixes & the state of your code base

One of the most horrible things that can happen to a code base is a production bug that needs to be fixed right now. At that stage, developers usually throw aside all attempts of creating well written code and just try to fix the problem using brute force.

Add to that the fact that very often we don’t have the exact cause of the production bug, and you get an interesting experiment in the scientific method. The devs form a hypothesis, try to build a fix, and “test” that on production. Repeat until something make the problem go away, which isn’t necessarily one of the changes that the developers intended to make.

Consider that this is a highly stressful period of time, with very little sleep (if any), and you get into a cascading problem situation. A true mess.

Here is an example from me trying to figure out an error in NH Prof’s connectivity:

image

This is not how I usually write code, but it was very helpful in narrowing down the problem.

But this post isn’t about the code example, it is about the implications on the code base. It is fine to cut corners in order to resolve a production bug. One thing that you don’t want to do is to commit those to source control, or to keep them in production. In one memorable case, during the course of troubleshooting a production issue, I enabled full logging on the production server, then forgot about turning them off when we finally resolved the issue. Fast forward three months later, and we had a production crash because of a full hard disk.

Another important thing is that you should only try one thing at a time when you try to troubleshoot a production error. I often see people try something, and when it doesn’t work, they try something else on top of the modified code base.

That change that you just made may be important, so by all means save it in a branch and come back to it later, but for production emergency fixes, you want to always work from the production code snapshot, not on top of increasingly more franticly changed code base.

Challenge: Can you spot the bug?

I have this piece of code:

public static string FirstCharacters(this string self, int numOfChars)
{
    if (self == null)
        return "";
    if (self.Length < numOfChars)
        return self;
    return self
        .Replace(Environment.NewLine, " ")
        .Substring(0, numOfChars - 3) + "...";
}

And the following exception was sent to me:

System.ArgumentOutOfRangeException: Index and length must refer to a location within the string.
Parameter name: length
  at System.String.InternalSubStringWithChecks(Int32 startIndex, Int32 length, Boolean fAlwaysCopy)
  at StringExtensions.FirstCharacters(String self, Int32 numOfChars)

It actually took me a while to figure out what is going on. Partly that is because we have an “impossible” stack trace (the JIT seems to have inlined the Substring call). When I did figure what the bug was, i is a head slapping moment.

JAOO: More on Evolving the Key/Value Programming Model to a Higher Level from Billy Newport

As I already mentioned, this presentation had me thinking. Billy presented a system called Redis, which is a Key/Value store which is intended for an attribute based storage.

That means that storing something like User { Id = 123, Name = “billy”, Email = “billy@example.org”} is actually stored as:

{ "uid:123:name": "billy" } 
{ "uid:123:email": "billy@example.org" }
{ "uname:billy": "123" } 

Each of those lines represent a different key/value pair in the Redis store. According to Billy, this has a lot of implications. On the advantage side, you get no schema and very easy support for just adding stuff as you go along. On the other hand, Redis supports not transactions and it is easy to “corrupt” the database during development (usually as a result of a programming bug).

What actually bothered me the most was the implications on the number of remotes calls that are being made. The problem shows itself very well in this code sample, (a twitter clone), which show inserting a new twit:

long postid = R.str_long.incr("nextPostId"); 
long userId = PageUtils.getUserID(request); 
long time = System.currentTimeMillis(); 
string post = Long.toString(userId)+"|" + Long.toString(time)+"|"+status; 
R.c_str_str.set("p:"+Long.toString(postid), post);
List<long> followersList = R.str_long.smembers(Long.toString(userId)+":followers"); 
if(followersList == null) 
   followersList - new ArrayList<Long>(); 
HashSet<long> followerSet = new HashSet<long>(followersList); 
followerSet.add(userid); 
long replyId = PageUtils.isReply(status); 
if(replyId != -1) 
   followerSet.add(replyId); 
for(long i : followerSet) 
    R.str_long.lpush(Long.toString(i)+":posts", postid); 
// -1 uid is global timeline 
String globalKey = Long.toString(-l)+":posts"; 
R.str_long.lpush(globalKey,postid); 
R.str_long.ltrim(globolKey, 200);

I really don’t like the API, mostly because it reminds me of C, but the conventions are pretty easy to figure out.

  • R is the static gateway into the Redis API
  • str_long = store ong
  • c_str_str – store string and keep it in nearby cache

The problem with this type of code is the number of remote calls and the locality of those calls. With a typical sharded set of servers, you are going to have lots of calls going all over the place. And when you get into people that have thousands and millions of followers, the application is simply going to die.

A better solution is required. Billy suggested using async programming or sending code to the data store to execute there.

I have a different variant on the solution.

We will start from the assumption that we really want to reduce remote calls, and that the system performance in the face of large amount of writes (without impacting reads) is important. The benefits of using something like Redis is that it is very easy to get started, very easy to change things around and great for rapid development mode. We want to keep that for now, so I am going to focus on a solution based on the same premise.

The first thing to go is the notion that a key can sit anywhere that it wants. In a key/value store, it is important to be able to control locality of reference. We change the key format so it is now: [server key]@[local key]. What does this mean? It means that for the previously mentioned user, this is the format it will be stored as:

{ "uid:123@name": "billy" } 
{ "uid:123@email": "billy@example.org" }
{ "uname@billy": "123" } 

We use the first part of the key (before the @) to find the appropriate server. This means that everything with a prefix of “uid:123” is known to reside on the same server. This allow you to do things like transactions on a single operation of setting multiple keys.

Once we have that, we can start adding to the API. Instead of getting a single key at a time, we can get a set of values in one remote call. That has the potential of significantly reducing the number of remote calls we will make.

Next, we need to consider repeated operations. By that I mean anything where we have a look in which we call to the store. That is a killer when you are talking about any data of significant size. We need to find a good solution for this.

Billy suggested sending JRuby script to the server (or similar) and executing it there, saving the network roundtrips. Which this is certainly possible, I think it would be a mistake. I have a much simpler solution. Teach the data store about repeated operations. Let us take as a good example the copying that we are doing of a new twit to all your followers. Instead of reading the entire list of followers into memory, and then writing the status to every single one of them, let us do something different:

Redis.PushToAllListFoundIn("uid:"+user_id+"@followers", status, "{0}@posts");

I am using the .NET conventions here because otherwise I would go mad. As you can see, we instruct Redis to go to a particular list, and copy the status that we pass it to all the keys found in the list ( after formatting the key with the pattern). This gives the data store enough information about this to be able to optimize this operation considerably.

With just these few changes, I think that you gain enormously, and you retain the very simple model of using a key/value store.

Rant: Compilers != Parsers

This is just something that really annoys me. There is a common misconception that compilers are mostly about parsing the code. This couldn’t be further from the truth. About the easiest step in building a compiler is parsing the text into some form of machine readable format. There are great tools to help you there, and a lot of information.

It is the next stage that is complex, taking the machine readable format (AST) and turning that into executable. There is a lot less information about that, and that tends to be a pretty gnarly sort of a problem.

As the title says, this post is here just to make sure that people distinguish between the two.

LinqToSql Profiler & NHibernate Profiler – What is happening?

About two weeks ago I posted my spike results of porting NHProf to Linq to SQL, a few moments ago I posted screenshots of the code that is going to private beta.

I spent most of the last week working on the never-ending book, but that is a subject for a different post. Most of the time that I actually spent on the profiler wasn’t spent on integrating with Linq to SQL. To be frank, it took me two hours to do basic integration to Linq to SQL. That is quite impressive, considering that is from the point of view of someone who never did anything more serious with it than the hello world demo, and it was years ago. The Linq to SQL codebase is really nice.

Most of the time went into refactoring the profiler. I know that I am going to want to add additional OR/Ms (and other stuff) there, so I spent the time to make the OR/M into a concept in the profiler. That was a big change, but it was mostly mechanical. All the backend changes are done, and plugging in a new OR/M should take about two hours, I guess.

For Linq to SQL, however, the work is far from over, while we are now capable of intercepting, displaying and analyzing Linq to SQL output, the current Linq to SQL profiler, as you can see in the screen shots, is using the NHibernate Profiler skin, we need to adapt the skin to match Linq to SQL (different names, different options, etc).

That is the next major piece of work that we have left to do before we can call the idea of OR/Ms as a concept done, and release Linq to SQL as a public beta.

Published at

Originally posted at

Comments (3)

LinqToSql Profiler private Beta – the screen shot gallery

This is just to give you some ideas about what the new LinqToSql Profiler can do.

It can track statements and associate them to their data context, gives you properly formatted and highlighted SQL, including the parameters, in a way that you can just copy and execute in SQL Server:

image

The profiler can tell you when you are hitting the database too often.

image

All the usual reports, what methods caused what queries to be executed?

image

Reports about unique queries across the entire profiling session:

image

And overall usage reports:

image

Tags:

Published at

Originally posted at

Comments (11)

Challenge: Why is this wrong?

Can you figure out why this is a horribly wrong thing to do?

[ThreadStatic]
private static readonly MySingleton mySingleton = CreateSingleInstance();
A free license of Rhino Mocks will be raffled between the people who will correctly answer it.

NHibernate Shards: Progress Report

Since my last post about it, there has been a lot of changes to NHibernate Shards.

Update: I can’t believe I forgot, I was so caught up in how cool this was that I did give the proper credits. Thanks to Dario Quintana and all the other contributors to NHibernate Shards.

The demo actually works :-) You can look at the latest code here: http://nhcontrib.svn.sourceforge.net/svnroot/nhcontrib/trunk/src/NHibernate.Shards/

You can read the documentation for the Java version (most of which is applicable for the .NET version) here: http://docs.jboss.org/hibernate/stable/shards/reference/en/html/

Let us go through how it works, okay?

We have the following class, which we want to shard.

image

The class mapping is almost standard:

image

As you can see, the only new thing is the primary key generator. Because entities are sharded based  on their primary key, we have to encode the appropriate shard in the shard. The easiest way of doing that is using the SharedUUIDGenerator. This generator generates keys that looks like this:

  • 00010000602647468c2ef2f10ded039a
  • 000200006ba74626a564d147dc89f9ad
  • 00030000eb934532b828601979036e3c

The first four characters are reserved for the shard id.

Next, we need to specify the configurations for each shard, we can do this normally, but we have to specify the shard id in the configuration.

cfg.SetProperty(ShardedEnvironment.ShardIdProperty, 1);

The shard id is an integer that is used to select the appropriate shard. It is also used to allow you to add new shards without breaking the ids of already existing shards.

Next, you need to implement a shard strategy factory:

image

This allows you to configure the shard strategy based on your needs. This is often where you would add custom behaviors. A shard strategy is composed of several components:

image

The Shard Selection Strategy is used solely to select the appropriate shard for new entities. If you shard your entities based on the user name, this is where you’ll implement that, by providing a shard selection strategy that is aware of this. On of the nice things about NH Shards is that it is aware of the graph as a whole, and if you have an association to a sharded entity, it knows that it needs to place you in the appropriate shard, without giving the burden to you.

For new objects, assuming that you haven’t provided your own shard selection strategy, NHibernate Shards will try to spread them evenly between the shards. The most common implementation is the Round Robin Load Balancer, which will give you a new shard for each new item that you save.

The Shard Resolution Strategy is quite simple, given an entity and the entity id, in which shard should we look for them?

image

image

If you are using a sharded id, such as the one that WeatherReport is using, NH Shards will know which shard to talk to automatically. But if you are using a non sharded id, you have to tell NHibernate how to figure out which shards to look at. By default, if you have non sharded id, it will look at all shards until it finds it.

The shard access strategy specifies how NHibernate Shards talks to the shards when it needs to talk to more than a single shard. NHibernate Shards can do it either sequentially or in parallel. Using parallel access strategy means that NHibernate will hit all your databases at the same time, potentially saving quite a bit of time for you.

The access strategy is also responsible for handling post processing the queries result, merging them and ordering them as needed.

Let us look at the code, okay? As you can see, this is a pretty standard usage of NHibernate.

using(ISession session = sessionFactory.OpenSession())
using(session.BeginTransaction())
{
session.Save(new WeatherReport
{
Continent = "North America",
Latitude = 25,
Longitude = 30,
ReportTime = DateTime.Now,
Temperature = 44
});

session.Save(new WeatherReport
{
Continent = "Africa",
Latitude = 44,
Longitude = 99,
ReportTime = DateTime.Now,
Temperature = 31
});

session.Save(new WeatherReport
{
Continent = "Asia",
Latitude = 13,
Longitude = 12,
ReportTime = DateTime.Now,
Temperature = 104
});
session.Transaction.Commit();
}
Since we are using the defaults, each of those entities is going to go to a different shard. Here is the result:

image
Our data was saved into three different databases. And obviously we could have saved them to three different servers as well.

But saving the data is only part of things, what about querying? Well, let us look at the following query:

session.CreateCriteria(typeof(WeatherReport), "weather").List()

This query will give us:

image

Note that we have three different sessions here, each for its own database, each executing a single query. What is really interesting is that NHibernate will take all of those results and merge them together. It can even handle proper ordering across different databases.

Let us see the code:

var reports =
session.CreateCriteria(typeof(WeatherReport), "weather")
.Add(Restrictions.Gt("Temperature", 33))
.AddOrder(Order.Asc("Continent"))
.List();
foreach (WeatherReport report in reports)
{
Console.WriteLine(report.Continent);
}

Which results in:

image

And in the following printed to the console:

Asia
North America

We got the proper ordering, as we specified in the query, but note that we aren’t handling ordering in the database. Because we are hitting multiple sources, it is actually cheaper to do the ordering in memory, rather than get partially ordered data and they trying to sort it.

Well, that is about it from the point of view of the capabilities.

One of the things that is holding NH Shards back right now is that only core code paths has been implemented. A lot of the connivance methods are currently not implemented. 

image

They are relatively low hanging fruits, and can be implemented without any deep knowledge of NHibernate or NHibernate Shards. Beyond that, the sharded HQL implementation is still not handling order properly, so if you care about ordering you can only query using ICriteria (at the moment).

It isn’t there yet, but it is much closer. You can get a working demo, and probably start working with this and implement things on the fly as you run into things. I strongly urge you to contribute the missing parts, at least the convenience methods, which should be pretty easy.

Please submit patches to our JIRA and discuss the topic at: http://groups.google.com/group/nhcdevs

What am I missing? MSMQ Perf Issue

I am getting some strange perf numbers from MSMQ, and I can’t quite figure out what is going on here.

The scenario is simple, I have a process reading from queue 1 and writing to queue 2. But performance isn’t anywhere near where I think it should be.

In my test scenario, I have queue 1 filled with 10,000 messages, each about 1.5 Kb in size. My test code does a no op move between the queues. Both queues are transactional.

Here is the code:

private static void CopyData()
{
var q1 = new MessageQueue(@".\private$\test_queue1");
var q2 = new MessageQueue(@".\private$\test_queue2");
q2.Purge();
var sp = Stopwatch.StartNew();
while (true)
{
using (var msmqTx = new MessageQueueTransaction())
{
msmqTx.Begin();

Message message;
try
{
message = q1.Receive(TimeSpan.FromMilliseconds(0), msmqTx);
}
catch (MessageQueueException e)
{
Console.WriteLine(e);
break;
}

q2.Send(message, msmqTx);

msmqTx.Commit();
}
}
Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

Using this code, it takes 236.8 seconds to move 10,000 messages. If I use System.Transactions, instead of MSMQ’s internal transactions, I get comparable speeds.

Just to give you an idea, this is about 40 messages a second, this number is ridiculously low.

Changing the code so each operation is a separate transaction, like this:

private static void CopyData()
{
var q1 = new MessageQueue(@".\private$\test_queue1");
var q2 = new MessageQueue(@".\private$\test_queue2");
q2.Purge();
var sp = Stopwatch.StartNew();
while (true)
{

Message message;
try
{
message = q1.Receive(TimeSpan.FromMilliseconds(0), MessageQueueTransactionType.Single);
}
catch (MessageQueueException e)
{
Console.WriteLine(e);
break;
}

q2.Send(message, MessageQueueTransactionType.Single);
}
Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

Means that it takes 16.3 seconds, or about 600 messages per second, which is far closer to what I would expect.

This is on a quad core machine 8 GB RAM (4 GB free), so I don’t think that it is the machine that is causing the problem. I can see similar results on other machines as well.

Am I missing something? Is there something in my code that is wrong?

Challenge: Write the check in comment

In my code base, I have the following commit, which doesn’t have a checkin comment. This is the before:

private Entry GetEntryForSlug(string slug)
{
var entry = session.CreateCriteria(typeof(Entry))
.Add(Restrictions.Like("Slug", slug))
.UniqueResult<Entry>();

if(entry != null)
return entry;

//try load default document
return session.CreateCriteria(typeof (Entry))
.Add(Restrictions.Like("Slug", slug + "/default"))
.UniqueResult<Entry>();
}

And this is the after:

private Entry GetEntryForSlug(string slug)
{
var entries = session.CreateCriteria(typeof(Entry))
.Add(
Restrictions.Like("Slug", slug) ||
Restrictions.Like("Slug", slug + "/default")
)
.SetMaxResults(2)
.List<Entry>();


return entries
.OrderBy(x => x.Slug.Length)
.FirstOrDefault();
}

What was the silent committer thinking about when he made the change? Can you write the check in comment?

Life altering decisions

I recently had the chance to reminisce, and I run into this post, this is me in 2004, talking about my logo:

ayende's logo

People frequently ask my, “Why Rhinos?”, here is the answer, from way back ago:

I thought about using some sort of a symbol to create the logo, I toyed for a while with the Windings font, but found that lacking for my purposes, then I thought about using an animal's shadow as the watermark. I thought about using some sort of dragon or a wolf, but those are really banal. I spent some time thinking about it until I finally decided to use a Rhino, I just thought about it and I think that it's cool. Beside, I don't know anyone else that uses Rhinos.

The logo and the general orange theme of the blog, were decisions that I just made, there was no deep thinking involved. My living room has a lot of orange in it right now, and I have a Rhino tattoo.

It is funny how such small a decision can have such a significant impact.

Can you imagine Wolf.Mocks?