Ayende @ Rahien

Unnatural acts on source code

A bad test

image

This is a bad test, because what it does is ensuring that something does not works. I just finished implementing the session.Advaned.Defer support, and this test got my attention by failing the build.

Bad test, you should be telling me when I broke something, not when I added new functionality.

Tags:

Published at

Originally posted at

Comments (16)

Non overlapping time periods–because I like the pain of 2 AM wakeup calls

This post is partly in response for this post, discussing the Azure problem with leap year. But it is actually a bit more general than that.

In my code, here is how I define “one year from now”:

DateTime.Today.AddYears(1).AddDays(3);

As it turned out, this tend to have a lot of implications on your business, most of them are actually pretty good ones.

For a start, you will never get hit with a leap year bug, but more importantly, you are never going to have to deal with an immediate cutoff. This is important because it gives you time. Mostly, it gives you time to screw up, but having the time to do so without having an egg all of your face is a really nice thing.

For example, all of our subscriptions are using a similar method of calculation, and this is why we can take the ordering system down for a few hours or even a day or two and no one will actually notice. We have a big grace period in which we can work things out.

Sure, a user gets 3 “extra” days for free out of this, but frankly, I don’t give a damn. It is more important that I get the buffer, and most users like it much better when you don’t slam the doors in their faces on the first chance.

One of the things that is important is style, and giving a grace period for those sort of things is crucial.

Tags:

Published at

Originally posted at

Comments (5)

Rotten Scheduling: Don’t roll your own

“We need to run a specific task every 72 hours, I thought about this approach…”

public class TimedTask
{
  public static Timer Timer;
  
  public static void Init()
  {
    Timer = new Timer(ExecuteEvery72Hours, null, TimeSpan.FromHours(72), TimeSpan.FromHours(72));
  }
  
  public static void ExecuteEvery72Hours()
  {
    // do something important
  }
}

This is a bloody rotten idea, let us see why…

  • What happens if your application is recycled every 29 hours?
  • What happens if your application is always on, but during that 72 hour call, it was offline?
  • What happens if your task actually takes more than 72 hours to run?
  • What happens if the task fails?
  • How do you report errors, warnings, etc?

Scheduling is a hard problem. There are a lot of things that you actually need to consider. And the code above is really considering none of them. I would be very surprised if something like that ever run. in production. It most certainly can’t be made to run reliably.

Things that run every X time, where X is a long time (hours / days) tend to be pretty important. In some of the systems that we wrote, that include doing things like updating VAT and interest rates, pulling from external source, generating the weekly report, etc.

You do not want this to be messed up.

If you need to do anything like that, make use of the builtin scheduling features of the OS you are running on (Windows Task Scheduler is an amazingly full featured, and cron isn’t bad if you are running on Linux). If you still insist on doing this in code, at the very least do something like this:

public class TimedTask
{
  public static Timer Timer;
  
  public static void Init()
  {
    Timer = new Timer(()=>
    {
      if( (DateTime.UtcNow - GetLastExecutedTime()) > 72)
        ExecuteEvery72Hours();
    }, null, TimeSpan.FromMinutes(1), TimeSpan.FromMinutes(1));
  }
  
  public static void ExecuteEvery72Hours()
  {
    // do something important
  }
}

This still has a lot of problems, but at least it it solving some crucial problems for you (note that GetLastExecutedTime has to be a persisted value).

Of course, if you need something like this in your code, you have better use something like Quartz, instead. Don’t roll your own. Sure, this is ten lines of code to do so, but as I said, this is the very basics, and it gets complex really fast.

Tags:

Published at

Originally posted at

Comments (19)

Rotten Scheduling

“We need to run a specific task every 72 hours, I thought about this approach…”

public class TimedTask
{
  public static Timer Timer;
  
  public static void Init()
  {
    Timer = new Timer(ExecuteEvery72Hours, null, TimeSpan.FromHours(72), TimeSpan.FromHours(72));
  }
  
  public static void ExecuteEvery72Hours()
  {
    // do something important
  }
}

Let us assume that this code is being called properly. Why is this a bloody rotten idea?

Tags:

Published at

Originally posted at

Comments (28)

The best multi threading debugging tool is Microsoft Excel

In any system that gets to a certain size, especially one that is multi threaded, there are a certain class of bugs that are really a bitch to figure out.

They are usually boil down to some sort of a race condition. I have been doing that recently, trying to fix a bunch of race conditions in RavenDB. The problem with race conditions is that they are incredibly hard to reproduce, and even when you can reproduce them, you can’t really debug them.

I came up with the following test harness to try to narrow things down. Basically, create a test case that sometimes fails, and run it a thousand times.

Odds are that you’ll get the failure, but that isn’t the important bit. The important bit is that you setup logging properly. In my case, I set things up so the logging output to CSV and each test run had a different file.

This gives me output that looks like this:

23:54:34.5952,Raven.Database.Server.HttpServer,Debug,Request #  10: GET     -    12 ms - <default>  - 200 - /indexes/dynamic/Companies?query=&start=0&pageSize=1&aggregation=None,,11
23:54:34.5952,Raven.Client.Document.SessionOperations.QueryOperation,Debug,"Stale query results on non stale query '' on index 'dynamic/Companies' in 'http://reduction:8079/', query will be retried, index etag is: bcf0fed1-d975-43b4-bfb7-65221ef06b99",,1
23:54:34.5952,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 10, for: TasksExecuter, will wait for additional work",,6
23:54:34.5952,Raven.Database.Indexing.WorkContext,Debug,Incremented work counter to 11 because: WORK BY IndexingExecuter,,12
23:54:34.5952,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 11, for: TasksExecuter, will wait for additional work",,6
23:54:34.5952,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 11, for: ReducingExecuter, will wait for additional work",,21
23:54:34.5952,Raven.Database.Indexing.WorkContext,Debug,"No work was found, workerWorkCounter: 11, for: IndexingExecuter, will wait for additional work",,12
23:54:34.6952,Raven.Client.Document.SessionOperations.QueryOperation,Debug,Executing query '' on index 'dynamic/Companies' in 'http://reduction:8079/',,1
23:54:34.8172,Raven.Database.Indexing.Index.Querying,Debug,Issuing query on index Temp/Companies for all documents,,11
23:54:34.8172,Raven.Storage.Esent.StorageActions.DocumentStorageActions,Debug,Document with key 'companies/1' was found,,11

Not really helpful in figure out what is going on, right? Except for one tiny thing, we load them to Excel, and we use conditional formatting to get things to look like this:

image

The reason this is so helpful? You can actually see the threads interleaving. This usually help me get to roughly the right place, and then I can add additional logging so I can figure out better what is actually going on.

I was able to detect and fix several race conditions using this approach.

And yes, I know that this is basically printf() debugging. But at least it is printf() debugging with pretty colors.

Tags:

Published at

Originally posted at

Comments (11)

Thou shall not delete

Ouch!

public ActionResult DeleteComment(int id)
{
  var userComment = RavenSession.Load<UserComment>(id);

  if (userComment == null)
    return new HttpStatusCodeResult(204);

  var user = RavenSession.GetUser(User.Identity.Name);
  if(user == null || (user.Role != UserRole.Moderator && user.Role != UserRole.Admin))
    return new HttpStatusCodeResult(403, "You must be logged in as moderator or admin to be able to delete comments");

  RavenSession.Delete(user);

  return new HttpStatusCodeResult(204);
}
Tags:

Published at

Originally posted at

Comments (34)

What is wrong here? Solution

In my previous post, I showed the database schema and the UI and asked what was wrong with that. Before we move on, here is what I showed.

image

image

If you looked carefully, you might have noticed that there are duplicate PO# and Tracking# in the UI. More than that, we somehow double charged the customer for shipping.

What is going on?

It is actually fairly obvious, when you think about it. Look at the schema, there isn’t actually any association between the Tracking # and the PO #. In most orders, we have only 1 PO #, so it was easy to just add this information by just pulling it from the DB and adding a few columns. But when we got an order that has multiple POs… that is when all hell breaks lose.

This is a classic Cartesian Product problem.

The solution?  Actually model the UI to avoid suggesting that there is a relationship between the tracking and purchase orders, like this:

image

Tags:

Published at

Originally posted at

Comments (11)

What is wrong here?

This case, it isn’t code that I am going to show, rather, I am going to show the final UI and the database structure, and let you figure:

  • What is wrong.
  • How to fix this.

Here is the database schema:

image

And here is the problem:

image

Hints, this used to work for a long time, and suddenly it doesn’t, and the customer is pissed, annoyed and threatening to sue.

Tags:

Published at

Originally posted at

Comments (38)

Ask Ayende: What about the QA env?

Matthew Bonig asks, with regards to a bug in RavenDB MVC Integration (RavenDB Profiler) that caused major slow down on this blog.:

I'd be very curious to know how this code got published to a production environment without getting caught. I would have thought this problem would have occurred in any testing environment as well as it did here. Ayende, can you comment on where the process broke down and how such an obvious bug was able to slip through?

Well, the answer for that comes in two parts. The first part  is that no process broke down. We use our own assets for final testing of all our software, that means that whenever there is a stable RavenDB release pending (and sometimes just when we feel like it) we move our infrastructure to the latest and greatest.

Why?

Because as hard as you try testing, you will never be able to catch everything. Production is the final test ground, and we have obvious incentives of trying to make sure that everything works.  It is dogfooding, basically. Except that if we get a lemon, that is a very public one.

It means that whenever we make a stable release, we can do that with high degree of confidence that everything is going to work, not just because all the tests are passing, but because our production systems had days to actually see if things are right.

The second part of this answer is that this is neither an obvious bug nor one that is easy to catch. Put simply, things worked. There wasn’t even an infinite loop that would make it obvious that something is wrong, it is just that there was a lot of network traffic that you would notice only if you either had a tracer running, or were trying to figure out why the browser was suddenly so busy.

Here is a challenge, try to devise some form of an automated test that would catch something like this error, but do so without actually testing for this specific issue. After all, it is unlikely that someone would have written a test for this unless they run into the error in the first place. So I would be really interested in seeing what sort of automated approaches would have caught that.

Tags:

Published at

Originally posted at

Comments (19)

Bug Hunt: What made this blog slow?

A while ago the blog start taking 100% CPU on the client machines. Obviously we were doing something very wrong there, but what exactly was it?

We tracked down the problem to the following code:

image_thumb

image_thumb[1]

As you can probably guess, the problem is that we have what is effective an infinite loop. On any Ajax request, we will generate a new Ajax request. And that applies to our own requests as well.

The fix was pretty obvious when we figured out what was going on, but until then…

image

Tags:

Published at

Originally posted at

Comments (14)

Bug Hunt: What made this blog slow?

A while ago the blog start taking 100% CPU on the client machines. Obviously we were doing something very wrong there, but what exactly was it?

We track down the problem to the following code, can you figure out what the problem?

image

image

Tags:

Published at

Originally posted at

Comments (11)

What is up with all those 404?

Recently I had a spate of posts showing up, and then reporting 404. The actual reason behind that was that the server and the database disagreed with one another with regards to the timezone of the post (one was using UTC, the other local). Sorry for the trouble, and this shouldn’t happen any longer.

Tags:

Published at

Originally posted at

Comments (8)

Mixing Integrated Authentication and Anonymous Authentication with PreAuthenticated = true doesn’t work

This StackOverflow question indicate that it is half a bug and half a feature, but that it sure as hell looks like a bug to me.

Let us assume that we have a couple of endpoints in our application, called http://localhost:8080/secure and http://localhost:8080/public. As you can imagine, the secure endpoint is… well, secure, and requires authentication. The public endpoint does not.

We want to optimize the number of request we make, so we specify PreAuthenticated = true; And that is where all hell break lose.

The problem is that it appears that when using request with entity body (in other words, PUT / POST) with PreAuthenticate = true, the .NET framework will issue a PUT / POST request with empty body to the server. Presumably to get the 401 authentication information. At that point, if the endpoint that it happened to have reached is public, it will be accepted as a standard request, and processing will be tried. The problem here is that it has an empty body, so that has a very strong likelihood of failing.

This error cost me a day and a half or so. Here is the full repro:

static void Main()
{
    new Thread(Server)
    {
        IsBackground = true
    }.Start();

    Thread.Sleep(500); // let the server start

    bool secure = false;
    while (true)
    {
        secure = !secure;
        Console.Write("Sending: ");
        var str = new string('a', 621);
        var req = WebRequest.Create(secure ? "http://localhost:8080/secure" : "http://localhost:8080/public");
        req.Method = "PUT";

        var byteCount = Encoding.UTF8.GetByteCount(str);
        req.UseDefaultCredentials = true;
        req.Credentials = CredentialCache.DefaultCredentials;
        req.PreAuthenticate = true;
        req.ContentLength = byteCount;

        using(var stream = req.GetRequestStream())
        {
            var bytes = Encoding.UTF8.GetBytes(str);
            stream.Write(bytes, 0, bytes.Length);
            stream.Flush();
        }

        req.GetResponse().Close();

    }

}

And the server code:

public static void Server()
{
    var listener = new HttpListener();
    listener.Prefixes.Add("http://+:8080/");
    listener.AuthenticationSchemes = AuthenticationSchemes.IntegratedWindowsAuthentication | AuthenticationSchemes.Anonymous;
    listener.AuthenticationSchemeSelectorDelegate = request =>
    {

        return request.RawUrl.Contains("public") ? AuthenticationSchemes.Anonymous : AuthenticationSchemes.IntegratedWindowsAuthentication;
    };

    listener.Start();

    while (true)
    {
        var context = listener.GetContext();
        Console.WriteLine(context.User != null ? context.User.Identity.Name : "Anonymous");
        using(var reader = new StreamReader(context.Request.InputStream))
        {
            var readToEnd = reader.ReadToEnd();
            if(string.IsNullOrEmpty(readToEnd))
            {
                Console.WriteLine("WTF?!");
                Environment.Exit(1);
            }
        }

        context.Response.StatusCode = 200;
        context.Response.Close();
    }
}

If we remove pre authenticate is set to false, everything works, but then we have twice as many requests. The annoying thing is that if it would be trying to authenticate to a public endpoint, nothing would happen, if it were sending the bloody entity body along as well.

This is quite annoying.

Tags:

Published at

Originally posted at

Comments (13)

Dynamic resolution rules joy

In the following code, what do you believe the output should be?

class Program
{
    static void Main(string[] args)
    {
        dynamic stupid = new Stupid{Age = 3};

        Console.WriteLine(stupid.Age);
    }
}

public class Stupid : DynamicObject
{
    public int Age { get; set; }

    public override bool TryGetMember(GetMemberBinder binder, out object result)
    {
        result = 1;
        return true;
    }
}

Argh!

The default is to use the actual type members, and then fall back to the dynamic behavior. Whereas I would expect it to first check the dynamic behavior and then fall back to the actual members if it can’t find dynamic stuff.

Quite annoying.

Tags:

Published at

Originally posted at

Comments (24)

What is wrong with this code?

Calling the Compare method will (sometimes) crash your entire application in a very rude fashion, why?

static bool Compare(byte[] b1, byte[] b2)
  {
    IntPtr retval = memcmp(b1, b2, new IntPtr(b1.Length));
    return retval == IntPtr.Zero;
  }
 
  [DllImport("msvcrt.dll")]
  static extern IntPtr memcmp(byte[] b1, byte[] b2, IntPtr count);

In fact, there are actually two different problems that I can see here. The easy one would consistently crash, the hard one would usually pass, but sometime crash you in an apparently random ways.

Tags:

Published at

Originally posted at

Comments (14)

Twitter doesn’t like my identity

Recently I noticed that a very important feature in twitter website is broken for me. I can’t follow the entire conversation any longer, which drive me crazy.

image

It didn’t take me long to figure out what the actual issue is:

image

It appears that even though I am logged in, some parts of twitter doesn’t like this.

I then thought that this has something to do with chrome, so I used the incognito mode to try it:

image

It works!

I logged out and in again, nada. I cleared all browser history and restarted chrome, nada.

I uninstalled chrome (selecting delete all browsing history) and re-installed, and now it is working.

No idea why, and I don’t like the heavy handed approach, but at least now I got my twitter tracking back.

Tags:

Published at

Originally posted at

Comments (4)

Debugging Security Exceptions

One of the horrible things about SecurityException is just how little information you are given.

For example, let us take a look at:

image

Yes, that is informative. The real problem is that most security analysis is done at the method level, which means that _something_ in this method caused this problem. Which meant that in order to debug this, I have to change my code to be like this:

image

Which gives me this:

image

Just to point out, the first, second, etc are purely artificial method names, meant just to give me some idea about the problem areas for this purpose only.

Then we need to go into the code in the First method and figure out what the problem is there, and so on, and so forth.

Annoying, and I wish that I knew of a better way.

Tags:

Published at

Originally posted at

Comments (12)

High memory usage with WCF Discovery

Yes, I know that this is basically saying that select is broken, but I am seeing some very strange stuff here. The code in question is this:

for (int i = 0; i < 15; i++)
{
    var discoveryClient = new DiscoveryClient(new UdpDiscoveryEndpoint());
    var findCriteria = new FindCriteria(typeof(IDiscoverableService))
    {
        Duration = TimeSpan.FromSeconds(1)
    };
    discoveryClient.Find(findCriteria);
    discoveryClient.Close();
}

The full repro can be found here.

The problem is, put simply, that before this code, the working set is:  Working Set: 57,640 kb, after this have executed, it is 90,288 kb.

I went over the code with a fine tooth comb, but I don’t really see where all of this memory have gone.

The actual memory reported by GC.GetTotalMemory does go down after the cleanup, so I guess it could be that .NET isn’t releasing the memory back to the OS, but it still worries me somewhat.

I tried it with higher numbers for the loop, and it seems like eventually it settles down on some number and doesn’t grow from there. My main problem is what happens when you start doing async stuff. Let us take a look here:

int count = 150;
var countdown = new CountdownEvent(count);

for (int i = 0; i < count; i++)
{
    var discoveryClient = new DiscoveryClient(new UdpDiscoveryEndpoint());
    var findCriteria = new FindCriteria(typeof(IDiscoverableService))
    {
    };
    discoveryClient.FindProgressChanged += (sender, eventArgs) => {  }; // do nothing
    discoveryClient.FindCompleted += (sender, eventArgs) =>
    {
        countdown.AddCount();
        discoveryClient.Close();
        PrintMemory("Complete: ");
    };
    discoveryClient.FindAsync(findCriteria);
}

countdown.Wait();

At peek, I am seeing 600 MB (!) of memory used. Note that we are now using the default duration of 20 seconds, and it seems that DiscoveryClient is very heavy on memory.

If you know how, can you take a look at the code and tell me that I am crazy?

Tags:

Published at

Originally posted at

Comments (25)

Where are your debug hooks?

One of the most important things that I learned about working with software is that you have to design, from the get go, the debug ability of the system. This is important, because by default, here is how the your system reports its state:

To make things worse, in many cases, we are talking about things that are really hard to figure out. With Rhino Service Bus, I had started, from day one, to add a lot of debug logs, we had the idea of the logging endpoint, error messages went to the error queue along with the error that they generated, etc.

With NHibernate, it took me three days after starting to build NH Prof to figure out that NH Prof was a great tool to use when working on NHibernate itself.  With NH Prof, the actual file format that we used is an event stream, which is logged locally. Which means that the process of opening a saved file and listening to a running application is the same. That means that you can send me you File > Save output, and I can simulate locally the exact conditions that led to whatever problems you had.

As software becomes more and more complex, threading is involved, complex scenarios are introduced, multiple actors involved and more and more data is streaming in. The old patterns of debugging through a problem become less and less relevant. To put it simply, it is less and less possible to easily reconstruct a problem in most scenarios, at least, it is very hard to do from the information given.

Last week I started to add support for profiling RavenDB. The intent was mostly to allow us to build better integration with web applications, but during the process of investigating a very nasty bug, I figured out that I could use this profiling information to figure out what is going on. I did, and seeing the information, it was immediately obvious what was wrong.

I decided that I might was well go the whole hog there and introduced this:

image

No, it isn’t RavenDB Profiler (not yet), but it is a debug visualizer that you can use to look at the raw requests. It is a bit more than just a Fiddler clone, too, since it has awareness of things like RavenDB sessions, caching, aggressive caching, etc.

In the short time that this API is available, it has already help resolve at least two separate (and very hard to figure out) bugs.

I learned two things from the experience:

  • Having the debug hooks is critically important, because it means that you can access that information at need.
  • Just having the debug hooks is not enough, you have to provide a good way for them to be used.

In this case, it is a debugger visualizer, but we will have additional ways to expose this information to the users.

Tags:

Published at

Originally posted at

Comments (10)

More auth issues: 0xc000006d on Windows 2008 R2

Next on the schedule of problems we had an issue with again, with everything working fine remotely and not working when running locally. This time, it didn’t take as long to figure out what was going on, the root of the problem was “Account failed to log on (0xc000006d)”, and the explanation for that is here:

http://code-journey.com/2009/account-failed-to-log-on-0xc000006d-unable-to-load-website-from-local-server/

The basic idea is that you can’t authenticate against the local server using anything except the local server name.

And it still didn’t work.

The server is Windows 2008 R2, and the documentation said that I had to set a registry key at: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa\MSV1_0

That did not work, but what did work was setting the DisableLoopbackCheck at: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa

No idea why, and if there is a problem at the documentation for Windows 2008 R2, but this is how I got it working on my end. I would probably like it more if I was able to use the recommended solution, and not the solution explicitly not recommended.

Any ideas?

Published at

Originally posted at

Comments (7)

The evil tricks of “It Works On My Machine”, in reverse

The following is quite annoying. I am trying to make an authenticated request to a web server. In this instance, we are talking about communication from one IIS application to another IIS application (the second application is RavenDB hosted inside IIS).

The part that drives me CRAZY is that I am getting this error when I am trying to make an authenticated request, but using the exact same code and credentials, from another machine, I can make this work just fine.

image

It took me a while to figure out that most important part, when running locally, I was running under my own account, when running remotely, the application run under IIS account. The really annoying part was that even when running on IIS locally, it still worked locally and failed remotely.

It took me even longer to figure out that the local IIS was configure to run under my own account as well Sad smile

Once that discovery was made, I was able to figure out what is wrong and implement a work around (run the remote IIS site under a custom user name). I know that the actual problem is something relating to permissions on certificate store, but I have no idea how, or, for that matter, which certificate?

This is plain old HTTP auth, no SSL, client certs, etc. I am assuming that this is raised because we are using Windows Auth, but I am not sure what is going on here with regards to which certificate should I grant to which user, or even how to do this.

Any ideas?

I would like to busy too, but…

Originally posted at 4/13/2011

Yes, I know, you popped this out in the middle of a debug session, after VS didn’t respond to any user input for several minutes.

image

Urgh!

image

AYENDEPC is the local machine! You can’t lose a network connection to yourself.

Oh well, we already know that Visual Studio is schizophrenic.

The best bug reports are pull requests

I just got the following bug report, I’ll just let you read it, and I have additional commentary below:

image

The great fun with getting a pull request with a failing test is that the whole process of working with this is pretty seamless.

For the one above, GitHub told me that I need to run the following command:

git pull https://github.com/benjamingram/ravendb.git DynamicFieldsBug

I did, and got the failing test, from there it was just a matter of fixing the actual bug, which was rather simple, and nothing that even smelled like ceremony.