Ayende @ Rahien

Hi!
My name is Ayende Rahien
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

@

Posts: 5,947 | Comments: 44,540

filter by tags archive

Application analysis: Northwind.NET


For an article I am writing, I wanted to compare a RavenDB model to a relational model, and I stumbled upon the following Northwind.NET project.

I plugged in the Entity Framework Profiler and set out to watch what was going on. To be truthful, I expected it to be bad, but I honestly did not expect what I got. Here is a question, how many queries does it take to render the following screen?

image

The answer, believe it or no, is 17:

image

You might have noticed that most of the queries look quite similar, and indeed, they are. We are talking about 16(!) identical queries:

SELECT [Extent1].[ID]           AS [ID],
       [Extent1].[Name]         AS [Name],
       [Extent1].[Description]  AS [Description],
       [Extent1].[Picture]      AS [Picture],
       [Extent1].[RowTimeStamp] AS [RowTimeStamp]
FROM   [dbo].[Category] AS [Extent1]

Looking at the stack trace for one of those queries led me to:

image

And to this piece of code:

image

You might note that dynamic is used there, for what reason, I cannot even guess. Just to check, I added a ToArray() to the result of GetEntitySet, and the number of queries dropped from 17 to 2, which is more reasonable. The problem was that we passed an IQueryable to the data binding engine, which ended up evaluating the query multiple times.

And EF Prof actually warns about that, too:

image

At any rate, I am afraid that this project suffer from similar issues all around, it is actually too bad to serve as the bad example that I intended it to be.

Expanding your horizons: Actions


In theory, there is no difference between theory and real life.

In my previous blog post, I discussed my belief that the best value you get from learning is learning the very basic of how our machines operate. From learning about memory management in operating systems to the details of how network protocols like TCP/IP work.

Some of that has got to be theoretical study, actually reading about how those things work, but theory isn’t enough. I don’t care if you know the TCP specs by heart, if you haven’t actually built a real system with it, and experienced the pain points, it isn’t really meaningful. The best way to learn, at least from my own experiences, is to actually do something.

Because that teaches you several very interesting things:

  • What are the differences between the spec and what is actually implemented?
  • How to resolve common (and not so common problems)?

The later is probably the most important thing. I think that I learned most of what I know about HTTP in the process of building an RSS feed reader. I learned a lot about TCP from implementing a proxy system, and I did a lot of learning from a series of failed projects regarding distributed programming in general.

I learned a lot about file systems and how to work with file based storage from Practical File System Design and from building Rhino Queues and Rhino DHT. In retrospect, I did a lot of very different projects in various areas and technologies.

The best way that I know to get better is to do, to fail, and to learn from what didn’t work. I don’t know of any shortcuts, although I am familiar with plenty of ways of making the road much longer (and usually not very pleasant).

In short, if you want to get better, pick something that you don’t know how to do, and then do it. You might fail, you likely will, but you’ll learn a lot from failing.

I keep drawing a blank when people ask me to suggest options for things to try building, so I thought that I would ask the readers of this blog. What sort of things do you think would be useful to build? Things that would push most people out of their comfort zone and make them learn the fundamentals of how things work.

Expanding your horizons


One of the questions that I routinely get asked is “how do you learn”. And the answer that I keep giving is that I had accidently started learning things from the basic building blocks. I still count a C/C++ course that I took over a decade ago as one of the chief reasons why I have a good grounding in how computers actually operate. During that course, we had to do anything from building parts of the C standard library on our own to construct much of the foundation of C++ features in plain C.

That gave me enough understanding of how things are actually implemented to be able to grasp how things are behaving elsewhere. Digging deep into the implementation is almost never a wasted effort. And if you can’t peel away the layer of abstractions, you can’t really say that you know what you are doing.

For example, I count myself ignorant in all manners about WCF, but I have full confidence that I can build a system using it. Not because I understand WCF itself, but because I understand the arena in which it plays. I don’t need to really understand how a certain technology works, if I already know what are the rules it has to play with.

Picking on WCF again, if you don’t know firewalls and routers, you can’t really build a WCF system, regardless of how good your memory is about the myriad ways of configuring WCF to do you will. If you can’t use WireShark to figure out why the system is slow to respond to requests, it doesn’t matter if you can compose an WCF envelope message literally on the back of a real world envelope.  If you don’t grok the Fallacies of Distributes Computing, you shouldn’t be trying to build a real system where WCF is used, regardless of whatever certificate you have from Microsoft.

The interesting bit is that for most of what we do, the rules are fairly consistent. We all have to play in Turing’s sand box, after all.

What this means is that learning the details of IP and TCP will be worth it over and over again. Understanding things like memory fetch latencies would be relevant in 5 years and in ten. Knowing what actually goes on in the system, even if it at a somewhat abstracted level is important. That is what make you the master of the system, instead of its slave.

Some of the things that I especially value, and that is of the top of my head and isn’t a closed list are:

  • TCP / UDP – how do they actually work.
  • HTTP – and implications (for example, state management).
  • The Fallacies of Distributed Computing.
  • Disk based storage – efficiently working with it, how file system works.
  • Memory management in OS and your environment.

Obviously, this is a very short list, and again, it isn’t comprehensive.  It is just meant to give you some indications for things that I have found to be useful over and over and over again.

That kind of knowledge isn’t something that is replaced often, and it will help you understand how anyone else has to interact with the same constraints. In fact, it often allows you to accurately guess how they solve a certain problem, because you are aware of the same alternatives that the other side had to solve.

In short, if you seek to be a better developer, dig deep and learn the real basic building blocks for our profession.

In my next post, I’ll discuss strategies for doing that.

Transitive Replication in RavenDB


TLDR;

Replication topologies make my head hurt.

One of our customers had an interesting requirement, several months ago:

image

Basically, he wanted to write a document at node #1, and have it replicate, through node #2, to node #3. That was an easy enough change, and we did that. But then we got another issue from a different customer, who had the following topology:

image

And that client problem is that when making a write to node #1, it would be replicated to nodes 2 – 4, each of which would then try to update the other two with the new replication information (it would skip node #1 because it is the source). That would cause… issues, because they already had that document in place.

In order to resolve that, I added a configuration option, which controls whatever the node that we replicate to should receive only documents that were modified on the current node, or whatever we need to include documents that were replicated to us from other nodes as well.

It is a relatively small change, code wise. Of course, documenting this, and all of the options that follows is going to be a much bigger task, because now you have to make a distinction between replicating nodes, gateway nodes, etc.

Mixing Integrated Authentication and Anonymous Authentication with PreAuthenticated = true doesn’t work


This StackOverflow question indicate that it is half a bug and half a feature, but that it sure as hell looks like a bug to me.

Let us assume that we have a couple of endpoints in our application, called http://localhost:8080/secure and http://localhost:8080/public. As you can imagine, the secure endpoint is… well, secure, and requires authentication. The public endpoint does not.

We want to optimize the number of request we make, so we specify PreAuthenticated = true; And that is where all hell break lose.

The problem is that it appears that when using request with entity body (in other words, PUT / POST) with PreAuthenticate = true, the .NET framework will issue a PUT / POST request with empty body to the server. Presumably to get the 401 authentication information. At that point, if the endpoint that it happened to have reached is public, it will be accepted as a standard request, and processing will be tried. The problem here is that it has an empty body, so that has a very strong likelihood of failing.

This error cost me a day and a half or so. Here is the full repro:

static void Main()
{
    new Thread(Server)
    {
        IsBackground = true
    }.Start();

    Thread.Sleep(500); // let the server start

    bool secure = false;
    while (true)
    {
        secure = !secure;
        Console.Write("Sending: ");
        var str = new string('a', 621);
        var req = WebRequest.Create(secure ? "http://localhost:8080/secure" : "http://localhost:8080/public");
        req.Method = "PUT";

        var byteCount = Encoding.UTF8.GetByteCount(str);
        req.UseDefaultCredentials = true;
        req.Credentials = CredentialCache.DefaultCredentials;
        req.PreAuthenticate = true;
        req.ContentLength = byteCount;

        using(var stream = req.GetRequestStream())
        {
            var bytes = Encoding.UTF8.GetBytes(str);
            stream.Write(bytes, 0, bytes.Length);
            stream.Flush();
        }

        req.GetResponse().Close();

    }

}

And the server code:

public static void Server()
{
    var listener = new HttpListener();
    listener.Prefixes.Add("http://+:8080/");
    listener.AuthenticationSchemes = AuthenticationSchemes.IntegratedWindowsAuthentication | AuthenticationSchemes.Anonymous;
    listener.AuthenticationSchemeSelectorDelegate = request =>
    {

        return request.RawUrl.Contains("public") ? AuthenticationSchemes.Anonymous : AuthenticationSchemes.IntegratedWindowsAuthentication;
    };

    listener.Start();

    while (true)
    {
        var context = listener.GetContext();
        Console.WriteLine(context.User != null ? context.User.Identity.Name : "Anonymous");
        using(var reader = new StreamReader(context.Request.InputStream))
        {
            var readToEnd = reader.ReadToEnd();
            if(string.IsNullOrEmpty(readToEnd))
            {
                Console.WriteLine("WTF?!");
                Environment.Exit(1);
            }
        }

        context.Response.StatusCode = 200;
        context.Response.Close();
    }
}

If we remove pre authenticate is set to false, everything works, but then we have twice as many requests. The annoying thing is that if it would be trying to authenticate to a public endpoint, nothing would happen, if it were sending the bloody entity body along as well.

This is quite annoying.

Stupid smart code: Solution


The reason that I said that this is very stupid code?

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteCount = Encoding.UTF8.GetByteCount(data);
    req.ContentLength = byteCount;
    using (var dataStream = req.GetRequestStream())
    {
        if(byteCount <= 0x1000) // small size, just let the system allocate it
        {
            var bytes = Encoding.UTF8.GetBytes(data);
            dataStream.Write(bytes, 0, bytes.Length);
            dataStream.Flush();
            return;
        }

        var buffer = new byte[0x1000];
        var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1);
        var charBuffer = new char[maxCharsThatCanFitInBuffer];
        int start = 0;
        var encoder = Encoding.UTF8.GetEncoder();
        while (start < data.Length)
        {
            var charCount = Math.Min(charBuffer.Length, data.Length - start);

            data.CopyTo(start, charBuffer, 0, charCount);
            var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false);
            dataStream.Write(buffer, 0, bytes);
            start += charCount;
        }
        dataStream.Flush();
    }
}

Because all of this lovely code can be replaced with a simple:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    req.ContentLength = Encoding.UTF8.GetByteCount(data);

    using (var dataStream = req.GetRequestStream())
    using(var writer = new StreamWriter(dataStream, Encoding.UTF8))
    {
        writer.Write(data);
        writer.Flush();
    }
}

And that is so much better.

Stupid smart code


We had the following code:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteArray = Encoding.UTF8.GetBytes(data);

    req.ContentLength = byteArray.Length;

    using (var dataStream = req.GetRequestStream())
    {
        dataStream.Write(byteArray, 0, byteArray.Length);
        dataStream.Flush();
    }
}

And that is a problem, because it allocates the memory twice, once for the string, once for the buffer. I changed that to this:

public static void WriteDataToRequest(HttpWebRequest req, string data)
{
    var byteCount = Encoding.UTF8.GetByteCount(data);
    req.ContentLength = byteCount;
    using (var dataStream = req.GetRequestStream())
    {
        if(byteCount <= 0x1000) // small size, just let the system allocate it
        {
            var bytes = Encoding.UTF8.GetBytes(data);
            dataStream.Write(bytes, 0, bytes.Length);
            dataStream.Flush();
            return;
        }

        var buffer = new byte[0x1000];
        var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1);
        var charBuffer = new char[maxCharsThatCanFitInBuffer];
        int start = 0;
        var encoder = Encoding.UTF8.GetEncoder();
        while (start < data.Length)
        {
            var charCount = Math.Min(charBuffer.Length, data.Length - start);

            data.CopyTo(start, charBuffer, 0, charCount);
            var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false);
            dataStream.Write(buffer, 0, bytes);
            start += charCount;
        }
        dataStream.Flush();
    }
}

And I was quite proud of myself.

Then I realized that I was stupid. Why?

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB Sharding (2):
    21 May 2015 - Adding a new shard to an existing cluster, the easy way
  2. The RavenDB Comic Strip (2):
    20 May 2015 - Part II – a team in trouble!
  3. Challenge (45):
    28 Apr 2015 - What is the meaning of this change?
  4. Interview question (2):
    30 Mar 2015 - fix the index
  5. Excerpts from the RavenDB Performance team report (20):
    20 Feb 2015 - Optimizing Compare – The circle of life (a post-mortem)
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats