Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,191 | Comments: 45,979

filter by tags archive

Tri state waiting with async tcp streams

time to read 4 min | 714 words

We recently had the need to develop a feature that requires a client to hold a connection to the server and listen to a certain event. Imagine that we are talking about a new document arriving to the database.

This led to a very simple design:

  • Open a TCP connection and let the server know about which IDs you care about.
  • Wait for any of those IDs to change.
  • Let the client know about it.

Effectively, it was:

Unfortunately, this simple design didn’t quite work. As it turns out, having a dedicated TCP connection per id is very expensive, so we would like to be able to use a single TCP connection in order to watch multiple documents. And we don’t know about all of them upfront, so we need to find a way to talk to the server throughout the process. Another issue that we have is the problem of steady state. If none of the documents we care about actually change for a long time, there is nothing going on over the network. This is going to lead the TCP connection to fail with a timeout.

Actually, a TCP connection that passes no packets is something that is perfectly fine in theory, but the problem is that it requires resource that that need to be maintained. As long as you have systems that are not busy, it will probably be fine, but the moment that it reaches the real world, the proxies / firewalls / network appliances along the way use a very brutal policy, “if I’m not seeing packets, I’m not maintaining the connection”, and it will be dropped, usually without even a RST packet. That makes debugging this sort of things interesting.

So our new requirements are:

  • Continue to accept IDs to watch throughout the lifetime of the connection.
  • Let the client know of any changes.
  • Make sure that we send the occasional heartbeat to keep the TCP connection alive.

This is much more fun to write, to be honest, and the code we ended up with was pretty, here it is:

There are a couple of things to note here. We are starting an async read operation from the TCP stream without waiting for it to complete, and then we go into a loop and wait for one of three options.

  1. A document that we watched has been changed (we are notified about that by the documentChanged task completion), in which case we notify the user. Note that we first replace the documentChanged task and then we drain the queue from all pending documents changes for this collection, after which we’ll go back to waiting. On the doc changed event, we first enqueue the document that was changed, and then complete the task. This ensure that we won’t miss anything.
  2. New data is available from the client. In this case we read it and add it to the IDs we are watching, while starting another async read operation (for which we’ll wait on the next loop iteration). I’m creating a new instance of the IDs collection here to avoid threading issues, and also because the number of items is likely to be very small and rarely change. If there were a lot of changes, I would probably go with a concurrent data structure, but I don’t think it is warranted at this time.
  3. Simple timeout.

Then, based on which task has been completed, we select the appropriate behavior (send message to client, accept new doc ID to watch, send heartbeat, etc).

The nice thing about this code is that errors are also handled quite nicely. If the client disconnects, we will get an error from the read, and know that it happened and exit gracefully (well, we might be getting that just when we are writing data to the client, but that is pretty much the same thing in terms of our error handling).

The really nice thing about this code is that for the common cases, where there isn’t actually anything for us to do except maintain the TCP connection, this code is almost never in runnable state, and we can support a very large number of clients with very few resources.

Code reviewThe bounded queue

time to read 1 min | 64 words

The following code has just been written (never run, never tested).

It’s purpose is to serve as a high speed, no locking transport between two threads, once of them producing information, the other consuming it, in a bounded, non blocking manner.

Note that this is done because the default usage of BlockingCollection<T> here generated roughly 80% of the load, which is not ideal

reWhy you can't be a good .NET developer

time to read 3 min | 526 words

This post is in reply to Rob’s post, go ahead and read it, I’ll wait.

My answer to Rob’s post can be summarize in a single word:

In particular, this statement:

it is impossible to be a good .NET developer. To work in a development shop with a team is to continually cater for the lowest common denominator of that team and the vast majority of software shops using .NET have a whole lot of lowest common denominator to choose their bad development decisions for.

Um, nope. That only apply to places that are going for the lowest common denominator. To go from there to all .NET shops is quite misleading. I’ll give our own example, of building a high performance database in managed code, which has very little lowest common denominator anything anywhere, but that would actually be too easy.

Looking at the landscape, I can see quite a lot of people doing quite a lot of interesting things at the bleeding edge. Now, it may be that this blog is a self selecting crowd, but when you issue statements as “you can’t be a good .NET developers”, that is a pretty big statement to stand behind.

Personally, I think that I’m pretty good developer, and while I dislike the term “XYZ developer”, I do 99% of my coding in C#.

Now, some shops have different metrics, they care about predictability of results, so they will be extremely conservative in their tooling and language usage, the old “they can’t handle that, so we can’t use it” approach. This has nothing to do with the platform you are using, and all to do with the type of culture of the company you are at.

I can certainly find good reasons for that behavior, by the way, when your typical product lifespan is measured in 5 – 10 years, you have a very different mindset than if you aim at most a year or two away. Making decisions on brand new stuff is dangerous, we lost a lot when we decided to use Silverlight, for example. And the decision to go with CoreCLR for RavenDB was made with explicit back off strategy in case that was sunk too.

Looking at the kind of directions that people leave .NET for, it traditionally have been to the green green hills of Rails, then it was Node.JS, not I think it is Elixir, although I’m not really paying attention. That means that in the time a .NET developer (assuming that they investing in themselves and continuously learning) invested in their platform, learned a lot on how to make it work properly, the person who left for greener pastures has had to learn multiple new frameworks and platforms. If you think that this doesn’t have an impact on productivity, you are kidding yourself.

The reason you see backlash against certain changes (project.json coming, going and then doing disappearing acts worthy of Houdini) is that there is value in all of that experience.

Sure, sometimes change is worth it, but it needs to be measured against its costs. And sometimes there are non trivial.

Dazed and confused: The state of the Core

time to read 5 min | 886 words

It feels like I took my eyes off the ball for a second, and suddenly I'm in an entirely new game.

We have been working with CoreCLR (the RC1 release from last year), and have been grateful to be able to run our software oh so easily on both Windows and Linux (heaven and earth difference from trying to run on Mono).

However, stuff changed. I know about the delay in shipping CoreCLR (and I'm whole heartedly approve it, we need something stable), and I have some rough idea about where the general direction is. But what is the current state? No idea.

Googling this find me fragments of information, and not at all useful.

The decision to go with NET Standard instead of the bazillion individual packages is something that I applaud. It should make our life easier. Except that when I tried that I run into a whole bunch of issues that caused me to stop trying for a time.

Basically, here is what I need to know is whatever there currently a way to get a CoreCLR (dotnet cli, whatever the current term is) on the latest framework / libraries that would actually make sense.

In this context, make sense means that I get to do things like use Visual Studio normally (more or less) to build / debug my application. I'm fine with quite a lot of manual / scripts work, but I really like being able to debug. The nice thing about the RC1 bits is that it mostly works, you can debug it, profilers work, the whole thing. It is different than running on the standard framework, because some things don't work, but that is something that we can work with. (In particular, no easy way to get pdbs for stuff like profiling is really annoying).

When I read Hanselman's post about the dotnet cli, I was quite excited. It made sense. But now I feel that instead of having all the usual power that I'm used to have plus all the conveyance of just opening a shell and start pounding code, it is the other way around. It is somewhat like being back working in perl. If stuff works, great, but if it isn't, you might as well just start from scratch.

I'm talking about the build system, the dependencies, "how do I run this?", "how do I debug this?" questions.

What makes this worse is that there doesn't appear to be a roadmap that we can peek at (not that I could find). I can see some discussions in chats and in issues, but there is no real "this is where we are going" sense.

To explain some of my frustration, let me take my simplest project, which current contains the following details:

  "frameworks": {
    "dnxcore50": {
      "dependencies": {
        "Microsoft.CSharp": "4.0.1-beta-23516",
        "System.Collections.Concurrent": "4.0.11-beta-23516",
        "System.Diagnostics.Debug": "4.0.11-beta-23516",
        "System.Diagnostics.Contracts": "4.0.1-beta-23516",
        "System.Runtime.InteropServices.RuntimeInformation": "4.0.0-beta-23516",
        "Microsoft.AspNet.Hosting": "1.0.0-rc1-final",
        "Microsoft.AspNet.Server.Kestrel": "1.0.0-rc1-final",
        "Microsoft.Extensions.Configuration.FileProviderExtensions": "1.0.0-rc1-final",
        "Microsoft.Extensions.Configuration.Json": "1.0.0-rc1-final",
        "Microsoft.AspNet.WebSockets.Server": "1.0.0-rc1-final",
        "NLog": "4.4.0-*"
      }
    }
  },
  "dependencies": {
  }

Now, I want to move this core from DNX RC1 to the latest release. So this probably will look like this:

"frameworks": {
  "dnxcore50": {
    "dependencies": {
      "Microsoft.Extensions.Configuration.FileProviderExtensions": "1.0.0-rc1-final",
      "Microsoft.Extensions.Configuration.Json": "1.0.0-rc1-final",
      "Microsoft.AspNet.WebSockets.Server": "1.0.0-rc1-final",
      "NLog": "4.4.0-*"
    }
  }
},
"dependencies": {
  "NETStandard.Library": "1.0.0-rc3-23909"
},

I think, at least, that this is what it is supposed to be. Everything that I expect to be in the standard library to be in the standard library, and the rest I still need to reference. Now we get to the fun part. Using "dotnet restore / dotnet build" it successfully compiles. But using "dnu restore / dnu build" it does not.

I think that this relates to the magic incantation of the specific nuget feeds that it checks, but I'm not sure. Not being able to also compile that on DNX means that I can't use Visual Studio to actually work on this project.

And a more complex project just doesn't work.

So I'm lost, and feeling quite stupid at this point. And more than a bit frustrated.

The role of logs

time to read 2 min | 344 words

You are probably familiar with logging, and log levels. In particular, the following scheme is very popular.

I have been using this scheme for pretty much every software project I had in over a decade. And it works. Which is good.

But I found that there are actually not really useful for our scenarios. In particular, we found that there are many cases where a component that logged something as a warn or error would generate a support ticket by eagle eyed admin, while it was just a valid part of the software behavior. The other side of that is that when we need to use the logs to find a problem, we never use fine grained logs, we just need it all.

This lead me to believe that we only actually need two and a half levels.

  • Informative – we are reporting so one of our support engineers or a team member can look at that and understand what the software is doing.
  • Operational – something deserves attention from the customer’s operations team.

There isn’t really anything in between those. Either this is something that we fully expect an operation team at the customer to look at, or this is something that the development and/or support engineers need to see.

But I mentioned two and a half, what is that half? Well, above operational, there is another half a level, which I like to call Pay Attention Now. This isn’t actually a log level, it is a way to attract someone’s attention to the log, or to some sort of operational alert. This is an SMS sent, or an email, etc.

But beyond the “something need attention now”, there is nothing else that is needed. The production logs are either for routine operations monitoring (in which case they should be relatively succinct and written with an eye to a reader who isn’t familiar with the internal working of RavenDB) or “what is going on here” mode, which is typically done by a team member who needs the most information possible.

Code through the looking glassAnd a linear search to rule them

time to read 1 min | 187 words

The final piece for this candidate is the code to actually search through the sorted array. As you'll recall, the candidate certainly made our computer pay for that sorting and merging. But we are fine now, we have manage to recover from all of that abuse, and we are ready to rock.

Here is how the search (over sorted array) was implemented.

public List<int> EmailSearcher(string email)
{
    List<int> answer = new List<int>();
    for (int i = 0; i < emailsArray.Length; i++)
    {
        if (emailsArray[i].STR.ToString().Equals(email, StringComparison.OrdinalIgnoreCase))
        {
            answer = emailsArray[i].ID;
            return answer;
        }
    }

    return answer;
}

And with that, I have nothing left to say.

Code through the looking glassSorting large data sets

time to read 3 min | 590 words

The task in question was to read a CSV file (which is large, but can fit in memory) and allow quick searches on the data by email to find the relevant users. Most candidates handle that with a dictionary, but since our candidate decided to flip things around, he also included an implementation of doing this with sorted arrays. That is actually quite nice, in theory. In practice, it ended up something like this:

funny, funny pictures, funny photos, hilarious, wtf, weird, bizarre, funny animals, 17 Bizarre Photoshop Hybrid Animals

In order to truly understand the code, I have to walk you through a  bunch of it.

It starts with the following fields:

List<Structure> emails = new List<Structure>();

Structure[] namesArray = new Structure[0];

Structure, by the way, is a helper class that just has a key and a list of ids.  The duplicate is strange, but whatever, let move on.

The class constructor is reading one line at a time and add an structure instance with the email and the line id to the emails list.

public void ToArray()
{
    emailsArray = this.emails.ToArray();
}

This code just copy emails to the array, we are not sure why yet, but then we have:

public void Sort()
{
    Array.Sort(emailsArray, delegate(Structure x, Structure y) { return x.STR.CompareTo(y.STR); });
}

So far, this is right in line with the "use a sorted array" method that the candidate talked about. There is a small problem here, because emails are allowed to be duplicated, but no fear, our candidate can solve that…

public void UnifyIdLists()
{
    for (int i = 0; i < emailsArray.Length; i++)
    {
        if (i + 1 == emailsArray.Length)
            break;
        if (emailsArray[i].STR.Equals(emailsArray[i + 1].STR))
        {
            emailsArray[i].ID.AddRange(emailsArray[i + 1].ID);
            emailsArray[i + 1] = null;
            List<Structure> temp = emailsArray.ToList<Structure>();
            temp.RemoveAll(item => item == null);
            emailsArray = temp.ToArray();
        }
    }
}

The intent of this code is to merge all identical email values into a single entry in the list.

Now, to put things in perspective, we are talking about a file that is going to be around the 500MB in size, and there are going to be about 3.5 million lines in it.

That means that the emailsArray alone is going to take about 25MB.

Another aspect to consider is that we are using dummy data in our test file. Do you want to guess how many duplicates there are going to be there? Each of which is generating two 25MB allocations and multiple passes over an array of 3.5 million items in size.

Oh, and for the hell of it, the code above doesn't even work. Consider the case when we have three duplicates…

Code through the looking glassI'm the God of all OOP

time to read 3 min | 472 words

We typically don't look too deeply into the overall design on a candidate's code. Most coding tasks just aren't big enough to warrant it.

When we do, it is typically because the candidate has done something… peculiar. For example, we have had a candidate that send a submission with no less than 17 classes to manager reading a CSV file and populating a dictionary. At some point we knew that this candidate wasn't going to move forward, but it became a challenge, to try to outline how the data actually traveled through that particular maze. But as much as we love Mr. Goldberg, I'm afraid that dear Rubbie isn't going to be remembered for much longer.

Here is how this candidate code started:

static void Main(string[] args)
{
    //Call for parsing method
    HelperClass cr = new HelperClass();
    cr.CsvParser();
    cr.ToArray();
    cr.Sort();
    cr.UnifyIdLists();
    //ask the user to enter input
    string input = cr.InputReader();

I actually started explaining the issues in this code, but I run out of… pretty much everything.

I have a challenge for you, can you think of a single OOP principle that isn't being broken here?

For… fun, I guess, you can also look at the following code, which come just after the one above:

Start:
    //check input type
    switch (cr.InputTester(input))
    {
        case 0:
            //valid email structure, but the input is not found in the file
            List<int> emails = new List<int>();
            emails = cr.EmailSearcher(input);
            if (emails.Count == 0)
            {
                Console.WriteLine("The input you have entered has not been found in the file. \nPlease try again.");
                input = cr.InputReader();
                goto Start;
            }

I guess someone heard that loops are a performance bottleneck in most applications and decided to do away with them.

Code through the looking glassAll you need is a dictionary (and O(N) )

time to read 2 min | 362 words

The first question that ask in the coding task goes something like this: "Given a CSV file containing users' data, who is large, but can fully fit into memory, we want to be able to search by email address very quickly and get all the matching user ids. Optimize for fast queries over fast startup".

The intent is basically that you'll read the file and load that into a dictionary, then use that to perform queries.

This candidate has done just that, although things started being strange pretty much from the get go…

Dictionary<int, string> emails = new Dictionary<int, string>();

That seemed like a pretty strange way to set things up, but we have seen crazier. At this point I thought that they are probably storing the hash code of the email in the dictionary, and the string value is the concatenated list of all the matching ids.

The truth was… far more interesting. Here is the code for querying:

public Dictionary<int, int> EmailSearcher(string email)
{
    Dictionary<int, int> answer = new Dictionary<int, int>();
    int count = 0;
    foreach (var entry in emails)
    {
        if (entry.Value.ToString().Equals(email, StringComparison.OrdinalIgnoreCase))
        {
            answer.Add(count, entry.Key);
            count++;
        }

    }
    return answer;
}

This code actually have multiple levels of strangeness. The obvious one is that this is doing a linear search on the dictionary, but look at the return type as well…

The candidate was well aware that this code is not able to handle large amount of information, so the candidate sent us an alternative implementation. But I'll talk about that in my next post.

Code through the looking glassFinding today's log file

time to read 4 min | 615 words

You might have noticed that we are looking for more people again. Mostly because I gripe about some of the bad ones here every now and then.

One of my absolute requirements is that I want to read a candidate's code before hiring them. Some of them have made significant contributions to Open Source code, but a large number don't have any significant body of code that they can show. That is why we ask candidates to send us a coding test.

We had people flat out refuse to do the coding test, and some who thought that the questions were incredibly hard and unrealistic. We had people who sent in code that was so bad it caused migraines, we had people who sent in code that wouldn't compile, we  had people do stuff like read a 15TB file (16,492,674,416,640)! times. And just to clarify, that is factorial(16492674416640). I don't know what the actual value of this number is, but is it big.

The nice thing is that you can usually tell right away when the code is bad. We also play a game called "guess the candidate's background". We have about 89% success rate there*.

Spotting good code is much harder, because a really successful submission is going to be boring. It does what needs to be done, and that is it. Our most recent hire's code submission was so boring, we had to analyze it using our standard metrics to find issues (our standard metrics are for production code running in a high performance DB for months on end, not the same metrics we evaluate code submissions).

And then… then there is this candidate, whose code is so unique that I decided to dedicate a full week to explore it. The code looks good, documented, there are explanations that show what is going on and they are going in the right direction, on the surface of it.

And then there is the devil in the details. I have quite a lot to write about this particular submission, but I'll just start with the following:

//Find today's log file in the directory
public string LogFileFinder()
{
    string[] files = Directory.GetFiles(LoggingAggregation.Program.GlobalVar.path, "*.log");
    for (int i = 0; i < files.Length; i++)
    {
        int slash = files[i].LastIndexOf(@"\");
        int dot = files[i].LastIndexOf(".");
        string fileName = files[i].Substring(slash + 1, dot - slash - 1);
        string fileDate = DateTime.Now.ToString("yyyy-MM-dd");
        if (fileName.Equals(fileDate))
            return fileName + ".log";
    }

    return null;
}

Now, another way to write this code would be:

Path.Combine(LoggingAggregation.Program.GlobalVar.path, DateTime.Now.ToString("yyyy-MM-dd") +".log") 

I literally stared at this piece of code for several minutes, trying to figure out what is actually going on there. I wasn't able to.

As an aside, we sent the candidate a rejection notice, along with a few pointers to some of the more egregious things that were wrong in the code.  They know how to code, it is just that it goes sideway in the middle.

* The game call to guess the candidate's default language settings. That is, someone who write C# code, but has been brought up on C, so have a C style to the code.

FUTURE POSTS

  1. re: Why Uber Engineering Switched from Postgres to MySQL - 16 hours from now
  2. Debugging CoreCLR applications in WinDBG - about one day from now
  3. Non reproducible / intermittent error handling - 5 days from now
  4. The Guts n’ Glory of Database Internals: What goes inside the transaction journal - 6 days from now
  5. The Guts n’ Glory of Database Internals: Log shipping and point in time recovery - 7 days from now

And 20 more posts are pending...

There are posts all the way to Aug 31, 2016

RECENT SERIES

  1. Production postmortem (16):
    05 Feb 2016 - A null reference in our abstraction
  2. The Guts n’ Glory of Database Internals (20):
    18 Jul 2016 - What the disk can do for you
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats