Ayende @ Rahien

Hi!
My name is Ayende Rahien
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,949 | Comments: 44,548

filter by tags archive

Boldly & confidently fail, it is better than the alternative


Recently I had the chance to sit with a couple of the devs in the RavenDB Core Team to discuss “keep & discard” habits*.

The major problem we now have with RavenDB is that it is big. And there are a lot of things going on there that you need to understand. I run the numbers, and it turns out that the current RavenDB contains:

  • 835,000 Lines of C#
  •   67,500 Lines of Type Script
  •   87,500 Lines of HTML

That is divided into many areas of functionalities, but that is still a big chunk of stuff to go through. And that is ignoring things that require we understand additional components (like Esent, Lucene, etc). What is more, there is a lot of expertise in understanding what is going on in term of the full picture. We limit this value here because too much of it would result in high memory consumption under this set of circumstances, for example.

The problem is that it take time, and sometime a lot of it, to get good understanding on how things are all coming together. In order to handle that, we typically assign new devs issues from all around the code base. The idea isn’t so much to give them a chance to become expert in a particular field, but to make sure that they get the general idea of how come is structured and how the project comes together.

Over time, people tend to gravitate toward a particular area (M** is usually the one handling the SQL Replication stuff, for example), but that isn’t fixed (T fixed the most recent issue there), and the areas of responsibility shifts (M is doing a big task, we don’t want to disturb him, let H do that).

Anyway, back to the discussion that we had. What I realized is that we have a problem. Most of our work is either new features or fixing issues. That means that nearly all the time, we don’t really have any fixed template to give developers “here is how you do this”. A recent example was an issue where invoking smuggler with a particular set of filters would result in very high cost. The task was to figure out why, and then fix this. But the next task for this developer is to do sharded bulk insert implementation.

I’m mentioning this to explain a part of the problem. We don’t see a lot of “exactly the same as before” and a new dev on the team lean on the other members quite heavily initially. That is expected, of course, and encouraged. But we identified a key problem in the process. Because the other team members also don’t have a ready made answer, they need to dig into the problem before they can offer assistance, which sometimes (all too often, to be honest) lead to a “can you slide the keyboard my way?” and taking over the hunt. The result is that the new dev does learn, but a key part of the process is missing, the finding out what is going on.

We are going to ask both sides of this interaction to keep track of that, and stop it as soon as they realize that this is what is going on.

The other issue that was raised was the issue of fear. RavenDB is a big system, and it can be quite complex. It is quite reasonable apprehension, what if I break something by mistake?

Here it comes back to the price of failure. Trying something out means that at worst you wasted a work day, nothing else. We are pretty confident in our QA process and system, so we can allow people to experiment. Analysis paralysis is a much bigger problem. And I wasn’t being quite right, trying the wrong thing isn’t wasting a day, you learned what doesn’t work, and hopefully also why.

“I have not failed. I've just found 10,000 ways that won't work.”
Thomas A. Edison

* Keep & discard is a literal translation of a term that is very common in the IDF. After most activities, there is an investigation performed, and one of the first questions asked is what we want to keep (good things that happened that we need to preserve for the next time we do this) and what we need to discard (bad things that we need to watch out for).

** The actual people are not relevant for this post, so I’m using letters only.

Project Tamar


I’m happy to announce that despite the extreme inefficiencies involved in the process, the performance issues and what are sure to be multiple stop ship bugs in the way the release process is handled. We have successfully completed Project Tamar.

The result showed up as 2.852 Kg bundle, and is currently sleeping peacefully. I understand that this is a temporary condition, and lack of sleep shall enthuse shortly. I’ll probably regret saying that, but I can’t wait.

 

This little girl is named Tamar and to celebrate, I’m offering a 28.52% discount for all our products. This include:

Just use coupon code: Tamar

This will be valid for the next few days.

Lambda methods and implicit context


The C# compiler is lazy, which is usually a very good thing, but that can also give you some issues. We recently tracked down a memory usage issue to code that looked roughly like this.

var stats = new PerformatStats
{
    Size = largeData.Length
};
stats.OnCompletion += () => this.RecordCompletion(stats);

Write(stats, o =>
{
    var sp = new Stopwatch();
    foreach (var item in largeData)
    {
        sp.Restart();
        // do something with this
        stats.RecordOperationDuration(sp);
    }
});

On the surface, this looks good. We are only using largeData for a short while, right?

But behind the scene, something evil lurks. Here is what this actually is translated to by the compiler:

__DisplayClass3 cDisplayClass3_1 = new __DisplayClass3
{
    __this = this,
    largeData = largeData
};
cDisplayClass3_1.stats = new PerformatStats { Size = cDisplayClass3_1.largeData.Length };

cDisplayClass3_1.stats.OnCompletion += new Action(cDisplayClass3_1, __LambdaMethod1__);

Write(cDisplayClass3_1.stats, new Action(cDisplayClass3_1, __LambdaMethod2__));

You need to pay special attention to what is going on. We need to maintain the local state of the variables. So the compiler lift the local parameters into an object. (Called __DisplayClass3).

Creating spurious objects is something that we want to avoid, so the C# compiler says: “Oh, I’ve two lambdas in this call that need to get access to the local variables. Instead of creating two objects, I can create just a single one, and share it among both calls, thereby saving some space”.

Unfortunately for us, there is a slight issue here. The lifetime of the stats object is pretty long (we use it to report stats). But we also hold a reference to the completion delegate (we use that to report on stuff later on). Because the completion delegate holds the same lifted parameters object, and because that holds the large data object. It means that we ended up holding a lot of stuff in memory far beyond the time they were useful.

The annoying thing is that it was pretty hard to figure out, because we were looking at the source code, and the lambda that we know is likely to be long running doesn’t look like it is going to hold a referece to the largeData object.

Ouch.

Buffer Managers, production code and alternative implementations


We are porting RavenDB to Linux, and as such, we run into a lot of… interesting issues. Today we run into a really annoying one.

We make use of the BufferManager class inside RavenDB to reduce memory allocations. On the .Net side of things, everything works just fine, and we never really had any issues with it.

On the Mono side of things, we started getting all sort of weird errors. From ArgumentOutOfRangeException to NullReferenceException to just plain weird stuff. That was the time to dig in and look into what is going on.

On the .NET side of things, BufferManager implementation is based on a selection criteria between large (more than 85Kb) and small buffers. For large buffers, there is a single large pool that is shared among all the users of the pool. For small buffers, the BufferManager uses a pool per active thread as well as a global pool, etc. In fact, looking at the code we see that it is really nice, and a lot of effort has been made to harden it and make it work nicely for many scenarios.

The Mono implementation, on the other hand, decides to blithely discard the API contract by ignoring the maximum buffer pool size. It seems because “no user code is designed to cope with this”. Considering the fact that RavenDB is certainly dealing with that, I’m somewhat insulted, but it seems par the course for Linux, where “memory is infinite until we kill you”* is the way to go.

But what is far worse is that this class is absolutely not thread safe. That was a lot of fun to discover. Considering that this piece of code is pretty central for the entire WCF stack, I’m not really sure how that worked. We ended up writing our own BufferManager impl for Mono, to avoid those issues.

* Yes, somewhat bitter here, I’ll admit. The next post will discuss this in detail.

Long running async and memory fragmentation


We are working on performance a lot lately, but performance isn’t just an issue of how fast you can do something, it is also an issue of how many resources we use while doing that. One of the things we noticed was that we are using more memory than we would like to, and even after we were freeing the memory we were using. Digging into the memory usage, we found that the problem was that we were suffering from fragmentation inside the managed heap.

More to the point, this isn’t a large object heap fragmentation, but actually fragmentation in the standard heap. The underlying reason is that we are issuing a lot of outstanding async I/O requests, especially to serve things like the Changes() API, wait for incoming HTTP requests, etc.

Here is what this looks like inside dotProfiler.

image

As you can see, we are actually using almost no memory, but heap fragmentation is killing us in terms of memory usage.

Looking deeper, we see:

image

We suspect that the issue is that we have pinned instances that we sent to async I/O, and that match what we have found elsewhere about this issue, but we aren’t really sure how to deal with it.

Ideas are more than welcome.

.NET Packaging mess


In the past few years, we had:

  • .NET Full
  • .NET Micro
  • .NET Client Profile
  • .NET Silverlight
  • .NET Portable Class Library
  • .NET WinRT
  • Core CLR
  • Core CLR (Cloud Optimized)*
  • MessingWithYa CLR

* Can’t care enough to figure out if this is the same as the previous one or not.

In each of those cases, they offered similar, but not identical API and options. That is completely ignoring the versioning side of things ,where we have .NET 2.0 (1.0 finally died a while ago), .NET 3.5, .NET 4.0 and .NET 4.5. I don’t think that something can be done about versioning, but the packaging issue is painful.

Here is a small example why:

image

In each case, we need to subtly tweak the system to accommodate the new packaging option. This is pure additional cost to the system, with zero net benefit. Each time that we have to do that, we add a whole new dimension to the testing and support matrix, leaving aside the fact that the complexity of the solution is increasing.

I wouldn’t mind it so much, if it weren’t for the fact that a lot of those are effectively drive-bys, it feels. Silverlight took a lot of effort, and it is dead. WinRT took a lot of effort, and it is effectively dead.

This adds a real cost in time and effort, and it is hurting the platform as a whole.

Now users are running into issues with the Core CLR not supporting stuff that we use. So we need to rip out MEF from some of our code, and implement it ourselves just to get things in the same place as before.

Exercise, learning and sharpening skills


The major portion of the RavenDB core team is co-located in our main office in Israel. That means that we get to do things like work together, throw ideas off one another, etc.

But one of the things that I found stifling in other work places is a primary focus on the current work. I mean, the work is important, but one of the best ways to stagnate is to keep doing the same thing over and over again. Sure, you are doing great and can spin that yarn very well, but there is always that power loom that is going to show up any minute. So we try to do some skill sharpening.

There are a bunch of ways we do that. We try to have a lecture every week (given by one of our devs). The topics so far range from graph theory to distributed gossip algorithms to new frameworks and features that we deal with to the structure of the CPU and memory architecture.

We also have mini debuggatons. I’m not sure if the name fit, but basically, we show a particular problem, split into pairs and try to come up with a root cause and a solution. This post is written while everyone else in the office is busy looking at WinDBG and finding a memory leak issue, in the meantime, I’m posting to the blog and fixing bugs.

Over the wire protocol design


I’ve had a couple of good days in which I was able to really sit down and code. Got a lot of cool stuff done. One of the minor requirements we had was to send some data over the wire. That always lead to a lot of interesting decisions.

How to actually send it? One way? RPC? TCP? HTTP?

I like using HTTP for those sort of things. There are a lot of reasons for that, it is easy to work with and debug, it is easy to secure, it doesn’t require special permissions or firewall issues.

Mostly, though, it is because of one overriding priority, it is easy to manage in production. There are good logs for that. Because this is my requirement, it means that this applies to the type of HTTP operations that I’m making.

Here is one such example that I’m going to need to send:

public class RequestVoteRequest : BaseMessage
{
public long Term { get; set; }
public string CandidateId { get; set; }
public long LastLogIndex { get; set; }
public long LastLogTerm { get; set; }

public bool TrialOnly { get; set; }
public bool ForcedElection { get; set; }
}

A very simple way to handle that would be to simply POST the data as JSON, and you are done. That would work, but it would have a pretty big issue. It would mean that all requests would look the same over the wire. That would mean that if we are looking at the logs, we’ll only be able to see that stuff happened ,but won’t be able to tell what happened.

So instead of that, I’m going with a better model in which we will make request like this one:

http://localhost:8080/raft/RequestVoteRequest?&Term=19&CandidateId=oren&LastLogIndex=2&LastLogTerm=8&TrialOnly=False&ForcedElection=False&From=eini

That means that if I’m looking at the HTTP logs, it would be obvious both what I’m doing. We will also reply using HTTP status codes, so we should be able to figure out the interaction between servers just by looking at the logs, or better yet, have Fiddler capture the traffic.

Career planningMine


I got some really good questions about my career. Which caused me to reflect back and snort at my own attempts to make a sense of what happened.

Here is a rough timeline:

  • 1999 – Start of actually considering myself to be a professional developer. This is actually the start of a great one year course I took to learn C++, right out of high school.
  • 2001 – Joined the army, was sent to the Military Police, and spent 4 years in prison. Roles ranged from a prison guard, XO of big prison, teacher in officer training course and concluded with about a year as a small prison commander.
  • 2004 – Opened my blogged and started writing about the kind of stuff that I was doing, first version of Rhino Mocks.
  • 2005 – Joined the Castle Comitter’s team, Left the army, joined We!, worked as a consultant.
  • 2006 – My first international conference – DevTeach.
  • 2008 – Left We!, started working as an independent consultant.
  • 2009 – NHibernate Profiler beta release.
  • 2010 – DSLs in Book book is published, Entity Framework Profiler, Linq to SQL Profiler, RavenDB.
  • 2011 – Hiring of first full employee.
  • 2014 – Writing this blog post.

A lot of my history doesn’t make sense without a deeper understanding. In the army, there was a long time in which I wasn’t actually able to do anything with computers. That meant that on vacations, I would do stuff voraciously. At that time, I already read a lot of university level books (dinosaurs book, Tanenbaum’s book, TCP/IP, DDD book and a lot of other stuff like that). At some point, I got an actual office and had some free time, so I could play with things. I wrote code, a lot. Nothing that was actually very interesting. It was anything from mouse tracking software (that I never actually used) to writing custom application to track medical data for inmates. Mostly I played around and learned how to do stuff.

When I got to be a prison commander, I also got myself a laptop, and start doing more serious stuff. I wasn’t able to afford any professional software, so it was mostly Open Source work. I started working on NHibernate Query Analzyer, just to see what I can do about it. That thought me a lot about reflection and NHibernate itself. I then got frustrated with the state of mocking tools in the .NET framework, and I decided to write my own. Around that time, I also started to blog.

What eventually became Rhino Mocks was a pretty big thing. Still one of the best pieces of software that I have written, it required that I’ll have a deep understanding of a lot of things. From IL generation to how classes are composed by the runtime to AppDomains to pretty much everything.

Looking back, Rhino Mocks was pretty huge in terms of pushing my career. It was very widely used, and it got me a lot of attention. After that, I was using NHibernate and talking about that a lot, so I got a lot of reputation points in that arena as well. But the first thing that I did after starting as an independent consultant was to actually work on SvnBridge. A component that would allow an SVN client to talk to a Team Foundation Server. That was something that I had very little experience with, but I think that I did a pretty good job there.

Following that, I mostly did consulting and training on NHibernate. I was pretty busy. So busy that at some point I actually have a six week business trip that took me to five countries and three continents. I came back home and didn’t leave my bed for about a week. For two weeks following that, I would feel physically ill if I sat in front of the computer for more than a few minutes.

That was a pretty big wakeup call for me. I knew that I had to do something about it. That is when I actually sat down and thought about what I wanted to do. I knew that I wanted to stay in development, and that I couldn’t continue being a road warrior without burning out again. I decided that my route would be to continue to do consulting, but on a much reduced frequency, and to start focusing on creating products. Stuff that I could work on while at home, and hopefully get paid for. That is how the NHibernate Profiler was born.

From there, it was a matter of working more on that and continuing to other areas, such as Entity Framework, Linq to SQL, etc. RavenDB came because I got tired of fixing the same old issues over and over again, even with the profilers to help me. And that actually had a business plan, we were going to invest so much money and time to get it out, and it far exceeded our expectations.

Looking back, there were several points that were of great influence. Writing my blog. Writing Rhinos Mocks, joining open source projects such as Boo or Castle. Working and blogging with NHibernate. Going to conferences and speaking there. All of those gave me a lot of experience and got me out there, building reputation and getting to know people.

That was very helpful down the road, when I was looking for consultancy jobs, or doing training. Or when it came the time to actually market our software.

In terms of the future, Hibernating Rhinos is growing at a modest rate. Which is the goal. I don’t want to double every six months, that is very disturbing to one’s peace of mind. I would much rather have a slow & steady approach. We are working on cool software, we are going home early and for the most part, there isn’t a “sky is falling” culture. The idea is that this is going to be a place that you can spend a decade or four in. I’ll get back to you on that when I retire.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
  2. Special Offer (2):
    27 May 2015 - 29% discount for all our products
  3. RavenDB Sharding (3):
    22 May 2015 - Adding a new shard to an existing cluster, splitting the shard
  4. Challenge (45):
    28 Apr 2015 - What is the meaning of this change?
  5. Interview question (2):
    30 Mar 2015 - fix the index
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats