Why scalability matters?
The RavenDB indexing process: Optimization–Tuning? Why, we have auto tuning
The final aspect of RavenDB’s x7 jump in indexing performance is the fact that we made it freakishly smart.
During standard operation, most indexes only update when new information comes in, we are usually talking about a small number of documents for every indexing run. The problem is what happens when you have a sudden outpour of documents into RavenDB? For example, during nightly ETL batch, or just if you suddenly have a flood of users doing write operations.
The problem here is that we actually have to balance a lot of variable at the same time:
- The number of documents that we have to index*.
- The current memory utilization**.
- How any cores I have available to do the index work with?
- How much time do I have to do this?
Basically, the idea goes like this, if I have a small batch size, I am able to index more quickly, ensuring that we have fresher results. If I have big batch size, I am able to index more documents, and my overall indexing times goes down.
There is a non trivial cost associated with every indexing run, so reducing the number of indexing run is good, but the more documents I shove into a single run, the more memory will I use, and the more time it will take before the results are visible to the users.
* It is non trivial because there is no easy way for us to even know how many documents we have left to index (to find out is costly).
** Memory utilization is hard to figure out in a managed world. I don’t actually have a way to know how much memory I am using for indexing and how much for other stuff, and there is no real way to say “free the memory from the last indexing run”, or even estimate how much memory that took.
What we have decided on doing is to start from a very small (low hundreds) indexing batch size, and see what is actually going on live. If we see that we have more documents to index than the current batch size, we will slowly double the size of the batch. Slowly, because bigger batches requires more memory, and we also have to take into account current utilization, memory usage, and a bunch of other factors as well. We also go the other way around, able to reduce the indexing batch size on demand based on how much work we have to do right now.
We also provide an upper limit, because at some point it make sense to just do a big batch and make the indexing results visible than to try to do everything all at once.
The fun part in all of that is that once we have found the appropriate algorithm for this, it means that RavenDB will automatically adjust itself based on real production load. If you have an low update rate, it will favor small indexing batches and immediately execute indexing on the new documents. However, if you suddenly have a spike in traffic and the update rate goes up, RavenDB will adjust the indexing batch size so it will be able to keep up with your rate.
We have done some (read, a huge amount) testing with regards to this new optimization, and it turns out that under slow update frequency, we are seeing an average of 15 – 25 ms between a document update and it showing up in the indexes. That is pretty good, but what is going on when we have data just pouring in?
We tested this with a 3 million documents and 3 indexes. And it turn out that under this scenario, where we are trying to shove data into RavenDB as fast as it can accept it, we do see an increase in index latency. Under those condition, latency rose all the way to 1.5 seconds.
This is actually something that I am very happy about, because we were able to automatically adjust to the changing conditions, and were still able to index things at a reasonable rate (note that under this scenario, the batch size was usually 8 – 16 thousands documents, vs. the 128 – 256 that it is normally).
Because we were able to adjust the batch size on the fly, we could handle sustained writes at this rate with no interruption in service and no real need to think about this from the users perspective.. Exactly what the RavenDB philosophy calls for.
The RavenDB indexing process: Optimization–Getting documents from disk
As I noted in my previous post, we have done major optimizations for RavenDB. One of the areas where we improved the performance was reading the documents from the disk for indexing.
In Pseudo Code, it looks like this:
while database_is_running:
stale = find_stale_indexes()
lastIndexedEtag = find_last_indexed_etag(stale)
docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
As it turned out, we had a major optimization option here, because of the way the data is actually structured on disk. In simple terms, we have an on disk index that lists the documents in the order in which they were updated, and then we have the actual documents themselves, which may be anywhere on the disk.
Instead of loading the documents in the orders in which they were modified, we decided to try something different. We first query the information we need to find the document on disk from the index, then we sort them based on the optimal access pattern, to reduce disk movement and ensure that we have as sequential reads as possible. Then we take those results in memory and sort them based on their last update time again.
This seems to be a perfectly obvious thing to do, assuming that you are aware of such things, but it is actually something that is very easy not to notice. The end result is quite promising, and it contributed to the 7+ times improvements in perf that we had for indexing costs.
But surprisingly, it wasn’t the major factor, I’ll discuss a huge perf boost in this area tomorrow.
The RavenDB indexing process: Optimization–De-parallelizing work
One of the major dangers in doing perf work is that you have a scenario, and you optimize the hell out of that scenario. It is actually pretty easy to do without even noticing it. The problem is that when you do things like that, you are likely to be optimizing a single scenario to perform really well, but you are hurting the overall system performance.
In this example, we have moved heaven and earth to make sure that we are indexing things as fast as possible, and we tested with 3 indexes, on an 4 cores machine. As it turned out, we actually had improved things, for that particular scenario.
Using the same test case on a single core machine was suddenly far more heavy weight, because we were pushing a lot of work at the same time. More than the machine could process. The end result was that it actually got there, but much more slowly than if we would have run things sequentially.
Of course, I give you the outliers, but those are good indicators for what we found out. Initially, we thought that we could resolve that by using the TPL’s MaxDegreeOfParallelism, but it turned out to be more complex than that. We have IO bound and we have CPU bound tasks that we need to execute, and trying to execute IO heavy tasks with this would actually cause issues in this scenario.
We had to manually throttle things ourselves, both to ensure limited number of parallel work, and because we have a lot more information about the actual tasks than the TPL have. We can schedule them in a way that is far more efficient because we can tell what is actually going on.
The end result is that we are actually using less parallelism, overall, but in a more efficient manner.
In my next post, I’ll discuss the auto batch tuning support, which allows us to do some really amazing things from the point of view of system performance.
The RavenDB indexing process: Optimization–Parallelizing work
One of the things that we are doing during the index process for RavenDB is applying triggers and deciding what, if and how a document will be indexed. The actual process is a bit more involved, because we have to do additional things (like figure out which indexes have already indexed those particular documents).
At any rate, the interesting thing is that this is a process which is pretty basic:
for doc in docs: matchingIndexes = FindIndexesFor(doc) if matchingIndexes.Count > 0: doc = ExecuteTriggers(doc) if doc != null: yield doc
The interesting thing about this is that this is a set of operations that only works on a single document at a time, and the result is the modified documents.
We were able to gain significant perf boost by simply moving to a Parallel.ForEach call. This seems simple enough, right? Parallelize the work, get better benefits.
Except that there are issues with this as well, which I’ll touch on my next post.
The RavenDB indexing process: Optimization
The actual process done by RavenDB to index documents is a fairly complex one. In order to understand what exactly happened, I decided to break it apart to pseudo code.
It looks something like this:
while database_is_running: stale = find_stale_indexes() lastIndexedEtag = find_last_indexed_etag(stale) docs_to_index = get_documents_since(lastIndexedEtag, batch_size) filtered_docs = execute_read_filters(docs_to_index) indexing_work = [] for index in stale: index_docs = select_matching_docs(index, filtered_docs) if index_docs.empty: set_indexed(index, lastIndexedEtag) else indexing_work.add(index, index_docs) for work in indexing_work: work.index(work.index_docs)
And now let me show you the areas in which we did some perf work:
while database_is_running: stale = find_stale_indexes() lastIndexedEtag = find_last_indexed_etag(stale) docs_to_index = get_documents_since(lastIndexedEtag, batch_size) filtered_docs = execute_read_filters(docs_to_index) indexing_work = [] for index in stale: index_docs = select_matching_docs(index, filtered_docs) if index_docs.empty: set_indexed(index, lastIndexedEtag) else indexing_work.add(index, index_docs) for work in indexing_work: work.index(work.index_docs)
All of which gives us a major boost in the system performance. I’ll discuss each part of that work in detail, don’t worry ![]()
Performance implications of method signatures
In my previous post, I asked: What are the performance implications of the two options?
Versus:
And the answer is quite simple. The chance to optimize how it works.
In the first example, we have to return an unknown amount of information. In the second example, we know how much data we need to return. That means that we can optimize ourselves based on that.
What do I mean by that?
Look at the method signatures, those requires us to scan a secondary index, and get the results back. From there, we need to get back to the actual data. If we knew what the size of the data that we need to return is, we could fetch just the locations from the index, then optimize our disk access pattern to take advantage of sequential reads.
In the first example, we have to assume that every read is the last read. Callers may request one item, or 25 or 713, so we don’t really have a way to optimize things. The moment that we have the amount that the caller wants, things change.
We can scan the index to get just actual position of the document on disk, and then load the documents from the disk based on the optimal access pattern in terms of disk access. It is a very small change, but it allowed us to make a big optimization.
Compare and contrast: Performance implications of method signatures
If you throttle me any me I am going to throttle you back!

It is interesting to note that for a long while, what we were trying to do with RavenDB was make it use less and less resources. One of the reasons for that is that less resources is obviously better, because we aren’t wasting anything.
The other reason is that we have users running us on a 512MB/650 MHz Celeron 32 bit machines. So we really need to be able to fit into a small box (and also allow enough processing power for the user to actually do something with the machine).
We have gotten really good in doing that, actually.
The problem is that we also have users running RavenDB on standard server hardware (32 GB / 16 cores, RAID and what not) in which case they (rightly) complain that RavenDB isn’t actually using all of their hardware.
Now, being conservative about resource usage is generally good, and we do have the configuration in place which can tell RavenDB to use more memory. It is just that this isn’t polite behavior.
RavenDB in most cases shouldn’t require anything special for you to run, we want it to be truly a zero admin database. The solution? Take into account the system state and increase the amount of work that we do to get things done. And yes, I am aware of the pitfalls.
As long as there is enough free RAM available, we will increase the amount of documents that we are going to index in a single batch. That is subject to some limits (for example, if we just created a new index on a big database, we need to make sure we aren’t trying to load it entirely to memory), and it knows how to reserve some room for other things, and how to throttle down and as well as up.
This post is written before I had the chance to actually test this on production level size dataset, but I am looking forward to seeing how it works.
Update: Okay, that is encouraging, it looks like what we did just made things over 7 times faster. And this isn’t a micro benchmark, this is when you throw this on a multi GB database with full text search indexing.
Next, we need to investigate what we are going to do about multiple running indexes and how this optimization affects them. Fun
.
Watch your 6, or is it your I/O? It is the I/O, yes
As I said in my previous post, tasked with having to load 3.1 million files into RavenDB, most of them in the 1 – 2 KB range.
Well, the first thing I did had absolutely nothing to do with RavenDB, it had to do with avoiding dealing with this:
As you can see, that is a lot.
But when the freedb dataset is distributed, what we have is actually:
This is a tar.bz2, which we can read using the SharpZipLib library.
The really interesting thing is that reading the archive (even after adding the cost of decompressing it) is far faster than reading directly from the file system. Most file systems do badly on large amount of small files, and at any rate, it is very hard to optimize the access pattern to a lot of small files.
However, when we are talking about something like reading a single large file? That is really easy to optimize and significantly reduces the cost on the input I/O.
Just this step has reduced the cost of importing by a significant factor, we are talking about twice as much as before, and with a lot less disk activity.
Watch your 6, or is it your I/O?
One of the interesting things about the freedb dataset is that it is distributed as a 3.1 million separate files, most of them in the 1 – 2 KB range.
Loading that to RavenDB took a while, so I set out to fix that. Care to guess what is the absolutely the first thing that I did?
When you pit RavenDB & SQL Server against one another…
Here is how it works. I hate benchmarks, because they are very easily manipulated. Whenever I am testing performance stuff, I am posting numbers, but they are usually in reference to themselves (showing improvements).
That said…
Mark Rodseth .Net Technical Architect at Fortune Cookie in London, UK and he did a really interesting comparison between RavenDB & SQL Server. I feel good about posting this because Mark is a totally foreign agent (hm…. well, maybe not that
) but he has no association with RavenDB or Hibernating Rhinos.
Also, this post really made my day.
Update: Mark posted more details on his test case.
Mark setup a load test for two identical applications, one using RavenDB, the other one using SQL Server. The results:
SQL Load Test
Transactions: 111,014 (Transaction = Single Get Request)
Failures: 110,286 (Any 500 or timeout)
And for RavenDB ?
RavenDB Load Test
Transactions: 145,554 (Transaction = Single Get Request)
Failures: 0 (Any 500 or timeout)
And now that is pretty cool.
Stupid smart code: Solution
The reason that I said that this is very stupid code?
public static void WriteDataToRequest(HttpWebRequest req, string data) { var byteCount = Encoding.UTF8.GetByteCount(data); req.ContentLength = byteCount; using (var dataStream = req.GetRequestStream()) { if(byteCount <= 0x1000) // small size, just let the system allocate it { var bytes = Encoding.UTF8.GetBytes(data); dataStream.Write(bytes, 0, bytes.Length); dataStream.Flush(); return; } var buffer = new byte[0x1000]; var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1); var charBuffer = new char[maxCharsThatCanFitInBuffer]; int start = 0; var encoder = Encoding.UTF8.GetEncoder(); while (start < data.Length) { var charCount = Math.Min(charBuffer.Length, data.Length - start); data.CopyTo(start, charBuffer, 0, charCount); var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false); dataStream.Write(buffer, 0, bytes); start += charCount; } dataStream.Flush(); } }
Because all of this lovely code can be replaced with a simple:
public static void WriteDataToRequest(HttpWebRequest req, string data) { req.ContentLength = Encoding.UTF8.GetByteCount(data); using (var dataStream = req.GetRequestStream()) using(var writer = new StreamWriter(dataStream, Encoding.UTF8)) { writer.Write(data); writer.Flush(); } }
And that is so much better.
Stupid smart code
We had the following code:
public static void WriteDataToRequest(HttpWebRequest req, string data) { var byteArray = Encoding.UTF8.GetBytes(data); req.ContentLength = byteArray.Length; using (var dataStream = req.GetRequestStream()) { dataStream.Write(byteArray, 0, byteArray.Length); dataStream.Flush(); } }
And that is a problem, because it allocates the memory twice, once for the string, once for the buffer. I changed that to this:
public static void WriteDataToRequest(HttpWebRequest req, string data) { var byteCount = Encoding.UTF8.GetByteCount(data); req.ContentLength = byteCount; using (var dataStream = req.GetRequestStream()) { if(byteCount <= 0x1000) // small size, just let the system allocate it { var bytes = Encoding.UTF8.GetBytes(data); dataStream.Write(bytes, 0, bytes.Length); dataStream.Flush(); return; } var buffer = new byte[0x1000]; var maxCharsThatCanFitInBuffer = buffer.Length / Encoding.UTF8.GetMaxByteCount(1); var charBuffer = new char[maxCharsThatCanFitInBuffer]; int start = 0; var encoder = Encoding.UTF8.GetEncoder(); while (start < data.Length) { var charCount = Math.Min(charBuffer.Length, data.Length - start); data.CopyTo(start, charBuffer, 0, charCount); var bytes = encoder.GetBytes(charBuffer, 0, charCount, buffer, 0, false); dataStream.Write(buffer, 0, bytes); start += charCount; } dataStream.Flush(); } }
And I was quite proud of myself.
Then I realized that I was stupid. Why?
You can’t cache DateTime.Now
One of the things that were itching me was the fact that it seems that not all the queries in RaccoonBlog were hitting the cache. Oh, it is more than fast enough, but I couldn’t really figure out what is going on there. Then again, it was never important enough for me to dig in.
I was busy doing the profiling stuff for RavenDB and I used RaccoonBlog as my testing ground, when I realized what the problem was:
Everything was working just fine, the problem was here:
Do you get the light bulb moment that I had? I was using Now in a query, and since Now by definition changes, we keep generating new queries, which can’t be cached, etc.
I changed all of the queries that contained Now to:
Which means that it would only use a different value every minute. Once I fixed that, I still saw that there was a caching problem, which led me to discover that there was an error in how we calculated etags for dynamic indexes after they have been promoted. Even a very basic profiling tool helped us fix two separate bugs (in Raccoon Blog and in RavenDB).
RavenDB Aggressive Caching Mode
RavenDB has the notion of HTTP caching out of the box, what this means is that by default, without you having to take any action, RavenDB will cache as much as possible for you. It can get away with doing this because it is utilizing the notion of the 304 Not Modified response. The second time that we load an entity or execute a query, we can simply ask the server whatever the data has been modified since the last time we saw it, and if it wasn’t, we can just skip executing the code and get the data directly from the cache.
This saves a lot in terms of bandwidth and processing power, it also means that we don’t have to worry about RavenDB’s caching returning any stale results, because we checked for that. It does mean, however, that we have to send a request to the server. There are situation where we want to squeeze even better performance from the system, and we can move to RavenDB’s aggressive caching mode.
Aggressive caching means that RavenDB won’t even ask the server whatever anything has changed, it will simply return the reply directly from the local cache if it is there. This means that you might get stale data, but it also means that you’ll get it fast.
You can activate this mode using:
using (session.Advanced.DocumentStore.AggressivelyCacheFor(TimeSpan.FromMinutes(5))) { session.Load<User>("users/1"); }
Now, if there is a value in the cache for users/1 that is at most 5 minutes old, we can directly use that.
It also works on queries too:
using (session.Advanced.DocumentStore.AggressivelyCacheFor(TimeSpan.FromMinutes(5)))
{
session.Query<User>().ToList();
}
This is an explicit step beyond the normal caching, and it does mean that you might get out of date information, but if you really want to reduce the number of remote calls, it is a really nice feature.
Performance numbers in the pub
Originally posted at 3/31/2011
I am currently sitting with 3 guys in the pub, and the discussion naturally turned to performance. I asked all three the following question: “How many CLR objects can you create in one second?”
I got the following replies:
- 2,000 objects per second
- 50,000 objects per second
- 100,000 objects per second
Then I sat down to check:
class Program { static void Main(string[] args) { var sp = Stopwatch.StartNew(); int i = 0; while(sp.ElapsedMilliseconds < 1000) { new MyClass(); i++; } sp.Stop(); Console.WriteLine("Created {0} in {1}", i, sp.Elapsed); } } public class MyClass { public string A; public int B; public DateTime C; }
The result?
Created 7,715,305 in 00:00:01
You are only as fast as your slowest bottleneck
Chris points out something very important:
“A much better solution would have been to simply put the database on a compressed directory, which would slow down some IO ..."
I don't agree.
Compression needs CPU. We got a lot of more IO by switching on compression (it's just less to write and read). Previous our CPU was about 40%, now averaging at 70%. Compression rate saves us about 30% per file. After switching on compression our IO bound application was about 20% faster.
We are currently planning switching on compression on all our production servers over Christmas, because using cpu-cores for compression is even cheaper than adding hard disks and raid for performance.
In general, most operations today are mostly IO bound, with the CPU mostly sitting there twiddling the same byte until that byte threatens to sue for harassment. It make sense to trade off IO for CPU time, because our systems are being starved for IO.
In fact, you can just turn on compression at the File System level in most OSes, and it is likely to result in a significant saving for the application performance, assuming that the data does not already fits in memory.
How to become a speaker?
I get asked that quite frequently. More to the point, how to become an international speaker?
I was recently at a gathering where no less than three different people asked me this question, so I thought that it might be a good post.
Note: this post isn’t meant for someone who isn’t already speaking. And if you are speaking but are bad at it, this isn’t for you. The underlying assumption here is that you can speak and are reasonably good at it.
Note II: For this post, speaking is used to refer to presenting some technical content in front of an audience.
Why would you want to be a speaker anyway?
I heard that it is actually possible to make a living as a speaker. I haven’t found it to be the case, but then again, while I speak frequently, I don’t speak that frequently.
There are several reasons to want to be a speaker:
- reputation (and in the end, good reputation means you get to raise your rates, get more work, etc).
- contacts (speaking put you in front of dozens or hundreds of people, and afterward you get to talk with the people who are most interested in what you talked about)
- advertising for your product (all those “lap around Visual Studio 2010” are actually an hour long ad that you paid to see :-) ).
I’ll focus on the first two, reputation & contacts gives you a much wider pool of potential work that you can choose from, increase the money you can make, etc.
So how do I do that, damn it?
Honestly, I have no idea. The first time that I presented at a technical conference, it was due to a mixup in communication. Apparently when in the US “it would have been delightful” means “we regret to inform”, but in Israel we read that as “great, let us do it”, and put the guy on the spot, so he had to scramble and do something.
Okay, I lied, I do have some idea about how to do this.
Again, I am assuming you are a reasonably good speaker (for myself, I know that my accent is a big problem when speaking English), but there are a lot of reasonably good speakers out there.
So, what is the answer? Make yourself different.
Pick a topic that is near & dear to your heart (or to your purse, which also works) and prepare a few talks on it. Write about it in a blog, comment on other people blogs about the topic. Your goal should be that when people think about topic X, your name would be on that list. Forums like Stack Overflow can help, writing articles (whatever it is for pay or in places like CodeProject). Join a mailing list and be active there (and helpful). Don’t focus on regionally associated forums / mailing list, though. The goal is international acknowledgement.
This will take at least a year, probably, for people to start recognizing your name (it took over 2 years for me). If it is possible, produce a set of tools that relate to your topic. Publish them for free, and write it off as an investment in your future.
For myself, NHibernate Query Analyzer would a huge boost in terms of getting recognized. And Rhino Mocks was probably what clinched the deal. I honestly have no idea how much time & effort I put into Rhino Mocks, but Ohloh estimate that project at $ 12,502,089(!). While I disagree about that number, I did put a lot of effort into it, but it paid itself off several times over.
If you don’t have a blog, get one. Don’t get one at a community site, either. Community sites like blogs.microsoft.co.il are good to get your stuff read, but they have a big weakness in terms of branding yourself. You don’t want to get lost in a crowd, you want people to notice who you are. And most people are going to read your posts in a feed reader, and they are going to notice that the community feed is interesting, not that you are interesting.
Post regularly. I try to have a daily post, but that would probably not be possible for you, try to post at least once a week, and try to time it so it is always on the same date & time. Monday’s midnight usually works.
Okay, I did all of that, what now?
Another note, this is something that you may want to do in parallel to the other efforts.
Unless you become very well known, you won’t be approached, you’ll have to submit session suggestions. Keep an eye on the conferences that interest you, and wait until they have a call for sessions. Submit your stuff. Don’t get offended if they reject you.
If you live in a place that host international conferences (which usually rule Israel out), a good bet is to try to get accepted as a speaker there. You would be considerably cheaper than bringing someone from out of town/country. And that also play a role. Usually, if you managed to get into a conference once, they’ll be much more likely to have you again. They have your speaker eval, and unless you truly sucked (like going on stage and starting to speak in Hebrew at Denmark), and that gives them more confidence in bringing you a second time.
And that is about it for now.
Paxos enlightment
Paxos is an algorithm used to reach consensus among a group of machines, which is resilient to failures. For a long time, I had a really hard time understand Paxos. Or, to be rather more exact, I didn’t have an issue with Paxos per-se, I understood the protocol. What I had a trouble with is its application.
My problem always was that I couldn’t figure out what you do with it. That had to do with a basic problem on my part, I failed to understand how to go from a shared consensus on a numeric value to something that is actually useful. After a while, I had enough of feeling stupid, and I started reading all the material that I could on that, including going through the source codes of available Paxos implementations (such as libpaxos). The main problem was that I had a huge misconception in my head.
I kept thinking that Paxos is a consensus algorithm to arrive at a value in a distributed system. It isn’t, and because I kept thinking that it is, I had a really hard time understand what it does and how to apply it.
Leslie Lamport [pdf], the original author of the algorithm, describe the goal of Paxos as follows:
Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value.
This is from the Paxos Made Simple paper, I don’t know about you, but I have a hard time going from proposed value to something useful. What finally made everything click was the Paxos Made Code, which describe the implementation of libpaxos. While reading the paper, I had a light bulb moment.
Paxos is an algorithm used to ensure consist ordering semantics over a set of values in a cluster of machines. Why is this important? Because if you have consistent ordering over a set of values, and the values are events (or commands, or states), you can be sure that all machines in the cluster have either the same state or a previous version of the state.
Let us see how this can be useful. We have the following scenario:
- Jane has no money in her back account.
- Joe sends 100$ to Jane
- When Jane is notified that it got the 100$, she send a 50$ check to the IRS
- The IRS cash the check and spend it on something stupid.
Without a consistent ordering, each machine in the cluster may view the events in any order, which means that the following three timelines are allowed:
As you can imagine, Jane isn’t very happy about that overdraft fee she was just charged with. It is important to note that Paxos sole use here is to ensure that all machines will have a consistent view of events across all machines. That view might not be the same as the order those events showed up. It is possible to give that guarantee as well, on top of Paxos, but that isn’t the topic of this post.
Now that we all (hopefully) understand what Paxos is and what it is used for, let us talk about the algorithm itself.
Paxos has three actor types (Usually, the different actors are all part of the same system, either all together or two of the roles together), we will assume that we have 3 of each in our cluster and that we are interested in events ordered by sequential integer event ids with no gaps:
When you want to make a change in the system, you go to the proposer and tell it that you want to add event SendCheck { To= “IRS”, Amount = 50 } to the system.
The proposer then check what is the latest event id that it knows about, increment it, and then ask all the acceptors in the cluster to reserve that event id for its use. (Please note that I intentionally skip the details of how this is done, I am trying to get a high level description here, you can read the actual algorithm description for all the details).
There are several options here:
- Another proposer is currently trying to reserve that event id.
- Another proposer successfully claimed this event id.
- Another proposer tried to claim this id and then crashed midway.
- Etc… :-)
What Paxos ensures is that in the end, even in the presence of failures of network and machines, the proposer will be able to write the SendCheck { To= “IRS”, Amount = 50 } to an event id in such a way that no other machine will see another value in that location and that eventually all machines in the cluster will see that value in that location.
It is important to understand that even with Paxos, the following timelines are possible:
That is, due to some error, we may not be fully up to date on some machines (as seen on machine #2) or that we have missing events (as seen on machine #3).
What Paxos provides is that there wouldn’t be a scenario in which we have missing events and are not explicitly aware of that. At that point, we can defer processing of cashing the check until we know what event we are missing, explicitly decide to ignore the discontinuity in the time line or something else that fit the business needs.
In order to understand how this all works, I had to write my own implementation of Paxos in C#. There doesn’t seem to be anything like that available to the public, and I (at least) find the code I wrote much easier to understand than libpaxos’ C or Erlang implementations. You can find the implementation here: http://github.com/ayende/Paxos.Demo
Optimizing “expensive” calls
Take a look at the following code:
It is obvious where we need to optimize, right?
Except… each call here takes about 0.8 millisecond. Yes, we could probably optimize this further, but the question is, would it be worth it?
Given a sub millisecond performance, and given that trying to implement a different serialization format would be expensive operation, I think that there just isn’t enough justification to do so.
Sometimes you really need a profiler handy
Performance optimizations, managed code and leaky abstractions
I run into this post from Jeff Atwood, talking about the performance difference between managed and unmanaged code:

There were a lot of optimizations for this along the way, but the C++ version has soundly beaten the C# version. As expected, right?
Well, yes, but with extenuating circumstances.
So am I ashamed by my crushing defeat? Hardly. The managed code achieved a very good result for hardly any effort. To defeat the managed version, Raymond had to:
- Write his own file/io stuff
- Write his own string class
- Write his own allocator
- Write his own international mapping
Of course he used available lower level libraries to do this, but that's still a lot of work. Can you call what's left an STL program? I don't think so, I think he kept the std::vector class which ultimately was never a problem and he kept the find function. Pretty much everything else is gone.
So, yup, you can definitely beat the CLR. I think Raymond can make his program go even faster.
I find this interesting, because it isn’t really specific for C++, in my recent performance sprint for the profiler, I had to:
- Write my own paging system
- Write my own string parsing routines
- Write my own allocator
For the most part, performance optimizations fall into four categories:
- Inefficient algorithms – O(N) notation, etc.
- Inefficient execution – not applying caching, doing too much work upfront, doing unneeded work.
- I/O Bound – the execution waits for a file, database, socket, etc.
- CPU Bound – it just takes a lot of calculations to get the result.
I can think of very few problems that are really CPU Bounded, they tend to be very specific and small. And those are just about the only ones that’ll gain any real benefit from a faster code. Of course, in pure math scenarios, which is pretty much where most of the CPU Bound code reside, there isn’t much of a difference between the language that you choose (assuming it is not interpreted, at least, and that you can run directly on the CPU using native instructions). But as I said, those are pretty rare.
In nearly all cases, you’ll find that the #1 cause for perf issues is IO. Good IO strategies (buffering, pre-loading, lazy loading, etc) are usually applicable for specific scenarios, but they are the ones that will make a world of difference between poorly performing code and highly performing code. Caching can also make a huge difference, as well as differing work to when it is actually needed.
I intentionally kept the “optimize the algorithm” for last, because while it can have drastic performance difference, it is also the easiest to do, since there is so much information about it, assuming that you didn’t accidently got yourself into an O(N^2) or worse.
Why all the performance posts? The shocking truth!
I was quite amazed by the number of conspiracy theories that were brought up by this post. Some of them in the comments, some of them in private communications.
The reason, the real & only one, that I had so many posts lately about performance is quite simple. I did a lot of that recently, and one aspect of perf testing that I didn’t talk about is that most perf test run takes a long time, that means that I had a lot of free time. Free time for me usually translate into posting time :-)
Patterns for reducing memory usage
Memory problems happen when you application use more memory that you would like. It isn’t necessarily paging or causing OutOfMemory, but it is using enough memory to generate complaints. The most common cases for memory issues are:
- Memory leaks
- Garbage spewers
- In memory nuts
- Framework bugs
Let me take each of them in turn.
Memory leaks in a managed language are almost always related to dangling references, such as in a cache with no expiration or events where you never unsubscribe. Those are usually nasty to figure out, because tracking down what is holding the memory can be unpleasant. But, by the same token, it is also fairly straightforward to do so.
Garbage spewers are pieces of code that allocate a lot of memory that will have to be freed soon afterward. A common case of that is:
public string Concat(string[] items) { string result = ""; foreach(var item in items) results += item; return result; }
This is going to allocate a lot of memory, which will have to be freed soon after. This will get cleaned up eventually, but it will put a lot of pressure on the GC first, will cause the application to consume more memory and in general won’t play nice with others. While the code above is the simplest way to explain this, it is fairly common in ways that are harder to detect, a common case would be to load a DTO from the database, convert that to an entity and convert that to a view model. Along the way, you are going to consume a lot of memory for doing pretty much the same thing.
Now the caveat here is that most objects are actually small, so you don’t really notice that, but if you are working with large objects, or a lot of them, this is something that is going to hit you.
In memory nuts refer to a common problem, you simply put your entire dataset in memory, and commonly refer to it by direct model traversal. When your dataset becomes too big, however… well, that is the point where the pain is really going to hit you. Usually, fixing this is a costly process, because your code assumes that the entire thing is in memory. Even if you can easily save it to persistent storage, fixing all the places where the code assumes that everything is just a pointer reference away is a big problem.
Framework bugs are my least favorite, it is when you run into cases where the framework just won’t release memory. Most often, this is because you are doing something wrong, but occasionally you will hit the real framework bug, and tracking that down is a pure PITA.
In all cases, you need to set up some goals, what is acceptable memory usage, in what scenarios, over what time frame, etc. Then build test scenarios that are repeatable and try each of your improvements out. Do not try to implement too much upfront, that way lies the road to madness.

