Work stealing in the presence of startup / shutdown costs

time to read 3 min | 569 words

I mentioned that we have created our own thread pool implementation in RavenDB to handle our specific needs. A common scenario that ended up quite costly for us was the notion of parallelizing similar work.

For example, I have 15,000 documents to index .That means that we need to go over each of the documents and apply the indexing function. That is an embarrassingly parallel task. So that is quite easy. One easy way to do that would be to do something like this:

foreach(var doc in docsToIndex)
	ThreadPool.QueueUserWorkItem(()=> IndexFunc(new[]{doc}));

Of course, that generates 15,000 entries for the thread pool, but that is fine.

Except that there is an issue here, we need to do stuff to the result of the indexing. Namely, write them to the index. That means that even though we can parallelize the work, we still have non trivial amount of startup & shutdown costs. Just running the code like this would actually be much slower than running it in single threaded mode.

So, let us try a slightly better method:

foreach(var partition in docsToIndex.Partition(docsToIndex.Length / Environment.ProcessorCount))
	ThreadPool.QueueUserWorkItem(()=> IndexFunc(partition));

If my machine has 8 cores, then this will queue 8 tasks to the thread pool, each indexing just under 2,000 documents. Which is pretty much what we have been doing until now.

Except that this means that we have to incur the startup/shutdown costs a minimum of 8 times.

A better way is here:

ConcurrentQueue<ArraySegment<JsonDocument>> partitions = docsToIndex.Partition(docsToIndex.Length / Environment.ProcessorCount);
for(var i = 0; i < Environment.ProcessorCount; i++) 
{
	ThreadPool.QueueUserWorkItem(()=> {
		ArraySegment<JsonDocument> first;
		if(partitions.TryTake(out first) == false)
			return;

		IndexFunc(Pull(first, partitions));
	});
}

IEnumerable<JsonDocument> Pull(ArraySegment<JsonDocument> first, ConcurrentQueue<ArraySegment<JsonDocument>> partitions )
{
	while(true)
	{
		for(var i = 0; i < first.Count; i++)
			yield return first.Array[i+first.Start];

		if(partitions.TryTake(out first) == false)
			break;
	}
}

Now something interesting is going to happen, we are scheduling 8 tasks, as before, but instead of allocating 8 static partitions, we are saying that when you start running, you’ll get a partition of the data, which you’ll go ahead and process. When you are done with that, you’ll try to get a new partition, in the same context. So you don’t have to worry about new startup/shutdown costs.

Even more interesting, it is quite possible (and common) for those tasks to be done with by the time we end up executing some of them. (All the index is already done but we still have a task for it that didn’t get a chance to run.) In that case we exit early, and incur no costs.

The fun thing about this method is what happens under the load when you have multiple indexes running. In that case, we’ll be running this for each of the indexes. It is quite likely that each core will be running a single index. Some indexes are going to be faster than the others, and complete first, consuming all the documents that they were told to do. That means that the tasks belonging to those indexes will exit early, freeing those cores to run the code relevant for the slower indexes, which hasn’t completed yet.

This gives us dynamic resource allocation. The more costly indexes get to run on more cores, while we don’t have to pay the startup / shutdown costs for the fast indexes.