Ayende @ Rahien

It's a girl

Multi threading challenge: can you spot a bug?

One of the problems of multi threading is that there are a lot of intricacies that you have to deal with. Recently I run into issues that dictated that I would have to write an AsyncBulkInserterAppender  for log4net.

One of the reasons that I want to do that is to avoid locking the application if the database is down or the logging table is locked.I just had a recent issue where this casued a problem.

When I implemented that, I started to worry about what would happen if the database is locked for a long duration. There is a chance that this async logging would block for a long time, and then another async batch would start, also blocking, etc. Eventually, it will fill the thread pool and halt the entire system.

This is the approach I ended up with, it should ensure that there is at most, only two threads that are writing to the database at a time. Since I wrote it, I already found at least two bugs in there. It looks fine now, but I can't think of any good way to really test that.

I am afraid that multi threading can't really be tested successfully. This is something where code review is required.

Here is the code:

protected override void SendBuffer(LoggingEvent[] events)
{
	// we accept some additional complexity here
	// in favor of better concurrency. We don't want to
	// block all threads in the pool (train) if we have an issue with
	// the database. Therefor, we perform thread sync to ensure
	// that only a single thread will write to the DB at any given point
	ThreadPool.QueueUserWorkItem(delegate
	{
		lock (syncLock)
		{
			eventsList.AddLast(events);
			if (anotherThreadAlreadyHandlesLogging)
				return;
anotherThreadAlreadyHandlesLogging = true; } while (true) { LoggingEvent[] current; lock (syncLock) { if(eventsList.Count == 0) { anotherThreadAlreadyHandlesLogging = false; return; } current = eventsList.First.Value; eventsList.RemoveFirst(); } PerformWriteToDatabase(current); } }); }

Comments

Avish
12/26/2007 11:15 AM by
Avish

I'm no so good at multithreaded code, but it seems to me like any number of threads could go through the first lock block, before any of them reaches the "I got it, back off" line. Then all these threads would assume they can handle the writing, and will begin to do so.

I think you need to set anotherThreadAlreadyHandlesLogging to true inside the first lock block. So basically, you need:

acquire lock

  • add events to shared events buffer

  • check if some other thread is already writing

  • if so, return; if not, set yourself as the writing thread

release lock

if I'm the writing thread:

  • write everything in the events buffer

  • unset yourself as the writing thread

Where am I wrong?

Ayende Rahien
12/26/2007 11:27 AM by
Ayende Rahien

No, you are not wrong.

I missed that on the third bug.

Fixed, thanks

Pieter Breed
12/26/2007 12:42 PM by
Pieter Breed

Something else which might also cause headaches is errors. If PerformWriteToDataBase() fails with an exception, you won't have any guaruntee that the bool flag is set correctly. So, for you to be sure about it, your need some kind of try {} catch(exception) {} in PerformWriteToDataBase(), which is not good.

I would rather use a lock, instead of a flag, which could be unset even in the presence of exceptions. so make the code like this:

lock (eventsQueue) { eventsQueue.addEventsToQueue(newEvents) }

if (!Monitor.TryEnter(singleHandlerLock)) return;

try

{

while (true)

{

  lock (eventsQueue)

  {

     if (eventsQueue.Count == 0) return;

     WriteToDB(eventsQueue.Pop());

  }

}

}

finally

{ Monitor.Exit(singleHandlerLock); }

Ayende Rahien
12/26/2007 01:04 PM by
Ayende Rahien

Pieter,

PerformWriteToDataBase never throws, it has internal error handling there.

Also, take into account that we are trying to reduce contention times.

Your approach would have everything wait for the write to the database to end

Pieter Breed
12/26/2007 01:08 PM by
Pieter Breed

You are correct of course. no use fixing something by breaking it, eh? ;)

using a temp variable and moving the Write operation out of the lock will fix that problem, but that is neither here nor there. More important to me was wrapping up the anotherThreadAlreadyHandlesLogging flag into another lock on which you can call TryEnter().

Anyway, thanks and keep it up.

John Rayner
12/26/2007 01:38 PM by
John Rayner

The variable anotherThreadAlreadyHandlesLogging should be marked volatile so that it still works on multi-proc machines, or replaced with a ManualResetEvent. I'd replace it with a lock, as per Pieter's suggestion. The code which is adding to the event queue can then test for the existence of a writer thread by trying to obtain this lock with a zero length timeout.

Also, I wouldn't move the write into the eventsQueue down into the threadpool. I think that this should be done deterministically to try and maintain the correct ordering of events.

You might want to look into using the Parallel FX library from MS (http://www.microsoft.com/downloads/details.aspx?FamilyID=e848dc1d-5be3-4941-8705-024bc7f180ba&displaylang=en) since you actually have a pretty standard producer-consumer model here. Your producers are the calls which are logging events and your consumers are the threads in the thread pool which are writing these events to the database.

Ayende Rahien
12/26/2007 01:59 PM by
Ayende Rahien

John,

No, it should not.

We only modify it on a lock, that should be safe.

Bill Barry
12/26/2007 02:24 PM by
Bill Barry

I am not a fan of the while(true) but here it seems to be alright.

Some things to ensure:

eventsList is never null

PerformWriteToDatabase cannot throw an exception

PerformWriteToDatabase cannot take exceedingly long (possible out of memory exception on eventsList.AddLast)

If I were to attempt to test this, that would be what I test. If I had malicious intent, I would attempt to add to eventsList faster than the execution time for PerformWriteToDatabase. If this is something to worry about; I would limit the execution time of that function to the point where ~95% get through and do something else with the 5% overflow (something that would execute at in at least 1/10th of the time of your average PerformWriteToDatabase call).

The pattern looks sound though.

Ayende Rahien
12/26/2007 02:52 PM by
Ayende Rahien

eventsList is declared as

readonly LinkedList eventsList = new LinkedList();

PerformWriteToDatabase never throws.

PerformWriteToDatabase can take as much as it likes, that it why we have the thread sync to ensure that we pass to the currently executing thread. This way, we have only a single thread executing.

If AddLast throws because of OOM, we have lost the current logs batch, but we haven't changed any other state, so we are still safe.

Shadi Mari
12/26/2007 05:12 PM by
Shadi Mari

Why not to create a dedicated thread for such purpose instead of relying on the .NET thread pool and have a shared syncronized queue between the caller and the background thread where the caller push to queue and thread read from queue and sleep if empty or use IOCP queue instead.

Mike Rettig
12/26/2007 05:28 PM by
Mike Rettig

I'd recommend using Retlang for this type of problem.

Just create a process bus that uses the .NET thread pool for execution.

IProcessFactory factory = ... create factory

IProcessBus bus = factory.CreatePooledAndStart()

Then setup a batch subscription. This will automatically batch the logging events in cases when the writing thread is blocked due to IO or DB locking.

bus.SubscribeToBatch<LoggingEvent[]>("my.log.topic", OnLogBatch);

Write the handler that will process the logEvents:

void OnLogBatch(List<LoggingEvent[]> logEvents)

{

// no locking needed since Retlang will only call this from

// a single thread from the pool but not necessarily the

// same thread every time.

foreach(LoggingEvent[] event in logEvents)

PerformWriteToDatabase(event);

}

There is no locking. Retlang takes care of sequential delivery of the events, so there are no races. Retlang will also guarantee that only a single thread from the pool is used for processing the batch at any given time. There is no threading code to test with the assumption that Retlang works as designed. ;)

Mike

Retlang Developer

Ayende Rahien
12/26/2007 08:45 PM by
Ayende Rahien

Shadi,

The reason for that is fairly simple, I don't want to waste a thread for that. The most common scenario is that it go off, does it thing and completes. The complexity comes from the need to handle the uncommon case without issues

Ayende Rahien
12/26/2007 08:46 PM by
Ayende Rahien

Mike,

You are correct, probably. I didn't think of retlang at all, which is a shame.

Arne Claassen
12/27/2007 03:02 PM by
Arne Claassen

I know this is a bit of a tangent, but I'd just recommend against using your database as a live logging target in the first place. Logging isn't a realtime concern, compared to business needs. I generally write to a file and have a separate daemon harvest the logs and then process them for datawarehousing and debugging databases.

Most recently, I wanted closer to real-time availability and used the UdpAppender and had the receiving daemon insert into the DB. Either way, I never want my application be at the mercy of having to log to database.

Erich Eichinger
12/30/2007 11:52 AM by
Erich Eichinger

Shadi, Ayende,

I'd go with Shadi's suggestion. Imho using a dedicated thread and a properly synchronized queue drastically reduces potential threading problems. Given the number of threads and the number of logging events usually occurring in an application, I don't think that such a thread is much a waste compared to number of potential issues you might run into otherwise. The thread doesn't consume cpu time anyway while sleeping.

There might be another log4net related issue with asynchronous logging: You must ensure, that all event properties are resolved ("fixed" in log4net terminology I think) before the event is queued to the dbwriter. Otherwise your events might contain wrong meta data (e.g. threadid)

Tip: By implementing a ForwardingAppender you can make any arbitrary appender asyncronous.

-Erich

P.S.: I'm talking about log4net 1.2.9, not familiar with 1.2.10 yet, but I don't think this has changed much.

Comments have been closed on this topic.