Thou shall not do threading unless you know what you are doing
I had a really bad couple of days. I am pissed, annoyed and angry, for totally not technical reasons.
And then I run into this issue, and I just want to throw something really hard at someone, repeatedly.
The issue started from this bug report:
1: NetTopologySuite.Geometries.TopologyException was unhandled2: HResult=-21462328323: Message= ... trimmed ...4: Source=NetTopologySuite5: StackTrace:6: at NetTopologySuite.Operation.Overlay.Snap.SnapIfNeededOverlayOp.GetResultGeometry(SpatialFunction opCode)7: at NetTopologySuite.Operation.Union.CascadedPolygonUnion.UnionActual(IGeometry g0, IGeometry g1)8: at NetTopologySuite.Operation.Union.CascadedPolygonUnion.Worker.Execute()9: at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)10: at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)11: at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)12: at System.Threading.ThreadHelper.ThreadStart()
At first, I didn’t really realized why it was my problem. I mean, it is NTS problem, isn’t it?
Except that this particular issue actually crashed ravendb (don’t worry, it is unstable builds only). The reason it crashed RavenDB? An unhandled thread exception.
What I can’t figure out is what on earth is going on. So I took a look at the code, have a look:
I checked, and this isn’t code that has been ported from the Java code. You can see the commented code there? That is from the Java version.
And let us look at what the execute method does?
So let me see if I understand. We have a list of stuff to do, so we spin out threads, reclusively, then we wait on them. I think that the point was to optimize things in some manner by parallelizing the work between the two halves.
Do you know what the real killer is? If we assume that we have a geometry with just 20 items on it, this will generate twenty two threads.
Leaving aside the issue of not handling errors properly (and killing the entire process because of this), the sheer cost of creating the threads is going to kill this program.
Libraries should be made to be thread safe (I already had to fix a thread safety bug there), but they should not be creating their own threads unless it is quite clear that they need to do so.
I believe that this is a case of a local optimization for a specific scenario, it also carry all of the issues associated with local optimizations. It solves one problem and opens up seven other ones.
Please report issues to http://code.google.com/p/nettopologysuite/issues/entry. We appreciate you (trying to) using NTS for RavenDB, but I'm afraid no-one has used NTS the way you are doing now, so, as you dig, you'll probably run into more threading issues.
How cool would it be if Resharper could highlight your code as above along with the WTF stamp.
lol@Andrew Rimmer. Now that would be a killer extension :D
Always a favourite - an innocent looking sort method that contains a nasty threading surprise!
@oren why not simply report the issue to the people that develop the project? I'm not sure if the persons that follows your blog are capable to find the problem like you did... but I'm sure that, by what I see, they know make fun of a quite serious work that certainly they would not have done better
Looks like they need to rethink their logic for how they limit the depth of recursion.
Sems to be the wrong tool to use in your case.
Considering they have no idea what threading is all about and from comments above - its better to skip it all together.
Jonny U are an idiot - my commet makes as much sense as yours...
I once worked at a place that prematurely optimized through threading in static classes (experiencing odd behavior and race conditions)... then to find that the "configuration" methods referenced an in-memory table, doing SQL-style queries, as opposed to a key/value store (hash table) which is indexed... the 168 calls that one page used to various setting values caused about 3 seconds in delay time total. Lots of these kinds of things in the wild.