The importance of a data formatPart VI – When two orders of magnitude aren't enough

time to read 2 min | 396 words

So our current version can index 18,000 documents in 22 ms vs. the JSON approach which takes 1.8 seconds. That is an amazing improvement, but I got to tell you, I got a good feeling about it.

Without running a profiler, I'm guessing that a major part of the 22 ms cost is going to be comparisons during a binary search process. We store the property names as a sorted array, and when you need to get a property by name, we need to do a binary search in the properties name to match it.

Our test case get 3 properties from a total of 18,801 documents. We do this 30 times, to reduce the chance of jitter getting in the way. Here are the profiler results:

image

As you can see, the cost is indeed where I expected it to be. Basically, the ReadNumber and GetComparerFor are the expressions of the number of binary searches.

But here we can take advantage on the fact that we know that most of those documents are going to have the same structure. In other words, once we have found the position of  a particular property, there is a strong chance that the next time we'll look a property with the same name, we'll find it in the same position. Caching that and using this as the initial position for doing the binary search means that most of the time, we hit the right property right on the nose, and in the worst case, we have a single extra comparison to make.

This single change gives us a really nice impact:

image

That is about 40% improvement in speed, from this single change.

We have been following similar patterns throughout the development of this feature. Write the feature, make sure it passes all its tests, then profiler it. Usually that indicate a problem that we need to solve. Either by optimizing code, or by changing the way we do things in the code. Sometimes radically.  But that is a topic for the next blog post.

More posts in "The importance of a data format" series:

  1. (25 Jan 2016) Part VII–Final benchmarks
  2. (15 Jan 2016) Part VI – When two orders of magnitude aren't enough
  3. (13 Jan 2016) Part V – The end result
  4. (12 Jan 2016) Part IV – Benchmarking the solution
  5. (11 Jan 2016) Part III – The solution
  6. (08 Jan 2016) Part II–The environment matters
  7. (07 Jan 2016) Part I – Current state problems