The difference between benchmarks & performance tests
Also known as: Please check the subtitle of this blog.
This post is in response to this one. Kelly took offence with this post about Voron performance. In particular, it appears that the major issues are:
This benchmark doesn’t actually provide much useful information. It is too short and compares fully featured DBMS systems to storage engines. I always stress very much that people never make decisions based on benchmarks like this.
These paint the fully featured DBMS systems in a negative light that isn’t a fair comparison. They are doing a LOT more work. I’m sure the FoundationDB folks will not be happy to know they were roped into an unfair comparison in a benchmark where the code is not even available.
This isn’t a benchmark. This is just an interim step along the way of developing Voron. It is a way for us to see where we stand and where we need to go. A benchmark include full details about what you did (machine specs, running environment, full source code, etc). This is just us putting stress on our machine and comparing where we are at. And yes, we could have done it in isolation, but that wouldn’t really give us any major advantage. We need to see how we compare to other database.
And yes, we compare apples to oranges here when we compare a low level storage engine like Voron to SQL Server. I am well aware of that. But that isn’t the point. For the same reason that we are currently doing a lot of micro benchmarks rather than the 48 hours ones we have in the pipeline.
I am trying to see how users will evaluate Voron down the road. A lot of the time, that means users doing micro benchmarks to see how good we are. Yes, those aren’t very useful, but they are a major way people make decisions. And I want to make sure that we come out in a good light under that scenario.
With regards to Foundation DB, I am sure they are as happy about it as I am about them making silly claims about RavenDB transaction support. And the source code is available if you really want to, in fact, we got the Foundation DB there because we had an explicit customer request, and because they contributed the code for that.
Next, let us move to something else equally important. This is my personal blog. I publish here things that I do on a daily basis. And if I am currently in a performance boost stage, you’re going to be getting a lot of details on that. Those are the results of performance runs, they aren’t benchmarks. They don’t get anywhere beyond this blog. When we’ll put the results on ravendb.net, or something like that, then it will be a proper benchmark.
And while I fully agree that making decisions based on micro benchmarks is a silly way to go about doing so, the reality is that many people do just that. So one of the things that I’m focusing on is exactly those things. It helps that we currently see a lot of places to improve in those micro benchmarks. We already have a plan (and code) to see how we do on a 24 – 48 hours benchmark, which would also allow us to see all sort of interesting things (mixed reads & writes, what happens when you go beyond physical memory size, longevity issues, etc).
Comments
I read her post yesterday and felt like she really took what you were doing out of context. You are experimenting and publishing the results of those experiments as you go along. There is nothing wrong with that, and quite frankly I wish more people would approach problems this way.
It's exceedingly obvious this is what you are doing here, so I don't understand why this rubbed her the wrong way.
I think there was some confusion when you said "Mostly, because we're having users that use this micro benchmark as a way to base decisions". Some people read it to mean people would use the specific measurements from your blog, but what I think you meant is that people do their own experiments with that scenario (lots of writes) to make decisions.
Though the lots of writes scenario hasn't matched production scenarios for me, I do end up at least passing through the lots-of-writes scenario when I check what RavenDB can do with real-world data.
I am by no means a FoundationDB expert but I wrote the FDB test portion of the Voron tests and avoided highly optimized multi-read per transaction API's so that they would match the other tests provided in spirit, as it seemed to me that some of the other compared DB's could do something similar, but my understanding is they are comparing multi-consumer single document reads and not single consumer multi-document reads.
A lot of words used in the linked blog post but no substance. Which comparisons are invalid and why?
As a customer, if I am seeking the best solution to a particular usage scenario (say, very fast, transactional, concurrent writes) then I am going to start assessing solutions based on that criteria. If I need something more, replication, sharding, indexing, these things all feed into the criteria, and I compare those functions in isolation amongst all products that satisfy the criteria. I should be able to do this in spite of one of those products having a whole lot more, or a different target audience, or because someone on the internet says they are "apples and oranges".
I am not interests in your website claiming you can do x million per second on some finely crafted spectacular setup nobody else shares, I am interested in same machine, same scenario tests, and I expect them to be open and reproducible in my similarly spec'd environment.
Hm, just had a look at your LMDB code. It had a couple of outright mistakes (use of mdb_open, etc.), and a missed optimization.
https://github.com/ayende/raven.voron/pull/9
In the meantime, I agree with Kelly, comparing a KV storage engine to a full-blown DBMS is certainly not apples to apples.
Hi Howard, Thanks for making those fixes. For reference, we haven't actually even run this code yet :-) The perf testing we did with LMDB so far has been done only through the .NET wrapper.
Ah, I didn't see a test harness for the .NET version in your repo. I might check again later, would be interesting to see how that compares to the C++ run.
Howard, It is located here: https://github.com/ayende/raven.voron/blob/voron/Performance.Comparison/Performance.Comparison/LMDB/LmdbTest.cs
It has much the same issues as the C++ code did. I tried fixing them on my copy but the results running on Linux with Mono are quite slow, much slower than the C++ and much slower than the results you've posted. Seems to me that Mono isn't really suitable for high performance work on Linux.
Howard, Can you send me the fixes as well? Mono can be a PITA to work with, sometimes, yes.
OK, I've updated github and sent you a new pull request. But see my comments, there's a bit of other fixing needed.
Comment preview