The difference between benchmarks & performance tests

time to read 3 min | 589 words

Also known as: Please check the subtitle of this blog.

This post is in response to this one. Kelly took offence with this post about Voron performance. In particular, it appears that the major issues are:

This benchmark doesn’t actually provide much useful information. It is too short and compares fully featured DBMS systems to storage engines. I always stress very much that people never make decisions based on benchmarks like this.
These paint the fully featured DBMS systems in a negative light that isn’t a fair comparison. They are doing a LOT more work. I’m sure the FoundationDB folks will not be happy to know they were roped into an unfair comparison in a benchmark where the code is not even available.

This isn’t a benchmark. This is just an interim step along the way of developing Voron. It is a way for us to see where we stand and where we need to go. A benchmark include full details about what you did (machine specs, running environment, full source code, etc). This is just us putting stress on our machine and comparing where we are at. And yes, we could have done it in isolation, but that wouldn’t really give us any major advantage. We need to see how we compare to other database.

And yes, we compare apples to oranges here when we compare a low level storage engine like Voron to SQL Server. I am well aware of that. But that isn’t the point. For the same reason that we are currently doing a lot of micro benchmarks rather than the 48 hours ones we have in the pipeline.

I am trying to see how users will evaluate Voron down the road. A lot of the time, that means users doing micro benchmarks to see how good we are. Yes, those aren’t very useful, but they are a major way people make decisions. And I want to make sure that we come out in a good light under that scenario.

With regards to Foundation DB, I am sure they are as happy about it as I am about them making silly claims about RavenDB transaction support. And the source code is available if you really want to, in fact, we got the Foundation DB there because we had an explicit customer request, and because they contributed the code for that.

Next, let us move to something else equally important. This is my personal blog. I publish here things that I do on a daily basis. And if I am currently in a performance boost stage, you’re going to be getting a lot of details on that. Those are the results of performance runs, they aren’t benchmarks. They don’t get anywhere beyond this blog. When we’ll put the results on ravendb.net, or something like that, then it will be a proper benchmark.

And while I fully agree that making decisions based on micro benchmarks is a silly way to go about doing so, the reality is that many people do just that. So one of the things that I’m focusing on is exactly those things. It helps that we currently see a lot of places to improve in those micro benchmarks. We already have a plan (and code) to see how we do on a 24 – 48 hours benchmark, which would also allow us to see all sort of interesting things (mixed reads & writes, what happens when you go beyond physical memory size, longevity issues, etc).

Tweet Share Share 11 comments

Tags:

Comments

30 Nov 2013
19:15 PM

bpm

I read her post yesterday and felt like she really took what you were doing out of context. You are experimenting and publishing the results of those experiments as you go along. There is nothing wrong with that, and quite frankly I wish more people would approach problems this way.

It's exceedingly obvious this is what you are doing here, so I don't understand why this rubbed her the wrong way.

30 Nov 2013
23:09 PM

fschwiet

I think there was some confusion when you said "Mostly, because we're having users that use this micro benchmark as a way to base decisions". Some people read it to mean people would use the specific measurements from your blog, but what I think you meant is that people do their own experiments with that scenario (lots of writes) to make decisions.

Though the lots of writes scenario hasn't matched production scenarios for me, I do end up at least passing through the lots-of-writes scenario when I check what RavenDB can do with real-world data.

01 Dec 2013
02:43 AM

Jahmai Lay

I am by no means a FoundationDB expert but I wrote the FDB test portion of the Voron tests and avoided highly optimized multi-read per transaction API's so that they would match the other tests provided in spirit, as it seemed to me that some of the other compared DB's could do something similar, but my understanding is they are comparing multi-consumer single document reads and not single consumer multi-document reads.

A lot of words used in the linked blog post but no substance. Which comparisons are invalid and why?

As a customer, if I am seeking the best solution to a particular usage scenario (say, very fast, transactional, concurrent writes) then I am going to start assessing solutions based on that criteria. If I need something more, replication, sharding, indexing, these things all feed into the criteria, and I compare those functions in isolation amongst all products that satisfy the criteria. I should be able to do this in spite of one of those products having a whole lot more, or a different target audience, or because someone on the internet says they are "apples and oranges".

I am not interests in your website claiming you can do x million per second on some finely crafted spectacular setup nobody else shares, I am interested in same machine, same scenario tests, and I expect them to be open and reproducible in my similarly spec'd environment.

03 Dec 2013
20:46 PM

Howard Chu

Hm, just had a look at your LMDB code. It had a couple of outright mistakes (use of mdb_open, etc.), and a missed optimization.

https://github.com/ayende/raven.voron/pull/9

03 Dec 2013
20:47 PM

Howard Chu

In the meantime, I agree with Kelly, comparing a KV storage engine to a full-blown DBMS is certainly not apples to apples.

03 Dec 2013
23:33 PM

Ayende Rahien

Hi Howard, Thanks for making those fixes. For reference, we haven't actually even run this code yet :-) The perf testing we did with LMDB so far has been done only through the .NET wrapper.

04 Dec 2013
05:53 AM

Howard Chu

Ah, I didn't see a test harness for the .NET version in your repo. I might check again later, would be interesting to see how that compares to the C++ run.

04 Dec 2013
06:06 AM

Ayende Rahien

Howard, It is located here: https://github.com/ayende/raven.voron/blob/voron/Performance.Comparison/Performance.Comparison/LMDB/LmdbTest.cs

09 Dec 2013
03:57 AM

Howard Chu

It has much the same issues as the C++ code did. I tried fixing them on my copy but the results running on Linux with Mono are quite slow, much slower than the C++ and much slower than the results you've posted. Seems to me that Mono isn't really suitable for high performance work on Linux.

09 Dec 2013
04:10 AM

Ayende Rahien

Howard, Can you send me the fixes as well? Mono can be a PITA to work with, sometimes, yes.

09 Dec 2013
04:27 AM

Howard Chu

OK, I've updated github and sent you a new pull request. But see my comments, there's a bit of other fixing needed.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB