The performance regression in the optimization: Part I

architecture (617) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (645) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1091) rss
raven (1458) rss
ravendb.net (543) rss
reviews (184) rss

2025
- August (3)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB - High-Performance NoSQL Document Database

Nov 30 2016

The performance regression in the optimizationPart I

time to read 4 min | 629 words

PageTable is a pretty critical piece of Voron. It is the component responsible for remapping modified pages in transactions and is the reason why we support MVCC and can avoid taking locks for the most part. It has been an incredibly stable part of our software, rarely changing and pretty much the same as it was when it was initially written in 2013. It has been the subject for multiple performance reviews in that time, but acceptable levels of performance from our code in 2013 is no longer acceptable today. PageTable came up recently in one of our performance reviews as a problematic component. It was responsible for too much CPU and far too many allocations.

Here is a drastically simplified implementation, which retain the salient points:

Here is the sample workout for this class, which just simulates ten thousand transactions. This little scenario takes 15.3 seconds and allocates a total of 1.1GB of memory! That is a lot of allocations, and must have tremendous amount of time spent in GC. The most problematic issue here is the SetItems methods, which will allocate two different delegates for each modified page in the transaction. Then we have the total abandon in which we’ll allocate additional memory in there. As you can imagine, we weren’t very happy about this, so we set out to fix this issue.

We can take advantage off the fact that SetItems and RemoveBefore are only called under lock, while TryGetValue is called concurrently with everything else.

So I wrote the following code:

This relies on allowing stale reads from concurrent readers, which we don’t care about since they wouldn’t be able to make use of the data anyway, and it was able to reduce the allocations to just 320 MB, but the runtime actually went up to 32 seconds.

That is quite annoying, as you can imagine, and much cursing enthused as a result. I then pulled my trusty profiler ask it kindly to figure out what piece of code needs to be hit with a rolling pin and have a stern talk to about what is expected from code after it has been laboriously and carefully optimized. It is expected to sit nicely and be fast, or by Git I’ll revert you.

What the hell?! Here are the original implementation costs, and you can clearly see how much time we are spending on garbage collection.

And here is the optimized version, which is actually slower, and actually used more memory?!

There are a bunch of interesting things going on here. We can see that we are indeed using spending a little less time in GC, and that both RemoveBefore and SetItems methods are much cheaper, but the cost of TryGetValue is so much higher, in fact, if we compare the two, we have:

So we are 3.4 times higher, and somehow, the cost of calling the concurrent dictionary TryGetValue has risen by 88%.

But the implementation is pretty much the same, and there isn’t anything else that looks like it can cause that much of a performance gap.

I’ll leave this riddle for now, because it drove me crazy for two whole days and give you the details on what is going on in the next post.

Tweet Share Share 6 comments

Tags:

Comments

30 Nov 2016
13:05 PM

Christophe

What could explain slowdown in PageTableNew.TryGetValue(...) method: before you were doing a for(...) on an array of struct, which would be optimized by the JIT (ABCREM, pointers to current item hoisted in register, maybe unrolling?). But in the new version, both BufferHolder and TransactionPage are classes, so all access to value.End, value.Begin, value.Buffer[...], as well as property access to the fields of TransactionPage are probably slower because the JIT cannot optimize the code as well (memory indirection, no ABCREM, need to re-read all the properties on each loop iteration, cannot hoist them into registers as much, probably no unrolling possible, ...).

Maybe caching all the "constant" fields (Begin, End, Buffer) into local variables could help reduce the overhead?

As for why ConcurrentDictionary<..>.TryGetValue(...) is slower, I have no idea :) Perhaps there is now more contentions in the dictionary because timings have changed and multiple threads are now bashing the dictionary at the same time?

30 Nov 2016
14:06 PM

Oren Eini

Christophe, Note that this is a single threaded piece of code :-)

30 Nov 2016
14:49 PM

Christophe

Ok, here is an idea: in old behavior, you add but also remove entries from the mapping dictionary, so TryGetValue does not have a long list of elements to check when looking for the key in a bucket (cf TryGetInternal(..) implementation)

But in new behavior, once a page is added to the dictionary, it will stay there, so the dictionary is a lot more crowded, and TryGetValue has more items to scan, so the lookup takes more time.

So: - in old behavior, size of mapping dictionary was proportional to number of pages that are "in use" at the current time. - in new behavior, size of mapping dictionary is proportional to number of pages that were used at least once in the lifetime of the process.

Maybe you should remove empty BufferHolder instances, and put then in a BufferPool to be reused later?

30 Nov 2016
15:45 PM

alex

@Oren I guess the "BufferHolder" represents the "ImmutableAppendOnlyList" which, as far as I know, you have been using since the early Voron days. So in that respect it is somewhat surprising seeing it featured as part of the "improved" page table design solution in this post.

I agree with Christophe that the performance issues seem to be primarily related to not removing the empty lists from the dictionary.

In fact, in the "improved" design, no entry is ever removed from the dictionary, meaning it could grow in the given benchmark example to hold around 1 million entries. This has many adverse effects and in a sense it is surprising that the other two calls became faster:

RemoveBefore now has to scan a vastly bigger dictionary with many empty lists and it has two bugs, causing it to iterate through the full array for each "expired" entry It is a good thing we can rely on 0 initialization in c# or consequences would be more severe.
TryGetValue will now also get dictionary hits for any page ever used before, retrieving a likely empty list and scanning through either the entire list (if still empty), or stopping at the first element if it has been reused. If you had used a ulong for the begin and end array indices, it would have crashed with an index out of range exception.

30 Nov 2016
16:08 PM

Philippecp

Like Christophe is saying, having your BufferHolder as class can explain increased memory usage since ConcurrentDictionary uses arrays internally and while structs have 0 overhead in an array, classes will incur ~24bytes per entry. Furthermore, this can probably lead to cache misses given the extra indirection required to access the buffer. Lastly, I wouldn't be surprised if changing the way you access the buffer cause the JIT to suddenly start emitting bounds checking instructions in the loop.

30 Nov 2016
16:25 PM

Stuart

I'm not familiar enough with the precise nature of the GC to nail down exactly what's going on, but my guess would be in the fact that the original ConcurrentDictionary was passing around a reference to an array of struct, that is a pointer to a block of memory. On the other hand, the improved ConcurrentDictionary is passing around a reference to a class, which then holds a reference to an array, each element of the array then is a class. Memory access has now moved from being 1 level of indirection (index into the array and there's the data) to 3 levels of indirection (get the BufferHolder class, index the array, and dereference the TransactionPage class). This alone would increase cache usage and significantly decrease performance. I would venture further that the GC might possibly be exacerbated by the pointers being updated constantly, which may indicate that the GC should review open pointers and see if there's anything to collect. Even if memory references are effectively the same and there is nothing to clear from memory, the GC doesn't necessarily know this and is reiterating frequently.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

The performance regression in the optimizationPart I

More posts in "The performance regression in the optimization" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "The performance regression in the optimization" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication