Challenge: The code review bug that gives me nightmares–the issue

architecture (623) rss
bugs (451) rss
community (383) rss
databases (481) rss
design (899) rss
development (654) rss
hibernating-practices (73) rss
miscellaneous (592) rss
performance (397) rss
programming (1107) rss
raven (1475) rss
ravendb.net (562) rss
reviews (184) rss

2025
- November (4)
- October (4)
- September (10)
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

Nov 02 2021

ChallengeThe code review bug that gives me nightmares–the issue

time to read 2 min | 301 words

In my previous post, I discussed a bug that brought up in code review, that bug made me go into near panic mode. Here is the issue:

In order to understand this bug, you have to take into account multiple concurrent threads at the same time. Look at the ComputeHashAndPutInCache() method, where we register an eviction callback for the item in the cache. When we evict the item, we return the buffer to the buffer pool.

We want to avoid allocating memory, so that is certainly something desirable, no? However, consider what happens if I have a thread in ComputeHash(), getting a value from the cache. Before I have a chance to look at the value, however, the cache will decide to evict it. At which point the eviction callback will run.

We’ll return the buffer back to the buffer pool, where it may be used again by something else. I am also using this buffer to do other things from the caller of the ComputeHash() call. This is a somewhat convoluted use after free issue, basically.

And I find is super scary bug because of its affects:

Randomly, and rarely, some buffer will contain the wrong data, leading to wrong results (but hard to track it down).
Trying to find such a bug after the fact, however, is nearly impossible.
Most of the debugging techniques (repeating the operation for a particular value) will make it go away (the cache will keep the value and not evict it).

In short, this is a recipe for a really nasty debug session and an impossible to resolve bug. All from code that looks very much like an innocent bystander.

Now, I can obviously fix it by not using the array pool, but that may cause me to allocate more memory than I should. How would you approach fixing this issue?

Tweet Share Share 14 comments

Tags:

Comments

02 Nov 2021
16:55 PM

Sebastiaan Dammann

Change the public interface to accept a span, and the caller can then manage memory as it wishes. However, that still wouldn't solve the issue because you still can have the race condition between the moment the cache expires but the array was returned and is copied to the span.

In this case the option might be not to solve it and simply allocate an array yourself. ArrayPools are primarily meant to resolve the issue of short-term buffer allocation, for instance when processing streams of information. In this case, you're holding onto the allocated buffers, possibly exhausting the pool (depending on the underlying implementation).

You can of course still use the ArrayPool while you're compiling the hash.

02 Nov 2021
17:49 PM

Kaotis

One option would be to wrap it in a class with a destructor and cache that class instead. But you have to be careful not to expose that array directly..

02 Nov 2021
18:15 PM

Dejan Grujić

You can wrap it in a class/struct and add a usage counter. But you'll have to decrease counter once you're done with the hash value. Memory is released not on eviction, but when counter is 0, which could happen from either eviction or after manual release. There's a risk of memory leak if you forget to release it, but it's much better than the "ghost" bug you described.

02 Nov 2021
23:14 PM

Scooletz

An array with 32 bytes has its own overhead (object header + length). Depending on the evictions and load the first approach could be to turn it into an object with 4 longs that are sufficient to carry the payload but do not have the length overhead. Then once they are no longer used, just discard them.

Add a counter similar to the slot in the Concurrent queue for each hash value, that represents both the state and the epoch. Then, with volatile access, assert before accessing and after that the value that was read was ok. The read must be a copy to preserve the order and value. This could be done with the array (alignments) but with the helper class described above as well. With the latter, the caching would be for storing values.

Unless this is for building a db with hundreds of files or an upload service, probably I'd not cache it at all though.

02 Nov 2021
23:18 PM

Scooletz

@dejan Who and when bumps up the counter? How would you save it from ABA problem? It would be interesting if you could elaborate.

03 Nov 2021
07:42 AM

Dalibor Čarapić

I see several potential fixes for this issue depending on the required optimization:
1) Make a copy of the byte[] array before putting it into cache or returning the cached version to the caller.
Pros: Minor changes to existing code
Cons: Extra allocations on every call
2) Change the signature of the ComputeHash to accept callback function accepting ReadOnlySpan (and returns void). Caller can not modify the array and can not take a reference to it. This also requires pausing evictions while inside this method.
Pros: Prevent extra allocations
Cons: More difficult to use. Pausing evictions might not be easy to implement.
3) Do not use ArrayPool but just store byte[] values in cache. Return ComputeHash would return the same byte[] array but wrapped in ReadOnlyMemory struct.
Pros: Only one allocation per file.
Cons: You still have one allocation per file instead of single arraypool.

03 Nov 2021
09:03 AM

Dejan Grujić

@Scooletz

For start, I'm not sure entire idea with wrapper class is a good one in this context because of extra allocations. If we go that way IMO ComputeHash should be renamed to suggest that we're borrowing something which has to be released.

We can then bump the counter in the ComputeHash. We'll also need some ReleaseHash, and both that and evict methods will call some ReturnHashToPoolIfNotUsed which will decrease the counter. But we'll have to put a lock on _cache to prevent getting value while we'll releasing it. So, extra allocations + locking, not that great.

We can avoid locking and counters if we know for sure that borrows are short-lived, i.e. that whoever needs these hash values will complete its job with them in let's say under 1s. If this can be guaranteed, we can remove values from _cache on eviction, thus preventing new borrowers from getting them, but not return bytes to the pool immediately. We'll do actual return few seconds later. We'll just need a background thread for cleanup and some way to keep ReleaseTimestamps and byte array.

But I'd ask these questions first:

How many hashes are we talking about.
Is it predictable size or grows while the app is running? If we won't go over few thousand entries, we can skip release altogether.
Do callers need hash values for short period of time or do they keep them longer, potentially for the entire app lifetime? If it's long-lease, shared pool is not that compatible with cache eviction anyway - creating copies might be better.

04 Nov 2021
08:33 AM

Oren Eini

Sebastiaan,

The actual issue here is that I'm holding the arrays for a long time, so they are likely to be in Gen2. Not holding on to them would result in me putting stuff in Gen2 which will be rarely cleaned up. This is especially true when you are operating at or near the cache limits.

At that point, certain behaviors (access pattern that is greater than the cache) will mean that you keep filling and dropping from the cache. Just enough time to put that on Gen2, which will then make overall memory usage far more problematic.

04 Nov 2021
08:37 AM

Oren Eini

Kaotis,

Yes, that is an option, for sure. However, that leads to a far more subtle issue.

public class ReleaseArray
{
   public byte[] Buffer;
   
   ~ReleaseArray()
   {
       ArrayPool<byte>.Shared.Return(Buffer);
   }
}

public void DoSomething(string key)
{
   ReleaseArray ra = cache.Get(key);
   DoSomethingWithBuffer(ra.Buffer);
}

Note that in this case, we are exposed to the JIT deciding that we aren't using the ra after the call to DoSomethingWithBuffer and cleaning it up.

That ties to your "not expose the buffer", but that is hard to do. The issue is that even if we make DoSomethingWithBuffer as an instance method of ReleaseArray, the JIT may decide to inline it, resulting in the same behavior. Non trivial issue.

04 Nov 2021
08:39 AM

Oren Eini

Dejan,

That leaks to the ABA problem, however. Because the act of getting the instance to increment the counter may be racy with _decrementing the counter.

04 Nov 2021
08:42 AM

Oren Eini

Scooletz,

The scenario I actually have is caching of buffers that may be in the MB range, and I don't actually follow your approach. 4 longs vs. an array is the exact same scenario. You are now not using a buffer pool, and will have a lot of allocations. That is also something what we want to avoid.

04 Nov 2021
08:46 AM

Oren Eini

Dalibor,

1 & 2 - those are really complex solutions. In both cases, you need a way to avoid evictions while we are either copying the values or calling the callback. That means that you are paying the cost always, which is something that we want to avoid.

3 - That would expose us to a situation where the working set is greater than the cache size, which will mean that the cache will just make things a LOT harder for us, instead of better.

04 Nov 2021
12:58 PM

Dejan Grujic

I did say locking would be needed, and more specifically locking on _cache object. Although, my sixth sense always tells "danger" when locking is involved.

lock( _cache ) { .... _cache.Get(file) ... counter++ } ...

ReturnHashToPoolIfNotUsed( OurHashClass ... ) { lock(_cache) { counter-- ... etc. } }

04 Nov 2021
13:06 PM

Oren Eini

Dejan,

Yep, and now you are paying the locking cost on every access, which isn't nice to have.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

ChallengeThe code review bug that gives me nightmares–the issue

More posts in "Challenge" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Challenge" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication