Reducing complexity with a shift in thinking
I love B+Trees, but they can be gnarly beasts, with the number of edge cases that you can run into. Today’s story is about a known difficult place, page splitting in the tree. Consider the following B+Tree, showing a three-level tree with 3 elements on each page.
Consider what will happen when we want to insert a new value to the tree, the value: 27. Given the current state of the tree, that should go on the page marked in red:
But there is no place for the new value on this page, so we have to split it. The tree will then look like so, we split the page and now we need to add the new page to the parent, but that one also doesn’t have room for it:
So we are now in a multi-level split process. Let’s see what this looks like when we go up the tree. This is the final state of the tree when we are done doing all the splits:
The reason for all of this is that we need to add 27 to the tree, and we haven’t done that yet. At this stage, we got the tree back in order and we can safely add the new value to the tree, since we made sure we have enough space.
However, note that the exact same process would apply if we were adding 27 or 29. The page that we’ll add them to, however, is different.
This can be quite complex to keep track of, because of the recursive nature of the process. In code, this looks something like this:
I am skipping on some details, but that is the gist of it. So we do the split (recursively if needed) and then after we wired the parent page properly, we find the right location for the new value.
An important aspect here is the cursor. We use that to mark our current location in the tree, so the cursor will always contain all the parent pages that we are currently searching upon. A lot of the work that we are doing in the tree is related to the cursor.
Now, look at the code and consider the behavior of this code when we insert the value 29. It will correctly generate this page:
However.. what happens if we’ll insert 27?
Well, when we split the page, we went up the tree. And then we had another split, and then we went down another branch. So as written, the result would be adding the 27 to the same page as we would the 29. This would look like this:
Look at the red markers. We put entry 27 on the wrong page.
Fixing this issue is actually pretty hard, because we need to keep track of the values as we go up and down the tree. For fun, imagine what happens in this exact scenario, but when you have 6 levels in the tree and you end up in a completely different location in the tree.
I spent a lot of time struggling with this issue, including getting help from some pretty talented people, and the final conclusion we got was “it’s complicated”.
I don’t want complications here, I need it to be as simple as possible, otherwise, we can’t make any sort of sense here. I kept spinning more and more complex systems to resolve this, when I realized that I just looked at the problem in the wrong manner all along.
The issue was that I was trying to add the new value to the tree after I sorted out the structure of the tree, but there was actually nothing that forced me to do that. Given that I already split the page at this stage, I know that I have sufficient space to add the key without doing anything else. I can first add the key to the right page, then write the split page back to the tree. In this case, I don’t need to do any sort of backtracking or state management .
Here is what this looks like:
And with this change, the entire class of problems related to the tree structure just went away.
I’m very happy with this result, even if it is a bit ironic. Like the problem at hand, a lot of the complexity was there because I had to backtrack the implementation decisions and go on a new path to solve this.
Also, I just checked, the portion that controls page splits inside Voron has had roughly 1 change a year for the past 5 years. Given our scope and usage, that means that it has been incredibly stable in the face of everything that we could throw at it.
Comments
Interesting post!
While reading I think I have seen a few typos:
The reason for all of this is that we need to add 27 to the tree, and we haven’t done that yet. (Three should be 28)
Also on the images I think there should be 10 not 0, but I am not 100% sure.
Comment preview