We recently tackled performance improvements for vector search in RavenDB. The core challenge was identifying performance bottlenecks. Details of specific changes are covered in a separate post.
TODO: link to the blog post on that
In this post, I don’t want to talk about the actual changes we made, but the approach we took to figure out where the cost is.
Take a look at the following flame graph, showing where our code is spending the most time.
As you can see, almost the entire time is spent computing cosine similarity. That would be the best target for optimization, right?
I spent a bunch of time writing increasingly complicated ways to optimize the cosine similarity function. And it worked, I was able to reduce the cost by about 1.5%!
That is something that we would generally celebrate, but it was far from where we wanted to go. The problem was elsewhere, but we couldn’t see it in the profiler output because the cost was spread around too much.
Our first task was to restructure the code so we could actually see where the costs were. For instance, loading the vectors from disk was embedded within the algorithm. By extracting and isolating this process, we could accurately profile and measure its performance impact.
This restructuring also eliminated the "death by a thousand cuts" issue, making hotspots evident in profiling results. With clear targets identified, we can now focus optimization efforts effectively.
That major refactoring had two primary goals. The first was to actually extract the costs into highly visible locations, the second had to do with how you address them. Here is a small example that scans a social graph for friends, assuming the data is in a file.
def read_user_friends(file, user_id: int) -> List[int]:
"""Read friends for a single user ID starting at indexed offset."""
pass # redacted
def social_graph(user_id: int, max_depth: int) -> Set[int]:
if max_depth < 1:
return set()
all_friends = set()
visited = {user_id}
work_list = deque([(user_id, max_depth)])
with open("relations.dat", "rb") as file:
while work_list:
curr_id, depth = work_list.popleft()
if depth <= 0:
continue
for friend_id in read_user_friends(file, curr_id):
if friend_id not in visited:
all_friends.add(friend_id)
visited.add(friend_id)
work_list.append((friend_id, depth - 1))
return all_friends
If you consider this code, you can likely see that there is an expensive part of it, reading from the file. But the way the code is structured, there isn’t really much that you can do about it. Let’s refactor the code a bit to expose the actual costs.
def social_graph(user_id: int, max_depth: int) -> Set[int]:
if max_depth < 1:
return set()
all_friends = set()
visited = {user_id}
with open("relations.dat", "rb") as file:
work_list = read_user_friends(file, [user_id])
while work_list and max_depth >= 0:
to_scan = set()
for friend_id in work_list: # gather all the items to read
if friend_id in visited:
continue
all_friends.add(friend_id)
visited.add(friend_id)
to_scan.add(curr_id)
# read them all in one shot
work_list = read_users_friends(file, to_scan)
# reduce for next call
max_depth = max_depth - 1
return all_friends
Now, instead of scattering the reads whenever we process an item, we gather them all and then send a list of items to read all at once. The costs are far clearer in this model, and more importantly, we have a chance to actually do something about it.
Optimizing a lot of calls to read_user_friends(file, user_id)
is really hard, but optimizing read_users_friends(file, users)
is a far simpler task. Note that the actual costs didn’t change because of this refactoring, but the ability to actually make the change is now far easier.
Going back to the flame graph above, the actual cost profile differs dramatically as the size of the data rose, even if the profiler output remained the same. Refactoring the code allowed us to see where the costs were and address them effectively.
Here is the end result as a flame graph. You can clearly see the preload section that takes a significant portion of the time. The key here is that the change allowed us to address this cost directly and in an optimal manner.
The end result for our benchmark was:
- Before: 3 minutes, 6 seconds
- After: 2 minutes, 4 seconds
So almost exactly ⅓ of the cost was removed because of the different approach we took, which is quite nice.
This technique, refactoring the code to make the costs obvious, is a really powerful one. Mostly because it is likely the first step to take anyway in many performance optimizations (batching, concurrency, etc.).