Caching hostility–usage patterns that breaks your system

time to read 3 min | 587 words

When you use a cache, you need to take into account several factors about the cache. There are several workload patterns that can cause the cache to turn into a liability, instead of an asset. One of the most common scenarios where you can pay heavily for the cache, but not benefit much, is when you have a sequential access pattern that exceed the size of the cache.

Consider the following scenario:

In this case, the size is set to 100, but the keys are sequential in the range of 0 .. 127. We are basically guaranteed to never have a cache hit. What is the impact of such a cache, however?

Well, it will keep the reference alive for longer, so they will end up in the Gen2. On eviction, they will take longer to be discarded. In other words, adding a cache here will increase the amount of memory that is being used, have higher CPU utilization (the GC has to do more work) and won’t add any performance benefit at all. Removing the cache, on the other hand, will reduce both memory utilization and CPU costs.

This can be completely unintuitive at first glance, but it is a real scenario, and sadly something that we had experienced many times in RavenDB 3.x editions. In fact, a lot of the design of RavenDB 4.x was about fixing those kinds of issues.

Whenever you design a cache, you should consider what sort of adversity you have to face. Considering your users and adversaries, intentionally trying to break your software, is a good mindset to have. You get to avoid many pitfalls this way.

There are many other caching anti patterns. For example, if you are using a distributed cache, the pattern of accesses to the cache may be more expensive than reading from the source. You have many (fast) queries to answer a value, instead of one (somewhat slower) remote call. The network cost is typically huge, but discounted (see: Fallacies of Distributed Computing).

But for in memory cache, it is easy to forget that a cache that is overloaded is just a memory hog, not providing very good details at all. In the previous posts, I discussed how I should use a buffer pool in conjunction with the cache. That is done because of this particular scenario, if the cache is overloaded, and we discard values, we want to at least avoid doing additional allocations.

In many ways, a cache is a really complex piece of software. There has been a lot of research into it. Here are another non initiative result. Instead of using the least recently used (or least frequently used), select a value at random and evict it. Your performance is going to be faster.

Why is that? Look at the code above, let’s assume that I’m evicting a random value in the 25% least frequently used items. The fact that I’m doing that randomly means that there is higher likelihood that some values will remain in the cache, even after they “should” have expired. And by the time I come back to them, they would be useful in the cache, instead of predictably evicted.

In many databases, the cache management takes a huge part of the complexity. You usually have multiple levels of caches, and policies that move them between one another. I really liked this post, discussing the Postgres algorithm in great details. It also cover some aspects of nearly hostile behavior that the cache has to guard against, to avoid pathological performance drops.