Elegancy challenge: Cacheable Batches
Let us say that we have the following server code:
public JsonDocument[] GetDocument(string[] ids) { var results = new List<JsonDocument>(); foreach(var id in ids) { results.Add(GetDocument(id)); } return result.ToArray(); }
This method is a classic example of batching requests to the server. For the purpose of our discussion, we have a client proxy that looks like this:
public class ClientProxy : IDisposable { MemoryCache cache = new MemoryCache(); public JsonDocument[] GetDocument(params string[] ids) { // make the request to the server, for example var request = WebRequest.Create("http://server/get?id=" + string.Join("&id=", ids)); using(var stream = request.GetResponse().GetResposeStream()) { return GetResults(stream); } } public void Dispose() { cache.Dispose(); } }
Now, as you can probably guess from the title and from the code above, the question relates to caching. We need to make the following pass:
using(var proxy = new ClientProxy()) { proxy.GetPopularity("ayende", "oren"); // nothing in the cahce, make request to server proxy.GetPopulairty("rhino", "hippo"); // nothing in the cahce, make request to server proxy.GetPopulairty("rhino", "aligator"); // only request aligator, 'rhino' is in the cache proxy.GetPopulairty("rhino", "hippo"); // don't make any request, serve from cache proxy.GetPopulairty("rhino", "oren"); // don't make any request, serve from cache }
The tricky part, of course, is to make this elegant. You can modify both server and client code.
I simplified the problem drastically, but one of the major benefits in the real issue was reducing the size of data we have to fetch over the network even for partial cached queries.
Comments
public JsonDocument[] GetDocument(params string[] ids) { // only ask server for uncached documents var uncachedIds = (from id in ids where !cache.Contains(id) select id).ToArray();
}
Jesper, This code suffers from a multi threading bug. We first check if an item exists, and only later we extract it, by that time, it might have been removed from the cache. Same for setting the item in the cache and then trying to get it from there.
Mark, You don't have a cache, you have a memory leak, since nothing ever expires from this cache
Revised: keys.Join line was abominable.
Mark, Aside from everything else: keys.Except(_cache.Keys) Is probably much more expensive than doing a dictionary lookup for each key, since you are forcing iteration of the entire thing, which is very expensive for large number of items.
Same issue regarding the memory leak as before. Also, not safe for multi threading
Non-leaking version:
Non-leaking, more performant, thread-safe...
Mark, a) The GetValues will give you all the keys back (with the value set to null), so the code would fail. b) You are resolving those items one at a time, not all at once, which means that you just broke batching.
Mark, In the second version, same problem in that GetValues return all keys, even not on the cache. You are still making N calls to the DB. Your call to AddOrGetExisting will get an existing (potentially older) value that has been already set in the cache. You should prefer to use the value from the server
Ayende,
Your use case for the proxy showed it being used and disposed in a very short space of time, so memory leakage, thread-safety and performance around the overall size of the cache were not an obvious part of your requirements. You simplified the problem drastically, I simplified the solution likewise.
While we're in code review mode, your server method is inefficient in terms of collection management. You should create the List with the known size to prevent array copies as the list resizes its internals, or, even better, just create the array and then use a for loop to populate it:
Mark, I apologize, my (unvoiced) assumption was that the cache lifetime is longer than the lifetime of the proxy
Good catch with regards to the server code
While I'm learning about MemoryCache, which is great, more and more requirements are rearing their heads here.
I'm not making multiple DB calls, the _uncachedResolver function takes a list of strings and returns a list of values. The function would look like:
Also, I'm probably doing something hideous with my creation of the MemoryCache class with a GUID as the name. I suspect a better way would be to use a named region in the Default cache.
Mark, Oh! Good point, I didn't see that, I am sorry. Last thing to fix is what to do when everything is cached, you don't need to call the server then.
Mark, That is pretty bad, yes. Mostly because unless managed properly, caches are actually memory leaks :-) No worries for this scenario, but you might want to check how much trouble that got us with the RavenDB caching
Yes, I forgot to fix that check line to account for Nulls in the dictionary that comes back from the cache. So that line should be
and done.
Interesting to note that if you post enough comments, the Captcha stops appearing. Nice UX.
I've posted that class along with how it would be used by ClientProxy as a gist: https://gist.github.com/1167813
I rather like Mark's solution.
Mark,
1 small bug, the line foreach (var value in _uncachedResolver(keys.Where(k => values[k] == null)))
should be foreach (var value in _uncachedResolver(keys.Where(k => !values.ContainsKey(k))))
Otherwise it throws an exception when the key isn't in a cache rather than fetching it.
Matt,
MemoryCache returns all the keys in the Dictionary, with nulls where there was no matching value.
I think that only applies if you're using the GetValues(..) method, but not if you're using the Indexer or Item property (see http://msdn.microsoft.com/en-us/library/system.runtime.caching.memorycache.item.aspx).
Either way it's a small detail and I like the your solution
Matt: I am using GetValues.
@Ayende, can we see your solution about this challenge?
Olcay, Take a look at the RaccoonBlog code on github :-)
@Mark, regarding list/array optimization, I think Linq would be most efficient solution. E.g. ids.Select(GetDocument).ToArray();
@Ayende, regarding thread-safetiness, why does that matter. Why is it important that 2 concurrent requests with the same id MUST make exactly 1 request? I think the performance overhead to ensure this requirement (i.e. locking) would outweigh its gain.
Hendry, You don't need locking to do that. And the issues isn't of concurrent request, the issue is of concurrent _system_. 2 concurrent requests would make 2 separate requests, most probably, nothing wrong there. But you have to account for concurrency in the design because that is where it will run.
Comment preview