Buffer allocation strategiesExplaining the solution
In my previous post, I threw a bunch a code at you, with no explanation, and asked you to discuss it.
Here is the code, with full discussion below.
[ThreadStatic] private static Stack<byte[]>[] _buffersBySize; private static byte[] GetBuffer(int requestedSize) { if(_buffersBySize == null) _buffersBySize = new Stack<byte[]>[32]; var actualSize = PowerOfTwo(requestedSize); var pos = MostSignificantBit(actualSize); if(_buffersBySize[pos] == null) _buffersBySize[pos] = new Stack<byte[]>(); if(_buffersBySize[pos].Count == 0) return new byte[actualSize]; return _buffersBySize[pos].Pop(); } private static void ReturnBuffer(byte[] buffer) { var actualSize = PowerOfTwo(buffer.Length); if(actualSize != buffer.Length) return; // can't put a buffer of strange size here (probably an error) if(_buffersBySize == null) _buffersBySize = new Stack<byte[]>[32]; var pos = MostSignificantBit(actualSize); if(_buffersBySize[pos] == null) _buffersBySize[pos] = new Stack<byte[]>(); _buffersBySize[pos].Push(buffer); }
There are a couple of interesting things going on here. First, we do allocations by power of two number, this reduce the number of different sizes we have to deal with. We store all of that in a small array (using the most significant bit to index into the array based on the requested size) that contains stacks for all the requested sizes.
In practice, most of the time we’ll use a very small number of sizes, typically 4KB – 32KB. The basic idea is that you’ll pull an array from the pool, and if there is a relevant one, we save allocations. If not, we allocate a new one and return it to the user.
Once we gave the user a buffer, we don’t keep track of it. If they return it to us, this is great, if not, the GC will clean it up. This is important, because otherwise forgetting to call ReturnBuffer creates what is effectively a memory leak.
Another thing to notice is that we aren’t requiring that the same thread will be used for getting and returning the buffer. It is fine to use one thread to get it and another to return it. This means that async code will work well with thread hopping and this buffer pool. We also use a stack, to try to keep the busy buffer close to the actual CPU cache.
Note that this is notepad code, so there are probably issues with it.
In fact, there is a big issue here that will only show up in particular usage patterns. Can you see it? I’ll talk about it in my next post.
More posts in "Buffer allocation strategies" series:
- (09 Sep 2015) Bad usage patterns
- (08 Sep 2015) Explaining the solution
- (07 Sep 2015) A possible solution
Comments
"This means that async code will work well with thread hopping and this buffer pool."
I assume by work well you mean be quite random in nature and cost more memory that you'd expect. IO completion ports have their own dedicated threads in the thread pool, which means that continuations from IO completion routines (socket / file IO) might release buffers into thread statics that are never acquirable by the a normal worker thread. That besides, the solution is very non-deterministic in it's effectiveness in async code and is bound to end up asking ourselves "why does my memory footprint seem to grow into infinity?"
See: https://msdn.microsoft.com/en-us/library/system.threading.threadpool.setmaxthreads(v=vs.110).aspx
Yes, if the set of allocating threads and the set of releasing threads aren't the same then the {releasing, non-allocating} threads are now a memory leak. That could be somewhat dealt with by setting a cap on the number of buffers that you keep in the pool.
But if you implement that generally, you lose the locality that you liked - unless you also switch to deques instead of stacks so that, on hitting the limit, you can discard the LRU buffer, rather than the current one (in the release method).
Some things worth looking into for improving Async support for this over ThreadStatic, depending on what version of the framework you're targeting and what platforms you need to run on:
https://msdn.microsoft.com/en-us/library/system.runtime.remoting.messaging.callcontext.logicalgetdata(v=vs.110).aspx https://msdn.microsoft.com/en-us/library/dn906268(v=vs.110).aspx
To get around the issue Damien is talking about (one thread allocates and another releases), I would think about replacing
[ThreadStatic] private static Stack<byte[]>[]
with
System.Collections.Concurrent.ConcurrentBag<byte[]>[]
but I have no idea on your specific use case or your special performance requirements, but I had read about the concurrent collections in the past and they are known to be mostly lockfree and they try to use threadlocal AND global stuff under the covers so (at least in my head) they will generelly perform better than your threadlocal stacks.
Yes, large issues are:
1) If all allocations are done from a set of threads different from the ones releasing, you have a hard effective memory leak that will grow without bounds
2) If memory is usually used in bursts - for example, threads use very large buffers at startup, then release them and only a few small buffers are needed after that - your memory usage will be much higher than required, because the memory used by your pool during all of runtime will be equal to the maximum ever needed at once.
3) as noted above, I/O threads get their own pool, so if all allocations are done from the worker pool and then released in the I/O pool, you have an increasing amount of memory used and never reused.
4) In the odd case where you're not using the thread pool (or any type of thread pooling), this code will be useless. Hopefully this is a non-issue in most cases.
5) Finally, you'll have to audit all your code (and check any framework code you're using) to ensure you're not using buffer.length. Because you're not clearing the contents of the buffer it could be a random bug factory at best, and a security vulnerability at worst.
Oh yeah, you get 250 IO threads by default, so that means, if you allocate, say, 256MB in an IO callback every so often, once your application has been up long enough, you'll have allocated 64GB of memory. Not ideal.
What if somebody returns MAX_INT number of tiny sized buffers because they happen to read a very large number of tiny files and then they never exhibit that usage pattern again? Oompf. Whatever wraps this data structure should periodically expire buffers when they idle too long and it should use a minimum core pool size defined by the user (or a sane default) to keep some buffers idle and at the ready, but let the GC take care of the rest.
Eli, The whole point of buffer pool is that you take and return buffers from it. To take 2 billion buffers and not return them is something very strange.
Truly. I was just using shorthand for "really big number" here. The point was that usually pool have bounds to prevent caller from doing outrageous things.
Eli, If you are allocating a lot of small buffers, you want us to pool them. Otherwise you are going to force the GC to do a LOT of work.
Comment preview