I just finished doing a second de-optimization for NMemcached. This is the second such change I do, turning the code from using the async read pattern to using a simple serial read from a stream. Both those changes together takes the time to complete my simple (and trivial) benchmark from 5768.5 ms to 2768.6 ms.
That is less than 50% of the original time! In both cases, I started with high use of BeginXyz, in order to get as much parallelism as much as possible, but it actually turned out to be a bad decision, since it meant that in many cases where the data was already there, I would pay the price of an async call, vs. just grabbing the data from the kernel buffer.