Immutable Collections performance
After finding out that a lot of our costs in Voron is actually related to immutable collections, I decided to run some isolated perf tests.
1: private static void Dictionary()2: {
3: var dic = new Dictionary<object, object>();4: var sp = Stopwatch.StartNew();
5:
6: for (int i = 0; i < 10*1000*1000; i++)7: {
8: var obj = new object();9: dic[obj] = obj;
10: }
11:
12: Console.WriteLine(sp.Elapsed);
13: }
1: private static void ImmutableDictionary()2: {
3: var dic = ImmutableDictionary<object,object>.Empty;4: var sp = Stopwatch.StartNew();
5:
6: for (int i = 0; i < 10 * 1000 * 1000; i++)7: {
8: var obj = new object();9: dic = dic.SetItem(obj, obj);
10: }
11:
12: Console.WriteLine(sp.Elapsed);
13: }
3.639 seconds
1 minute and 58 seconds
Yes, that is correct, the immutable version is over thirty times slower. And given that we are only using 10 million items, that is a ridiculous high rate.
I decided to do some tests to see if it is just the number of calls that we are making here:
1: private static void ImmutableDictionary()2: {
3: var dic = ImmutableDictionary<object,object>.Empty;4: var sp = Stopwatch.StartNew();
5:
6: dic = dic.SetItems(Enumerable.Range(0, 10*1000*1000).Select(i =>
7: {
8: var obj = new object();9: return new KeyValuePair<object, object>(obj, obj);10: }));
11:
12: Console.WriteLine(sp.Elapsed);
13: }
1 minute
So that it appears that it isn’t the number of calls, but something intrinsic to the way it works. And how about lists?
1: private static void List()2: {
3: var list = new List<object>();4: var sp = Stopwatch.StartNew();
5:
6: for (int i = 0; i < 10 * 1000 * 1000; i++)7: {
8: var obj = new object();9: list.Add(obj);
10: }
11:
12: Console.WriteLine(sp.Elapsed);
13: }
1: private static void ImmutableList()2: {
3: var list = ImmutableList<object>.Empty;4: var sp = Stopwatch.StartNew();
5:
6: for (int i = 0; i < 10 * 1000 * 1000; i++)7: {
8: var obj = new object();9: list = list.Add(obj);
10: }
11:
12:
13: Console.WriteLine(sp.Elapsed);
14: }
0.9 seconds
16 seconds
Ouch.
I think we are going to need to do something about this, even though I love the idea of immutable collections, 16 – 32 times slower is just going to be utterly unacceptable.
But this is for writes, how about reads?
I had written the following for both of them:
1: var sp = Stopwatch.StartNew();
2: for (int i = 0; i < 10*1000*1000; i++)3: {
4: object value;5: dic1.TryGetValue(i, out value);6: }
7:
8: Console.WriteLine(sp.Elapsed);
1: var sp1 = Stopwatch.StartNew();
2:
3: for (int i = 0; i < 10 * 1000 * 1000; i++)4: {
5: object value;6: dic2.TryGetValue(i, out value);7: }
8: Console.WriteLine(sp1.Elapsed);
0.45 seconds
3.11 seconds
So the summary is, the performance for our workloads is probably going to be simply unacceptable. We’ll have to find another way to handle this. The way to handle this nicely is to basically copy the values. So I wanted to know how long it would take to fully clone a 10 millions items dictionary. The answer: 0.6 seconds. Doing the same with immutable dictionary? 16 seconds.
So we are going to need to do something else .
Comments
In zero contention scenarios, Dictionary is much much faster than ConcurrentDictionary, but in high contention it is the reverse.
It is not much surprise that a single threaded for-loop is faster for Dictionary than ImmutableDictionary.
What is Vorons dictionary usage pattern?
I would also compare the time it takes to construct the object as adding items.
Jahmai, We have a no contention issue (already protected), but using immutable collections means that we don't have to fear mutation while reading. Now we have to figure out on replacing the immutable collections with something else.
Yeah seems like instead of an immutable dictionary (which copies itself every time you alter its state) you want something that guarantees the immutability of the items it contains?
Have you tried the static methods on ImmutableList and/or the extension methods? For example changing your ImmutableList test to the following drops the time down significantly:
Erewhon, Nope, I actually store long -> long there. So no issue with mutability of items. I need to have both readers and a single writer.
Chris, Not really significant. Because I need to actually do a buildup of this over time, not just once.
Have you looked into the LMAX Disruptor ring buffer implementation for handling this?
Mikkel, I don't need anything like that.
Roger that:)
Why don't you just expose it as a IReadOnlyDictionary? Protected, but you can still modify it.
The performance isn't bad. Blogged about it a few mins back http://invokeit.wordpress.com/2013/11/29/immutable-collections-performance-dotnet/
The Immutable Collections API suggests using its builder classes for manipulating the collections:
var builder = ImmutableDictionary.CreateBuilder<object, object>(); var sp = Stopwatch.StartNew();
for (int i = 0; i < 10 * 1000 * 1000; i++) { var obj = new object(); builder.Add(obj, obj); }
Console.WriteLine(sp.Elapsed);
When I run this test, it reduces the run time from 1:46.42 (using ImmutableDictionary) to 0:55.79. Now, compared to 3.67 seconds for the plain Dictionary version...
Hermit, as far as I can see, while ayende mutates the immutable list n times (meaning, creating a new one from the old one), you only generate it once per run - that is a very different test and not comparable. You are essentially copying the list into a new structure snapshot-style. You don't need Immutable* for that.
jgauffin, IReadOnlyDictionary isn't useful for me. I want to have this read by many threads, and written by one thread. The dictionary is NOT safe to be read while it is being mutated.
Stephan, That is the time for _writing_, but the read times are just as bad, and they impact us a LOT more.
Hermit, What you are doing is to create a snapshot of the data. That is nice, but it is utterly useless. Try to imagine that you now need to do this in another operation (for example, you need to track something). That means that you cannot build the entire data set in one go.
Snapshot != immutable.
For fun, try iterating over both of lists, for AND foreach are horrible
What implementation of ImmutableDictionary are you using?
Rafal, The BCL Immutable collections: http://www.nuget.org/packages/Microsoft.Bcl.Immutable/
Hello I am not sure you can do much faster using any existing form of dictionary. The immutable collection works by creating a new version each time you alter it, this way each existing instance is read only and can be safely read by as many thread as you want.
The benefit you gets from those is that the new version is not a full copy, this way it saves memory and copy time. To achieve that they have to use complex structure which are less iteration friendly that their basic counterpart.
Look at it this way: 2 minutes to create 10 M different dictionaries with a size from 1 to 10 M is not that bad!
I understand this is not what you want to hear.
If that is possible in your implementation, you should favor batch modifications: SetItems or AddRange
ah, .Net 4.5 only. ok.
dupdob, I have tried with SetItems, it isn't really helpful for our scenario. In fact, performance is really bad around that.
And yes, I fully understand the implementation decision associated with this.
For those unclear about these collections, the implementations for the immutable collections are substantially more complicated than their mutable counterparts. They are optimised to keep to a low memory footprint so that each change causes a new collection to be created with a reference to the previous state and a delta of the changes. It avoids repeatedly copying all the elements and the large memory allocation that would cost.
This immutability is perfect for representing a "snapshot" of state which you can pass around to other operations knowing that it will not mutate.
Compare this to a concurrent collection which would provide support for concurrent reads and writes, but makes no promise that the state wont be shifting between reads.
I'm sure the BCL team have gone to great lengths to make sure these classes perform as fast as possible for their goals of low memory utilization, so it might be that the pattern isn't the most performant for this use case.
@Ayende What is the ratio between reads and writes? How "stale" can the data be for the readers? Is it important that the reads or writes aren't blocked?
Tried something similar with F# immutable map
Writes: let elements = { 1 .. 10 * 1000 * 1000 } |> Seq.map (fun i -> i, i) |> Map.ofSeq takes about 20 seconds
Reads: elements |> Array.ofSeq |> ignore takes 0.9 seconds
(times on my machine, times for your C# dictionary samples are similar to yours)
I guess this means there's room for improvement in the C# ImmutableDictionary implementation
Paul, The immutability contract is perfect, we need to have an immutable snapshot of the data while we are modifying it in another thread. There is one writer, and multiple readers, but the data cannot change from the moment the reader got the collection.
Patrick, The problem is that the F# map doesn't have a way to batch changes. Our actual scenario is more complex, and we want to be able to batch those changes to avoid a lot of intermediary steps.
If you are just iterating, it might be worth mentioning that the ConcurrentDictionary does provide a snapshot.
Why not wrap ConcurrentDictionary in a IReadOnlyDictionary? ConcurrentDictionary is fairly good at reading.
Paul, I am not iterating, I am calling TryGetValue a lot.
OmariO, Because I don't get a snapshot of the state it was in.
In the first test, you have created 1010001000 snapshots of the Immutable dictionary. I'm pretty sure you don't need that many.
If you try this, and have only (1000 snapshots):
It runs in under a second on my machine.
Ups, Ignore the previous post, it has a bug (it does not add items)
Proper version:
Timings: 10000000 Snapshots 00:03:21.4288838 1000 Snapshots 00:00:25.7278598
Read timings: 00:00:05.2422042 00:00:04.6549159
Try instantiating the dictionary with a default capacity size. This works wonders for the List<> especially when you know how many items it needs to have. The underlying array that backs a List<> can get pre-sized and so you don't have to do any memory copies to resize an array. Linear performance that way.
I watched a video by Bjarne Stroustrup on LinkedLists vs Vectors in speed performance in reads and writes. Might be worth a look to explain why the immutable collection is so slow http://www.youtube.com/watch?v=YQs6IC-vgmo#t=6
Well this difference in performance should not be that surprising really. You are comparing an O(1) implementation versus an O(log N) implementation. With 10 million inserts that is going to be noticeable. If I need a persistent data structure as a generic dictionary, I generally us an implementation of a hashed array mapped trie. On these tests it is almost twice as fast as the BCL immutable dictionary, but still O(log N).
Seems like your needs are different and you do not want a persistent data structure, but something specific that delivers a "minimal change"immutable snapshot after each mutating operation on an underlying mutable data structure and that is preferably array based v.s. link / pointer based.
For these types of specific use cases it often makes sense to just bite the bullet and build your own customized data structure that is optimized for your use case.
I tried a similar test with ConcurrentDictionary and performance lags big time when the key is an object. When it is an int, I get 2.0 seconds on Dictionary and 2.5 seconds on ConcurrentDictionary. In the past, I've tried to get better performance from Dictionary using my own locks and always given in to the superior performance of ConcurrentDictionary under heavy read/write contention (50 million entries). Straight sequential performance is probably not the best indicator of how it will hold up under load.
Alex, Yes, that is the direction we are going with. We are currently trying out different purpose built implementations.
Tyler, Concurrent Dictionary isn't good for us, it exposes the mutated state, and we need a frozen view.
Okay, it exposes the mutated state. But how does a concurrency strategy around Dictionary prevent exposure of the mutated state? How do you "freeze" the view on the Dictionary?
Could you wrap the dictionary access instead of passing the dictionary itself around? The wrapper would hold an instance of the current dictionary and a couple of functions to control access. Something like:
CollectionWrapper.WithCurrent(x=> /* do stuff, x is the current dictionary as an IReadOnlyDictionary /*);
CollectionWrapper.ModifyCurrent(x=> /* x is a clone of the current dictionary, this will return a new dictionary */ return x;);
Some locking around ModifCurrent and you could have multi threaded writes as well.
Tyler, That is what immutable collections gives you.
flukus, I cannot do that, I am going to keep the instance I get for a while, and need to make multiple separate calls to it.
You could have wrapper.GetCurrent(). As long as you can trust the calling code to not make multiple calls ie: if (wrapper.GetCurrent().HasKey("key")) var x = wrapper["key"].
That's not exactly ideal though. I assume the data is to big to clone every time?
Okay, I understand that. But the cost (making a new copy of the data on every write) seems prohibitively high, as you point out. I wonder if you could achieve something similar but faster by having two dictionaries (A & B) for data and a transaction/swap dictionary (C). Write to A, read from B. When you write to A, you "log" the change to C. When you swap, A for B, C is used to update B, and reads occur on A.
Thanks for your patience as I'm poking around in the dark outside of the box, not knowing the internals of the problem you are solving.
Flukus, Sure, you got the current, but how do you protect it from other changes?
Tyler, The problem here is that we cannot assume that we have just reader/writer. In practice we have multiple readers and one writer.
The readers only get access to an IReadOnlyDIctionary and the writer gets a cloned copy to modify. After modification wrapper.current will be replaced with the cloned dictionary. Every call to GetCurrent() will potentially get a new dictionary.
Do readers have to have the latest version or is it okay to have a stale version while the writes are taking place?
flukus, That still means that you have to clone the dictionary on every write.
The readers gets the version once (has to be latest), and then use it for a while.
I think I understand that. If you have multiple readers, and they are all reading from the "read-only" dictionary (perhaps even a ConcurrentDictionary that is not taking writes at any time readers are reading from it). And you have a single writer that is writing to the "writable" dictionary. And at some point when writes are at a lull or some other strategy determined point, readers are switched to read from the previously "writeable" collection. At that point, the queued changes (in "C") are written to the previously "read only" collection and writes continue there, queuing changes until the next "swap".
If I'm missing the boat, can you explain how you expected immutables to behave differently except that every write would create a new copy that would be available to the next reader? Do you need that level of concurrency, that every write invalidates the current "read only" collection? If so, could you get away with segmenting or sharding the immutables into smaller collections, with reads and writes orchestrated across the "shards" to limit the amount of data being "copied" on each write/read cycle?
Tyler, The writer make its own changes, and they are not visible until it is done. Once it is done, the changes are public and available for any reader. Readers lifetimes are independent, and you can have readers that last multiple write cycles, but they need the same data in their dictionary.
Cloning for writes is the only lock free way of doing it I know of. Otherwise you end up re implementing ImmutableDictionary.
wrapper.GetCurrent would be the current snapshot so if we had the situation:
So at step 4 "abc" won't be in the dictionary because it is a snapshot, which is semantically the same as ImmutableDictionary but with hopefully better performance characteristics (depends entirely on how many clones your doing).
Either that, or as tyler said, additions/modifications go into a separate dictionary and at some point you combine the two, this requires either locking or cloning as well.
The big question is what are the write characteristics? Frequent small additions or less frequent large additions?
Ayende, it sounds like you don't want mutability but versioning. Can a reader be given a version and all reads of the mutable concurrent dictionary read the value for that version or prior to that version but never a newer version of the value? And all writes simply adding a new value with a current "writer version" rather than updating an existing value? And then a trailing process to remove old versions once there are no more readers with that version. In this case, the "value" in the concurrent dictionary would have to support multiple "versions" with a version key. And deletes would be more complex. But it just might give you reader immutability (or the illusion of same) and writer freedom while allowing readers to have a lifetime across multiple reads with a guarantee that they read only their "version" of the data. Hmmm.....
flukus, We frequently do some additions. It can be changing 10 values, it could be changing 100 - 250, but those are the usual ranges.
Tyler, That is actually a very good suggestion, and it just might actually work great! That would turn our process to a O(2), with very little cost. I am going to try that soon
Cool. I look forward to hearing how it goes.
I am not sure if your use case is similar to one of mine, where basically I am only appending to the end of a list (e.g. pages for a transaction, essentially leading to a transaction ordered list), removing from the front of the list (e.g. transaction pages that are no longer needed) and searching for the first entry matching some criterion and iterating sequentially from that first entry (e.g. those that still need to be data synced).
I created a custom immutable list for this purpose that performs O(1) amortized for appends, removals, indexing and enumeration, and thus can be searched in O(log N) for a given item, if the items are inserted in sorted order (which is the case in my use case).
For inserts it is takes about twice the time of a BCL generic list, for random/sequential indexing and enumeration it performs about the same as the BCL list. For removals it is much faster. Performance in practice depends obviously on the actual mix of operations, but so far I am pretty happy with it.
I added the essentials to a gist if you are interested: https://gist.github.com/anonymous/7819554
One note of caution: this implementation is only truly immutable if there is only a single thread at a time that appends new items.
Alex, that is a very interesting implementation. I'll try it out, but it looks like it should be quite appropriate for our needs. Can I assume that you don't have an issue with me using this?
And yes, we only have a single writer thread.
I have no issue with you using it. In fact that has already resulted in finding and fixing 2 bugs ;)
An update with 2 fixes: https://gist.github.com/anonymous/7820394
Some other idea that might be useful (though Tyler's is pretty good too):
assume: class TransUndo { KVP old; TransUndo next } static TransUndo last; static ConcurrentDictionary dict
Write: last = new TransUndo { old = valuesReplacedByChanges, Next = last } dict.Write
Read: snapshot = last ... // skipping snapshot is intentional for (current = snapshot.Next; current = current.Next; current != null { if (current.old.Key == readKey) return current.old.Value } return dict.Read
The advantage of this solution over Tyler's is that reading is cheaper when no write happened during the read transaction.
Also I'm not sure how Tyler would deal with very old values. What about a read transaction version 200000 when you need value version 1?
dict[key][200000]
is not going to work. And iterating would be expensive.The advantage over alex's solution is that cleaning up the Xs happens by the gc, but it comes at the cost of slower reads when a lot of writes happened during the read transaction. Also alex's solution allows you to reuse the transaction object, which could relieve a bit of gc pressure.
ugh bitten by a newline is not a new line again :(
I'm coming from Java context, so I have not checked it with C#. But if the list is the same as in other functional languages, you are likely doing benchmarking in a wrong way.
For immutable structures, adding to the end is usually O(N) operation, adding to beginning is usually O(1). So the correct benchmark would be adding elements to the beginning (using Insert(0, x)) and then reversing the list at the end. Otherwise you will get quadratic cost on adding elements.
You realize of course, that you're looking for an MVCC dictionary to interface to Voron, which is an MVCC dictionary. ???
Alex, You have another bug here: https://gist.github.com/anonymous/7820394#file-immutableappendonlylist-cs-L205
The conditions should be reversed there.
tx, We aren't using a list here, we are testing a dictionary, which doesn't really have the notion of adding in beginning / end.
Howard, Yes, I am. What I need to store is the page table for Voron. And that would NOT do well as a B+Tree in same implementation semantics. To start with, the cost of 4KB min per change is very prohibitive for the kind of things we do. And we don't need infinite scale, we know we can do it more efficiently if we concentrate on memory resident collection.
Patrick, I am afraid that I am not following your solution.
Alex, Another issue here : https://gist.github.com/anonymous/7820394#file-immutableappendonlylist-cs-L194 You are decrementing count by 1, but nToRemove can be more than that.
Alex, another issue here: https://gist.github.com/anonymous/7820394#file-immutableappendonlylist-cs-L82
You need to do i < _head + _count.
Alex, Here is the fully fixed code: https://gist.github.com/ayende/7848030
well that should teach me not to do copy - paste - strip - edit gists anymore.
The implementation I am using does a few additional things to keep track of both data sync & reader limits. I figured creating a gist of a modified version stripped down to the essentials would be a good idea. I should have checked the result a bit better I guess ...
Thanks for the corrections.
There appears to be another copy-paste bug in there at https://gist.github.com/ayende/7848030#file-immutableappendonlylist-cs-L187
_count - 1 should be _count - nToRemove.
Alex, Thanks, fixed.
Ayende, Basically, the implementation is: * a concurrent dictionary with the current state * a single linked list / queue-without-head with change undoes.
When you start a read transaction, the latest node is part of its state. When you want to read a value, you first walk the nodes. (skip the node you remembered, because those changes happened before you started the transaction)
When you write, you append a node with the old values of the updates. Set this new node as the latest node. Let the GC clean up nodes that no transaction can reach anymore.
So when you want to write "newValue" to "key": 1. append a node with ("key" : "oldValue) 2. set current node to this new node 3. set "key" to "newValue" in the live dict.
This assumes the following is rare: T0 start read transaction + read a value T1 finish write transaction T2 read another value in the read transaction
Because only in such a situation walking the nodes is going to incur overhead. Which will become worse linearly to the amount of items written / write transactions finished.
Patrick, Yes, we ended up with something very smiliar.
Comment preview