What is up with RavenDB 2.0? Performance…
Well, one thing that we put a lot of focus on was performance. In order to test that, I had a dataset of 4.66 million documents (IMDB data set, if you care) as well as two indexes defined.
The results for RavenDB 2.0 (drum roll):
Loading 4.66 millions records in 44 minutes. Average rate of less then half a millisecond per document.
But wait, what about the indexes? Well, RavenDB index stuff as they come, and as we were inserting the documents, they were indexed along the way. That meant that 11 seconds after we were done putting 4.66 millions documents to RavenDB, we were done indexing (across all indexes).
Pretty nice perf, even if I say so myself.
Comments
"Loading 4.66 millions records in 44 minutes"
This means 1756 documents / second.
Is the I/O channel saturated from this? Disk write speed maxed out?
I don't know about the complexity of those documents, however an ETL process can reach over 50k "rows" per second on my modest machine using bulk load.
Therefore I think it would be interesting to see some benchmarks for small documents (1 property), medium (10-100 properties) large (1000+ properties) and the I/O caracteristics of Raven DB during such operations.
Pop, This is meant to show indexing performance more than anything else. Bulk load is doing something quite different.
What options do you have for doing an actual bulk load? Say if we wanted to load 250m moderately complex documents - is there some kind of bulk load option which can do batch indexing after?
May I ask where you downloaded the IMDB dataset from?
@Jeús http://www.imdb.com/interfaces
Would be great if you could direct us to your ETL process. I noticed the old ETL project in the raven source is no longer there?
Is it possible to get a comparison with Raven 1.x's performance using the same dataset and hardware?
Jamie, We will have bulk load work done after the release. It is a bit involved, as you might imagine.
The "ETL code" is just the smuggler.
I really don't understand why people care so much about 'bulk load' performance. I mean really, what's the difference between writing 1.000 or 5.000 documents per second WITHOUT indexing?
The whole point about raven is that is has indexes for you to do calculation or queries. If you don't need that, you have a key/value store for which you don't need raven in the first place.
Perf metrics without indexing are useless.
Daniel: Of course it matters, Jamie clearly stated why. If you need to store large amounts of data quickly, and only need indexes later, bulking makes sense.
AndersM, Not really, just loading the data and waiting for indexing, and loading the data with indexing would result in about the same time frame
Ok, i did not know how Raven would handle this, but answered based on Daniels numbers :)
Maybe in the Ravendb world... As Will Hughes suggested, it would be more interesting to see the difference with the previous release, right now it's just some random numbers.
AndersM: My point is - the only metric I care about is the time it takes to do both, writing and indexing. No, I don't mean bulk import of data-sets because this is something you don't do frequently and when you do it, it's generally not time sensitive (like migrate from another database).
@Daniel Bulk loading should include indexing. In my earlier example, indexing during bulk load is enabled.
Very nice! What do the indexes look like?
So, how did the older version do on this? What's the improvement (if any) does 2.0 bring?
Have you full source code for this perf test?
Comment preview