Raven performance testing
Well, Raven is pretty much done now, from feature perspective, so now is the time to look at the performance numbers, see where I goofed up, etc. I decided to use StackOverflow Data Dump as my sample source (using the March 2010), because that is an accessible real world data, large, data set that I can utilize.
I quickly wrote a simple ETL process to read the StackOverflow dump files and load them into Raven. I’ll speak about the ETL process in more detail in a future post, but for now, I want to talk about the numbers.
The ETL approach I used isn’t the most efficient one, I’ll admit. It involves doing multiple passes on the data. Basically it goes:
- foreach user in users.xml
- insert user document
- foreach badge in badges.xml
- update appropriate user document with new badge
- foreach post in posts.xml
- insert post document
- foreach vote in votes.xml
- update post with vote
- foreach comment in comments.xml
- update post with comment
As you can imagine, this means that we are asking the server to do a lot of duplicated work. It would be better if we would pre-prepare the values and insert them only once, instead of insert & then update them. Unfortunately, the data sizes are large enough that doing trying to do this in memory is too expensive. I can think of several approaches to try to optimize this, but at the end, I don’t really see a reason. This ETL process is probably how people will write it in the real world, so there is no point in trying too hard.
In the end, using the March 2010 dump from stack overflow, I ended up with a Raven DB instance with: 2,329,607 documents. From random sampling, most documents are 1 KB – 5 KB in since, with several that are significantly larger than that.
Here are two typical documents:
// posts/2321816 { "LastEditorUserId":200145, "LastEditorDisplayName":"", "PostTypeId":1, "Id":2321816, "Title":"Storing commercial files on the server", "Score":0, "CreationDate":"\/Date(1266952829257+0200)\/", "CommentCount":2, "AcceptedAnswerId":2321854, "LastActivityDate":"\/Date(1266953391687+0200)\/", "Tags":"", "LastEditDate":"\/Date(1266953391687+0200)\/", "Body":"Where would you store files that are meant for sale on an e-commerce website? \n", "OwnerUserId":200145, "AnswerCount":3, "ViewCount":45, "comments":[ { "Score":null, "CreationDate":"\/Date(1266952919510+0200)\/", "Text":"are they \"sensitive\" information?", "UserId":"users/203907" }, { "Score":null, "CreationDate":"\/Date(1266953092057+0200)\/", "Text":"I wouldn't say they are sensitive information. They are just commercial files and a free access to them shouldn't be allowed.", "UserId":"users/200145" } ] } // users/33 { "Age":28, "CreationDate":"\/Date(1217583130963+0300)\/", "Id":33, "UpVotes":354, "LastAccessDate":"\/Date(1267455753720+0200)\/", "DisplayName":"John", "Location":"Southampton, England", "AboutMe":"C# and VB.net Developer working primarily in windows service and winforms applications.\r\n\r\n ", "EmailHash":"d0b76ae7bf261316683cad31ba0bad91", "Reputation":3209, "DownVotes":3, "Views":334, "badges":[ { "Name":"Teacher", "Dates":[ "\/Date(1221458104020+0300)\/" ] }, { "Name":"Student", "Dates":[ "\/Date(1221458104190+0300)\/" ] }, { "Name":"Editor", "Dates":[ "\/Date(1221458104377+0300)\/" ] }, { "Name":"Cleanup", "Dates":[ "\/Date(1221458104470+0300)\/" ] }, { "Name":"Organizer", "Dates":[ "\/Date(1221458104737+0300)\/" ] }, { "Name":"Supporter", "Dates":[ "\/Date(1221458104893+0300)\/" ] }, { "Name":"Critic", "Dates":[ "\/Date(1221458104987+0300)\/" ] }, { "Name":"Citizen Patrol", "Dates":[ "\/Date(1221458105173+0300)\/" ] }, { "Name":"Scholar", "Dates":[ "\/Date(1221458105483+0300)\/" ] }, { "Name":"Enlightened", "Dates":[ "\/Date(1221458112677+0300)\/" ] }, { "Name":"Taxonomist", "Dates":[ "\/Date(1221458113427+0300)\/" ] }, { "Name":"Nice Answer", "Dates":[ "\/Date(1221458638367+0300)\/", "\/Date(1236274052530+0200)\/", "\/Date(1244026052343+0300)\/", "\/Date(1244726552923+0300)\/", "\/Date(1257249754030+0200)\/" ] }, { "Name":"Nice Question", "Dates":[ "\/Date(1225182453990+0200)\/", "\/Date(1231624653367+0200)\/" ] }, { "Name":"Commentator", "Dates":[ "\/Date(1227767555493+0200)\/" ] }, { "Name":"Autobiographer", "Dates":[ "\/Date(1233569254650+0200)\/" ] }, { "Name":"Necromancer", "Dates":[ "\/Date(1234393653060+0200)\/", "\/Date(1257860556480+0200)\/" ] }, { "Name":"Popular Question", "Dates":[ "\/Date(1236054752283+0200)\/", "\/Date(1248302252213+0300)\/", "\/Date(1248607054807+0300)\/", "\/Date(1250013763393+0300)\/", "\/Date(1251215254023+0300)\/", "\/Date(1258400556113+0200)\/" ] }, { "Name":"Yearling", "Dates":[ "\/Date(1249237664163+0300)\/" ] }, { "Name":"Notable Question", "Dates":[ "\/Date(1249583857093+0300)\/" ] }, { "Name":"Beta", "Dates":[ "\/Date(1221512400000+0300)\/" ] }, { "Name":"Self-Learner", "Dates":[ "\/Date(1251201753523+0300)\/" ] }, { "Name":"Civic Duty", "Dates":[ "\/Date(1260347854457+0200)\/" ] } ] }
The database size is: 6.29 GB (out of which about 158 MB is for the default index).
Total number of operations: 4,667,100
The first major issue was that I couldn’t tell how many documents I had in the database, get document count turned out to be an O(N) operation(!), that was easy to fix, thankfully.
The second major issue was that Raven didn’t really handle indexing of a lot of documents very well, it would index each document as a standalone operation. The problem is that there are additional costs for doing this (opening & closing the index for writing, mostly), which slow things down enormously. I fixed that by implementing index merging, so documents that were inserted at the same time would be index together (up to some limit that I am still playing at).
Once those were fixed, I could try doing some measurements…
- With indexing disabled, inserting 2.3 million documents takes about 1 hour & 15 minutes. Considering that we made 4.6 million operations (including insert & updates), we are talking about over a 1000 operations per second (using single threaded mode).
I am very happy with those results.
- With indexing enabled, and a single index defined, the process takes much longer. About 3 hours & 15 minutes, giving us about 400 operations per second (again, single threaded), or about 2.5 milliseconds per operation.
- Waiting for the background indexing task to complete took a lot longer, another 2 hours & 45 minutes. This gives me just over 200 documents indexed per second.
I am pleased, but not really happy about those results. I think that we can do better, and I certainly intend to optimize things.
But that is for later, right now I want to see how Raven behaves when it has that many documents.
What we test | URL | Timing |
Looking up database stats | GET /stats | 00:00:00.0007919 |
Browse documents (start) | GET /docs/?start=0&pageSize=25 | 00:00:00.0429041 |
Browse documents (toward the end) | GET /docs/?start=2300000&pageSize=25 | 00:00:00.0163617 |
Get document by id (toward the start) | GET /docs/users/32 | 00:00:00.0017779 |
Get document by id (toward the end) | GET /docs/posts/2321034 | 00:00:00.0022796 |
Query index (start) | GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25 | 00:00:00.0388772 |
Query index (toward the end) | GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25&start=100000 | 00:00:00.5988617 |
Query index (midway point) | GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25&start=20000 | 00:00:00.1644477 |
Query index #2 (start) | GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25 | 00:00:00.4957742 |
Query index #2 (toward the end) | GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25&start=2000000 | 00:00:07.3789415 |
Query index #2 (midway point) | GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25&start=200000 | 00:00:02.2621174 |
The results are very interesting. It is gratifying to see that browsing documents and retrieving documents is blazing fast, and once I fixed the O(N) issue on the stats, that is fast as hell as well.
Querying the indexes is interesting. It is clear that Lucene doesn’t like to page deeply. On the other hand, I think that it is safe to say that we don’t really have to worry about this, since deep paging (going to page #4,000) is very unlikely, and we can take the half a second hit when it is happening.
Querying index #2 is troublesome, though. It don’t think that it should take that long (without paging, if you want to page to page 80,000, please wait), even if that query returns ~2.1 million results. I think that this is because some of the Lucene options that we use, so I’ll have to look further into that.
Nitpicker corner: Those are not the final numbers, we intend to improve upon them.
Comments
Incredible results!!! Been reading this series from the start and things look like they are really coming together. Congrats!
Very nice, performance looks impressive. Raven seems to be maturing with features at a good pace.
I take it there is no de/serialization involved and you're just dumping the json text value from lucene to the response stream?
Do these RESTish requests come as part of Raven solution or is this interface something that the client is expected to provide?
i.e. do you get it for free and be able to page any query I've defined without writing a custom http handler?
Either way this seems like a good CouchDB-style solution with built-in lucene is a killer feature - should be a prime candidate for building pure Ajax apps with.
Maybe I should renew efforts in my ajax framework ( http://www.ajaxstack.com/AjaxStack.Demo/) and get it talk directly to a Raven backend, the built-in lucene index would come in handy as I currently maintain my own search index. What's the ETA on a public release? is it going to be open source?
Just one q, what are you using to serialize to JSON as I'm having a hard time finding any good ones, are just you using the BCL's JsonDataContractSerializer?
For GETs, no, there is no serialization, we simply dump it to the client.
For queries, there is, because you have to merge multiple documents together.
The Lucene index doesn't actually hold all the data, only the indexed parts.
All of those requests are handled as part of Raven itself, you don't have to write anything to make this happen.
ETA of release is weeks, and yes, it is going to be open source.
I am using Newsoft.Json for that, I love it
Would it be possible to have some sort of a comparison with an sql database here.
Not sure if this would make for a logical comparison but I think most devs are familiar with sql databases so a performance comparison with something open source like postgress / mysql / sqlite would be nice to see.
Raven seems to be doing a whole lot more then just store data, so that part would have to be reconstructed but still it might be good to see.
Comment preview