Raven performance testing

Apr 24 2010

Raven performance testing

time to read 16 min | 3090 words

Well, Raven is pretty much done now, from feature perspective, so now is the time to look at the performance numbers, see where I goofed up, etc. I decided to use StackOverflow Data Dump as my sample source (using the March 2010), because that is an accessible real world data, large, data set that I can utilize.

I quickly wrote a simple ETL process to read the StackOverflow dump files and load them into Raven. I’ll speak about the ETL process in more detail in a future post, but for now, I want to talk about the numbers.

The ETL approach I used isn’t the most efficient one, I’ll admit. It involves doing multiple passes on the data. Basically it goes:

foreach user in users.xml

insert user document

foreach badge in badges.xml

update appropriate user document with new badge

foreach post in posts.xml

insert post document

foreach vote in votes.xml

update post with vote

foreach comment in comments.xml

update post with comment

As you can imagine, this means that we are asking the server to do a lot of duplicated work. It would be better if we would pre-prepare the values and insert them only once, instead of insert & then update them. Unfortunately, the data sizes are large enough that doing trying to do this in memory is too expensive. I can think of several approaches to try to optimize this, but at the end, I don’t really see a reason. This ETL process is probably how people will write it in the real world, so there is no point in trying too hard.

In the end, using the March 2010 dump from stack overflow, I ended up with a Raven DB instance with: 2,329,607 documents. From random sampling, most documents are 1 KB – 5 KB in since, with several that are significantly larger than that.

Here are two typical documents:

// posts/2321816
{
   "LastEditorUserId":200145,
   "LastEditorDisplayName":"",
   "PostTypeId":1,
   "Id":2321816,
   "Title":"Storing commercial files on the server",
   "Score":0,
   "CreationDate":"\/Date(1266952829257+0200)\/",
   "CommentCount":2,
   "AcceptedAnswerId":2321854,
   "LastActivityDate":"\/Date(1266953391687+0200)\/",
   "Tags":"",
   "LastEditDate":"\/Date(1266953391687+0200)\/",
   "Body":"Where would you store files that are meant for sale on an e-commerce website? \n",
   "OwnerUserId":200145,
   "AnswerCount":3,
   "ViewCount":45,
   "comments":[
      {
         "Score":null,
         "CreationDate":"\/Date(1266952919510+0200)\/",
         "Text":"are they \"sensitive\" information?",
         "UserId":"users/203907"
      },
      {
         "Score":null,
         "CreationDate":"\/Date(1266953092057+0200)\/",
         "Text":"I wouldn't say they are sensitive information. They are just commercial files and a free access to them shouldn't be allowed.",
         "UserId":"users/200145"
      }
   ]
}

// users/33
{
   "Age":28,
   "CreationDate":"\/Date(1217583130963+0300)\/",
   "Id":33,
   "UpVotes":354,
   "LastAccessDate":"\/Date(1267455753720+0200)\/",
   "DisplayName":"John",
   "Location":"Southampton, England",
   "AboutMe":"C# and VB.net Developer working primarily in windows service and winforms applications.\r\n\r\n ",
   "EmailHash":"d0b76ae7bf261316683cad31ba0bad91",
   "Reputation":3209,
   "DownVotes":3,
   "Views":334,
   "badges":[
      {
         "Name":"Teacher",
         "Dates":[
            "\/Date(1221458104020+0300)\/"
         ]
      },
      {
         "Name":"Student",
         "Dates":[
            "\/Date(1221458104190+0300)\/"
         ]
      },
      {
         "Name":"Editor",
         "Dates":[
            "\/Date(1221458104377+0300)\/"
         ]
      },
      {
         "Name":"Cleanup",
         "Dates":[
            "\/Date(1221458104470+0300)\/"
         ]
      },
      {
         "Name":"Organizer",
         "Dates":[
            "\/Date(1221458104737+0300)\/"
         ]
      },
      {
         "Name":"Supporter",
         "Dates":[
            "\/Date(1221458104893+0300)\/"
         ]
      },
      {
         "Name":"Critic",
         "Dates":[
            "\/Date(1221458104987+0300)\/"
         ]
      },
      {
         "Name":"Citizen Patrol",
         "Dates":[
            "\/Date(1221458105173+0300)\/"
         ]
      },
      {
         "Name":"Scholar",
         "Dates":[
            "\/Date(1221458105483+0300)\/"
         ]
      },
      {
         "Name":"Enlightened",
         "Dates":[
            "\/Date(1221458112677+0300)\/"
         ]
      },
      {
         "Name":"Taxonomist",
         "Dates":[
            "\/Date(1221458113427+0300)\/"
         ]
      },
      {
         "Name":"Nice Answer",
         "Dates":[
            "\/Date(1221458638367+0300)\/",
            "\/Date(1236274052530+0200)\/",
            "\/Date(1244026052343+0300)\/",
            "\/Date(1244726552923+0300)\/",
            "\/Date(1257249754030+0200)\/"
         ]
      },
      {
         "Name":"Nice Question",
         "Dates":[
            "\/Date(1225182453990+0200)\/",
            "\/Date(1231624653367+0200)\/"
         ]
      },
      {
         "Name":"Commentator",
         "Dates":[
            "\/Date(1227767555493+0200)\/"
         ]
      },
      {
         "Name":"Autobiographer",
         "Dates":[
            "\/Date(1233569254650+0200)\/"
         ]
      },
      {
         "Name":"Necromancer",
         "Dates":[
            "\/Date(1234393653060+0200)\/",
            "\/Date(1257860556480+0200)\/"
         ]
      },
      {
         "Name":"Popular Question",
         "Dates":[
            "\/Date(1236054752283+0200)\/",
            "\/Date(1248302252213+0300)\/",
            "\/Date(1248607054807+0300)\/",
            "\/Date(1250013763393+0300)\/",
            "\/Date(1251215254023+0300)\/",
            "\/Date(1258400556113+0200)\/"
         ]
      },
      {
         "Name":"Yearling",
         "Dates":[
            "\/Date(1249237664163+0300)\/"
         ]
      },
      {
         "Name":"Notable Question",
         "Dates":[
            "\/Date(1249583857093+0300)\/"
         ]
      },
      {
         "Name":"Beta",
         "Dates":[
            "\/Date(1221512400000+0300)\/"
         ]
      },
      {
         "Name":"Self-Learner",
         "Dates":[
            "\/Date(1251201753523+0300)\/"
         ]
      },
      {
         "Name":"Civic Duty",
         "Dates":[
            "\/Date(1260347854457+0200)\/"
         ]
      }
   ]
}

The database size is: 6.29 GB (out of which about 158 MB is for the default index).

Total number of operations: 4,667,100

The first major issue was that I couldn’t tell how many documents I had in the database, get document count turned out to be an O(N) operation(!), that was easy to fix, thankfully.

The second major issue was that Raven didn’t really handle indexing of a lot of documents very well, it would index each document as a standalone operation. The problem is that there are additional costs for doing this (opening & closing the index for writing, mostly), which slow things down enormously. I fixed that by implementing index merging, so documents that were inserted at the same time would be index together (up to some limit that I am still playing at).

Once those were fixed, I could try doing some measurements…

With indexing disabled, inserting 2.3 million documents takes about 1 hour & 15 minutes. Considering that we made 4.6 million operations (including insert & updates), we are talking about over a 1000 operations per second (using single threaded mode).

I am very happy with those results.

With indexing enabled, and a single index defined, the process takes much longer. About 3 hours & 15 minutes, giving us about 400 operations per second (again, single threaded), or about 2.5 milliseconds per operation.

Waiting for the background indexing task to complete took a lot longer, another 2 hours & 45 minutes. This gives me just over 200 documents indexed per second.

I am pleased, but not really happy about those results. I think that we can do better, and I certainly intend to optimize things.

But that is for later, right now I want to see how Raven behaves when it has that many documents.

What we test	URL	Timing
Looking up database stats	GET /stats	00:00:00.0007919
Browse documents (start)	GET /docs/?start=0&pageSize=25	00:00:00.0429041
Browse documents (toward the end)	GET /docs/?start=2300000&pageSize=25	00:00:00.0163617
Get document by id (toward the start)	GET /docs/users/32	00:00:00.0017779
Get document by id (toward the end)	GET /docs/posts/2321034	00:00:00.0022796
Query index (start)	GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25	00:00:00.0388772
Query index (toward the end)	GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25&start=100000	00:00:00.5988617
Query index (midway point)	GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25&start=20000	00:00:00.1644477
Query index #2 (start)	GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25	00:00:00.4957742
Query index #2 (toward the end)	GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25&start=2000000	00:00:07.3789415
Query index #2 (midway point)	GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25&start=200000	00:00:02.2621174

The results are very interesting. It is gratifying to see that browsing documents and retrieving documents is blazing fast, and once I fixed the O(N) issue on the stats, that is fast as hell as well.

Querying the indexes is interesting. It is clear that Lucene doesn’t like to page deeply. On the other hand, I think that it is safe to say that we don’t really have to worry about this, since deep paging (going to page #4,000) is very unlikely, and we can take the half a second hit when it is happening.

Querying index #2 is troublesome, though. It don’t think that it should take that long (without paging, if you want to page to page 80,000, please wait), even if that query returns ~2.1 million results. I think that this is because some of the Lucene options that we use, so I’ll have to look further into that.

Nitpicker corner: Those are not the final numbers, we intend to improve upon them.

Tweet Share Share 4 comments

Tags:

Raven

Comments

24 Apr 2010
09:33 AM

HoyaSaxa93

Incredible results!!! Been reading this series from the start and things look like they are really coming together. Congrats!

24 Apr 2010
11:19 AM

Demis Bellot

Very nice, performance looks impressive. Raven seems to be maturing with features at a good pace.

I take it there is no de/serialization involved and you're just dumping the json text value from lucene to the response stream?

Do these RESTish requests come as part of Raven solution or is this interface something that the client is expected to provide?

i.e. do you get it for free and be able to page any query I've defined without writing a custom http handler?

Either way this seems like a good CouchDB-style solution with built-in lucene is a killer feature - should be a prime candidate for building pure Ajax apps with.

Maybe I should renew efforts in my ajax framework ( http://www.ajaxstack.com/AjaxStack.Demo/) and get it talk directly to a Raven backend, the built-in lucene index would come in handy as I currently maintain my own search index. What's the ETA on a public release? is it going to be open source?

Just one q, what are you using to serialize to JSON as I'm having a hard time finding any good ones, are just you using the BCL's JsonDataContractSerializer?

24 Apr 2010
17:04 PM

Ayende Rahien

I take it there is no de/serialization involved and you're just dumping the json text value from lucene to the response stream?

For GETs, no, there is no serialization, we simply dump it to the client.

For queries, there is, because you have to merge multiple documents together.

The Lucene index doesn't actually hold all the data, only the indexed parts.

All of those requests are handled as part of Raven itself, you don't have to write anything to make this happen.

ETA of release is weeks, and yes, it is going to be open source.

I am using Newsoft.Json for that, I love it

26 Apr 2010
08:15 AM

Mischa Kroon

Would it be possible to have some sort of a comparison with an sql database here.

Not sure if this would make for a logical comparison but I think most devs are familiar with sql databases so a performance comparison with something open source like postgress / mysql / sqlite would be nice to see.

Raven seems to be doing a whole lot more then just store data, so that part would have to be reconstructed but still it might be good to see.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB