Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 331 words

Can you tell me why I changed Raven’s server log format from this:

Request #1: GET / - 00:00:00.0038469
Request #2: GET /raven/index.html - 00:00:00.0066264
Request #3: GET /raven/css/rdb.css - 00:00:00.0137499
Request #5: GET /raven/js/jquery-1.4.2.min.js - 00:00:00.0136234
Request #4: GET /raven/css/smoothness/jquery-ui-1.8rc2.custom.css - 00:00:00.0135178
Request #6: GET /raven/js/jquery-ui.js - 00:00:00.0163169
Request #9: GET /raven/js/raven-ui.js - 00:00:00.0017103
Request #7: GET /raven/js/jquery-jtemplates.js - 00:00:00.0026158
Request #8: GET /raven/js/jquery.RavenDB.js - 00:00:00.0026884
Request #10: GET /raven/JSONTemplates/quickStats.html - 00:00:00.0003281
Request #12: GET /raven/images/header_bg.png - 00:00:00.0002728
Request #15: GET /raven/images/ajax-loader.gif - 00:00:00.0004883
Request #14: GET /raven/images/footer_bg.png - 00:00:00.0008783
Request #13: GET /raven/images/logo.png - 00:00:00.0011801
Request #11: GET /stats - 00:00:00.0183869

To this:

Request #   1: GET -     4 ms - 302 - /
Request #   2: GET -     7 ms - 200 - /raven/index.html
Request #   3: GET -     3 ms - 200 - /raven/css/rdb.css
Request #   4: GET -     4 ms - 200 - /raven/css/smoothness/jquery-ui-1.8rc2.custom.css
Request #   5: GET -     8 ms - 200 - /raven/js/jquery-1.4.2.min.js
Request #   8: GET -     5 ms - 200 - /raven/js/jquery.RavenDB.js
Request #   7: GET -     7 ms - 200 - /raven/js/jquery-jtemplates.js
Request #   9: GET -     2 ms - 200 - /raven/js/raven-ui.js
Request #   6: GET -    11 ms - 200 - /raven/js/jquery-ui.js
Request #  10: GET -     0 ms - 200 - /raven/JSONTemplates/quickStats.html
Request #  12: GET -     0 ms - 200 - /raven/images/header_bg.png
Request #  14: GET -     0 ms - 200 - /raven/images/ajax-loader.gif
Request #  15: GET -     0 ms - 200 - /raven/images/footer_bg.png
Request #  13: GET -     1 ms - 200 - /raven/images/logo.png
Request #  11: GET -    31 ms - 200 - /stats
time to read 1 min | 95 words

Take a look at the following code:

image

It is obvious where we need to optimize, right?

Except… each call here takes about 0.8 millisecond. Yes, we could probably optimize this further, but the question is, would it be worth it?

Given a sub millisecond performance, and given that trying to implement a different serialization format would be expensive operation, I think that there just isn’t enough justification to do so.

time to read 2 min | 378 words

Yesterday I talked about a problem that kept appearing, my application getting slower and slower and eventually dying. The code that I was testing it with was:

Parallel.ForEach(Directory.GetFiles("Docs","*.json"), file =>
{
    PostTo("http://localhost:9090/bulk_docs", file);
});

The Docs directory contains about 90,000 files, and there is no concurrent connection limit. Average processing time for each request when running in a single threaded mode was 100 – 200 ms.

The problem was, at its root, an unbounded concurrency issue. Here is what happened.

Handling a single request is pretty fast, but it is still going to block for about a 100 ms. The Parallel ForEach implementation is getting notified about the blocked thread, and in order to increase parallelism, it is going to spin another thread and give it the next task. That thread is going to block as well, and after a while, another thread will be created, etc.

This code is big enough, and the response times are built just so to encourage the thread pool to create more and more threads. The server, at the same time, is having to deal with more and more work, causing request times to increase, which causes the client to create more threads, which causes the server to become more and more distressed, until eventually it fails and requests starts to error.

The fun part is that if you reduce the load on the server, it would resume operating normally. But the errors that it gave in the meanwhile was… cryptic. There is a reason why most servers limit their concurrency level, there are always enough clients to saturate a server, and it is better to place a client on a request queue or send him “busy now, come back later” error than to crash because of resource allocation issues somewhere in the middle of processing a request.

time to read 2 min | 267 words

I have been doing a lot of heavy performance testing on Raven, and I run into a lot of very strange scenarios. I found a lot of interesting stuff (runaway cache causing OutOfMemoryException, unnecessary re-parsing, etc). But one thing that I wasn’t able to resolve was the concurrency issue.

In particular, Raven would slow down and crash under load. I scoured the code, trying to figure out what was going on, but I couldn’t figure it out. It seemed that after several minutes of executing, request times would grow longer and longer, until finally the server would start raising errors on most requests.

I am ashamed to say that it took me a while to figure out what was actually going on. Can you figure it out?

Here is the client code:

Parallel.ForEach(Directory.GetFiles("Docs","*.json"), file =>
{
    PostTo("http://localhost:9090/bulk_docs", file);
});

The Docs directory contains about 90,000 files, and there is no concurrent connection limit. Average processing time for each request when running in a single threaded mode was 100 – 200 ms.

That should be enough information to figure out what is going on.

Why did the application crash?

time to read 23 min | 4516 words

In order to test Raven behavior with a large amount of data, I needed to load a whole bunch of data to it. I choose StackOverflow Data Dump as my sample source.

That left me with the problem of how to go from data that looks like this:

<!--  users.xml -->
<row Id="1"
         Reputation="18566"
         CreationDate="2008-07-31T14:22:31.287"
         DisplayName="Jeff Atwood"
         EmailHash="51d623f33f8b83095db84ff35e15dbe8"
         LastAccessDate="2010-03-02T07:33:46.390"
         WebsiteUrl="http://www.codinghorror.com/blog/"
         Location="El Cerrito, CA"
         Age="40"
         AboutMe="..."
         Views="23436"
         UpVotes="1972"
         DownVotes="431" />

<!--  badges.xml -->
<row Id="82946"
         UserId="3718"
         Name="Teacher"
         Date="2008-09-15T08:55:03.923" />

<!-- posts.xml -->
<row Id="7"
         PostTypeId="2"
         ParentId="4"
         CreationDate="2008-07-31T22:17:57.883"
         Score="44"
         ViewCount="0"
         Body="..."
         OwnerUserId="9"
         LastEditorUserId="1"
         LastEditorDisplayName=""
         LastEditDate="2010-02-02T19:38:14.493"
         LastActivityDate="2010-02-02T19:38:14.493"
         CommentCount="0" />

<!-- votes.xml -->
<row Id="2"
         PostId="3"
         VoteTypeId="2"
         CreationDate="2008-07-31" />


<!-- comments.xml -->
<row Id="12"
         PostId="47428"
         Score="2"
         Text="..."
         CreationDate="2008-09-06T13:51:47.843"
         UserId="4642" />

To JSON documents. Actually, it is a bit worse, the data dump is relational in nature, which isn’t really suitable for a doc db format. What I actually want is a User document, which contains badges in it and a Post document, which contains votes and comments in it.

My first approach was this:

  • foreach user in users.xml
    • Insert user document
  • foreach badge in badges.xml
    • Get user document from database
    • Update appropriate user document with new badge
  • foreach post in posts.xml
    • Insert post document
  • foreach vote in votes.xml
    • Get Post document from database
    • Update post with vote
  • foreach comment in comments.xml
    • Get Post document from database
    • Update post with comment

Running with an embedded option, with no indexing enabled, that took about 1:15 hours. (That is actually not fair, this process involved reading the XML as well, and resulted in several performance improvements.)

My next attempt was to do something like:

  • foreach user in users.xml & foreach badge in badges.xml
    • Merge badges to appropriate user
    • Insert user document
  • foreach post in posts.xml & foreach vote in votes.xml & foreach comment in comments.xml
    • Marge votes to appropriate post
    • Merge comments to appropriate post
    • Insert post document

The theory was that I can reduce the number of calls to the database, thus reduce the overall cost. The problem is that trying to do this sort of merges in memory take a lot of time and a huge amount of memory. It is perhaps telling that I gave up that approach on a 64 bits machine with 8 GB of RAM.

Finally, I decided to go in a two stage approach, because what I actually wanted to see was the usage of Raven, not the performance of reading the XML. The current ETL process is composed of two steps:

Part 1:

  • foreach user in users.xml
    • Partition in 128 users per batch
    • Generate JSON document representing a batch insert in the database
  • similar to badges, posts, votes and comments

Part 2:

Submit the JSON documents for Raven over HTTP

Here is how it looks like for users:

[
  {
    "Key": "users/-1",
    "Etag": null,
    "Method": "PUT",
    "Document": {
      "UpVotes": 413,
      "EmailHash": "a007be5a61f6aa8f3e85ae2fc18dd66e",
      "AboutMe": "…",
      "Id": -1,
      "LastAccessDate": "\/Date(1219699013810+0300)\/",
      "Views": 649,
      "Location": "on the server farm",
      "DownVotes": 22414,
      "Reputation": 1,
      "CreationDate": "\/Date(1217451600000+0300)\/",
      "Age": 2,
      "DisplayName": "Community",
      "WebsiteUrl": "http://stackoverflow.com"
    },
    "Metadata": {
      "Raven-Entity-Name": "Users"
    }
  },
  {
    "Key": "users/1",
    "Etag": null,
    "Method": "PUT",
    "Document": {
      "UpVotes": 1972,
      "EmailHash": "51d623f33f8b83095db84ff35e15dbe8",
      "AboutMe": "…",
      "Id": 1,
      "LastAccessDate": "\/Date(1267508026390+0200)\/",
      "Views": 23436,
      "Location": "El Cerrito, CA",
      "DownVotes": 431,
      "Reputation": 18566,
      "CreationDate": "\/Date(1217503351287+0300)\/",
      "Age": 40,
      "DisplayName": "Jeff Atwood",
      "WebsiteUrl": "http://www.codinghorror.com/blog/"
    },
    "Metadata": {
      "Raven-Entity-Name": "Users"
    }
  }
]

This is pretty simple, right?

But, what about something like badges? How can we go from something like this:

<!-- badges.xml --> 
<row Id="82946" 
          UserId="3718" 
          Name="Teacher" 
          Date="2008-09-15T08:55:03.923" /> 

To a document update? After all, we don’t have the rest of the document. That would seem to necessitate a fetch, adding the value at the client side and send to the server. However, that is going to be slow. Luckily, Raven support the PATCH command. There isn’t really a diff format for JSON, so we had to invent our own. In this case, the document for adding badges is pretty simple and should be self explanatory:

[
  {
    "Key": "users/3718",
    "Method": "PATCH",
    "Etag": null,
    "Patches": [
      {
        "Type": "Add",
        "Value": {
          "Name": "Teacher",
          "Dates": [
            "\/Date(1221458103923+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Supporter",
          "Dates": [
            "\/Date(1221458104830+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Autobiographer",
          "Dates": [
            "\/Date(1221458105220+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Editor",
          "Dates": [
            "\/Date(1221753452033+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Organizer",
          "Dates": [
            "\/Date(1221753452780+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Scholar",
          "Dates": [
            "\/Date(1222711653513+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Student",
          "Dates": [
            "\/Date(1222753351563+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Critic",
          "Dates": [
            "\/Date(1249633658857+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Beta",
          "Dates": [
            "\/Date(1221512400000+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Yearling",
          "Dates": [
            "\/Date(1251666165617+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Popular Question",
          "Dates": [
            "\/Date(1254097352953+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Commentator",
          "Dates": [
            "\/Date(1254833254790+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      }
    ]
  },
  {
    "Key": "users/994",
    "Method": "PATCH",
    "Etag": null,
    "Patches": [
      {
        "Type": "Add",
        "Value": {
          "Name": "Teacher",
          "Dates": [
            "\/Date(1221458103957+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Supporter",
          "Dates": [
            "\/Date(1221458104830+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Autobiographer",
          "Dates": [
            "\/Date(1221458105220+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Scholar",
          "Dates": [
            "\/Date(1226316765340+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Student",
          "Dates": [
            "\/Date(1226358152000+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Popular Question",
          "Dates": [
            "\/Date(1250109153087+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Beta",
          "Dates": [
            "\/Date(1221512400000+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Critic",
          "Dates": [
            "\/Date(1257496652637+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Editor",
          "Dates": [
            "\/Date(1257496952257+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      }
    ]
  }
]

The fun part about the whole thing is now I am truly testing just Raven’s performance, because the client side is now just reading files and submitting requests to Raven. Another important factor is that instead of issuing a single request per document, we are able to batch them together in groups of 128 requests and submit them to /bulk_docs.

Using this process, we see a pretty consistent average of about 1,000 documents per second for batched inserts and somewhat less than that for batched PATCH requests.

Just to give you some ideas, here are are the file sizes per type:

  • Users – 65 KB
  • Badges - ~500 Kb
  • Posts – 150 Kb
  • Comments – 70 Kb
  • Votes – 40 Kb

Note that each file contains 128 requests, and that those aren’t final document size. User document is made up of the user data and badges, for example.

But so far, I am pretty happy about it.

time to read 1 min | 122 words

After trying it out on NH Prof, profiler subscriptions are now opened for all the profilers.

A profiler subscription allows you to pay a small monthly free (~16$) and get the full profiler capabilities along with the assurance of no upgrade cost when the next major version comes out.

In addition to the monthly subscription, I got requests for a yearly subscription. I am not sure that I quite follow the logic, but I am not going to make it harder for people to give me money, so that is available as well for all profilers.

time to read 2 min | 298 words

Continuing my efforts to make RavenDB faster, I finally settled on loading all the users from Stack Overflow into Raven as a good small benchmark for Raven. This is important, because I want something big enough to give me meaningful stats, but small enough so I would have to wait too long for results.

The users data from Stack Overflow is ~180,000 documents, but it loads in just a few minutes, so that is very useful.

With that, I was able to identify a major bottleneck:

image

The AddTask takes a lot of time, mostly because it needs to update a Json field, so it needs to load it, update, and then save it back. In many cases, it is the same value that we keep parsing over & over again. I introduced a Json Cache, which could save the parsing cost. When introducing this, I was also able to make it useful in a few other places, which should hopefully improve performance in a few more places.

Current speed, when inserting ~183,000 documents (over HTTP on the local host) in batches of 128 documents is 128 milliseconds. I consider this pretty good. Note that this is with indexing enabled, but not counting how long it takes for the background indexes to finish processing.

When measuring that as well, after inserting hundreds of thousands of documents, it took another half a second (but note that my measuring granularity was 0.5 second) to finish updating the indexes on the background. I think it is time to try to throw the entire StackOverflow data dump at Raven and see how fast it can handle it.

time to read 2 min | 231 words

I was very excited when I tried this code, it produced amazing performance:

private static void LoadDataFor(string searchPattern)
{
    foreach (var file in Directory.GetFiles("Docs", searchPattern).OrderBy(x=>x))
    {
        var sp = Stopwatch.StartNew();
        var httpWebRequest = (HttpWebRequest)WebRequest.Create("http://localhost:8080/bulk_docs");
        httpWebRequest.Method = "POST";
        using(var requestStream = httpWebRequest.GetRequestStream())
        {
            var readAllBytes = File.ReadAllBytes(file);
            requestStream.Write(readAllBytes, 0, readAllBytes.Length);
        }
        Console.WriteLine("{0} - {1}", Path.GetFileName(file), sp.Elapsed);
    }
}

Can you figure out what the fatal flaw in this code is?

time to read 16 min | 3090 words

Well, Raven is pretty much done now, from feature perspective, so now is the time to look at the performance numbers, see where I goofed up, etc. I decided to use StackOverflow Data Dump as my sample source (using the March 2010), because that is an accessible real world data, large, data set that I can utilize.

I quickly wrote a simple ETL process to read the StackOverflow dump files and load them into Raven. I’ll speak about the ETL process in more detail in a future post, but for now, I want to talk about the numbers.

The ETL approach I used isn’t the most efficient one, I’ll admit. It involves doing multiple passes on the data. Basically it goes:

    • foreach user in users.xml
      • insert user document
    • foreach badge in badges.xml
      • update appropriate user document with new badge
    • foreach post in posts.xml
      • insert post document
    • foreach vote in votes.xml
      • update post with vote
    • foreach comment in comments.xml
      • update post with comment

As you can imagine, this means that we are asking the server to do a lot of duplicated work. It would be better if we would pre-prepare the values and insert them only once, instead of insert & then update them. Unfortunately, the data sizes are large enough that doing trying to do this in memory is too expensive. I can think of several approaches to try to optimize this, but at the end, I don’t really see a reason. This ETL process is probably how people will write it in the real world, so there is no point in trying too hard.

In the end, using the March 2010 dump from stack overflow, I ended up with a Raven DB instance with: 2,329,607 documents. From random sampling, most documents are 1 KB – 5 KB in since, with several that are significantly larger than that.

Here are two typical documents:

// posts/2321816
{
   "LastEditorUserId":200145,
   "LastEditorDisplayName":"",
   "PostTypeId":1,
   "Id":2321816,
   "Title":"Storing commercial files on the server",
   "Score":0,
   "CreationDate":"\/Date(1266952829257+0200)\/",
   "CommentCount":2,
   "AcceptedAnswerId":2321854,
   "LastActivityDate":"\/Date(1266953391687+0200)\/",
   "Tags":"",
   "LastEditDate":"\/Date(1266953391687+0200)\/",
   "Body":"Where would you store files that are meant for sale on an e-commerce website? \n",
   "OwnerUserId":200145,
   "AnswerCount":3,
   "ViewCount":45,
   "comments":[
      {
         "Score":null,
         "CreationDate":"\/Date(1266952919510+0200)\/",
         "Text":"are they \"sensitive\" information?",
         "UserId":"users/203907"
      },
      {
         "Score":null,
         "CreationDate":"\/Date(1266953092057+0200)\/",
         "Text":"I wouldn't say they are sensitive information. They are just commercial files and a free access to them shouldn't be allowed.",
         "UserId":"users/200145"
      }
   ]
}

// users/33
{
   "Age":28,
   "CreationDate":"\/Date(1217583130963+0300)\/",
   "Id":33,
   "UpVotes":354,
   "LastAccessDate":"\/Date(1267455753720+0200)\/",
   "DisplayName":"John",
   "Location":"Southampton, England",
   "AboutMe":"C# and VB.net Developer working primarily in windows service and winforms applications.\r\n\r\n ",
   "EmailHash":"d0b76ae7bf261316683cad31ba0bad91",
   "Reputation":3209,
   "DownVotes":3,
   "Views":334,
   "badges":[
      {
         "Name":"Teacher",
         "Dates":[
            "\/Date(1221458104020+0300)\/"
         ]
      },
      {
         "Name":"Student",
         "Dates":[
            "\/Date(1221458104190+0300)\/"
         ]
      },
      {
         "Name":"Editor",
         "Dates":[
            "\/Date(1221458104377+0300)\/"
         ]
      },
      {
         "Name":"Cleanup",
         "Dates":[
            "\/Date(1221458104470+0300)\/"
         ]
      },
      {
         "Name":"Organizer",
         "Dates":[
            "\/Date(1221458104737+0300)\/"
         ]
      },
      {
         "Name":"Supporter",
         "Dates":[
            "\/Date(1221458104893+0300)\/"
         ]
      },
      {
         "Name":"Critic",
         "Dates":[
            "\/Date(1221458104987+0300)\/"
         ]
      },
      {
         "Name":"Citizen Patrol",
         "Dates":[
            "\/Date(1221458105173+0300)\/"
         ]
      },
      {
         "Name":"Scholar",
         "Dates":[
            "\/Date(1221458105483+0300)\/"
         ]
      },
      {
         "Name":"Enlightened",
         "Dates":[
            "\/Date(1221458112677+0300)\/"
         ]
      },
      {
         "Name":"Taxonomist",
         "Dates":[
            "\/Date(1221458113427+0300)\/"
         ]
      },
      {
         "Name":"Nice Answer",
         "Dates":[
            "\/Date(1221458638367+0300)\/",
            "\/Date(1236274052530+0200)\/",
            "\/Date(1244026052343+0300)\/",
            "\/Date(1244726552923+0300)\/",
            "\/Date(1257249754030+0200)\/"
         ]
      },
      {
         "Name":"Nice Question",
         "Dates":[
            "\/Date(1225182453990+0200)\/",
            "\/Date(1231624653367+0200)\/"
         ]
      },
      {
         "Name":"Commentator",
         "Dates":[
            "\/Date(1227767555493+0200)\/"
         ]
      },
      {
         "Name":"Autobiographer",
         "Dates":[
            "\/Date(1233569254650+0200)\/"
         ]
      },
      {
         "Name":"Necromancer",
         "Dates":[
            "\/Date(1234393653060+0200)\/",
            "\/Date(1257860556480+0200)\/"
         ]
      },
      {
         "Name":"Popular Question",
         "Dates":[
            "\/Date(1236054752283+0200)\/",
            "\/Date(1248302252213+0300)\/",
            "\/Date(1248607054807+0300)\/",
            "\/Date(1250013763393+0300)\/",
            "\/Date(1251215254023+0300)\/",
            "\/Date(1258400556113+0200)\/"
         ]
      },
      {
         "Name":"Yearling",
         "Dates":[
            "\/Date(1249237664163+0300)\/"
         ]
      },
      {
         "Name":"Notable Question",
         "Dates":[
            "\/Date(1249583857093+0300)\/"
         ]
      },
      {
         "Name":"Beta",
         "Dates":[
            "\/Date(1221512400000+0300)\/"
         ]
      },
      {
         "Name":"Self-Learner",
         "Dates":[
            "\/Date(1251201753523+0300)\/"
         ]
      },
      {
         "Name":"Civic Duty",
         "Dates":[
            "\/Date(1260347854457+0200)\/"
         ]
      }
   ]
}

The database size is: 6.29 GB (out of which about 158 MB is for the default index).

Total number of operations: 4,667,100

The first major issue was that I couldn’t tell how many documents I had in the database, get document count turned out to be an O(N) operation(!), that was easy to fix, thankfully.

The second major issue was that Raven didn’t really handle indexing of a lot of documents very well, it would index each document as a standalone operation. The problem is that there are additional costs for doing this (opening & closing the index for writing, mostly), which slow things down enormously.  I fixed that by implementing index merging, so documents that were inserted at the same time would be index together (up to some limit that I am still playing at).

Once those were fixed, I could try doing some measurements…

  • With indexing disabled, inserting 2.3 million documents takes about 1 hour & 15 minutes. Considering that we made 4.6 million operations (including insert & updates), we are talking about over a 1000 operations per second (using single threaded mode).

I am very happy with those results.

  • With indexing enabled, and a single index defined, the process takes much longer. About 3 hours & 15 minutes, giving us about 400 operations per second (again, single threaded), or about 2.5 milliseconds per operation.
    • Waiting for the background indexing task to complete took a lot longer, another 2 hours & 45 minutes. This gives me just over 200 documents indexed per second.

I am pleased, but not really happy about those results. I think that we can do better, and I certainly intend to optimize things.

But that is for later, right now I want to see how Raven behaves when it has that many documents.

What we test URL Timing
Looking up database stats GET /stats 00:00:00.0007919
Browse documents (start) GET /docs/?start=0&pageSize=25 00:00:00.0429041
Browse documents (toward the end) GET /docs/?start=2300000&pageSize=25 00:00:00.0163617
Get document by id (toward the start) GET /docs/users/32 00:00:00.0017779
Get document by id (toward the end) GET /docs/posts/2321034 00:00:00.0022796
Query index (start) GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25 00:00:00.0388772
Query index (toward the end) GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25&start=100000 00:00:00.5988617
Query index (midway point) GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25&start=20000 00:00:00.1644477
Query index #2 (start) GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25 00:00:00.4957742
Query index #2 (toward the end) GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25&start=2000000 00:00:07.3789415
Query index #2 (midway point) GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25&start=200000 00:00:02.2621174

The results are very interesting. It is gratifying to see that browsing documents and retrieving documents is blazing fast, and once I fixed the O(N) issue on the stats, that is fast as hell as well.

Querying the indexes is interesting. It is clear that Lucene doesn’t like to page deeply. On the other hand, I think that it is safe to say that we don’t really have to worry about this, since deep paging (going to page #4,000) is very unlikely, and we can take the half a second hit when it is happening.

Querying index #2 is troublesome, though. It don’t think that it should take that long (without paging, if you want to page to page 80,000, please wait), even if that query returns ~2.1 million results. I think that this is because some of the Lucene options that we use, so I’ll have to look further into that.

Nitpicker corner: Those are not the final numbers, we intend to improve upon them.

time to read 2 min | 361 words

My first day at work, and I am but a young whippersnapper. I arrived, did the usual meet & greet with everybody (and promptly forgot their names), then I was shown to a desk, given a laptop and one of my new coworkers rummaged in a desk drawer for a while before throwing me a document. Well, I say a document, what I mean is a book.

It was couple hundred pages, and it was titled: Specs for the Ergaster module in the Heidelbergensis System. When asked what I was supposed to do with it, I was given a very simple answer: Implement.

I later found on some very interesting aspects on the spec:

  • The document represented over a year and a half of work.
  • The module was being developed for a client of our client.
  • The module was meant to be both specific for the sub client needs and at the same time generic enough to used by other clients of our clients.
  • The system that we were supposed to be a module in was currently under development and wasn’t in a position to be used by us.

Can you say image

Well, I spent close to two years working on this project. I learned a lot, from how to proof myself from upstream teams to how not to design a system. I poured a whole lot of work and effort into this.

Along the way, I also learned something very interesting about the spec. That was the first time that I actually had to reverse engineer a specification to understand what the customer actually wanted.

Two years after we handed off the module to the client, I had the chance to go to lunch with the team leader from that client, so I asked him about the module. He informed me that they still haven’t deployed to production.

I was depressed for a week afterward.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}