Ayende @ Rahien

It's a girl

Those small touches

Can you tell me why I changed Raven’s server log format from this:

Request #1: GET / - 00:00:00.0038469
Request #2: GET /raven/index.html - 00:00:00.0066264
Request #3: GET /raven/css/rdb.css - 00:00:00.0137499
Request #5: GET /raven/js/jquery-1.4.2.min.js - 00:00:00.0136234
Request #4: GET /raven/css/smoothness/jquery-ui-1.8rc2.custom.css - 00:00:00.0135178
Request #6: GET /raven/js/jquery-ui.js - 00:00:00.0163169
Request #9: GET /raven/js/raven-ui.js - 00:00:00.0017103
Request #7: GET /raven/js/jquery-jtemplates.js - 00:00:00.0026158
Request #8: GET /raven/js/jquery.RavenDB.js - 00:00:00.0026884
Request #10: GET /raven/JSONTemplates/quickStats.html - 00:00:00.0003281
Request #12: GET /raven/images/header_bg.png - 00:00:00.0002728
Request #15: GET /raven/images/ajax-loader.gif - 00:00:00.0004883
Request #14: GET /raven/images/footer_bg.png - 00:00:00.0008783
Request #13: GET /raven/images/logo.png - 00:00:00.0011801
Request #11: GET /stats - 00:00:00.0183869

To this:

Request #   1: GET -     4 ms - 302 - /
Request #   2: GET -     7 ms - 200 - /raven/index.html
Request #   3: GET -     3 ms - 200 - /raven/css/rdb.css
Request #   4: GET -     4 ms - 200 - /raven/css/smoothness/jquery-ui-1.8rc2.custom.css
Request #   5: GET -     8 ms - 200 - /raven/js/jquery-1.4.2.min.js
Request #   8: GET -     5 ms - 200 - /raven/js/jquery.RavenDB.js
Request #   7: GET -     7 ms - 200 - /raven/js/jquery-jtemplates.js
Request #   9: GET -     2 ms - 200 - /raven/js/raven-ui.js
Request #   6: GET -    11 ms - 200 - /raven/js/jquery-ui.js
Request #  10: GET -     0 ms - 200 - /raven/JSONTemplates/quickStats.html
Request #  12: GET -     0 ms - 200 - /raven/images/header_bg.png
Request #  14: GET -     0 ms - 200 - /raven/images/ajax-loader.gif
Request #  15: GET -     0 ms - 200 - /raven/images/footer_bg.png
Request #  13: GET -     1 ms - 200 - /raven/images/logo.png
Request #  11: GET -    31 ms - 200 - /stats

Optimizing “expensive” calls

Take a look at the following code:

image

It is obvious where we need to optimize, right?

Except… each call here takes about 0.8 millisecond. Yes, we could probably optimize this further, but the question is, would it be worth it?

Given a sub millisecond performance, and given that trying to implement a different serialization format would be expensive operation, I think that there just isn’t enough justification to do so.

Unbounded concurrency

Yesterday I talked about a problem that kept appearing, my application getting slower and slower and eventually dying. The code that I was testing it with was:

Parallel.ForEach(Directory.GetFiles("Docs","*.json"), file =>
{
    PostTo("http://localhost:9090/bulk_docs", file);
});

The Docs directory contains about 90,000 files, and there is no concurrent connection limit. Average processing time for each request when running in a single threaded mode was 100 – 200 ms.

The problem was, at its root, an unbounded concurrency issue. Here is what happened.

Handling a single request is pretty fast, but it is still going to block for about a 100 ms. The Parallel ForEach implementation is getting notified about the blocked thread, and in order to increase parallelism, it is going to spin another thread and give it the next task. That thread is going to block as well, and after a while, another thread will be created, etc.

This code is big enough, and the response times are built just so to encourage the thread pool to create more and more threads. The server, at the same time, is having to deal with more and more work, causing request times to increase, which causes the client to create more threads, which causes the server to become more and more distressed, until eventually it fails and requests starts to error.

The fun part is that if you reduce the load on the server, it would resume operating normally. But the errors that it gave in the meanwhile was… cryptic. There is a reason why most servers limit their concurrency level, there are always enough clients to saturate a server, and it is better to place a client on a request queue or send him “busy now, come back later” error than to crash because of resource allocation issues somewhere in the middle of processing a request.

Tags:

Published at

Originally posted at

Comments (20)

Challenge: What killed the application?

I have been doing a lot of heavy performance testing on Raven, and I run into a lot of very strange scenarios. I found a lot of interesting stuff (runaway cache causing OutOfMemoryException, unnecessary re-parsing, etc). But one thing that I wasn’t able to resolve was the concurrency issue.

In particular, Raven would slow down and crash under load. I scoured the code, trying to figure out what was going on, but I couldn’t figure it out. It seemed that after several minutes of executing, request times would grow longer and longer, until finally the server would start raising errors on most requests.

I am ashamed to say that it took me a while to figure out what was actually going on. Can you figure it out?

Here is the client code:

Parallel.ForEach(Directory.GetFiles("Docs","*.json"), file =>
{
    PostTo("http://localhost:9090/bulk_docs", file);
});

The Docs directory contains about 90,000 files, and there is no concurrent connection limit. Average processing time for each request when running in a single threaded mode was 100 – 200 ms.

That should be enough information to figure out what is going on.

Why did the application crash?

Published at

Originally posted at

Comments (29)

ETL Process using Raven

In order to test Raven behavior with a large amount of data, I needed to load a whole bunch of data to it. I choose StackOverflow Data Dump as my sample source.

That left me with the problem of how to go from data that looks like this:

<!--  users.xml -->
<row Id="1"
         Reputation="18566"
         CreationDate="2008-07-31T14:22:31.287"
         DisplayName="Jeff Atwood"
         EmailHash="51d623f33f8b83095db84ff35e15dbe8"
         LastAccessDate="2010-03-02T07:33:46.390"
         WebsiteUrl="http://www.codinghorror.com/blog/"
         Location="El Cerrito, CA"
         Age="40"
         AboutMe="..."
         Views="23436"
         UpVotes="1972"
         DownVotes="431" />

<!--  badges.xml -->
<row Id="82946"
         UserId="3718"
         Name="Teacher"
         Date="2008-09-15T08:55:03.923" />

<!-- posts.xml -->
<row Id="7"
         PostTypeId="2"
         ParentId="4"
         CreationDate="2008-07-31T22:17:57.883"
         Score="44"
         ViewCount="0"
         Body="..."
         OwnerUserId="9"
         LastEditorUserId="1"
         LastEditorDisplayName=""
         LastEditDate="2010-02-02T19:38:14.493"
         LastActivityDate="2010-02-02T19:38:14.493"
         CommentCount="0" />

<!-- votes.xml -->
<row Id="2"
         PostId="3"
         VoteTypeId="2"
         CreationDate="2008-07-31" />


<!-- comments.xml -->
<row Id="12"
         PostId="47428"
         Score="2"
         Text="..."
         CreationDate="2008-09-06T13:51:47.843"
         UserId="4642" />

To JSON documents. Actually, it is a bit worse, the data dump is relational in nature, which isn’t really suitable for a doc db format. What I actually want is a User document, which contains badges in it and a Post document, which contains votes and comments in it.

My first approach was this:

  • foreach user in users.xml
    • Insert user document
  • foreach badge in badges.xml
    • Get user document from database
    • Update appropriate user document with new badge
  • foreach post in posts.xml
    • Insert post document
  • foreach vote in votes.xml
    • Get Post document from database
    • Update post with vote
  • foreach comment in comments.xml
    • Get Post document from database
    • Update post with comment

Running with an embedded option, with no indexing enabled, that took about 1:15 hours. (That is actually not fair, this process involved reading the XML as well, and resulted in several performance improvements.)

My next attempt was to do something like:

  • foreach user in users.xml & foreach badge in badges.xml
    • Merge badges to appropriate user
    • Insert user document
  • foreach post in posts.xml & foreach vote in votes.xml & foreach comment in comments.xml
    • Marge votes to appropriate post
    • Merge comments to appropriate post
    • Insert post document

The theory was that I can reduce the number of calls to the database, thus reduce the overall cost. The problem is that trying to do this sort of merges in memory take a lot of time and a huge amount of memory. It is perhaps telling that I gave up that approach on a 64 bits machine with 8 GB of RAM.

Finally, I decided to go in a two stage approach, because what I actually wanted to see was the usage of Raven, not the performance of reading the XML. The current ETL process is composed of two steps:

Part 1:

  • foreach user in users.xml
    • Partition in 128 users per batch
    • Generate JSON document representing a batch insert in the database
  • similar to badges, posts, votes and comments

Part 2:

Submit the JSON documents for Raven over HTTP

Here is how it looks like for users:

[
  {
    "Key": "users/-1",
    "Etag": null,
    "Method": "PUT",
    "Document": {
      "UpVotes": 413,
      "EmailHash": "a007be5a61f6aa8f3e85ae2fc18dd66e",
      "AboutMe": "…",
      "Id": -1,
      "LastAccessDate": "\/Date(1219699013810+0300)\/",
      "Views": 649,
      "Location": "on the server farm",
      "DownVotes": 22414,
      "Reputation": 1,
      "CreationDate": "\/Date(1217451600000+0300)\/",
      "Age": 2,
      "DisplayName": "Community",
      "WebsiteUrl": "http://stackoverflow.com"
    },
    "Metadata": {
      "Raven-Entity-Name": "Users"
    }
  },
  {
    "Key": "users/1",
    "Etag": null,
    "Method": "PUT",
    "Document": {
      "UpVotes": 1972,
      "EmailHash": "51d623f33f8b83095db84ff35e15dbe8",
      "AboutMe": "…",
      "Id": 1,
      "LastAccessDate": "\/Date(1267508026390+0200)\/",
      "Views": 23436,
      "Location": "El Cerrito, CA",
      "DownVotes": 431,
      "Reputation": 18566,
      "CreationDate": "\/Date(1217503351287+0300)\/",
      "Age": 40,
      "DisplayName": "Jeff Atwood",
      "WebsiteUrl": "http://www.codinghorror.com/blog/"
    },
    "Metadata": {
      "Raven-Entity-Name": "Users"
    }
  }
]

This is pretty simple, right?

But, what about something like badges? How can we go from something like this:

<!-- badges.xml --> 
<row Id="82946" 
          UserId="3718" 
          Name="Teacher" 
          Date="2008-09-15T08:55:03.923" /> 

To a document update? After all, we don’t have the rest of the document. That would seem to necessitate a fetch, adding the value at the client side and send to the server. However, that is going to be slow. Luckily, Raven support the PATCH command. There isn’t really a diff format for JSON, so we had to invent our own. In this case, the document for adding badges is pretty simple and should be self explanatory:

[
  {
    "Key": "users/3718",
    "Method": "PATCH",
    "Etag": null,
    "Patches": [
      {
        "Type": "Add",
        "Value": {
          "Name": "Teacher",
          "Dates": [
            "\/Date(1221458103923+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Supporter",
          "Dates": [
            "\/Date(1221458104830+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Autobiographer",
          "Dates": [
            "\/Date(1221458105220+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Editor",
          "Dates": [
            "\/Date(1221753452033+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Organizer",
          "Dates": [
            "\/Date(1221753452780+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Scholar",
          "Dates": [
            "\/Date(1222711653513+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Student",
          "Dates": [
            "\/Date(1222753351563+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Critic",
          "Dates": [
            "\/Date(1249633658857+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Beta",
          "Dates": [
            "\/Date(1221512400000+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Yearling",
          "Dates": [
            "\/Date(1251666165617+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Popular Question",
          "Dates": [
            "\/Date(1254097352953+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Commentator",
          "Dates": [
            "\/Date(1254833254790+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      }
    ]
  },
  {
    "Key": "users/994",
    "Method": "PATCH",
    "Etag": null,
    "Patches": [
      {
        "Type": "Add",
        "Value": {
          "Name": "Teacher",
          "Dates": [
            "\/Date(1221458103957+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Supporter",
          "Dates": [
            "\/Date(1221458104830+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Autobiographer",
          "Dates": [
            "\/Date(1221458105220+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Scholar",
          "Dates": [
            "\/Date(1226316765340+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Student",
          "Dates": [
            "\/Date(1226358152000+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Popular Question",
          "Dates": [
            "\/Date(1250109153087+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Beta",
          "Dates": [
            "\/Date(1221512400000+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Critic",
          "Dates": [
            "\/Date(1257496652637+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Editor",
          "Dates": [
            "\/Date(1257496952257+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      }
    ]
  }
]

The fun part about the whole thing is now I am truly testing just Raven’s performance, because the client side is now just reading files and submitting requests to Raven. Another important factor is that instead of issuing a single request per document, we are able to batch them together in groups of 128 requests and submit them to /bulk_docs.

Using this process, we see a pretty consistent average of about 1,000 documents per second for batched inserts and somewhat less than that for batched PATCH requests.

Just to give you some ideas, here are are the file sizes per type:

  • Users – 65 KB
  • Badges - ~500 Kb
  • Posts – 150 Kb
  • Comments – 70 Kb
  • Votes – 40 Kb

Note that each file contains 128 requests, and that those aren’t final document size. User document is made up of the user data and badges, for example.

But so far, I am pretty happy about it.

Profiler subscriptions are now open for all profilers

After trying it out on NH Prof, profiler subscriptions are now opened for all the profilers.

A profiler subscription allows you to pay a small monthly free (~16$) and get the full profiler capabilities along with the assurance of no upgrade cost when the next major version comes out.

In addition to the monthly subscription, I got requests for a yearly subscription. I am not sure that I quite follow the logic, but I am not going to make it harder for people to give me money, so that is available as well for all profilers.

RavenDB Performance Tweaks

Continuing my efforts to make RavenDB faster, I finally settled on loading all the users from Stack Overflow into Raven as a good small benchmark for Raven. This is important, because I want something big enough to give me meaningful stats, but small enough so I would have to wait too long for results.

The users data from Stack Overflow is ~180,000 documents, but it loads in just a few minutes, so that is very useful.

With that, I was able to identify a major bottleneck:

image

The AddTask takes a lot of time, mostly because it needs to update a Json field, so it needs to load it, update, and then save it back. In many cases, it is the same value that we keep parsing over & over again. I introduced a Json Cache, which could save the parsing cost. When introducing this, I was also able to make it useful in a few other places, which should hopefully improve performance in a few more places.

Current speed, when inserting ~183,000 documents (over HTTP on the local host) in batches of 128 documents is 128 milliseconds. I consider this pretty good. Note that this is with indexing enabled, but not counting how long it takes for the background indexes to finish processing.

When measuring that as well, after inserting hundreds of thousands of documents, it took another half a second (but note that my measuring granularity was 0.5 second) to finish updating the indexes on the background. I think it is time to try to throw the entire StackOverflow data dump at Raven and see how fast it can handle it.

How to get REALLY fast benchmarks

I was very excited when I tried this code, it produced amazing performance:

private static void LoadDataFor(string searchPattern)
{
    foreach (var file in Directory.GetFiles("Docs", searchPattern).OrderBy(x=>x))
    {
        var sp = Stopwatch.StartNew();
        var httpWebRequest = (HttpWebRequest)WebRequest.Create("http://localhost:8080/bulk_docs");
        httpWebRequest.Method = "POST";
        using(var requestStream = httpWebRequest.GetRequestStream())
        {
            var readAllBytes = File.ReadAllBytes(file);
            requestStream.Write(readAllBytes, 0, readAllBytes.Length);
        }
        Console.WriteLine("{0} - {1}", Path.GetFileName(file), sp.Elapsed);
    }
}

Can you figure out what the fatal flaw in this code is?

Raven performance testing

Well, Raven is pretty much done now, from feature perspective, so now is the time to look at the performance numbers, see where I goofed up, etc. I decided to use StackOverflow Data Dump as my sample source (using the March 2010), because that is an accessible real world data, large, data set that I can utilize.

I quickly wrote a simple ETL process to read the StackOverflow dump files and load them into Raven. I’ll speak about the ETL process in more detail in a future post, but for now, I want to talk about the numbers.

The ETL approach I used isn’t the most efficient one, I’ll admit. It involves doing multiple passes on the data. Basically it goes:

    • foreach user in users.xml
      • insert user document
    • foreach badge in badges.xml
      • update appropriate user document with new badge
    • foreach post in posts.xml
      • insert post document
    • foreach vote in votes.xml
      • update post with vote
    • foreach comment in comments.xml
      • update post with comment

As you can imagine, this means that we are asking the server to do a lot of duplicated work. It would be better if we would pre-prepare the values and insert them only once, instead of insert & then update them. Unfortunately, the data sizes are large enough that doing trying to do this in memory is too expensive. I can think of several approaches to try to optimize this, but at the end, I don’t really see a reason. This ETL process is probably how people will write it in the real world, so there is no point in trying too hard.

In the end, using the March 2010 dump from stack overflow, I ended up with a Raven DB instance with: 2,329,607 documents. From random sampling, most documents are 1 KB – 5 KB in since, with several that are significantly larger than that.

Here are two typical documents:

// posts/2321816
{
   "LastEditorUserId":200145,
   "LastEditorDisplayName":"",
   "PostTypeId":1,
   "Id":2321816,
   "Title":"Storing commercial files on the server",
   "Score":0,
   "CreationDate":"\/Date(1266952829257+0200)\/",
   "CommentCount":2,
   "AcceptedAnswerId":2321854,
   "LastActivityDate":"\/Date(1266953391687+0200)\/",
   "Tags":"",
   "LastEditDate":"\/Date(1266953391687+0200)\/",
   "Body":"Where would you store files that are meant for sale on an e-commerce website? \n",
   "OwnerUserId":200145,
   "AnswerCount":3,
   "ViewCount":45,
   "comments":[
      {
         "Score":null,
         "CreationDate":"\/Date(1266952919510+0200)\/",
         "Text":"are they \"sensitive\" information?",
         "UserId":"users/203907"
      },
      {
         "Score":null,
         "CreationDate":"\/Date(1266953092057+0200)\/",
         "Text":"I wouldn't say they are sensitive information. They are just commercial files and a free access to them shouldn't be allowed.",
         "UserId":"users/200145"
      }
   ]
}

// users/33
{
   "Age":28,
   "CreationDate":"\/Date(1217583130963+0300)\/",
   "Id":33,
   "UpVotes":354,
   "LastAccessDate":"\/Date(1267455753720+0200)\/",
   "DisplayName":"John",
   "Location":"Southampton, England",
   "AboutMe":"C# and VB.net Developer working primarily in windows service and winforms applications.\r\n\r\n ",
   "EmailHash":"d0b76ae7bf261316683cad31ba0bad91",
   "Reputation":3209,
   "DownVotes":3,
   "Views":334,
   "badges":[
      {
         "Name":"Teacher",
         "Dates":[
            "\/Date(1221458104020+0300)\/"
         ]
      },
      {
         "Name":"Student",
         "Dates":[
            "\/Date(1221458104190+0300)\/"
         ]
      },
      {
         "Name":"Editor",
         "Dates":[
            "\/Date(1221458104377+0300)\/"
         ]
      },
      {
         "Name":"Cleanup",
         "Dates":[
            "\/Date(1221458104470+0300)\/"
         ]
      },
      {
         "Name":"Organizer",
         "Dates":[
            "\/Date(1221458104737+0300)\/"
         ]
      },
      {
         "Name":"Supporter",
         "Dates":[
            "\/Date(1221458104893+0300)\/"
         ]
      },
      {
         "Name":"Critic",
         "Dates":[
            "\/Date(1221458104987+0300)\/"
         ]
      },
      {
         "Name":"Citizen Patrol",
         "Dates":[
            "\/Date(1221458105173+0300)\/"
         ]
      },
      {
         "Name":"Scholar",
         "Dates":[
            "\/Date(1221458105483+0300)\/"
         ]
      },
      {
         "Name":"Enlightened",
         "Dates":[
            "\/Date(1221458112677+0300)\/"
         ]
      },
      {
         "Name":"Taxonomist",
         "Dates":[
            "\/Date(1221458113427+0300)\/"
         ]
      },
      {
         "Name":"Nice Answer",
         "Dates":[
            "\/Date(1221458638367+0300)\/",
            "\/Date(1236274052530+0200)\/",
            "\/Date(1244026052343+0300)\/",
            "\/Date(1244726552923+0300)\/",
            "\/Date(1257249754030+0200)\/"
         ]
      },
      {
         "Name":"Nice Question",
         "Dates":[
            "\/Date(1225182453990+0200)\/",
            "\/Date(1231624653367+0200)\/"
         ]
      },
      {
         "Name":"Commentator",
         "Dates":[
            "\/Date(1227767555493+0200)\/"
         ]
      },
      {
         "Name":"Autobiographer",
         "Dates":[
            "\/Date(1233569254650+0200)\/"
         ]
      },
      {
         "Name":"Necromancer",
         "Dates":[
            "\/Date(1234393653060+0200)\/",
            "\/Date(1257860556480+0200)\/"
         ]
      },
      {
         "Name":"Popular Question",
         "Dates":[
            "\/Date(1236054752283+0200)\/",
            "\/Date(1248302252213+0300)\/",
            "\/Date(1248607054807+0300)\/",
            "\/Date(1250013763393+0300)\/",
            "\/Date(1251215254023+0300)\/",
            "\/Date(1258400556113+0200)\/"
         ]
      },
      {
         "Name":"Yearling",
         "Dates":[
            "\/Date(1249237664163+0300)\/"
         ]
      },
      {
         "Name":"Notable Question",
         "Dates":[
            "\/Date(1249583857093+0300)\/"
         ]
      },
      {
         "Name":"Beta",
         "Dates":[
            "\/Date(1221512400000+0300)\/"
         ]
      },
      {
         "Name":"Self-Learner",
         "Dates":[
            "\/Date(1251201753523+0300)\/"
         ]
      },
      {
         "Name":"Civic Duty",
         "Dates":[
            "\/Date(1260347854457+0200)\/"
         ]
      }
   ]
}

The database size is: 6.29 GB (out of which about 158 MB is for the default index).

Total number of operations: 4,667,100

The first major issue was that I couldn’t tell how many documents I had in the database, get document count turned out to be an O(N) operation(!), that was easy to fix, thankfully.

The second major issue was that Raven didn’t really handle indexing of a lot of documents very well, it would index each document as a standalone operation. The problem is that there are additional costs for doing this (opening & closing the index for writing, mostly), which slow things down enormously.  I fixed that by implementing index merging, so documents that were inserted at the same time would be index together (up to some limit that I am still playing at).

Once those were fixed, I could try doing some measurements…

  • With indexing disabled, inserting 2.3 million documents takes about 1 hour & 15 minutes. Considering that we made 4.6 million operations (including insert & updates), we are talking about over a 1000 operations per second (using single threaded mode).

I am very happy with those results.

  • With indexing enabled, and a single index defined, the process takes much longer. About 3 hours & 15 minutes, giving us about 400 operations per second (again, single threaded), or about 2.5 milliseconds per operation.
    • Waiting for the background indexing task to complete took a lot longer, another 2 hours & 45 minutes. This gives me just over 200 documents indexed per second.

I am pleased, but not really happy about those results. I think that we can do better, and I certainly intend to optimize things.

But that is for later, right now I want to see how Raven behaves when it has that many documents.

What we test URL Timing
Looking up database stats GET /stats 00:00:00.0007919
Browse documents (start) GET /docs/?start=0&pageSize=25 00:00:00.0429041
Browse documents (toward the end) GET /docs/?start=2300000&pageSize=25 00:00:00.0163617
Get document by id (toward the start) GET /docs/users/32 00:00:00.0017779
Get document by id (toward the end) GET /docs/posts/2321034 00:00:00.0022796
Query index (start) GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25 00:00:00.0388772
Query index (toward the end) GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25&start=100000 00:00:00.5988617
Query index (midway point) GET /indexes/Raven/DocumentsByEntityName?query=Tag:Users&pageSize=25&start=20000 00:00:00.1644477
Query index #2 (start) GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25 00:00:00.4957742
Query index #2 (toward the end) GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25&start=2000000 00:00:07.3789415
Query index #2 (midway point) GET /indexes/Raven/DocumentsByEntityName?query=Tag:posts&pageSize=25&start=200000 00:00:02.2621174

The results are very interesting. It is gratifying to see that browsing documents and retrieving documents is blazing fast, and once I fixed the O(N) issue on the stats, that is fast as hell as well.

Querying the indexes is interesting. It is clear that Lucene doesn’t like to page deeply. On the other hand, I think that it is safe to say that we don’t really have to worry about this, since deep paging (going to page #4,000) is very unlikely, and we can take the half a second hit when it is happening.

Querying index #2 is troublesome, though. It don’t think that it should take that long (without paging, if you want to page to page 80,000, please wait), even if that query returns ~2.1 million results. I think that this is because some of the Lucene options that we use, so I’ll have to look further into that.

Nitpicker corner: Those are not the final numbers, we intend to improve upon them.

Horror stories from the trenches: My very first project

My first day at work, and I am but a young whippersnapper. I arrived, did the usual meet & greet with everybody (and promptly forgot their names), then I was shown to a desk, given a laptop and one of my new coworkers rummaged in a desk drawer for a while before throwing me a document. Well, I say a document, what I mean is a book.

It was couple hundred pages, and it was titled: Specs for the Ergaster module in the Heidelbergensis System. When asked what I was supposed to do with it, I was given a very simple answer: Implement.

I later found on some very interesting aspects on the spec:

  • The document represented over a year and a half of work.
  • The module was being developed for a client of our client.
  • The module was meant to be both specific for the sub client needs and at the same time generic enough to used by other clients of our clients.
  • The system that we were supposed to be a module in was currently under development and wasn’t in a position to be used by us.

Can you say image

Well, I spent close to two years working on this project. I learned a lot, from how to proof myself from upstream teams to how not to design a system. I poured a whole lot of work and effort into this.

Along the way, I also learned something very interesting about the spec. That was the first time that I actually had to reverse engineer a specification to understand what the customer actually wanted.

Two years after we handed off the module to the client, I had the chance to go to lunch with the team leader from that client, so I asked him about the module. He informed me that they still haven’t deployed to production.

I was depressed for a week afterward.

That No SQL Thing: Document Database Migrations

One of the things that I get ask frequently about No SQL is how to handle migrations. First, I want to clarify something very important:

Migrations sucks.

It doesn’t matter what you are doing, the moment that you need to be forward / backward compatible, or just update existing data to a new format, you are going to run into a lot of unpleasant situations. Chief of which is that what you thought was true about your data isn’t, and probably never was.

But leaving that aside, in the RDBMS world, we have three major steps when we migrate from one database version to another, non destructive update to the schema (add columns / tables), update the data (or move it to its new location), destructive schema update (delete columns / tables, introduce indexes, etc). With a document database, the situation is much the same, while we don’t have a mandatory schema, we do have the data in a certain shape.

I am going to use three examples, which hopefully demonstrate enough about migrations to give you the basic idea.

  • Renaming a column – Posts.Text should be renamed to Posts.Content
  • Update column value – Update tax rate in all orders after 2010-01-01
  • Change many-to-one association to many-to-many association – Posts.Category to Posts Categories

Renaming a column

Using SQL, we can do this using:

-- non destructive update
alter table Posts add Content nvarchar(max) null

-- update the data
update Posts
set Content = Text

-- destructive schema update
alter table Posts drop column Text
alter table Posts alter column Content nvarchar(max) not null

Using a Document Database, we run the following Linq query on the server (this is effectively an update):

from doc in docs
where doc.type == "post"
select new 
{
   type = doc.type,
   content = doc.text, // rename a column
   posted_at = doc.posted_at
}

Update column value

In SQL, it is pretty easy:

-- update the data
update Orders
set Tax = 0.16
where OrderDate > '2010-01-01'

And using an update Linq query:

from doc in docs
where doc.type == "order" && doc.order_date > new DateTime(2010,1,1)
select new
{
   tax = 0.16,
   order_date = doc.order_date,
   order_lines = doc.order_lines
}

Change many-to-one association to many-to-many association

Using SQL, it is pretty complex (especially when you add the FKs):

create table CategoriesPosts (PostId int, CategoryId int)

insert into CategoriesPosts (PostId, CategoryId)
select Id, CategoryId from Posts

alter table Posts drop column CategoryId

And using a doc db:

from doc in docs
where doc.type == "post"
select new 
{
   type = doc.type,
   content = doc.content,
   posted_at = doc.posted_at,
   categories = new [] { doc.category }
}

It is pretty much the same pattern over and over again :-)

That No SQL Thing: Modeling Documents in a Document Database

So, after telling you the wrong way to go about it, I intend to show the right way to design the data model using a document database. So let’s be about it.

As a reminder, those are the scenarios that I want to deal with:

  • Main page: show list of blogs
  • Main page: show list of recent posts
  • Main page: show list of recent comments
  • Main page: show tag cloud for posts
  • Main page: show categories
  • Post page: show post and all comments
  • Post page: add comment to post
  • Tag page: show all posts for tag
  • Categories page: show all posts for category

And here is a sample data that we are going to work with (again, using C# because any other representation pretty much dictate the storage format).

var user = new User("ayende");
var blog = new Blog("Ayende @ Rahien", user) { Tags = {".NET", "Architecture", "Databases" } };
var categoryRaven = new Category("Raven");
var categoryNoSQL = new Category("NoSQL");
var post = new Post(blog, "RavenDB", "... content ...")  
{
    Categories  = { categoryRaven, categoryNoSQL },
    Tags = {"RavenDB", "Announcements" }
};
var comment = new Comment(post, "Great news");

PersistAll(user, blog, categoryRaven, categoryNoSQL, post, comment);

When approaching a document database model design, I like to think in aggregates. What entities in the model above are aggregates? User and Blog, obviously, and Post as well. Each of those has a right to exist in an independent manner. But nothing else has a right to exist on its own. Tags are obviously Value Objects, and the only usage I can see for categories is to use them for indexing the data. Comments, as well, aren’t really meaningful outside their post. Given that decision, I decided on the following format:

// users/ayende
{
   "type": "user",
   "name": "ayende"
}

// blogs/1
{
   "type": "blog",
    "users": ["users/ayende"],
    "name": "Ayende @ Rahien",
    "tags": [".NET", "Architecture", "Databases"]
}
// posts/1
{
    "blog": "blogs/1",
    "title": "RavenDB",
    "content": "... content ...",
    "categories": ["Raven", "NoSQL"]
    "tags" : ["RavenDB", "Announcements"],
    "comments": [
        { "content": "Great News" }
    ]
}

That gives me a much smaller model, and it make things like pulling a post and its comments (a very common operation) extremely cheap. Let us see how we can handle each scenario using this model…

Main page: show list of blogs

This is simple, using a commonly used index:

var blogs = docDb.Query<Blog>("DocumentsByType", "type:blog");

Main page: show list of recent posts

var posts = docDb.Query<Post>("PostsByTime", orderBy:"-posted_at");

This is using an index for posts by time:

from doc in docs
where doc.type == "post"
select new {doc.posted_at}

Main page: show list of recent comments

This is more interesting, because we don’t have the concept of comments as a separate thing. We handle this using an index that extract the values from the post document:

from doc in docs
where doc.type == "post"
from comment in doc.comments
select new {comment.posted_at, comment.content }

And the query itself:

var recentComments = docDb.Query<Comment>("CommentsByTime", orderBy:"-posted_at");

Main page: show tag cloud for posts

And now, we have another interesting challenge, we need to do aggregation on top of the tags. This is done using Map/Reduce operation. It sounds frightening, but it is pretty easy, just two linq queries:

from doc in docs
where doc.type == "post"
select tag in doc.tags
select new { tag, count = 1}

from result in results
group result by result.tag into g
select new {tag = g.Key, count = g.Sum(x=>x.count) }

And given that, we can now execute a simple query on top of that:

var tagCloud = docDb.Query<TagAndCount>("TagCloud");

Main page: show categories

This is pretty much the same thing as tag cloud, because we want to extract the categories from the posts, and get the number of posts in each category.

from doc in docs
where doc.type == "post"
select category in doc.categories
select new { category, count = 1}

from result in results
group result by result.category into g
select new {category= g.Key, count = g.Sum(x=>x.count) }

And given that, we can now execute a simple query on top of that:

var categories = docDb.Query<CategoryAndCount>("CategoriesCount");

Post page: show post and all comments

Here we see how easy it is to manage object graphs, all we need to do is pull the post document, and we get everything we need to render the post page:

var post = docDb.Get<Post>("posts/1");

Post page: add comment to post

Remember, a comment cannot live outside a post, so adding a comment involves getting the post, adding the comment and saving:

var post = docDb.Get<Post>("posts/1");
post.Comments.Add(new Comment(...));
docDb.SaveChanges();

Note that this uses the Unit of Work pattern, the Doc DB keep track of the post entity and will persist it to the database when SaveChanges() is called.

Tag page: show all posts for tag

Again, we have a two step approach, define the index, and then query on it. It is pretty simple:

from doc in docs
where doc.type == "post"
from tag in doc.tags
select new { tag }

And the query itself is:

var posts = docDb.Query<Post>("PostsByTag", "tag:Raven");

Categories page: show all posts for category

This is the same as all posts for tag, so I’ll skip it.

There are a few things to notice about this approach.

  • A good modeling technique is to think about Aggregates in the DDD sense, an Aggregate and all its associations are stored as a single document.
  • If we need to query on an entity encapsulated in an Aggregate, we can extract that information using an index. That extraction, however, doesn’t change the way that way that we access the data.
  • We can create “virtual entities”, such as Tags or Categories, which doesn’t really exists, but are generated by the indexes we define. Those are usually things that are only useful for users to search / organize the aggregates on.
  • Schema doesn’t really matter, because we either search by key, or on well known indexes, which massage the data to the format that we expect it to be.

Thoughts?

That No SQL Thing: The relational modeling anti pattern in document databases

I am going to demonstrate the design of the data model in a document database for a typical blog application.

The following is my default sample data model, showing a very simple blog:

image

The absolutely wrong approach with a document database is to try to take the relational model and apply it on a document level. This is especially wrong because for a while, it might actually work. Let us say that we want to store the following:

var user = new User("ayende");
var blog = new Blog("Ayende @ Rahien", user) { Tags = {".NET", "Architecture", "Databases" } };
var categoryRaven = new Category("Raven");
var categoryNoSQL = new Category("NoSQL");
var post = new Post(blog, "RavenDB", "... content ...")  
{
    Categories  = { categoryRaven, categoryNoSQL },
    Tags = {"RavenDB", "Announcements" }
};
var comment = new Comment(post, "Great news");

PersistAll(user, blog, categoryRaven, categoryNoSQL, post, comment);

Interestingly enough, I need to use code to represent the data without tying it to a particular storage format.

The wrong approach to store the data would be to store each object as its own document, similar to the way we would store each object as its own row in a relational database. That wrong approach would look like this:

// users/ayende
{
   "type": "user",
   "name": "ayende"
}

// tags/1
{
   "name": ".NET"
}

// tags/2
{
   "name": "Architecture"
}

// tags/3
{
   "name": "Databases"
}
// tags/4
{
   "name": "RavenDB"
}
// tags/5
{
   "name": "Announcements"
}
// categories/1
{
    "name": "Raven"
}
// categories/2
{
    "name" : "NoSQL"
}
// blogs/1
{
   "type": "blog",
    "users": ["users/ayende"],
    "name": "Ayende @ Rahien",
    "tags": ["tags/1", "tags/2", "tags/3"]
}

// posts/1
{
    "blog": "blogs/1",
    "title": "RavenDB",
    "content": "... content ...",
    "categories": ["categories/1", "categories/2"]
    "tags" : ["tags/4", "tags/5"]
}

// comments/1
{
    "post": "posts/1",
    "content": "Great News"
}

I know that I am repeating myself here, but I have seen people miss the point before. Do NOT try to model a document database in this way.

See full size imageThe main reason that this is wrong is that a document database has no real support for doing joins, unions or any of the things that make such a model work effectively in a relational model.

Let us try to analyze the scenarios where we need this data, okay?

  • Main page: show list of blogs
  • Main page: show list of recent posts
  • Main page: show list of recent comments
  • Main page: show tag cloud for posts
  • Main page: show categories
  • Post page: show post and all comments
  • Post page: add comment to post
  • Tag page: show all posts for tag
  • Categories page: show all posts for category

I am going to analyze each of those scenarios using SQL (and the above model) and the current (and bad, smelly, nasty) document database model. I’ll have another post showing how to correctly model this in a document database, this post is about how not to do it.

Main page: show list of blogs

Using SQL, this is pretty easy:

select * from blogs

Using DocDB, this is easy, we are using a built-in index to query for documents by their type:

docDb.Query<Blog>("DocumentsByType", query:"type:blog");

Main page: show list of recent posts

Using SQL, this is pretty easy:

select * from posts order by PostedAt desc

Using DocDB, we need to define our own index function, to allow use to sort on it. That is painless, even if I say so myself:

from doc in docs
where doc.type == "post"
select new {doc.posted_at}

And now we can query this using:

docDb.Query<Post>("Posts", orderBy:"-posted_at");

Main page: show list of recent comments

This is exactly the same as recent posts, so I’ll skip it.

Main page: show tag cloud for posts

Here the SQL grows interesting:

select Name, COUNT(*) as TagCount from tags
where ItemType = 'Posts'
group by Name

And with the document database we need to write a map/reduce index (“Did you just told me to go @%$# myself?”)

from doc in docs
where doc.type == "posts"
from tag in doc.tags
select new { tag, count = 1 }

from result in results
group result by result.tag into g
select new { tag = g.Key, count = g.Sum(x=>x.count) }

And now that we have the index, we can get the values from it using:

var tagCloud = new TagCloud();
var tagIds = docDb.Query<TagAndCount>("TagsCloud", orderBy:"+count");
foreach(var tagId in tagIds)
{
    var tag = docDb.Get<Tag>(tagId.Tag);
    tagCloud.Add(tag.Name, tagId.Count);
}

See full size imageNow this is ugly on many levels. First, we have the fairly complex index. Second, we have to merge the data ourselves at the client side. Third, we have to perform a SELECT N+1.

Yuck doesn’t being to cover it. There are actually ways to handle this more nicely, by making a multi get request, but I’ll not bother.

Main page: show categories

Exactly the same as show blogs, so I’ll skip it.

Post page: show post and all comments

Using stupid SQL:

select * from Posts where Id = 1

select * from Comments where PostId = 1

A more efficient method would be to use a join:

select * from Posts 
  join Comments 
    on Posts.Id = Comments.Id
where Posts.Id = 1

With the doc db, we can do:

var post = docDb.Get<Post>(1);
var comments = docDb.Query<Comment>("CommentsByPost", query:"post_id:1", orderBy:"+posted_at");

Which, of course, require us to define the comments by post index:

from doc in docs
where doc.type == "comment"
select new{doc.post_id, doc.posted_at}

Note that we have to make two calls here, because a document database has no notion of joins.

Post page: add comment to post

In SQL, it is a straightforward insert:

insert into comments (PostId, ... )
values(1, ...)

And with a document database, you can use:

docDb.Store(new Comment{ PostId = 1, ... });
docDb.SaveChanges();

Nothing much to look at here, using this flawed model.

See full size imageTag page: show all posts for tag

Using sql, that is slightly complex, because tags may be associated with blogs or with posts, so we need to do:

select * from Posts 
where Id in (
    select ItemId from tags
    where ItemType = 'Posts' and TagId = 1
)

Using a document database:

var posts = docDb.Query<Post>("PostsByTag", query:"tag:tags/1");

With the following index:

from doc in docs
where doc.type == "post"
from tag in doc.tags
select new {tag}

Categories page: show all posts for category

This is exactly like tags, so I’ll skip it.

As you’ve seen, by copying the relational model, we have created just the same sort of environment that we already had with RDBMS, but now we are faced with the problem that a document database can’t do things that a relational database can. In my eyes, what we have done is a net lose. Oh, we may gain some small benefit by being schemaless, but that isn’t really that beneficial in compared to the amount of effort that we have to go to by trying to be relational on a non relational database.

That No SQL Thing: Document Databases – usages

I described what a document database is, but I haven’t touched about why you would want to use it.

The major benefit, of course, is that you are dealing with documents. There is little or no impedance mismatch between DTOs and documents. That means that storing data in the document database is usually significantly easier than when using an RDBMS for most non trivial scenarios.

It is usually quite painful to design a good physical data model for an RDBMS, because the way the data is laid out in the database and the way that we think about it in our application are drastically different. Moreover, RDBMS has this little thing called Schemas. And modifying a schema can be a painful thing indeed.

Sidebar: One of the most common problems that I find when reviewing a project is that the first step (or one of them) was to build the Entity Relations Diagram, thereby sinking a large time/effort commitment into it before the project really starts and real world usage tells us what we actually need.

The schemaless nature of a document database means that we don’t have to worry about the shape of the data we are using, we can just serialize things into and out of the database. It helps that the commonly used format (JSON) is both human readable and easily managed by tools.

A document database doesn’t support relations, which means that each document is independent. That makes it much easier to shard the database than it would be in a relational database, because we don’t need to either store all relations on the same shard or support distributed joins.

Finally, I like to think about document databases as a natural candidate for DDD and DDDish (DDD-like?) applications. When using a relational database, we are instructed to think in terms of Aggregates and always go through an aggregate. The problem with that is that it tends to produce very bad performance in many instances, as we need to traverse the aggregate associations, or specialized knowledge in each context. With a document database, aggregates are quite natural, and highly performant, they are just the same document, after all.

I’ll post more about this issue tomorrow.

Opps: Dangerous typos

When releasing EF Prof, I had a typo in the pricing information. A single additional zero led to this embarrassment:

image

I think I caught it before anyone actually saw the price…

Profiler subscriptions are now available

One of the repeated requests for the profiler was to provide some alternative to the full price. After due consideration, I chose to implement an additional licensing model for the profiler, a subscription based one.

The current licensing model is still available and will not change!

What does this mean? Using the subscription model, you pay a monthly fee, and you can keep using the profiler as long as you pay.

The monthly fee is 12 Euro / 16$ US.

Why 12 Euro / 16$? It isn’t a “normal” price point.

When looking at the pricing model for subscriptions, I wanted to choose something low enough that most people wouldn’t mind paying. I choose this particular price point because this is how much a business lunch cost at a nearby restaurant. I believe that it should be low enough for everyone in the target audience.

The terms are simple:

  • Payment is made month to month (that is, you get charged each month for the profiler).
  • You can cancel the subscription at any point in time, and your copy of the profiler will expire when the time it due.
  • Upgrading between version will happen automatically. That is, if you are subscribed to version 1.x, you’ll get 2.x automatically.

Subscriptions are currently offered for NH Prof only, but we will expand that to the other profilers if this prove successful.

You can buy a subscription here.

That No SQL Thing – Document Databases

A document database is, at its core, a key/value store with one major exception. Instead of just storing any blob in it, a document db requires that the data will be store in a format that the database can understand. The format can be XML, JSON, Binary JSON (MongoDB), or just about anything, as long as the database can understand it.

Why is this such a big thing? Because when the database can understand the format of the data that you send it, it can now do server side operations on that data. In most doc dbs, that means that we can now allow queries on the document data.The known format also means that it is much easier to write tooling for the database, since it is possible to show, display and edit the data.

I am going to use Raven as the example for this post. Raven is a JSON based, REST enabled document database. Documents in Raven use the JSON format, and each document contains both the actual data and an additional metadata information that you can use. Here is an example of a document:

{
    "name": "ayende",
    "email": "ayende@ayende.com",
    "projects": [
        "rhino mocks",
        "nhibernate",
        "rhino service bus",
        "raven db",
        "rhino persistent hash table",
        "rhino distributed hash table",
        "rhino etl",
        "rhino security",
        "rampaging rhinos"
    ]
}

We can PUT this document in the database, under the key ‘ayende’. We can also GET the document back by using the key ‘ayende’.

A document database is schema free, that is, you don’t have to define your schema ahead of time and adhere to that. It also allow us to store arbitrarily complex data. If I want to store trees, or collections, or dictionaries, that is quite easy. In fact, it is so natural that you don’t really think about it.

It does not, however, support relations. Each document is standalone. It can refer to other documents by store their key, but there is nothing to enforce relational integrity.

The major benefit of using a document database comes from the fact that while it has all the benefits of a key/value store, you aren’t limited to just querying by key. By storing information in a form that the database can understand, we can ask the server to do things for us. Such as defining the following index for us:

from doc in docs
from prj in doc.projects
select new { Project = prj, Name = doc.Name }

With that in place, we can now make queries on the stored documents:

GET /indexes/ProjectAndName?query=Project:raven

This will give us all the documents that include a project with the name raven.

We can take it a bit further, though, and ask the database to store aggregations for us:

from doc in docs
from prj in doc.projects
select new { Count = 1, Project = prj}

from result in results
group result by result.Project into g
select new { Project = g.Key, Count = g.Sum(x=>x.Count) };

And now we can query for the number of users involved in the raven project:

GET /indexes/UsersCountByProject?query=Project:raven

In general, querying a document database falls into the prepared ahead of time (Raven, CouchDB) or at query time (MongoDB).

In the first case, you define an indexing function (in Raven’s case, a Linq query, in CouchDB case, a JavaScript function) and the server will run it to prepare the results, once the results are prepared, they can be served to the client with minimal computation. CouchDB and Raven differs in the method they use to update those indexes, Raven will update the index immediately on document change, and queries to indexes will never wait. The query may return a stale result (and is explicitly marked as such), but it will return immediately. With CouchDB, a view is updated at view query time, which may lead to a long wait time on the first time a view is accessed if there were a lot of changes in the meanwhile. The CouchDB approach avoids doing extra works that may not be needed, but Raven’s approach means that you don’t have to handle potentially large delays at queyring times. CouchDB allow to pass a parameter at the view querying time to allow it to return stale results, which seems to be the best of all worlds.

Note that in both CouchDB and Raven’s cases, indexes do not affect write speed, since in both cases this is done at a background task.

MongoDB, on the other hand, allows ad-hoc querying, and relies on indexes defined on the document values to help it achieve reasonable performance when the data size grows large enough. MongoDB’s indexes behave in much the same way RDBMS indexes behave, that is, they are updated as part or the insert process, so large number of indexes is going to affect write performance.

Other document databases:

  • ThruDB is the closest one that I could find, but it is basically a key/value store with manual Lucene indexing.
  • Terrastore seems to be a full fledged document database, but I can’t find much information about it. Requires Terracotta to work.
  • Riak is another Erlang based Doc DB. It seems to be close to MongoDB in terms of operation, but that is about as much as I know about it. Querying Riak is done by writing a map/reduce query in javascript or erlang. I think it is the originator of this.

re: NoSQL, meh

I was pointed to this blog post, it is written by the guy who wrote ZODB (a python object database) who isn’t excited about NoSQL:

But for me there was a moment of pure joy one morning when an absolutely awesome colleague I worked with at the time said to me something like: "There's a problem with this invoice, I traced it down to this table in the database which has errors in these columns. I've written a SQL statement to correct, or should it be done at the model level?". Not only had he found and analyzed the problem, he was offering to fix it.

Praise the gods. To do similar in Plone, he would have had to learn Python, read up on all the classes. Write a script to get those objects from the ZODB and examine them. Not a small undertaking by any means.

What was going on

The tools for the ZODB just weren't there and it wasn't transparent enough. If there was a good object browser for the ZODB (and yes a few simple projects showed up that tried to do that) that did all the work and exposed things that would really help. But setting up and configuring such a tool is hard and I never saw one that allowed you to do large scale changes.

I also got the following comment, on an unrelated post:

Not directly related, but I'm curious why you don't use Rhino Queues? The tooling with msmq?

Tooling are incredibly important! In fact, it is the tooling that can make or break a system. Rhino Queues is a really nice queuing system, it offer several benefits over MSMQ, but it has no tooling support, and as such, it has a huge disadvantage in comparison to MSMQ (or other Queuing technologies).

With databases, this is even more important. I can (usually) live without having direct access to queued messages, but for databases and data stores, having the ability to access the data in an easy manner is mandatory. Some data is unimportant (I couldn’t care less what the values are in user #12094’s session are), but for the most part, you really want the ability to read that data and look at it in a human readable fashion.

The problems that Andy run into with ZODB are related more to the fact that ZODB didn’t have any good tooling and that the ZODB storage format was tied directly to Python’s pickle.

With Raven, I flat out refuse to release it publically until we have a good tooling story. Raven comes with its own internal UI (accessible via any web browser), which allows you to define indexes, view/create/edit the documents, browse the database, etc.

I consider such abilities crucial, because without those, a database is basically a black hole, which requires special tooling to work with. By providing such tooling out of the box, you reduce the barrier to entry by a huge margin.

This image was sent to me by a lot of people:

This is funny, and true. There is another layer here, you don’t query a key/value store, that is like asking how to feed a car, because you are used to riding horses. If you need to perform queries on a key/value store, you are using the wrong tool, or perhaps you are trying to solve the problem in a non idiomatic way.

Debugging a production problem

The backend for NH Prof stopped working all of a sudden. Obviously, that is not a good thing, so I setup to fix it as soon as I could. Unfortunately, I really couldn’t see what the problem was.

I am running on my own server, so at least I didn’t have to play the usual games with trying to figure out what is wrong by inferring things.

The first thing to look at was the queues, and I saw something very strange there:

image

There were a lot of messages in the queue, but there weren’t being processed. The next step was to try looking at the error queue, but there was nothing really interesting there, restarting the service did nothing.

I setup the system to enable logging, but after the flurry of startup logging messages, the only things in the log where messages telling me about sent messages that are part of scheduled tasks in the system, such as:

DEBUG Rhino.ServiceBus.Msmq.OpenedQueue - Sending message NHProfilerWeb.Backend.InternalMessages.CheckOrders

My initial suspicion was that we had a hung thread handling a message, and that caused the entire thing to stop, but that didn’t seem to fit, since I never got a log message about starting to process the message. Time to pull out the big guns, then: WinDbg.

I restarted the process, and attached to it, then issued the following command:

!eestack –ee

This gives me the managed stack on all the threads. Using that, I managed to isolate this thread:

Thread   9
Current frame:
ChildEBP RetAddr  Caller,Callee
0640ee88 067290dd (MethodDesc 0x5f23470 +0x35 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640eea8 067290dd (MethodDesc 0x5f23470 +0x35 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640eed4 05f0c9d3 (MethodDesc 0x5f23454 +0x53 System.Messaging.MessageQueue.StaleSafeReceiveMessage(...))
0640ef00 05f0e46b (MethodDesc 0x5f23340 +0x1d3 System.Messaging.MessageQueue.ReceiveCurrent(...))
0640ef64 06178529 (MethodDesc 0x5f2325c +0x211 System.Messaging.MessageQueue.ReceiveBy(...))
0640efc8 061782bb (MethodDesc 0x5f23280 +0x4b System.Messaging.MessageQueue.ReceiveById(System.String, System.Messaging.MessageQueueTransactionType))
0640eff8 06177a19 (MethodDesc 0x5059458 +0x61 Rhino.ServiceBus.Msmq.OpenedQueue.TryGetMessageFromQueue(System.String))
0640f040 06030eeb (MethodDesc 0x50589a8 +0x103 Rhino.ServiceBus.Msmq.MsmqTransport.ReceiveMessageInTransaction(Rhino.ServiceBus.Msmq.OpenedQueue, System.String, System.Func`2<Rhino.ServiceBus.Impl.CurrentMessageInformation,Boolean>, System.Action`2<Rhino.ServiceBus.Impl.CurrentMessageInformation,System.Exception>))
0640f0c0 060303b1 (MethodDesc 0x50589f0 +0x209 Rhino.ServiceBus.Msmq.MsmqTransport.HandlePeekedMessage(Rhino.ServiceBus.Msmq.OpenedQueue, System.Messaging.Message))
0640f14c 05f0dea2 (MethodDesc 0x5057110 +0x30a Rhino.ServiceBus.Msmq.AbstractMsmqListener.PeekMessageOnBackgroundThread(System.Object))
0640f208 6fa816fe (MethodDesc 0x6f2139cc System.Threading.ThreadHelper.ThreadStart_Context(System.Object))
0640f214 6f3601ef (MethodDesc 0x6f18275c +0x6f System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
0640f228 6f85b5aa (MethodDesc 0x6f2139d8 +0x4a System.Threading.ThreadHelper.ThreadStart(System.Object))

As you can see, it appears to be hung on receiving a message by id.

The next action was to switch to that thread using:

~9s

And ask to see the stack of this thread along with its parameters:

!clrstack –p

The interesting tidbit is here:

0640efec 061782bb System.Messaging.MessageQueue.ReceiveById(System.String, System.Messaging.MessageQueueTransactionType)
    PARAMETERS:
        this = <no data>
        id = <no data>
        transactionType = <no data>

0640f004 06177a19 Rhino.ServiceBus.Msmq.OpenedQueue.TryGetMessageFromQueue(System.String)
    PARAMETERS:
        this = 0x01ef7910
        messageId = 0x01efa070

So now we need to look at what this message id is:

!dumpobj 0x01efa070

And the result is:

Name: System.String
MethodTable: 6f3b0a00
EEClass: 6f16d64c
Size: 162(0xa2) bytes
(C:\Windows\assembly\GAC_32\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)g
String: 0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\1551847
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
6f3b2c4c  4000096        4         System.Int32  1 instance       73 m_arrayLength
6f3b2c4c  4000097        8         System.Int32  1 instance       44 m_stringLength
6f3b16e0  4000098        c          System.Char  1 instance       30 m_firstChar
6f3b0a00  4000099       10        System.String  0   shared   static Empty
    >> Domain:Value  002eae88:01cc1198 0031b358:01cc1198 <<
6f3b1630  400009a       14        System.Char[]  0   shared   static WhitespaceChars
    >> Domain:Value  002eae88:01cc1714 0031b358:01cc5aa0 <<

Knowing that, I switched to the queue and looked at the first message:

image

That is good, now we know what the problem is, we are hung on receive by id. That is not something that I expected.

CheckOrder is a periodical message, so I can afford to consume a few in order to see what is going on. I opened PowerShell and executed:

[System.Reflection.Assembly]::Load("System.Messaging, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a")
$q = new-object System.Messaging.MessageQueue("NHProf.BackEnd")
$q.Peek()
Exception calling "Peek" with "0" argument(s): "Length cannot be less than zero.
Parameter name: length"
At line:1 char:8
+ $q.Peek <<<< ()
    + CategoryInfo          : NotSpecified: (:) [], MethodInvocationException
    + FullyQualifiedErrorId : DotNetMethodException

That was the point when I gave up on it, downloaded Boo and started the Boo shell. I didn’t get very far:

>>>import System.Messaging
>>>q = MessageQueue("NHProf.BackEnd")
System.Messaging.MessageQueue
>>>msg = q.Peek()
System.ArgumentOutOfRangeException: Length cannot be less than zero.
Parameter name: length
   at System.String.InternalSubStringWithChecks(Int32 startIndex, Int32 length,
Boolean fAlwaysCopy)
   at System.Messaging.MessageQueue.ResolveFormatNameFromQueuePath(String queueP
ath, Boolean throwException)
   at System.Messaging.MessageQueue.get_FormatName()
   at System.Messaging.MessageQueue.ReceiveCurrent(TimeSpan timeout, Int32 actio
n, CursorHandle cursor, MessagePropertyFilter filter, MessageQueueTransaction in
ternalTransaction, MessageQueueTransactionType transactionType)
   at System.Messaging.MessageQueue.Peek()
   at Input3Module.Main(String[] argv) in C:\Users\Administrator\Documents\boo-0
.9.3.3457-bin[1]\bin\input3:line 1

Unlike the PowerShell example, I had the full stack trace, so I knew what is going on. In the PowerShell example, I was certain that there was some issue with calling overloaded methods. Once I saw that, it was perfectly obvious what was wrong, I didn’t pass a valid queue name, that was easy to fix.

>>>q.Close()
>>>q = MessageQueue(""".\private$\NHProf.BackEnd""")
System.Messaging.MessageQueue
>>>q.Peek()
System.Messaging.Message

And now I could start playing around with things:

>>>q.Peek().Id
'0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\\1551847'

So this is the right message, now let us try to consume it:

>>>q.ReceiveById('0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\\1551847')
System.Messaging.Message
>>>q.Peek().Id
'0730ab9b-eea5-4acd-bc93-7ee594cd2e8f\\1551849'
>>>q.Peek().Label
'NHProfilerWeb.Backend.InternalMessages.CheckOrders'

So that works well enough.

What is the difference, though? Well, it looks like the major difference is that we aren’t using DTC here. I restarted the DTC service and the MSMQ service and everything started to work :-)

Now, I am not quite sure why this is the case, to tell you the truth, and I would dearly love to know why, but it at least works now...

That No SQL Thing – Key / Value stores – Usages

Now that we have a fairly good understanding of what Key / Value stores are, what are they good for? Well, they are really good if you need to lookup things by key :-)

Storing the user’s session is a common example, or the user’s shopping cart. In fact, per user’s data is probably the most common usage for a key value store. The reasons for that are fairly obvious, I would imagine, we have the user’s id already, and we need a well known data. That make it very cheap to access, modify and store at the end of a request.

Other usages include storing already processed data for quick retrieval, consider things like: Favorite Products, Best Of …, etc. The idea here is to have the application access the key / value store using well known keys for those data items, while background processes will update those periodically. It isn’t quite a cache, because there isn’t really a scenario where the value doesn’t exists, but it is a near thing.

One interesting use case that I run into was to store the product catalog in the key/value store, that allowed pulling a particular product’s details in a single call. The key/value store was also used to track how many units were physically in stock, so it was able to tell if the item was available for purchase or not.

Just remember, if you need to do things more complex than just access a bucket of bits using a key, you probably need to look at something else, and the logical next step in the chain in the Document Database.

Entity Framework Profiler goes out out beta

Well, it has been there for long enough :-)

In the tradition of finding meaningful dates for pushing things out of beta, I thought about making it the 1st of April, but I though better on it when I consider how it might be taken.

So on the 12th of April, EF Prof goes out of beta.

That means that the 30% beta discount will be over, which means that if you want it, you might want to hurry up :-)

Tags:

Published at

That No SQL Thing – Key / Value stores – Operations

I mentioned before that the traditional API for Key / Value stores is pretty simplistic, Get/Put/Delete over blob values. That doesn’t really give us much options,

I have a long series of posts talking about how to do a whole host of operations using this minimal API.