ETL Process using Raven

time to read 23 min | 4516 words

In order to test Raven behavior with a large amount of data, I needed to load a whole bunch of data to it. I choose StackOverflow Data Dump as my sample source.

That left me with the problem of how to go from data that looks like this:

<!--  users.xml -->
<row Id="1"
         Reputation="18566"
         CreationDate="2008-07-31T14:22:31.287"
         DisplayName="Jeff Atwood"
         EmailHash="51d623f33f8b83095db84ff35e15dbe8"
         LastAccessDate="2010-03-02T07:33:46.390"
         WebsiteUrl="http://www.codinghorror.com/blog/"
         Location="El Cerrito, CA"
         Age="40"
         AboutMe="..."
         Views="23436"
         UpVotes="1972"
         DownVotes="431" />

<!--  badges.xml -->
<row Id="82946"
         UserId="3718"
         Name="Teacher"
         Date="2008-09-15T08:55:03.923" />

<!-- posts.xml -->
<row Id="7"
         PostTypeId="2"
         ParentId="4"
         CreationDate="2008-07-31T22:17:57.883"
         Score="44"
         ViewCount="0"
         Body="..."
         OwnerUserId="9"
         LastEditorUserId="1"
         LastEditorDisplayName=""
         LastEditDate="2010-02-02T19:38:14.493"
         LastActivityDate="2010-02-02T19:38:14.493"
         CommentCount="0" />

<!-- votes.xml -->
<row Id="2"
         PostId="3"
         VoteTypeId="2"
         CreationDate="2008-07-31" />


<!-- comments.xml -->
<row Id="12"
         PostId="47428"
         Score="2"
         Text="..."
         CreationDate="2008-09-06T13:51:47.843"
         UserId="4642" />

To JSON documents. Actually, it is a bit worse, the data dump is relational in nature, which isn’t really suitable for a doc db format. What I actually want is a User document, which contains badges in it and a Post document, which contains votes and comments in it.

My first approach was this:

  • foreach user in users.xml
    • Insert user document
  • foreach badge in badges.xml
    • Get user document from database
    • Update appropriate user document with new badge
  • foreach post in posts.xml
    • Insert post document
  • foreach vote in votes.xml
    • Get Post document from database
    • Update post with vote
  • foreach comment in comments.xml
    • Get Post document from database
    • Update post with comment

Running with an embedded option, with no indexing enabled, that took about 1:15 hours. (That is actually not fair, this process involved reading the XML as well, and resulted in several performance improvements.)

My next attempt was to do something like:

  • foreach user in users.xml & foreach badge in badges.xml
    • Merge badges to appropriate user
    • Insert user document
  • foreach post in posts.xml & foreach vote in votes.xml & foreach comment in comments.xml
    • Marge votes to appropriate post
    • Merge comments to appropriate post
    • Insert post document

The theory was that I can reduce the number of calls to the database, thus reduce the overall cost. The problem is that trying to do this sort of merges in memory take a lot of time and a huge amount of memory. It is perhaps telling that I gave up that approach on a 64 bits machine with 8 GB of RAM.

Finally, I decided to go in a two stage approach, because what I actually wanted to see was the usage of Raven, not the performance of reading the XML. The current ETL process is composed of two steps:

Part 1:

  • foreach user in users.xml
    • Partition in 128 users per batch
    • Generate JSON document representing a batch insert in the database
  • similar to badges, posts, votes and comments

Part 2:

Submit the JSON documents for Raven over HTTP

Here is how it looks like for users:

[
  {
    "Key": "users/-1",
    "Etag": null,
    "Method": "PUT",
    "Document": {
      "UpVotes": 413,
      "EmailHash": "a007be5a61f6aa8f3e85ae2fc18dd66e",
      "AboutMe": "…",
      "Id": -1,
      "LastAccessDate": "\/Date(1219699013810+0300)\/",
      "Views": 649,
      "Location": "on the server farm",
      "DownVotes": 22414,
      "Reputation": 1,
      "CreationDate": "\/Date(1217451600000+0300)\/",
      "Age": 2,
      "DisplayName": "Community",
      "WebsiteUrl": "http://stackoverflow.com"
    },
    "Metadata": {
      "Raven-Entity-Name": "Users"
    }
  },
  {
    "Key": "users/1",
    "Etag": null,
    "Method": "PUT",
    "Document": {
      "UpVotes": 1972,
      "EmailHash": "51d623f33f8b83095db84ff35e15dbe8",
      "AboutMe": "…",
      "Id": 1,
      "LastAccessDate": "\/Date(1267508026390+0200)\/",
      "Views": 23436,
      "Location": "El Cerrito, CA",
      "DownVotes": 431,
      "Reputation": 18566,
      "CreationDate": "\/Date(1217503351287+0300)\/",
      "Age": 40,
      "DisplayName": "Jeff Atwood",
      "WebsiteUrl": "http://www.codinghorror.com/blog/"
    },
    "Metadata": {
      "Raven-Entity-Name": "Users"
    }
  }
]

This is pretty simple, right?

But, what about something like badges? How can we go from something like this:

<!-- badges.xml --> 
<row Id="82946" 
          UserId="3718" 
          Name="Teacher" 
          Date="2008-09-15T08:55:03.923" /> 

To a document update? After all, we don’t have the rest of the document. That would seem to necessitate a fetch, adding the value at the client side and send to the server. However, that is going to be slow. Luckily, Raven support the PATCH command. There isn’t really a diff format for JSON, so we had to invent our own. In this case, the document for adding badges is pretty simple and should be self explanatory:

[
  {
    "Key": "users/3718",
    "Method": "PATCH",
    "Etag": null,
    "Patches": [
      {
        "Type": "Add",
        "Value": {
          "Name": "Teacher",
          "Dates": [
            "\/Date(1221458103923+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Supporter",
          "Dates": [
            "\/Date(1221458104830+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Autobiographer",
          "Dates": [
            "\/Date(1221458105220+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Editor",
          "Dates": [
            "\/Date(1221753452033+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Organizer",
          "Dates": [
            "\/Date(1221753452780+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Scholar",
          "Dates": [
            "\/Date(1222711653513+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Student",
          "Dates": [
            "\/Date(1222753351563+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Critic",
          "Dates": [
            "\/Date(1249633658857+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Beta",
          "Dates": [
            "\/Date(1221512400000+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Yearling",
          "Dates": [
            "\/Date(1251666165617+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Popular Question",
          "Dates": [
            "\/Date(1254097352953+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Commentator",
          "Dates": [
            "\/Date(1254833254790+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      }
    ]
  },
  {
    "Key": "users/994",
    "Method": "PATCH",
    "Etag": null,
    "Patches": [
      {
        "Type": "Add",
        "Value": {
          "Name": "Teacher",
          "Dates": [
            "\/Date(1221458103957+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Supporter",
          "Dates": [
            "\/Date(1221458104830+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Autobiographer",
          "Dates": [
            "\/Date(1221458105220+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Scholar",
          "Dates": [
            "\/Date(1226316765340+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Student",
          "Dates": [
            "\/Date(1226358152000+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Popular Question",
          "Dates": [
            "\/Date(1250109153087+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Beta",
          "Dates": [
            "\/Date(1221512400000+0300)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Critic",
          "Dates": [
            "\/Date(1257496652637+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      },
      {
        "Type": "Add",
        "Value": {
          "Name": "Editor",
          "Dates": [
            "\/Date(1257496952257+0200)\/"
          ]
        },
        "Name": "Badges",
        "Position": null,
        "Nested": null
      }
    ]
  }
]

The fun part about the whole thing is now I am truly testing just Raven’s performance, because the client side is now just reading files and submitting requests to Raven. Another important factor is that instead of issuing a single request per document, we are able to batch them together in groups of 128 requests and submit them to /bulk_docs.

Using this process, we see a pretty consistent average of about 1,000 documents per second for batched inserts and somewhat less than that for batched PATCH requests.

Just to give you some ideas, here are are the file sizes per type:

  • Users – 65 KB
  • Badges - ~500 Kb
  • Posts – 150 Kb
  • Comments – 70 Kb
  • Votes – 40 Kb

Note that each file contains 128 requests, and that those aren’t final document size. User document is made up of the user data and badges, for example.

But so far, I am pretty happy about it.