Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,969 | Comments: 44,490

filter by tags archive

FLOSS Moling with RavenDB


There is the FLOSS Mole data set, which provide a lot of interesting information about open source projects. As I am always interested in testing RavenDB with different data sets, I decided that this would be a great opportunity to do that, and get some additional information about how things are working as well.

The data is provided in a number of formats, but most of them aren’t really easy to access. SQL statements and raw text files that I assume to be tab separated, but I couldn’t really figure out quickly.

I decided that this would be a great example of actually migrating content from a SQL System to a RavenDB System. The first thing to do was to install MySQL, as that seems to be the easiest way to get the data out. (As a note, MySQL Workbench is really not what I would call nice.)

The data looks like this, this is the Google Code projects, and you can also see that a lot of the data is driven from the notion of a project.

image

I explored the data a bit, and I came to the conclusion that this is pretty simple stuff, overall. There are a few many to one associations, but all of them were capped (the max was 20 or so).

That meant, in turn, that we had a really simple work to do for the import process. I started by creating the actual model which we will use to save to RavenDB:

image

The rest was just a matter of reading from MySQL and writing to RavenDB. I chose to use Peta Poco for the SQL access, because it is the easiest. The following code sucks. It is written with the assumption that I know what the data sizes are, that the cost of making so many queries (roughly a 1,500,000 queries) is acceptable, etc.

using (var docStore = new DocumentStore
    {
        ConnectionStringName = "RavenDB"
    }.Initialize())
using (var db = new PetaPoco.Database("MySQL"))
using (var bulk = docStore.BulkInsert())
{
    foreach (var prj in db.Query<dynamic>("select * from gc_projects").ToList())
    {
        string name = prj.proj_name;
        bulk.Store(new Project
            {
                Name = name,
                CodeLicense = prj.code_license,
                CodeUrl = prj.code_url,
                ContentLicense = prj.content_license,
                ContentUrl = prj.content_url,
                Description = prj.project_description,
                Summary = prj.project_summary,
                Labels = db.Query<string>("select label from gc_project_labels where proj_name = @0", name)
                                .ToList(),
                Blogs = db.Query<dynamic>("select * from gc_project_blogs where proj_name = @0", name)
                            .Select(x => new Blog { Link = x.blog_link, Title = x.blog_title })
                            .ToList(),
                Groups = db.Query<dynamic>("select * from gc_project_groups where proj_name = @0", name)
                            .Select(x => new Group { Name = x.group_name, Url = x.group_url })
                            .ToList(),
                Links = db.Query<dynamic>("select * from gc_project_links where proj_name = @0", name)
                            .Select(x => new Link { Url = x.link, Title = x.link_title })
                            .ToList(),
                People = db.Query<dynamic>("select * from gc_project_people where proj_name = @0", name)
                    .Select(x => new Person
                        {
                            Name = x.person_name,
                            Role = x.role,
                            UserId = x.user_id
                        })
                    .ToList(),
            });
    }
}

But, it does the work, and it was simple to write. Using this code, I was able to insert 299,949 projects in just under 13 minutes. Most of the time went to making those 1.5 million queries to the db, by the way.

Everything is cool, and it is quite nice. On the next post, I’ll talk about why I wanted a new dataset. Don’t worry, it is going to be cool.


Comments

peter

Agreed, the workbench app is bad. I only use the "legacy" admin and query apps.

flukus

You mentioned the many to one relationships were capped at 20. Out of curiosity, does raven support fixed length arrays on models?

Ayende Rahien

Flukus, Yes, it does. Although the capping wasn't physical, it was just that the biggest collection they had was 20 items in size.

Asaf

it would probably run faster if all the queries were under the same transaction: using (var scope = db.GetTransaction())

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmortem (5):
    29 Jul 2015 - The evil licensing code
  2. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats