Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 178 words

Update: The issue has been resolved, I’ll update on Sunday with full details about this.

Some RavenHQ customers may have noticed that they are currently unable to access their RavenHQ databases.

The underlying reason is an outage in Amazon US-EAST-1 region, which is where the ravenhq.com server and some of the data base servers are located.

Customers with replicated plans should see no disturbance of service, since this should trigger an automatic failover to the secondary node.

If you have any questions / require support, please contact us via our support forum: http://support.ravenhq.com/

You can see the status report from Amazon below. We managed to restore some service, but then lost it again (because of EBS timeouts, I suspect).

We are currently hard at work at bringing up new servers in additional availability zones, and we hope to restore full functionality as soon as possible.

image

time to read 2 min | 345 words

If you had sharp eyes, you might have noticed that in this code, I am actually using two different sessions:

We have the GeoSession, and we have the RavenSession.

The GeoSession is actually pointed at a different database, and it is a read only. In fact, here is how we use this:

image

As you can see, we create this on as needed basis, and we only dispose it, we never actually call SaveChanges().

So, those are the technical details, but what is the reasoning behind this?

Well, it is actually pretty simple. The GeoIP dataset is about 600 MB in size, and mostly it is about… well, geo location stuff. It is a very nice feature, but it is a self contained one, and not something that I really care for putting inside my app database. Instead, I have decided to go another way, and use a separate database.

That means that we have separation, at the data layer, between the different databases. It makes sense, about the only thing that we need from the GeoIP dataset is the ability to handle queries, and that is expressed only via GetLocationByIp, nothing more.

I don’t see a reason to make the app database bigger and more complex, or to have to support updates to the GeoIP dataset inside the app. This is a totally separate service. And having this in a separate database make it much easier to use this the next time that I want to use geo location. And it simplify my life right now with regards to maintaining and working with my current app.

In fact, we could have taken it even further, and not use RavenDB access to this at all. We can use REST calls to get the data out directly. We have chosen to still use the RavenDB Client, I’ll discuss exactly why we chose not to do that.

time to read 14 min | 2638 words

I got word about a  port of Nerd Dinner to RavenDB, Dinner Party (source, live demo), and I just had to check the code.

I reviewed Nerd Dinner itself in the past: Part I, Part II, so it is extra fun to see what happens when you move this to RavenDB. Note that at this point, I haven’t even looked at the code yet.

Here is the actual project (Just one project, I wholeheartedly approve):

image

Hm… where is the controllers folder?

Oh, wait, this isn’t an ASP.NET MVC application, it is a NancyFX application. I never actually look into that, so this should be interesting. Let us see if I can find something interesting, and I think we should look at the bootstrapper first.

There are several interesting things happening here. First, the application uses something called TinyIoC, which I am again, not familiar with. But it seems reasonable, and here is how it is initialized:

 protected override void ApplicationStartup(TinyIoC.TinyIoCContainer container, Nancy.Bootstrapper.IPipelines pipelines)
 {
     base.ApplicationStartup(container, pipelines);

     DataAnnotationsValidator.RegisterAdapter(typeof(MatchAttribute), (v, d) => new CustomDataAdapter((MatchAttribute)v));

     Func<TinyIoCContainer, NamedParameterOverloads, IDocumentSession> factory = (ioccontainer, namedparams) => { return new RavenSessionProvider().GetSession(); };
     container.Register<IDocumentSession>(factory);



     CleanUpDB(container.Resolve<IDocumentSession>());

     Raven.Client.Indexes.IndexCreation.CreateIndexes(typeof(IndexEventDate).Assembly, RavenSessionProvider.DocumentStore);
     Raven.Client.Indexes.IndexCreation.CreateIndexes(typeof(IndexUserLogin).Assembly, RavenSessionProvider.DocumentStore);
     Raven.Client.Indexes.IndexCreation.CreateIndexes(typeof(IndexMostPopularDinners).Assembly, RavenSessionProvider.DocumentStore);
     Raven.Client.Indexes.IndexCreation.CreateIndexes(typeof(IndexMyDinners).Assembly, RavenSessionProvider.DocumentStore);

     pipelines.OnError += (context, exception) =>
     {
         Elmah.ErrorSignal.FromCurrentContext().Raise(exception);
         return null;
     };
 }

All of which looks fine to me, except that I seriously don’t like the injection of the session. Why?

Because it means that if you have two components in the same request that needs a session, each will get his own session, instead of having a session per request. It also means that you can’t implement the “call SaveChanges() when the request is done without error” pattern, but that is more of a pet peeve than anything else.

Another thing to note is the multiple calls to IndexCreation.CreateIndexes. Remember, we have just one assembly here, and CreateIndexes operate on the assembly level, not on the individual index level. All of those can be removed but one (and it doesn’t matter which).

Lastly, we have the CleanupDB part. Dinner Party runs on Azure, and make use of RavenHQ. In order to stay within the limit of the RavenHQ free database, Dinner Party will do cleanups and delete old events if the db size goes over some threshold.

Okay, let us see where the real stuff is happening, and it seems to be happening in the Modules directory. I checked the HomeModule first, and I got:

public class HomeModule : BaseModule
{
    public HomeModule()
    {
        Get["/"] = parameters =>
        {
            base.Page.Title = "Home";

            return View["Index", base.Model];
        };

        Get["/about"] = parameters =>
        {
           
            base.Page.Title = "About";

            return View["About", base.Model];
        };
    }
}

I was worried at first about Page.Title (reminded me of ASPX pages), but it is just a default model that is defined in BaseModule. It is actually quite neat, if you think about it, check it out:

Before += ctx =>
{
    Page = new PageModel()
    {
        IsAuthenticated = ctx.CurrentUser != null,
        PreFixTitle = "Dinner Party - ",
        CurrentUser = ctx.CurrentUser != null ? ctx.CurrentUser.UserName : "",
        Errors = new List<ErrorModel>()
    };

    Model.Page = Page;

    return null;
};

I assume that Model is shared between Module and View, but I will check it shortly. I like how you can expose it to the view dynamically and have a strongly typed version in your code.

And yes, confirmed, the views are just Razor code, and they look like this:

image

Okay, enough with playing around, I’ll need to investigate NancyFX more deeply later on (especially since it can do self hosting), but right now, let us see how this is using RavenDB.

Let us start with the DinnerModule, a small snippet of it can be found here (this is from the ctor):

const string basePath = "/dinners";

Get[basePath + Route.AnyIntOptional("page")] = parameters =>
{

    base.Page.Title = "Upcoming Nerd Dinners";
    IQueryable<Dinner> dinners = null;

    //Searching?
    if (this.Request.Query.q.HasValue)
    {
        string query = this.Request.Query.q;

        dinners = DocumentSession.Query<Dinner>().Where(d => d.Title.Contains(query)
                || d.Description.Contains(query)
                || d.HostedBy.Contains(query)).OrderBy(d => d.EventDate);
    }
    else
    {
        dinners = DocumentSession.Query<Dinner, IndexEventDate>().Where(d => d.EventDate > DateTime.Now.Date)
            .OrderBy(x => x.EventDate);
    }

    int pageIndex = parameters.page.HasValue && !String.IsNullOrWhiteSpace(parameters.page) ? parameters.page : 1;



    base.Model.Dinners = dinners.ToPagedList(pageIndex, PageSize);

    return View["Dinners/Index", base.Model];

};

I am not sure that I really like this when you effectively have methods within methods, and many non trivial ones.

The code itself seems to be pretty nice, and I like the fact that it makes use of dynamic in many cases to make things easier (for Query or to get the page parameter).

But where does DocumentSession comes from? Well, it comes from PersistentModule, the base class for DinnerModule, let us take a look at that:

public class PersistModule : BaseModule
{
    public IDocumentSession DocumentSession
    {
        get { return Context.Items["RavenSession"] as IDocumentSession; }
    }

    public PersistModule()
    {
    }

    public PersistModule(string modulepath)
        : base(modulepath)
    {
    }
}

And now I am confused, so we do have session per request here? It appears that we do, there is a RavenAwareModuleBuilder, which has the following code:

if (module is DinnerParty.Modules.PersistModule)
{
    context.Items.Add("RavenSession", _ravenSessionProvider.GetSession());
    //module.After.AddItemToStartOfPipeline(ctx =>
    //{
    //    var session =
    //        ctx.Items["RavenSession"] as IDocumentSession;
    //    session.SaveChanges();
    //    session.Dispose();
    //});
}

I withdraw my earlier objection. Although note that the code had at one point automatic session SaveChanges(), and now it no longer does.

Another common pet issue in the code base, there is a lot of code that is commented.

Okay, so now I have a pretty good idea how this works, let us see how they handle writes, in the Dinner case, we have another class, called DinnerModuleAuth, which is used to handle all writes.

Here is how it looks like (I chose the simplest, mind):

Post["/delete/" + Route.AnyIntAtLeastOnce("id")] = parameters =>
    {
        Dinner dinner = DocumentSession.Load<Dinner>((int)parameters.id);

        if (dinner == null)
        {
            base.Page.Title = "Nerd Dinner Not Found";
            return View["NotFound", base.Model];
        }

        if (!dinner.IsHostedBy(this.Context.CurrentUser.UserName))
        {
            base.Page.Title = "You Don't Own This Dinner";
            return View["InvalidOwner", base.Model];
        }

        DocumentSession.Delete(dinner);
        DocumentSession.SaveChanges();

        base.Page.Title = "Deleted";
        return View["Deleted", base.Model];
    };

My only critique is that I don’t understand why we would need to explicitly call SaveChanges instead.

Finally, a bit of a critique on the RavenDB usage, the application currently uses several static indexes: IndexEventDate, IndexMostPopularDinners, IndexMyDinners and IndexUserLogin.

The first three can be merged without any ill effects, I would create this, instead:

public class Dinners_Index : AbstractIndexCreationTask<Dinner>
{
    public Dinners_Index()
    {
        this.Map = dinners =>
                   from dinner in dinners
                   select new
                   {
                       RSVPs_AttendeeName = dinner.RSVPs.Select(x => x.AttendeeName),
                       RSVPs_AttendeeNameId = dinner.RSVPs.Select(x => x.AttendeeNameId),
                       HostedById = dinner.HostedById,
                       HostedBy = dinner.HostedBy,
                       DinnerID = int.Parse(dinner.Id.Substring(dinner.Id.LastIndexOf("/") + 1)),
                       Title = dinner.Title,
                       Latitude = dinner.Latitude,
                       Longitude = dinner.Longitude,
                       Description = dinner.Description,
                       EventDate = dinner.EventDate,
                       RSVPCount = dinner.RSVPs.Count,
                   };
    }
}

This serve the same exact function, but it only has one index. In general, we prefer to have bigger and fewer indexes than smaller and more numerous indexes.

time to read 2 min | 244 words

In my previous post, I mentioned that I don’t like this code, that it lacked an important *ities, and I asked what was the problem with it.

image_thumb

The answer is that it isn’t easily debuggable. Consider the case where we have a support call from a user “I am in France but I see courses that are running in Australia”.

How do you debug something like that, and how can you work with the user on that.

In this case, adding something like this is enough to make the system much nicer to work with under those scenarios:

image

This is an internal detail, never expose externally (until this blog post, obviously), but it allows us to override what the system is doing. We can actually pretend to be another user and see what is going on. Without the need of actual debugging any code. That means that you can try things out, in production. It means that you don’t have to work hard to replicate a production environment, get all of the  production data and so on.

Thinking like that is important, it drastically reduces your support burden over time.

time to read 2 min | 330 words

So, all of this have gone quite far. We have seen that we can quite easily go from having the user’s IP address to figuring out its location using our Geo Location database.

The next step is to actually do something about it. Usually, when doing geo location, you care to know what the human readable name for the user’s location is, but in our case, what we most care about is the user’s physical location, so it is lucky that the geo location database that we found also include longitude and latitude information.

With that, we can define the following index, on our events. The Longitude & Latitude information is actually calculated by the browser using the Google Geocoder API, so we just plug in the address, and the site does the rest by figuring out where on the globe we are.

This index allows us to search by spatial query as well as by time:

image

Using that, we can do:

image

First, we try to find the location of the user based on the user IP, then we are making a spatial query on top of the events, based on the user location.

What we are doing essentially is asking, “show me the next 2 future events within 200 miles of the user’s location”. Just a tiny little bit of code, but it produces this:

image

And hopefully, this will narrow things down for you to the obvious: “Of course I am going to the next RavenDB Course”!

time to read 2 min | 216 words

I can’t sleep, due to a literal doggie pile on my bed. Hence, I lay awake and ponder things like pricing strategies and other interesting topics.

image

In fact, as you can see, I put both dogs to sleep when taking about this.

The big one is a female German Shepherd, named Arava, and the smaller one is a charming fellow named Oscar.

And this is relevant to the discussion on a technical blog exactly how?

Well, as I said, Arava is a German Shepherd, and I have an upcoming RavenDB Course in Munich next month. If you go and register to the course, you can use Arava as the coupon code to get a 25% discount.

And before I forget Oscar (and hence get the biggest puppy eyes stare in the world), if you go and register to the Chicago RavenDB Bootcamp in late August, and use Oscar as the coupon code, you’ll get a 20% discount (what can I do, he is smaller).

The offer will remain open until I wake up, you have some time, I still need to find some way to get them off the bed : –).

time to read 3 min | 451 words

Now we have all of the data loaded in, we need to be able to search on it. In order to do that, we define the following index:

image

It is a very simple one, mapping the start and end of each range for each location.

The next step is actually doing the search, and this is where we run into some issues. The problem was with the data:

image

Let us take the first range and translate that to IP addresses in the format that you are probably more used to:

Start: 0.177.195.68 End: 255.177.195.68

Yep, it is little endian vs. big endian here to bite us once more.

It took me a while to figure it out, I’ll admit. In other words, we have to reverse the IP address before we can search on it properly. Thankfully, that is easily done, and we have the following masterpiece:

image

The data source that we have only support IPv4, so that is what we allow. We reverse the IP, then do a range search based on this.

Now we can use it like this:

var location = session.GetLocationByIp(IPAddress.Parse("209.85.217.172"));

Which tells us that this is a Mountain View, CA, USA address.

More importantly for our purposes, it tells us that this is located at: 37.4192, -122.0574 We will use that shortly to do spatial searches for RavenDB events near you, which I’ll discuss in my next post.

Oh, and just for fun. You might remember that in previous posts I mentioned that MaxMind (the source for this geo location information) had to create its own binary format because relational databases took several second to process each query?

The query above completed in 53 milliseconds on my machine, without any tuning on our part. And that is before we introduce caching into the mix.

time to read 7 min | 1264 words

The following are sample from the data sources that MaxMind provides for us:

image

image

The question is, how do we load them into RavenDB?

Just to give you some numbers, there are 1.87 million blocks and over 350,000 locations.

Those are big numbers, but still small enough that we can work with the entire thing in memory. I wrote a quick & ugly parsing routines for them:

public static IEnumerable<Tuple<int, IpRange>> ReadBlocks(string dir)
{
    using (var file = File.OpenRead(Path.Combine(dir, "GeoLiteCity-Blocks.csv")))
    using (var reader = new StreamReader(file))
    {
        reader.ReadLine(); // copy right
        reader.ReadLine(); // header

        string line;
        while ((line = reader.ReadLine()) != null)
        {
            var entries = line.Split(',').Select(x => x.Trim('"')).ToArray();
            yield return Tuple.Create(
                int.Parse(entries[2]),
                new IpRange
                {
                    Start = long.Parse(entries[0]),
                    End = long.Parse(entries[1]),
                });
        }
    }
}

public static IEnumerable<Tuple<int, Location>> ReadLocations(string dir)
{
    using (var file = File.OpenRead(Path.Combine(dir, "GeoLiteCity-Location.csv")))
    using (var reader = new StreamReader(file))
    {
        reader.ReadLine(); // copy right
        reader.ReadLine(); // header

        string line;
        while ((line = reader.ReadLine()) != null)
        {
            var entries = line.Split(',').Select(x => x.Trim('"')).ToArray();
            yield return Tuple.Create(
                int.Parse(entries[0]),
                new Location
                {
                    Country = NullIfEmpty(entries[1]),
                    Region = NullIfEmpty(entries[2]),
                    City = NullIfEmpty(entries[3]),
                    PostalCode = NullIfEmpty(entries[4]),
                    Latitude = double.Parse(entries[5]),
                    Longitude = double.Parse(entries[6]),
                    MetroCode = NullIfEmpty(entries[7]),
                    AreaCode = NullIfEmpty(entries[8])
                });
        }
    }
}

private static string NullIfEmpty(string s)
{
    return string.IsNullOrWhiteSpace(s) ? null : s;
}

And then it was a matter of bringing it all together:

var blocks = from blockTuple in ReadBlocks(dir)
             group blockTuple by blockTuple.Item1
             into g
             select new
             {
                 LocId = g.Key,
                 Ranges = g.Select(x => x.Item2).ToArray()
             };

var results =
    from locTuple in ReadLocations(dir)
    join block in blocks on locTuple.Item1 equals block.LocId into joined
    from joinedBlock in joined.DefaultIfEmpty()
    let _ = locTuple.Item2.Ranges = (joinedBlock == null ? new IpRange[0] : joinedBlock.Ranges)
    select locTuple.Item2;

 

The advantage of doing things this way is that we only have to write to RavenDB once, because we merged the results in memory. That is why I said that those are big results, but still small enough for us to be able to process them easily in memory.

Finally, we wrote them to RavenDB in batches of 1024 items.

The entire process took about 3 minutes and wrote 353,224 documents to RavenDB, which include all of the 1.87 million ip blocks in a format that is easy to search through.

In our next post, we will discuss actually doing searches on this information.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}