Ayende @ Rahien

Refunds available at head office

What have we been up to? And some future plans

We have been head down for a while, doing some really cool things with RavenDB (sharding, read striping, query intersection, indexing reliability and more). But that meant that for a while,things that are not about writing code for RavenDB has been more or less on auto-pilot.

So here are some things that we are planning. We will increase the pace of RavenDB courses and conference presentation. You can track it all in the RavenDB events page.

Conferences

RavenDB Courses

NHibernate Courses

Not finalized yet

  • August 2012.
    • User groups talks in Philadelphia & Washington DC by Itamar Syn-Hershko.
    • One day boot camp for moving from being familiar with RavenDB to being a master in Chicago.
  • September 2012.
    • RavenDB Course in Austin, Texas.

Consulting Opportunities

We are also available for on site consulting in the following locations and times. Please contact us directly if you would like to arrange for one of RavenDB core team to show up at your door step. Or if you want me to do architecture or NHibernate consulting.

  • Oren Eini – Malmo, June 26 – 27.
  • Oren Eini – Berlin, July 2 – 4.
  • Itamar Syn-Hershko – New York, Aug 23.
  • Itamar Syn-Hershko – Chicago, Aug 30 or Sep 3.
  • Itamar Syn-Hershko – Austin, Sep 3.
  • Itamar Syn-Hershko – Toronto, Sep 9 – 10.
  • Itamar Syn-Hershko – London, Sep 11.

If you throttle me any me I am going to throttle you back!

It is interesting to note that for a long while, what we were trying to do with RavenDB was make it use less and less resources. One of the reasons for that is that less resources is obviously better, because we aren’t wasting anything.

The other reason is that we have users running us on a 512MB/650 MHz Celeron 32 bit machines. So we really need to be able to fit into a small box (and also allow enough processing power for the user to actually do something with the machine).

We have gotten really good in doing that, actually.

The problem is that we also have users running RavenDB on standard server hardware (32 GB / 16 cores, RAID and what not) in which case they (rightly) complain that RavenDB isn’t actually using all of their hardware.

Now, being conservative about resource usage is generally good, and we do have the configuration in place which can tell RavenDB to use more memory. It is just that this isn’t polite behavior.

RavenDB in most cases shouldn’t require anything special for you to run, we want it to be truly a zero admin database. The solution?  Take into account the system state and increase the amount of work that we do to get things done. And yes, I am aware of the pitfalls.

As long as there is enough free RAM available, we will increase the amount of documents that we are going to index in a single batch. That is subject to some limits (for example, if we just created a new index on a big database, we need to make sure we aren’t trying to load it entirely to memory), and it knows how to reserve some room for other things, and how to throttle down and as well as up.

This post is written before I had the chance to actually test this on production level size dataset, but I am looking forward to seeing how it works.

Update: Okay, that is encouraging, it looks like what we did just made things over 7 times faster. And this isn’t a micro benchmark, this is when you throw this on a multi GB database with full text search indexing.

Next, we need to investigate what we are going to do about multiple running indexes and how this optimization affects them. Fun Smile.

Watch your 6, or is it your I/O? It is the I/O, yes

As I said in my previous post, tasked with having to load 3.1 million files into RavenDB, most of them in the 1 – 2 KB range.

Well, the first thing I did had absolutely nothing to do with RavenDB, it had to do with avoiding dealing with this:

image

As you can see, that is a lot.

But when the freedb dataset is distributed, what we have is actually:

image

This is a tar.bz2, which we can read using the SharpZipLib library.

The really interesting thing is that reading the archive (even after adding the cost of decompressing it) is far faster than reading directly from the file system. Most file systems do badly on large amount of small files, and at any rate, it is very hard to optimize the access pattern to a lot of small files.

However, when we are talking about something like reading a single large file? That is really easy to optimize and significantly reduces the cost on the input I/O.

Just this step has reduced the cost of importing by a significant factor, we are talking about twice as much as before, and with a lot less disk activity.

Tags:

Published at

Originally posted at

Comments (9)

Watch your 6, or is it your I/O?

One of the interesting things about the freedb dataset is that it is distributed as a 3.1 million separate files, most of them in the 1 – 2 KB range.

Loading that to RavenDB took a while, so I set out to fix that. Care to guess what is the absolutely the first thing that I did?

Tags:

Published at

Originally posted at

Comments (24)

RavenDB: Self optimizing Ids

One of the things that is really important for us in RavenDB is the notion of Safe by Default and Zero Admin. What this means is that we want to make sure that you don’t really have to think about what you are doing for the common cases, RavenDB will understand what you mean and figure out what is the best way to do things.

One of the cases where RavenDB does that is when we need to generate new ids. There are several ways to generate new ids in RavenDB, but the most common one, and the default, is to use the hilo algorithm. It basically (ignoring concurrency handling) works like this:

var currentMax = GetMaxIdValueFor("Disks");
var limit = currentMax + 32;
SetMaxIdValueFor("Disks");

And now we can generate ids in the range of currentMax to currentMax+32, and we know that no one else can generate those ids. Perfect!

The good thing about it is that now we have a reserved range, we can create ids without going to the server. The bad thing about it is that we now reserved a range of 32. If we create just one or two documents and then restart, we would need to request a new range, and the rest of that range would be lost. That is why the default range value is 32. It is small enough that gaps aren’t that important*, but it since in most applications, you usually create entities on an infrequent basis and when you do, you usually generate just one, then it is big enough to still provide a meaningful optimization with regards to the number of times you have to go to the server.

* What does it means, “gaps aren’t important”? The gaps are never important to RavenDB, but people tend to be bothered when they see disks/1 and disks/2132 with nothing in the middle. Gaps are only important for humans.

So this is perfect for most scenarios. Except one very common scenario, bulk import.

When you need to load a lot of data into RavenDB, you will very quickly note that most of the time is actually spent just getting new ranges. More time than actually saving the new documents takes, in fact.

Now, this value is configurable, so you can set it to a higher value if you care for it, but still, that was annoying.

Hence, what we have now. Take a look at the log below:

image

It details the requests pattern in a typical bulk import scenario. We request an id range for disks, and then we request it again, and again, and again.

But, notice what happens as times goes by (and not that much time) before RavenDB recognizes that you need bigger ranges, and it gives you them. In fact, very quickly we can see that we only request a single range per batch, because RavenDB have optimized itself based on our own usage pattern.

Kinda neat, even if I say so myself.

Tags:

Published at

Originally posted at

Comments (15)

Searching ain’t simple: solution

On my last post, I descried the following problem:

image_thumb

And stated that the following trivial solution is the wrong approach to the problem:

select d.* from Designs d 
 join ArchitectsDesigns da on d.Id = da.DesignId
 join Architects a on da.ArchitectId = a.Id
where a.Name = @name

The most obvious reason is actually that we are thinking too linearly. I intentionally showed the problem statement in terms of UI, not in terms of a document specifying what should be done.

The reason for that is that in many cases, a spec document is making assumptions that the developer should not. When working on a system, I like to have drafts of the screens with rough ideas about what is supposed to happen, and not much more.

In this case, let us consider the problem from the point of view of the user. Searching by the architect name makes sense to the user, that is usually how they think about it.

But does it makes sense from the point of view of the system? We want to provide good user experience, which means that we aren’t just going to provide the user with a text box to plug in some values. For one thing, they would have to put in the architect full name as it is stored in our system. That is going to be a tough call in many cases. Ask any architect what the first name of Gaudi is, and see what sort of response you’ll get.

Another problem is how to deal with misspelling, partial names, and other information. What if we actually have the architect id, and are used to type that? I would much rather type 1831 than Mies Van Der Rohe, and most users that work with the application day in and day out would agree.

From the system perspective, we want to divide the problem into two separate issues, finding the architect and finding the appropriate designs. From a user experience perspective, that means that the text box is going to be an ajax suggest box, and the results would be loaded based on valid id.

Using RavenDB and ASP.Net MVC, we would have the following solution. First, we need to define the search index:

image

This gives us the ability to search across both name and id easily, and it allows us to do full text searches as well. The next step is the actual querying for architect by name:

image

Looks complex, doesn’t it? Well, there is certainly a lot of code there, at least.

First, we look for an a matching result in the index. If we find anything, we send just the name and the id of the matching documents to the user. that part is perfectly simple.

The interesting bits happen when we can’t find anything at all. In that case, we ask RavenDB to find us results that might be the things that the user is looking for. It does that by running a string distance algorithm over the data in the database already and providing us with a list of suggestions about what the user might have meant.

We take it one step further. If there is just one suggestion, we assume that this is what the user meant, and just return the results for that value. If there is more than that, we sent an empty result set to the client along with a list of alternatives that they can suggest to the user.

From here, the actual task of getting the designs for this architect becomes as simple as:

image

And it turns out that when you think about it right, searching is simple.

Tags:

Published at

Originally posted at

Comments (15)

Searching ain’t simple

The problem statement is best described using:

image

This seems like a nice and easy problem, right? We join the architects table to the designs table and we are done.

select d.* from Designs d 
 join ArchitectsDesigns da on d.Id = da.DesignId
 join Architects a on da.ArchitectId = a.Id
where a.Name = @name

This is a trivial solution, and shouldn’t take a lot of time to build…

It is also the entirely wrong approach for the problem, can you tell me why?

Tags:

Published at

Originally posted at

Comments (30)

RavenDB Sharding–Map/Reduce in a cluster

In my previous post, I introduced RavenDB Sharding and discussed how we can use sharding in RavenDB. We discussed both blind sharding and data driven sharding. Today I want to introduce another aspect of RavenDB Sharding. The usage of Map/Reduce to  gather information from multiple shards.

We start by defining a map/reduce index. In this case, we want to look at the invoice totals per date. We define the index like this:

public class InvoicesAmountByDate : AbstractIndexCreationTask<Invoice, InvoicesAmountByDate.ReduceResult>
{
    public class ReduceResult
    {
        public decimal Amount { get; set; }
        public DateTime IssuedAt { get; set; }
    }

    public InvoicesAmountByDate()
    {
        Map = invoices =>
              from invoice in invoices
              select new
              {
                  invoice.Amount,
                invoice.IssuedAt
              };

        Reduce = results =>
                 from result in results
                 group result by result.IssuedAt
                 into g
                 select new
                 {
                     Amount = g.Sum(x => x.Amount),
                    IssuedAt = g.Key
                 };
    }
}

And then we execute the following code:

using (var session = documentStore.OpenSession())
{
    var asian = new Company { Name = "Company 1", Region = "Asia" };
    session.Store(asian);
    var middleEastern = new Company { Name = "Company 2", Region = "Middle-East" };
    session.Store(middleEastern);
    var american = new Company { Name = "Company 3", Region = "America" };
    session.Store(american);

    session.Store(new Invoice { CompanyId = american.Id, Amount = 3, IssuedAt = DateTime.Today.AddDays(-1)});
    session.Store(new Invoice { CompanyId = asian.Id, Amount = 5, IssuedAt = DateTime.Today.AddDays(-1) });
    session.Store(new Invoice { CompanyId = middleEastern.Id, Amount = 12, IssuedAt = DateTime.Today });
    session.SaveChanges();
}

We use a three way sharding, based on the region of the company, so we actually have the following document sin three different servers:

First server, Asia:

image

Second server, Middle East:

image

Third server, America:

image

Now, let us see what happen when we use the map/reduce query:

using (var session = documentStore.OpenSession())
{
    var reduceResults = session.Query<InvoicesAmountByDate.ReduceResult, InvoicesAmountByDate>()
        .ToList();

    foreach (var reduceResult in reduceResults)
    {
        string dateStr = reduceResult.IssuedAt.ToString("MMM dd, yyyy", CultureInfo.InvariantCulture);
        Console.WriteLine("{0}: {1}", dateStr, reduceResult.Amount);
    }
    Console.WriteLine();
}

As you can see, again, we make no distinction in our code about using sharding, we just query it normally. The results, however, are quite interesting:

image

As you can see, we got the correct results, cluster wide.

RavenDB was able to query all the servers in the cluster for their results, reduce them again, and get us the total across all three servers.

And that, my friends, it truly awesome.

Tags:

Published at

Originally posted at

Comments (26)

RavenDB Sharding–Data Driven Sharding

In my previous post, I introduced RavenDB Sharding and discussed how we can use Blind Sharding to a good effect. I also mentioned that this approach is somewhat lacking, because we don’t have enough information at hand to be able to really understand what is going on. Let me show you how we can define a proper sharding function that shards your documents based on their actual data.

We are still going to run the exact same code as we have done before:

string asianId, middleEasternId, americanId;

using (var session = documentStore.OpenSession())
{
    var asian = new Company { Name = "Company 1", Region = "Asia" };
    session.Store(asian);
    var middleEastern = new Company { Name = "Company 2", Region = "Middle-East" };
    session.Store(middleEastern);
    var american = new Company { Name = "Company 3", Region = "America" };
    session.Store(american);

    asianId = asian.Id;
    americanId = american.Id;
    middleEasternId = middleEastern.Id;

    session.Store(new Invoice { CompanyId = american.Id, Amount = 3 });
    session.Store(new Invoice { CompanyId = asian.Id, Amount = 5 });
    session.Store(new Invoice { CompanyId = middleEastern.Id, Amount = 12 });
    session.SaveChanges();

}

using (var session = documentStore.OpenSession())
{
    session.Query<Company>()
        .Where(x => x.Region == "America")
        .ToList();

    session.Load<Company>(middleEasternId);

    session.Query<Invoice>()
        .Where(x => x.CompanyId == asianId)
        .ToList();
}

What is different now is how we initialize the document store:

image

What we have done is given RavenDB the information about how our entities are structured and how we should shard them. We should shard the companies based on their regions, and the invoices based on their company id.

Let us see how the code behaves now, shall we? As before, we will analyze the output of the HTTP logs from execute this code. Here is the first server output:

image

As before, we can see the first four request are there to handle the hilo generation, and they are only there for the first server.

The 5th request is saving two documents. Note that this is the Asia server, and unlike the previous example, we don’t get companies/1 and invoices/1 in the first shard.

Instead, we have companies/1 and invoice/2. Why is that? Well, RavenDB detected that invoices/2 belongs to a company that is associated with this shard, so it placed it in the same shard. This ensures that we have good locality and that we can utilize features such as Includes or Live Projections even when using sharding.

Another interesting aspect is that we don’t see a request for companies in the America region. Because this is what we shard on, RavenDB was able to figure out that there is no way that we will have a company in the America region in the Asisa shard, so we can skip this call.

Conversely, when we need to find an invoice for an asian company, we can see that this request gets routed to the proper shard.

Exciting, isn’t it?

Let us see what we have in the other two shards.

image

In the second shard, we can see that we have just two requests, one to save two documents (again, a company and its associated invoice) and the second to load a particular company by id.

We were able to optimize all the other queries away, because we actually understand the data that you save.

And here is the final shard results:

image

Again, we got a save for the two documents, and then we can see that we routed the appropriate query to this shard, because this is the only place that can answer this question.

Data Driven Sharding For The Win!

But so far we have seen how RavenDB can optimize the queries made to the shards when it has enough information to do so. But what happens when it can’t?

For example, let us say that I want to get the 2 highest value invoices. Since I didn’t specify a region, what would RavenDB do? Let us look at the code:

var topInvoices = session.Query<Invoice>()
    .OrderByDescending(x => x.Amount)
    .Take(2)
    .ToList();

foreach (var invoice in topInvoices)
{
    Console.WriteLine("{0}\t{1}", invoice.Amount, invoice.CompanyId);
}

This code outputs:

image

So we were actually able to get just the two highest invoices. But what actually happened?

Shard 1 (Asia):

image

Shard 2 (Middle-East):

image

Shard 3 (America):

image

As you can see, we have actually made 3 queries, asking the same question from each of the shards. Each shard returned its own results. On the client side, we merged those results, and gave you back exactly the information that requested, across the entire cluster.

Tags:

Published at

Originally posted at

Comments (17)

RavenDB Sharding – Blind sharding

From the get go, RavenDB was designed with sharding in mind. We had a sharding client inside RavenDB when we shipped, and it made for some really cool demos.

It also wasn’t really popular, we didn’t implement some things for sharding. We always intended to, but we had other things to do and no one was asking for it much.

That was strange. I decided that we needed to do two major things.

  • First, to make sure that the experience for writing in a sharded environment was as close as we could get to the one you get with a non sharded environment.
  • Second, we had to make it simple to use sharding.

Before our changes, in order to use sharding you had to do the following:

  • Setup multiple RavenDB server.
  • Create a list of those servers urls.
  • Implement IShardStrategy, which exposes
    • IShardAccesStrategy – determine how we call to the servers.
    • IShardSelectionStrategy – determine how we select which server a new instance will go to, and what server an existing instance belongs on.
    • IShardResolutionStrategy – determine which servers we should query when we are querying for data (allow to optimize which servers we are actually hitting for particular queries)

All in all, you would need to write a minimum of 3 classes, and have to write some sharding code that can be… somewhat tricky.

Oh, it works, and it is a great design. It is also complex, and it makes it harder to use sharding.

Instead, we now have the following scenario:

image

As you can see, here we have three different servers, each running in a different port. Let us see what we need to do to get us working with this from the client code:

image

First, we need to define the servers (and their names), then we create a shard strategy and use that to create a sharded document store. Once that is done, we are home free, and can do pretty much whatever we want:

string asianId, middleEasternId, americanId;

using (var session = documentStore.OpenSession())
{
    var asian = new Company { Name = "Company 1", Region = "Asia" };
    session.Store(asian);
    var middleEastern = new Company { Name = "Company 2", Region = "Middle-East" };
    session.Store(middleEastern);
    var american = new Company { Name = "Company 3", Region = "America" };
    session.Store(american);

    asianId = asian.Id;
    americanId = american.Id;
    middleEasternId = middleEastern.Id;

    session.Store(new Invoice { CompanyId = american.Id, Amount = 3 });
    session.Store(new Invoice { CompanyId = asian.Id, Amount = 5 });
    session.Store(new Invoice { CompanyId = middleEastern.Id, Amount = 12 });
    session.SaveChanges();

}

using (var session = documentStore.OpenSession())
{
    session.Query<Company>()
        .Where(x => x.Region == "America")
        .ToList();

    session.Load<Company>(middleEasternId);

    session.Query<Invoice>()
        .Where(x => x.CompanyId == asianId)
        .ToList();
}

What you see here is the code that saves both companies and invoices, and does this over multiple servers. Let us see the log output for this code:

image

You can see a few interesting things here:

  • The first four requests are to manage the document ids (hilos). By default, we use the first server as the one that will store all the hilo information.
  • Next (request #5) we are saving two documents, note that the shard id is now part of the document id.
  • Request 6 and 7 here are actually queries, we returned 1 results for the first query, and none for the second.

Let us look at another shard now:

image

This is much shorter, since we don’t have the hilo requests. The first request is there to store two documents, and then we see two queries, both of which return no results.

And the last shard:

image

Here we again don’t see the hilo requests (since they are all on the first server). We do see putting of the two docs, and request #2 is a query that returns no results.

Request #3 is interesting, because we did not see that anywhere else. Since we did a load by id, and since by default we store the shard id in the document id, we were able to optimize this operation and go directly to the relevant shard, bypassing the need to query anything other server.

The last request is a query, for which we have a result.

So what did we have so far?

We were able to easily configure RavenDB to use 3 ways sharding in a few lines of code. It automatically distributed writes and reads for us, and when it could, it optimized the data access so it would only access the relevant shards. Writes are distributed on a round robin basis, so it is pretty fair. And reads are optimized on whatever we can figure out a minimal number of shards to query. For example, when we do a load by id, we can figure out what the shard id is, and query that server directly, rather than all of them.

Pretty cool, if you ask me.

Now, you might have noticed that I called this post Blind Sharding. The reason this is called this name is that this is pretty much the lowest rung in the sharding ladder. It is good, it split your data and it tries to optimize things, but it isn’t the best solution. I’ll discuss a better solution in my next post.

Tags:

Published at

Originally posted at

Comments (14)

Composite entities

In my previous post, I discussed some of the problems that you run into when you try to have a single source of truth with regards to an entity definition. The question here is, how do we manage something like a Customer across multiple applications / modules.

For the purpose of discussion, I am going to assume that all of the data is either:

  • All sitting in the same physical database (common if we are talking about different modules in the same application).
  • Spread across multiple databases with some data being replicate to all databases (common if we are talking about different applications).

We will focus on the customer entity as an example, and we will deal with billing and help desk modules / applications. There are some things that everyone can agree on with regards to the customer. Most often, a customer has a id, which is shared across the entire system, as well as some descriptive details, such as a name.

But even things that you would expect to be easily agreed upon aren’t really that easy. For example, what about contact information? The person handling billing at a customer is usually different than the person that we contact for help desk inquires. And that is the stuff that we are supposed to agree on. We have much bigger problems when we have to deal with things like customer’ payment status vs. outstanding helpdesk calls this month.

The way to resolve this is to forget about trying to shove everything into a single entity. Or, to be rather more exact, we need to forget about trying to thing about the Customer entity as a single physical thing. Instead, we are going to have the following:

image

There are several things to note here:

  • There is no inheritance relationship between the different aspect of a customer.
  • We don’t give in and try to put what appears to be shared properties (ContactDetails) in the root Customer. Those details have different meaning for each entity.

There are several ways to handle actually storing this information. If we are using a single database, then we will usually have something like:

image

The advantage of that is that it makes it very easy to actually look at the entire customer entity for debugging purposes. I say for debugging specifically because for production usage, there really isn’t anything that needs to look at the entire thing, every part of the system only care for its own details.

You can easily load the root customer document and your own customer document whenever you need to.

More to the point, because they are different physical things, that solves a lot of the problems that we had with the shared model.

Versioning is not an issue, if billing needs to make a change, they can just go ahead and change things. They don’t need to talk to anyone, because no one else is touching their data.

Concurrency is not an issue, if you make a concurrent modification to billing and help desk, that is not a problem, they are stored into two different locations. That is actually what you want, since it is perfectly all right for having those concurrent changes.

It free us from having to have everyone’s acceptance on any change for everything except on the root document. But as you can probably guess, the amount of information that we put on the root is minimal, precisely to avoid those sort of situations.

This is how we handle things with a shared database, but what is going on when we have multiple applications, with multiple databases?

As you can expect, we are going to have one database which contains all of the definitions of the root Customer (or other entities), and from there we replicate that information to all of the other databases. Why not have them access two databases? Simple, it makes things so much harder. It is easier to have a single database to access to and have replication take care of that.

What about updates in that scenario? Well, updates to the local part is easy, you just do that, but updates to the root customer details have to be handled differently.

The first thing to ask is whatever there really is any need for any of the modules to actually update the root customer details. I can’t see any reason why you would want to do that (billing shouldn’t update the customer name, for example). But even if you have this, the way to handle that is to have a part of the system that is responsible for the root entities database, and have it do the update, from where it will replicate to all of the other databases.

Tags:

Published at

Originally posted at

Comments (27)

There ain’t no such thing, the definitive entity definition

I was at a customer site, and we were talking about a problem they had with modeling their domain. Actually, we were discussing a proposed solution, a central and definitive definition for all of their entities, so all of the applications could use that.

I had a minor seizure upon hearing that, but after I recovered, I was able to articulate my objections to this approach.

To start with, it breaks the Single Responsibility Principle, the Open Closed Principle and the Interface Segregation Principle. It also makes versioning hard, and introduce a central place where everyone must coordinate with. Think about the number of people that has to be involved whenever you make a change.

Let us take the customer as the representative entity for this discussion. We can all agree that a customer has to have a name, an email and an id. But billing also need to know his credit card information, help desk needs to track what support contracts he has and sales needs to know what sort of products we sold the guy, so we can sell him upgrades.

Now, would you care to be the guy who has to mediate between of all of those different concerns?

And what about changes and updates? Whenever you need to make a change, you have to wait for all of those teams and application to catch up and update and deploy their apps,.

And what about actual usage? You actually don’t want the help desk system to be able to access the billing information, and you most certainly don’t want them to change anything there.

And does it matter if we have concurrent modifications to the entity by both help desk and billing?

All of those things argue very strongly against having a single source of truth about what an entity is. In my next post, I’ll discuss a solution for this problem, Composite Entities.

Tags:

Published at

Originally posted at

Comments (25)

RavenDB session management in ASP.Net Web API

This was brought up in the mailing list, and I thought it was an interesting solution, therefor, this post.

image

A couple of things to note here. I would actually rather use the Initialize() / Dispose() methods for this, but the problem is that we don’t really have a way at the Dispose() to know if the action threw an exception. Hence, the need to capture the ExecuteAsync() operation.

For fun, you can also use the async session as well, which will integrate very nicely into the async nature of most of the Web API.

Tags:

Published at

Originally posted at

Comments (6)

Sharding with RavenDB Webinar

Sharding allows you to horizontally partition your database. RavenDB includes builtin support for sharding, and in this webinar, we will discuss in detail how to utilize it, how it works and how you can best use it in your own projects

Date: Wednesday, March 28, 2012

Time: 9:00 AM - 10:00 AM PDT

After registering you will receive a confirmation email containing information about joining the Webinar.

Space is limited.
Reserve your Webinar seat now at:
https://www2.gotomeeting.com/register/729216138

Tags:

Published at

Originally posted at

The NuGet Problem

NuGet is a wonderful system, and I am very happy to be able to use and participate in it.

Unfortunately, it has a problem that I don’t know how to solve. In a word, it is a matter of granularity. With RavenDB, we currently have the following packages:

  • RavenDB
  • RavenDB.Embedded
  • RavenDB.Client

The problem is that as we have some features that uses F#, we have some features that uses MVC, we have a debug visualizer for VS, we have… Well, I think you get the point. The problem is that if we split things too granularly, we end up with something like:

  1. RavenDB.Client.FSharp
  2. RavenDB.MvcIntegration
  3. RavenDB.DebugSupport
  4. RavenDB
  5. RavenDB.Core
  6. RavenDB.Embedded
  7. RavenDB.Client
  8. RavenDB.Sharding
  9. RavenDB.NServiceBus
  10. RavenDB.WebApiIntegration
  11. RavenDB.Etl
  12. RavenDB.Replication
  13. RavenDB.IndexReplication
  14. RavenDB.Expiration
  15. RavenDB.MoreLikeThis
  16. RavenDB.Analyzers
  17. RavenDB.Versioning
  18. RavenDB.Authorization
  19. RavenDB.OAuth
  20. RavenDB.CascadeDelete

And there are probably more.

It gets more complex because we don’t really have a good way to make decisions on the type of assemblies that we add to what projects.

As I said, I don’t have an answer, but I would sure appreciate suggestions.

Tags:

Published at

Originally posted at

Comments (26)

Thoughts after using ASP.Net Web API (beta) in anger for a week

Nitpicker corner: Yes, I know about Open Rasta, NancyFX, FubuMVC and all of the other cool frameworks out there. I am not here today to talk about them. I am not interested in talking about them and why they are so much better in this post.

You might be aware that I am doing a lot of stuff at the HTTP level, owing to the fact that both RavenDB and RavenFS are REST based servers.

As such, I had to become intimately familiar with the HTTP spec, how things work at the low level ASP.Net level, etc. I even had to write my own abstractions to be able to run both inside IIS and as a service. Suffice to say, I feel that I have a lot of experience in building HTTP based system. That said, I am also approaching things from a relatively different angle than most people. I am not aiming to build a business application, I am actually building infrastructure servers.

There was a lot of buzz about the ASP.Net Web API, and I took a brief look at the demo, marked it as “nice, need to check it out some day” in my head and moved on. Then I run into a strange problem in RavenFS. RavenFS is a sibling to RavenDB. Whereas RavenDB is a document database, RavenFS is a distributed & replicated file server for (potentially) very large files. (It is currently in beta testing ,and when we are doing giving it all the bells and whistle of a real product, we will show it to the world Smile, it isn’t really important to this post).

What is important is that I run into a problem with RavenFS, and I felt that there was a strong likelihood that I was doing something wrong in the HTTP layer that was causing this. Despite its outward simplicity, HTTP is pretty complex when you get down to business. So I decided to see what would happen if I would replace the HTTP layer for RavenFS with ASP.Net Web API.

That means that I have been using it in anger for the last week, and here is what I think about it so far.

First, it is a beta. That is something that is important to remember, because it means that it isn’t done yet.

Second, I am talking strictly about the server API. I haven’t even touched the client API as of now.

Third, and most important. I am impressed. It is a really clean API, nice interface, well thought out and quite nice to work with.

More than that, I had to do a bunch of stuff that really isn’t trivial. And there are very little docs for it as of now. I was able to do pretty much everything I wanted by just walking the API and figuring things out on my own.

Things that I particularly liked:

  • The API guides you to do the right thing. For example, different headers have different meaning, and you can see that when you look at the different headers collections. You have headers that goes in the response, headers that go in the request, headers that goes for the content, and so on. It really guides you properly to using this as you should.
  • A lot of the stuff that is usually hard is now pretty easy to do. Multi part responses, for example. Ranged requests, or proper routing.
  • I was able to plug in DI for what I was doing in a couple of minutes without really knowing anything about how things work. And I could do that by providing a single delegate, rather than implement a complex interface.
  • I provides support for self hosting, which is crucial for doing things like unit testing the server.
  • It is Async to the core.
  • I really like the ability to return a value, or a task, or a task of a value, or return an HttpResponseMessage which I can customize to my heart content.

Overall, it just makes sense. I get how it works, and it doesn’t seems like I have to fight anything to get things done.

Please note that this is porting a major project to a completely new platform, and doing some really non trivial things in there while doing this.

Things that I didn’t like with it:

Put simply, errors. To be fair, this isn’t a complaint about the standard error handling, this works just fine. The issue is with infrastructure errors.

For example, if you try to push a 5MB request to the server, by default the request will just die. No error message, and the status code is 503 (message unavailable). This can be pretty frustrating to try to figure out, because there is nothing to tell you what the problem is. And I didn’t look at the request size at first. It just seemed that some request worked, and some didn’t. Even after that I found that it is the size that mattered, it was hard to figure out where we need to fix that (and the answer to that is different depending on where you are running!).

Another example is using PUT or DELETE in your requests. As long as you are running in SelfHost, everything will work just fine. If you switch to IIS, you will get an error (405, Method Not Allowed), again, with no idea how to fix this or why this is happening. This is something that you can fix in the config (sometimes), but it is another error that has horrible usability.

Those are going to be pretty common errors, I am guessing, and any error like that is actually a road block for the users. Having an error code and nothing else thrown at you is really frustrating, and this is something that can really use a good error report. Including details about how to fix the problem.

There are a bunch of other issues that I run into (an NRE when misconfiguring the routing that was really confusing and other stuff like that), but this is beta software, and those are things that will be fixed.

The one thing that I miss as a feature is good support for nested resources (/accounts/1/people/2/notes), which can be a great way to provide additional context for the application. What I actually want to use this for is to be able to do things like: /folders <—FoldersController.Get, Post, Put, Delete and then have: /folders/search <- FolderController.GetSearch., PostSearch, etc.

So I can get the routing by http method even when I am doing controller and action calls.

Final thoughts, RavenFS is now completely working using this model, and I like it. It is really nice API, it works, but most importantly, it makes sense.

Tags:

Published at

Originally posted at

Comments (44)

Reviewing Postman

I enjoy reading code, and I decided that of a change, I want to read code that isn’t in .NET. The following is a review of Postman:

Postman is a little JavaScript library (well it's actually Coffeescript but the Cakefile handles a build for me) which is similar to a traditional pub/ sub library, just a whole lot smarter.

This is actually the first time that I looked at Coffescript code (beyond cursory glance at the tutorial a time or two). I got to say, it looks pretty neat. Take a look at the definition of  a linked list:

image

Pretty, readable and to the point. I like that.

Then I got to a head scratching piece:

image

There are various things that refer to postie, but it wasn’t until I got to the bottom of the code that I saw:

image

So I guess that postie line is actually defining a null argument, so it can be captured by the class Postman class methods.

I’ll be the first to admit that I am not a JS / CoffeeScript guy, so sometimes I am a little slow to figure things out, this method gave me a pause:

image

It took a while to figure out what is going on there.

The first few lines basically say, skip the first argument and capture the rest, then call all the subscriptions with the new msg.

Note that this is preserving history. So we can do something with this.

There is also an async version of this, confusing called deliverSync.

Getting the notification is done via:

image

This is quite elegant, because it means that you don’t lose out on messages that have already been published.

I guess that you might need to worry about memory usage, but there seems to be some mechanism to sort that out too. So you can explicitly clean things out. Which works well enough, I guess, but I would probably do some sort of builtin limits for how many msgs it can hold at any one time, just to be on the safe side. I don’t actually know how you would debug a memory leak in such a system, but I am guessing it can’t be fun.

image

This code makes my head hurt a big, because of the ability to pass a date or a function. I would rather have an options argument here, rather than overloading the parameter. It might be that I am a bad JS / CoffeScript coder and try to impose standards of behavior from C#, though.

All in all, this seems to be a fairly nice system, there is a test suite that is quite readable, and it is a fun codebase to read.

Tags:

Published at

Originally posted at

Comments (5)

Taking a look at S#arp Lite–final thoughts

This is a review of the S#arp Lite project, the version from Nov 4, 2011.

This project is significantly better than the S#arp Arch project that I reviewed a while ago, but that doesn’t mean that it is good. There is a lot to like, but frankly, the insistence to again abstract the data access behind complex base classes and repositories makes things much harder in the longer run.

If you are writing an application and you find yourself writing abstractions on top of CUD operations, stop, you are doing it wrong.

I quite like S#arp approach for querying, though. You expose things directly, and if it is ugly, you just wrap it in a dedicated query object. That is how you should be handling things.

Finally, whenever possible, push things to the infrastructure, it is usually pretty good and that is the right level of handling things like persistence, validation, etc. And no, you don’t have to write that, it is already there.

A lot of the code in the sample project was simply to manage persistence and validation (in fact, there was an entire project for that) that could be safely deleted in favor of:

public class ValidationListener : NHibernate.Event.IPreUpdateEventListener, NHibernate.Event.IPreInsertEventListener
{
    public bool OnPreUpdate(PreUpdateEvent @event)
    {
        if (!DataAnnotationsValidator.TryValidate(@event.Entity)) 
            throw new InvalidOperationException("Updated entity is in an invalid state");

        return false;
    }

    public bool OnPreInsert(PreInsertEvent @event)
    {
        if (!DataAnnotationsValidator.TryValidate(@event.Entity))
            throw new InvalidOperationException("Updated entity is in an invalid state");

        return false;
    }
}

Register that with NHibernate, and it will do that validation work for you, for example. Don’t try too hard, it should be simple, if it ain’t, you are either doing something very strange or you are doing it wrong, and I am willing to bet on the later.

To be clear, the problems that I had with the codebase were mostly with regards to the data access portions. I didn’t have any issues with the rest of the architecture.

RavenDB US Tour

Some of the Dev Team for RavenDB are going to be in the states in late August and the beginning of September.

Partly this is to give the 2nd New York RavenDB Course, but we are also available for onsite consulting at the time.

If you are interested, please ping me for additional details.

Tags:

Published at

Originally posted at

Taking a look at S#arp Lite– The MVC parts

This is a review of the S#arp Lite project, the version from Nov 4, 2011.

Okay, after going over all of the rest of the application, let us take a look a the parts that actually do something.

the following are from the CustomerController:

image

It is fairly straight forward, all in all. Of course, the problem is that is isn’t doing much. The moment that it does, we are going to run into problems. Let us move a different controller, ProductController and the Index action:

image

Seems fine, right? Except that in the view…

image

As you can see, we got a Select N+1 here. I’ll admit, I actually had to spend a moment or two to look for it (hint, look for @foreach  in the view, that is usually an indication of a place that requires attention).

The problem is that we really don’t have anything to do about it. If we want to resolve this, we would have to create our own query object to completely encapsulate the query. But all we need is to just add a FetchMany and we are done, except that there is that nasty OR/M abstraction that doesn’t do much except make our life harder.

Taking a look at S#arp Lite, Part III–tasks

This is a review of the S#arp Lite project, the version from Nov 4, 2011.

So far, we have gone over the general structure and the domain. Next on the list is going over the tasks project. No idea what this is. I would expect that to be some sort of long running, background tasks, but haven’t checked it yet.

Unfortunately, this isn’t the case. Tasks seems to be about another name for DAL. And to make matter worse, this is a DAL on top or a Repository on top of an OR/M.

And as if that wasn’t enough to put my teeth on edge, we got some really strange things going on there. Let us see how it goes.

image

Basically, a CudTask seems to be all about translating from the data model (I intentionally don’t call it domain model) to the view model. I spoke about the issues with repositories many times before, so I’ll suffice with saying that this is still wasteful and serve no real purpose and be done with it.

This TransferFormValuesTo() is a strange beast, and it took me a while to figure out what is going on here. let us look in the parent class to figure out what is going on there.

image

Let me count the things that are wrong here.

First, we have this IsTransient() method, why do we have that? All we need to do is just to call SaveOrUpdate and it will do it for us. Then the rest of the method sunk in.

The way this system works, you are going to have two instances of every entity that you load (not really true, by the way, because you have leakage for references, which must cause some really interesting bugs). One instance is the one that is managed by NHibernate, dealing with lazy loading, change management, etc. The second is the value that isn’t managed by NHibernate. I assume that this is an instance that you get when you bind the entity view the action parameters.

NHibernate contains explicit support for handling that (session.Merge), and that support is there for bad applications. You shouldn’t be doing things this way. Extend the model binder so it would load the entity from NHibernate and bind to that instance directly. You wouldn’t have to worry about all of this code, it would just work.

For that matter, the same goes for validation as well, you can push that into NHibernate as a listener. So all of this code just goes away, poof!

And then there is the Delete method:

image

I mean, is there a rule that says “developers should discard error information as soon as possible, because it is useless.” I mean, the next step is to see C# code littered with things like:

catch(Exception e)
{
   delete e; // early release for the memory held by the exception
}

The one good thing that I can say about the CudTasks is that at least they are explicit about not handling reads, and that reads seems to be handled properly so far (but I haven’t looked at the actual code yet).

REST and Urls

Rob Conery has been talking about REST lately, and I think he perpetuate a common misconception. In particular, in the post I referenced, he is asking about ideas for URLs for doing things like logging in, working with productions and episodes, etc.

The problem with that is that this has very little to do with REST. Now, I’ll be the first that will tell you that discussions about architectural purity bore me, and I really like the concept of nice URLs. But nice URLs are totally different from REST.

These slides do a really good work of describing what REST is and how to work with it.

It wasn’t until I actually was called to do a code review on an application written by Rob Eisenberg that I really got it. That application was a pretty simple UI (well, the UI logic was simple, the UI itself was pretty complex, but that was mostly because of the visualizations). The interesting thing is that most of the UI was completely driven by the response from the server.

What I mean by that is that when you loaded an entity, it would load the appropriate view, and use information like this:

<link method="DELETE" title="Cancel" rel="rels/cancelOrder" href="/orders/1234"/>
<link method="GET" title="Shipping Details" rel="rels/viewShipping" href="/orders/1234/shipping"/>

To generate much of the actual behavior on the client side.

The client was fairly stable, but modifying the server meant that you could get a lot more from the system.

Human readable and hackable urls are nice, sure. But they have very little to do with REST.

Tags:

Published at

Originally posted at

Comments (9)

MSNBC.COM & RavenDB Webinar

RavenDB is in heavy use inside MSNBC.COM, running some of their most interesting assets.

On Wed, Mar 7, 2012 9:00 AM - 10:30 AM PST, we will host the development team for doing that in our RavenDB Webinar to learn what is it like to run RavenDB on the most popular U.S. news site.

In their own words: "we’re using RavenDB, loving it, and plan to use it much more".

In this webinar, we will talk with members of the development team responsible for that, learn how they:

  • deal with RavenDB in production
  • manage geo distribution with RavenDB
  • rapidly make changes in both development and production.

Please register to the webinar, we have a limited amount of participants, so make sure to register in advance.

And bring your own questions, we will open this up for the audience to ask their own questions.

Tags:

Published at

Originally posted at

Comments (8)

Taking a look at S#arp Lite, Part II–the domain

This is a review of the S#arp Lite project, the version from Nov 4, 2011.

In my previous post, I looked at the general structure, but not much more. In this one, we are going to focus on the Domain project.

We start with the actual domain:

image

I have only few comments about this sort of model:

  • This is a pure CRUD model, which is good, since it is simple and easy to understand, but one does wonder where the actual business logic of the system is. It might be that there isn’t any (we are talking about a sample app, after all).
  • The few methods that are there are also about data (in this case, aggregation, and Order.GetTotal() will trigger a lazy loaded query when called, which might be a surprise to the caller.
  • Probably the worst point of this object model is that it is highly connected, which encourages people to try to walk the object graphs where they should issue a separate query instead.

Next, let us look at the queries. We have seen one example where NHibernate low level API was hidden behind an interface, but that was explicitly called out as rare. So how does this get handled on a regular basis?

image

Hm… I have some issues here with regards to the naming. I don’t like the “Find” vs. “Query” naming. I would use WhereXyz to add a filter and SelectXyz to add a transformation. It would read better when writing Linq queries, but that is about it for the domain.

One thing that I haven’t touched so far is the entities base class:

image

And its parent:

image

I strongly support the notion of ComparableObject, this is recommended when you use NHibernate. But what is it about GetTypeSpecificSignatureProperties? What it actually does is select all the properties that has the [DomainSignature] attribute. But what would you want something like that?

Looking at the code, the Customer.FirstName and Customer.LastName have this attribute, looking at the code, I really can’t understand what went on here. This seems to be selected specifically to create hard to understand and debug bugs.

Why do I say that? The ComparableObject uses properties marked with [DomainSignature] for the GetHashCode() calculation. What this means is that if you change the customer name you change its hash code value. This hash code value is used for, among other things, finding the entity in the unit of work, so changing the customer name can cause NHibernate to loose track of it and behave in some really strange ways.

This is also violating one of the core principals of entities:

A thing with distinct and independent existence.

In other words, an entity doesn’t exists because of the particular values that are there for the first and last names. If those change, the customer doesn’t change. It is the same as saying that by changing the shirt I wear, I becomes a completely different person.

Domain Signature is something that I am completely opposed, not only for the implementation problems, but because it has no meaning when you start to consider what an entity is.

Next, we are going to explore tasks…

Taking a look at S#arp Lite, Part I

This is a review of the S#arp Lite project, the version from Nov 4, 2011.

I was asked to review this project a long time ago, but I never got around to it. I had some time and I decided that I might take a look and see how it goes. I don’t like the S#arp Arch project, because it seems too complex and heavy weight for the purpose.

The project comes with a sample application, which is good, because it is easy to see how the framework is intended to be used. Unfortunately, it is yet another online store example, I am getting heartily sick of that. On the other hand, it is a fairly simple model and easy to understand, so I grok why this keeps getting chosen.

Review Rule, I look at the code. If I wanted to deal with documentation, I would write some for our products. I am doing this because I find it fun to look at other people’s code. So skip any comments about “if you read the docs…”.

We start from the project structure:

image

I am not sure if I like it, I don’t know if I agree that all of those splits are needed, but this is well within reasonable limits, so I am willing to let it slide on the grounds that this is personal taste more than anything else. Looking at the dependencies, we see:

image

The Init project contains two files, which are responsible for… well, starting up, it seems. Again, I don’t see any reason why this would be a separate project, but that is about it so far.

Next in line is the NHibernateProvider project, in this case, we have the following:

image

So far, I am cautiously optimistic. All of the files / folders marked with red are actually all about setting NHibernate up, not about hiding it. But then we get to the read me file, which reads in part:

This folder contains any concrete, NHibernate-specific query classes.
There should only be classes in here for any respective query *interfaces* found in
/MyStore.Domain/Queries/

This folder will usually be empty except for very exceptive cases.

This is… interesting. Can’t say whatever I agree or not yet. Looking at the QueryForProductOrderSummaries, we see:

image

Note the comment, there are better ways to do it, but we demonstrate an ugly way, and how to nicely encapsulate it.

That is enough for now, I think, next post, I’ll touch on the actual model…