Ayende @ Rahien

Refunds available at head office

ListOfParams and other horrible things that you shouldn’t bring to RavenDB

In the mailing list, we got asked about an issue with code that looked like this:

   1: public abstract class Parameter
   2: {
   3:     public String Name { get; set; }
   4: }
   5:  
   6: public class IntArrayParameter : Parameter
   7: {
   8:     public Int32[,] Value { get; set; }
   9: }

I fixed the bug, but that was a strange thing to do, I thought. Happily, the person asking that question was actually taking part of a RavenDB course and I could sit with him and understand the whole question.

It appears that in their system, they have a lot of things like that:

  • IntParameter
  • StringParameter
  • BoolParameter
  • LongParameter

And along with that, they also have a coordinating class:

   1: public class ListOfParams
   2: {
   3:    public List<Param> Values {get;set;}
   4: }

The question was, could they keep using the same approach using RavenDB? They were quite anxious about this, since they had a need for the capabilities of this in their software.

This is why I hate Hello World questions. I could answer just the question that was asked, and that was it. But the problem is quite different.

You might have recognized it by now, what they have here is Entity Attribute Value system. A well known anti pattern for the relational database world and one of the few ways to actually get a dynamic schema in that world.

In RavenDB, you don’t need all of those things. You can just get things done. Here is the code that we wrote to replace the above monstrosity:

   1: public class Item : DynamicObject
   2: {
   3:     private Dictionary<string, object>  vals = new Dictionary<string, object>();
   4:  
   5:     public string StaticlyDefinedProp { get; set; }
   6:  
   7:     public override bool TryGetMember(GetMemberBinder binder, out object result)
   8:     {
   9:         return vals.TryGetValue(binder.Name, out result);
  10:     }
  11:  
  12:     public override bool TrySetMember(SetMemberBinder binder, object value)
  13:     {
  14:         if(binder.Name == "Id")
  15:             return false;
  16:         vals[binder.Name] = value;
  17:         return true;
  18:     }
  19:  
  20:     public override bool TrySetIndex(SetIndexBinder binder, object[] indexes, object value)
  21:     {
  22:         var key = (string) indexes[0];
  23:         if(key == "Id")
  24:             return false;
  25:         vals[key] = value;
  26:         return true;
  27:     }
  28:  
  29:     public override bool TryGetIndex(GetIndexBinder binder, object[] indexes, out object result)
  30:     {
  31:         return vals.TryGetValue((string) indexes[0], out result);
  32:     }
  33:  
  34:     public override IEnumerable<string> GetDynamicMemberNames()
  35:     {
  36:         return GetType().GetProperties().Select(x => x.Name).Concat(vals.Keys);
  37:     }
  38: }

Not only will this class handle the dynamics quite well, it also serializes  to idiomatic JSON, which means that querying that is about as easy as you can ask.

The EAV schema was created because RDBMS aren’t suitable for dynamic work, and like many other things from the RDMBS world, this problem just doesn’t exists for us in RavenDB.

Published at

Originally posted at

Comments (4)

Northwind Starter Kit Review: It is all about the services

This is a review of the Northwind Starter Kit project, this review revision 94815 from Dec 18 2011.

Okay, enough about the data access parts. Let us see take a look at a few of the other things that are going on in the application. In particular, this is supposed to be an application with…

Domain logic is implemented by means of a Domain Model, onto a layer of services adds application logic. The model is persisted by a DAL designed around the principles of the "Repository" patterns, which has been implemented in a LINQ-friendly way.

Let us try to figure this out one at a time, okay?

The only method in the domain model that have even a hint of domain logic is the CalculateTotalIncome method. Yes, you got it right, that is a method, as in singular. And that method should be replaced with a query, it has no business being on the domain model.

So let us move to the services, okay? Here are the service definitions in the entire project:

image

Look at the methods carefully. Do you see any action at all? You don’t, the entire thing is just about queries.

And queries should be simple, not abstracted away and made very hard to figure out.

The rule of the thumb is that you try hard to not abstract queries, it is operations that you try to abstract. Operations is where you usually actually find the business logic.

Northwind Starter Kit Review: From start to finishing–tracing a request

This is a review of the Northwind Starter Kit project, this review revision 94815 from Dec 18 2011.

One of the things that I repeatedly call out is the forwarding type of architecture, a simple operation that is hidden away by a large number of abstractions that serves no real purpose.

Instead of a controller, let us look at a web service, just to make things slightly different. We have the following:

image

Okay, let us dig deeper:

image

I really like the fact that this repository actually have have FindById method, which this service promptly ignores in favor of using the IQueryable<Customer> implementation. If you want to know how that is implemented, just look (using the EF Code First repository implementations, the others are fairly similar):

image

 

All in all, the entire thing only serves to make things harder to understand and maintain.

Does anyone really think that this abstraction adds anything? What is the point?!

Northwind Starter Kit Review: Refactoring to an actual read model

This is a review of the Northwind Starter Kit project, this review revision 94815 from Dec 18 2011.

In my previous post, I talked about how the CatalogController.Product(id) action was completely ridiculous in the level of abstraction that it used to do its work and promised to show how to do the same work on an actual read model in a much simpler fashion. Here is the code.

image

There are several things that you might note about this code:

  • It is all inline, so it is possible to analyze, optimize and figure out what the hell is going on easily.
  • It doesn’t got to the database to load data that it already has.
  • The code actually does something meaningful.
  • It only do one thing, and it does this elegantly.

This is actually using a read model. By, you know, reading from it, instead of abstracting it.

But there is a problem here, I hear you shout (already reaching for the pitchfork, at least let me finish this post).

Previously, we have hidden the logic of discontinued products and available products behind the following abstraction:

image

Now we are embedding this inside the query itself, what happens if this logic changes? We would now need to go everywhere we used this logic and update it.

Well, yes. Except that there are two mitigating factors.

  • This nice abstraction is never used elsewhere.
  • It is easy to create our own abstraction.

Here is an example on how to do this without adding additional layers of abstractions.

image

Which means that our action method now looks like this:

image

Simple, easy, performant, maintainable.

And it doesn’t make my head hurt or cause me to stay up until 4 AM feeling like an XKCD item.

Northwind Starter Kit Review: Data Access review thoughts

This is a review of the Northwind Starter Kit project, this review revision 94815 from Dec 18 2011.

In my last few posts, I have gone over the data access strategy used in NSK. I haven’t been impressed. In fact, I am somewhere between horrified, shocked and amused in the same way you feel when you see a clown slipping on a banana peel.  Why do I say that? Let us trace a single call from the front end all the way to the back.

The case in point, CatalogController.Product(id) action. This is something that should just display the product on the screen, so it should be fairly simple, right? Here is how it works when drawn as UML:

image

To simplify things, I decided to skip any method calls on the same objects (there are more than a few).

Let me show you how this looks like in actual code:

image

Digging deeper, we get:

image

We will deal with the first method call to CatalogServices now, which looks like:

image

I’ll skip going deeper, because this is just a layer of needless abstraction on top of Entity Framework and that is quite enough already.

Now let us deal with the second call to CatalogServices, which is actually more interesting:

image

Note the marked line? This is generating a query. This is interesting, because we have already loaded the product. There is no way of optimizing that, of course, because the architecture doesn’t let you.

Now, you need all of this just to show a single product on the screen. I mean, seriously.

You might have noticed some references to things like Read Model in the code. Which I find highly ironic. Read Models are about making the read side of things simple, not drowning the code in abstraction on top of abstraction on top of abstraction.

In my next post, I’ll show a better way to handle this scenario. A way that is actually simpler and make use an of actual read model and not infinite levels of indirection.

Northwind Starter Kit Review: Data Access and the essence of needless work, Part I

This is a review of the Northwind Starter Kit project, this review revision 94815 from Dec 18 2011.

Update: Andrea, the project’s leader has posted a reply to this series of posts.

I like to start reviewing applications from their database interactions. That it usually low level enough to tell me what is actually going on, and it is critical to the app, so a lot of thought usually goes there.

In good applications, I have hard time finding the data access code, because it isn’t there. It is in the OR/M or the server client API (in the case of RavenDB). In some applications, if they work against legacy databases or without the benefit of OR/M or against a strange data source (such as a remote web service target) may need an explicit data layer, but most don’t.

NSK actually have 5 projects dedicated solely to data access. I find this.. scary.

image

Okay, let me start outlying things in simple terms. You don’t want to do things with regards to data access the way NSK does them.

Let us explore all the ways it is broken. First, in terms of actual ROI. There is absolutely no reason to have multiple implementations with different OR/Ms. There is really not a shred of reason to do so. The OR/M is already doing the work of handling the abstraction from the database layer, and the only thing that you get is an inconsistent API, inability to actually important features and a whole lot more work that doesn’t' get you anything.

Second, there are the abstractions used:

image

I don’t like repositories, because they abstract important details about the way you work with the database. But let us give this the benefit of the doubt and look at the implementation. There is only one implementation of IRepository, which uses NHibernate.

image

As you can see, this is pretty basic stuff. You can also see that there are several methods that aren’t implemented. That is because they make no sense to a data. The reason they are there is because IRepository<T> inherits from ICollection<T>. And the reason for that is likely because of this:

Mediates between the domain and data mapping layers using a collection-like interface for accessing domain objects.

The fact that this is totally the wrong abstraction to use doesn’t enter to the design, it seems.

Note that we also violate the contract of ICollection<T>.Remove:

true if item was successfully removed from the ICollection<T>; otherwise, false. This method also returns false if item is not found in the original ICollection<T>.

There are other reasons to dislike this sort of thing, but I’ll touch on that on my next post.

Is OR/M an anti pattern?

This article thinks so, and I was asked to comment on that. I have to say that I agree with a lot in this article. It starts by laying out what an anti pattern is:

  1. It initially appears to be beneficial, but in the long term has more bad consequences than good ones
  2. An alternative solution exists that is proven and repeatable

And then goes on to list some of the problems with OR/M:

  • Inadequate abstraction - The most obvious problem with ORM as an abstraction is that it does not adequately abstract away the implementation details. The documentation of all the major ORM libraries is rife with references to SQL concepts.
  • Incorrect abstraction – …if your data is not relational, then you are adding a huge and unnecessary overhead by using SQL in the first place and then compounding the problem by adding a further abstraction layer on top of that.
    On the the other hand, if your data is relational, then your object mapping will eventually break down. SQL is about relational algebra: the output of SQL is not an object but an answer to a question.
  • Death by a thousand queries – …when you are fetching a thousand records at a time, fetching 30 columns when you only need 3 becomes a pernicious source of inefficiency. Many ORM layers are also notably bad at deducing joins, and will fall back to dozens of individual queries for related objects.

If the article was about pointing out the problems in OR/M I would have no issues in endorsing it unreservedly. Many of the problems it points out are real. They can be mitigated quite nicely by someone who knows what they are doing, but that is beside the point.

I think that I am in a pretty unique position to answer this question. I have over 7 years of being heavily involved in the NHibernate project, and I have been living & breathing OR/M for all of that time. I have also created RavenDB, a NoSQL database, that gives me a good perspective about what it means to work with a non relational store.

And like most criticisms of OR/M that I have heard over the years, this article does only half the job. It tells you what is good & bad (most bad) in OR/M, but it fails to point out something quite important.

To misquote Churchill, Object Relational Mapping is the worst form of accessing a relational database, except all of the other options when used for OLTP.

When I see people railing against the problems in OR/M, they usually point out quite correctly problems that are truly painful. But they never seem to remember all of the other problems that OR/M usually shields you from.

One alternative is to move away from Relational Databases. RavenDB and the RavenDB Client API has been specifically designed by us to overcome a lot of the limitations and pitfalls inherit to OR/M. We have been able to take advantage of all of our experience in the area and create what I consider to be a truly awesome experience.

But if you can’t move away from Relational Databases, what are the alternative? Ad hoc SQL or Stored Procedures? You want to call that better?

A better alternative might be something like Massive, which is a very thin layer over SQL. But that suffers from a whole host of other issues (no unit of work means aliasing issues, no support for eager load means better chance for SELECT N+1, no easy way to handle migrations, etc). There is a reason why OR/M have reached where they have. There are a lot of design decisions that simply cannot be made any other way without unacceptable tradeoffs.

From my perspective, that means that if you are using Relational Databases for OLTP, you are most likely best served with an OR/M. Now, if you want to move away from Relational Databases for OLTP, I would be quite happy to agree with you that this is the right move to make.

The building blocks of a database: Transactional & persistent hash table

I am working on managed storage solution for RavenDB in an off again on again fashion for a while now. The main problem Is that doing something like this is not particularly hard, but it is complex. You either have to go with a transaction log or an append only model.

There are more than enough material on the matter, so I won’t touch that. The problem is that building that is going to take time, and probably a lot of it I decided that it is better off to have something than nothing, and scaled back the requirements.

The storage has to have:

  • ACID
  • Multi threaded
  • Fully managed
  • Support both memory and files
  • Fast
  • Easy to work with

The last one is important. Go and take a look at the codebase of any of the available databases. They can be… pretty scary.

But something has to be give, so I decided that to make things easier, I am not going to implement indexing on the file system. Instead, I’ll store the data on the disk, and keep the actual index completely in memory. There is an overhead of roughly 16 bytes per key plus the key itself, let us round it to 64 bytes per held per key. Holding 10 million keys would cost ~600 MB. That sounds like a lot, because it is. But it actually not too bad. It isn’t that much memory for modern hardware. And assuming that our documents are 5 KB in size, we are talking about 50 GB for the database size, anyway.

Just to be clear, the actual data is going to be on the disk, it is the index that we keep in memory. And once we have that decision, the rest sort of follow on its own.

It is intentionally low level interface, and mostly it gives you is a Add/Read/Remove interface, but it gives you multiple tables, the ability to do key and range scans and full ACID compliance (including crash recovery).

And the fun part, it does so in 400 lines!

http://github.com/ayende/Degenerated.Storage

Tags:

Published at

Originally posted at

Comments (15)

Normalization is from the devil

The title of this post is a translation of an Arabic saying that my father quoted me throughout my childhood.

I have been teaching my NHibernate course these past few days, and I had come to realize that my approach for designing RDBMS based applications has gone a drastic change recently. I think that the difference in my view was brought home when I started getting angry about this model:

image

I mean, it is pretty much a classic, isn’t it? But what really annoyed me was that all I had to do was look at this and know just how badly this is going to end up as when someone is going to try to show an order with its details. We are going to have, at least initially, 3 + N(order lines) queries. And even though this is a classic model, loading it efficiently is actually not that trivial. I actually used this model to show several different ways of eager loading. And remember, this model is actually a highly simplified representation of what you’ll need in real projects.

I then came up with a model that I felt was much more palatable to me:

image

And looking at it, I had an interesting thought. My problem with the model started because I got annoyed by how many tables were involved in dealing with “Show Order”, but the end result also reminded me of something, Root Aggregates in DDDs. Now, since my newfound sensitivity about this has been based on my experiences with RavenDB, I found it amusing that I explicitly modeled documents in RavenDB after Root Aggregates in DDD, then went the other way (reducing queries –> Root Aggregates) with modeling in RDBMS).

The interesting part is that once you start thinking like this, you end up with a lot of additional reasons why you actually want that. (If the product price changed, it doesn’t affect the order, for example).

If you think about it, normalization in RDBMS had such a major role because storage was expensive. It made sense to try to optimize this with normalization. In essence, normalization is compressing the data, by taking the repeated patterns and substituting them with a marker. There is also another issue, when normalization came out, the applications being being were far different than the type of applications we build today. In terms of number of users, time that you had to process a single request, concurrent requests, amount of data that you had to deal with, etc.

Under those circumstances, it actually made sense to trade off read speed for storage. In today’s world? I don’t think that it hold as much.

The other major benefit of normalization, which took extra emphasis when the reduction in storage became less important as HD sizes grew, is that when you state a fact only once, you can modify it only once.

Except… there is a large set of scenarios where you don’t want to do that. Take invoices as a good example. In the case of the order model above, if you changed the product name from “Thingamajig” to “Foomiester”, that is going to be mighty confusing for me when I look at that order and have no idea what that thing was. What about the name of the customer? Think about the scenarios in which someone changes their name (marriage is most common one, probably). If a woman orders a book under her maiden name, then changes her name after she married, what is supposed to show on the order when it is displayed? If it is the new name, that person didn’t exist at the time of the order.

Obviously, there are counter examples, which I am sure the comments will be quick to point out.

But it does bear thinking about, and my default instinct to apply 3rd normal form has been muted once I realized this. I now have a whole set of additional questions that i ask about every piece of information that I deal with.

Entity != Table

I recently had a chance to work on an interesting project, doing a POC of moving from a relational model to RavenDB. And one of the most interesting hurdles along the way wasn’t technical at all, it was trying to decide what an entity is. We are so used to make the assumption that Entity == Table that we started to associate the two together. With a document database, an entity is a document, and that map much more closely to a root aggregate than to a RDMBS entity.

That gets very interesting when we start looking at tables and having to decide if they represent data that is stand alone (and therefore deserve to live is separate documents) or whatever they should be embedded in the parent document. That led to a very interesting discussion on each table. What I found remarkable is that it was partly a discussion that seem to come directly from the DDD book, about root aggregates, responsibilities and the abstract definition of an entity and partly a discussion that focused on meeting the different modeling requirement for a document database.

I think that we did a good job, but I most valued the discussion and the insight. What was most interesting to me was how right was RavenDB for the problem set, because a whole range of issues just went away when we started to move the model over.

Database assisted denormalization – Oracle edition

I decided to take a chance (installing Oracle is a big leap :-) ) and see how things match in Oracle.

I decided to run the following query:

SELECT deptno, 
       dname, 
       loc, 
       (SELECT COUNT(*) 
        FROM   emp 
        WHERE  emp.deptno = dept.deptno) AS empcount 
FROM   dept 
WHERE  deptno = 20 

Please note that I run in on a database that had (total) maybe a 100 records, so the results may be skewed.

image

Like in the SQL Server case, we need to create an index on the FK column. I did so, after which I got:

image

Then I dropped that index and create a simple view:

CREATE VIEW depswithempcount 
AS 
  SELECT deptno, 
         dname, 
         loc, 
         (SELECT COUNT(*) 
          FROM   emp 
          WHERE  emp.deptno = dept.deptno) AS empcount 
  FROM   dept 

Querying on top of that gives me the same query plan as before. Trying to create a materialized view out of this fails, because of the subquery expression, I’ll have to express the view in terms of joins, instead. Like this:

SELECT dept.deptno, 
       dname, 
       loc, 
       COUNT(*) empcount 
FROM   dept 
       LEFT JOIN emp 
         ON dept.deptno = emp.deptno 
WHERE  dept.deptno = 20 
GROUP  BY dept.deptno, 
          dname, 
          loc 

Interestingly enough, this is a different query plan than the subquery, with SQL Server, those two query exhibit identical query plans.

image

Now, to turn that into an materialized view.

CREATE materialized VIEW deptwithempcount 
AS SELECT dept.deptno, 
          dname, 
          loc, 
          COUNT(*) empcount 
   FROM   dept 
          left join emp 
            ON dept.deptno = emp.deptno 
   GROUP  BY dept.deptno, 
             dname, 
             loc 

And querying on this gives us very interesting results:

select * from deptwithempcount 
where deptno = 20

image

Unlike SQL Server, we can see that Oracle is reading everything from the view. But let us try one more thing, before we conclude this with a victory.

update emp 
set deptno = 10
where deptno = 20;

select * from deptwithempcount 
where deptno = 20

But now, when we re-run the materialized view query, we see the results as they were at the creation of the view.

There appears to be a set of options to control that, but the one that I want (RERESH FAST), which update the view as soon as data changes will not work with this query, since it consider it too complex. I didn’t investigate too deeply, but it seems that this is another dead end.

Tags:

Published at

Originally posted at

Comments (11)

Database assisted denormalization

Let us say that I have the homepage of the application, where we display Blogs with their Post count, using the following query:

select 
    dbo.Blogs.Id, 
    dbo.Blogs.Title,
    dbo.Blogs.Subtitle,
    (select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) as PostCount
 from dbo.Blogs 

Given what I think thoughts of denormalization, and read vs. write costs, it seems a little wasteful to run the aggregate all the time.

I can always add a PostCount property to the Blogs table, but that would require me to manage that myself, and I thought that I might see whatever the database can do it for me.

This isn’t a conclusive post, it details what I tried, and what I think is happening, but it isn’t the end all be all. Moreover, I run my tests on SQL Server 2008 R2 only, not on anything else. I would like to hear what you think of this.

My first thought was to create this as a persisted computed column:

ALTER TABLE Blogs
ADD PostCount AS (select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) PERSISTED

But you can’t create computed columns that uses subqueries. I would understand easier why not if it was only for persisted computed columns, because that would give the database a hell of time figuring out when that computed column needs to be updated, but I am actually surprised that normal computed columns aren’t supporting subqueries.

Given that my first attempt failed, I decided to try to create a materialized view for the data that I needed. Materialized views in SQL Server are called indexed views, There are several things to note here. You can’t use subqueries here either (likely because the DB couldn’t figure which row in the index to update if you were using subqueries), but have to use joins.

I created a data set of 1,048,576 rows in the blogs table and 20,971,520 posts, which I think should be enough to give me real data.

Then, I issued the following query:

select 
        dbo.Blogs.Id, 
        dbo.Blogs.Title,
        dbo.Blogs.Subtitle,
        count_big(*) as PostCount
from dbo.Blogs left join dbo.Posts
        on dbo.Blogs.Id = dbo.Posts.BlogId
where dbo.Blogs.Id = 365819
group by dbo.Blogs.Id,
        dbo.Blogs.Title,
        dbo.Blogs.Subtitle

This is before I created anything, just to give me some idea about what kind of performance (and query plan) I can expect.

Query duration: 13 seconds.

And the execution plan:

image

The suggest indexes feature is one of the best reasons to move to SSMS 2008, in my opinion.

Following the suggestion, I created:

CREATE NONCLUSTERED INDEX [IDX_Posts_ByBlogID]
ON [dbo].[Posts] ([BlogId])

And then I reissued the query. It completed in 0 seconds with the following execution plan:

image

After building Raven, I have a much better understanding of how databases operate internally, and I can completely follow how that introduction of this index can completely change the game for this query.

Just to point out, the results of this query is:

Id          Title                 Subtitle               PostCount
----------- --------------------- ---------------------- --------------------
365819      The lazy blog         hibernating in summer  1310720

I decided to see what using a view (and then indexed view) will give me. I dropped the IDX_Posts_ByBlogID index and created the following view:

CREATE VIEW BlogsWithPostCount 
WITH SCHEMABINDING
AS 
select 
    dbo.Blogs.Id, 
    dbo.Blogs.Title,
    dbo.Blogs.Subtitle,
    count_big(*) as PostCount
 from dbo.Blogs join dbo.Posts
    on dbo.Blogs.Id = dbo.Posts.BlogId
 group by dbo.Blogs.Id,
    dbo.Blogs.Title,
    dbo.Blogs.Subtitle

After which I issued the following query:

select 
        Id, 
        Title,
        Subtitle,
        PostCount
from BlogsWithPostCount
where Id = 365819

This had the exact same behavior as the first query (13 seconds and the suggestion for adding the index).

I then added the following index to the view:

CREATE UNIQUE CLUSTERED INDEX IDX_BlogsWithPostCount
ON BlogsWithPostCount (Id)

And then reissued the same query on the view. It had absolutely no affect on the query (13 seconds and the suggestion for adding the index). This make sense, if you understand how the database is actually treating this.

The database just created an index on the results of the view, but it only indexed the columns that we told it about, which means that is still need to compute the PostCount. To make things more interesting, you can’t add the PostCount to the index (thus saving the need to recalculate it).

Some points that are worth talking about:

  • Adding IDX_Posts_ByBlogID index resulted in a significant speed increase
  • There doesn’t seem to be a good way to perform materialization of the query in the database (this applies to SQL Server only, mind you, maybe Oracle does better here, I am not sure).

In other words, the best solution that I have for this is to either accept the cost per read on the RDBMS and mitigate that with proper indexes or create a PostCount column in the Blogs table and manage that yourself. I would like your critique on my attempt, and additional information about whatever what I am trying to do is possible in other RDMBS.

Playing with Entity Framework Code Only

After making EF Prof work with EF Code Only, I decided that I might take a look at how Code Only actually work from the perspective of the application developer. I am working on my own solution based on the following posts:

But since I don’t like to just read things, and I hate walkthroughs, I decided to take that into a slightly different path. In order to do that, I decided to set myself the following goal:

image

  • Create a ToDo application with the following entities:
    • User
    • Actions (inheritance)
      • ToDo
      • Reminder
  • Query for:
    • All actions for user
    • All reminders for today across all users

That isn’t really a complex system, but my intention is to get to grips with how things work. And see how much friction I encounter along the way.

We start by referencing “Microsoft.Data.Entity.Ctp” & “System.Data.Entity”

There appears to be a wide range of options to define how entities should be mapped. This include building them using a fluent interface, creating map classes or auto mapping. All in all, the code shows a remarkable similarity to Fluent NHibernate, in spirit if not in actual API.

I don’t like some of the API:

  • HasRequired and HasKey for example, seems to be awkwardly named to me, especially when they are used as part of a fluent sentence. I have long advocated avoiding the attempt to create real sentences in a fluent API (StructureMap was probably the worst in this regard). Dropping the Has prefix would be just as understandable, and look better, IMO.
  • Why do we have both IsRequired and HasRequired? The previous comment apply, with the addition that having two similarly named methods that appears to be doing the same thing is probably not a good idea.

But aside from that, it appears very nice.

ObjectContext vs. DbContext

I am not sure why there are two of them, but I have a very big dislike of ObjectContext, the amount of code that you have to write to make it work is just ridiculous, when you compare that to the amount of code you have to write for DbContext.

I also strongly dislike the need to pass a DbConnection to the ObjectContext. The actual management of the connection is not within the scope of the application developer. That is within the scope of the infrastructure. Messing with DbConnection in application code should be left to very special circumstances and require swearing an oath of nonmaleficence. The DbContext doesn’t require that, so that is another thing that is in favor of it.

Using the DbContext is nice:

public class ToDoContext : DbContext
{
    private static readonly DbModel model;

    static ToDoContext()
    {
        var modelBuilder = new ModelBuilder();
        modelBuilder.DiscoverEntitiesFromContext(typeof(ToDoContext));
        modelBuilder.Entity<User>().HasKey(x => x.Username);
        model = modelBuilder.CreateModel();
    }

    public ToDoContext():base(model)
    {
        
    }

    public DbSet<Action> Actions { get; set; }

    public DbSet<User> Users { get; set; }
}

Note that we can mix & match the configuration styles, some are auto mapped, some are explicitly stated. It appears that if you fully follow the builtin conventions, you don’t even need ModelBuilder, as that will be build for you automatically.

Let us try to run things:

using(var ctx = new ToDoContext())
{
    ctx.Users.ToList();
}

The connection string is specified in the app.config, by defining a connection string with the name of the context.

Then I just run it, without creating a database. I expected it to fail, but it didn’t. Instead, it created the following schema:

image

That is a problem, DDL should never run as an implicit step. I couldn’t figure out how to disable that, though (but I didn’t look too hard). To be fair, this looks like it will only run if the database doesn’t exists (not only if the tables aren’t there). But I would still make this an explicit step.

The result of running the code is:

image

Now the time came to try executing my queries:

var actionsForUser = 
    (
        from action in ctx.Actions
        where action.User.Username == "Ayende"
        select action
    )
    .ToList();

var remindersForToday =
    (
        from reminder in ctx.Actions.OfType<Reminder>()
        where reminder.Date == DateTime.Today
        select reminder
    )
    .ToList();

Which resulted in:

image

That has been a pretty brief overview of Entity Framework Code Only, but I am impressed, the whole process has been remarkably friction free, and the time to go from nothing to a working model has been extremely short.

Data access is contextual, a generic approach will fail

In my previous post, I discussed the following scenario, a smart client application for managing Machines, Parts & Maintenance History:

image_thumb

From the client perspective, Machine,Parts & Maintenance History where all part of the same entity, and they wanted to be able to be able return that to the user in a single call. They run into several problems in doing that, mostly, from what I could see, because they tried to do something that I strongly discourage, sending entities on the wire so they could be displayed on the smart client application. The client uses CSLA, so the actual business logic is sitting in Factory objects (I discussed my objection to that practice here), but the entities are still doing triple duty here. They need to be saved to the database, serve as DTOs on the wire and also handle being presented in the UI.

When I objected to that practice, I was asked: “Are we supposed to have three duplicates of everything in the application? One for the database, one for the wire and one for the UI? That is insane!”

The answer to this question is:

This is the symbol for Mu, which is the answer you give when the question is meaningless.

The problem with the question as posed is that it contains an invalid assumption. The assumption is that the three separate models in the application are identical to one another. They aren’t, not by a long shot.

Let me see if I can explain it better with an example. Given the Machine –>> Part –>> Maintenance History model, and an application to manage those as our backdrop. We will take two sample scenarios to discuss, the first would be to record replacing a part in a machine, and the second would be looking at a list of machines that need fixing.

For the first scenario, we need to show the user the Maintenance screen for the machine, which looks like this:

image

And on the second one, machines requiring fixes, we have:

image

Now, both of those screens are driven by data in the Machine –>> Part –>> Maintenance History model. But they have very different data requirements.

In the first case, we need the Machine.Name, Machine.NextMaintenanceDate, and a list of Part.SKU, Part.Name, Part.Broken, Part.Notes.

In the second case, we need Machine.Owner, Machine.Car, Machine.Age, Machine.Preferred.

The data set that they require is completely different, not only that, but sending the full object graph to each of those is a waste. If you noticed, no one actually even needed Maintenance History for any of those screens. It is likely that Maintenance History will be only needed rarely, but by saying that there is only one model that needs to serve all those different scenarios, we are tying ourselves into knots, because we need to serve Maintenance History every time that we are asked for a machine, even though it is not needed.

The way I would design the system, I would have:

Entities: Machine, Part, Maintenance History

This is the full data model that we use. Note that I am not discussing issues like CQRS here, because the application doesn’t warrant it. A single model for everything would work well here.

DTO: MachineHeaderDTO, MachineAndPartsDTO, PartAndMaintenanceHistoryDTO, etc.

We have specific DTOs for each scenario, only giving us the data that we need, so we can spare a lot of effort in the server side in deciding how to fill the data.

View Models

The view models would be built on top of the DTOs, usually providing some additional services like INotifyPropertyChanged, calculated properties, client side validation, etc.

I am in favor of each screen having its own View Models, because that lead to a self contained application, but I wouldn’t have a problem with common view models shared among the entire application.

Sample application

And because I know you’ll want to see some code, the Alexandria sample application demonstrate those concepts quite well: http://github.com/ayende/alexandria

Microsoft.Data, because the 90s were so good, we want to do them again

I just saw this post, and I had to double check the date of the post twice to be sure that I am reading about something that is going to come out soon, rather than something that have come out in 2002.

This is the code that is being proudly presented as an achievement:

using (var db = Database.Open("Northwind")) {
    foreach (var product in db.Query("select * from products where UnitsInStock < 20")) {
        Response.Write(product.ProductName + " " + product.UnitsInStock);
    }
}

Allow me to give you the exact statements used:

The user doesn’t have to learn about connection strings or how to create a command with a connection and then use a reader to get the results. Also, the above code is tied to Sql Server since we’re using specific implementations of the connection, command, and reader(SqlConnection, SqlCommand, SqlDataReader).

Compare this with code below it. We’ve reduced the amount of lines required to connect to the database, and the syntax for accessing columns is also a lot nicer, that’s because we’re taking advantage of C#’s new dynamic feature.

I really don’t know where to start. Yes, compared to raw ADO.Net I guess that this is improvement. But being beaten only thrice a week instead of daily is also an improvement.

I mean, seriously, are you freaking kidding me? Are you telling me that you are aiming to make the life of people writing code like this easier?

We have direct use of Response.Write – because it is so hard to create maintainable code, we decided to just give up from the get go. Beyond that, putting SQL directly in the code like that, including the parameters, is an open invitation for SQL injection.

Next, let us talk seriously, okay. Anyone who had to write ADO.Net code already wrapped it up. Anyone who didn’t isn’t going to.

But given all the times that we have heard “no resources to fix XYZ” from Microsoft, are you telling me that creating a framework that is intended for really bad programmers to slide down the maintainability scale more easily? Is that a valuable use of resources?

Writing code like that is actively harmful, period. There is really no need to deal with those low level stuff in this day and age.

Tags:

Published at

Originally posted at

Comments (84)

The false myth of encapsulating data access in the DAL

This is a question that I get routinely, both from random strangers and when I am at clients.

I would like to design a system/application using NHibernate. But I also want to so flexible that in future ,if I unplug the NHibernate and use ADO.NET Entity framework or other framework then my application should not
crash.

In short, I am completely opposed for even trying doing something like that.

It is based on flawed assumptions

A lot of the drive behind this is based on the historical drive built in the time where data access layers directly accessed a database using its own dialect, resulting in the need to create just such an encapsulation in order to support multiple databases.

The issue with this drive is that it is no longer a factor, all modern OR/Ms can handle multiple databases effectively. Moreover, modern OR/M are no longer just ways to execute some SQL and get a result back, which is how old style DAL were written. An OR/M takes on a lot more responsibilities, from things like change tracking to cache management, from ensuring optimistic concurrency  to managing optimal communication with the database.

And those features matter, a lot. Not only that, but they are different between each OR/M.

It doesn’t work, and you’ll find that out too late

The main problem is that no matter how hard you try, there are going to be subtle and not so subtle differences between different OR/Ms, those changes can drastically affect how you build your application.

Here are a few examples, using NHibernate and EF.

Feature NHibernate Entity Framework
Futures Yes No
Batching Yes No
Transaction handling Requires explicit code Implicitly handled
Caching 1st & 2nd level caching 1st level caching only

This isn’t intended to be a NH vs. EF, and it doesn’t even pretend to be unbiased, I am simply pointing out a few examples of features that you can take advantage of which can greatly benefit you in one situation, which do not exists in another.

It has a high cost

In order to facilitate this data access encapsulation, you have to do one of two things:

  • Use the lowest common denominator, preventing you from using the real benefits of the OR/M in question.
  • Bleed those features through the DAL, allowing you to make use of those features, but preventing you from switching at a later time.

Either of those add complexity, reduce flexibility and creates confusion down the road. And in general, it still doesn’t work.

There are other barriers than the API

Here is an example from a real client, which insists on creating this encapsulation and hiding NHibernate inside their DAL. They run into the previously mentioned problems, where there are NHibernate features specifically designed to solve some of their problems, but which they have hard time to implement through their DAL.

Worse, from the migration perspective, most of the barrier for moving from NHibernate isn’t in the API. Their entity model make a heavy use on NHibernate’s <any/> feature, which large percentage other OR/Ms do not support. And that is merely the example that spring most forcibly to mind, there are others.

The real world doesn’t support it, even for the simplest scenarios

A while ago I tried porting the NerdDinner application to NHibernate. Just to point it out, that application have a single entity, and was designed with a nice encapsulation between the data access and the rest of the code. In order to make the port, I had to modify significant parts of the codebase. And that is about the simplest example that can be.

The role of encapsulation

Now, I know that some people would read this as an attack of encapsulation of data access, but that isn’t the case. By all mean, encapsulate to your heart’s content. But the purpose of this encapsulation is important. Trying to encapsulate to make things easier to work with, great. Trying to encapsulate so that you can switch OR/Ms? Won’t work, will be costly and painful.

So how do you move between OR/Ms?

There are reasons why some people want to move from one data access technology to the other. I was involved in several such efforts, and the approach that we used in each of those cases was porting, rather than trying to drop a new IDaataAccess implementation.

Table scans, index scans and index seeks, on my!

In general, when you break it down to the fundamentals, a data base is just a fancy linked list + some btrees. Yes, I am ignoring a lot, but if you push aside a lot of the abstraction, that is what is actually going on.

If you ever dealt with database optimizations you are familiar with query plans, like the following (from NHProf):

You can see that we have some interesting stuff going on here:

  • Index Seek
  • Index Scan

And if you are unlucky, you are probably familiar with the dreaded “your !@&!@ query is causing a table scan!” scream from the DBA. But most of the time, people just know that table scan is slow, index scan is fast and index seek is fastest. I am ignoring things like clustered vs. unclustered indexes, since they aren’t really important for what I want to do.

For simplicity sake, I’ll use the following in memory data structure:

public class DocumentDatabase
{
    public List<JObject> Documents = new ...;
    public IDictionary<string, IDictionary<JToken, JObject>> Indexes = new ...; 
}

To keep things simple, we will only bother with the case of exact matching. For example, I might store the following document:

{ "User": "ayende", "Name": "Ayende Rahien", "Email": "Ayende@ayende.com" }

And define an index on Name & Email. What would happen if I wanted to make a query by the user name?

Well, we don’t really have any other option, we have to do what amounts to a full table scan:

foreach (var doc in Documents)
{
    if(doc.User == "ayende")
          yield return doc;
}

As you can imagine, this is an O(N) operation, which can get pretty expensive if we are querying a large table.

What happen if I want to find data by name & email? We have an index that is perfectly suited for that, so we might as well use it:

Indexes["NameAndEmail"][new{Name="Ayende Rahien", Email = “Ayende@ayende.com”}];

Note that what we are doing here is accessing the NameAndEmail index, and then making a query on that. This is essentially an index seek.

What happens if I want to query just by email? There isn’t an index just for emails, but we do have an index that contains emails. We have two options, use a table scan, or and index scan. We already saw what a table scan is, so let us look at what is an index scan:

var nameAndEmailIndex = Indexes["NameAndEmail"];
foreach (var indexed in nameAndEmailIndex)
{
   if(indexed.Email == "ayende@ayende.com")
             yield return indexed;
}

All in all, it is very similar to the table scan, and when using in memory data structures, it is probably not worth doing index scans (at least, not if the index is over the entire data set).

Where index scans prove to be very useful is when we are talking about persistent data sets, where reading the data from the index may be more efficient than reading it from the table. That is usually because the index is much smaller than the actual table. In certain databases, the format of the data on the disk may make it as efficient to do a table scan in some situations as it to do an index scan.

Another thing to note is that while I am using hashes to demonstrate the principal, in practice, most persistent data stores are going to use some variant of trees.

Building data store – indexing data structure

I run into an interestingly annoying problem recently. Basically, I am trying to write the following code:

tree.Add(new { One = 1, Two = 2 }, 13);
tree.Add(new { One = 2, Two = 3 }, 14); tree.Add(new { One = 3, Two = 1 }, 15); var docPos = tree.Find(new { One = 1, Two = 2 }); Assert.Equal(13, docPos); docPos = tree.Find(new { Two = 2 }); Assert.Equal(13, docPos); docPos = tree.Find(new { Two = 1 }); Assert.Equal(14, docPos);

As you can imagine, this is part of an indexing approach, the details of which aren’t really important. What is important is that I am trying to figure out how to support partial searches. In the example, we index by One & Two, and we can search on both of them. The problem begins when we want to make a search on just Two.

While the tree can compare between partial results just fine, the problem is how to avoid traversing the entire tree for a partial result. The BTree is structured like this:

image

The problem when doing a partial search is that at the root, I have no idea if I should turn right or left.

What I am thinking now is that since I can’t do a binary search, I’ll have to use a BTree+ instead. Since BTree+ also have the property that the leaf nodes are a linked list, it means that I can scan it effectively. I am hoping for a better option, though.

Any ideas?

Building data stores – Append Only

One of the interesting aspects in building a data store is that you run head on into things that you would generally leave to the infrastructure. By far, most developers deal with concurrency by relegating that responsibility to a database.

When you write your own database, you have to build this sort of thing. In essence, we have two separate issues here:

  • Maximizing Concurrency – does readers wait for writers? does writers wait for readers? does writers wait for writers?
  • Ensuring Consistency – can I read uncommitted data? can I read partially written data?

As I mentioned in my previous post, there are two major options when building a data store, Transaction Log & Append Only. There are probably a better name for each, but that is how I know them.

This post is going to focus on append only. An append only store is very simple idea in both concept and implementation. It requires that you will always append to the file. It makes things a bit finicky with the type of data structures that you have to use, since typical persistent data structures rely on being able to modify data on the disk. But once you get over that issue, it is actually very simple.

An append only store works in the following manner:

  • On startup, the data is read in reverse, trying to find the last committed transaction.
  • That committed transaction contains pointers to locations in the file where the actual data is stored.
  • A crash in the middle of a write just means garbage at the end of the file that you have to skip when finding the last committed transaction.
  • In memory, the only thing that you have to keep is just the last committed transaction.
  • A reader with a copy of the last committed transaction can execute independently of any other reader / writer. It will not see any changes made by writers made after it started, but it also doesn’t have to wait for any writers.
  • Concurrency control is simple:
    • Readers don’t wait for readers
    • Readers don’t wait for writers
    • Writers don’t wait for readers
    • There can be only one concurrent writer

The last one is a natural fallout from the fact that we use the append only model. Only one thread can write to the end of the file at a given point in time. That is actually a performance boost, and not something that would slow the database down, as you might expect.

The reason for that is pretty obvious, once you start thinking about it. Writing to disk is a physical action, and the head can be only in a single place at any given point in time. By ensuring that all writes go to the end of the file, we gain a big perf advantage since we don’t do any seeks.

Building data stores – Transaction log

One of the interesting aspects in building a data store is that you run head on into things that you would generally leave to the infrastructure. By far, most developers deal with concurrency by relegating that responsibility to a database.

When you write your own database, you have to build this sort of thing. In essence, we have two separate issues here:

  • Maximizing Concurrency – does readers wait for writers? does writers wait for readers? does writers wait for writers?
  • Ensuring Consistency – can I read uncommitted data? can I read partially written data?

As I mentioned in my previous post, there are two major options when building a data store, Transaction Log & Append Only. There are probably a better name for each, but that is how I know them.

This post is going to focus on transaction log. Transaction log is actually pretty simple idea, conceptually. It simply requires that you would state what you intend to do before you do it, in such a way that you can reverse it.

For example, let us say that I want to store “users/ayende” –> "ayende@ayende.com”. All I need to do is to write the following to the transaction log.

{
   "Key": "users/ayende",
   "OldVal": "AYENDE@AYENDE.COM",
   "NewVal": "ayende@ayende.com",
   "TxId": 19474774
}

If the data store crashes before the transaction is completed, we can run a recovery process that would resolve any issues in the actual data from the transaction log. Once a transaction is committed, we can safely delete it from the transaction log.

As I said, conceptually it is a very simple idea, but it leads to some interesting implementation challenges:

  • You can optimize things by not writing to disk (except for writing to the transaction log) immediately.
  • You need to keep track of concurrent transactions touching the same records.
  • You need to handle background flushing to disk.
  • The crash recovery process can be.. finicky to write.

Concurrency control is something that you essentially have to roll on your own, and you can make it as granular as you feel like. There is some complexity involved in ensuring that you read the appropriate data from the transaction log / data file (based on whatever you are in a transaction reading data you modified or outside it, reading the old data), but where it gets really complex is the number of moving parts that you have to deal with.

Tags:

Published at

That No SQL Thing: Column (Family) Databases

Column family databases are probably most known because of Google’s BigTable implementation. The are very similar on the surface to relational database, but they are actually quite different beast. Some of the difference is storing data by rows (relational) vs. storing data by columns (column family databases). But a lot of the difference is conceptual in nature. You can’t apply the same sort of solutions that you used in a relational form to a column database.

That is because column databases are not relational, for that matter, they don’t even have what a RDBMS advocate would recognize as tables.

Nitpicker corner: this post is about the concept, I am going to ignore actual implementation details where they don’t illustrate the actual concepts.

Note: If you want more information, I highly recommend this post, explaining about data modeling in a column database.

The following concepts are critical to understand how column databases work:

  • Column family
  • Super columns
  • Column

Columns and super columns in a column database are spare, meaning that they take exactly 0 bytes if they don’t have a value in them. Column families are the nearest thing that we have for a table, since they are about the only thing that you need to define upfront. Unlike a table, however, the only thing that you define in a column family is the name and the key sort options (there is no schema).

Personally, I think that column family databases are probably the best proof of leaky abstractions. Just about everything in CFDB (as I’ll call them from now on) is based around the idea of exposing the actual physical model to the users so they can make efficient use of that.

  • Column families – A column family is how the data is stored on the disk. All the data in a single column family will sit in the same file (actually, set of files, but that is close enough). A column family can contain super columns or columns.
  • A super column is a dictionary, it is a column that contains other columns (but not other super columns).
  • A column is a tuple of name, value and timestamp (I’ll ignore the timestamp and treat it as a key/value pair from now on).

It is important to understand that when schema design in a CFDB is of outmost importance, if you don’t build your schema right, you literally can’t get the data out. CFDB usually offer one of two forms of queries, by key or by key range. This make sense, since a CFDB is meant to be distributed, and the key determine where the actual physical data would be located. This is because the data is stored based on the sort order of the column family, and you have no real way of changing the sorting (except choosing between ascending or descending).

The sort order, unlike in a relational database, isn’t affected by the columns values, but by the column names.

Let assume that in the Users column family, in the row “@ayende”, we have the column “name” set to “Ayende Rahine” and the column “location” set to “Israel”. The CFDB will physically sort them like this in the Users column family file:

@ayende/location = “Israel”
@ayende/name = “Ayende Rahien”

This is because the sort “location” is lower than “name”. If we had a super column involved, for example, in the Friends column family, and the user “@ayende” had two friends, they would be physically stored like this in the Friends column family file:

@ayende/friends/arava= 945
@ayende/friends/rose = 14

Remember that, this property is quite important to understanding how things work in a CFDB. Let us imagine the twitter model, as our example. We need to store: users and tweets. We define three column families:

  • Users – sorted by UTF8
  • Tweets – sorted by Sequential Guid
  • UsersTweets – super column family, sorted by Sequential Guid

  Let us create the user (a note about the notation: I am using named parameters to denote column’s name & value here. The key parameter is the row key, and the column family is Users):

cfdb.Users.Insert(key: “@ayende”, name: “Ayende Rahine”, location: “Israel”, profession: “Wizard”);

You can see a visualization of how below. Note that this doesn’t look at all like how we would typically visualize a row in a relational database.

image

Now let us create a tweet:

var firstTweetKey = “Tweets/” + SequentialGuid.Create();
cfdb.Tweets.Insert(key: firstTweetKey, application: “TweekDeck”, text: “Err, is this on?”, private: true);

var secondTweetKey = “Tweets/” + SequentialGuid.Create();
cfdb.Tweets.Insert(key: secondTweetKey, app: “Twhirl”, version: “1.2”, text: “Well, I guess this is my mandatory hello world”, public: true);

And here is how it actually looks:

image

There are several things to notice here:

  • In this case, the key doesn’t matter, but it does matter that it is sequential, because that will allow us to sort of it later.
  • Both rows have different data columns on them.
  • We don’t actually have any way to associate a user to a tweet.

That last bears some talking about. In a relational database, we would define a column called UserId, and that would give us the ability to link back to the user. Moreover, a relational will allow us to query the tweets by the user id, letting us get the user’s tweets. A CFDB doesn’t give us this option, there is no way to query by column value. For that matter, there is no way to query by column (which is a familiar trick if you are using something like Lucene).

Instead, the only thing that a CFDB gives us is a query by key. In order to answer that question, we need the UsersTweets column family:

cfdb.UsersTweets.Insert(key: “@ayende”,  timeline: { SequentialGuid.Create(): firstTweetKey } );
cfdb.UsersTweets.Insert(key: “@ayende”,  timeline: { SequentialGuid.Create(): secondTweetKey } ); 

On the CFDB, it looks like this:

image

And now we need more explanation about the notation. Here we insert into the UsersTweets column family, to the row with the key: “@ayende”, to the super column timeline two columns, the name of each column is a sequential guid, which means that we can sort by it. What this actually does is create a single row with a single super column, holding two columns, where each column name is a guid, and the value of each column is the key of a row in the Tweets table.

Question: Couldn’t we create a super column in the Users’ column family to store the relationship? Well, yes, we could, but a column family can contain either columns or super columns, it cannot contain both.

Now, in order to get tweets for a user, we need to execute:

var tweetIds = 
      cfdb.UsersTweets.Get(“@ayende”)
.Fetch(“timeline”) .Take(25)
.OrderByDescending() .Select(x=>x.Value); var tweets = cfdb.Tweets.Get(tweetIds);

In essence, we execute two queries, one on the UsersTweets column family, requesting the columns & values in the “timeline” super column in the row keyed “@ayende”, then execute another query against the Tweets column family to get the actual tweets.

Because the data is sorted by the column name, and because we choose to sort in descending order, we get the last 25 tweets for this user.

What would happen if I wanted to show the last 25 tweets overall (for the public timeline)? Well, that is actually very easy, all I need to do is to query the Tweets column family for tweets, ordering them by descending key order.

Nitpicker corner: No, there is not such API for a CFDB for .NET that I know of, I made it up so it would be easier to discuss the topic.

Why is a CFDB so limiting?

You might have noticed how many times I noted differences between RDBMS and a CFDB. I think that it is the CFDB that is the hardest to understand, since it is so close, on the surface to the relational model. But it seems to suffer from so many limitations. No joins, no real querying capability (except by primary key), nothing like the richness that we get from a relational database. Hell, Sqlite or Access gives me more than that. Why is it so limited?

The answer is quite simple. A CFDB is designed to run on a large number of machines, and store huge amount of information. You literally cannot store that amount of data in a relational database, and even multi-machine relational databases, such as Oracle RAC will fall over and die very rapidly on the size of data and queries that a typical CFDB is handling easily.

Do you remember that I noted that CFDB is really all about removing abstractions? CFDB is what happens when you take a database, strip everything away that make it hard to run in on a cluster and see what happens.

The reason that CFDB don’t provide joins is that joins require you to be able to scan the entire data set. That requires either someplace that has a view of the whole database (resulting in a bottleneck and a single point of failure) or actually executing a query over all machines in the cluster. Since that number can be pretty high, we want to avoid that.

CFDB don’t provide a way to query by column or value because that would necessitate either an index of the entire data set (or just in a single column family) which in again, not practical, or running the query on all machines, which is not possible. By limiting queries to just by key, CFDB ensure that they know exactly what node a query can run on. It means that each query is running on a small set of data, making them much cheaper.

It requires a drastically different mode of thinking, and while I don’t have practical experience with CFDB, I would imagine that migrations using them are… unpleasant affairs, but they are one of the ways to get really high scalability out of your data storage.

Waiting expectantly to the commenters who would say that relational databases are the BOMB and that I have no idea what I am talking about and that I should read Codd and that no one really need to use this sort of stuff except maybe Google and even then only because Google has no idea how RDBMS work (except maybe the team that worked on AdWords).

Graph DB Sharding Strategies: Gravity

This is pure “scratch an itch” design post, since I started thinking about the topic, I might as well put my thoughts in a structured format. I am going to use twitter for this example, and assume that my end goal is to be able to do graph traversal, not just finding neighbors. Applying this to other social network scenarios should be pretty easy.

I generated this graph using this tool, this works using mentions, rather than following / followers graph, though. It will server to give you an idea what I am talking about.

image

One of the fundamentals of computer science is: localizing data leads to greater read performance. This is true whatever we are talking about keeping the data in the CPU L1 cache or in a distributed networked system. Typically, part of a sharding strategy is to keep all the data related to a root in a single location. The problem is, of course, that graphs don’t really have roots. And in most social network graphs, there is no such thing as a closed graph. There are, on average, ~7 million users within three hops from any twitter user. Now, it would be very easy to put 7 millions users to a single machine, except, as they say, they aren’t the same 7 million users.

Given that, I think that I can come up with an approach to allow more efficient queries and higher localization in the graph. The model assume an open and dynamic model (indeed, it relies on that).

We starts with geographical distribution. When we create a new user, we will place it in a shard dedicate to the geographical location the user is located on. This is a piece of data that we can get cheaply, and it has the advantage that users that interact with their circle of physical friends would tend to be clustered together anyway.

Next, we start assigning weights to associations. We only take into account outgoing associations (which solve the problem with outliers for incoming associations such as @aplsuk), but with a small twist, the weight of each outgoing association is taken as a portion of the total number of outgoing associations. In other words, the value of an outgoing association when you are following 10 friends is 0.1, but when you are following 500 friends, the value of each association is 0.002.

Next, we place some value on each mention that the user twits. A mention indicate that the association is active. For that matter, we probably need to create silent associations if a user keep mentioning someone that they are not following. For now, we will say that this value is 1% of the value of an association. That means that if I am following 100 users and I mentioned another user a hundred time, the value of the association is 0.02.

How does gravity comes into play here? Well, each outgoing association exact a pull on a node. But moving a node between shards is expensive, so we give shards an escape velocity. When we check if we need to re-shard a node, we aggregate the pulls of all the associations per node. Only if one shard pull is higher than the current shard pull + escape velocity will the node be shifted to the new shard.

Over time, this will tend to migrate most users close to the users that they are actively associated with. With that in mind, we can now move of to queries.

As I mentioned, I am interested more in this for graph traversal, and the good thing about this approach is that for the most part, most of the relevant information is located on the same shard. When the times comes to perform a query, we can assert that queries, too, need to have an escape velocity to cross a shard boundary. Unless there are enough outgoing connections to a given shard to overcome the escape velocity, outgoing connections to that shard are ignored. We can limit the cost of remote calls further if we increase the escape velocity for the query as the query search deeper. In other words, the escape velocity at depth = 0 would be 10, at depth = 1 it would be 100, at depth = 2 it would be 1000, etc.

While this represent a loss in accuracy of the results, it also mean that for the most case, results will tend to be more relevant.

Is this important if I don’t care for graph traversal?

There is another issue to consider, quite aside from graph traversal. The gravity approach outlined will tend to higher localization, and most operations are local. Consider writing a status update to all the people interested in that, when you have good locality, the cost of such on operation grows down drastically, in fact, I would say, it is highly likely that many such operations could be completed completely locally.

That No SQL Thing: Why do I need that again?

During the course of this series, I got a lot of people declaring: “But you can do that with RDMBS if you use XYZ”. The problem inherit in this statement is that it ignore the fact that if you really want to, you can use a hammer to put screws, it isn’t nice or efficient, but you can do it.

I run across this Dare Obasanjo’s post that explains the actual problem in beautiful terms:

What tends to happen once you’ve built a partitioned/sharded SQL database architecture is that you tend to notice that you’ve given up most of the features of an ACID relational database. You give up the advantages of the relationships by eschewing foreign keys, triggers and joins since these are prohibitively expensive to run across multiple databases. Denormalizing the data means that you give up on Atomicity, Consistency and Isolation when updating or retrieving results. And the end all you have left is that your data is Durable (i.e. it is persistently stored) which isn’t much better than you get from a dumb file system. Well, actually you also get to use SQL as your programming model which is nicer than performing direct file I/O operations.

It is unsurprising that after being at this point for years, some people in our industry have wondered whether it doesn’t make more sense to use data stores that are optimized for the usage patterns of large scale websites instead of gloriously misusing relational databases.  A good example of the tradeoffs is the blog post from the Digg team on why they switched to Cassandra. The database was already sharded which made performing joins to calculate results of queries such as “which of my friends Dugg this item?” to be infeasible. So instead they had to perform two reads from SQL (all Diggs on an item and all of the user’s friends) then perform the intersection operation on the PHP front end code.

I can’t use most of the relational database traditional advantages anyway. Not the moment that I step out of a single machine boundary. At that point, I have to re-evaluate what I am doing to see if it make sense to have to deal with the traditional relation database disadvantages. In many cases, the answer is no, and a solution that fit the problem better can be found.

This is where the NoSQL databases started. I think that once they gotten mature enough, they have advantages for smaller scale solutions, but that is a problem for another day.

That No SQL Thing: Scaling Graph Databases

Yesterday I talked about graph databases, outlining what they are and how they work. One of the interesting things about this series is that in many cases, I am posing a question (to myself), trying to answer it, then go and find out what other people do.

When thinking about scaling scenarios for a graph database, I had the following scenario in mind, a graph of nodes that is spread across multiple servers, where each member in the graph may reside on any machine in the system. The following diagram demonstrate what I am thinking about, each rectangle represent a different machine in the system:

image

Why is this important?

A single machine solution is obviously a barrier to scaling (and safety, but that is another concern. In a graph database, having relations between the node is the point, that makes sharding a bit more complicated, because unless you store the entire graph on a single machine, you are forced to query across machine boundaries. And you can’t store a graph in a single machine, for the simple reason that it is unlikely that you can limit a graph to be that small. Think about the implications of Six Degrees of Separation for graph databases and it will be clear what the problem is. In real world graphs, everyone is connected to everyone.

The problem with breaking the entire graph across machines is that now it is much more expensive to do graph traversals. The following query, which previous run on a single machine:

new GraphDatabaseQuery
{
   SourceNode = ayende,
   MaxDepth = 3,
   RelationsToFollow = new[]{"As Known As", "Family", "Friend", "Romantic", "Ex"},
   Where = node => node.Location == ayende.Location,
   SearchOrder = SearchOrder.BreadthFirst
}.Execute();

Now need to touch 3 different machines. Worse, it isn’t the number of machines that impacts that, but the spread of graph nodes across machines in the system.

After spending some time thinking about it, I came to the conclusion that I can’t envision any general way to solve the problem. Oh, I can think of several ways of reduce the problem:

  • Batching cross machine queries so we only perform them at the close of each breadth first step.
  • Storing multiple levels of associations (So “users/ayende” would store its relations but also “users/ayende”’s relation and “users/arik”’s relations).

The solution most likely to be successful is limiting the depth of cross machine node searches. In many cases, that is acceptable, I think. If we put the depth limit on 3, we can still give pretty good answers in a reasonable time frame. But the only way this can be made to work is with good batching support.

The algorithm may look like:

public IEnumerable<Node> DistributedTraverse(Node sourceNode, int depth, string relationToFollow, Func<Node, filter> predicate)
{
    if(depth == 0) // feeling defensive
        yield break;
        
    var related = GetRelatedNodes(sourceNode.ShardName, relationToFollow, predicate);
    
    foreach(var result in related)
            yield return result;
        
    if(depth == 1) // don't even bother asking down the line
    {
        yield break;
    }
    
    foreach(var relationsByShard in related.GroupBy(x=>x.ShardName))
    {
        var shard = GetShardProxy(relationsByShard.Key);
        var results = shard.BatchedQuery(sourceNodes: relationsByShard.ToArray(), depth - 1,relationToFollow, predicate);
        foreach(var result in results)
            yield return result;
    }
}

This give us a maximum amount of (depth * number_of_machines_in_cluster) – depth remote calls: With a depth of 3 and 3 machines in the cluster, we would have a max of 6 calls.

With that theory out of our heads, let us examine how real world Graph DBs tried to resolve this issue…

Neo4j (which seems to be pretty much the default for Graph DBs) doesn’t handle this currently, there are some hints that they intend to offer cluster wide replication, but nothing about design or implementation details. Neo4j does offer write-master/read-slaves approach for scaling out, which is really nice, but even that approach is limited at one point, and in this post, I am focusing on what happen when you go beyond that point.

FlockDB (which is what is used by twitter) does include, as part of its design goals: “horizontal scaling including replication”. However, FlockDB isn’t trying to solve the problem outlined above, indeed, graph traversal is a non goal for it.  FlockDB is more about finding one level of relations very fast than anything else.

In summary, I believe that while you can shard a graph database, it place a very lot limit on the type of graph walking queries you can make. Now, just to give you an idea, Neo4j, for example, appears to be able to routinely handle billions on nodes on a single machines, so you might no need to scale higher than that..

That No SQL Thing: Graph databases

After a short break, let us continue the discussion. Think about a graph database as a document database, with a special type of documents, relations. A simple example is a social network:

image

There are four documents and three relations in this example. Relations in a graph database are more than just a pointer. A relation can be unidirectional or bidirectional, but more importantly, a relation is typed, I may be associated to you in several ways, you may be a client, family or my alter ego. And the relation itself can carry information. In the case of the relation document in the example above, we simply record the type of the association.

And that is about it, mostly. Once you figured out that graph database can be seen as document databases with a special document type, you are pretty much done.

Except that graph database has one additional quality that make them very useful. They allow you to perform graph operations. The most basic graph operation is traversal. For example, let us say that I want to know who of my friends is in town so I can go and have a drink. That is pretty easy to do, right? But what about indirect friends? Using a graph database, I can define the following query:

new GraphDatabaseQuery
{
   SourceNode = ayende,
   MaxDepth = 3,
   RelationsToFollow = new[]{"As Known As", "Family", "Friend", "Romantic", "Ex"},
   Where = node => node.Location == ayende.Location,
   SearchOrder = SearchOrder.BreadthFirst
}.Execute();

I can execute more complex queries, filtering on the relation properties, considering weights, etc.

Graph databases are commonly used to solve network problems. In fact, most social networking sites use some form of a graph database to do things like “You might know…”.

Because graph databases are intentionally design to make sure that graph traversal is cheap, they also provide other operations that tend to be very expensive without it. For example, Shortest Path between two nodes. That turn out to be frequently useful when you want to do things like: “Who can recommend me to this company’s CTO so they would hire me”.