Friday, July 30, 2010
#
EF Prof and Code Only
I just finish touching up a new feature for EF Prof, support for Entity Framework’s Code Only feature. What you see below is EF Prof tracking the Code Only Nerd Dinner example:
I tried to tackle the same thing in CTP3, but I was unable to resolve it. Using CTP4, it was about as easy as I could wish it.
Just for fun, the following screen shot looks like it contains a bug, but it doesn’t (at least, not to my knowledge). If you can spot what the bug is, I am going to hand you a 25% discount coupon for EF Prof. If you can tell me why it is not a bug, I would double that.
As an aside, am I the only one that is bothered by the use of @@IDNETITY by EF? I thought that we weren’t supposed to make use of that. Moreover, why write this complex statement when you can write SELECT @@IDENTITY?
Thursday, July 29, 2010
#
A question of an untenable situation
One of the most common issues that I run into in my work is getting all sort of questions which sound really strange. For example, I recently got a question that went something like this:
What is the impact of reflection on NHibernate’s performance?
I started to answer that there isn’t one that you would notice, even beyond the major optimizations that NH had in that role, you are accessing a remote database, which is much more expensive. But then they told me that they profiled the application and found some stuff there about that.
I asked them what their scenario was, and the exchange went like that:
Well, when we load a million rows…
And that is your problem…
To be fair, they actually had a reasonable reason to want to do that. I disagree with the solution that they had, but it was a reasonable approach to the problem at hand.
Wednesday, July 28, 2010
#
Reviewing CommunityCourses: A RavenDB application
The code for CommunityCourses can be found here: http://github.com/adam7/CommunityCourses
This is a RavenDB application whose existence I learned about roughly an hour ago. The following is a typical code review about the application, not limited to just RavenDB.
Tests
When I first opened the solution, I was very happy to see that there are tests for the application, but I was disappointed when I actually opened it.

The only tests that exists seems to be the default ones that comes with ASP.Net MVC.
I would rather have no test project than a test project like that.
Account Controller, yet again
Annoyed with the test project, I headed for my favorite target for frustration, the MVC AccountController and its horrid attempt at creating DI or abstractions.
Imagine my surprise when I found this lovely creature instead:
[HandleError]
public partial class AccountController : Controller
{
public virtual ActionResult LogOn()
{
return View();
}
[AcceptVerbs(HttpVerbs.Post)]
public virtual ActionResult LogOn(string userName, string password)
{
if (FormsAuthentication.Authenticate(userName, password))
{
FormsAuthentication.SetAuthCookie(userName, false);
return RedirectToAction("Index", "Home");
}
else
{
ModelState.AddModelError("logon", "Invalid username or password");
return View();
}
}
public virtual ActionResult LogOff()
{
FormsAuthentication.SignOut();
return RedirectToAction("Index", "Home");
}
}
Yes, simple, work, simple, uncomplicated! You don’t have to think when reading this code, I like it. That is how the Account Controller should have been, by default.
The model
I then turned into the model. Community Courses is a RavenDB application, so I am very interested in seeing how this is handled. The first thing that I noticed was this:
That was interesting, I am not used to seeing static classes in the model. But then I looked into those classes, and it all became clear:
This is essentially a lookup class.
Then I switched to the rest of the model, the following image shows a partial view of the model, annotated a bit:
The classes with the Id property (highlighted) are presumed to be Root Entities (in other words, they would each reside in their own document). I am not absolutely sure that this is the case yet, but I am sure enough to point out a potential problem.
Did you notice that we have references to things like Address, Person and Centre in the model? They are marked with red and green points.
A green point is when we reference a class that doesn’t have an id, and is therefore considered to be a value type which is embedded in the parent document. A red point, however, indicate what I believe will be a common problem for people coming to RavenDB from an OR/M background.
RavenDB doesn’t support references (this is by design), and the result of referencing a Root Entity in another Root Entity is that the referenced entity is embedded inside the referencing document. This is precisely what you want for value types like Address, and precisely what you don’t want for references. You can see in TasterSession that there are actually two references to Tutor, one for the Tutor data and one for the TutorId. I think that this is an indication for hitting that problem.
For myself, I would prefer to not denormalize the entire referenced entity, but only key properties that are needed for processing the referencing entity. That make is easier to understand the distinction between the Person instance that is mapped to people/4955 and the Person instance that is help in Centre.Contact.
Session management
Next on the list, because it is so easy to get it wrong (I saw so many flawed NHibernate session management):
This is great, we have a session per request, which is what I would expect in a we application.
I might quibble with the call to SaveChanges, though. I like to make that one an explicit one, rather than implicit, but that is just that, a quibble.
Initializing RavenDB
This is worth speaking about (it happens in Application_Start):
RavenDB is pretty conventional, and Community Courses override one of those conventions to make it easier to work with MVC. By default, RavenDB will use ids like: “people/391”, “centres/912”, by changing the identity parts separator, it will ids like: “people-952”, “centres-1923”. The later are easier to work with in MVC because they don’t contain a routing character.
Centre Controller
This is a simple CRUD controller, but it is worth examining nonetheless:
Those are all pretty simple. The only thing of real interest is the Index action, which uses a query on an index to get the results.
Currently the application doesn’t support paging, but it probably should, which wouldn’t complicate like all that much (adding Skip & Take, that is about it).
Next, the Create/Edit actions:
They are short & to the point, nothing much to do here. I like this style of coding very much. You could fall asleep while writing it. The only comment beyond that is that those methods are so similar that I would consider merging them into a single action.
Course Controller
Things are starting to get much more interesting here, when we see this method:
Overall, it is pretty good, but am very sensitive to making remote calls, so I would change the code to make only a single remote call:
CourseViewModel ConvertToCourseViewModel(Course course)
{
var ids = new List<string> { course.CentreId, course.TutorId, course.UnitId };
if(course.VerifierId != null)
ids.Add(course.VerifierId);
ids.AddRange(course.StudentIds);
var results = MvcApplication.CurrentSession.Load<object>(ids.ToArray());
var courseViewModel = new CourseViewModel
{
Centre = (Centre)results[0],
CentreId = course.CentreId,
EndDate = course.EndDate,
Id = course.Id,
Name = course.Name,
StartDate = course.StartDate,
StudentIds = course.StudentIds,
Tutor = (Person)results[1],
TutorId = course.TutorId,
Unit = (Unit)results[2],
UnitId = course.UnitId,
VerifierId = course.VerifierId
};
int toSkip = 3;
if (course.VerifierId != null)
{
toSkip += 1;
courseViewModel.Verifier = (Person)results[3];
}
courseViewModel.Students = results.Skip(toSkip).Cast<Person>().ToList();
return courseViewModel;
}
This is slightly more complex, but I think that the benefits outweigh the additional complexity.
5N+1 requests ain’t healthy
The following code is going to cause a problem:
It is going to cause a problem because it makes a single remote call (the Query) and for every result from this query it is going to perform 5 remote calls inside ConvertToCourseViewModel.
In other words, if we have twenty courses to display, this code will execute a hundred remote calls. That is going to be a problem. Let us look at how the document courses-1 looks like:
{
"CentreId": "centres-1",
"Status": "Upcoming",
"Name": "NHibernate",
"StartDate": "/Date(1280361600000)/",
"EndDate": "/Date(1280534400000)/",
"UnitId": "units-1",
"TutorId": "people-1",
"VerifierId": null,
"StudentIds": [
"people-1"
]
}
And here is how the UI looks like:
I think you can figure out what I am going to suggest, right? Instead of pulling all of this data at read time (very expensive), we are going to denormalize the data at write time, leading to a document that looks like this:
{
"CentreId": { "Id": "centres-1", "Name": "SkillsMatter London" },
"Status": "Upcoming",
"Name": "NHibernate",
"StartDate": "/Date(1280361600000)/",
"EndDate": "/Date(1280534400000)/",
"UnitId": { "Id": "units-1", "Name": "1 - Introduction to Face Painting"},
"TutorId": { "Id": "people-1", "Name": "Mr Oren Eini" },
"VerifierId": null,
"StudentIds": [
{ "Id": "people-1", "Name": "Ayende Rahien" }
]
}
Using this approach, we can handle the Index action shown above with a single remote call. And that is much better.
I am going to ignore actions whose structure we already covered (Edit, Details, etc), and focus on the interesting ones, the next of which is:
This is an excellent example of how things should be. (Well, almost, I would remove the unnecessary Store call and move the StudentIds.Add just before the foreach, so all the data access happens first, it makes it easier to scan). Using an OR/M, this code would generate 8 remote calls, but because Raven’s documents are loaded as a single unit, we have only 3 here (and if we really wanted, we can drop it to one).
Next, we update a particular session / module in a student.
We can drop the unnecessary calls to Store, but beside that, it is pretty code. I don’t like that moduleId / sessionId are compared to the Name property, That seems confusing to me.
Charting with RavenDB
I am showing only the parts that are using RavenDB here:

There is one problem with this code, it doesn’t work. Well, to be bit more accurate, it doesn’t work if you have enough data. This code ignores what happen if you have enough people to start paging, and it does a lot of work all the time. It can be significantly improved by introducing a map/ reduce index to do all the hard work for us:
This will perform the calculation once, updating it whenever a document changes. It will also result in much less traffic going on the network, since the data that we will get back will look like this:
Person Controller doesn’t contain anything that we haven’t seen before. So we will skip that and move directly to…
Taster Session Controller
I sincerely hope that those comments were generated by a tool.
The following three methods all shared the same problem:
They all copy Centre & Person to the TasterSession entity. The problem with that is that it generate the following JSON:
References in RavenDB are always embedded in the referencing entity. This should be a denomralized reference instead (just Id & Name, most probably).
Other aspects
I focused exclusively on the controller / Raven code, and paid absolutely no attention to any UI / JS / View code. I can tell you that the UI looks & behaves really nice, but that is about it.
Summary
All in all, this was a codebase that was a pleasure to read. There are some problems, but they are going to be easy to fix.
Tuesday, July 27, 2010
#
RavenDB Authorization Bundle Design
I used to be able to just sit down and write some code, and eventually things would work. Just In Time Design. That is how I wrote things like Rhino Mocks, for example.
Several years ago (2007, to be exact) I started doing more detailed upfront design, those designs aren’t curved in stone, but they are helpful in setting everything in motion properly. Of course, in some cases those design need a lot of time to percolate. At any rate, this is the design for the Authorization Bundle for RavenDB. I would welcome any comments about it. I gave some background on some of the guiding thoughts about the subject in this post.
Note: This design is written before the code, it reflect the general principles of how I intend to approach the problem, but it is not a binding design, things will change.
Rhino Security design has affected the design of this system heavily. In essence, this is a port (of a sort) of Rhino Security to RavenDB, with the necessary changes to make the move to a NoSQL database. I am pretty happy with the design and I actually think that we might do back porting to Rhino Security at some point.
Important Assumptions
The most important assumption that we make for the first version is that we can trust the client not to lie about whose user it is executing a certain operation. That one assumes the following deployment scenario:
In other words, only the application server can talk to the RavenDB server and the application server is running trusted code.
To be clear, this design doesn’t not apply if users can connect directly to the database and lie about who they are. However, that scenario is expected to crop up, even though it is out of scope for the current version. Our design need to be future proofed in that regard.
Context & User
Since we can trust the client calling us, we can rely on the client to tell us which user a particular action is executed on behalf of, and what is the context of the operation.
From the client API perspective, we are talking about:
using(var session = documentStore.OpenSession())
{
session.SecureFor("raven/authorization/users/8458", "/Operations/Debt/Finalize");
var debtsQuery = from debt in session.Query<Debt>("Debts/ByDepartment")
where debt.Department == department
select debt
orderby debt.Amount;
var debts = debtsQuery.Take(25).ToList();
// do something with the debts
}
I am not really happy with this API, but I think it would do for now. There are a couple of things to note with regards to this API:
- The user specified is using the reserved namespace “raven/”. This allows the authorization bundle to have a well known format for the users documents.
- The operation specified is using the Rhino Security conventions for operations. By using this format, we can easily construct hierarchical permissions.
Defining Users
The format of the authorization user document is as follows:
// doc id /raven/authorization/users/2929
{
"Name": "Ayende Rahien",
"Roles": [ "/Administrators", "/DebtAgents/Managers"],
"Permissions": [
{ "Operation": "/Operations/Debts/Finalize", "Tag": "/Tags/Debts/High", "Allow": true, "Priority": 1, }
]
}
There are several things to note here:
- The format isn’t what an application needs for a User document. This entry is meant for the authorization bundle’s use, not for an application’s use. You can use the same format for both, of course, by extending the authorization user document, but I’ll ignore this for now.
- Note that the Roles that we have are hierarchical as well. This is important, since we would use that when defining permissions. Beyond that, Roles are used in a similar manner to groups in something like Active Directory. And the hierarchical format allows to manage that sort of hierarchical grouping inside Raven easily.
- Note that we can also define permissions on the user for documents that are tagged with a particular tag. This is important if we want to grant a specific user permission for a group of documents.
Roles
The main function of roles is to define permissions for a set of tagged documents. A role document will look like this:
// doc id /raven/authorization/roles/DebtAgents/Managers
{
"Permissions": [
{ "Operation": "/Operations/Debts/Finalize", "Tag": "/Tags/Debts/High", "Allow": true, "Priority": 1, }
]
}
Defining permissions
Permissions are defined on individual documents, using RavenDB’s metadata feature. Here is an example of one such document, with the authorization metadata:
//docid-/debts/2931
{
"@metadata": {
"Authorization": {
"Tags": [
"/Tags/Debts/High"
],
"Permissions": [
{
"User": "raven/authorization/users/2929",
"Operation": "/Operations/Debts",
"Allow": true,
"Priority": 3
},
{
"User": "raven/authorization/roles/DebtsAgents/Managers",
"Operation": "/Operations/Debts",
"Allow": false,
"Priority": 1
}
]
}
},
"Amount": 301581.92,
"Debtor": {
"Name": "Samuel Byrom",
"Id": "debots/82985"
}
//more document data
}
Tags, operations and roles are hierarchical. But the way they work is quite different.
- For Tags and Operations, having permission for “/Debts” gives you permission to “/Debts/Finalize”.
- For roles, it is the other way around, if you are a member of “/DebtAgents/Managers”, you are also a memeber of “/DebtAgents”.
The Authorization Bundle uses all of those rules to apply permissions.
Applying permissions
I think that it should be pretty obvious by now how the Authorization Bundle makes a decision about whatever a particular operation is allowed or denied, but the response for denying an operation are worth some note.
- When performing a query over a set of documents, some of which we don’t have the permission for under the specified operation, those documents are filtered out from the query.
- When loading a document by id, when we don’t have the permission to do so under the specified operation, an error is raised.
- When trying to write to a document (either PUT or DELETE), when we don’t have the permission to do so under the specified operation, an error is raised.
That is pretty much as detailed as I want things to be at this stage. Thoughts?
Monday, July 26, 2010
#
RavenDB Index Management
When I wrote RavenDB, I started from the server, and built the client last. That had some interesting affects on RavenDB, for example, you can see detailed docs about the HTTP API, because that is what I had when I wrote most of the docs.
In the context of indexes, that meant that I thought a lot more about defining and working with indexes from the WebUI perspective, rather than the client perspective. Now that Raven have users that actually put it through its paces, I found that most people want to be able to define their indexes completely in code, and want to be able to re-create those indexes from code.
And that calls for a integral solution from Raven for this issue. Here is how you do this.
- You define your index creation as a class, such as this one:
public class Movies_ByActor : AbstractIndexCreationTask
{
public override IndexDefinition CreateIndexDefinition()
{
return new IndexDefinition<Movie>
{
Map = movies => from movie in movies
select new {movie.Name}
}
.ToIndexDefinition(DocumentStore.Conventions);
}
}
- Somewhere in your startup routine, you include the following line of code:
IndexCreation.CreateIndexes(typeof(Movies_ByActor).Assembly, store);
And that is it, Raven will scan the provided assembly (you can also provide a MEF catalog, for more complex scenarios) and create all those indexes for you, skipping the creation if the new index definition matches the index definition in the database.
This also provide a small bit of convention, as you can see, the class name is Movies_ByActor, but the index name will be Movies/ByActor. You can override that by overriding the IndexName property
Sunday, July 25, 2010
#
Find the bug: Accidental code reviews
I was working with a client about a problem they had in integrating EF Prof to their application, when my caught the following code base (anonymized, obviously):
public static class ContextHelper
{
private static Acme.Entities.EntitiesObjectContext _context;
public static Acme.Entities.EntitiesObjectContext CurrentContext
{
get { return _context ?? (_context = new Acme.Entities.EntitiesObjectContext()); }
}
}
That caused me to stop everything and focus the client’s attentions on the problem that this code can cause.
What were those problems?
A series of posts about NHibernate tooling
I intend to write a series of posts about NHibernate tooling, and I thought that before I start, I should ask people to point me to tools that I might not be familiar with.
Tools that are currently on the list to post about:
- LLBLGen 3.0
- Pleasant Modeler
- Active Writer
- NHibernate Query Analyzer
Any others that you’ll like me to check out?
Friday, July 23, 2010
#
Real world authorization implementation considerations
Nitpicker corner: this post discusses authorization, which assumes that you already know who the user is. Discussion of authentication methods, how we decide who the user is, would be outside the scope of this post.
I had a lot of experience with building security systems. After all, sooner or later, whatever your project is, you are going to need one. At some point, I got tired enough of doing that that I wrote Rhino Security, which codify a lot of the lessons that I learned from all of those times. And I learned a lot from using Rhino Security in real world projects as well.
When coming to design the authorization bundle for RavenDB, I had decided to make a conscious effort to detail the underlying premise that I have when I am approaching the design of a security system.
You can’t manage authorization at the infrastructure level
That seems to be an instinctual response by most developers when faced with the problem, “we will push it to the infrastructure and handle this automatically”. The usual arguments is that we want to avoid the possibility of the developer forgetting to include the security checks and that it makes it easier to develop.
The problem is that when you put security decisions in the infrastructure, you are losing the context in which a certain operation is performed. And context matters. It matters even more when we consider the fact that there are actually two separate levels of security that we need to consider:
- Infrastructure related – can I read / write to this document?
- Business related – can I perform [business operation] on this entity?
Very often, we try to use the first to apply the second. This is often the can when we have a business rule that specify that a user shouldn’t be able to access certain documents which we try to apply at the infrastructure level.
For a change, we will use the example of a debt collection agency.
As a debt collector, I can negotiate a settlement plan with a debtor, so the agency can resolve the debt.
- Debt collectors can only negotiate settlement plans for debts under 50,000$
- Only managers can negotiate settlement plans for debts over 50,000$
Seems simple, right? We will assume that we have a solution in store and say that the role of DebtCollectors can’t read/write to documents about settlement plans of over 50K$. I am not sure how you would actually implement this, but let us say that we did just that. We solved the problem at the infrastructure level and everyone is happy.
Then we run into a problem, a Debt Collector may not be allow to do the actual negotiation with a heavy debtor, but there is a whole lot of additional work that goes on that the Debt Collector should do (check for collateral, future prospects, background check, etc).
The way that the agency works, the Debt Collector does a lot of the preliminary work, then the manager does the actual negotiation. That means that for the same entity, under different contexts, we have very different security rules. And these sort of requirements are the ones that are going to give you fits when you try to apply them at the infrastructure level.
You can argue that those sort of rules are business logic, not security rules, but the way the business think of them, that is exactly what they are.
The logged on user isn’t the actual user
There is another aspect for this. Usually when we need to implement security system like this, people throw into the ring the notion of Row Level Security and allowing access to specific rows by specific logins. That is a non starter from the get go, for several reasons. The previous point about infrastructure level security applies here as well, but the major problem is that it just doesn’t work when you have more than a pittance of users.
All Row Level Security solutions that I am aware of (I am thinking specifically of some solutions provided by database vendors) requires you to login into the database using a specific user, from which your credentials can be checked against specific rows permissions.
Consider the case where you have a large number of users, and you have to login to the database for each user using their credentials. What is going to be the affect on the system?
Well, there are going to be two major problems. The first is that you can wave goodbye to small & unimportant things like connection pooling, since each user have their own login, they can’t share connections, which is going to substantially increase the cost of talking to the database.
The second is a bit more complex to explain. When the system perform an operation as a result of a user action, there are distinct differences between work that the system performs on behalf of the user and work that the system performs on behalf of the system.
Let us go back to our Debt Collection Agency and look at an example:
As a Debt Collector, I can finalize a settlement plan with a debtor, so the agency can make a lot of money.
- A Debt Collector may only view settlement plans for the vendors that they handle debt collection for.
- Settlement plan cannot be finalized if (along with other plans that may exists) the settlement plan would result in over 70% of the debtor salary going into paying debts.
This is pretty simple scenario. If I am collecting debts for ACME, I can’t take a peek and see how debts handle be EMCA, ACME’s competitor, are handled. And naturally, if the debtor’s income isn’t sufficient to pay the debt, it is pretty obvious that the settlement plan isn’t valid, and we need to consider something else.
Now, let us look at how we would actually implement this, the first rule specifies that we can’t see other settlement plans, but for us to enforce the second rule, we must see them, even if they belong to other creditors. In other words, we have a rule where the system need to execute in the context of the system and not in the context of the user.
You will be surprised how often such scenarios come up when building complex systems. When your security system is relying on the logged on user for handling security filtering, you are going to run into a pretty hard problem when it comes the time to handle those scenarios.
Considerations
So, where does this leave us? It leave us with the following considerations when the time comes to build an authorization implementation:
- You can’t handle authorization in the infrastructure, there isn’t enough context to make decisions there.
- Relying on the logged on user for row/document level security is a good way to have a wall hit your head in a considerable speed.
- Authorization must be optional, because we need to execute some operations to ensure valid state outside the security context of a single user.
- Authorization isn’t limited to the small set of operations that you can perform from infrastructure perspective (Read / Write) but have business meaning that you need to consider.
Thursday, July 22, 2010
#
An interesting RavenDB bug
I got a very strange bug report recently,
The following index:
from movie in docs.Movies
from actor in movie.Actors
select new { Actor = actor }
Will produce multiple results from a single document, which poses a pretty big problem when you try to page through that. Imagine that each movie has 10 actors, and you are trying to page through this index for the first two documents of movies by Charlie Chaplin. The first movie that matches Charlie Chaplin will have ten results returned from the index, and simple paging at the index level will give us the wrong results.
Here is my solution for that, which works, but make me just a tad uneasy:
public IEnumerable<IndexQueryResult> Query(IndexQuery indexQuery)
{
IndexSearcher indexSearcher;
using (searcher.Use(out indexSearcher))
{
var previousDocuments = new HashSet<string>();
var luceneQuery = GetLuceneQuery(indexQuery);
var start = indexQuery.Start;
var pageSize = indexQuery.PageSize;
var skippedDocs = 0;
var returnedResults = 0;
do
{
if(skippedDocs > 0)
{
start = start + pageSize;
// trying to guesstimate how many results we will need to read from the index
// to get enough unique documents to match the page size
pageSize = skippedDocs * indexQuery.PageSize;
skippedDocs = 0;
}
var search = ExecuteQuery(indexSearcher, luceneQuery, start, pageSize, indexQuery.SortedFields);
indexQuery.TotalSize.Value = search.totalHits;
for (var i = start; i < search.totalHits && (i - start) < pageSize; i++)
{
var document = indexSearcher.Doc(search.scoreDocs[i].doc);
if (IsDuplicateDocument(document, indexQuery.FieldsToFetch, previousDocuments))
{
skippedDocs++;
continue;
}
returnedResults++;
yield return RetrieveDocument(document, indexQuery.FieldsToFetch);
}
} while (skippedDocs > 0 && returnedResults < indexQuery.PageSize);
}
}
Tuesday, July 20, 2010
#
iTunes full screen movies incompatible with Large Font sizes?
I have a PC hooked to my TV, but the problem is that there seems to be a bug in iTunes, when I set the font site to be large enough to actually be readable, like so:
I lose the ability to view full screen movies in iTunes, when I switch the movie to full screen, it continues playing (I can hear it) but the display switch back to the iTunes library, rather than the movie.
I verified the resetting the font size fixes this problem, and this is in iTunes 9.2.1.
Anyone run into this? Any solutions?
Monday, July 19, 2010
#
Tikal .NET forum: Introduction to NHibernate
I’ll be presenting on July 25th 10:00-11:30 in the Tikal .NET forum:
Tikal .NET forum is delighted to present an introduction to NHibernate, the leading and most advanced open source ORM (Object Relational Mapping) in the .NET domain with integrated support for concurrency, distribution, fault tolerance and incremental code loading. ORM takes care of the burden of mapping between your .NET entities and the underlying relational database.
Following a concise introduction into the motivation for the rising interests ORM, we will provide an general idea of the key features of the NHibernate framework compared to other frameworks , and talk about the impact they have on the production of highly scalable and fault-tolerant systems.
See you all on July 25th at 10:00 in Krypton, Hakfar-Hayarok Ramat-Hasharon.
Friday, July 16, 2010
#
Buy vs. Build & YAGNI
I was recently at the Israeli ALT.Net tools night, and I had a very interesting discussion on installers. Installers are usually a very painful part of the release procedure. The installer format for Windows is MSI, which is… strange. It takes time to understand how MSI work, and even after you got that, it is still painful to work with. Wix is a great improvement when it comes to building MSI installations, but that doesn’t make it good. Other installer builders, such as InstallSheild and NSIS are just as awkward.
The discussion that I had related to the complexity of building an installer on those technologies.
My argument was that it simply made no sense to try to overcome the hurdles of the installer technologies, instead, we can write our own installer more easily than fussing about the existing ones. The installer already assumes the presence of the .NET framework, so that make things even easier.
This is an application of a principle that I strongly believes in: Single purpose, specially built tools & components can be advantageous over more generic ones, for your specific scenarios.
Case in point, the installer. Installers are complicated beasts because they must support a lot of complex scenarios (upgrading from 5.3.2 to 6.2.1, for example), be transactional, support installation, etc. But for the installer in question, upgrade is always an uninstall of the previous version & install of the new one, and the only tasks it requires is copying files and modifying registry entries.
Given that set of requirements, we can design the following installer framework:
public interface IInstallerTask
{
void Install();
void Uninstall();
}
public class FileCopyTask : IInstallerTask
{
public string Source { get;set; }
public string Destination { get;set; }
public void Install()
{
File.Copy(Source, Destination,overwrite:true);
}
public void Uninstall()
{
File.Delete(Destination);
}
}
And building a particular installer would be:
ExecuteInstaller(
Directory.GetFiles(extractedTempLocation)
.Select( file =>
new FileCopyTask
{
Source = file,
Destination = Path.Combine(destinationPath, file)
}
),
new RegistryKeyTask
{
Key = "HKLM/Windows/CurrentVersion...",
Value = 9
}
);
This gives the ExecuteInstaller method a list of tasks to be executed, which can then be used to installer or uninstall everything.
Yes, it is extremely simple, and yes, it wouldn’t fit many scenarios. But, it is quick to do, match the current and projected requirements, doesn’t introduce any new technology to the mix and it works.
Contrast that with having someone on the team that is the Installer expert (bad) or having to educate the entire team about installer (expensive).
NHProf new feature: Expensive queries report
It has been a while since we had a new major feature for the profiler, but here it is:
The expensive queries report will look at all your queries and surface the most expensive ones across all the sessions. This can give you a good indication on where you need to optimize things.
Naturally, this feature is available across all the profiler profiles (NHibernate Profiler, Entity Framework Profiler, Linq to SQL Profiler and Hibernate Profiler).
Sunday, July 11, 2010
#
Find the issue
There is a design issue that is revealed in the following tests, can you figure out why I changed the behavior and removed the tests?

Thursday, July 01, 2010
#
NoSQL and Data Warehousing
I recently got this question on email, and I thought it would be a good subject for a post.
I wanted to get your thoughts about using NoSQL for data warehouse solutions. I have read mixed thoughts about this and curious where you stand.
Before we can talk about this, we need to understand what data warehousing is, using wise geek definition, that is:
Data warehousing is combining data from multiple and usually varied sources into one comprehensive and easily manipulated database. Common accessing systems of data warehousing include queries, analysis and reporting. Because data warehousing creates one database in the end, the number of sources can be anything you want it to be, provided that the system can handle the volume, of course. The final result, however, is homogeneous data, which can be more easily manipulated.
And if you follow that definition, it make an absolute sense to ask about data warehousing in a NoSQL situation. But remember, one of the things that tend to lead people to the NoSQL land is the desire to scale in some manner (more data, more users, higher concurrency, cheaper TCO) than is possible using a SQL solution. In order to achieve that goal, you have to be willing to accept the tradeoff associated with that, which is reduced flexibility. You can query a relational database every which way, but most NoSQL solutions have very strict rules about how you can query them, for example.
By the way, I am probably abusing the term SQL here. I meant the whole set of technologies generally associated with relational databases, so in this case, I am talking about OLAP data stores, which are the typical solution for data warehousing scenarios. OLAP is usually queried with MDX, which looks like this:
SELECT
{ [Measures].[Sales Amount],
[Measures].[Tax Amount] } ON COLUMNS,
{ [Date].[Fiscal].[Fiscal Year].&[2002],
[Date].[Fiscal].[Fiscal Year].&[2003] } ON ROWS
FROM [Adventure Works]
WHERE ( [Sales Territory].[Southwest] )
OLAP & MDX, like the relational database & SQL, gives us a lot of flexibility and power. But like relational databases, those come at a cost. At some point, if you have enough data, it gets impractical to store it all in a single server, and the usual arguments for NoSQL solutions come to the fore.
At that point, we have to decide what is it that we want to get from the data warehouse. In other words, we need to design our solution to match the kind of reports that we want to get out. Of the NoSQL solutions out there (Key/Value stores, Document Databases, Graph Databases, Column Family Databases) I would probably choose a Column Family database for such a task, since my primary concern is probably being able to handle large amount of data.
The type of reports that I would need would dictate how I would store the data itself, but once I built the schema, everything else should just work.
In short, for data warehousing, I think that the relational / OLAP world has significant advantages, mostly because in many BI scenarios, you want to allow the users to explore the data, which is easy with the SQL toolset, and harder with NoSQL solutions. But when you get too large (and large in OLAP scenarios is really large), you might want to consider limiting the users’ options and going with a NoSQL solution tailor to what they need.
Friday, July 02, 2010
#
Building Distributed Apps with NHibernate and Rhino Service Bus
The first part of a two parts article about NHibernate and Rhino Service bus in now on MSDN.
Wednesday, June 30, 2010
#
The pitfalls of transparent security
A long time ago, I needed to implement a security subsystem for an application. I figured that it was best to make the entire security subsystem transparent to the developer, under the assumption that if the infrastructure would take care of security, it would make a lot more sense than relying on the developer to remember to add the security calls.
It took me a long while to realize how wrong that decision was. Security is certainly important, but security doesn’t apply to the system itself. In other words, while a specific user may not be allowed to read/write to the audit log, actions that the user made should be written to that log. That is probably the simplest case, but that are a lot of similar ones.
Ever since then, I favored using an explicit approach:
var books = session.CreateQuery("from Books")
.ThatUserIsAllowedToRead(CurrentUser)
.List<Book>();
This also help you implement more interesting features, such as “on behalf of”. And yes, it does put the onus of security on the developer, but considering the alternative, that is a plus.
Tuesday, June 29, 2010
#
Challenge: Find the bug
The following code contains a bug that would only occur under rare situations, can you figure out what is the bug?

Sunday, June 27, 2010
#
NHibernate: Streaming large result sets
Note: I am not feeling very well for the past week or so, which is why I am posting so rarely.
NHibernate is meant to be used in an OLTP system, as such, it is usually used in cases where we want to load a relatively small amount of data from the database, work with it and save it back. For reporting scenarios, there are better alternatives, usually (and before you ask, any reporting package will do. Right tool for the job, etc).
But there are cases where you want to do use NHibernate in reporting scenarios nonetheless. Maybe because the reporting requirements aren’t enough to justify going to a separate tool, or because you want to use what you already know. It is in those cases where you tend to run into problems, because you violate the assumptions that were made while building NHibernate.
Let us imagine the following use case, we want to print a list of book names to the user:
using (ISession s = OpenSession())
{
var books = s.CreateQuery("from Book")
.List<Book>();
foreach (var book in books)
{
Console.WriteLine(book.Name);
}
}
There are several problems here:
- We query on a large result set without a limit clause.
- We read a lot of data into memory.
- We only start processing the data after it was completely read from the database.
What I would like to see is something like this:
while(dataReader.Read())
{
Console.WriteLine(dataReader.GetString("Name"));
}
This still suffer from the problem of reading a large result set, but we will consider this a part of our requirements, so we’ll just have to live with it. The data reader code has two major advantages, it uses very little memory, and we can start processing the data as soon as the first row loads from the database.
How can we replicate that with NHibernate?
Well, as usual with NHibernate, it is only a matter of finding the right extension point. In this case, the List method on the query also has an overload that accepts an IList parameter:
That make it as simple as implementing our own IList implementation:
public class ActionableList<T> : IList
{
private Action<T> action;
public ActionableList(Action<T> action)
{
this.action = action;
}
public int Add(object value)
{
action((T)value);
return -1;
}
public bool Contains(object value)
{
throw new NotImplementedException();
}
// ...
}
And now we can call it:
using (ISession s = OpenSession())
{
var books = new ActionableList<Book>(book => Console.WriteLine(book.Name));
s.CreateQuery("from Book")
.List(books);
}
This will have the exact same effect as the pervious NHibernate code, but it will start printing the results as soon as the first result loads from the database. We still have the problem of memory consumption, though. The session will keep track of all the loaded objects, and if we load a lot of data, it will eventually blow out with an out of memory exception.
Luckily, NHibernate has a ready made solution for this, the stateless session. The code now looks like this:
using (IStatelessSession s = sessionFactory.OpenStatelessSession())
{
var books = new ActionableList<Book>(book => Console.WriteLine(book.Name));
s.CreateQuery("from Book")
.List(books);
}
The stateless session, unlike the normal NHibernate session, doesn’t keep track of loaded objects, so the code here and the data reader code are essentially the same thing.
Wednesday, June 23, 2010
#
Challenge: Dynamically dynamic
Can you figure out a way to write the following code without using a try/catch?
class Program
{
static void Main(string[] args)
{
dynamic e = new ExpandoObject();
e.Name = "Ayende";
Console.WriteLine(HasProperty("Name", e));
Console.WriteLine(HasProperty("Id", e));
}
private static bool HasProperty(string name, IDynamicMetaObjectProvider dyn)
{
try
{
var callSite =
CallSite<Func<CallSite, object, object>>.Create(
Binder.GetMember(CSharpBinderFlags.None, name, typeof (Program),
new[]
{
CSharpArgumentInfo.Create(
CSharpArgumentInfoFlags.None, null)
}));
callSite.Target(callSite, dyn);
return true;
}
catch (RuntimeBinderException)
{
return false;
}
}
}
The HasProperty method should accept any IDynamicMetaObjectProvider implementation, not just ExpandoObject.
Modeling hierarchical structures in RavenDB
The question pops up frequently enough and is interesting enough for a post. How do you store a data structure like this in Raven?
The problem here is that we don’t have enough information about the problem to actually give an answer. That is because when we think of how we should model the data, we also need to consider how it is going to be accessed. In more precise terms, we need to define what is the aggregate root of the data in question.
Let us take the following two examples:
As you can imagine, a Person is an aggregate root. It can stand on its own. I would typically store a Person in Raven using one of two approaches:
| Bare references | Denormalized References |
{
"Name": "Ayende",
"Email": "Ayende@ayende.com",
"Parent": "people/18",
"Children": [
"people/59",
"people/29"
]
}
|
{
"Name": "Ayende",
"Email": "Ayende@ayende.com",
"Parent": { "Name": "Oren", "Id": "people/18"},
"Children": [
{ "Name": "Raven", "Id": "people/59"},
{ "Name": "Rhino", "Id": "people/29"}
]
}
|
The first option is bare references, just holding the id of the associated document. This is useful if I only need to reference the data very rarely. If, however, (as is common), I need to also show some data from the associated documents, it is generally better to use denormalized references, which keep the data that we need to deal with from the associated document embedded inside the aggregate.
But the same approach wouldn’t work for Questions. In the Question model, we have utilized the same data structure to hold both the question and the answer. This sort of double utilization is pretty common, unfortunately. For example, you can see it being used in StackOverflow, where both Questions & Answers are stored as posts.
The problem from a design perspective is that in this case a Question is not a root aggregate in the same sense that a Person is. A Question is a root aggregate if it is an actual question, not if it is a Question instance that holds the answer to another question. I would model this using:
{
"Content": "How to model relations in RavenDB?",
"User": "users/1738",
"Answers" : [
{"Content": "You can use.. ", "User": "users/92" },
{"Content": "Or you might...", "User": "users/94" },
]
}
In this case, we are embedding the children directly inside the root document.
So I am afraid that the answer to that question is: it depends.
The cost of money
This is just some rambling about the way the economy works, it has nothing to do with tech or programming. I just had to sit down recently and do the math, and I am pretty annoyed by it.
The best description of how the economy works that I ever heard was in a Terry Prachett’s book, it is called Captain Vimes’ Boots’ Theory of Money. Stated simply, it goes like this.
A good pair of boots costs 50$, and they last for 10 years and keep your feet warm. A bad pair of boots costs 10$ and last only a year or two. After 10 years, the poor boots cost twice as much as the good boots, and your feet are still cold!
The sad part about that is that this theory is quite true. Let me outline two real world examples (from Israel, numbers are in Shekels).
Buying a car is expensive, so a lot of people opts for a leasing option. Here are the numbers associated with this (real world numbers):
| | Buying car outright | Leasing |
| Upfront payment | 120,000 | 42,094.31 |
| Monthly payment (36 payments) | 0 | 1,435.32 |
| Buying the car (after 3 yrs) [optional] | 0 | 52,039.67 |
The nice part of going with a leasing contract is that you need so much less upfront money, and the payments are pretty low. The problem starts when you try to compare costs on more than just how much money you are paying out of pocket. We only have to spent a third.
Let us see what is going to happen in three years time, when we wan to switch to a new car.
| | Buying car outright | Leasing |
| Upfront payment | 120,000.00 | 42,094.31 |
| Total payments | 0.00 | 51,671.52 |
| Selling the car | -80,000.00 | 0.00 |
| Total cost | 40,000.00 | 93,765.83 |
With the upfront payment, we can actually sell the car to recoup some of our initial investment. With the leasing option, at the end of the three years, you are out 93,765.83 and have nothing to show for it.
Total cost of ownership for the leasing option is over twice as much as the upfront payment option.
Buying an apartment is usually one of the biggest expenses that most people do in their life. The cost of an apartment/house in Israel is typically over a decade of a person’ salary. Israel’s real estate is in a funky state at the moment, being one of the only places in the world where the prices keep going up. Here are some real numbers:
- Avg. salary in Israel: 8,611
- Avg. price of an apartment (in central Israel): 1,071,900
It isn’t surprising that most people requires a mortgage to buy a place to live.
Let us say that we are talking about a 1,000,000 price, just to make the math simpler, and that we have 400,000 available for the down payment. Let us further say that we got a good interest rate of the 600,000 mortgage of 2% (if you take more than 60% of the money you are penalized with higher interest rate in Israel).
Assuming fixed interest rate and no inflation, you will need to pay 3,035 for 20 years. But a 2% interest rate looks pretty good, right? It sounds pretty low.
Except over 20 years, you’ll actually pay: 728,400 back on your 600,000 loan, which means that the bank get 128,400 more than it gave you.
The bank gets back 21.4% more money. With a more realistic 3% interest rate, you’ll pay back 33% more over the lifetime of the loan. And that is ignoring inflation. Assume (pretty low) 2% per year, you would pay 49% more to the bank in 2% interest rate and 65% more in 3% interest rate.
Just for the fun factor, let us say that you rent, instead. And assume further that you rent for the same price of the monthly mortgage payment. We get:
| | Mortgage | Rent |
| Upfront payment | 400,000.00 | 0.00 |
| Monthly payment | 3,000.00 | 3,000.00 |
| Total payments (20 years) | 720,000.00 | 720,000.00 |
| Total money out | 1,120,000.00 | 720,000.00 |
| House value | 1,000,000.00 | 0.00 |
| Total cost | 120,000.00 | 720,000.00 |
After 20 years, renting cost 720,000. Buying a house costs 120,000. And yes, I am ignoring a lot of factors here, that is intentional. This isn’t a buy vs. rent column, it is a cost of money post.
But after spending all this time doing the numbers, it all comes back to Vimes’ Boots theory of money.
Saturday, June 19, 2010
#
Table scans, index scans and index seeks, on my!
In general, when you break it down to the fundamentals, a data base is just a fancy linked list + some btrees. Yes, I am ignoring a lot, but if you push aside a lot of the abstraction, that is what is actually going on.
If you ever dealt with database optimizations you are familiar with query plans, like the following (from NHProf):

You can see that we have some interesting stuff going on here:
And if you are unlucky, you are probably familiar with the dreaded “your !@&!@ query is causing a table scan!” scream from the DBA. But most of the time, people just know that table scan is slow, index scan is fast and index seek is fastest. I am ignoring things like clustered vs. unclustered indexes, since they aren’t really important for what I want to do.
For simplicity sake, I’ll use the following in memory data structure:
public class DocumentDatabase
{
public List<JObject> Documents = new ...;
public IDictionary<string, IDictionary<JToken, JObject>> Indexes = new ...;
}
To keep things simple, we will only bother with the case of exact matching. For example, I might store the following document:
{ "User": "ayende", "Name": "Ayende Rahien", "Email": "Ayende@ayende.com" }
And define an index on Name & Email. What would happen if I wanted to make a query by the user name?
Well, we don’t really have any other option, we have to do what amounts to a full table scan:
foreach (var doc in Documents)
{
if(doc.User == "ayende")
yield return doc;
}
As you can imagine, this is an O(N) operation, which can get pretty expensive if we are querying a large table.
What happen if I want to find data by name & email? We have an index that is perfectly suited for that, so we might as well use it:
Indexes["NameAndEmail"][new{Name="Ayende Rahien", Email = “Ayende@ayende.com”}];
Note that what we are doing here is accessing the NameAndEmail index, and then making a query on that. This is essentially an index seek.
What happens if I want to query just by email? There isn’t an index just for emails, but we do have an index that contains emails. We have two options, use a table scan, or and index scan. We already saw what a table scan is, so let us look at what is an index scan:
var nameAndEmailIndex = Indexes["NameAndEmail"];
foreach (var indexed in nameAndEmailIndex)
{
if(indexed.Email == "ayende@ayende.com")
yield return indexed;
}
All in all, it is very similar to the table scan, and when using in memory data structures, it is probably not worth doing index scans (at least, not if the index is over the entire data set).
Where index scans prove to be very useful is when we are talking about persistent data sets, where reading the data from the index may be more efficient than reading it from the table. That is usually because the index is much smaller than the actual table. In certain databases, the format of the data on the disk may make it as efficient to do a table scan in some situations as it to do an index scan.
Another thing to note is that while I am using hashes to demonstrate the principal, in practice, most persistent data stores are going to use some variant of trees.
Building data store – indexing data structure
I run into an interestingly annoying problem recently. Basically, I am trying to write the following code:
tree.Add(new { One = 1, Two = 2 }, 13);
tree.Add(new { One = 2, Two = 3 }, 14);
tree.Add(new { One = 3, Two = 1 }, 15);
var docPos = tree.Find(new { One = 1, Two = 2 });
Assert.Equal(13, docPos);
docPos = tree.Find(new { Two = 2 });
Assert.Equal(13, docPos);
docPos = tree.Find(new { Two = 1 });
Assert.Equal(14, docPos);
As you can imagine, this is part of an indexing approach, the details of which aren’t really important. What is important is that I am trying to figure out how to support partial searches. In the example, we index by One & Two, and we can search on both of them. The problem begins when we want to make a search on just Two.
While the tree can compare between partial results just fine, the problem is how to avoid traversing the entire tree for a partial result. The BTree is structured like this:
The problem when doing a partial search is that at the root, I have no idea if I should turn right or left.
What I am thinking now is that since I can’t do a binary search, I’ll have to use a BTree+ instead. Since BTree+ also have the property that the leaf nodes are a linked list, it means that I can scan it effectively. I am hoping for a better option, though.
Any ideas?
Friday, June 18, 2010
#
Building data stores – Append Only
One of the interesting aspects in building a data store is that you run head on into things that you would generally leave to the infrastructure. By far, most developers deal with concurrency by relegating that responsibility to a database.
When you write your own database, you have to build this sort of thing. In essence, we have two separate issues here:
- Maximizing Concurrency – does readers wait for writers? does writers wait for readers? does writers wait for writers?
- Ensuring Consistency – can I read uncommitted data? can I read partially written data?
As I mentioned in my previous post, there are two major options when building a data store, Transaction Log & Append Only. There are probably a better name for each, but that is how I know them.
This post is going to focus on append only. An append only store is very simple idea in both concept and implementation. It requires that you will always append to the file. It makes things a bit finicky with the type of data structures that you have to use, since typical persistent data structures rely on being able to modify data on the disk. But once you get over that issue, it is actually very simple.
An append only store works in the following manner:
- On startup, the data is read in reverse, trying to find the last committed transaction.
- That committed transaction contains pointers to locations in the file where the actual data is stored.
- A crash in the middle of a write just means garbage at the end of the file that you have to skip when finding the last committed transaction.
- In memory, the only thing that you have to keep is just the last committed transaction.
- A reader with a copy of the last committed transaction can execute independently of any other reader / writer. It will not see any changes made by writers made after it started, but it also doesn’t have to wait for any writers.
- Concurrency control is simple:
- Readers don’t wait for readers
- Readers don’t wait for writers
- Writers don’t wait for readers
- There can be only one concurrent writer
The last one is a natural fallout from the fact that we use the append only model. Only one thread can write to the end of the file at a given point in time. That is actually a performance boost, and not something that would slow the database down, as you might expect.
The reason for that is pretty obvious, once you start thinking about it. Writing to disk is a physical action, and the head can be only in a single place at any given point in time. By ensuring that all writes go to the end of the file, we gain a big perf advantage since we don’t do any seeks.