Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,633
|
Comments: 51,252
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 321 words

One of the goals that I set for myself with the NHibernate Profiler is to be able to run on unmodified NHibernate 2.0. The way that I do that is by intercepting and parsing the log stream from NHibernate.

NHibernate logging is extremely rich and detailed, so anything I wanted to do so far was possible. I am pretty sure that there would come a time when a feature would require more invasive approaches, running profiler code in the client application to gather more information, but for now this is enough.

I did run into several problems with logging interception. Ideally, I want this to happen on the fly, as we go. So I really want to get the real time logging stream. The problem is how to do so. I started with the UdpAppender, but that doesn't work on Vista in the released version. Remoting Appender is what I am using now, but it has one critical issue, it is an async appender, so message can (and do) appear out of order.

The message order is pretty important to the profiler. It can deal with that, but it would lead to surprising results. So that one is out as well.

The only other appender that comes out of the box with log4net and can be used remotedly is the telnet appender, which is next on the list for exploring. It does mean that the profiler has to connect to the application, rather than the other way around, which can be a problem.

I built an appender that fit my needs, and I am using it now to test how the profiler works, but before starting to deal with the telnet appender, I thought it would be a good time to ask.

How important is "running on unmodified NHibernate" is?

I am not talking about having a profiler build of NHibernate, I am talking about doing things like using the profiler appender, or registering an HttpModule.

time to read 3 min | 542 words

Usually, "select" isn't broken is a good motto to follow. Occasionally, there are cases where this is case. In particular, it may not be that it is broken, it may very well be that the way it works doesn't match the things that we need it to do.

I spoke about an optimization story that happened recently, in which we managed to reduce the average time from 5 - 10 seconds to 5 - 15 milliseconds.

What we needed was to walk a tree structure, which was stored in a database, and do various interesting tree based operations on it. The most natural way of working with trees is with recursion, and SQL is just not the right way of dealing with it.

Deciding to load the entire table to memory, build a real tree structure and perform all the operations on that tree structure has paid off tremendously. What is important to remember is that we hadn't had to do anything radical to the data model or the way the application worked. We only had to modify the implementation of the component that exposed that tree to the application.

One of the things that we had to deal was the case where the amount of data we have to deal with would exceed available memory. At least, we thought we had to deal with it.

But our tree was very simple, it consisted of a few properties and that it. Let us do the math about this, shall we?

  • Name - 50 chars, unicode - 100 bytes
  • 4 decimal fields - 16 bytes each = 64 bytes
  • 3 boolean fields - 3 bytes
  • Parent point - 4 bytes
  • Set of children - average of 10 per node - 40 bytes + ~50 bytes bookkeeping

This is very rough, of course, but that would do. It puts the memory cost of a node at just under 256 bytes. We will use that number, because it is easier to work with.

Now, with 256 bytes per node, how many can we reasonably use?

Well, 100 MB will take 409,600 nodes or so. Which is pretty good number, I say. A table of that size is considered big, by most people. A GB of memory will give us 4,194,304 items in the tree, and keep the traversal speed near instantaneous. At that point, I would start thinking about the size of the node, because 256 bytes is big size. More realistic size would be 64 bytes or so (drop the name, pack the decimals, use linked list for children) which would give me 16,777,216 nodes for the same memory requirement.

All of those numbers are greater than the current and expected size of the data set, so there isn't a reason to care much beyond that.

The important thing here is to understand that the usual truth about "let the tool do the optimization" doesn't really hold true when you have specific scenarios. For solving very specific, very narrow circumstances, you can generally come up with a much better approach than the generic one.

Of course, this approach would not allow any generalization, and it doesn't have other benefits that using the common platform might have offered (needing to do our own transactions, for example).

Keep that in mind.

time to read 4 min | 651 words

It is an extremely common issue and I talked about it in the past quite a few times. I have learned a lot since then, however, and I want to show you can create rich, complex, querying support with very little effort.

We will start with the following model:

image

And see how we can query it. We start by defining search filters, classes that look more or less like our domain. Here is a simple example:

public abstract class AbstractSearchFilter
{
	protected IList<Action<DetachedCriteria>> actions = new List<Action<DetachedCriteria>>();
	
	public void Apply(DetachedCriteria dc)
	{
		foreach(var action in actions)
		{
			action(dc);
		}
	}
}


public class PostSearchFilter : AbstractSearchFilter
{
	private string title;
	
	public string Title
	{
		get { return title; }
		set
		{
			title = value;
			actions.Add(dc => 
			{
				if(title.Empty())
					return;
				
				dc.Add(Restictions.Like("Title", title, MatchMode.Start));
			});
		}
	}
}

public class UserSearchFilter : AbstractSearchFilter
{
	private string username;
	private PostSearchFilter post;
	
	public string Username
	{
		get { return username; }
		set
		{
			username = value;
			actions.Add(dc =>
			{
				if(username.Empty())
					return;
			
				dc.Add(Restrictions.Like("Username", username, MatchMode.Start));
			});
		}
	}
	
	public PostSearchFilter Post
	{
		get { return post; }
		set
		{
			post = value;
			actions.Add(dc=>
			{
				if(post==null)
					return;
				
				var postDC = dc.Path("Posts"); // Path is an extension method for GetCriteriaByPath(name) ?? CreateCriteria(path)
				post.Apply(postDC);
			);
		}
	}
}

Now that we have the code in front of us, let us talk about it. The main idea here is that we move the responsibility of deciding what to query to the hands of the client. It can make decisions by just setting our properties. Not only that, but we support rich domain queries using this approach. Notice what we are doing in UserSearchFilter.Post.set, we create a sub criteria and pass it to the post search filter, to apply itself on that. Using this method, we completely abstract all the need to deal with our current position in the tree. We can query on posts directly, through users, through comments, etc. We don't care, we just run in the provided context and apply our conditions on it.

Let us take the example of wanting to search all the users who posts about NHibernate.  I can express this as:

usersRepository.FindAll(
  new UserSearchFilter
  {
    Post = new PostSearchFilter
        {
            Title = "NHibernate"
        }
  }
);

But that is only useful for static scenarios, and in those cases, it is easier to just write the query using the facilities NHibernate already gives us. Where does it shine?

There is a really good reason that I chose this design for the query mechanism. JSON.

I can ask the json serializer to serialize a JSON string into this object graph. Along the way, it will make all the property setting (and query building) that I need. On the client side, I just need to build the JSON string (an easy task, I think you would agree), and send it to the server. On the server side, I just need to build the filter classes (another very easy task). Done, I have a very rich, very complex, very complete solution.

Just to give you an idea, assuming that I had fully fleshed out the filters above, here is how I search for users name 'ayende', who posted about 'nhibernate' with the tug 'amazing' and has a comment saying 'help':

{ // root is user, in this case
	Name: 'ayende',
	Post:
	{
		Title: 'NHibernate',
		Tag:
		{
			Name: ['amazing']
		}
		Comment:
		{
			Comment: 'Help'
		}
	}
}
Deserializing that into our filter object graph gives us immediate results that we can pass the the repository to query with exactly zero hard work.
 

A bug story

time to read 7 min | 1301 words

I run into a bug today with the way NHibernate dealt with order clauses. In particular, it can only happen if you are:

  • Use parameters in the order clause
  • Using SQL Server 2005
  • Using a limit clause

If you met all three conditions, you would run into a whole host of problems (in particular, NH-1527 and NH-1528). They are all fixed now, and I am writing this post as the build run. The underlying issue is that SQL Server 2005 syntax for paging is broken, badly.

Let us take the this statement:

SELECT   THIS_.ID         AS ID0_0_,
         THIS_.AREA       AS AREA0_0_,
         THIS_.PARENT     AS PARENT0_0_,
         THIS_.PARENTAREA AS PARENTAREA0_0_,
         THIS_.TYPE       AS TYPE0_0_,
         THIS_.NAME       AS NAME0_0_
FROM     TREENODE THIS_
WHERE    THIS_.NAME LIKE ?
         AND THIS_.ID > ?
ORDER BY (SELECT THIS_0_.TYPE AS Y0_
          FROM   TREENODE THIS_0_
          WHERE  THIS_0_.TYPE = ?) ASC

And let us say that we want to get a paged view of the data. How can we do it? Here is the code:

SELECT   TOP 1000 ID0_0_,
                  AREA0_0_,
                  PARENT0_0_,
                  PARENTAREA0_0_,
                  TYPE0_0_,
                  NAME0_0_
FROM     (SELECT ROW_NUMBER()
                   OVER(ORDER BY __HIBERNATE_SORT_EXPR_0__) AS ROW,
                 QUERY.ID0_0_,
                 QUERY.AREA0_0_,
                 QUERY.PARENT0_0_,
                 QUERY.PARENTAREA0_0_,
                 QUERY.TYPE0_0_,
                 QUERY.NAME0_0_,
                 QUERY.__HIBERNATE_SORT_EXPR_0__
          FROM   (SELECT THIS_.ID         AS ID0_0_,
                         THIS_.AREA       AS AREA0_0_,
                         THIS_.PARENT     AS PARENT0_0_,
                         THIS_.PARENTAREA AS PARENTAREA0_0_,
                         THIS_.TYPE       AS TYPE0_0_,
                         THIS_.NAME       AS NAME0_0_,
                         (SELECT THIS_0_.TYPE AS Y0_
                          FROM   TREENODE THIS_0_
                          WHERE  THIS_0_.TYPE = ?) AS __HIBERNATE_SORT_EXPR_0__
                  FROM   TREENODE THIS_
                  WHERE  THIS_.NAME LIKE ?
                         AND THIS_.ID > ?) QUERY) PAGE
WHERE    PAGE.ROW > 10
ORDER BY __HIBERNATE_SORT_EXPR_0__

Yes, in this case, we could use TOP 1000 as well, but that doesn't work if we want pages data that isn't started at the beginning of the data set.

Now, here is an important fact, the question marks that you see? Those are positional parameters. Do you see the bug now?

SQL Server 2005 (and 2008) paging support is broken. I find it hard to believe that a feature that is just a tad less important than SELECT is so broken. Any other database get it right, for crying out load.

Anyway, by now you noticed that when we processed the statement to add the limit clause, we had re-written the structure of the statement and changed the order of the parameters. Tracking that problem down was a pain, just to give an idea, here is a bit of the change that I had to make:

/// <summary>
/// We need to know what the position of the parameter was in a query
/// before we rearranged the query.
/// This is used only by dialects that rearrange the query, unfortunately, 
/// the MS SQL 2005 dialect have to re shuffle the query (and ruin positional parameter
/// support) because the SQL 2005 and 2008 SQL dialects have a completely broken 
/// support for paging, which is just a tad less important than SELECT.
/// See  	 NH-1528
/// </summary>
public int? OriginalPositionInQuery;

I fixed the issue, but it is an annoying problem that keep occurring. Paging in SQL Server 2005/8 is broken!

Oh, and just to clarify some things. The ability to use complex expressions for the order by clause using the projection API is fairly new for NHibernate, it is incredibly powerful and really scares me.

time to read 3 min | 597 words

I left work today very happy. There was a piece in the UI that was taking too long when run under with a real world data set. What is slow? Let us call it 40 seconds to start with. This is a pretty common operation in the UI, so that was a good place to optimize.

I wasn't there for that part, but optimizing the algorithms used reduced the time from 40 seconds to 5 - 10 seconds, and impressive amount by all accounts, but still one in which the users had to wait an appreciable amount for a common UI operation. Today we decided to tackle this issue, and see if we can optimize this further.

The root action is loading some data and executing a bit of business logic on top of that data. I checked the queries being generated, and while they weren't ideal, they weren't really bad (just not the way I would do things). At that point, we decided to isolate the issue in a test page, which would allow us to test just this function in isolation. Then, we implemented this from scratch, as plain data loading process.

The performance for that was simply amazing. 300 - 150 ms per operation, vs. 5 - 10 seconds in the optimized scenario. Obviously, however, we were comparing apples to oranges here. The real process also did a fair amount of business logic (and related data loading), which was the reason that it was slow. I looked at the requirement again, then at the queries, and despaired.

I hoped that I would be able to use a clever indexing scheme and get the 1000% perf benefit using some form of SQL. But the requirement simply cannot be expressed in SQL. And trying to duplicate the existing logic would only put us in the same position as before.

What to do... what to do...

The solution was quite simple, take the database out of the loop. For a performance critical piece of the application, we really can't afford to rely on external service (and the DB is considered one, in this scenario). I spent some time loading the data at application startup, as well as doing up front work on the data set to make it easier to work with.

This turned that operation into an O(1) operation, where O consists of a small set of in memory hash table lookups. And the performance? The performance story goes like this:

I go into the manager office, and ask him how fast he wants this piece of functionality to run. He hesitate for a moment and then says: "A second?".
I shake my head, "I can't do that, can you try again?"
"Two seconds?" He asked.
"I am sorry", I replied, "I can do five"
The I left the office and thrown over my shoulder, "oh, but it is in milliseconds".
Sometimes I have a rotten sense of humor, but the stunned silence that followed that declaration was a very pleasing.

I am lucky in that the data set is small enough to fit in memory. But I am not going to rely on that, we need to implement soft paging of the data anyway (to make the application startup time acceptable), so it will be able to handle that easily enough even when the data set that we are talking about will grow beyond the limits of memory (which I don't expect to happen in the next couple of years).

Overall, it was a very impressive optimization, even if I say so myself.

time to read 2 min | 226 words

Note, I am explicitly not asking if this is optimal. I am asking if it is good enough.

There is a tendency to assume 'it works, and that let sleeping dragons be. This is usually correct, my own definition for legacy code is "code that makes money". As such, any modifications to it should be justified in terms of ROI.

The term that I often use for that is technical debt, by no means my own invention, by a very useful concept. This allow me to explain, in terms that makes sense to the client, what are the implications of letting working, but not good enough implementation to stay in place. Or why I need to take a week or two with a couple of developers to refactor parts of the application.

We like to think about refactoring as: changing the internal structure of the code without changing observable behavior. Business people tend to think about it differently. A time in which the development team is going to do stuff that doesn't give me any value. The inability to translate the difficulty to terms that the business understand is important. And framing such discussions in terms of the technical debt into which they will get us into is critical.

Setting expectations about the behavior of the team is just as important as setting expectations about the behavior of the application.

time to read 2 min | 353 words

I just finished writing the final test for the basic functionality that I want for NHibernate Profiler:

        [Test]
public void SelectBlogById()
{
ExecuteScenarioInDifferentProcess<SelectBlogByIdUsingCriteria>();
StatementModel selectBlogById = observer.Model.Sessions.First()
.Statements.First();
const string expected = @"SELECT this_.Id as Id3_0_,
this_.Title as Title3_0_,
this_.Subtitle as Subtitle3_0_,
this_.AllowsComments as AllowsCo4_3_0_,
this_.CreatedAt as CreatedAt3_0_
FROM Blogs this_
WHERE this_.Id = @p0

";
Assert.AreEqual(expected, selectBlogById.Text);
}

I actually had to invest some thought about the architecture of testing this. This little test has a whole set of ideas behind it, about which I'll talk about at a later date. Suffice to say that this test creates a new process, start to listen to interesting things that are going on there (populating the observer model with data).

Another interesting tidbit is that the output is formatted for readability. By default, NHiberante's SQL output looks something like this:

SELECT this_.Id as Id3_0_, this_.Title as Title3_0_, this_.Subtitle as Subtitle3_0_, this_.AllowsComments as AllowsCo4_3_0_, this_.CreatedAt as CreatedAt3_0_ FROM Blogs this_ WHERE this_.Id = @p0

This is pretty hard to read the moment that you have any sort of complex conditions.

API Design

time to read 2 min | 216 words

There are several important concerns that needs to be taken into account when designing an API. Clarity is an important concern, of course, but the responsibilities of the users and implementers of the API should be given a lot of consideration. Let us take a look at a couple of designs for a simple notification observer. We need to observe a set of actions (with context). I don't want to have force mutable state on the users, so I have started with this approach (using out parameters instead of return values in order to name the parameter):

public interface INotificationObserver
{
    void OnNewSession(out object sessionTag);
    void OnNewStatement(object sessionTag, StatementInformation statementInformation, out object statementTag);
    void OnNewAction(object statementTag, ActionInformation actionInformation);
}

I don't really like this, too much magic objects here, and too much work for the client. We can do it in a slightly different way, however:

public delegate void OnNewAction(ActionInformation actionInformation);

public delegate void OnNewStatement(StatementInformation statementInformation, out OnNewAction onNewAction);

public interface INotificationObserver
{
    void OnNewSession(out OnNewStatement onNewStatement);
}

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. API Design (10):
    29 Jan 2026 - Don't try to guess
  2. Recording (20):
    05 Dec 2025 - Build AI that understands your business
  3. Webinar (8):
    16 Sep 2025 - Building AI Agents in RavenDB
  4. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  5. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}