Ayende @ Rahien

Refunds available at head office

Find the differences: The optimization that changed behavior

I was thinking about ways of optimizing NHibernate Search’ behavior, and I run into a bug in my proposed solution. It is an interesting one, so I thought that would be make a good post.

Right now NHibernate Search behavior is similar to this:

public IList<T> Search<T>(string query)
{
var results = new List<T>();
foreach(var idAndClass in DoLuceneSearch(query))
{
var result = (T)session.Get(idAndClass.ClassName, idAndClass.Id);
if(result != null)
results.Add(result);
}
return results;
}

This isn’t the actual code, but it shows how NHibernate works. It also shows the problem that I thought about fixing. The way it is implemented now, NHibernate Search will create a SELECT N+1 query.

Now, the optimization is simply:

public IList<T> Search<T>(string query)
{
return session
.CreateCriteria<T>()
.Add(Restrictions.In("id", DoLuceneSearch(query).Select(x=>x.Id)))
.List();
}

There are at least two major differences between the behavior of the two versions, can you find them?

Comments

configurator
09/16/2009 04:34 AM by
configurator

Is it possible that DoLuceneSearch can return the same row multiple times?

Also, this would do this in a different sort order if I'm not mistaken. (That is, if sort order in even an issue at that level which I assume it is)

Note: I don't use NHibernate, and I have no idea what Lucene is, so this is nothing more than an educated guess.

Dmitriy Nagirnyak
09/16/2009 04:49 AM by
Dmitriy Nagirnyak

I think 3 major differences are:

  1. The 2nd case executes (in theory) 2 SQL queries. One to return Class+Id, the 2nd to return the actual list (probably using SELECT... FROM... WHERE id in (...). With all the pros/cons.

  2. The 2nd case ignores the NH 2nd level cache.

  3. The 2nd approach ignores the class thus you cannot use search with inheritance.

Dmitry
09/16/2009 05:00 AM by
Dmitry

The query in the second example would always hit the database but it would only be done once. You might hit a 2100 parameter limit if it uses the IN clause.

The first example can take advantage of the identity map.

Markus Zywitza
09/16/2009 05:29 AM by
Markus Zywitza

1) Original code loads proxies, the optimized code loads the full objects.

2) The optimized code loads only T while the original code loads T and subclasses of T

Bunter
09/16/2009 05:48 AM by
Bunter

Will the second one fail if no rows are returned by lucenesearch?

Johannes Gustafsson
09/16/2009 06:01 AM by
Johannes Gustafsson

The first query uses idAndClass.className to get the entity. If it is possible for idAndClass.className to be something else than T, then the second query could return an entirely different entity.

On the other hand, the first query would throw in this case.

Mogens Heller Grabe
09/16/2009 06:04 AM by
Mogens Heller Grabe

One difference is that the first version will take advantage of the 1st level cache and the 2nd level entity cache if it is enabled.

The second version will always go to the database.

Ayende Rahien
09/16/2009 09:00 AM by
Ayende Rahien

Configurator,

Sort order is one such problem, yes.

Ayende Rahien
09/16/2009 09:02 AM by
Ayende Rahien

Johannes,

Yep, that is a big change in behavior.

Ayende Rahien
09/16/2009 09:11 AM by
Ayende Rahien

Dmitriy,

1 isn't true, the DoLuceneQuery doesn't hit the DB.

Howard Pinsley
09/16/2009 03:34 PM by
Howard Pinsley

Expanding on what Johannes mentioned, it seems like you can actually get an non-matching item of type T that exits with an id that was returned by the Lucene search for an entirely different class?

Johannes Gustafsson
09/17/2009 06:51 AM by
Johannes Gustafsson

I Guess one solution could be some kind of this?

public IList <t Search <t(string query)

{

return session

    .CreateCriteria

<t()

    .Add(Restrictions.In("id", DoLuceneSearch(query).Where(x => typeof(T).IsAssignableFrom(Assembly.GetExecutingAssembly().GetType(x.className))).Select(x=>x.Id)))

    .List();

}

Chris Smith
09/17/2009 10:54 AM by
Chris Smith

I can see a big stinking NullReferenceException about to happen though :)

gunteman
09/20/2009 12:48 PM by
gunteman

It's important to at least have the option to use the IN query version, especially if the query should be decorated from other sources, such as Rhino.Security

Pawel Lesnikowski
09/21/2009 06:20 AM by
Pawel Lesnikowski

As far as I remember IN query has limits at least in Oracle database.

Comments have been closed on this topic.