Find the differences: The optimization that changed behavior
I was thinking about ways of optimizing NHibernate Search’ behavior, and I run into a bug in my proposed solution. It is an interesting one, so I thought that would be make a good post.
Right now NHibernate Search behavior is similar to this:
public IList<T> Search<T>(string query)
{
var results = new List<T>();
foreach(var idAndClass in DoLuceneSearch(query))
{
var result = (T)session.Get(idAndClass.ClassName, idAndClass.Id);
if(result != null)
results.Add(result);
}
return results;
}
This isn’t the actual code, but it shows how NHibernate works. It also shows the problem that I thought about fixing. The way it is implemented now, NHibernate Search will create a SELECT N+1 query.
Now, the optimization is simply:
public IList<T> Search<T>(string query)
{
return session
.CreateCriteria<T>()
.Add(Restrictions.In("id", DoLuceneSearch(query).Select(x=>x.Id)))
.List();
}
There are at least two major differences between the behavior of the two versions, can you find them?
Comments
Is it possible that DoLuceneSearch can return the same row multiple times?
Also, this would do this in a different sort order if I'm not mistaken. (That is, if sort order in even an issue at that level which I assume it is)
Note: I don't use NHibernate, and I have no idea what Lucene is, so this is nothing more than an educated guess.
I think 3 major differences are:
The 2nd case executes (in theory) 2 SQL queries. One to return Class+Id, the 2nd to return the actual list (probably using SELECT... FROM... WHERE id in (...). With all the pros/cons.
The 2nd case ignores the NH 2nd level cache.
The 2nd approach ignores the class thus you cannot use search with inheritance.
The query in the second example would always hit the database but it would only be done once. You might hit a 2100 parameter limit if it uses the IN clause.
The first example can take advantage of the identity map.
1) Original code loads proxies, the optimized code loads the full objects.
2) The optimized code loads only T while the original code loads T and subclasses of T
Will the second one fail if no rows are returned by lucenesearch?
The first query uses idAndClass.className to get the entity. If it is possible for idAndClass.className to be something else than T, then the second query could return an entirely different entity.
On the other hand, the first query would throw in this case.
One difference is that the first version will take advantage of the 1st level cache and the 2nd level entity cache if it is enabled.
The second version will always go to the database.
Configurator,
Sort order is one such problem, yes.
Johannes,
Yep, that is a big change in behavior.
Mogens,
Yep :-)
Dmitriy,
1 isn't true, the DoLuceneQuery doesn't hit the DB.
Expanding on what Johannes mentioned, it seems like you can actually get an non-matching item of type T that exits with an id that was returned by the Lucene search for an entirely different class?
I Guess one solution could be some kind of this?
public IList <t Search <t(string query)
{
<t()
}
I can see a big stinking NullReferenceException about to happen though :)
Is this still true, I applied a patch (see http://nhjira.koah.net/browse/NHSR-17) that addressed at least one use case of this.
It's important to at least have the option to use the IN query version, especially if the query should be decorated from other sources, such as Rhino.Security
As far as I remember IN query has limits at least in Oracle database.
Comment preview