Ayende @ Rahien

It's a girl

Full Text Search takes you only so far

A few weeks ago I had a really interesting engagement with a customer. They were using RavenDB to do some interesting searches, and eventually they hit a wall with what they were trying to do.

For simplicity sake, we will say that the customer wanted to allow users to search for books. The scenario is something like this (totally different domain, obviously) and the client isn’t Amazon, they are just a good place to get the screen shot from:

image

Sure, the suggest feature is really nice, but what the customer really cared about is being able to search on the whole set of options.

In their field, people usually write the book name using one of the following formats:

  • Author First Name, Author Last Name – Book Title, Year
  • Year, Book Title, Author Last Name
  • Author Last Name, Book Title, Year

And a bunch of other options.

Also, they want to offer a free text search option.

Also, it had to be fast. They already had an existing system that worked, but had unacceptably high latency for most queries and had… issues under load. The first approach they tried was just moving to RavenDB, enabling full text search and seeing what it got them. It got them something, but not nearly enough.

When I started looking at the problem, I had several recommendation, none of them had much of anything to do with full text search. They were mostly around just being smarter in understanding the user.

To start with, given that most of the information was in one of a small number of formats, there was really no reason not to build a parser for that information. When you actually know what fields you are looking for, you can provide much better information for the user, than if you are just doing brute force full text search.

So, instead of issuing a query like this:

RavenSession.Query<Books_FullText.Result, Books_FullText>()
   .Search(x=> x.Result, searchTermFromUser)
   .ToList();

Which can work, but can’t really take advantage of your knowledge of the domain and the users, you will do something like this:

var parseResult = new BooksQueryParser(Context).Parse(searchTermFromUser);
if( parseResult.Success )
{
  var q = RavenSession.Query<Books_FullText.Result, Books_FullText>()
  parseResule.ApplyOn(q);
  // would do things like
  // q.Search(x=>x.Title , "the lost fleet");
  // q.Search(x=>x.Author, "jack campbell");
  return q.ToList();
}
else // fall back, do a full text search, because there isn't anything else to do
{
  return RavenSession.Query<Books_FullText.Result, Books_FullText>()
   .Search(x=> x.Result, searchTermFromUser)
   .ToList();
}

RavenDB can’t do that for you. It can provide awesome full text support, but if you guide it in this manner, it would be tremendously more helpful.

The next stage is to actually learn from your users. Whenever you users make a search, you are going to record it. In fact, you are going to track the entire interaction. It will end up looking something like this:

{ // searchInteractions/4833424
  "User": "users/93432",
  "Terms" [
    "the last feetl",
    "the lost fast",
    "the lost fleet"
  ],
  "FollowedTo": "books/40273498723"

}

In this case, the sample data shows typos, but in the customer scenario, those would be the user trying different ways to format the actual valid search, to find something that the system recognizes.

What is important is that if you can’t find a search result with high enough ranking (for example, if you failed to parse the search terms), you can now do several fairly intelligent things.

You can search for similar searches made by other users, there is a high likelihood that the same search term was tried before, the user then corrected his typos / formatting errors and then found what they wanted. The next user that run into this can benefit from this experience. You can also suggest to the user “did you mean ?“  when you can’t find a good result for the search query.

Note that the interactions always ends when the user has selected an appropriate result. This is the user’s way of telling you, “this is what I meant”, you should learn from it.

In all, I don’t think that either suggestion is truly ground breaking, but together they can result in a huge leap for the usability of the search feature. And for that particular client, the search feature is Major.

Comments

Scooletz
10/12/2011 10:55 AM by
Scooletz

"To start with, given that most of the information was in one of a small number of format, there was really no reason not to build a parser for that information." LinkedIn does pretty amazing stuff with analyzing queries before quering their indexes. Ok, they're big, but making the query semantic makes sense to me.

Frank Quednau
10/12/2011 12:14 PM by
Frank Quednau

What happened to the Event Aggregation post? My newsreader already cached it, but here it is gone...

Ayende Rahien
10/12/2011 12:42 PM by
Ayende Rahien

Frank, We made a huge amount of changes in the architecture, it wasn't relevant any longer, so I removed it. I'll post more about the actual system architecture later.

configurator
10/12/2011 02:17 PM by
configurator

Too easy to game this system, I think, making for example searching for "Worst software developer ever" Did you mean, "Ayende Rahien?"

Comments have been closed on this topic.