Full text search is a really interesting topic, which I have been dipping my toes into again and again over the years. It is a rich area of research, and there has been quite a few papers, books and articles about the topic. I read a bunch of projects for doing full text search, and I have been using Lucene for a while.
I thought that I would write some code to play with full text search and see where that takes me. This is a side project, and I hope it will be an interesting one. The first thing that I need to do is to define the scope of work:
- Be able to (eventually) do full text search queries
- Compare and contrast different persistence strategies for this
- Be able to work with multiple fields
What I don’t care about: Analysis process, actually implementing complex queries (I do want to have the foundation for them), etc.
Given that I want to work with real data, I went and got the Enron dataset. That is over 517,000 emails from Enron totaling more than 2.2 GB. This is one of the more commonly used test datasets for full text search, so that is helpful. The first thing that we need to do is to get the data into a shape that we can do something about it.
Enron is basically a set of MIME encoded files, so I’ve used MimeKit to speed the parsing process. Here is the code of the algorithm I’m using for getting the relevant data for the system. Here is the relevant bits:
As you can see, this is hardly a sophisticated approach. We are spawning a bunch of threads, processing all half million emails in parallel, select a few key fields and do some very basic text processing. The idea is that we want to get to the point where we have enough information to do full text search, but without going through the real pipeline that this would take.
Here is an example of the output of one of those dictionaries:
As you can see, this is bare bones (I forgot to index the Subject, for example), but on my laptop (8 cores Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz) with 16 GB of RAM, we can index this amount of data in under a minute and a half.
So far, so good, but this doesn’t actually gets us anywhere, we need to construct an inverted index, so we can ask questions about the data and be able to find stuff out. We are already about half way there, which is encouraging. Let’s see how far we can stretch the “simplest thing that could possibly work”… shall we?
Here is the key data structures:
Basically, we have an array of fields, each of which holds a dictionary from each of the terms and a list of documents for the terms.
For the full code for this stage, look at the following link, it’s less than 150 lines of code.
Indexing the full Enron data set now takes 1 minute, 17 seconds, and takes 2.5 GB in managed memory.
The key is that with this in place, if I want to search for documents that contains the term: “XML”, for example, I can do this quite cheaply. Here is how I can “search” over half a million documents to get all those that have the term HTML in them:
As you can imagine, this is actually quite fast.
That is enough for now, I want to start actually exploring persistence options now.
The final code bits are here, I ended up implementing stop words as well, so this is a really cool way to show off how you can do full text search in under 200 lines of code..
More posts in "Searching through text" series:
- (17 Oct 2019) Part III, Managing posting lists
- (16 Oct 2019) Part II, Exploring posting lists persistence
- (14 Oct 2019) Part I, full text search in under 200 lines of code