Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,520
|
Comments: 51,142
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 843 words

In the previous posts in this series, I explored a bit how to generate a full text index on top of the Enron data set. In particular, we looked at (rudimentary) analysis of text in the first post and looked into posting lists (list of matching documents for specific terms) in the second one. It occurred to me that we need to actually have a much better understanding of the kind of requirements that we have from posting lists in general, so let’s look at them, shall we?

  • Add to the list (increasing numbers only).
  • Iterate the list (all, or from starting point).
  • Reduce disk space and memory utilization as much as possible.

The fact that I want to be able to add to the list is interesting. The typical use case in full text search is to generate the full blown posting list from scratch every time. The typical model is to use LSM (Log Structure Merge) and take advantage on the fact that we are dealing with sorted list to merge them cheaply.

Iterating the list is something you’ll frequently do, to find all the matches or to merge two separate lists. Here is the kind of API that I initially had in mind:

As you can see, there isn’t much there, which is intentional. I initially thought about using this an the baseline of a couple of test implementations using StreamVByte, FastPFor as well as Gorrilla compression. The problem is that there is the need to balance compression ratio with the cost of actually going over the list. Given that my test cases showed a big benefit for using Roaring Bitmaps, I decided to look at them first and see what I can get out of it.

RoaringBitamps is a way to store (efficiently) a set of bits, they are very widely used in the industry. The default implementation is also entirely suitable for my purposes. Mostly because they make use of managed memory, and a hard requirement that I have placed on this series is that I want to be able to use persistent memory. In other words, I want to be able to write the data out, then be able to do everything on top of memory mapped data, without having to parse it.

Roaring Bitmaps works in the following manner. Each 64K range of integers is divided into each own 8KB segments. Given that I’m using Voron as a persistence library, these numbers don’t work for my needs. Voron uses an 8KB page size, so we’ll drop these numbers by half. Each range will be 32K of integers and take a maximum of 4KB of disk space. This allows me to store it much more efficiently inside of Voron. Each segment, in turn, has a type. The types can be either:

  • Array – if the number of set bits in the segment is less than 2048, the data will use a simple sorted array implementation, with each value taking 2 bytes.
  • Bitmap – if the number of set bits in the segment is between 2048 and 30,720, the segment will use a total of 4096 bytes and be a standard bitmap.
  • Reversed array – if the number of set bits in the segment is higher than 30,720, we’ll store in the segment the unset bits as a sorted array.

This gives us quite a few advantages:

  • It is straightforward to build this incrementally (remember that we only ever add items in the end).
  • It is quite efficient in terms of space saving in the case of sparse / busy usage.
  • It is cheap (computationally) to work with and process.
  • It is very simple to use from memory mapped file without having to parse / create managed objects.

The one thing that we still need to take into account is how to deal with the segment metadata. How do we know what segment belong to what range. In order to handle that, we’ll define the following:

The idea is that we need to store two important pieces of information. The start location (is always going to be a multiple of 32K) and the number of set bits (which has a maximum of 32K). Therefor, we can pack all of them into a single int64. The struct is merely there for convenience.

In other words, in addition to the segments with the actual set bits, we are also going to have an array of all the segment’s metadata. In practice, we’ll also need another value here, the actual location of the segment’s data, but that is merely another int64, so that is still very reasonable.

As this is currently a mere exercise, I’m going to skip actually building the implementation, but it seems like it should be a fairly straightforward approach. I might do another post about how to actually implement this feature on Voron, because it is interesting. But I think that this is already long enough.

We still have another aspect to consider. So far, we talked only about the posting lists, but we also need to discuss the terms. But that is a topic for the next post in the series.

time to read 3 min | 551 words

In full text search terminology, a posting list is just a list of document ids. These are used to store and find matches for particular terms in the index.

I took the code from the previous post and asked it to give me the top 50 most frequent terms in the dataset and their posting lists. The biggest list had over 200,000 documents, and I intentionally use multiple threads to build things, so the actual list is going to be random from run to run (which adds a little more real-worldedness to the system*).

*Yes, I invented that term. It make sense, so I’m sticking with it.

I took those posting lists and just dumped them to a file, in the simplest possible format. Here are the resulting files:

image

There are a few things to note here. As you can see, the file name is the actual term in the index, the contents of the file is a sorted list of int64 of the document ids (as 8 bytes little endian values).

I’m using int64 here because Lucene uses int32 and thus has the ~2.1 billion document limit, which I want to avoid. It also make it more fun to work with the data, because of the extra challenge.  The file sizes seems small, but the from file contains over 250,000 entries.

When dealing with posting lists, size matter, a lot. So let’s see what it would take to reduce the size here, shall we?

image

Simply zipping the file gives us a massive space reduction, so there is a lot left on the table, which is great.

Actually, I might have skipped a few steps:

  • Posting lists are sorted, because it helps do things like union / intersect queries.
  • Posting lists are typically only added to.
  • Removal are handled separately, with a merge step to clean this up eventually.

Because the value is sorted, the first thing I tried was to use a diff model with variable sized int. Here is the core code:

Nothing really that interesting, I have to admit, but it did cut the size of the file to 242KB, which is nice (and better than ZIP). Variable sized integers are used heavily by Lucene, so I’m very familiar with them. But there are other alternatives.

  • StreamVByte is a new one, with some impressive perf numbers, but only gets us to 282 KB (but it is possible / likely that my implementation of the code is bad).
  • FastPFor compresses the (diffed) data down to 108KB.
  • RoaringBitmap gives us a total of 64KB.

There are other methods, but they tend to go to the esoteric and not something that I can very quickly test directly.

It is important to note that there are several separate constraints here:

  • Final size on disk
  • Computational cost to generate that final format
  • Computation cost to go from the final format to the original values
  • How much (managed) memory is required during this process

That is enough for now, I believe. My next post will deal delve into the actual semantics that we need to implement to get a good behavior from the system. This is likely going to be quite interesting.

time to read 4 min | 645 words

Full text search is a really interesting topic, which I have been dipping my toes into again and again over the years. It is a rich area of research, and there has been quite a few papers, books and articles about the topic. I read a bunch of projects for doing full text search, and I have been using Lucene for a while.

I thought that I would write some code to play with full text search and see where that takes me. This is a side project, and I hope it will be an interesting one. The first thing that I need to do is to define the scope of work:

  • Be able to (eventually) do full text search queries
  • Compare and contrast different persistence strategies for this
  • Be able to work with multiple fields

What I don’t care about: Analysis process, actually implementing complex queries (I do want to have the foundation for them), etc.

Given that I want to work with real data, I went and got the Enron dataset. That is over 517,000 emails from Enron totaling more than 2.2 GB. This is one of the more commonly used test datasets for full text search, so that is helpful. The first thing that we need to do is to get the data into a shape that we can do something about it.

Enron is basically a set of MIME encoded files, so I’ve used MimeKit to speed the parsing process. Here is the code of the algorithm I’m using for getting the relevant data for the system. Here is the relevant bits:

As you can see, this is hardly a sophisticated approach. We are spawning a bunch of threads, processing all half million emails in parallel, select a few key fields and do some very basic text processing. The idea is that we want to get to the point where we have enough information to do full text search, but without going through the real pipeline that this would take.

Here is an example of the output of one of those dictionaries:

As you can see, this is bare bones (I forgot to index the Subject, for example), but on my laptop (8 cores Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz) with 16 GB of RAM, we can index this amount of data in under a minute and a half.

So far, so good, but this doesn’t actually gets us anywhere, we need to construct an inverted index, so we can ask questions about the data and be able to find stuff out. We are already about half way there, which is encouraging. Let’s see how far we can stretch the “simplest thing that could possibly work”… shall we?

Here is the key data structures:

Basically, we have an array of fields, each of which holds a dictionary from each of the terms and a list of documents for the terms.

For the full code for this stage, look at the following link, it’s less than 150 lines of code.

Indexing the full Enron data set now takes 1 minute, 17 seconds, and takes 2.5 GB in managed memory.

The key is that with this in place, if I want to search for documents that contains the term: “XML”, for example, I can do this quite cheaply. Here is how I can “search” over half a million documents to get all those that have the term HTML in them:

image

As you can imagine, this is actually quite fast.

That is enough for now, I want to start actually exploring persistence options now.

The final code bits are here, I ended up implementing stop words as well, so this is a really cool way to show off how you can do full text search in under 200 lines of code..

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Challenge (75):
    01 Jul 2024 - Efficient snapshotable state
  2. Recording (14):
    19 Jun 2024 - Building a Database Engine in C# & .NET
  3. re (33):
    28 May 2024 - Secure Drop protocol
  4. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  5. Production Postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}