Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,128 | Comments: 45,546

filter by tags archive

Big Data SearchBinary Search of Textual Data

time to read 29 min | 5758 words

The index I created for the exercise is just a text file, sorted by the indexed key. When doing a search by a human, that make it very easy to work with. Much easier than trying to work with a binary file, it also helps debugging.

However, it does make it running a binary search on the data a bit harder. Mostly because there isn’t a nice way to say “give me the #th line”. Instead, I wrote the following:

   1: public void SetPositionToLineAt(long position)
   2: {
   3:     // now we need to go back until we either get to the start of the file
   4:     // or find a \n character
   5:     const int bufferSize = 128;
   6:     _buffer.Capacity = Math.Max(bufferSize, _buffer.Capacity);
   8:     var charCount = _encoding.GetMaxCharCount(bufferSize);
   9:     if (charCount > _charBuf.Length)
  10:         _charBuf = new char[Utils.NearestPowerOfTwo(charCount)];
  12:     while (true)
  13:     {
  14:         _input.Position = position - (position < bufferSize ? 0 : bufferSize);
  15:         var read = ReadToBuffer(bufferSize);
  16:         var buffer = _buffer.GetBuffer();
  17:         var chars = _encoding.GetChars(buffer, 0, read, _charBuf, 0);
  18:         for (int i = chars - 1; i >= 0; i--)
  19:         {
  20:             if (_charBuf[i] == '\n')
  21:             {
  22:                 _input.Position = position - (bufferSize - i) + 1;
  23:                 return;
  24:             }
  25:         }
  26:         position -= bufferSize;
  27:         if (position < 0)
  28:         {
  29:             _input.Position = 0;
  30:             return;
  31:         }
  32:     }
  33: }

This code starts at an arbitrary byte position, and go backward until it find the new line character ‘\n’. This give me the ability to go to a rough location and get the line oriented input.

Once I have that, the rest is pretty easy. Here is the binary search:

   1: while (lo <= hi)
   2: {
   3:     position = (lo + hi) / 2;
   4:     _reader.SetPositionToLineAt(position);
   6:     bool? result;
   7:     do
   8:     {
   9:         result = _reader.ReadOneLine();
  10:     } while (result == null); // skip empty lines
  12:     if (result == false)
  13:         yield break; // couldn't find anything
  15:     var entry = _reader.Current.Values[0];
  16:     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
  18:     if (match == 0)
  19:     {
  20:         break;
  21:     }
  22:     if (match > 0)
  23:         lo = position + _reader.Current.Values.Sum(x => x.Count) + 1;
  24:     else
  25:         hi = position - 1;
  26: }
  28: if (match != 0)
  29: {
  30:     // no match
  31:     yield break;
  32: }

The idea is that this positions us on the location of the index that has an entry with a value that is equal to what we are searched on.

We then write the following to actually get the data from the actual data file:

   1: // we have a match, now we need to return all the matches
   2: _reader.SetPositionToLineAt(position);
   4: while(true)
   5: {
   6:     bool? result;
   7:     do
   8:     {
   9:         result = _reader.ReadOneLine();
  10:     } while (result == null); // skip empty lines
  12:     if(result == false)
  13:         yield break; // end of file
  15:     var entry = _reader.Current.Values[0];
  16:     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
  17:     if (match != 0)
  18:         yield break; // out of the valid range we need
  20:     _buffer.SetLength(0);
  21:     _data.Position = Utils.ToInt64(_reader.Current.Values[1]);
  23:     while (true)
  24:     {
  25:         var b = _data.ReadByte();
  26:         if (b == -1)
  27:             break;
  28:         if (b == '\n')
  29:         {
  30:             break;
  31:         }
  32:         _buffer.WriteByte((byte)b);
  33:     }
  35:     yield return _encoding.GetString(_buffer.GetBuffer(), 0, (int)_buffer.Length);
  36: }

As you can see, we are moving forward in the index file, reading one line at a time. Then we take the second value, the position of the relevant line in the data file, and read that.

We continue to do so as long as the indexed value is the same. Pretty simple, all told. But it comes with its own set of problems. I’ll discuss that in my next post.

More posts in "Big Data Search" series:

  1. (24 Jan 2014) Sorting randomness
  2. (23 Jan 2014) Sorting Optimizations
  3. (22 Jan 2014) The index format is horrible
  4. (20 Jan 2014) Binary Search of Textual Data
  5. (17 Jan 2014) Setting up


Bartosz Adamczewski

@Ayende, Correct me if I'm wrong but this sort of binary search would render O(log N) file seeks, wouldn't it be better to store the ranges of the files we merged and just search through them using bin search to narrow the search in the index file?, this would reduce file seeks which are expensive because instead 15TB region we would need to just search a small part of that.

Ayende Rahien

Bartosz, I am not sure that I am following. Yes, this would require O(log N) seeks, but what do you mean, ranges of the files we merged? Again, there is no guarantees for different ranges in different files.

Bartosz Adamczewski

@Ayende since you merge the files into one index and this is sorted then you should be able to store some sorted values in mem along with their offset in the index file. By issuing a query you could start by binary searching those values this should give you a smaller range to search in the index file.

for example:

Assume that we have 2 data sets [A,B,C,D] and [A,C,D,F] since all we have to do to merge them is to maintain two pointers we can easily determine where one file range starts and ends.


After merge we would have the following: [A,A,B,C,C,D,F] the ranges would be [A,D,F] one A is omitted as we already have index for that value. If another set that we would need to merge would look like so: [A,A,B,B] then the range set would have [A,B,D,F]

The down side would be that we might end up in lot's of indexed values like that but this could be optimize to hold just 10K ranges.

I don't know if my explanation makes any sense I guess the best way to show it would be through code.

Ayende Rahien

Bartosz, I don't understand. After I merged the files, there is just one big index file. Why do I need anything else about the previous iterations?

Bartosz Adamczewski

@Ayende, What this operation is ment to do is to partition this file one big indexed file.

Given that:

[A,A,B,C,C,D,F] is our big index file if we want to search for C then we just bin search the range set [A,D,F] since C is between A and D so we issue a binary search over the index file starting l = A and h = D.

Ayende Rahien

Bartosz, That isn't really going to save you anything in the real world. You basically want to cache the first level or two of the binary search. But given that the OS is already going to cache those internally anyway, I don't see any reason to do that.

Bartosz Adamczewski

@Ayende, Actually depending on the data but ideally this would be 30% of seeks. By caching internally you are referring to lookahead disk read to cache?

Ayende Rahien

Bartosz, Since we are always going to be hitting roughly the same spots (midway parts into the file), the OS is going to cache those in its own memory, and we can avoid doing an actual hardware seek.

Eric Richards

Always full of interesting information. Maybe this is some mangling specific to The Old Reader, which I use to read your posts, but your code listings always get broken out into individual line blocks which make it a little harder to read.

Howard Chu

(Btw, this is exactly what the Unix "look" command does.)

Comment preview

Comments have been closed on this topic.


  1. The low level interview question - 11 hours from now
  2. The worker pattern - 3 days from now

There are posts all the way to May 30, 2016


  1. The design of RavenDB 4.0 (14):
    26 May 2016 - The client side
  2. RavenDB 3.5 whirl wind tour (14):
    25 May 2016 - Got anything to declare, ya smuggler?
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats