Ayende @ Rahien

It's a girl

Big Data Search: Binary Search of Textual Data

The index I created for the exercise is just a text file, sorted by the indexed key. When doing a search by a human, that make it very easy to work with. Much easier than trying to work with a binary file, it also helps debugging.

However, it does make it running a binary search on the data a bit harder. Mostly because there isn’t a nice way to say “give me the #th line”. Instead, I wrote the following:

   1: public void SetPositionToLineAt(long position)
   2: {
   3:     // now we need to go back until we either get to the start of the file
   4:     // or find a \n character
   5:     const int bufferSize = 128;
   6:     _buffer.Capacity = Math.Max(bufferSize, _buffer.Capacity);
   7:  
   8:     var charCount = _encoding.GetMaxCharCount(bufferSize);
   9:     if (charCount > _charBuf.Length)
  10:         _charBuf = new char[Utils.NearestPowerOfTwo(charCount)];
  11:  
  12:     while (true)
  13:     {
  14:         _input.Position = position - (position < bufferSize ? 0 : bufferSize);
  15:         var read = ReadToBuffer(bufferSize);
  16:         var buffer = _buffer.GetBuffer();
  17:         var chars = _encoding.GetChars(buffer, 0, read, _charBuf, 0);
  18:         for (int i = chars - 1; i >= 0; i--)
  19:         {
  20:             if (_charBuf[i] == '\n')
  21:             {
  22:                 _input.Position = position - (bufferSize - i) + 1;
  23:                 return;
  24:             }
  25:         }
  26:         position -= bufferSize;
  27:         if (position < 0)
  28:         {
  29:             _input.Position = 0;
  30:             return;
  31:         }
  32:     }
  33: }

This code starts at an arbitrary byte position, and go backward until it find the new line character ‘\n’. This give me the ability to go to a rough location and get the line oriented input.

Once I have that, the rest is pretty easy. Here is the binary search:

   1: while (lo <= hi)
   2: {
   3:     position = (lo + hi) / 2;
   4:     _reader.SetPositionToLineAt(position);
   5:  
   6:     bool? result;
   7:     do
   8:     {
   9:         result = _reader.ReadOneLine();
  10:     } while (result == null); // skip empty lines
  11:  
  12:     if (result == false)
  13:         yield break; // couldn't find anything
  14:  
  15:     var entry = _reader.Current.Values[0];
  16:     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
  17:  
  18:     if (match == 0)
  19:     {
  20:         break;
  21:     }
  22:     if (match > 0)
  23:         lo = position + _reader.Current.Values.Sum(x => x.Count) + 1;
  24:     else
  25:         hi = position - 1;
  26: }
  27:  
  28: if (match != 0)
  29: {
  30:     // no match
  31:     yield break;
  32: }

The idea is that this positions us on the location of the index that has an entry with a value that is equal to what we are searched on.

We then write the following to actually get the data from the actual data file:

   1: // we have a match, now we need to return all the matches
   2: _reader.SetPositionToLineAt(position);
   3:  
   4: while(true)
   5: {
   6:     bool? result;
   7:     do
   8:     {
   9:         result = _reader.ReadOneLine();
  10:     } while (result == null); // skip empty lines
  11:  
  12:     if(result == false)
  13:         yield break; // end of file
  14:  
  15:     var entry = _reader.Current.Values[0];
  16:     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
  17:     if (match != 0)
  18:         yield break; // out of the valid range we need
  19:  
  20:     _buffer.SetLength(0);
  21:     _data.Position = Utils.ToInt64(_reader.Current.Values[1]);
  22:  
  23:     while (true)
  24:     {
  25:         var b = _data.ReadByte();
  26:         if (b == -1)
  27:             break;
  28:         if (b == '\n')
  29:         {
  30:             break;
  31:         }
  32:         _buffer.WriteByte((byte)b);
  33:     }
  34:  
  35:     yield return _encoding.GetString(_buffer.GetBuffer(), 0, (int)_buffer.Length);
  36: }

As you can see, we are moving forward in the index file, reading one line at a time. Then we take the second value, the position of the relevant line in the data file, and read that.

We continue to do so as long as the indexed value is the same. Pretty simple, all told. But it comes with its own set of problems. I’ll discuss that in my next post.

Comments

Bartosz Adamczewski
01/20/2014 12:07 PM by
Bartosz Adamczewski

@Ayende, Correct me if I'm wrong but this sort of binary search would render O(log N) file seeks, wouldn't it be better to store the ranges of the files we merged and just search through them using bin search to narrow the search in the index file?, this would reduce file seeks which are expensive because instead 15TB region we would need to just search a small part of that.

Ayende Rahien
01/20/2014 12:26 PM by
Ayende Rahien

Bartosz, I am not sure that I am following. Yes, this would require O(log N) seeks, but what do you mean, ranges of the files we merged? Again, there is no guarantees for different ranges in different files.

Bartosz Adamczewski
01/20/2014 01:01 PM by
Bartosz Adamczewski

@Ayende since you merge the files into one index and this is sorted then you should be able to store some sorted values in mem along with their offset in the index file. By issuing a query you could start by binary searching those values this should give you a smaller range to search in the index file.

for example:

Assume that we have 2 data sets [A,B,C,D] and [A,C,D,F] since all we have to do to merge them is to maintain two pointers we can easily determine where one file range starts and ends.

So:

After merge we would have the following: [A,A,B,C,C,D,F] the ranges would be [A,D,F] one A is omitted as we already have index for that value. If another set that we would need to merge would look like so: [A,A,B,B] then the range set would have [A,B,D,F]

The down side would be that we might end up in lot's of indexed values like that but this could be optimize to hold just 10K ranges.

I don't know if my explanation makes any sense I guess the best way to show it would be through code.

Ayende Rahien
01/20/2014 01:05 PM by
Ayende Rahien

Bartosz, I don't understand. After I merged the files, there is just one big index file. Why do I need anything else about the previous iterations?

Bartosz Adamczewski
01/20/2014 01:12 PM by
Bartosz Adamczewski

@Ayende, What this operation is ment to do is to partition this file one big indexed file.

Given that:

[A,A,B,C,C,D,F] is our big index file if we want to search for C then we just bin search the range set [A,D,F] since C is between A and D so we issue a binary search over the index file starting l = A and h = D.

Ayende Rahien
01/20/2014 01:23 PM by
Ayende Rahien

Bartosz, That isn't really going to save you anything in the real world. You basically want to cache the first level or two of the binary search. But given that the OS is already going to cache those internally anyway, I don't see any reason to do that.

Bartosz Adamczewski
01/20/2014 01:31 PM by
Bartosz Adamczewski

@Ayende, Actually depending on the data but ideally this would be 30% of seeks. By caching internally you are referring to lookahead disk read to cache?

Ayende Rahien
01/20/2014 01:34 PM by
Ayende Rahien

Bartosz, Since we are always going to be hitting roughly the same spots (midway parts into the file), the OS is going to cache those in its own memory, and we can avoid doing an actual hardware seek.

Eric Richards
01/20/2014 03:00 PM by
Eric Richards

Always full of interesting information. Maybe this is some mangling specific to The Old Reader, which I use to read your posts, but your code listings always get broken out into individual line blocks which make it a little harder to read.

Howard Chu
01/20/2014 03:01 PM by
Howard Chu

(Btw, this is exactly what the Unix "look" command does.)

Comments have been closed on this topic.