Ayende @ Rahien

Refunds available at head office

When a race condition is what you want…

I have an interesting situation that I am not sure how to resolve. We need to record the last request time for a RavenDB database. Now, this last request time is mostly used to show the user, and to decide when a database is idle, and can be shut down.

As such, it doesn’t have to be really accurate ( a skew of even a few seconds is just fine ). However, under load, there are many requests coming in (in the case presented to us, 4,000 concurrent requests), and they all need to update the last request.

Obviously, in such a scenario, we don’t really care about the actual value. But the question is, how do we deal with that? In particular, I want to avoid a situation where we do a lot of writes to the same value in an unprotected manner, mostly because it is likely to cause contentions between cores.

Any ideas?

It is actually fine for us to go slightly back (so thread A at time T+1 and thread B at time T+2 running concurrently, and the end result is T+1), which is why I said that a race is fine for us. But what I want to avoid is any form of locking / contention.

I wrote the following test code:

class Program
{
    static void Main(string[] args)
    {
        var threads = new List<Thread>();

        var holder = new Holder();

        var mre = new ManualResetEvent(false);

        for (int i = 0; i < 2500; i++)
        {
            var thread = new Thread(delegate()
            {
                mre.WaitOne();
                for (long j = 0; j < 500*1000; j++)
                {
                    holder.Now = j;
                }
            });
            thread.Start();
            threads.Add(thread);
        }

        mre.Set();

        threads.ForEach(t => t.Join());


        Console.WriteLine(holder.Now);
    }
}

public class Holder
{
    public long Now;
}

And it looks like it is doing what I want it to. This creates a lot of contention on the same value, but it is also giving me the right value. And again, the value of right here is very approximate. The problem is that I know how to write thread safe code, I’m not sure if this is a good way to go about doing this.

Note that this code (yes, even with 2,500 threads) runs quite fast, in under a second. Trying to use Interlocked.Exchange is drastically more expensive, and Interlocked.CompareExchange is even worse.

But it is just not sitting well with me.

Tags:

Published at

Originally posted at

Comments (25)

Message passing, performance

I got some replies about the async event loop post, mentioning LMAX Disruptor and performance. I decided to see for myself what the fuss was all about.

You can read about the LMAX Disruptor, but basically, it is a very fast single process messaging library.

I wondered what that meant, so I wrote my own messaging library:

public class Bus<T>
{
Queue<T> q = new Queue<T>();

public void Enqueue(T msg)
{
lock (q)
{
q.Enqueue(msg);
}
}

public bool TryDequeue(out T msg)
{
lock (q)
{
if (q.Count == 0)
{
msg = default(T);
return false;
}
msg = q.Dequeue();
return true;
}
}
}

I think that you’ll agree that this is a thing of beauty and elegant coding. I then tested this with the following code:

public static void Read(Bus<string> bus)
{
int count = 0;
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
string msg;
while (bus.TryDequeue(out msg))
{
count++;
}
}
sp.Stop();

Console.WriteLine("{0:#,#;;0} msgs in {1} for {2:#,#} ops/sec", count, sp.Elapsed, (count / sp.Elapsed.TotalSeconds));
}

public static void Send(Bus<string> bus)
{
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
for (int i = 0; i < 1000; i++)
{
bus.Enqueue("test");
}
}
}

var bus = new Bus<string>();

ThreadPool.QueueUserWorkItem(state => Send(bus));

ThreadPool.QueueUserWorkItem(state => Read(bus));

The result of this code?

145,271,000 msgs in 00:00:10.4597977 for 13,888,510 ops/sec

Now, what happens when we use the DataFlow’s BufferBlock as the bus?

public static async Task ReadAsync(BufferBlock<string> bus)
{
int count = 0;
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
try
{
await bus.ReceiveAsync(TimeSpan.FromMilliseconds(5));
count++;
}
catch (TaskCanceledException e)
{
}
}
sp.Stop();

Console.WriteLine("{0:#,#;;0} msgs in {1} for {2:#,#} ops/sec", count, sp.Elapsed, (count / sp.Elapsed.TotalSeconds));
}

public static async Task SendAsync(BufferBlock<string> bus)
{
var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
for (int i = 0; i < 1000; i++)
{
await bus.SendAsync("test");
}
}
}

What we get is:

43,268,149 msgs in 00:00:10 for 4,326,815 ops/sec.

I then decided to check what happens with the .NET port of the LMAX Disruptor. Here is the code:

public class Holder
{
public string Val;
}

internal class CounterHandler : IEventHandler<Holder>
{
public int Count;
public void OnNext(Holder data, long sequence, bool endOfBatch)
{
Count++;
}
}

static void Main(string[] args)
{
var disruptor = new Disruptor.Dsl.Disruptor<Holder>(() => new Holder(), 1024, TaskScheduler.Default);
var counterHandler = new CounterHandler();
disruptor.HandleEventsWith(counterHandler);

var ringBuffer = disruptor.Start();


var sp = Stopwatch.StartNew();
while (sp.Elapsed.TotalSeconds < 10)
{
for (var i = 0; i < 1000; i++)
{
long sequenceNo = ringBuffer.Next();

ringBuffer[sequenceNo].Val = "test";

ringBuffer.Publish(sequenceNo);
}
}
Console.WriteLine("{0:#,#;;0} msgs in {1} for {2:#,#} ops/sec", counterHandler.Count, sp.Elapsed, (counterHandler.Count / sp.Elapsed.TotalSeconds));
}

And the resulting performance is:

29,791,996 msgs in 00:00:10.0003334 for 2,979,100 ops/sec

Now, I’ll be the first to agree that this is really and absolutely not even close to be a fair benchmark. It is testing wildly different things. Distruptor is using a ring buffer, and the BlockBuffer didn’t, and the original Bus implementation just used an unbounded queue.

But that is a very telling benchmark as well. Pretty much because it doesn’t matter. What I need this for is for network protocol handling. As such, even assuming that every single byte is a message, we would have to go far beyond what any reasonable pipe can be expected to be handle.

Tags:

Published at

Originally posted at

Comments (9)

Async event loops in C#

I’m designing a new component, and I want to reduce the amount of complexity involved in dealing with it. This is a networked component, and after designing several such, I wanted to remove one area of complexity, which is the use of explicitly concurrent code. Because of that, I decided to go with the following architecture:

image

 

The network code is just reading messages from the network, and putting them in an in memory queue. Then we have a single threaded event loop that simply goes over the queue and process those messages.

All of the code that is actually processing messages is single threaded, which make it oh so much easier to work with.

Now, I can do this quite easily with a  BlockingCollection<T>, which is how I usually did those sort of things so far. It is simple, robust and easy to understand. It also tie down a full thread for the event loop, which can be a shame if you don’t get a lot of messages.

So I decided to experiment with async approaches. In particular, using the BufferBlock<T> from the DataFlow assemblies.

I came up with the following code:

var q = new BufferBlock<int>(new DataflowBlockOptions
{
CancellationToken = cts.Token,
});

This just create the buffer block, but the nice thing here is that I can setup a “global” cancellation token for all operations on this. The problem is that this actually generate bad exceptions (InvalidOperationException, instead of TaskCancelledException). Well, I’m not sure if bad is the right term, but it isn’t the one I would expect here, at least. If you pass a cancellation token directly to the method, you get the behavior I expected.

At any rate, the code for the event loop now looks like this:

private static async Task EventLoop(BufferBlock<object> bufferBlock, CancellationToken cancellationToken)
{
while (true)
{
object msg;
try
{
msg = await bufferBlock.ReceiveAsync(TimeSpan.FromSeconds(3), cancellationToken);
}
catch (TimeoutException)
{
NoMessagesInTimeout();
continue;
}
catch (Exception e)
{
break;
}
ProcessMessage(msg);
}
}

And that is pretty much it. We have a good way to handle timeouts, and processing messages, and we don’t take up a thread. We can also be easily cancelled. I still need to run this through a lot more testing, in particular, to verify that this doesn’t cause issues when we need to debug this sort of system, but it looks promising.

SQL Injection & Optimizations Path

It is silly, but I just had a conversation with one of our developers on SQL Injection. In RavenDB we support replicating to a relational database, which obviously require using SQL. We are doing things properly, with parameters and everything.

No chance for SQL Injection from there. Great, and end of a very short blog post if it was everything.

As it turned out, there is a significant performance difference between:

@p1 = 'users/1'
@p2 = 'users/2'

DELETE FROM Users WHERE Id IN (@p1,@p1)

And:

DELETE FROM Users WHERE Id IN ('users/1', 'users/2')

Enough that we added that as an option. The reason why related to the vagaries of the database query optimizer, and not really relevant.

This is off by default, obviously. And we use parameters by choice & preference. But we still added a minimal “protection” by adding:

sqlValue.Replace("'", "''")

Considering that this isn’t meant for user’s input (it is for document ids), that is something that is annoying.

Any suggestions on how to improve this?

Tags:

Published at

Originally posted at

Comments (13)

Compression finale

After a fairly long road, we are done. We have all the pieces, generating a shared dictionary, writing using Huffman encoding and getting the results back out.

Hopefully by now the theory behind it is fairly clear to you, and it is time to actually put this into practice.

I’ve 100,000 random users documents in this file, and I want to see what kind of compression I can get from a shared dictionary approach. The project with the code for all of this can be found here: Rhea Compression (most of it is basically a port of FemtoZip to .NET).

The actual file size is 8.49 MB, and when compressing it with Zip (Windows’ send to compress folder) it turn into a 1.93 MB file.

The original file size in bytes is: 8,809,353.

I then tried to compress each document individually using GZipStream, resulting in a total of: 10,004,614 bytes used. Or 9.5 MB! In other words, and not to anyone surprise (I hope), we see an increase in the file size.

However, when using Rhea’s compression, we do the following:

var trainer = new CompressionTrainer();

for (int i = 0; i < json.Length/100; i++)
{
    trainer.TrainOn(json[i*100]);
}

var compressionHandler = trainer.CreateHandler();

This creates a shared dictionary from every 100th document. So we have 1,000 documents as our sampling data. Then, I compressed all the individual documents one at a time.

The result took: 2,593,235 bytes or just 2.47 MB. We got 29% compression ratio! Note that we did this with a 34Kb shared dictionary.

Here is the actual compression code:

foreach (var dic in docs)
{
    size += s.Length;
    ms.SetLength(0);
    compressedSize += compressionHandler.Compress(s, ms);
}

And that is pretty much it. Rhea Compression is on github, and that concludes my spike into compression. In general, Rhea (and FemtoZip, obviously) are meant for very specific scenarios. I have high hopes to be able to use it in the future for doing great things Smile.

Tags:

Published at

Originally posted at

Huffman coding and encoding compressed data

So far, we have dealt with relatively straightforward topics. Given a corpus, find a shared dictionary that we can then use to extract repeated patterns. This is the classic view of compression, even if real world compression schemes are actually a lot more convoluted. Now, that we have compressed text like the following, what comes next?

<-43,6>11,'n<-60,6>Anna Nepal<-40,13><-18,8><-68,8>awest@twinte.gov'}

Obviously, it would be pretty important to encoding this properly. Here is an example of a bad encoding scheme:

[prefix – byte – 1 for compressed, 0 for literal]

[length – int – length of compressed / literal]

[offset – int – back reference if the prefix was set to 1]

The problem here is that we need 6 bytes to encode a one letter literal, and 9 bytes to encode any back reference. Using this inefficient method, encoding the compressed string above actually takes more space than the original one.

As you can imagine, there has been a lot of research into this. The RFC 1951 specification, for example, set us how to do that in detail, although I find this explanation much easier to go through.

Let us see a simple Huffman encoding scheme. We will use “this is an example of a huffman tree” as our text, and first, we’ll build the table.

image

And now we can encoding the above sentence in 135 bits, instead of 288.

For decoding, we just run over the table again, and stop the first time that we hit a leaf.

For example, let us see how we would decode ‘ ‘ and ‘r’.

Space is very common, appearing 7 times in the text. As such, it is encoded as 111. So, start from the root, and move right three times. We are in a leaf node, and we output space.

With the letter R, it only appear in the text 4 times, and its encoding is 10111. So move right from the root node, then left, then three times right, resulting in the leaf node R.

So the results of Huffman encoding using the above table is:

[t] = 0001
[h] = 1010
[i] = 1011
[s] = 0000
[ ] = 111
[i] = 1011
[s] = 0000
[ ] = 111
[a] = 010
[n] = 0011
[ ] = 111
[e] = 011
[x] = 10001
[a] = 010
[m] = 0010
[p] = 10010
[l] = 110010
[e] = 011
[ ] = 111
[o] = 110011
[f] = 1101
[ ] = 111
[a] = 010
[ ] = 111
[h] = 1010
[u] = 10000
[f] = 1101
[f] = 1101
[m] = 0010
[a] = 010
[n] = 0011
[ ] = 111
[t] = 0001
[r] = 10011
[e] = 011
[e] = 011

However, we have a problem here, this result in 135 bits or 17 bytes (vs 36 bytes for the original code). However, 17 bytes contains 136 bits. How do we avoid corruption in this case? The answer is that we have to include an EOF Marker in there. We do that by just adding a new item to our Huffman table and recording this normally.

So, that is all set, and we are ready to go, right?  Almost, we still need to decide how to actually encode the literals and compressed data. GZip (and FemtoZip) uses a really nice way of handling that.

The use of Huffman table means that it contains the frequencies of individual bytes values, but instead of having just 0 – 255 entries in the Huffman table, they uses 513 entries.

256 entries are for the actual byte entries.

256 entries are just lengths.

1 entry for EOF.

What does it means, length entries? It basically means, we store the values 256 – 511 for lengths. When we read the Huffman value, it will give us a value in the range of 0 – 512.

If it is 512, this is the EOF marker, and we are done.

If it is 0 – 255, it is a byte literal, and we can treat it as such.

But if it is in the range of 256 – 511, it means that this is a length. Note that because lengths also cluster around common values, it is very likely that we’ll be able to store most lengths in under a byte.

Following a length, we’ll store the actual offset back. And again, FemtoZip is using Huffman encoding to do that. This is actually being done by encoding the offset into multiple 4 bits entries. The idea is to gain as much as possible from commonalities in the actual byte patterns for the offset.

Yes, it is confusing, and it took me a while to figure it out. But it is also very sleek to see this in action.

On my next post, I’ll bring it all together and attempt to actually generate something that produces compressed data and can decompress it back successfully…

Tags:

Published at

Originally posted at

Comments (6)

Using shared dictionary during compression

After the process of creating the actual dictionary, it is time for us to actually make use of it, isn’t it?

Again, I’m following the code from FemtoZip here, and I’m going to explain how it actually uses the dictionary for compression. The magic is happening in the PrefixHash class. Let us see the calling code:

var dic = Encoding.UTF8.GetBytes("asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'");

var text = Encoding.UTF8.GetBytes("{'id':11,'name':'Anna West','country':'Nepal','email':'awest@twinte.gov'}");

var prefixHash = new PrefixHash(dic, true);

var bestMatch = prefixHash.GetBestMatch(0, text);
Console.WriteLine(Encoding.UTF8.GetString(dic, bestMatch.BestMatchIndex, bestMatch.BestMatchLength));

The output of this code is: {‘id’:

How does this work?

When we create the prefixHash, it generate the following table by hashing every 4 bytes and storing the relevant position in them.

hash[  5] =  58;
hash[ 11] =  71;
hash[ 13] =  22;
hash[ 14] =  11;
hash[ 17] =  23;
hash[ 30] =  16;
hash[ 34] =  61;
hash[ 35] =   5;
hash[ 37] =  70;
hash[ 41] =  65;
hash[ 45] =  54;
hash[ 49] =  66;
hash[ 57] =  36;
hash[ 58] =  29;
hash[ 63] =  57;
hash[ 65] =  60;
hash[ 66] =  56;
hash[ 67] =   2;
hash[ 72] =   0;
hash[ 73] =  28;
hash[ 78] =  62;
hash[ 79] =  35;
hash[ 80] =  51;
hash[ 87] =  55;
hash[ 89] =  15;
hash[ 91] =   9;
hash[ 96] =   7;
hash[ 99] =  67;
hash[100] =  52;
hash[105] =  21;
hash[108] =  25;
hash[109] =  69;
hash[110] =  68;
hash[111] =  64;
hash[118] =  17;
hash[120] =   4;
hash[125] =  33;
hash[126] =   3;
hash[127] =  26;
hash[130] =  18;
hash[131] =  31;
hash[132] =  59;

The hash of {‘id (the first 4 bytes) is 125. And as you can see, that maps to position 33. That means that there is a likelihood that there in position 33 there is a the value {‘id. What we do then is check, and continue to run through the code as long as we have a match.

That is how we can figure out that there is a 6 character match starting at position 33. The actual code is more involved, of course, and we need to figure out if there might be another match, elsewhere in the dictionary, that might serve better. Another issue when we need to actually compress is that while we can use the dictionary for compression, it is also actually possible to use the plain text we are compressing as another dictionary as well. Which is what FemtoZip is doing.

Basically, the logic goes like this. Check the current position for an entry in the dictionary, then check if we already had this value in the plain text we have seen so far. Select the largest match, then output that.

Here is actually using it all:

var dic = Encoding.UTF8.GetBytes("asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'");

var text = Encoding.UTF8.GetBytes("{'id':11,'name':'Anna Nepal','country':'Nepal','email':'awest@twinte.gov'}");

var substringPacker = new SubstringPacker(dic);
substringPacker.Pack(text, new DebugPackerOutput(), Console.Out);

The output is meant to be human readable, and the compressed text is:

<-43,6>11,'n<-60,6>Anna Nepal<-40,13><-18,8><-68,8>awest@twinte.gov'}

Note that we saved a total of 41 characters due to compression (assuming we don’t count the cost of actually encoding this.

Now, what about those references? They look very strange with those negative numbers. The reason those numbers are negative is actually quite simple. They aren’t dictionary entries, like you would think. Instead, they are back references. In other words, the first <-43,6> call is actually saying: go backward 43 bytes, then copy 6 bytes.

But we just started reading the compressed text, where do we go backward to? The answer is that we go backward into the dictionary. So all the references in the text are always relative to our current position. Let us resolve this compressed string one step at a time.

<-43,6> means go back 43 bytes into the dictionary and copy 6 bytes, giving us a string of:

{‘id’:

Then we have “11,n” literal that we append to the string:

{‘id’:11,’n

Now we need to go 60 bytes back (from the current end of the string) and copy 6 bytes giving us:

{‘id’:11,’name’:’

The literal “Anna Nepal” giving us:

{‘id’:11,’name’:’Anna Nepal

Then we have to go 40 characters back, and copy 13 bytes:

{‘id’:11,’name’:’Anna Nepal’,’country’:’

Now this is fun, we have to go 18 chars back, and for the first time, we aren’t hitting the dictionary, we are using the actual string that we uncompressed to generate the rest of the string:

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,

Another backward reference, 68 steps and copying 8 bytes (again to the dictionary):

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,’email’:’

The literal awest@twinte.gov’} completes the picture, giving us the full text:

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,’email’:’awest@twinte.gov’}

And that is how FemtoZip works. And that is pretty neat.

The actual implementation is doing Huffman compression as well, but I’ll touch on that in a later post.

Tags:

Published at

Originally posted at

Shared dictionary generation

As I said, generating a shared dictionary turned out to be a bit more complex than I thought it would be. I hoped to be able to just use a prefix tree and get the highest scoring entries, but that doesn’t fly. I turned to femtozip to see how they do that, and it became both easier and harder at the same time.

They are doing this using a suffix array and LCP. I decided to port this to C# so I can play with this more easily. We start with the following code:

 var dic = new DictionaryOptimizer();

 dic.Add("{'id':1,'name':'Ryan Peterson','country':'Northern Mariana Islands','email':'rpeterson@youspan.mil'");
 dic.Add("{'id':2,'name':'Judith Mason','country':'Puerto Rico','email':'jmason@quatz.com'");
 dic.Add("{'id':3,'name':'Kenneth Berry','country':'Pakistan','email':'kberry@wordtune.mil'");

 var optimize = dic.Optimize(512);

This gives me an initial corpus to work with. And let us dig in and figure out what is going how it works. Note that I use a very small sample to reduce the amount of stuff we have to go through.

The first thing that FemtoZip is doing is to concat all of those entries together and generate a suffix array. A suffix array is all the suffixes from the combined string, and part of it, for the string above, is:

ariana Islands','email':'rpeterson@yousp
ason','country':'Puerto Rico','email':'j
ason@quatz.com'{'id':3,'name':'Kenneth B
atz.com'{'id':3,'name':'Kenneth Berry','
berry@wordtune.mil'
co','email':'jmason@quatz.com'{'id':3,'n
com'{'id':3,'name':'Kenneth Berry','coun
country':'Northern Mariana Islands','ema
country':'Pakistan','email':'kberry@word
country':'Puerto Rico','email':'jmason@q
d':1,'name':'Ryan Peterson','country':'N
d':2,'name':'Judith Mason','country':'Pu
d':3,'name':'Kenneth Berry','country':'P
dith Mason','country':'Puerto Rico','ema
ds','email':'rpeterson@youspan.mil'{'id'
dtune.mil'
e':'Judith Mason','country':'Puerto Rico
e':'Kenneth Berry','country':'Pakistan',
e':'Ryan Peterson','country':'Northern M
e.mil'
email':'jmason@quatz.com'{'id':3,'name':
email':'kberry@wordtune.mil'
email':'rpeterson@youspan.mil'{'id':2,'n
enneth Berry','country':'Pakistan','emai

The idea is to generate all the suffixes from the string, then sort them. Then use LCP (longest common prefix) to see what is the shared prefix between any two consecutive entries.

Together, we can use that to generate a list of all the common substrings. Then we start ranking them by how often they appear. Afterward, it is a matter of selecting the most frequent items that are the largest, so our dictionary entries will show be as useful as possible.

That gives us a list of potential entries:

','country':'
,'country':'
'country':'
','email':'
country':'
,'email':'
ountry':'
,'name':'
'email':'
untry':'
email':'
'name':'
ntry':'
name':'
mail':'
son','country':'
on','country':'
n','country':'
','country':'P
,'country':'P
{'id':
try':'
ame':'
ail':'
'country':'P
country':'P
ountry':'P
untry':'P
ntry':'P
ry':'
me':'
il':'
'id':
try':'P
ry':'P
y':'P
.mil'
y':'
n','
l':'
id':
e':'
eterson
terson
son@
mil'
':'P
erson
rson
erry
ason

One thing you can note here is that there are a lot of repeated strings. country appears in a lot of permutations, so we need to clear this up as well, remove all the entries that are overlapping, and then pack this into a final dictionary.

The dictionary resulting from the code above is:

asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'

This contains all the repeated strings that have been deemed valuable enough to put into the dictionary.

On my next post, I’ll talk on how to make use of this dictionary to actually handle compression.

Tags:

Published at

Originally posted at

Comments (5)

Building a shared dictionary

This turned out to be a pretty hard problem. I wanted to do my own thing, but for reference, femtozip is considered to be the master source for such things.

The idea of a shared dictionary system is that you have a training corpus, that you use to extract common elements from the training corpus, which you can then use to build a dictionary, which you’ll then be able to use to compress all the other data.

In order to test this, I generated 100,000 users using Mockaroo. You can find the sample data here: RandomUsers.

The data looks like this:

{"id":1,"name":"Ryan Peterson","country":"Northern Mariana Islands","email":"rpeterson@youspan.mil"},
{"id":2,"name":"Judith Mason","country":"Puerto Rico","email":"jmason@quatz.com"},
{"id":3,"name":"Kenneth Berry","country":"Pakistan","email":"kberry@wordtune.mil"},
{"id":4,"name":"Judith Ortiz","country":"Cuba","email":"jortiz@snaptags.edu"},
{"id":5,"name":"Adam Lewis","country":"Poland","email":"alewis@muxo.mil"},
{"id":6,"name":"Angela Spencer","country":"Poland","email":"aspencer@jabbersphere.info"},
{"id":7,"name":"Jason Snyder","country":"Cambodia","email":"jsnyder@voomm.net"},
{"id":8,"name":"Pamela Palmer","country":"Guinea-Bissau","email":"ppalmer@rooxo.name"},
{"id":9,"name":"Mary Graham","country":"Niger","email":"mgraham@fivespan.mil"},
{"id":10,"name":"Christopher Brooks","country":"Trinidad and Tobago","email":"cbrooks@blogtag.name"},
{"id":11,"name":"Anna West","country":"Nepal","email":"awest@twinte.gov"},
{"id":12,"name":"Angela Watkins","country":"Iceland","email":"awatkins@izio.com"},
{"id":13,"name":"Gregory Coleman","country":"Oman","email":"gcoleman@browsebug.net"},
{"id":14,"name":"Andrew Hamilton","country":"Ukraine","email":"ahamilton@rhyzio.info"},
{"id":15,"name":"James Patterson","country":"Poland","email":"jpatterson@skippad.net"},
{"id":16,"name":"Patricia Kelley","country":"Papua New Guinea","email":"pkelley@meetz.biz"},
{"id":17,"name":"Annie Burton","country":"Germany","email":"aburton@linktype.com"},
{"id":18,"name":"Margaret Wilson","country":"Saudia Arabia","email":"mwilson@brainverse.mil"},
{"id":19,"name":"Louise Harper","country":"Poland","email":"lharper@skinder.info"},
{"id":20,"name":"Henry Hunt","country":"Martinique","email":"hhunt@thoughtstorm.org"}

And what I want to do is to run over the first 1,000 records and extract a shared dictionary. Actually generating the dictionary is surprisingly hard. The first thing I tried is a prefix tree of all the suffixes. That is, given the following entries:

banana
lemon
orange

You would have the following tree:

  • b
    • ba
      • ban
        • bana
          • banan
            • banana
  • a
    • an
      • ana
        • anan
          • anana
      • ang
        • ange
  • n
    • na
      • nan
        • nana
      • nag
        • nage
  • l
    • le
      • lem
        • lemo
          • lemon
  • o
    • or
      • ora
        • oran
          • orang
            • orange
  • r
    • ra
      • ran
        • rang
          • range
  • g
    • ge
  • e

My idea was that this will allow me to easily find all the common substrings, and then rank them. But the problem is how do I select the appropriate entries that are actually useful? That is the part where I gave up my simple to follow and explain code and dived into the real science behind it. More on that in my next entry, but in the meantime, I would love it if someone could show me simple code to find the proper terms for the dictionary.

Tags:

Published at

Originally posted at

Comments (4)

Decompression code & Discussion

As I said, a good & small interview question is this one, it is a good one because it is short, relatively simple to handle, but it should show a lot of things about your code. To start with, being faced with a non trivial task that most people are not that familiar with.

Implement a Decompress(Stream input, Stream output, byte[][] dictionary) routine for the following protocol:

Prefix byte – composed of the highest bit + 7 bits len

If the highest bit is not set, this is a uncompressed data, copy the next len from the input to the output.

If the highest bit is set, this is compressed data, the next byte will be the index in the dictionary, copy len bytes from that dictionary entry to the output.

I couldn’t resist doing this myself, and I came up with the following:

public void Decompress(Stream input, Stream output, byte[][] dictionary)
{
    var tmp = new byte[128];
    while (true)
    {
        var readByte = input.ReadByte();
        if (readByte == -1)
            break;
            
        var prefix = (byte) readByte;
        var compressed = (prefix & 0x80) != 0;
        var len = prefix & 0x7f;

        if (compressed == false)
        {
            while (len > 0)
            {
                var read = input.Read(tmp, 0, len);
                if(read == 0)
                    throw new InvalidDataException("Not enough data to read from compressed input stream");
                len -= read;
                output.Write(tmp, 0, read);
            }
        }
        else
        {
            readByte = input.ReadByte();
            if(readByte == -1)
                throw new InvalidDataException("Not enough data to read from compressed input stream");
            output.Write(dictionary[readByte], 0, len);
        }
    }
}

Things to pay attention to: Low memory allocations, error handling, and handling of partial reads from the stream.

But that is just part of the question. After reading the protocol, and implementing it. The question now turns to what does the protocol says about this kind of compression scheme. The use of just 7 bits to store len drastically limit the compression utility in a general format. It also requires an external dictionary, which most compression formats don’t use, they use the actual compressed text itself as the dictionary.  Of course, I’ve been reading compression algorithms for a while now, so that isn’t that fair. But I would expect people to note that that 7 bit limits the compression usability.

And hopefully, with a bit of a hint, they should note that the external dictionary is useful for small data sets where the repetitions are actually between entry, not per entry.

Your customer isn’t a single entity

An interesting issue came up in the comments for my modeling post.  Urmo is saying:

…there are no defined processes, just individual habits (even among people with same set of obligations) with loose coupling on the points where people need to interact. In these companies a software can be a boot that kicks them into more defined and organized operating mode.

This is part of discussion of software modeling and the kind of thinking you have to do when you approach a system. The problem with Urmo’s approach is that there is a set implicit assumptions, and that is that the customer is speaking with a single voice, that they actually know what they are doing and that they have the best interests. Yes, it is really hard to create software (or anything, actually) without those, but that happens more frequently than one might desire.

A few years ago I was working on a software to manage what was essentially long term temp workers. Long term could be 20 years, and frequently was a number of years. The area in question was caring for invalids,  and most of the customers for that company were the elderly. That meant that a worker might not be required on a pretty sudden basis (the end customer died, care no longer required).

Anyway, that is the back story. The actual problem we run into was that by the time the development team got into place there was already a very detailed spec, written by a pretty good analyst after many sessions at a luxury hotel conference room. In other words, the spec cost a lot of money to generate, and involved a lot of people from the company’s management.

What it did not include, however, was feedback from the actual people who had to place the workers at particular people’s homes, and eventually pay them for their work. Little things like the 1st of the month (you have 100s of workers coming in to get their hours approved and get paid) weren’t taken into account. The software was very focused on the individual process, and there were a lot of checks to validate input.

What wasn’t there were things like: “How do I efficiently handle many applicants at the same time?’'

The current process was paper form based, and they were basically going over the hours submitted, ask minimal questions, and provisionally approve it. Later on, they would do a more detailed scan of the hours, and do any fixups needed. That would be the time that they would also input the data to their old software. In other words, there was an entire messy process going on that the higher ups didn’t even realize was happening.

This include decisions such as “you need an advance, we’ll register that as 10 extra hours you worked this month, and we’ll deduct it next month” and “you weren’t supposed to go to Mrs. Xyz, you were supposed to go to Mr. Zabc! We can’t pay for all your hours there” , etc.

When we started working on the software, we happened to do a demo to some of the on site people, and they were horrified by what they saw. The new & improved software would end up causing them much more issues, and it would actually result in more paperwork that they have to manage just so they can make the software happy.

Modeling such things was tough, and at some point (with the client reluctant agreement) we essentially threw aside the hundreds of pages of well written spec, and just worked directly with the people who would end up using our software. The solution in the end was to codify many of the actual “business processes” that they were using. Those business processes made sense, and they were what kept the company working for decades. But management didn’t actually realize that they were working in this manner.

And that is leaving aside the “let us change the corporate structure through software” endeavors, which are unfortunately also pretty common.

To summarize, assuming that your client is a single entity, which speaks with one voice and actually know what they are talking about? Not going to fly for very long. In another case, I had to literally walk a VP of Sales through the process of how a sale is actually happening in his company versus what he thought was happening.

Sometimes this job is likely playing a shrink, but for corporations.

Playing with compression

Compression is a pretty nifty tool to have, and it can save a lot in both I/O and overall time (CPU tends to have more cycles to spare then I/O bandwidth). I decided that I wanted to investigate it more deeply.

The initial corpus was all the orders documents from the Northwind sample database. And I tested this by running GZipStream over the results.

  • 753Kb - Without compression
  •   69Kb - With compression

That is a pretty big difference between the two options. However, this is when we compress all those documents together. What happens if we want to compress each of them individually? We have 830 orders, and the result of compressing all of them individually is:

  • 752Kb - Without compression
  • 414Kb - With compression

I did a lot of research into compression algorithms recently, and it the reason for the difference is that when we compress more data together, we can get better compression ratios. That is because compression works by removing duplication, and the more data we have, the more duplication we can find. Note that we still manage to do

What about smaller data sets? I created 10,000 records looking like this:

{"id":1,"name":"Gloria Woods","email":"gwoods@meemm.gov"}

And gave it the same try (compressing each entry independently):

  • 550KB – Without compression
  • 715KB – With compression

The overhead of compression on small values is very significant. Just to note, compressing all entries together will result in a data size of 150Kb in size.

So far, I don’t think that I’ve any surprising revelations. That is pretty much exactly inline with my expectations. And the reason that I’m actually playing with compression algorithms rather than just use an off the shelve one.

This particular scenario is exactly where FemtoZip is supposed to help, and that is what I am looking at now. Ideally, what I want is a shared dictionary compression that allows me to manage the dictionary myself, and doesn’t have a static dictionary that is created once, although that seems a lot more likely to be the way we’ll go.

Tags:

Published at

Originally posted at

Comments (10)

Modeling exercise: Flights & Travelers

I just got a really interesting customer inquiry, and I got their approval to share it. The basic problem is booking flights, and how to handle that.

The customer suggested something like the following:

{   //customers/12345
    "Name" : "John Doe",
    "Bookings" : [{
        "FlightId": "flights/1234",
        "BookingId": "1asifyupi",
        "Flight": "EA-4814",
        "From": "Iceland",
        "To" : "Japan", 
        "DateBooked" : "2012/1/1"
      }]
    }    
}

{ // flight/1234
   "PlaneId": "planes/1234"// centralized miles flown, service history
   "Seats": 
   {
       {
           "Seat": "F16"
           "BookedBy": "1asifyupi"
   }
}

But that is probably a… suboptimal way to handle this. Let us go over the type of entities that we have here:

  • Customers / Passengers
  • Flights
  • Planes
  • Booking

The key point in here is that each of those is pretty independent. Note that for simplicity’s sake, I’m assuming that the customer is also the passenger (not true in many cases, a company may pay for your flight, so you the company in the customer and you the passenger).

The actual problem the customer is dealing with is that they have thousands of flights, tens or hundreds of thousands of seats and millions of customers competing for those seats.

Let us see if we can breaking it down to a model that can work for this scenario.  Customers deserve its own document, but I wouldn’t store the bookings directly in the customer document. There are many customers that fly a lot, and they are going to have a lot of booking there. At the same time, there are many bookings that are made for a lot of people at the same time (an entire family flying).

That leaves the Customer’s document with data about the customer (name, email, phone, passport #, etc) as well as details such as # of miles traveled, the frequent flyer status, etc.

Now, we have the notion of flights and bookings. A flight is a (from, to, time, plane), which contains the available seats number. Note that we need to explicitly allow for over booking, since that is a common practice for airlines.

There are several places were we have contention here:

  • When ordering, we want to over book up to a certain limit.
  • When seating (usually 24 – 48 hours before the flight) we want to reserve seats.

The good thing about it is that we actually have a relatively small contention on a particular flight. And the way the airline industry works, we don’t actually need a transaction between creating the booking and taking a seat on the flight.

The usual workflows goes something like this:

  • A “reservation” is made for a particular itinerary.
  • That itinerary is held for 24 – 48 hours.
  • That itinerary is sent to the customer for approval.
  • Customer approve and a booking is made, flight reservations are turned into actual booked seats.

The good thing about this is that because a flight can have up to ~600 seats in it, we don’t really have to worry about contention on a single flight. We can just use normal optimistic concurrency and avoid more complex models. That means that we can just retry on concurrency errors and see where that leads us. The breaking of the actual order into reservation and booking also helps, since we don’t have to coordinate between the actual charge and the reservation on the flight.

Overbooking is handled by setting a limit of how much we allow overbooking, and managing the number of booked seats vs. reserved seats. When we look at the customer data, we show the customer document, along with the most recent orders and the stats. When we look at a particular flight, we can get pretty much all of its state from the flight document itself.

And the plane’s stats are usually just handled via a map/reduce index on all the flights for that plane.

Now, in the real world, the situation is a bit more complex. We might give out 10 economy seats and 3 business seats to Expedia for a 2 months period, so they manage that, and we have partnership agreements with other airlines, and… but I think that this is a good foundation to start building this on.

The contracts of Lazy vs Task

There was a question about our use of Task<T> and Lazy<T> in RavenDB, and I thought that this is a subtle thing that deserve more than a short email.

The basic difference between Lazy<T> and Task<T> are the kind of contracts that they express.

  • Lazy<T> represent a potentially expensive operation that has been deferred. The promise given is that the lazy’s value will be generated the first time that it is needed.
  • Task<T> represent a potentially expensive operation that is currently executing. The promise is that the task’s value will be available on request, hopefully already there by the time you asked.

The major difference is when we are actually starting the operation. Most often, when we return a task, we return a reference to an scheduled / executing task, which will complete whatever or not the task’s result will be accessed. Conversely, a lazy’s whose value was never accessed is not something that will ever execute.

The use cases for them tend to be quite different.

Lazy<T> is used a lot of the time as a way to handle once and only once creation (basically, a safe singleton pattern), and for actually deferring the execution of work. Sometimes you can avoid having to do something, and it is easier to give a caller a lazy object and only have to pay for that additional work if they access the data.

Task<T> is primarily used to actually parallelize the work, so you’ll have it running while you are doing other stuff. Usually this is used for I/O, since that can be natively parallelized.

With RavenDB, we use Task<T> as the return type of all our async operations, allowing of to take advantage on the async nature of I/O. And we use Lazy<T> to setup a deferred call for the server. When someone actually access one of lazy’s values, we have to provide you with the answer. At this point, we can go to the server with all  the pending lazy operations, and save a lot of effort just making remote calls to the server.

In fact, in RavenDB 3.0, we have lazy support in the async API. That means that we have methods that return Lazy<Task<T>>, which means: Give me a deferred operation, that when required, will perform in an async manner. That gives me both the ability to combine requests to the server and avoid blocking up a thread while that I/O is in progress.

Bisecting RavenDB

One of the results of expanding the RavenDB’s team is that we have to deal with a lot more pull requests. We have individual developers working on their features, and submitting pull requests on a daily basis. That usually means that I have to merge anything between ten to thirty pull requests per day.

Given our test suite runtime, it isn’t really reasonable to run the full test suite after each pull request, so we merge them all, review them, and then run the tests. The problem then is that if there was a test failure… which pull request, and which commit cause it?

That is why git has the wonderful git bisect command. And we have taken advantage of that to enable the following development time feature:

./bisect.ps1 HEAD HEAD~10 General_Initialized_WithoutErrors

It will output something like:

631caa60719b1a65891eec37c4ccb4673eb28d02 is the first bad commit
commit 631caa60719b1a65891eec37c4ccb4673eb28d02
Author: Pawel Pekrol
Date: Fri May 9 08:57:55 2014 +0200
another temp commit

Pin pointing exactly where the issue is.

And yes, this is here as a post primarily because I might forget about it.

C# Async programming pitfall

I really like the TPL, and I really like the async/await syntax. It is drastically better than any other attempt I’ve seen to handle concurrency.

But it also has a really major issue. There is no way to properly debug things. Imagine that I have the following code:

public Task DoSomething()
{
return new TaskCompletionSource<object>().Task;
}

And now imagine that I have some code that is going to do an await DoSomething();

What is going to happen? This is a never ending task, so we’ll never return. And that is fine, except that there is absolutely no way to see that. I have no way of seeing which task didn’t return, and I have no way of seeing all the pending tasks, and what they are all waiting for, etc. I’ve run into something like that (obviously a lot harder to figure out) too many times.

If I was using non async code, it would be obvious that there is this thread that is stopped on this thing, and I could figure it out. For us, this make it a lot harder to work with.

Tags:

Published at

Originally posted at

Comments (15)

Corax: The problem of space

Corax is a research project that we have, to see how we can build a full text search library on top of Voron. Along the way, we take the chance to find out how Lucene does things, and what we can do better. Pretty much from the get go, Corax is likely to use more disk space than Lucene, probably significantly so. I would be happy if we could get a merely 50% increase over Lucene

The reason that this is the case is that Lucene goes to great length to save disk space. From storing all integers in variable length format, to prefix compression to implicitly referencing data in other files. For example, you can see that when you try reading term positions:

TermPositions are ordered by term (the term is implicit, from the .tis file).

Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).

The downside of saving every little bit is that it is a lot more complex to read the data, requiring multiple caches and complex code path to actually get it properly. It make a lot of sense, when Lucene was created, disk space was at a premium. I won’t go as far as to say that disk space doesn’t matter, but given a trade off of using more disk space vs. using more memory / complexity, it is much easier to justify disk space usage today*.

* The caveat here is that you need to be careful, because just accessing the disk can be very slow.

One of the major things that we wanted to deal with Corax is reducing index corruption issues, and seeing if we can simplify things into a transactional system. As a side effect of that, we don’t need to have index segments, and we don’t need to do merges to free disk space. The problem is that in order to handle this, we need to make track additional information that Lucene doesn’t need to.

Let us look at the actual data we keep. Here is a very simple index:

using (var fullTextIndex = new FullTextIndex(StorageEnvironmentOptions.CreateMemoryOnly(), new DefaultAnalyzer()))
{
using (var indexer = fullTextIndex.CreateIndexer())
{
indexer.NewIndexEntry();
indexer.AddField("Name", "Oren Eini");
indexer.AddField("Email","Ayende@ayende.com");

indexer.NewIndexEntry();
indexer.AddField("Name", "Arava Eini");
indexer.AddField("Email","Arava@houseof.dog");

indexer.Flush();
}
}

For each field, we are going to create a multi tree. And for each unique term in the field we have a list of (Index Entry Id, term frequency, boost).

  • @fld_Name
    • arava
      • { 2, 1, 1.0 }  (index id 2, freq 1, boost 1.0)
    • eini
      • { 1,1,1.0 }
      • { 2,1,1.0 }
    • oren
      • { 1,1,1 }
  • @fld_Email
    • arava@houseof.dog
      • { 2,1,1.0 }
    • ayende@ayende.com
      • { 1,1,1.0 }

This is pretty much the equivalent to the way Lucene store things. Possible space optimizations here include not storing default values (term frex or boost of 1), storing index entry ids as variable ints, etc.

The problem is that while this is actually enough for the way Lucene does things, it is not enough for the way Corax does things. Let us consider the case of deleting a document. How would you go about doing this using the information above?

Lucene does this by marking a document id as deleted, and will purge its details on the next segments merge. That works, but only because a segment merge actually read & write all of the relevant segments data. Without a segments merge, deleting a document is actually something that would require us to scan all the data in the entire database. This is not really practical. Therefor, we need to store additional data so we can delete it later on. In this case, we have the Docs tree, which has keys for (index entry id, field id and term num). This looks like this:

  • Docs
    • [1,1,1]: oren
    • [1,1,2]: eini
    • [1,2,1]: ayende@ayende.com
    • [2,1,1]: arava
    • [2,1,2]: eini
    • [2,2,1]: arava@houseof.dog

Using this information, we can now remove all traces of a document when it is deleted. However, the problem here is that we need to also keep the terms per document in the index. That really blow up the index size, obviously.

The reason for this peculiar way of storing the document fields in this manner is that we also want to reuse this information for sorting. When Lucene needs to sort data, it has to read all of the data from the fields, then recreate the values for all relevant documents. Corax can just serve the data already there.

A pretty obvious step to save space would be to track the terms separately, and use an id in the Docs tree, not the full term. That leads to an interesting problem, because we are going to need to be able to go from term –> id and id –> term, which pretty much require storing them twice, unfortunately.

Final note, Corax is a research project.

The nice thing about working in a large team…

There are a lot of stuff that are hard to do when you are working on a large team. But the really nice thing is the velocity in which you can move.

I just started the morning with the following commands:

image

And I have another PR pending for a different branch that I’m going to have to look at. Overall, I think that this is a pretty cool thing to have. We can push forward in many direction at once, and it can be pretty awesome to look at all the good thins that are coming our way.

Tags:

Published at

Originally posted at

The Corax Experiment: API

I posted before about design practice for how I would approach building a search engine library. I decided to bite the bullet and actually try to do this. Using Voron, that turned out to be a really simple thing to do. Of course, this doesn’t do a tenth of what Lucene does, but it actually does quite a lot. The code is available here, and I want to emphasize again, this is purely experimental / research project.

The entire thing comes to less than 500 lines of code. And it is pretty functional even at this stage.

Corax is composed of:

  • Analysis
  • Indexing
  • Querying

Analysis of the documents is handled via analyzers:

   1: public interface IAnalyzer
   2: {
   3:     ITokenSource CreateTokenSource(string field, ITokenSource existing);
   4:     bool Process(string field, ITokenSource source);
   5:  
   6: }

An analyzer create a token source, which accept a TextReader and produces token. For each token, the Process method is called, and it is used to do things to the relevant token. For example, here is the default analyzer:

   1: public class DefaultAnalyzer : IAnalyzer
   2: {
   3:     readonly IFilter[] _filters =
   4:     {
   5:         new LowerCaseFilter(), 
   6:         new RemovePossesiveSuffix(), 
   7:         new StopWordsFilter(), 
   8:     };
   9:  
  10:  
  11:     public ITokenSource CreateTokenSource(string field, ITokenSource existing)
  12:     {
  13:         return existing ?? new StringTokenizer();
  14:     }
  15:  
  16:     public bool Process(string field, ITokenSource source)
  17:     {
  18:         for (int i = 0; i < _filters.Length; i++)
  19:         {
  20:             if (_filters[i].ProcessTerm(source) == false)
  21:                 return false;
  22:         }
  23:         return true;
  24:     }
  25: }

The idea here is to match, fairly closely, what Lucene is doing, but hopefully with clearer code. This analyzer will text a stream of text, break it up to discrete tokens, lower case them, remove the possessive ‘s suffix and clear stop words. Note that each of the filters are actually modifying the token in place.  And the tokenizer is pretty simple, but it does the job for now.

Now, let us move to indexing. With Lucene, the recommendation is that you’ll reuse your document and field instance, to avoid create garbage for the GC. With Corax, I took it a step further:

   1: using (var indexer = fullTextIndex.CreateIndexer())
   2: {
   3:     indexer.NewDocument();
   4:     indexer.AddField("Name", "Oren Eini");
   5:  
   6:     indexer.NewDocument();
   7:     indexer.AddField("Name", "Ayende Rahien");
   8:  
   9:     indexer.Flush();
  10: }

There are a couple of things to note here. An index can create indexers, it is intended to have multiple concurrent indexers running at the same time. Note that unlike Lucene, we don’t have Document or Field classes. Instead, we call methods on the indexer to create a new document and then add fields to the current document. When you are done with a document, you start a new one, or flush to complete the entire operation. For long running indexing, the indexer will flush itself automatically for you.

I think that this API gives us the best approach. It guide you toward using a single instance, with internal optimizations to make it memory efficient. Multiple instances can be used concurrently to speed up indexing time. And it knows when to spill flush itself for you, so you don’t have to worry about that.  Although you do have to complete the operation by calling Flush() at the end.

How about searching? That turned out to be pretty similar as well. All you have to do is create a searcher:

   1: using (var searcher = fti.CreateSearcher())
   2: {
   3:     QueryResults queryResults = searcher.QueryTop(new TermQuery("Name", "Arava"), 10);
   4:     Console.WriteLine(queryResults.TotalResults);
   5:     foreach (var match in queryResults.Results)
   6:     {
   7:         Console.WriteLine(match.DocumentId + " - " + match.Score);
   8:     }
   9: }

We create a searcher, and then we can utilize it to perform queries.

So far, this has been all about the API we have, I’ll talk about the actual implementation in my next post.

Design practice: Building a search engine library

Note: This is done purely as a design practice. We don’t have any current plans to implement this, but I find that it is a good exercise in general.

How would I go about building a search engine for RavenDB to replace Lucene. Well, we have Voron as the basis for storage, so from the get go, we have several interesting changes. To start with, we inherit the transactional properties of Voron, but more importantly, we don’t have to do merges, so we don’t have any of those issues. In other words, we can actually generate stable document ids.

But I’m probably jumping ahead a bit. We’ll start with the basics. Analysis / indexing is going to be done in very much the same way. Users will provide an analyzer and a set of documents, like so:

   1: var index = new Corax.Index(storageOptions, new StandardAnalyzer());
   2:  
   3: var scope = index.CreateIndexingScope();
   4:  
   5: long docId = scope.Add(new Document
   6: {
   7:     {"Name", "Oren Eini", Analysis.Analyzer},
   8:     {"Name", "Ayende Rahien", Analysis.Analyzer},
   9:     {"Email", "ayende@ayende.com", Analyzed.AsIs, Stored.Yes}
  10: });
  11:  
  12: docId = scope.Add(new Document
  13: {
  14:     {"Name", "Arava Eini", Analysis.Analyzer},
  15:     {"Email", "arava@doghouse.com", Analyzed.AsIs, Stored.Yes}
  16: });
  17:  
  18: index.Sumbit(index);

Some notes about this API. It is modeled quite closely after the Lucene API, and it would probably need additional work. The idea here is that you are going to need to get an indexing scope, which is single threaded. And you can have multiple indexing scopes running a the same time. You can batch multiple writes into a single scope, and it behaves like a transaction.

The idea is to deal with all of the work associated with indexing the document into a single threaded work, so that make it easier for us. Note that we immediately get the generated document id, but that the document will only be available for searching when you have submitted the scope.

Under the hood, this does all of the work at the time of calling Add(). The analyzer will run on the relevant fields, and we will create the appropriate entries. How does that work?

Every document has a long id associated with it. In Voron, we are going to have a ‘Documents’ tree, with the long id as the key, and the value is going to be all the data about the document we need to keep. For example, it would have the stored fields for that documents, or the term positions, if required, etc. We’ll also have a a Fields tree, which will have a mapping of all the field names to a integer value.

Of more interest is how we are going to deal with the terms and the fields. For each field, we are going to have a tree called “@Name”, “@Email”, etc. Those are multi trees, with the keys in that tree being the terms, and the multi values being the document ids that has those threes. In other words, the code above is going to generate the following data in Voron.

  • Documents tree:
    • 0 – { “Fields”: { 1: “ayende@ayende.com” } }
    • 1 – { “Fields”: { 1: “arava@doghouse.com” } }
  • Fields tree
    • 0 – Name
    • 1 – Email
  • @Name tree
    • ayende – [ 0 ]
    • arava – [ 1 ]
    • oren – [ 0 ]
    • eini – [ 0, 1 ]
    • rahien – [ 0 ]
  • @Email tree
    • ayende@ayende.com – [ 0 ]
    • arava@doghouse.com – [ 1 ]

Given this model, we can now turn to the actual querying part of the business. Let us assume that we have the following query:

   1: var queryResults = index.Query("Name: Oren OR Email: ayende@ayende.com");
   2: foreach(var result in queryResult.Matches)
   3: {
   4:     Console.WriteLine("{0}: {1}", result.Id, result["Email"] )
   5: }

How does querying works? The query above would actually be something like:

   1: new BooleanQuery(
   2:     Match.Or,
   3:     new TermQuery("Name", "Oren"),
   4:     new TermQuery("Email", "ayende@ayende.com")
   5:     )

The actual implementation of this query would just access the relevant field tree and load the values in the particular key. The result is a set of ids for both parts of the query, and since we are using an OR here, we will just union them and return the list of results back.

Optimizations here can be to run those queries in parallel, or to just cache the results of particular term query, so we can speed things even more.  This looks simple, and it is, because a lot of the work has already been done in Voron. Searching is pretty much complete, we only need to tell it what to search with.

My distributed build system

Yes, I know that you are probably getting geared up to hear about some crazy setup, and in some manner, it is crazy. My distributed build system is this:

IMG_20140402_103433

Yep, that is me manually distributing the build to do a cross check on a reasonable time frame.

I’ve mentioned before that our build time is atrocious. With over 3,500 tests, this is no longer a feasible alternative to just run them normally. So we started parallel efforts (pun intended), to reduce the time it takes the system to build and reduce individual test times as well as the ability to parallelize things.

We are actually at the point where we can run concurrent tests, even those we previously had to distribute. And we can even do that with replicated tests, which are the ones usually taking the longest. But what we’ll probably have in the end is just a bunch of test projects (currently we have ~5) that we will run on different machines at the same time.

We are using Team City for build system, and I know that it has capabilities in this regard, but I’ve never actually looked into that. We’re going to be pretty busy in the next couple of weeks, so I thought that this would be a good time to ask.

My current thinking is:

  • One build that would actually do the compilation, etc.
  • Then branch off that to a build per test project, which can run on different agents
  • Then a build on top of that that can actually deploy the build.

Any recommendations on that? We are going to need 5 – 7 agents to run the tests in parallel. Any recommendations on how to get those? Ideally I would like to avoid having to pay for those machines to sit idle 60% – 80% of the time.

Any other alternatives?

Maximum addressable virtual memory

I wonder what is says about what I am doing right now that I really wish that I could have the OS give me more control over virtual memory allocation. At any rate, the point of this post is to point out something quite important to people writing databases, especially databases that make use of virtual memory a lot.

There isn’t quite as much of it as I thought it would be. Oh, on 32 bits the 4GB limits is really hard to swallow. But on 64 bits, the situation is much better, but still constrained.

On Windows, using x64, you are actually limited to merely 8TB of address space. In Windows 8.1, (and I assume, but couldn’t verify, Windows 2012 R2) you can use up to 128TB of virtual address space. With Linux, at least since 2.6.32, and probably earlier, the limit is 128TB per process.

Implications for Voron, by the way, is that the total size of of all databases in a single process can be up to 8TB (probably somewhat less than that, the other stuff will also need memory). Currently the biggest RavenDB database that I am aware of was approaching the 1.5 – 2.0 TB mark last I checked (several months ago), and Esent, our current technology, is limited to 16TB per database.

So it isn’t great news, but it is probably something that I can live with. And at least I can give proper recommendations. In practice, I don’t think that this would be an issue. But that is good to know.

Tags:

Published at

Originally posted at

Comments (16)

Big Data Search: Sorting randomness

As it turns out, doing work on big data sets is quite hard. To start with, you need to get the data, and it is… well, big. So that takes a while. Instead, I decided to test my theory on the following scenario. Given 4GB of random numbers, let us find how many times we have the number 1.

Because I wanted to ensure a consistent answer, I wrote:

   1: public static IEnumerable<uint> RandomNumbers()
   2: {
   3:     const long count = 1024 *  1024 * 1024L  * 1;
   4:     var random = new MyRand();
   5:     for (long i = 0; i < count; i++)
   6:     {
   7:         if (i % 1024 == 0)
   8:         {
   9:             yield return 1;
  10:             continue;
  11:         }
  12:         var result = random.NextUInt();
  13:         while (result == 1)
  14:         {
  15:             result = random.NextUInt();
  16:         }
  17:         yield return result;
  18:     }
  19: }
  20:  
  21: /// <summary>
  22: /// Based on Marsaglia, George. (2003). Xorshift RNGs.
  23: ///  http://www.jstatsoft.org/v08/i14/paper
  24: /// </summary>
  25: public class MyRand
  26: {
  27:     const uint Y = 842502087, Z = 3579807591, W = 273326509;
  28:     uint _x, _y, _z, _w;
  29:  
  30:     public MyRand()
  31:     {
  32:         _y = Y;
  33:         _z = Z;
  34:         _w = W;
  35:  
  36:         _x = 1337;
  37:     }
  38:  
  39:     public uint NextUInt()
  40:     {
  41:         uint t = _x ^ (_x << 11);
  42:         _x = _y; _y = _z; _z = _w;
  43:         return _w = (_w ^ (_w >> 19)) ^ (t ^ (t >> 8));
  44:     }
  45: }

I am using a custom Rand function because it is significantly faster than System.Random. This generate 4GB of random numbers, at also ensure that we get exactly 1,048,576 instances of 1. Generating this in an empty loop takes about 30 seconds on my machine.

For fun, I run the external sort routine in 32 bits mode, with a buffer of 256MB. It is currently processing things, but I expect it to take a while. Because the buffer is 256 in size, we flush it every 128 MB (while we still have half the buffer free to do more work). The interesting thing is that even though we generate random number, sorting then compressing the values resulted in about 60% compression rate.

The problem is that for this particular case, I am not sure if that is a good thing. Because the values are random, we need to select a pretty high degree of compression just to get a good compression rate. And because of that, a significant amount of time is spent just compressing the data. I am pretty sure that for real world scenario, it would be better, but that is something that we’ll probably need to test. Not compressing the data in the random test is a huge help.

Next, external sort is pretty dependent on the performance of… sort, of course. And sort isn’t that fast. In this scenario, we are sorting arrays of about 26 million items. And that takes time. Implementing parallel sort cut this down to less than a minute per batch of 26 million.

That let us complete the entire process, but then it halts with the merge. The reason for that is that we push all the values into a heap, and there are 1 billion of them. Now, the heap never exceed 40 items, but those are still 1 billion * O(log 40) or about 5.4 billion comparisons that we have to do, and we do this sequentially, which takes time. I tried thinking about ways to parallel, but I am not sure how that can be done. We have 40 sorted files, and we want to merge all of them.

Obviously we can sort each 10 files set in parallel, then sort the resulting 4, but the cost we have now is the actual sorting cost, not I/O. I am not sure how to approach this.

For what is it worth, you can find the code for this here.

Tags:

Published at

Originally posted at

Comments (6)

Big Data Search: Sorting Optimizations

I mentioned several times that the entire point of the exercise was to just see how this works, not to actually do anything production worthy. But it is interesting to see how we could do better here.

In no particular order, I think that there are at least several things that we could do to significantly improve the time it takes to sort. Right now we defined 2 indexes on top of a 1GB file, and it took under 1 minute to complete. That gives us a runtime of about 10 days over a 15TB file.

Well, one of the reason for this performance is that we execute this in a serial fashion, that is, one after another. But we have to completely isolated indexes, there is no reason why we can’t parallelize the work between them.

For that matter, we are buffering in memory up to a certain point, then we sort, then we buffer some more, etc. That is pretty inefficient. We can push the actual sorting to a different thread, and continue parsing and adding to a buffer while we are adding to the buffer.

We wrote to intermediary files, but we wrote to those using plain file I/O. But it is usually a lot more costly to write to disk than to compress and then write to disk.  We are writing sorted data, so it is probably going to compress pretty well.

Those are the things that pop to mind. Can you think of additional options?

Tags:

Published at

Originally posted at

Comments (6)