Ayende @ Rahien

Refunds available at head office

The pitfalls of transparent security

A long time ago, I needed to implement a security subsystem for an application. I figured that it was best to make the entire security subsystem transparent to the developer, under the assumption that if the infrastructure would take care of security, it would make a lot more sense than relying on the developer to remember to add the security calls.

It took me a long while to realize how wrong that decision was. Security is certainly important, but security doesn’t apply to the system itself. In other words, while a specific user may not be allowed to read/write to the audit log, actions that the user made should be written to that log. That is probably the simplest case, but that are a lot of similar ones.

Ever since then, I favored using an explicit approach:

var books = session.CreateQuery("from Books")
                        .ThatUserIsAllowedToRead(CurrentUser)
                        .List<Book>();

This also help you implement more interesting features, such as “on behalf of”. And yes, it does put the onus of security on the developer, but considering the alternative, that is a plus.

Challenge: Find the bug

The following code contains a bug that would only occur under rare situations, can you figure out what is the bug?

image

NHibernate: Streaming large result sets

Note: I am not feeling very well for the past week or so, which is why I am posting so rarely.

NHibernate is meant to be used in an OLTP system, as such, it is usually used in cases where we want to load a relatively small amount of data from the database, work with it and save it back. For reporting scenarios, there are better alternatives, usually (and before you ask, any reporting package will do. Right tool for the job, etc).

But there are cases where you want to do use NHibernate in reporting scenarios nonetheless. Maybe because the reporting requirements aren’t enough to justify going to a separate tool, or because you want to use what you already know. It is in those cases where you tend to run into problems, because you violate the assumptions that were made while building NHibernate.

Let us imagine the following use case, we want to print a list of book names to the user:

using (ISession s = OpenSession())
{
    var books = s.CreateQuery("from Book")
        .List<Book>();

    foreach (var book in books)
    {
        Console.WriteLine(book.Name);
    }
}

There are several problems here:

  • We query on a large result set without a limit clause.
  • We read a lot of data into memory.
  • We only start processing the data after it was completely read from the database.

What I would like to see is something like this:

while(dataReader.Read())
{
     Console.WriteLine(dataReader.GetString("Name"));
}

This still suffer from the problem of reading a large result set, but we will consider this a part of our requirements, so we’ll just have to live with it. The data reader code has two major advantages, it uses very little memory, and we can start processing the data as soon as the first row loads from the database.

How can we replicate that with NHibernate?

Well, as usual with NHibernate, it is only a matter of finding the right extension point. In this case, the List method on the query also has an overload that accepts an IList parameter:

image

That make it as simple as implementing our own IList implementation:

public class ActionableList<T> : IList
{
    private Action<T> action;

    public ActionableList(Action<T> action)
    {
        this.action = action;
    }

    public int Add(object value)
    {
        action((T)value);
        return -1;
    }

    public bool Contains(object value)
    {
        throw new NotImplementedException();
    }

    // ...
}

And now we can call it:

using (ISession s = OpenSession())
{
    var books = new ActionableList<Book>(book => Console.WriteLine(book.Name));
    s.CreateQuery("from Book")
        .List(books);

}

This will have the exact same effect as the pervious NHibernate code, but it will start printing the results as soon as the first result loads from the database. We still have the problem of memory consumption, though. The session will keep track of all the loaded objects, and if we load a lot of data, it will eventually blow out with an out of memory exception.

Luckily, NHibernate has a ready made solution for this, the stateless session. The code now looks like this:

using (IStatelessSession s = sessionFactory.OpenStatelessSession())
{
    var books = new ActionableList<Book>(book => Console.WriteLine(book.Name));
    s.CreateQuery("from Book")
        .List(books);

}

The stateless session, unlike the normal NHibernate session, doesn’t keep track of loaded objects, so the code here and the data reader code are essentially the same thing.

Challenge: Dynamically dynamic

Can you figure out a way to write the following code without using a try/catch?

class Program
{
    static void Main(string[] args)
    {
        dynamic e = new ExpandoObject();
        e.Name = "Ayende";

        Console.WriteLine(HasProperty("Name", e));
        Console.WriteLine(HasProperty("Id", e));
    }

    private static bool HasProperty(string name, IDynamicMetaObjectProvider dyn)
    {
        try
        {
            var callSite =
                CallSite<Func<CallSite, object, object>>.Create(
                    Binder.GetMember(CSharpBinderFlags.None, name, typeof (Program),
                                     new[]
                                     {
                                         CSharpArgumentInfo.Create(
                                             CSharpArgumentInfoFlags.None, null)
                                     }));
            callSite.Target(callSite, dyn);
            return true;
        }
        catch (RuntimeBinderException)
        {

            return false;
        }
    }
}

The HasProperty method should accept any IDynamicMetaObjectProvider implementation, not just ExpandoObject.

Modeling hierarchical structures in RavenDB

The question pops up frequently enough and is interesting enough for a post. How do you store a data structure like this in Raven?

The problem here is that we don’t have enough information about the problem to actually give an answer. That is because when we think of how we should model the data, we also need to consider how it is going to be accessed. In more precise terms, we need to define what is the aggregate root of the data in question.

Let us take the following two examples:

image image

As you can imagine, a Person is an aggregate root. It can stand on its own. I would typically store a Person in Raven using one of two approaches:

Bare references Denormalized References
{
  "Name": "Ayende",
  "Email": "Ayende@ayende.com",
  "Parent": "people/18",
  "Children": [
        "people/59",
        "people/29"
  ]
}
{
  "Name": "Ayende",
  "Email": "Ayende@ayende.com",
  "Parent": { "Name": "Oren", "Id": "people/18"},
  "Children": [
        { "Name": "Raven", "Id": "people/59"},
        { "Name": "Rhino", "Id": "people/29"}
  ]
}

The first option is bare references, just holding the id of the associated document. This is useful if I only need to reference the data very rarely. If, however, (as is common), I need to also show some data from the associated documents, it is generally better to use denormalized references, which keep the data that we need to deal with from the associated document embedded inside the aggregate.

But the same approach wouldn’t work for Questions. In the Question model, we have utilized the same data structure to hold both the question and the answer. This sort of double utilization is pretty common, unfortunately. For example, you can see it being used in StackOverflow, where both Questions & Answers are stored as posts.

The problem from a design perspective is that in this case a Question is not a root aggregate in the same sense that a Person is. A Question is a root aggregate if it is an actual question, not if it is a Question instance that holds the answer to another question. I would model this using:

{
   "Content": "How to model relations in RavenDB?",
   "User": "users/1738",
   "Answers" : [
      {"Content": "You can use.. ", "User": "users/92" },
      {"Content": "Or you might...", "User": "users/94" },
   ]
}

In this case, we are embedding the children directly inside the root document.

So I am afraid that the answer to that question is: it depends.

The cost of money

This is just some rambling about the way the economy works, it has nothing to do with tech or programming. I just had to sit down recently and do the math, and I am pretty annoyed by it.

The best description of how the economy works that I ever heard was in a Terry Prachett’s book, it is called Captain Vimes’ Boots’ Theory of Money. Stated simply, it goes like this.

A good pair of boots costs 50$, and they last for 10 years and keep your feet warm. A bad pair of boots costs 10$ and last only a year or two. After 10 years, the poor boots cost twice as much as the good boots, and your feet are still cold!

The sad part about that is that this theory is quite true. Let me outline two real world examples (from Israel, numbers are in Shekels).

Buying a car is expensive, so a lot of people opts for a leasing option. Here are the numbers associated with this (real world numbers):

  Buying car outright Leasing
Upfront payment 120,000

42,094.31

Monthly payment (36 payments) 0

1,435.32

Buying the car (after 3 yrs) [optional] 0

52,039.67

The nice part of going with a leasing contract is that you need so much less upfront money, and the payments are pretty low. The problem starts when you try to compare costs on more than just how much money you are paying out of pocket. We only have to spent a third.

Let us see what is going to happen in three years time, when we wan to switch to a new car.

  Buying car outright Leasing
Upfront payment 120,000.00

42,094.31

Total payments 0.00

51,671.52

Selling the car -80,000.00

0.00

Total cost 40,000.00 93,765.83

With the upfront payment, we can actually sell the car to recoup some of our initial investment. With the leasing option, at the end of the three years, you are out 93,765.83 and have nothing to show for it.

Total cost of ownership for the leasing option is over twice as much as the upfront payment option.

Buying an apartment is usually one of the biggest expenses that most people do in their life. The cost of an apartment/house in Israel is typically over a decade of a person’ salary. Israel’s real estate is in a funky state at the moment, being one of the only places in the world where the prices keep going up. Here are some real numbers:

  • Avg. salary in Israel: 8,611
  • Avg. price of an apartment (in central Israel): 1,071,900

It isn’t surprising that most people requires a mortgage to buy a place to live.

Let us say that we are talking about a 1,000,000 price, just to make the math simpler, and that we have 400,000 available for the down payment. Let us further say that we got a good interest rate of the 600,000 mortgage of 2% (if you take more than 60% of the money you are penalized with higher interest rate in Israel).

Assuming fixed interest rate and no inflation, you will need to pay 3,035 for 20 years. But a 2% interest rate looks pretty good, right? It sounds pretty low.

Except over 20 years, you’ll actually pay: 728,400 back on your 600,000 loan, which means that the bank get 128,400 more than it gave you.

The bank gets back 21.4% more money. With a more realistic 3% interest rate, you’ll pay back 33% more over the lifetime of the loan. And that is ignoring inflation. Assume (pretty low) 2% per year, you would pay 49% more to the bank in 2% interest rate and 65% more in 3% interest rate.

Just for the fun factor, let us say that you rent, instead. And assume further that you rent for the same price of the monthly mortgage payment. We get:

 

Mortgage

Rent
Upfront payment 400,000.00 0.00
Monthly payment 3,000.00 3,000.00
Total payments (20 years) 720,000.00 720,000.00
Total money out 1,120,000.00 720,000.00
House value 1,000,000.00 0.00
Total cost 120,000.00 720,000.00

After 20 years, renting cost 720,000. Buying a house costs 120,000.  And yes, I am ignoring a lot of factors here, that is intentional. This isn’t a buy vs. rent column, it is a cost of money post.

But after spending all this time doing the numbers, it all comes back to Vimes’ Boots theory of money.

Table scans, index scans and index seeks, on my!

In general, when you break it down to the fundamentals, a data base is just a fancy linked list + some btrees. Yes, I am ignoring a lot, but if you push aside a lot of the abstraction, that is what is actually going on.

If you ever dealt with database optimizations you are familiar with query plans, like the following (from NHProf):

You can see that we have some interesting stuff going on here:

  • Index Seek
  • Index Scan

And if you are unlucky, you are probably familiar with the dreaded “your !@&!@ query is causing a table scan!” scream from the DBA. But most of the time, people just know that table scan is slow, index scan is fast and index seek is fastest. I am ignoring things like clustered vs. unclustered indexes, since they aren’t really important for what I want to do.

For simplicity sake, I’ll use the following in memory data structure:

public class DocumentDatabase
{
    public List<JObject> Documents = new ...;
    public IDictionary<string, IDictionary<JToken, JObject>> Indexes = new ...; 
}

To keep things simple, we will only bother with the case of exact matching. For example, I might store the following document:

{ "User": "ayende", "Name": "Ayende Rahien", "Email": "Ayende@ayende.com" }

And define an index on Name & Email. What would happen if I wanted to make a query by the user name?

Well, we don’t really have any other option, we have to do what amounts to a full table scan:

foreach (var doc in Documents)
{
    if(doc.User == "ayende")
          yield return doc;
}

As you can imagine, this is an O(N) operation, which can get pretty expensive if we are querying a large table.

What happen if I want to find data by name & email? We have an index that is perfectly suited for that, so we might as well use it:

Indexes["NameAndEmail"][new{Name="Ayende Rahien", Email = “Ayende@ayende.com”}];

Note that what we are doing here is accessing the NameAndEmail index, and then making a query on that. This is essentially an index seek.

What happens if I want to query just by email? There isn’t an index just for emails, but we do have an index that contains emails. We have two options, use a table scan, or and index scan. We already saw what a table scan is, so let us look at what is an index scan:

var nameAndEmailIndex = Indexes["NameAndEmail"];
foreach (var indexed in nameAndEmailIndex)
{
   if(indexed.Email == "ayende@ayende.com")
             yield return indexed;
}

All in all, it is very similar to the table scan, and when using in memory data structures, it is probably not worth doing index scans (at least, not if the index is over the entire data set).

Where index scans prove to be very useful is when we are talking about persistent data sets, where reading the data from the index may be more efficient than reading it from the table. That is usually because the index is much smaller than the actual table. In certain databases, the format of the data on the disk may make it as efficient to do a table scan in some situations as it to do an index scan.

Another thing to note is that while I am using hashes to demonstrate the principal, in practice, most persistent data stores are going to use some variant of trees.

Building data store – indexing data structure

I run into an interestingly annoying problem recently. Basically, I am trying to write the following code:

tree.Add(new { One = 1, Two = 2 }, 13);
tree.Add(new { One = 2, Two = 3 }, 14); tree.Add(new { One = 3, Two = 1 }, 15); var docPos = tree.Find(new { One = 1, Two = 2 }); Assert.Equal(13, docPos); docPos = tree.Find(new { Two = 2 }); Assert.Equal(13, docPos); docPos = tree.Find(new { Two = 1 }); Assert.Equal(14, docPos);

As you can imagine, this is part of an indexing approach, the details of which aren’t really important. What is important is that I am trying to figure out how to support partial searches. In the example, we index by One & Two, and we can search on both of them. The problem begins when we want to make a search on just Two.

While the tree can compare between partial results just fine, the problem is how to avoid traversing the entire tree for a partial result. The BTree is structured like this:

image

The problem when doing a partial search is that at the root, I have no idea if I should turn right or left.

What I am thinking now is that since I can’t do a binary search, I’ll have to use a BTree+ instead. Since BTree+ also have the property that the leaf nodes are a linked list, it means that I can scan it effectively. I am hoping for a better option, though.

Any ideas?

Building data stores – Append Only

One of the interesting aspects in building a data store is that you run head on into things that you would generally leave to the infrastructure. By far, most developers deal with concurrency by relegating that responsibility to a database.

When you write your own database, you have to build this sort of thing. In essence, we have two separate issues here:

  • Maximizing Concurrency – does readers wait for writers? does writers wait for readers? does writers wait for writers?
  • Ensuring Consistency – can I read uncommitted data? can I read partially written data?

As I mentioned in my previous post, there are two major options when building a data store, Transaction Log & Append Only. There are probably a better name for each, but that is how I know them.

This post is going to focus on append only. An append only store is very simple idea in both concept and implementation. It requires that you will always append to the file. It makes things a bit finicky with the type of data structures that you have to use, since typical persistent data structures rely on being able to modify data on the disk. But once you get over that issue, it is actually very simple.

An append only store works in the following manner:

  • On startup, the data is read in reverse, trying to find the last committed transaction.
  • That committed transaction contains pointers to locations in the file where the actual data is stored.
  • A crash in the middle of a write just means garbage at the end of the file that you have to skip when finding the last committed transaction.
  • In memory, the only thing that you have to keep is just the last committed transaction.
  • A reader with a copy of the last committed transaction can execute independently of any other reader / writer. It will not see any changes made by writers made after it started, but it also doesn’t have to wait for any writers.
  • Concurrency control is simple:
    • Readers don’t wait for readers
    • Readers don’t wait for writers
    • Writers don’t wait for readers
    • There can be only one concurrent writer

The last one is a natural fallout from the fact that we use the append only model. Only one thread can write to the end of the file at a given point in time. That is actually a performance boost, and not something that would slow the database down, as you might expect.

The reason for that is pretty obvious, once you start thinking about it. Writing to disk is a physical action, and the head can be only in a single place at any given point in time. By ensuring that all writes go to the end of the file, we gain a big perf advantage since we don’t do any seeks.

Building data stores – Transaction log

One of the interesting aspects in building a data store is that you run head on into things that you would generally leave to the infrastructure. By far, most developers deal with concurrency by relegating that responsibility to a database.

When you write your own database, you have to build this sort of thing. In essence, we have two separate issues here:

  • Maximizing Concurrency – does readers wait for writers? does writers wait for readers? does writers wait for writers?
  • Ensuring Consistency – can I read uncommitted data? can I read partially written data?

As I mentioned in my previous post, there are two major options when building a data store, Transaction Log & Append Only. There are probably a better name for each, but that is how I know them.

This post is going to focus on transaction log. Transaction log is actually pretty simple idea, conceptually. It simply requires that you would state what you intend to do before you do it, in such a way that you can reverse it.

For example, let us say that I want to store “users/ayende” –> "ayende@ayende.com”. All I need to do is to write the following to the transaction log.

{
   "Key": "users/ayende",
   "OldVal": "AYENDE@AYENDE.COM",
   "NewVal": "ayende@ayende.com",
   "TxId": 19474774
}

If the data store crashes before the transaction is completed, we can run a recovery process that would resolve any issues in the actual data from the transaction log. Once a transaction is committed, we can safely delete it from the transaction log.

As I said, conceptually it is a very simple idea, but it leads to some interesting implementation challenges:

  • You can optimize things by not writing to disk (except for writing to the transaction log) immediately.
  • You need to keep track of concurrent transactions touching the same records.
  • You need to handle background flushing to disk.
  • The crash recovery process can be.. finicky to write.

Concurrency control is something that you essentially have to roll on your own, and you can make it as granular as you feel like. There is some complexity involved in ensuring that you read the appropriate data from the transaction log / data file (based on whatever you are in a transaction reading data you modified or outside it, reading the old data), but where it gets really complex is the number of moving parts that you have to deal with.

Tags:

Published at

Building a managed persistent, transactional, queue

When one approaches building transactional systems, there are two main ways one can approach them.

  • Transaction log
  • Append only

In both cases, we rely on very important property, fsync. fsync is actually a unix call, which flushes all data to the actual device. In essence, this means “write the the actual hardware and don’t return until the hardware confirmed that the data was saved”. In Windows, the equivalent call is: FlushFileBuffers, or WriteThrough. WriteThrough is better if you need to call FlushFileBuffers every single time, while FlushFileBuffers is significantly faster if you only need to call it once in a while.

FileStream.Flush(true) in .NET 4.0 translate to FlushFileBuffers.

Transaction log systems tend to be more complex than append only systems, but append only systems use more space. Space tend to be pretty cheap, so that is a good tradeoff, I think.

Given that, let us see how we can consider a managed transactional persistent queue. One interesting aspect is that append only, just like immutable data structures, is really not friendly for things like queues. The problem is that adding an item to the queue requires modification to the entire queue, which result in a large number of memory allocations / writes to disk.

The functional crowd has solved the problem long ago, it seems. In essence, a functional queue is composed of two immutable lists, one for the front of the queue and the other for the back of the queue. You add things to the back of the queue, and pop things from the front of the queue. When the front of the queue is empty, you set the front of the queue to be the reversed back of the queue and clear the back. The link should make it clear how this works.

Our first task is actually implementing the list. Since we really only need it to be a stack, I won’t bother with list functionality and just implement a very simple stack. With that, we can implement the actual stack:

public class Stack
{
    private readonly Stream reader;
    private readonly Stream writer;
    private StackItem current;
    private readonly BinaryWriterWith7BitEncoding binaryWriter;
    private readonly BinaryReaderWith7BitEncoding binaryReader;
    private readonly List<StackItem> unwritten = new List<StackItem>();

    private class StackItem
    {
        public long Position { get; set; }
        public readonly StackItem NextItem;
        public readonly long Value;
        public readonly long? Next;

        public StackItem(long value, StackItem nextItem)
        {
            Value = value;
            NextItem = nextItem;
        }

        public StackItem(long value, long? next)
        {
            Value = value;
            Next = next;
        }
    }

    public long? CurrentPosition
    {
        get
        {
            return current == null ? (long?)null : current.Position;
        }
    }

    public Stack(Stream reader, Stream writer, StartMode mode)
    {
        this.reader = reader;
        this.writer = writer;

        binaryReader = new BinaryReaderWith7BitEncoding(reader);
        binaryWriter = new BinaryWriterWith7BitEncoding(writer);

        if (mode != StartMode.Open)
            return;
        current = ReadStackItem(reader.Position);
    }

    public void Push(byte[] data)
    {
        var pos = writer.Position;
        binaryWriter.Write7BitEncodedInt(data.Length);
        binaryWriter.Write(data);

        PushInternal(pos);
    }

    private void PushInternal(long pos)
    {
        current = new StackItem(pos, current);
        unwritten.Add(current);
    }

    public byte[] Pop()
    {
        var result = PopInternal(ref current);

        if (result == null)
            return null;

        reader.Position = result.Value;
        var size = binaryReader.Read7BitEncodedInt();
        var bytes = binaryReader.ReadBytes(size);
        return bytes;
    }

    private long? PopInternal(ref StackItem item)
    {
        if (item == null)
            return null;
        unwritten.Remove(item);
        var result = item.Value;
            
        if (item.NextItem != null)
            item = item.NextItem;
        else if (item.Next != null)
            item = ReadStackItem(item.Next.Value);
        else
            item = null;

        return result;
    }

    public void Flush()
    {
        foreach (var stackItem in unwritten)
        {
            stackItem.Position = writer.Position;
            binaryWriter.WriteBitEncodedNullableInt64(stackItem.Value);
            binaryWriter.WriteBitEncodedNullableInt64(stackItem.NextItem != null ? stackItem.NextItem.Position : stackItem.Next);
        }
        unwritten.Clear();
    }

    private StackItem ReadStackItem(long position)
    {
        reader.Position = position;
        return new StackItem(
            binaryReader.ReadBitEncodedNullableInt64().Value,
            binaryReader.ReadBitEncodedNullableInt64()
            )
        {
            Position = position
        };
    }

    public Stack Reverse()
    {
        var stack = new Stack(reader, writer, StartMode.Create);
        var item = current;

        while(item != null)
        {
            var value = PopInternal(ref item);
            if(value!=null)
                stack.PushInternal(value.Value);
        }

        return stack;
    }
}

The code make several assumptions:

  • The stream is using WriteThrough, so once a write was completed, it is saved to the disk.
  • It is not our responsibility to keep track of things like the current position, this is handled by higher level code.
  • We are allowed to jump around on the reader, but the writer is only doing appends.

Given the stack behavior, we can now implement the queue:

public class Queue
{
    private readonly Stream reader;
    private readonly Stream writer;
    private Stack front;
    private Stack back;
    private readonly BinaryReaderWith7BitEncoding binaryReader;
    private readonly BinaryWriterWith7BitEncoding binaryWriter;

    public Queue(Stream reader, Stream writer, StartMode mode)
    {
        this.reader = reader;
        this.writer = writer;
        binaryReader = new BinaryReaderWith7BitEncoding(reader);
        binaryWriter = new BinaryWriterWith7BitEncoding(writer);

        switch (mode)
        {
            case StartMode.Open:
                ReadStacks();
                break;
            case StartMode.Create:
                InitializeEmptyStacks();
                break;
        }
    }

    private void InitializeEmptyStacks()
    {
        front = new Stack(reader, writer, StartMode.Create);
        back = new Stack(reader, writer, StartMode.Create);
    }

    private void ReadStacks()
    {
        var frontPos = binaryReader.ReadBitEncodedNullableInt64();
        var backPos = binaryReader.ReadBitEncodedNullableInt64();
        if (frontPos != null)
        {
            reader.Position = frontPos.Value;
            front = new Stack(reader, writer, StartMode.Open);
        }
        else
        {
            front = new Stack(reader, writer, StartMode.Create);
        }
        if (backPos != null)
        {
            reader.Position = backPos.Value;
            back = new Stack(reader, writer, StartMode.Open);
        }
        else
        {
            back = new Stack(reader, writer, StartMode.Create);
        }
    }

    public void Enqueue(byte[] data)
    {
        back.Push(data);
    }

    public byte[] Dequeue()
    {
        var result = front.Pop();
        if (result != null)
            return result;

        front = back.Reverse();
        back = new Stack(reader, writer, StartMode.Create);
        return front.Pop();
    }

    public void Flush()
    {
        front.Flush();
        back.Flush();

        QueuePosition = writer.Position;

        binaryWriter.WriteBitEncodedNullableInt64(front.CurrentPosition);
        binaryWriter.WriteBitEncodedNullableInt64(back.CurrentPosition);
    }

    public long QueuePosition { get; private set; }
}

Now, just to make things interesting, let us see what it actually means:

var sp = Stopwatch.StartNew();
using (var writer = new FileStream("data",
    FileMode.OpenOrCreate,
    FileAccess.ReadWrite,
    FileShare.Delete | FileShare.Read,
    16 * 1024, 
    FileOptions.SequentialScan))
using (var reader = new FileStream("data", 
    FileMode.Open, 
    FileAccess.Read, 
    FileShare.ReadWrite | FileShare.Delete, 
    16 * 1024, 
    FileOptions.RandomAccess))
{
    Queue queue = new Queue(reader, writer, StartMode.Create);

    var bytes = new byte[1024*7];
    new Random().NextBytes(bytes);
    for (int i = 0; i < 100000; i++)
    {
        queue.Enqueue(bytes);
    }
    queue.Flush();
    writer.Flush(true);
}
Console.WriteLine(sp.ElapsedMilliseconds);

And this completes is a bit over 14 seconds, or over seven thousands (pretty big) items per second.

Historical Mondial Moments

It sometimes seems that the whole world goes mad over this thing, which I can’t really understand. But every time that I hear about the Mondial, it bring tears to my eyes, as I remember…

It was 2002, and the Mondial took place somewhere in Asia, which meant that the games were played in the morning, in Israel’s time. A lot of people in Israel are a bit touched in the head regarding soccer, so that cause quite a bit of disturbance all around, since people didn’t show up for work because they wanted to watch the games.

Anyway, I was minding my own business in Prison Six, just doing my usual rounds. Okay, calm down, you didn’t misread the last sentence, I was a prison guard at the time. One of the procedures for working in a prison is never work alone, we always worked in pairs. One of us would go into a cell to handle something, while the other remained outside the cell’s door. The idea was that it would prevent potential kidnapping.

The prison was very conscious of that since there was a major prison break / prison guard kidnapping / hostages just 5 years previously.

Where was I? Oh, yeah. It was Mondial time, and we put a lot of the inmates in the club, so they could watch the game. I was checking that the cell was clean for inspection, so I was pretty deep inside, where they was a giant roar from the club.

image

My partner was an avid soccer fan, so he didn’t think twice (actually, knowing the guy, thinking once was too much), he slammed shut the cell’s door and run to watch the replay.

I wouldn’t have minded, except that he locked me in the cell with 7 inmates.

The key for handling the situation was to pretend that I didn’t notice that and hope that the inmates wouldn’t either. By the time my partner got back to me, that cell was clean.

Checking For Empty Enumerations

Phil Haack has an interesting post about this topic, where he presents the following solution:

public static bool IsNullOrEmpty<T>(this IEnumerable<T> items) {
    return items == null || !items.Any();
}

This solution, unfortunately, suffers from a common problem related to handling IEnumerables. The assumption that you can iterate over enumerable more than once. This hold true for things like collections, but in many cases, this sort of code will silently hide data:

var files = Directory.EnumerateFiles(".","*.cs");
if(files.IsNullOrEmpty())
{
    Cosnole.WriteLine("No files");
}
else
{
   foreach(var file in files)
   {
          Console.WriteLine(file);
   }
}

The first file will never appear here.

A better solution is:

public static bool IsNullOrEmpty<T>(this IEnumerable<T> items, out IEnumerable<T> newItems) 
{
    newItems = items;
    if(items == null)
        return false;
    
    var enumerator = items.GetEnumerator();
    if(enumerator.MoveNext() == false)
        return false;
        
    newItems = new[]{enumerator.Current}.Concat(enumerator);
    
    return true;
}

That will not lose data.

Deleting hard links problem

This code fails:

[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Auto)]
public static extern bool CreateHardLink(string lpFileName, string lpExistingFileName, IntPtr lpSecurityAttributes);

static void Main()
{
    var writer = File.CreateText("test.txt");
    writer.Write("test");
    writer.Flush();

    CreateHardLink("hard_link.txt", "test.txt", IntPtr.Zero);


    File.Delete("hard_link.txt");

}

The problem here is that the hard_link.txt file is considered to be open. What I would expect to happen is that the delete would succeed in removing the directory entry, but keep the file. In other words, I would only expect it to fail if this is the last directory entry for the file (resulting in an actual delete).

For now I changed this to try to delete, but on failure, mark the file to be deletes on the next reboot.

WTF?! Google Groups really hates me!

There are no other owners for the group, I have no idea what is going on…

image

Update: Looks like Google disabled my account, for reasons known only to it. I had to go through a “send me an SMS to activate it” loop, but now it is working.

Set based operations with RavenDB

One of the most common problems that people run into with NoSQL databases is the inability to perform set based operations. A typical example of that would be:

DELETE FROM Users WHERE LastLogin < '2009-01-01'
UPDATE Users SET IsActive = 0 WHERE LastLogin < '2010-01-01'

The problem is that most NoSQL solutions really have no notion about set based operations, but they are something that:

  • Users really like
  • Can significantly simplify users’ code
  • Drastically reduce the number of remote calls

With all of that, I decided that I really want RavenDB to support SET based operations. And now it does :-)

The secret for set based operations is that you need two things:

  • What should you execute the operation on? (defining the set that you are working on). This is the WHERE clause.
  • What is the operation?

We will start with DELETE, because that is that operation is very simple to understand. So we only need to figure out what the set we operate on is. Well, Raven already have a way to express WHERE clauses, using its Indexing system. There is no reason why we can’t use the same mechanism for that.

Here is how we would execute the first statement using Raven. First, we need to define the appropriate index:

// Users/LastLogin
from user in docs.Users
select new { user.LastLogin }

And now we can issue an HTTP DELETE call:

DELETE /bulk_docs/Users/LastLogin?query=LastLogin:[00010101000000000 TO 20090101000000000]&allowStale=True

This asks Raven do delete all the documents where the LastLogin is between 01-01-0001 and 01-01-2009.

There is one important thing to note here, Raven’s indexes allow stale reads. But SET based operations will not work on stale indexes by default.

In this case, I am pretty sure that the index is capable of handling requests that are that far in the past, so I can safely add the allowStale=True. If we remove that, or specify allowStale=False, the operation will only succeed if the index is up to date. We can also specify a cutoff date for the stale check. This is similar to how this works everywhere with Raven.

So far, this has been pretty easy, what about the next step? Updating a set of documents? That presented some challenges, how do you define the operation?

Luckily, Raven has support for the PATCH command, which means that you can specify an update command very easily. Using the same index, we can specify:

PATCH /bulk_docs/Users/LastLogin?query=LastLogin:[00010101000000000 TO 20100101000000000]&allowStale=True

[
  { 
     "Type": "Set",
     "Name": "IsActive",
     "Value": false           
   }
]

Which will set the IsActive property to false on all the documents whose LastLogin is before 2010.

And yes, you can execute those operations using the Client API as well.

Tags:

Published at

Originally posted at

Comments (8)

Extracting a list of committers from Git

There doesn’t seem to be a nice way to getting stats like “who are the committers in this project” from Git. There is probably some fancy script that does sed/awk magic to do so, but I went with a simpler alternative:

git log --raw > log.txt

var logFile = @"C:\work\ravendb\log.txt";

var committers = from line in File.ReadAllLines(logFile)
                 where line.StartsWith("Author: ")
                 let author = line.Substring("Author: ".Length)
                 group author by author
                 into g
                 let result = new {Committer = g.Key, CommitsCount = g.Count()}
                 orderby result.CommitsCount descending
                 select result;

foreach (var committer in committers)
{
    Console.WriteLine(committer);
}

Running this on Raven’s code produces:

{ Committer = Ayende Rahien <ayende>, CommitsCount = 555 }
{ Committer = Ayende Rahien <Ayende>, CommitsCount = 477 }
{ Committer = unknown <Ayende, CommitsCount = 72 }
{ Committer = Paul B <github, CommitsCount = 24 }
{ Committer = Andrew Stewart <Andrew.Stewart, CommitsCount = 24 }
{ Committer = Benny Thomas <jan.thomas, CommitsCount = 19 }
{ Committer = Luke Hertert <lukehertert, CommitsCount = 15 }
{ Committer = Bobby Johnson <bobby.johnson, CommitsCount = 13 }
{ Committer = Rob Ashton <robashton, CommitsCount = 11 }
{ Committer = unknown <LukeHertert, CommitsCount = 7 }
{ Committer = Andrew Theken <theken.1, CommitsCount = 6 }
{ Committer = AndyStewart <Andy.Stewart, CommitsCount = 3 }
{ Committer = Steve Strong <steve, CommitsCount = 3 }
{ Committer = unknown <Aaron Weiker, CommitsCount = 1 }
{ Committer = Emil Cardell <emil.cardell, CommitsCount = 1 }
{ Committer = bjarte.skogoy <bjarte.skogoy, CommitsCount = 1 }
{ Committer = unknown <henke, CommitsCount = 1 }