Ayende @ Rahien

It's a girl

The best bug reports are pull requests

I just got the following bug report, I’ll just let you read it, and I have additional commentary below:

image

The great fun with getting a pull request with a failing test is that the whole process of working with this is pretty seamless.

For the one above, GitHub told me that I need to run the following command:

git pull https://github.com/benjamingram/ravendb.git DynamicFieldsBug

I did, and got the failing test, from there it was just a matter of fixing the actual bug, which was rather simple, and nothing that even smelled like ceremony.

Performance optimizations: Rinse, Repeat, Repeat, Repeat

Originally posted at 1/18/2011

We got some reports that there was an O(N) issue with loading large number of documents from Raven.

I wrote the following test code:

var db = new DocumentDatabase(new RavenConfiguration
{
    DataDirectory = "Data"
});
Console.WriteLine("Ready");
Console.ReadLine();
while (true)
{
    var sp = Stopwatch.StartNew();

    db.Query("Raven/DocumentsByEntityName", new IndexQuery
    {
        Query = "Tag:Item",
        PageSize = 1024
    });

    Console.WriteLine(sp.ElapsedMilliseconds);
}

With 1,024 documents in the database, I could clearly see that most requests took in the order of 300 ms. Not content with that speed, I decided to dig deeper, and pulled out my trusty profiler (dotTrace, and no, I am not getting paid for this) and got this:

image

As should be clear, it seems like de-serializing the data from byte[] to a JObject instance is taking a lot of time (relatively speaking).

To be more precise, it takes 0.4 ms, to do two deserialization operations (for the following document):

{
    "cartId": 666,
    "otherStuff": "moohahaha",
    "itemList": [{
        "productId": 42,
        "name": "brimstone",
        "price": 6.66,
        "quantity": 2000
    },
    {
        "productId": 42,
        "name": "brimstone",
        "price": 6.66,
        "quantity": 2000
    },
    {
        "productId": 42,
        "name": "brimstone",
        "price": 6.66,
        "quantity": 2000
    },
    {
        "productId": 42,
        "name": "brimstone",
        "price": 6.66,
        "quantity": 2000
    },
    {
        "productId": 42,
        "name": "brimstone",
        "price": 6.66,
        "quantity": 2000
    },
    {
        "productId": 42,
        "name": "brimstone",
        "price": 6.66,
        "quantity": 2000
    },
    {
        "productId": 42,
        "name": "brimstone",
        "price": 6.66,
        "quantity": 2000
    },
    {
        "productId": 42,
        "name": "brimstone",
        "price": 6.66,
        "quantity": 2000
    }]
}

I found it quite surprising, to tell you the truth. I wrote the following test case to prove this:

static void Main()
{
    
    var st =
        "IQAAAAJSYXZlbi1FbnRpdHktTmFtZQAFAAAASXRlbQAAvgIAABJjYXJ0SWQAmgIAAAAAAAACb3RoZXJTdHVm"+
        "ZgAKAAAAbW9vaGFoYWhhAARpdGVtTGlzdACFAgAAAzAATQAAABJwcm9kdWN0SWQAKgAAAAAAAAACbmFtZQAKA"+
        "AAAYnJpbXN0b25lAAFwcmljZQCkcD0K16MaQBJxdWFudGl0eQDQBwAAAAAAAAADMQBNAAAAEnByb2R1Y3RJZAA"+
        "qAAAAAAAAAAJuYW1lAAoAAABicmltc3RvbmUAAXByaWNlAKRwPQrXoxpAEnF1YW50aXR5ANAHAAAAAAAAAAMy"+
        "AE0AAAAScHJvZHVjdElkACoAAAAAAAAAAm5hbWUACgAAAGJyaW1zdG9uZQABcHJpY2UApHA9CtejGkAScXVhbn"+
        "RpdHkA0AcAAAAAAAAAAzMATQAAABJwcm9kdWN0SWQAKgAAAAAAAAACbmFtZQAKAAAAYnJpbXN0b25lAAFwcmljZ"+
        "QCkcD0K16MaQBJxdWFudGl0eQDQBwAAAAAAAAADNABNAAAAEnByb2R1Y3RJZAAqAAAAAAAAAAJuYW1lAAoAAABi"+
        "cmltc3RvbmUAAXByaWNlAKRwPQrXoxpAEnF1YW50aXR5ANAHAAAAAAAAAAM1AE0AAAAScHJvZHVjdElkACoAAAAAA"+
        "AAAAm5hbWUACgAAAGJyaW1zdG9uZQABcHJpY2UApHA9CtejGkAScXVhbnRpdHkA0AcAAAAAAAAAAzYATQAAABJwcm9"+
        "kdWN0SWQAKgAAAAAAAAACbmFtZQAKAAAAYnJpbXN0b25lAAFwcmljZQCkcD0K16MaQBJxdWFudGl0eQDQBwAAAAAAAA"+
        "ADNwBNAAAAEnByb2R1Y3RJZAAqAAAAAAAAAAJuYW1lAAoAAABicmltc3RvbmUAAXByaWNlAKRwPQrXoxpAEnF1YW50a"+
        "XR5ANAHAAAAAAAAAAAA";
    var buffer = Convert.FromBase64String(st);

    while (true)
    {
        var sp = Stopwatch.StartNew();
        for (int i = 0; i < 1024; i++)
        {
            DoWork(buffer);
        }
        Console.WriteLine(sp.ElapsedMilliseconds);
    }
}

private static void DoWork(byte[] buffer)
{
    var ms = new MemoryStream(buffer);
    JObject.Load(new BsonReader(ms));
    JObject.Load(new BsonReader(ms));
}

On my machine, this run at around 70ms for each 1,204 run. In other words, it takes significantly less I would have thought. Roughly 0.06 ms.

Note: The first number (0.4 ms) is under the profiler while the second number (0.06ms) is outside the profiler. You can routinely see order of magnitude differences between running inside and outside the profiler!

So far, so good, but we can literally see that this is adding almost a 100 ms to the request processing. That is good, because it is fairly simple to fix.

What I did was introduce a cache inside the serialization pipeline that made the entire cost go away. Indeed, running the same code above showed much better performance, an average of 200 ms.

The next step is to figure out where is the next cost factor… For that, we use the profiler.

image

And… we can see the the cost of de-serializing went drastically down. Now the actual cost is just doing a search for the document by the key.

You might have noticed that those results are for Munin. I have run the same test results with Esent, with remarkable similarity in the overall performance.

Code Review tools

Originally posted at 1/17/2011

Occasionally I get asked what code review tools I use.

The truth is, I tend to use TortoiseGit’s Log Viewer and just read the history, but I am not leaning toward this baby:

image

This is really nice!

RavenMQ update

It wasn’t planned so much as it happened, but RavenMQ just slipped into private beta stage. The API is still in a the somewhat clumsy state, but it it is working quite nicely Smile

You can see an example of the client API below:

using(var connection = RavenMQConnection.Connect("http://reduction:8181"))
{
    connection.Subscribe("/queues/abc", (context, message) => 
        Console.WriteLine(Encoding.UTF8.GetString(message.Data)));

    connection.PublishAsync(new IncomingMessage
    {
        Queue = "/queues/abc",
        Data = Encoding.UTF8.GetBytes("Hello Ravens")
    });

    Console.ReadLine();
}

Please note that this is likely to be subject to many changes.

This is written about 10 minutes after I posted the code above:

using(var connection = RavenMQConnection.Connect("http://localhost:8181"))
{
    connection.
        Subscribe<User>("/queues/abc", (context, message) => Console.WriteLine(message.Name));

    connection
        .StartPublishing
        .Add("/queues/abc", new User {Name = "Ayende"})
        .PublishAsync();

    Console.ReadLine();
}

I told you it would change… Open-mouthed smile

Published at

Originally posted at

Comments (9)

It is the little things that trips you: Reducing RavenDB’s Map/Reduce cost

After finally managing to get to Inbox Zero, I had the chance to tackle some problems that were on the back burner. One of the more acute ones was RavenDB performance for map/reduce indexes.

Standard indexes are actually very fast, especially since I just gave them an additional boost, but map/reduce indexes had two problems:

  • The reduce operation is currently single threaded (for all indexes).
  • The reduce operation blocks standard indexing.

In order to avoid that, we split the work so reduce would run in a separate thread from standard indexing operations. That done, I started to take a look at the actual cost of map/reduce operations.

It quickly became apparent that while the map part was pretty fast, it was the reduce operation that was killing us.

After some narrowing down, I was able to figure out that this is the code at fault:

public override void Execute(WorkContext context)
{
        if (ReduceKeys.Length == 0)
            return;

  var viewGenerator = context.IndexDefinitionStorage.GetViewGenerator(Index);
  if (viewGenerator == null)
    return; // deleted view?

  context.TransactionaStorage.Batch(actions =>
  {
    IEnumerable<object> mappedResults = null;
    foreach (var reduceKey in ReduceKeys)
    {
      IEnumerable<object> enumerable = actions.MappedResults.GetMappedResults(Index, reduceKey, MapReduceIndex.ComputeHash(Index, reduceKey))
        .Select(JsonToExpando.Convert);

      if (mappedResults == null)
        mappedResults = enumerable;
      else
        mappedResults = mappedResults.Concat(enumerable);
    }

    context.IndexStorage.Reduce(Index, viewGenerator, mappedResults, context, actions, ReduceKeys);
  });
}

Can you see the problem?

My first thought was that we had a problem with the code inside the foreach, since it effectively generate something like:

select * from MappedResulsts where Index = "MapReduceTestIndex" and ReduceKey = "Oren"
select * from MappedResulsts where Index = "MapReduceTestIndex" and ReduceKey = "Ayende"
select * from MappedResulsts where Index = "MapReduceTestIndex" and ReduceKey = "Arava"

And usually you’ll have about 2,500 of those.

Indeed, I modified the code to look like this:

public override void Execute(WorkContext context)
{
        if (ReduceKeys.Length == 0)
            return;

  var viewGenerator = context.IndexDefinitionStorage.GetViewGenerator(Index);
  if (viewGenerator == null)
    return; // deleted view?
  
  context.TransactionaStorage.Batch(actions =>
  {
    IEnumerable<object> mappedResults = new object[0];
    foreach (var reduceKey in ReduceKeys)
    {
      IEnumerable<object> enumerable = actions.MappedResults.GetMappedResults(Index, reduceKey, MapReduceIndex.ComputeHash(Index, reduceKey))
        .Select(JsonToExpando.Convert);

      mappedResults = mappedResults.Concat(enumerable);
    }
    var sp = Stopwatch.StartNew();
    Console.WriteLine("Starting to read {0} reduce keys", ReduceKeys.Length);

    var results = mappedResults.ToArray();

    Console.WriteLine("Read {0} reduce keys in {1} with {2} results", ReduceKeys.Length, sp.Elapsed, results.Length);

    context.IndexStorage.Reduce(Index, viewGenerator, results, context, actions, ReduceKeys);
  });
}

And got the following:

image_thumb[2]_thumb

Starting to read 2470 reduce keys
Read 2470 reduce keys in 00:57:57.5292856 with 2499 results

Yes, for 2,470 results, that took almost an hour!!

I started planning how to fix this by moving to what is effectively an “IN” approach, when I realized that I skipped a very important step, I didn’t run this through the profiler. And as we know, if it haven’t run through the profiler, it isn’t real, when we are talking about performance testing.

And the profiler led me to this method:

public IEnumerable<JObject> GetMappedResults(string view, string reduceKey, byte[] viewAndReduceKeyHashed)
{
    Api.JetSetCurrentIndex(session, MappedResults, "by_reduce_key_and_view_hashed");
    Api.MakeKey(session, MappedResults, viewAndReduceKeyHashed, MakeKeyGrbit.NewKey);
    if (Api.TrySeek(session, MappedResults, SeekGrbit.SeekEQ) == false)
        yield break;


    Api.MakeKey(session, MappedResults, viewAndReduceKeyHashed, MakeKeyGrbit.NewKey);
    Api.JetSetIndexRange(session, MappedResults, SetIndexRangeGrbit.RangeUpperLimit | SetIndexRangeGrbit.RangeInclusive);
    if (Api.TryMoveFirst(session, MappedResults) == false)
        yield break;
    do
    {
        // we need to check that we don't have hash collisions
        var currentReduceKey = Api.RetrieveColumnAsString(session, MappedResults, tableColumnsCache.MappedResultsColumns["reduce_key"]);
        if (currentReduceKey != reduceKey)
            continue;
        var currentView = Api.RetrieveColumnAsString(session, MappedResults, tableColumnsCache.MappedResultsColumns["view"]);
        if (currentView != view)
            continue;
        yield return Api.RetrieveColumn(session, MappedResults, tableColumnsCache.MappedResultsColumns["data"]).ToJObject();
    } while (Api.TryMoveNext(session, MappedResults));
}

You might have noticed the part that stand out Smile, yes, we actually had a O(n) algorithm here (effectively a table scan) that was absolutely meaningless and didn’t need to be there. I am pretty sure that it was test code that I wrote (yes, I run git blame on that and ordered some crow for tomorrow’s lunch).

Once I removed that, things changed for the better Open-mouthed smile. To give you an idea how much, take a look at the new results:

Starting to read 2470 reduce keys
Read 2470 reduce keys in 00:00:00.4907501 with 2499 results

And just for fun, I had tested how long it takes to reduce each batch of results:

Indexed 2470 reduce keys in 00:00:01.0533390 with 2499 results

And that is without doing any actual optimizations, that is just removing the brain-dead code that had no business being there.

Tags:

Published at

Originally posted at

Comments (3)

Silverlight and HTTP and Caching, OH MY!

We are currently building the RavenDB Siliverlight Client, and along the way we run into several problems. Some of them were solved, but others proved to be much harder to deal with.

In particular, it seems that the way Silverlight handle HTTP caching is absolutely broken. In particular, Silverlight aggressively caches HTTP requests (which is good & proper), but the problem is that it ignores just about any of the cache control mechanisms that the HTTP spec specifies.

Note: I am talking specifically about the Client HTTP stack, the Browser HTTP stack behaves properly.

Here is the server code:

static void Main()
{
    var listener = new HttpListener();
    listener.Prefixes.Add("http://+:8080/");
    listener.Start();

    int reqNum = 0;
    while (true)
    {
        var ctx = listener.GetContext();

        Console.WriteLine("Req #{0}", reqNum);

        if(ctx.Request.Headers["If-None-Match"] == "1234")
        {
            ctx.Response.StatusCode = 304;
            ctx.Response.StatusDescription = "Not Modified";
            ctx.Response.Close();
            continue;
        }

        ctx.Response.Headers["ETag"] = "1234";
        ctx.Response.ContentType = "text/plain";
        using(var writer = new StreamWriter(ctx.Response.OutputStream))
        {
            writer.WriteLine(++reqNum);
            writer.Flush();
        }
        ctx.Response.Close();
    }
}

This is a pretty simple implementation, let us see how it behaves when we access the server from Internet Explorer (after clearing the cache):

image

The first request will hit the server to get the data, but the second request will ask the server if the cached data is fresh enough, and use the cache when we get the 304 reply.

So far, so good. And as expected, when we see the presence of an ETag in the request.

Let us see how this behaves with an Out of Browser Silverlight application (running with elevated security):

public partial class MainPage : UserControl
{
    public MainPage()
    {
        InitializeComponent();
    }

    private void StartRequest(object sender, RoutedEventArgs routedEventArgs)
    {
        var webRequest = WebRequestCreator.ClientHttp.Create(new Uri("http://ipv4.fiddler:8080"));

        webRequest.BeginGetResponse(Callback, webRequest);
    }

    private void Callback(IAsyncResult ar)
    {
        var webRequest = (WebRequest) ar.AsyncState;
        var response = webRequest.EndGetResponse(ar);

        var messageBoxText = new StreamReader(response.GetResponseStream()).ReadToEnd();
        Dispatcher.BeginInvoke(() => MessageBox.Show(messageBoxText));
    }
}

The StartRequest method is wired to a button click on the window.

Before starting, I cleared the IE cache again. Then I started the SL application and hit the button twice. Here is the output from fiddler:

image

Do you notice what you are not seeing here? That is right, there is no second request to the server to verify that the resource has not been changed.

Well, I thought to myself, that might be annoying behavior, but we can fix that, all we need to do is to specify must-revalidate in the Cache-Control. And so I did just that:

image

Do you see Silverlight pissing all over the HTTP spec? The only aspect of Cache-Control that the ClientHttp stack in Silverlight seems to respect is no-cache, which completely ignores etags.

As it turns out, there is one way of doing this. You need to send an Expires header in the past as well as an ETag header. The Expires header will force Silverlight to make the request again, and the ETag will be used to re-validate the request, resulting in a 304 reply from the server, which will load the data from the cache.

The fact that there is a workaround doesn’t mean it is not a bug, and it is a pretty severe one, making it much harder to write proper REST clients in Silverlight.

Today is a great day: Zero Inbox!

Originally posted at 1/10/2011

This is the first time in I don’t know how long, months, closer to half a year!

image

It is 16:37, and I think that I’ll just leave the office before something else jumps at me.

Published at

Originally posted at

Comments (5)

More on the joy of support: My trial expired!

I got the following very interesting email:

image

You might have noticed that I have kept the email address of the sender public. That is an important clue.

The email was sent from a public email gateway, one of those places where you have a disposable email address.

I suspect that there isn’t actually a bug, but that the system is working as planned Smile

And there is this complaint:

image

Uber Prof New Features: A better query plan

Originally posted at 1/7/2011

Because I keep getting asked, this feature is available for the following profilers:

This feature is actually two separate ones. The first is the profiler detecting what is the most expensive part of the query plan and making it instantly visible. As you can see, in this fairly complex query, it is this select statement that is the hot spot.

image

Another interesting feature that only crops up whenever we are dealing with complex query plans is that the query plan can get big. And by that I mean really big. Too big for a single screen.

Therefore, we added zooming capabilities as well as the mini map that you see in the top right corner.

Uber Prof New Features: Go To Session from alert

Originally posted at 1/7/2011

This is another oft requested feature that we just implemented. The new feature is available for the full suite of Uber Profilers:

You can see the new feature below:

image

I think it is cute, and was surprisingly easy to do.

Uber Prof have recently passed the stage where it is mostly implemented using itself, so I just had to wire a few things together, and then I spent most of the time just making sure that things aligned correctly on the UI.

Executing TortoiseGit from the command line

Originally posted at 1/6/2011

I love git, but as much as I like the command line, there are some things that are ever so much simple with a UI. Most specifically, due to my long years of using TortoiseSVN, I am very much used to the way TortoiseGit is working.

I still work from the command line a lot, and I found myself wanting to execute various actions on the UI from the command line. Luckily, it is very easy to do so with TortoiseGit. I simply wrote the following script (tgit.ps1):

param($cmd)
& "C:\Program Files\TortoiseGit\bin\TortoiseProc.exe" /command:$cmd /path:.

And now I can execute the following from the command line:

tgit log

tgit commit

And get the nice UI.

Please note that I am posting this mostly because I want to be able to look it up afterward. I am sure your git tools are superior to mine, but I like the way I am doing things, and am reluctant to change.

Published at

Originally posted at

Comments (6)

RavenDB & HTTP Caching

The RavenDB’s Client API uses the session / unit of work model internally. That means that this code will only go to the database once:

session.Load<User>("users/1");
session.Load<User>("users/1");
session.Load<User>("users/1");

And that all three calls will return the same instance as well. This is just the identity map at work, and with NHibernate, it is also called the first level cache or the session level cache.

Having implemented that, a natural progression was to ask what about the second level cache. NHibernate’s second level cache is complicated (it takes an hour just to explain how exactly it works, and that is when skipping on all the actual implementation details).

For a while, my response was that we don’t actually need that, RavenDB is fast enough that we don’t need caching. Except that I forgot about the Fallacies of Distributed Computing, the first three rules of which state:

  • The network is reliable.
  • Latency is zero.
  • Bandwidth is infinite.

Most specifically, caching can help with the third rule, since when you are querying potentially large documents (or over a large set of documents), you are going to spend most of your time just on the network, sending bytes to and fro.

It is to avoid that that we actually need caching.

I was slightly depressed that I actually had to implement the same complicated logic as NHibernate for caching, so I dawdled in implementing this. And suddenly it dawned on me that as usual, I was being stupid.

RavenDB is REST based. One of the important parts of REST is that:

Cacheable
As on the World Wide Web, clients are able to cache responses. Responses must therefore, implicitly or explicitly, define themselves as cacheable or not to prevent clients reusing stale or inappropriate data in response to further requests. Well-managed caching partially or completely eliminates some client–server interactions, further improving scalability and performance.

RavenDB is an HTTP server, in the end. Why not use HTTP caching?

That required some thought, I’ll admit. It couldn’t be that simple, right?

HTTP Caching is a somewhat complex topic, if you think it is not, talk to me after reading this 24 pages document describing it. But in essence, I am actually using only a small bit of it.

Whenever RavenDB sends a response to a GET request (the only thing that can be safely cached), it adds an ETag header. The ETag header stands for Entity Tag, and it changes every time that the resource is changed.

RavenDB already generated ETags for documents and attachments, those are part of how we implement optimistic concurrency. But since we already had those, we could now move to the next stage, and that was to have the client remember the responses for all the GET requests and when a new request for a Url that we already GET before, it will generate a If-None-Match header for the request.

RavenDB then checks whatever the ETag that the client holds matches the ETag on the server, and if so, will generate a 304 Not Modified response. That instruct the client that it can use the cached response safely.

In order to fully implement caching on the client, that was all we had to do. On the server side, we had to modify a few endpoints to properly generate an ETag and 304 if the client sent us the current If-None-Match value. With RavenDB, this is handled very deep in the guts of the client api, directly on top of the HTTP layer. It is always on by default and it should drastically reduce the amount of data across the network when the data hasn’t been modified.

Please note that unlike NHibernate’s second level cache, we don’t need a distributed cache to ensure consistency. Each node has its own local cache, but all of them will always get valid results, thanks to RavenDB’s ETag checks. In fact, the biggest challenge was actually involved in figuring out how to cheaply generate a valid ETag without performing the actual work for the request Smile.

Tags:

Published at

Originally posted at

Comments (19)

Git Subtree

Originally posted at 1/10/2011

I have no idea why, but yesterday I tried using the git-subtree project on a different machine, and it did not work. Today, I tried it on my main work machine, and It Just Worked.

At any rate, let us see where we are, shall we?

PS C:\Work\temp> git init R1
Initialized empty Git repository in C:/Work/temp/R1/.git/
PS C:\Work\temp> git init R2
Initialized empty Git repository in C:/Work/temp/R2/.git/
PS C:\Work\temp> git init Lic
Initialized empty Git repository in C:/Work/temp/Lic/.git/
PS C:\Work\temp> cd R1
PS C:\Work\temp\R1> echo "Hello Dolly" > Dolly.txt
PS C:\Work\temp\R1> git add --all
PS C:\Work\temp\R1> git commit -m "initial commit"
[master (root-commit) b507184] initial commit
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Dolly.txt
PS C:\Work\temp\R1> cd ..\R2
PS C:\Work\temp\R2> echo "Hello Jane" > Jane.txt
PS C:\Work\temp\R2> git add --all
PS C:\Work\temp\R2> git commit -m "initial commit"
[master (root-commit) ec99676] initial commit
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Jane.txt
PS C:\Work\temp\R2> cd ..\Lic
PS C:\Work\temp\Lic> echo "Copyright Ayende (C) 2011" > license.txt
PS C:\Work\temp\Lic> git add --all
PS C:\Work\temp\Lic> git commit -m "initial commit"
[master (root-commit) a3a9b48] initial commit
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 license.txt
PS C:\Work\temp\Lic> cd..
PS C:\Work\temp> git clone .\Lic Lic.Bare --bare
Cloning into bare repository Lic.Bare...
done.

Those are the current repositories, and we want to be able to share the Lic repository among the two projects. Note that we created a bare repository for Lic, because we can’t by default push to a remote repository if it is not bare.

Using git subtree, we can run:

PS C:\Work\temp> cd .\R1
PS C:\Work\temp\R1> git subtree add --prefix Legal C:\Work\temp\Lic.Bare master
git fetch C:\Work\temp\Lic.Bare master
warning: no common commits
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (3/3), done.
From C:\Work\temp\Lic.Bare
 * branch            master     -> FETCH_HEAD
Added dir 'Legal'
PS C:\Work\temp\R1> ls -recurse


    Directory: C:\Work\temp\R1


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d----         1/10/2011  11:59 AM            Legal
-a---         1/10/2011  11:58 AM         28 Dolly.txt


    Directory: C:\Work\temp\R1\Legal


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---         1/10/2011  11:59 AM         56 license.txt

We do the same in the R2 repository:

PS C:\Work\temp\R1> cd ..\R2
PS C:\Work\temp\R2> git subtree add --prefix Legal C:\Work\temp\Lic.Bare master
git fetch C:\Work\temp\Lic.Bare master
warning: no common commits
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (3/3), done.
From C:\Work\temp\Lic.Bare
 * branch            master     -> FETCH_HEAD
Added dir 'Legal'

Now let us see what happen when we modify things…

PS C:\Work\temp\R2> echo "Not for Jihad use" > .\Legal\disclaimer.txt
PS C:\Work\temp\R2> git add --all
PS C:\Work\temp\R2> git commit -m "adding disclaimer"
[master 3ac3e15] adding disclaimer
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Legal/disclaimer.txt

Couple of things to note here:

  • We could add & commit from the root repository, because as far as Git is concerned, there is only one repository.
  • If we were to push our changes to the root repository location, it would include the changes just made.

This is a Good Thing, because if I want to create a branch / fork, I get everything, not just references.

Now, let us push our changes to the Lic repository:

PS C:\Work\temp\R2> git subtree push C:\Work\temp\Lic.Bare master --prefix Legal
git push using:  C:\Work\temp\Lic.Bare master
1/      4 (0)2/      4 (0)3/      4 (0)4/      4 (1)Counting objects: 4, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 320 bytes, done.
Total 3 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (3/3), done.
To C:\Work\temp\Lic.Bare
   a3a9b48..10fea68  10fea680b0783e0cf6e5d3ba5130d154557ffbe5 -> master

And now let us see how we get those changes back in the R1 repository:

PS C:\Work\temp\R1> git subtree pull C:\Work\temp\Lic.Bare master --prefix Legal
remote: Counting objects: 4, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (3/3), done.
From C:\Work\temp\Lic.Bare
 * branch            master     -> FETCH_HEAD
Merge made by recursive.
 Legal/disclaimer.txt |  Bin 0 -> 40 bytes
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Legal/disclaimer.txt
PS C:\Work\temp\R1> ls -recurse


    Directory: C:\Work\temp\R1


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
d----         1/10/2011  12:04 PM            Legal
-a---         1/10/2011  11:58 AM         28 Dolly.txt


    Directory: C:\Work\temp\R1\Legal


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---         1/10/2011  12:04 PM         40 disclaimer.txt
-a---         1/10/2011  11:59 AM         56 license.txt

There is another important advantage for git subtree, it is only me that have to use this, everyone else can just work with the usual git tools, and not have to be aware that I am sharing code between projects in this manner.

Published at

Originally posted at

Comments (4)

The problem with Git Submodules

The builtin answer for sharing code between multiple projects is quite simple…

git submodule

But it introduces several problems along the way:

  • You can’t just git clone the repository, you need to clone the repository, then call git submodule init & git submodule update.
  • You can’t just download the entire source code from github.
  • You can’t branch easily with submodules, well, you can, but you have to branch in the related projects as well. And that assumes that you have access to them.
  • You can’t fork easily with submodules, well, you can, if you really feel like updating the associations all the time. Which is really nasty.

Let me present you with a simple scenario, okay? I have two projects that share a common license. Obviously I want all projects to use the same license and the whole thing to be under source control.

Here is our basic setup:

PS C:\Work\temp> git init R1
Initialized empty Git repository in C:/Work/temp/R1/.git/
PS C:\Work\temp> git init R2
Initialized empty Git repository in C:/Work/temp/R2/.git/
PS C:\Work\temp> git init Lic
Initialized empty Git repository in C:/Work/temp/Lic/.git/
PS C:\Work\temp> cd R1
PS C:\Work\temp\R1> echo "Hello Dolly" > Dolly.txt
PS C:\Work\temp\Lic> cd ..\R1
PS C:\Work\temp\R1> git add --all
PS C:\Work\temp\R1> git commit -m "initial commit"
[master (root-commit) 498ab77] initial commit
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Dolly.txt
PS C:\Work\temp\R1> cd ..\R2
PS C:\Work\temp\R2> echo "Hello Jane" > Jane.txt
PS C:\Work\temp\R2> git add --all
PS C:\Work\temp\R2> git commit -m "initial commit"
[master (root-commit) deb45bc] initial commit
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Jane.txt
PS C:\Work\temp\R2> cd ..\Lic
PS C:\Work\temp\Lic> echo "Copyright Ayende (C) 2011" > license.txt
PS C:\Work\temp\Lic> git add --all
PS C:\Work\temp\Lic> git commit -m "initial commit"
[master (root-commit) 8e8b1b4] initial commit
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 license.txt

This just gives us the basics. Now I want to share the license.txt file between the projects. I can do that with submodules, like so:

PS C:\Work\temp\R1> git submodule init
PS C:\Work\temp\R1> git submodule add C:\Work\temp\Lic Legal
Cloning into Legal...
done.
PS C:\Work\temp\R1> cd ..\R2
PS C:\Work\temp\R2> git submodule init
PS C:\Work\temp\R2> git submodule add C:\Work\temp\Lic Legal
Cloning into Legal...
done.

Now, this looks nice, and it works beautifully. Until you start sharing this with other people. Then it starts to become somewhat messy.

For example, let us say that I want to add a disclaimer in R1:

PS C:\Work\temp\R1\Legal> echo "Not for Jihad use" > Disclaimer.txt
PS C:\Work\temp\R1\Legal> git add .\Disclaimer.txt
PS C:\Work\temp\R1\Legal> git commit -m "adding disclaimer"
[master db3987c] adding disclaimer
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Disclaimer.txt

And here is where the problems starts. Let us assume that I want to make a change that is local to just this project.

Well, guess what, you can’t. Not if you intend to share this with other people. You need to push your changes to the submodules somewhere, and that means that if you need to fork the original project, update references to the project. Of course, if there is an update to the original submodule, you need to have two stages to update that.

And we haven’t spoken yet on the fun of pushing the main repository but forgetting to push the submodule. It gives a new meaning to “it works on my machine”.

In short, git submodules looks like a good idea, but they aren’t really workable in the real world. I’ll have a new post shortly showing how to deal with the issue

Git submodules & subtrees

I am getting really sick of git submodules, and I am trying to find alternatives.

So far, I have discovered the following options:

PS C:\Work\RavenDB> braid add git@github.com:ravendb/raven.munin.git
F, [2011-01-09T18:41:09.788525 #224] FATAL -- : uninitialized constant Fcntl::F_SETFD (NameError)
C:/Ruby186/lib/ruby/gems/1.8/gems/open4-1.0.1/lib/open4.rb:20:in `popen4'
C:/Ruby186/lib/ruby/gems/1.8/gems/evilchelu-braid-0.5/lib/braid/operations.rb:103:in `exec'
C:/Ruby186/lib/ruby/gems/1.8/gems/evilchelu-braid-0.5/lib/braid/operations.rb:114:in `exec!'
C:/Ruby186/lib/ruby/gems/1.8/gems/evilchelu-braid-0.5/lib/braid/operations.rb:51:in `version'
C:/Ruby186/lib/ruby/gems/1.8/gems/evilchelu-braid-0.5/lib/braid/operations.rb:57:in `require_version'
C:/Ruby186/lib/ruby/gems/1.8/gems/evilchelu-braid-0.5/lib/braid/operations.rb:78:in `require_version!'
C:/Ruby186/lib/ruby/gems/1.8/gems/evilchelu-braid-0.5/lib/braid/command.rb:51:in `verify_git_version!'
C:/Ruby186/lib/ruby/gems/1.8/gems/evilchelu-braid-0.5/lib/braid/command.rb:10:in `run'
C:/Ruby186/lib/ruby/gems/1.8/gems/evilchelu-braid-0.5/bin/braid:58:in `run'
C:/Ruby186/lib/ruby/gems/1.8/gems/main-4.4.0/lib/main/program/class_methods.rb:155:in `run!'
C:/Ruby186/lib/ruby/gems/1.8/gems/main-4.4.0/lib/main/program/class_methods.rb:155:in `run'
C:/Ruby186/lib/ruby/gems/1.8/gems/main-4.4.0/lib/main/program/class_methods.rb:144:in `catch'
C:/Ruby186/lib/ruby/gems/1.8/gems/main-4.4.0/lib/main/program/class_methods.rb:144:in `run'
C:/Ruby186/lib/ruby/gems/1.8/gems/main-4.4.0/lib/main/factories.rb:18:in `run'
C:/Ruby186/lib/ruby/gems/1.8/gems/main-4.4.0/lib/main/factories.rb:25:in `Main'
C:/Ruby186/lib/ruby/gems/1.8/gems/evilchelu-braid-0.5/bin/braid:13
C:/Ruby186/bin/braid:19:in `load'
C:/Ruby186/bin/braid:19

Does anyone know about a good solution that will work on Windows? Most specifically, I am looking for something that is plug & play, I don’t want to write code or to understand how git works. I just wanna it to work

Tags:

Published at

Originally posted at

Comments (19)

Google vs. Bing

I was trying to find my office in Google’s Maps:

image

And then I tried Bing’s Maps:

image

Dear Microsoft, there is a good reason why you are less successful than Google…

The design of RavenDB’s attachments

Originally posted at 1/6/2011

imageI got a question on attachments in RavenDB recently:

I know that RavenDb allows for attachments. Thinking in terms of facebook photo albums - would raven attachments be suitable?

And one of the answers from the community was:

We use attachments and it works ok. We are using an older version of  RavenDB (Build 176 unstable), and the thing I wish would happen is that attachments were treated like regular documents in the DB. That way you could query them just like other documents. I am not sure if this was changed in newer releases, but there was talk about it being changed.

If I had to redesign again, I would keep attachments out of the DB cause they are resources you could easily off load to a CDN or cloud service like Amazon or Azure. If the files are in your DB, that makes it more work to optimize later.

In summary: You could put them in the DB, but you could also put ketchup on your ice cream. :)

I thought that this is a good point to stop and explain a bit about the attachment support in RavenDB. Let us start from the very beginning.

The only reason RavenDB has attachments support is that we wanted to support the notion of Raven Apps (see Couch Apps) which are completely hosted in RavenDB. That was the original impetus. Since then, they evolved quite nicely. Attachments in RavenDB can have metadata, are replicated between nodes, can be cascade deleted on document deletions and are HTTP cacheable.

One of the features that was requested several times was automatically turning a binary property to an attachment on the client API. I vetoed that feature for several reasons:

  • It makes things more complicated.
  • It doesn’t actually gives you much.
  • I couldn’t think of a good way to explain the rules governing this without it being too complex.
  • It encourage storing large binaries in the same place as the actual document.

Let us talk in concrete terms here, shall we? Here is my model class:

public class User
{
  public string Id {get;set;}
  public string Name {get;set;}
  public byte[] HashedPassword {get;set;}
  public Bitmap ProfileImage {get;set;}
}

From the point of view of the system, how is it supposed to make a distinction between HashedPassword (16 – 32 bytes, should be stored inside the User document) and ProfileImage (1Kb – 2 MB, should be stored as a separate attachment).

What is worst, and the main reason why attachments are clearly separated from documents, is that there are some things that we don’t want to store inside our document, because that means that:

  • Whenever we pull the document out, we have to pull the image as well.
  • Whenever we index the document, we need to load the image as well.
  • Whenever we update the document we need to send the image as well.

Do you sense a theme here?

There is another issue, whenever we update the user, we invalidate all the user data. But when we are talking about large files, changing the password doesn’t means that you need to invalidate the cached image. For that matter, I really want to be able to load all the images separately and concurrently. If they are stored in the document itself (or even if they are stored as an external attachment with client magic to make it appears that they are in the document) you can’t do that.

You might be familiar with this screen:

image_thumb[1]

If we store the image in the Animal document, we run into all of the problems outlined above.

But if we store it as a Url reference to the information, we can then:

  • Load all the images on the form concurrently.
  • Take advantage of HTTP caching.
  • Only update the images when they are actually changed.

Overall, that is a much nicer system all around.

Solving an OutOfMemoryException

The following code generates an OutOfMemoryException in certain circumstances (see my previous post about it):

private string ReplaceParametersWithValues(string statement, bool useComment)
{
  if (sqlStatement.Parameters == null)
    return statement;

  foreach (var parameter in sqlStatement.Parameters
            .Where(x => x != null && x.Name.Empty() == false)
            .OrderByDescending(x => x.Name.Length))
  {
    var patternNameSafeForRegex = Regex.Escape(parameter.Name);
    var pattern = patternNameSafeForRegex + @"(?![\d|_])";
    //static Regex methods will cache the regex, so we don't need to worry about that
      var replacement = useComment ? 
                parameter.Value + " /* " + parameter.Name + " */" : 
                parameter.Value;
      statement = Regex.Replace(statement,
                  pattern,
                  replacement);
  }
  return statement;
}

When trying to resolve this, I had to adjust my assumptions. The code above was written to handle statements of 1 – 5 kilobytes with up to a dozen or two parameters.

But it started crashing badly with a statement of 190 Kilobyte and 4 thousands parameters. It is fairly obvious that this method generates a lot of temporary strings. Probably leading to GC pressure and a lot of other nasty stuff beside. Unfortunately, there isn’t a set of Regex API for StringBuilder, so I had to make do with my own approach.

I chose a fairly brute force approach for that, and I am sure it can be made better, but basically, it is just using a StringBuilder and doing the work manually.

private string ReplaceParametersWithValues(string statement, bool useComment)
{
  if (sqlStatement.Parameters == null)
    return statement;
  var sb = new StringBuilder(statement);
  foreach (var parameter in sqlStatement.Parameters
            .Where(x => x != null && x.Name.Empty() == false)
            .OrderByDescending(x => x.Name.Length))
  {
    var replacement = useComment ?
      parameter.Value + " /* " + parameter.Name + " */" :
      parameter.Value;

    int i;
    for ( i = 0; i < sb.Length; i++)
    {
      if(sb[i] != parameter.Name[0])
        continue;
      int j;
      for (j = 1; j < parameter.Name.Length && (j+i) < sb.Length; j++)
      {
        if (sb[i + j] != parameter.Name[j])
          break;
      }
      if (j != parameter.Name.Length)
        continue;

      if ((i + j) >= sb.Length || char.IsDigit(sb[i + j]) == false)
      {
        sb.Remove(i, parameter.Name.Length);
        sb.Insert(i, replacement);
        i += replacement.Length - 1;
      }
    }
  }
  return sb.ToString();
}

The code is somewhat complicated due to the check that I need to make, I can’t just use sb.Replace(), because I need to replace the value if it isn’t followed by a digit.

At any rate, this code is much more complicated, but it is also much more conservative in terms of memory usage.

Published at

Originally posted at

Comments (8)

Investigating an OutOfMemoryException

Originally posted at 1/5/2011

I finally got a reliable reproduction for a repeated error, but I have to say, just based on the initial impression, something very strange is going on.

image

The stack trace was a bit more interesting:

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
  at System.String.GetStringForStringBuilder(String value, Int32 startIndex, Int32 length, Int32 capacity)
  at System.Text.StringBuilder.GetNewString(String currentString, Int32 requiredLength)
  at System.Text.StringBuilder.Append(String value)
  at System.Text.RegularExpressions.RegexReplacement.ReplacementImpl(StringBuilder sb, Match match)
  at System.Text.RegularExpressions.RegexReplacement.Replace(Regex regex, String input, Int32 count, Int32 startat)
  at System.Text.RegularExpressions.Regex.Replace(String input, String replacement, Int32 count, Int32 startat)
  at System.Text.RegularExpressions.Regex.Replace(String input, String replacement)
  at System.Text.RegularExpressions.Regex.Replace(String input, String pattern, String replacement)
  at HibernatingRhinos.Profiler.BackEnd.SqlStatementProcessor.ReplaceParametersWithValues(String statement, Boolean useComment) 

I put a breakpoint in the appropriate place, and discovered that the error occurred in:

  • A SQL Statement that was 190 kilobytes in size
  • It had 4,060 parameters

Now, let us look at the actual code:

private string ReplaceParametersWithValues(string statement, bool useComment)
{
  if (sqlStatement.Parameters == null)
    return statement;

  foreach (var parameter in sqlStatement.Parameters
            .Where(x => x != null && x.Name.Empty() == false)
            .OrderByDescending(x => x.Name.Length))
  {
    var patternNameSafeForRegex = Regex.Escape(parameter.Name);
    var pattern = patternNameSafeForRegex + @"(?![\d|_])";
    //static Regex methods will cache the regex, so we don't need to worry about that
      var replacement = useComment ? 
                parameter.Value + " /* " + parameter.Name + " */" : 
                parameter.Value;
      statement = Regex.Replace(statement,
                  pattern,
                  replacement);
  }
  return statement;
}

The problem is that:

  • There is no heavy memory pressure.
  • While the string is big, it is not that big.
  • In practice, there is a single replacement for each parameter.

Just for fun, I wasn’t able to reproduce the issue without running the full NH Prof application.

I solved the issue, but I am not entirely pleased with the way I solved it. (That is tomorrow’s post)

Any ideas how to reproduce this?

Any elegant ideas on how to solve this?

The BCL bug of the day

Now this one if quite an interesting one. Let us take a look and see what happen when we have the following calling code:

public class Program
{
    static void Main()
    {
        dynamic d = new MyDynamicObject();
        Console.WriteLine(d.Item.Key);
    }
}

And the following MyDynamicObject:

public class MyDynamicObject : DynamicObject
{
    public override bool TryGetMember(GetMemberBinder binder, out object result)
    {
        result = new {Key = 1};
        return true;
    }
}

What do you expect the result of executing this code would be?

If you think that this will print 1 on the console, you are absolutely correct.

Except…

If Program and MyDynamicObject are on separate assemblies.

In that case, we end up with a terribly confusing message:

Microsoft.CSharp.RuntimeBinder.RuntimeBinderException was unhandled
  Message='object' does not contain a definition for 'Key'
  Source=Anonymously Hosted DynamicMethods Assembly
  StackTrace:
       at CallSite.Target(Closure , CallSite , Object )
       at System.Dynamic.UpdateDelegates.UpdateAndExecute1[T0,TRet](CallSite site, T0 arg0)
       at ConsoleApplication1.Program.Main() 
       at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)

I have been able to narrow this down to “anonymous objects from a different assembly”.

Now that you have the bug, figure out:

  • Why this is happening?
  • How would you work around this bug?
  • How would you reproduce this bug without using anonymous types?
  • How would you fix this bug?
    • What should you be careful when fixing this bug?
  • What would be Microsoft’s response to that?

New year decisions

Originally posted at 1/2/2011

It is a new year, and I understand it is a tradition to make some rules that you’ll violate during the year.

Mine are:

  • Work less, a lot less.
  • Enjoy life.

They are more or less associated with one another, as you can imagine.

A few weeks ago, I got myself a nice birthday gift, a shiny office. The idea is that if I am no longer working from home, I have that much greater a chance to actually be able to put some separation between the periods in which I am working and the periods in which I am not.

Associate with the office is the decision to hire employees and leave the office at a reasonable time.

For example, this post is my last official act of “work” for the day, and I am leaving the office at 18:10. Quite a good number, I think Smile