Ayende @ Rahien

Refunds available at head office

Azure DocumentDB

On Friday, Microsoft came up with Azure DocumentDB. You might say that I have a small interest in such things, so I headed over there to see what I can learn about this project.

Aside from being somewhat annoyed with the name, this seems to be a very different animal from RavenDB, and something that was built to serve a different niche. One of the things that we put first with RavenDB is ease of use, development and deployment for business applications. The ADB design appears to be built around a different goal, around very big datasets.

Nitpicker corner: Yes, I know this is a preview, and I know that they are going to be changes. And I repeat, I have no knowledge about this project beyond the documentation and several hours of playing with it.

That said, I do have a fair bit of experience in this area. So I feel that I can speak with confidence about the topic.

ADB is supposed to be an highly scalable system that store documents. So far, so good, I can certainly understand that need. But it has made drastically different design choices, some of which I feel very strongly about. I'll try to explore the issues that I have issues with, and contrast that with what you can do with RavenDB.

This post has two parts, the first talks about conceptual issues. The second talk about the currently published limits, and their implications for general use for ADB.

TLDR;

  • No sorting option, or a good paging story
  • SQL Injection, without any other alternative
  • Hard to deploy and to keep current with your codebase
  • Poor development story & no testing story
  • Poor client API
  • Lots of table scans
  • Limited queries and few optimization options
  • Single document transactions (from the client)
  • No cross collection transactions at all
  • Very small document sizes allowed

Also see the “What is this for?” section below.

For a document database platform that doesn’t have any of those issues, and run in Azure, see RavenHQ on Azure.

Transactions – ADB say that it has transactions, and for a very limited meaning of the word, I believe it means it. Transactions in ADB means a single document only can be saved with a guarantee it will either be saved or not. That is great, in the sense that at least you won’t have data corruption, but that isn’t really something that mean much. Even MongoDB can satisfy that bar.

Oh, sure, you can get actual transactions if you write JS code that run as a “stored procedure” inside ADB. This means that you can send data to the server and have your JS Stored Procedure make multiple operations in a single transaction. Which is just slightly better (although see my comments on those stored procedures later), but that is still limited to only operations inside the same collections.

A trivial example for transactions in a document database would be to add a new comment, and update the comment count. You cannot do that in ADB. Not in a single transaction. I don’t know about you, but most of the interesting use cases happen when you are working with multiple document types. Sure, you can put all your documents inside the same collection, but have fun trying to work with that in the long term.

In contrast, RavenDB fully support actual transactions that can span multiple documents (even on different collections, which I would never believe would be an accomplishment). RavenDB can even support DTC and transactions that spans multiple interactions with the server. You know, the kind of transactions you actually want to use. For more, see the documentation on RavenDB transactions.

Management – it honestly feels like someone missed all the points that made people want to ditch SQL in the first place. ADB has the concepts of triggers, user defined functions (more on that travesty later, when I discuss queries) and stored procedures. You can define them in JS, and you create something that looks like this:

image

Let me count the ways that this is going to cause problems for you.

  • Business logic in the database, because we haven’t learned anything about that in the past.
  • Code that you cannot run or test independently. Just debugging something like that is going to be hard.
  • No way to actually manage deployment or make sure that this code is in sync with the rest of your codebase.
  • Didn’t we already learn that triggers are a source for a lot of pain? Are they really necessary for you to do things?

Yes, you have a DB that is schema less, but those kind of things are actually important. They define what you can do with the database, and not having a good way to move those around, and most importantly, not having a way to tie them to the source control system you are using is going to be a giant PITA.

Sorry, that isn’t actually something that you can delay doing for later. You need a good development story, and as I see it, the entire development story all around here is just going to be hard. You would have to manually schlep things around between development and production. And that isn’t just about the SP or UDFs. There are a lot of settings that you’re going to have to deal with. For example, the configuration per collection, which you’ll want to make sure is the same (otherwise you get some very subtle and hard to understand bugs).

For that matter, there doesn’t seem to be a development story. You are probably expected to just run another ADB instance on Azure and use that. This means a lot of latency in development, and that also means that you can’t have separate databases per developer, which is a standard practice. This means having to write a lot of code just to manage those things, and you are right back again at the good old days of “who didn’t update the schema script” and failed deployments.

In contrast, RavenDB make is very easy to handle your indexes & transformers in your code and deploying them as a single step. That also means that they are versioned in the same place as your code, so you don’t have to worry about moving between dev & prod. We spent a lot of time thinking and working around this specific area, because this is a common pain point in relational databases, and we weren’t willing to accept that being the case in our database. For more information, please see the documentation about index management in RavenDB.

Indexing – there are several things here that bother me. By default, everything is indexed, and in the same transaction. This is a great decision, for a demo system. But in a real world system, the overhead of indexing everything is prohibitive, especially in a high write system. So ADB is allowing to specify the paths that you will include or exclude from indexing, as well as whatever indexing should be within the same transaction or lazy.

The problem with that is that those are per collection settings and there doesn’t appear to be any way to modify them after the fact. So you start running your system in production, realize that the cost of indexing is high, so you need to change the indexing strategy for a collection. The only way to do that is to create a new collection, with a new indexing strategy, move all the data there, then delete the old one. For even more fun, consider the case where you have a production and development environments. In production, you have a different indexing strategy then in development (where the ‘index everything’ mode is still on). That means that when you push things to production, your system will fail silently, because you won’t be indexing the fields you though were indexed.

This need re-iteration, the way this currently work, you start running with the default indexing option, which is expensive. As long as you don’t have any performance requirements (for example, during development), that is just fine. But when you actually have a lot of data there, or have a lot of writes, that is when you’ll figure out that those things need to be changed. At that point, you are are pretty much screwed, because you need to pull all the data out, create a new collection with the new indexing options, and write it all back. That is a horrible experience, especially because you’ll likely need to do that under pressure with users breathing down your necks and management complaining about the performance.

For that matter, indexing in general scares me. Again, I don’t actually have any knowledge of the internal operations, but there are a lot of stuff there that just doesn’t make sense. It looks like the precision of the indexes used are up to 3 characters (by default) per value. I’m guessing that this is done to reduce the amount of space used by the indexing, at least that is what the docs says. The problem is that when you do that, you do a lookup by the first 3 characters, then you have to do a flat search over all the other values. That is going to be causing problems.

It is also indicated that you cannot do any range searches except on numeric values. Which has interesting implications if you want to do searches on something like a date range, or time spans, an incredibly common operation.

In contrast, RavenDB indexes are always using the full value, so you are getting an O(logN) search behavior, and not a fallback to O(N) behavior. Range searches are possible on any value, numeric, date time, time span, string, etc. For more information, see the RavenDB documentation about searching with RavenDB.

Queries – Speaking of problems. Let me talk for a moment on ADB SQL. It looks nice on the surface, it is certainly would be familiar to most people. It is also contain a lot of hidden traps.

For example, the docs talk about being able to do joins, but you are only actually able to do “joins” into the sub documents, not into other collections, or even documents in the same collection. A query such as:

 

SELECT c.Name as CustomerName, o.Total, o.Date
FROM Orders o
JOIN Customers c ON c.Id = o.CustomerId

Can’t be executed on ADB. So the whole notion of “joins” is actually limited to what you can do in a single document and the sub documents it contains. That make it very limited.

The options for filtering (where clause) is also interesting. Mostly because of the wide range they allow. It is very easy to create queries that cannot be computed using indexes. That means that your query is now running table scans. Lots & lots of table scans. Sure, you don’t have tables, but O(N) is still O(N), and when N is large, as it is apparently the expected case here, you are going to be pretty much dead in the water.

Another thing that I can’t wrap my head around is the queries shown. There is no way to pass parameters to the query. None.  This appears to be the case because 30+ years of working with SQL has shown that there is absolutely no issue with putting user’s input directly into the query. And since complex queries require you to use the raw ADB SQL, you are pretty much have guaranteed that you’ll have SQL Injection attacks.

Sure, you might no get caught by Little Bobby Tables (you can’t modify data via SQL), but you are still exposed and can leak important data. This query works just fine, and will return all products:

SELECT * FROM Products p WHERE p.Name = "testing" OR 1 = 1 -- "

I’ll assume that you understand how I got there. This is a brand new database engine, but ADB is bringing very old issues back into the future. Not only that, we don’t have anyway around that. I guess you are going to have to write your on parameter scrubbing code, and make sure to use it everywhere.

In general, queries are limited. Severely limited, actually. Take a look at the following query:

SELECT * FROM Products p 
WHERE p.Type = "Beer"
AND p.Maker = "Guinness"
AND p.Discontinued = false 
AND p.Price > 10 AND p.Price < 100

You can’t run it in ADB. It is too complex to run. Note that this is about as trivial a query as you can get, in pretty much any reasonable business system.

Continuing on with the problems for business apps theme, there doesn’t appear to any good way to do things like paging. When you issue a query, you can specify the number of items to take and you can repeat a query by passing a continuation. But that doesn’t really help when you need to actually page with the user. So you show the data to the user, then want to go to the next page… you have to pass the continuation token all the way around, and hope that it will remain valid for the duration. For that matter, the current client API does paging at the server level, but it will fetch all the results for a query, even if it take it hours to do so.

There is no way to actually get the total number of items that match the query. So you can’t show the user something like: “You have 250 new emails”, nor can you show them “Page 1 … 50”.

Another troubling omission is the total lack of anything that would allow you to actually query your documents in a particular order. If I want to get the latest orders in descending order (or in fact, in any well defined order), I am out of luck. There is no way of doing that. This is a huge deal, because this isn’t just something that you can try papering over. This is a core functionality that you need in pretty much any application. And it is just not there. There is some indication that this is being worked on, but I’m surprised that this isn’t here already. Distributed sorting is a non trivial problem, of course, so I’ll reserve further judgment until I see what they have.

ADB’s queries are highly limited, so I expect a workaround for that is going to be to push functionality into the UDF. Note that UDF don’t have access to any context, so it can’t load additional documents. What it can do it utterly destroy any chance you’ll ever have for optimizing a query. The moment that a UDF is involved, you don’t really have a choice about how to execute a query, you pretty much have to go to a table scan. Maybe filtering some stuff based on the other filters in the query, but in many cases, that means that you’ll have to run your UDF over millions of records. Because UDFs are permitted to perform non pure operations (like the current time), you can’t even cache its values, or do anything smart around that. You’ll always have to execute the UDF, regardless of the amount of data you have to go through. I don’t expect that to perform very well.

In contrast, RavenDB was explicitly designed to give you both flexibility and performance in queries. There are no table scans in RavenDB, and complex queries are expected, encouraged and are handled properly. Queries across multiple documents (and in other collections) are possible, and quite easy to do. Common operations, like paging or sorting are part of the core functionality, and are both very easy to use and come with no additional costs. Complex things like full text search, spatial queries, facets and many more are right there for you to use.  For more information, see the RavenDB documentation about querying in RavenDB, spatial searches in RavenDB and how RavenDB actually index the data to allow complex operations.

Data types – ADB data types are the ones defined in the JSON spec. In particular, it doesn’t have native support for date times. The ADB documentation suggest that you’ll do custom serialization to handle that. Rendering things like asking: “Give me all the orders for this customer for 2014” very hard, leaving aside the issues of querying for orders in a particular month, which is not possible either, since you can only do range searches on numeric data. Dates, in particular, are a very complex topic, and not actually handling this in the database is going to open you up for a lot of issues down the road. And dates are kinda important type to have.

In contrast, RavenDB handles complex (including user defined) types in a well defined manner. And has full support for dates, operations on dates, etc. It seems silly to mention, to be fair, because it seems so basic to have that. For more information, you can read the documentation about dates in RavenDB.

Aggregation – this one is simple, you don’t have any. That means that you cannot get the total number of unread emails, or the total sum of orders per customer, or the maximum order per this month . This whole functionality just isn’t there.

In contrast, RavenDB has explicit support for counting the number of results for a query as well as map/reduce indexes. Those give you powerful aggregation framework, which execute the work in the background. When you query, you get the pre-computed results very quickly, without having to do any work at query time. For more information, you can read about Map/Reduce in RavenDB and dynamic aggregation queries.

Set operations – another easy one, it is just not there. You can do some operations in a stored procedure, but you have 5 seconds to run, and that is it. If you need to do something like: Split FullName to FirstName and LastName, get ready to write a lot of code, and wait for a long time for this to complete. For that matter, something as simple as “delete all inactive users” is very hard to do as well.

In contrast, RavenDB has explicit support for set based updates and deletes. You can specify a query that match a set of results that would either be deleted or patched using a JS script. For more operations, read the documentations about Set Based Operations.

Client API – this is still a preview, so that is somewhat unfair, but the client API is very primitive. Basically, it is a very thin wrapper around the REST API, and it does a poor job at that. The REST API support paging, but the C# client API does not, for example. There is no concept of unit of work, change tracking, client side behavior or anything at all that would actually make this work nicely. There is also an interesting design decision to go async for all operations except queries.

With queries, you actually issue an async REST call, but you are going to be waiting on that query synchronously. This is probably because of the IQueryable interface and its assumption that the query is sync. But that is a very bad thing to do in terms of mixing sync and async work. It is easy to get into problems such as deadlocks, self lock and just plain weirdness.

In contrast, RavenDB has a carefully designed client APIs (for .NET, JVM, etc), which fully expose the power of RavenDB. They have been designed to be intuitive, easy to use and guide you into the pit of success, RavenDB also have separate sync and async API, including fully async queries. For more information, read the documentation about the client API.

Self links – when issuing any operation whatsoever to the database, you have to use something call the object link, or self link. For example, in my test database, the Products collection link is: dbs/frETAA==/colls/frETANSmEAA=/

You have to use links like that whenever you make any operation what so ever. For fun, those are going to be unique per database, so creating a Products collection in another database would result in a different collection link. That means that I can’t just store them in configuration. So you’ll probably have to read them from the database every time you need to use them (maybe with some caching?). This is just silly. This is making it very hard to look at what is going on and see what the system is doing (for example, by watching what is going on in Fiddler).

In contrast, RavenDB applies human readable names whenever possible. For more information, see the documentation about the efforts to make sure that everything in RavenDB in human readable and easily debuggable. One such place is the id generation strategy.

Development and testing – in this day and age, people are connected to the internet through most of their day to day life. That doesn’t mean that they are always connected, or that you can actually rely on the network, or that the latency is acceptable. There is no current development story for ADB. No way to run your own database and develop while you are offline (on the train or at 30,000 feet in the air). That means that every call to ADB has to go over the internet, and that means, in turn, that there is no local development story at all. It means a lot more waiting from the point of view of the developer (also see next point), it means that there is just no testing story.

If you want to run code to test your ADB usage, you have to setup (and pay) a whole new ADB instance, have to make sure that it is setup exactly the same way as your production instance, and run it against that. It means that test not only have to go outside your process, but across the internet to a remote server. This pretty much kills the notion of fast tests.

In contrast, RavenDB has an excellent development and testing story. You don’t pay for development or CI instances, and you can run tests against RavenDB using an in memory mode embedded inside your process. This has been heavily optimize to allow fast running tests. We are developers, and we care to make other developers’ life easy. It shows. For more information, see the documentation about unit testing RavenDB.

Joins are for your code – because ADB doesn’t actually support joins beyond the document scope, or any other option like that, it means that if you want to do something trivial, like show a customer a list of their orders, you are actually going to have to do the join in your own code, not in the database. In fact, let us take a silly scenario, let us say that we want to show a list of new employees as well as their managers, so we can have a chat with them about how they are settling in.

If we were using SQL, we would be using something like this:

SELECT emp.Id as EmpId, emp.Name as EmpName, mngr.Id as ManagerId, mngr.Name as ManagerName
FROM Employees emp
JOIN Managers mngr where emp.ManagerId = mngr.Id
WHERE emp.JoinedAt > '2014-06-01'

That is pretty easy, right? How do you do something like that in ADB? Well, you start with the first query:

SELECT emp.Id as EmpId, emp.Name as EmpName, emp.ManagerId as ManagerId
FROM Employees emp
WHERE emp.JoinedAt > '2014-06-01'

And then, for each of the returned managers’ ids, we have to issue a separate query (ADB doesn’t have support for IN). This pattern of usage is called SELECT N+1, and it is a very well known anti pattern, even leaving aside the fact that you have to manually do the join in your own code, with all that this implies. This sort of operations will effectively kill the performance of any application, because you are very chatty with the database.

In contrast, RavenDB contains several ways to load related items. From including a related document to projecting it via a transformer, you can very easily and efficiently get all the data you need, in a single query to RavenDB. In fact, RavenDB applies a Safe By Default approach and limit the number of times you can call the server (configurable) to prevent just this case. We’ll error if you go over the budget of remote calls you are allowed to make. This gives you an early chance to catch performance problems. For more information, see the documentation about includes, transformers and  the Safe By Default approach practiced by RavenDB.

Limits - reading the limits for ADB makes for some head scratching. Yes, I know that we are talking about the preview mode only. I’m aware that you can ask to increase those limits. Nevertheless, those limits likely reflect real trade offs made in the system. So increasing those limits for a particular use case means that you’ll have to pay the price for that elsewhere.

For example, let us take this blog post as an example. It is over 22KB in size. But I can’t store this blog post in ADB. That is because documents are limited to 16KB in size. This is utterly ridiculous. I just checked a few of our databases, an common size for documents is 4 – 8 KB, this is true. But larger documents appear all the time. Even if you exclude blog posts as BLOB of text, we have order documents that have with  multiple order lines that are easily past that size. In our users, we see every document size possible, from hundreds of KB to several MB.

I reached out to Codealike, one of our customers, who were also featured in one of Azure’s case studies, to hear from them what their situation was. Out of 1.6 million documents in one of their databases, about 90% are in the 500Kb range.

I’m assuming that a large part of this limitation is the fact that by default, everything is indexed. You can’t index everything and have large documents and have reasonable performance. So this limit was introduced. I’m also assuming that there are other issues here (to be able to fit into pages? low level technical stuff?). Regardless, this is just utterly ridiculous low limit. The problem is that even raising this limit by x5 or x10, that is still not enough. And I’m assuming that they didn’t chose this limit out of thin air, that there is a technical reason for it.

Other issues is the number of stored procedure and UDF that you have available. You get 5 of each, and that is it. So you don’t get to actually express anything complex there. You also get to use only a single UDF per query, and to use a maximum of 3 AND / OR clauses in a query. I’m assuming that the reasoning here is that the more clauses you have, the more complex it is to run the query, especially in a distributed environment. So they put a hard limit on that.

Those limits together, along with not supporting sorting basically render ADB into an interesting curiosity, but not a real contender for a generally applicable database.

What is this for?

After going over the documentation, there is one thing that I couldn’t find. What is the primary use case for ADB? 

It looks more like a solution in search of a problem than the other way around. It appears that this is used by several MS systems to store 100s of TB of data, and process millions of queries. Sheer data size isn’t really interesting, we have customers that have multiple TB data. And millions of queries per day isn’t really something to brag about (10 million queries per day translate to about 115 queries per second, or about 20 – 30 queries per second per node).

What interests me is what sort of data do you put there? The small size limitation make it pretty much unsuitable for storing actual complex documents. You have to always be aware of the size you are storing, and that put a serious crimp in how you can work with this. The limited queries and the inability to sort also lead me to believe that this is a very purpose built tool.

OneNote’s server side is apparently one such use case, but from the look of things, I would expect that this is the other way around. That ADB is actually the backend of OneNote that Microsoft has decided to make public (like Dynamo’s in Amazon’s case).

Some of those limitations are probably going to be alleviated by using additional Microsoft tools. So the new Search Server (presumably that one has complex searching & sorting available) would allow you to do some proper queries, and HDInsight might be used for doing aggregation.

You aren’t going to be able to get the “show me the count of unread emails for this user” from Hadoop, not when the data is constantly changing. And using a secondary search server will introduce high latencies for the indexing. That is leaving aside the additional operational complexity of having to manage multiple systems (and the communication between them) just to get things done.

Here are a few things that would be hard to build in ADB, as it stands today:

  • This blog – the posts are too big, can’t sort posts by date, can’t do “complex” queries (tag & date & published & not deleted)
  • Logging – I actually thought that this would be a great use case, but we actually need to show logs by date. As well as be able to search using multiple fields (more than 3) or do contains queries.
  • Orders system –  important orders with a lot of line items will be rejected because of the size limitation.

In fact, I don’t know what would work there. What kind of data are you putting there? This isn’t good for bulk data work, because the ingest rate is really small (~500 writes / second? The debug version RavenDB does 2,500 writes per sec that on my dev laptop without even using the bulk insert API) and there isn’t a good way to work with large amount of data at once. It isn’t good for business applications, for the reasons outlined above.

I guess that if you patched this and the search server and Hadoop together you would get something that might be able to serve. But I think that the complexity involved is going to be very high, and I just don’t see where this would be a great solution.

In short, what is the problem that this is trying to solve? What application would be a perfect fit for this?

With RavenDB, the answer is simple, it is a general purpose database focused on OTLP applications. Until you have an answer, you can use RavenDB on Azure today using RavenHQ on Azure.

Code Challenge: Make the assertion fire

Can you create a calling code that would make this code fire the assertion?

public static IEnumerable<int> Fibonnaci(CancellationToken token)
{
    yield return 0;
    yield return 1;

    var prev = 0;
    var cur = 1;

    try
    {
        while (token.IsCancellationRequested == false)
        {
            var tmp = prev + cur;
            prev = cur;
            cur = tmp;
            yield return tmp;
        }
    }
    finally
    {
        Debug.Assert(token.IsCancellationRequested);
    }
}

Note that cancellation token cannot be changed, once cancelled, it can never revert.

Tags:

Published at

Originally posted at

Comments (39)

Challenge: The problem of locking down tasks…

The following code has very subtle bug:

   1: public class AsyncQueue
   2: {
   3:     private readonly Queue<int> items = new Queue<int>();
   4:     private volatile LinkedList<TaskCompletionSource<object>> waiters = new LinkedList<TaskCompletionSource<object>>();
   5:  
   6:     public void Enqueue(int i)
   7:     {
   8:         lock (items)
   9:         {
  10:             items.Enqueue(i);
  11:             while (waiters.First != null)
  12:             {
  13:                 waiters.First.Value.TrySetResult(null);
  14:                 waiters.RemoveFirst();
  15:             }
  16:         }
  17:     }
  18:  
  19:     public async Task<IEnumerable<int>> DrainAsync()
  20:     {
  21:         while (true)
  22:         {
  23:             TaskCompletionSource<object> taskCompletionSource;
  24:             lock (items)
  25:             {
  26:                 if (items.Count > 0)
  27:                 {
  28:                     return YieldAllItems();
  29:                 }
  30:                 taskCompletionSource = new TaskCompletionSource<object>();
  31:                 waiters.AddLast(taskCompletionSource);
  32:             }
  33:             await taskCompletionSource.Task;
  34:         }
  35:     }
  36:  
  37:     private IEnumerable<int> YieldAllItems()
  38:     {
  39:         while (items.Count > 0)
  40:         {
  41:             yield return items.Dequeue();
  42:         }
  43:  
  44:     }
  45: }

I’ll even give you a hint, try to run the following client code:

   1: for (int i = 0; i < 1000 * 1000; i++)
   2: {
   3:     q.Enqueue(i);
   4:     if (i%100 == 0)
   5:     {
   6:         Task.Factory.StartNew(async () =>
   7:             {
   8:                 foreach (var result in await q.DrainAsync())
   9:                 {
  10:                     Console.WriteLine(result);
  11:                 }
  12:             });
  13:     }
  14:  
  15: }
Can you figure out what the problem is?

Published at

Originally posted at

Comments (7)

Unprofessional code

This code bugs me.

image

It isn’t wrong, and it is going to produce the right result. But is ain’t pro code, in the sense that this code lacks an important *ities.

Just to cut down on the guessing, I marked the exact part that bugs me. Can you see the issue?

Tags:

Published at

Originally posted at

Comments (41)

Explain this code: Answers

The reason this code is useful?

image

Because it allows you to write this sort of code:

class Program
{
    private static Collection<int> nums;

    static void Main(string[] args)
    {
        nums.Add(1);
        Console.WriteLine(nums.Count);
    }
}

I was aiming for that, and still this code strikes me as wrong.

Tags:

Published at

Originally posted at

Comments (22)

Challenge: Minimum number of round trips

We are working on creating better experience for RavenDB & Sharding, and that led us to the following piece of code:

shardedSession.Load<Post>("posts/1234", "post/3214", "posts/1232", "posts/1238", "posts/1232");

And the following Shard function:

public static string[] GetAppropriateUrls(string id)
{
    switch (id.Last())
    {
        case '4':
            return new[] { "http://srv-4", "http://srv-backup-4" };

        case '2':
            return new[] { "http://srv-2" };

        case '8':
            return new[] { "http://srv-backup-4" };

        default:
            throw new InvalidOperationException();
    }
}

Write a function that would make the minimum number of queries to load of all the posts from all of the servers.

Tags:

Published at

Originally posted at

Comments (25)

The tax calculation challenge

People seems to be more interested in answering the question than the code that solved it. Actually, people seemed to be more interested in outdoing one another in creating answers to that. What I found most interesting is that a large percentage of the answers (both in the blog post and in the interviews) got a lot of that wrong.

So here is the question in full. The following table is the current tax rates in Israel:

  Tax Rate
Up to 5,070 10%
5,071 up to 8,660 14%
8,661 up to 14,070 23%
14,071 up to 21,240 30%
21,241 up to 40,230 33%
Higher than 40,230 45%

Here are some example answers:

  • 5,000 –> 500
  • 5,800 –> 609.2
  • 9,000 –> 1087.8
  • 15,000 –> 2532.9
  • 50,000 –> 15,068.1

This problem is a bit tricky because the tax rate doesn’t apply to the whole sum, only to the part that is within the current rate.

Tags:

Published at

Originally posted at

Comments (79)

Elegancy challenge: Cacheable Batches

Let us say that we have the following server code:

public JsonDocument[] GetDocument(string[] ids)
{
  var  results = new List<JsonDocument>();
  foreach(var id in ids)
  {
    results.Add(GetDocument(id));
  }
  return result.ToArray();  
}

This method is a classic example of batching requests to the server. For the purpose of our discussion, we have a client proxy that looks like this:

public class ClientProxy : IDisposable
{
  MemoryCache cache = new MemoryCache();
  
  public JsonDocument[] GetDocument(params string[] ids)
  {
    
    // make the request to the server, for example
    var request = WebRequest.Create("http://server/get?id=" + string.Join("&id=", ids));
    using(var stream = request.GetResponse().GetResposeStream())
    {
      return  GetResults(stream);
    }
  }
  
  public void Dispose()
  {
    cache.Dispose();
  }
}

Now, as you can probably guess from the title and from the code above, the question relates to caching. We need to make the following pass:

using(var proxy = new ClientProxy())
{
  proxy.GetPopularity("ayende", "oren"); // nothing in the cahce, make request to server
  proxy.GetPopulairty("rhino", "hippo"); // nothing in the cahce, make request to server
  proxy.GetPopulairty("rhino", "aligator"); // only request aligator, 'rhino' is in the cache
  proxy.GetPopulairty("rhino", "hippo");  // don't make any request, serve from cache
  proxy.GetPopulairty("rhino", "oren");   // don't make any request, serve from cache
}

The tricky part, of course, is to make this elegant. You can modify both server and client code.

I simplified the problem drastically, but one of the major benefits in the real issue was reducing the size of data we have to fetch over the network even for partial cached queries.

Tags:

Published at

Originally posted at

Comments (27)

Challenge: Recent Comments with Future Posts

We were asked to implement comments RSS for this blog, and many people asked about the recent comments widget. That turned out to be quite a bit more complicated than it appeared on first try, and I thought it would make a great challenge.

On the face of it, it looks like a drop dead simple feature, right? Show me the last 5 comments in the blog.

The problem with that is that it ignores something that is very important to me, the notion of future posts. One of the major advantages of RaccoonBlog is that I am able to post stuff that would go on the queue (for example, this post would be scheduled for about a month from the date of writing it), but still share that post with other people. Moreover, those other people can also comment on the post. To make things interesting, it is quite common for me to re-schedule posts, moving them from one date to another.

Given that complication, let us try to define how we want the Recent Comments feature to behave with regards to future posts. The logic is fairly simple:

  • Until the post is public, do not show the post comments.
  • If a post had comments on it while it was a future post, when it becomes public, the comments that were already posted there, should also be included.

The last requirement is a bit tricky. Allow me to explain. It would be easier to understand with an example, which luckily I have:

image_thumb[2]

As you can see, this is a post that was created on the 1st of July, but was published on the 12th. Mike and me have commented on the post shortly after it was published (while it was still hidden from the general public). Then, after it was published, grega_g and Jonty have commented on that.

Now, let us assume that we query at 12 Jul, 08:55 AM, we will not get any comments from this post, but on 12 Jul, 09:01 AM, we should get both comments from this post. To make things more interesting, those should come after comments that were posted (chronologically) after them. Confusing, isn’t it? Again, let us go with a visual aid for explaining things.

In other words, let us say that we also have this as well:

image

Here is what we should see in the Recent Comments:

12 Jul, 2011 – 08:55 AM 12 Jul, 2011 – 09:05 AM
  1. 07/07/2011 07:42 PM – jdn
  2. 07/08/2011 07:34 PM – Matt Warren
  1. 07/03/2011 05:26 PM – Ayende Rahien
  2. 07/03/2011 05:07 PM - Mike Minutillo
  3. 07/07/2011 07:42 PM – jdn
  4. 07/08/2011 07:34 PM – Matt Warren

Note that the 1st and 2snd location on 9:05 are should sort after 3rd and 4th, but are sorted before them, because of the post publish date, which we also take into account.

Given all of that, and regardless of the actual technology that you use, how would you implement this feature?

Answer: Modifying execution approaches

In RavenDB, we had this piece of code:

        internal T[] LoadInternal<T>(string[] ids, string[] includes)
        {
            if(ids.Length == 0)
                return new T[0];

            IncrementRequestCount();
            Debug.WriteLine(string.Format("Bulk loading ids [{0}] from {1}", string.Join(", ", ids), StoreIdentifier));
            MultiLoadResult multiLoadResult;
            JsonDocument[] includeResults;
            JsonDocument[] results;
#if !SILVERLIGHT
            var sp = Stopwatch.StartNew();
#else
            var startTime = DateTime.Now;
#endif
            bool firstRequest = true;
            do
            {
                IDisposable disposable = null;
                if (firstRequest == false) // if this is a repeated request, we mustn't use the cached result, but have to re-query the server
                    disposable = DatabaseCommands.DisableAllCaching();
                using (disposable)
                    multiLoadResult = DatabaseCommands.Get(ids, includes);

                firstRequest = false;
                includeResults = SerializationHelper.RavenJObjectsToJsonDocuments(multiLoadResult.Includes).ToArray();
                results = SerializationHelper.RavenJObjectsToJsonDocuments(multiLoadResult.Results).ToArray();
            } while (
                AllowNonAuthoritiveInformation == false &&
                results.Any(x => x.NonAuthoritiveInformation ?? false) &&
#if !SILVERLIGHT
                sp.Elapsed < NonAuthoritiveInformationTimeout
#else 
                (DateTime.Now - startTime) < NonAuthoritiveInformationTimeout
#endif
                );

            foreach (var include in includeResults)
            {
                TrackEntity<object>(include);
            }

            return results
                .Select(TrackEntity<T>)
                .ToArray();
        }

And we needed to take this same piece of code and execute it in:

  • Async fashion
  • As part of a batch of queries (sending multiple requests to RavenDB in a single HTTP call).

Everything else is the same, but in each case the marked line is completely different.

I chose to address this by doing a Method Object refactoring. I create a new class, and moved all the local variables to fields, and moved each part of the method to its own method. I also explicitly gave up control on executing, deferring that to whoever it calling us. We ended up with this:

    public class MultiLoadOperation
    {
        private static readonly Logger log = LogManager.GetCurrentClassLogger();

        private readonly InMemoryDocumentSessionOperations sessionOperations;
        private readonly Func<IDisposable> disableAllCaching;
        private string[] ids;
        private string[] includes;
        bool firstRequest = true;
        IDisposable disposable = null;
        JsonDocument[] results;
        JsonDocument[] includeResults;
                
#if !SILVERLIGHT
        private Stopwatch sp;
#else
        private    DateTime startTime;
#endif

        public MultiLoadOperation(InMemoryDocumentSessionOperations sessionOperations, 
            Func<IDisposable> disableAllCaching,
            string[] ids, string[] includes)
        {
            this.sessionOperations = sessionOperations;
            this.disableAllCaching = disableAllCaching;
            this.ids = ids;
            this.includes = includes;
        
            sessionOperations.IncrementRequestCount();
            log.Debug("Bulk loading ids [{0}] from {1}", string.Join(", ", ids), sessionOperations.StoreIdentifier);

#if !SILVERLIGHT
            sp = Stopwatch.StartNew();
#else
            startTime = DateTime.Now;
#endif
        }

        public IDisposable EnterMultiLoadContext()
        {
            if (firstRequest == false) // if this is a repeated request, we mustn't use the cached result, but have to re-query the server
                disposable = disableAllCaching();
            return disposable;
        }

        public bool SetResult(MultiLoadResult multiLoadResult)
        {
            firstRequest = false;
            includeResults = SerializationHelper.RavenJObjectsToJsonDocuments(multiLoadResult.Includes).ToArray();
            results = SerializationHelper.RavenJObjectsToJsonDocuments(multiLoadResult.Results).ToArray();

            return    sessionOperations.AllowNonAuthoritiveInformation == false &&
                    results.Any(x => x.NonAuthoritiveInformation ?? false) &&
#if !SILVERLIGHT
                    sp.Elapsed < sessionOperations.NonAuthoritiveInformationTimeout
#else 
                    (DateTime.Now - startTime) < sessionOperations.NonAuthoritiveInformationTimeout
#endif
                ;
        }

        public T[] Complete<T>()
        {
            foreach (var include in includeResults)
            {
                sessionOperations.TrackEntity<object>(include);
            }

            return results
                .Select(sessionOperations.TrackEntity<T>)
                .ToArray();
        }
    }

Note that this class doesn’t contain two very important things:

  • The actual call to the database, we gave up control on that.
  • The execution order for the methods, we don’t control that either.

That was ugly, and I decided that since I have to write another implementation as well, I might as well do the right thing and have a shared implementation. The key was to extract everything away except for the call to get the actual value. So I did just that, and we got a new class, that does all of the functionality above, except control where the actual call to the server is made and how.

Now, for the sync version, we have this code:

internal T[] LoadInternal<T>(string[] ids, string[] includes)
{
    if(ids.Length == 0)
        return new T[0];

    var multiLoadOperation = new MultiLoadOperation(this, DatabaseCommands.DisableAllCaching, ids, includes);
    MultiLoadResult multiLoadResult;
    do
    {
        using(multiLoadOperation.EnterMultiLoadContext())
        {
            multiLoadResult = DatabaseCommands.Get(ids, includes);
        }
    } while (multiLoadOperation.SetResult(multiLoadResult));

    return multiLoadOperation.Complete<T>();
}

This isn’t the most trivial of methods, I’ll admit, but it is ever so much better than the alternative, especially since now the async version looks like:

/// <summary>
/// Begins the async multi load operation
/// </summary>
public Task<T[]> LoadAsyncInternal<T>(string[] ids, string[] includes)
{
    var multiLoadOperation = new MultiLoadOperation(this,AsyncDatabaseCommands.DisableAllCaching, ids, includes);
    return LoadAsyncInternal<T>(ids, includes, multiLoadOperation);
}

private Task<T[]> LoadAsyncInternal<T>(string[] ids, string[] includes, MultiLoadOperation multiLoadOperation)
{
    using (multiLoadOperation.EnterMultiLoadContext())
    {
        return AsyncDatabaseCommands.MultiGetAsync(ids, includes)
            .ContinueWith(t =>
            {
                if (multiLoadOperation.SetResult(t.Result) == false)
                    return Task.Factory.StartNew(() => multiLoadOperation.Complete<T>());
                return LoadAsyncInternal<T>(ids, includes, multiLoadOperation);
            })
            .Unwrap();
    }
}

Again, it isn’t trivial, but at least the core stuff, the actual logic that isn’t related to how we execute the code is shared.

Challenge: Modifying execution approaches

In RavenDB, we had this piece of code:

internal T[] LoadInternal<T>(string[] ids, string[] includes)
        {
            if(ids.Length == 0)
                return new T[0];

            IncrementRequestCount();
            Debug.WriteLine(string.Format("Bulk loading ids [{0}] from {1}", string.Join(", ", ids), StoreIdentifier));
            MultiLoadResult multiLoadResult;
            JsonDocument[] includeResults;
            JsonDocument[] results;
#if !SILVERLIGHT
            var sp = Stopwatch.StartNew();
#else
            var startTime = DateTime.Now;
#endif
            bool firstRequest = true;
            do
            {
                IDisposable disposable = null;
                if (firstRequest == false) // if this is a repeated request, we mustn't use the cached result, but have to re-query the server
                    disposable = DatabaseCommands.DisableAllCaching();
                using (disposable)
                    multiLoadResult = DatabaseCommands.Get(ids, includes);

                firstRequest = false;
                includeResults = SerializationHelper.RavenJObjectsToJsonDocuments(multiLoadResult.Includes).ToArray();
                results = SerializationHelper.RavenJObjectsToJsonDocuments(multiLoadResult.Results).ToArray();
            } while (
                AllowNonAuthoritiveInformation == false &&
                results.Any(x => x.NonAuthoritiveInformation ?? false) &&
#if !SILVERLIGHT
                sp.Elapsed < NonAuthoritiveInformationTimeout
#else 
                (DateTime.Now - startTime) < NonAuthoritiveInformationTimeout
#endif
                );

            foreach (var include in includeResults)
            {
                TrackEntity<object>(include);
            }

            return results
                .Select(TrackEntity<T>)
                .ToArray();
        }

And we needed to take this same piece of code and execute it in:

  • Async fashion
  • As part of a batch of queries (sending multiple requests to RavenDB in a single HTTP call).

Everything else is the same, but in each case the marked line is completely different.

When we had only one additional option, I choose the direct approach, and implement it using;

public Task<T[]> LoadAsync<T>(string[] ids)
{
    IncrementRequestCount();
    return AsyncDatabaseCommands.MultiGetAsync(ids)
        .ContinueWith(task => task.Result.Select(TrackEntity<T>).ToArray());
}

You might notice a few differences between those approaches. The implementation behave, most of the time, the same, but all the behavior for edge cases is wrong. The reason for that, by the way, is that initially the Load and LoadAsync impl was functionality the same, but the Load behavior kept getting more sophisticated, and I kept forgetting to also update the LoadAsync behavior.

When I started building support for batches, this really stumped me. The last thing that I wanted to do is to either try to maintain complex logic in three different location or have different behaviors depending if you were using a direct call, a batch or async call. Just trying to document that gave me a headache.

How would you approach solving this problem?

Caching, the funny way

One of the most frustrating things in working with RavenDB is that the client API is compatible with the 3.5 framework. That means that for a lot of things we either have to use conditional compilation or we have to forgo using the new stuff in 4.0.

Case in point, we have the following issue:

image

The code in question currently looks like this:

image

This is the sort of code that simply begs to be used with ConcurrentDictionary. Unfortunately, we can’t use that here, because of the 3.5 limitation. Instead, I went with the usual, non thread safe, dictionary approach. I wanted to avoid locking, so I ended up with:

image

Pretty neat, even if I say so myself. The fun part that without any locking, this is completely thread safe. The field itself is initialized to an empty dictionary in the constructor, of course, but that is the only thing that is happening outside this method. For that matter, I didn’t even bother to make the field volatile. The only thing that this relies on is that pointer writes are atomic.

How comes this works, and what assumptions am I making that makes this thing possible?

Rhino Mocks Challenge: Implement This Feature

Okay, let us see if this approach works...

Here is a description of a feature that I would like to have in Rhino Mocks (modeled after a new feature in Type Mock). I don't consider this a complicated feature, and I would like to get more involvement from the community in building Rhino Mocks (see the list of all the people that helped get Rhino Mocks 3.5 out the door).

The feature is fluent mocks. The idea is that this code should work:

var mockService = MockRespository.GenerateMock<IMyService>();
Expect.Call( mockService.Identity.Name ).Return("foo");

Assert.AreEqual("foo", mockService.Identity.Name);

Where identity is an interface.

The best place to capture such semantics is in the RecordMockState.

Have fun, and send me the patch :-)

Challenge: Don't stop with the first DSL abstraction

I was having a discussion today about the way business rules are implemented. And large part of the discussion was focused on trying to get a specific behavior in a specific circumstance. As usual, I am going to use a totally different example, which might not be as brutal in its focus as the real one.

We have a set of business rules that relate to what is going to happen to a customer in certain situations. For example, we might have the following:

upon bounced_check or refused_credit:
	if customer.TotalPurchases > 10000: # preferred
		ask_authorizatin_for_more_credit
	else:
		call_the cops

upon new_order:
	if customer.TotalPurchases > 10000: # preferred
		apply_discount 5.precent
upon order_shipped:
send_marketing_stuff unless customer.RequestedNoSpam

What is this code crying for? Here is a hint, it is not the introduction of IsPreferred, although that would be welcome.

I am interested in hearing what you will have to say in this matter.

And as a total non sequitur, cockroaches at Starbucks, yuck.

System.Reflection.Emit fun: Find the differences

This is annoying, I am trying to make something like this works using SRE:

public void Foo(out T blah)
{
    blah = (T)Arguments[0];
}

I created the SRE code to generate the appropriate values, but it is producing invalid code.

Works, but not verified IL:

    L_0089: stloc.2
    L_008a: ldarg.1
    L_008b: ldloc.2
    L_008c: ldc.i4 0
    L_0091: ldelem.ref
    L_0092: unbox.any !!T
    L_0097: stobj !!T
    L_009c: ret

Works, valid (csc.exe output, of course):

    L_000d: stloc.1
    L_000e: ldarg.1
    L_000f: ldloc.1
    L_0010: ldc.i4.0
    L_0011: ldelem.ref
    L_0012: unbox.any !!T
    L_0017: stobj !!T
    L_001c: ret

And yes, the stloc.2, and stloc.1 are expected.

Challenge: What does this code do?

Without compiling this, can you answer me whatever this piece of code will compile? And if so, what does it do?

var dummyVariable1 = 1;
var dummyVariable2 = 3;
var a = dummyVariable1
+-+-+-+-+ + + + + + +-+-+-+-+-+
dummyVariable2;

Oh, and I want to hear reasons, too.

Challenge: Find the bug fix

Usually I tend to pose bugs as the challenges, and not the bug fixes, but this is an interesting one. Take a look at the following code:

var handles = new List<WaitHandle>();
using (
	var stream = new FileStream(path, FileMode.CreateNew, FileAccess.Write, FileShare.None, 0x1000,
								 FileOptions.Asynchronous))
{
	for (int i = 0; i < 64; i++)
	{
		var handle = new ManualResetEvent(false);
		var bytes = Encoding.UTF8.GetBytes( i + Environment.NewLine);
		stream.BeginWrite(bytes, 0, bytes.Length, delegate(IAsyncResult ar)
		{
			stream.EndWrite(ar);
			handle.Set();
		}, stream);
		handles.Add(handle);
	}
	WaitHandle.WaitAll(handles.ToArray());
	stream.Flush();

}

Now, tell me why I am creating a ManualResetEvent manually, instead of using the one that BeginWrite IAsyncResult will return?

[Unstable code] Why timeouts doesn't mean squat...

Because they aren't helpful for the pathological cases. Let us take this simple example:

[ServiceContract]
public interface IFoo
{
	[OperationContract]
	string GetMessage();
}

var stopwatch = Stopwatch.StartNew();
var channel = ChannelFactory<IFoo>.CreateChannel(
	new BasicHttpBinding
	{
		SendTimeout = TimeSpan.FromSeconds(1), 
		ReceiveTimeout = TimeSpan.FromSeconds(1),
		OpenTimeout = TimeSpan.FromSeconds(1),
                CloseTimeout = TimeSpan.FromSeconds(1)
	},
	new EndpointAddress("http://localhost:6547/bar"));

var message = channel.GetMessage();

stopwatch.Stop();
Console.WriteLine("Got message in {0}ms", stopwatch.ElapsedMilliseconds);

On the face of it, it looks like we are safe from the point of view of timeouts, right? We set all the timeout settings that are there. At most, we will spend a second waiting for the message, and get a time out exception if we fail there.

Here is a simple way to make this code hang for a minute (more after the code):

namespace ConsoleApplication1
{
	using System;
	using System.Linq;
	using System.Diagnostics;
	using System.IO;
	using System.Net;
	using System.ServiceModel;
	using System.Threading;

	class Program
	{
		static void Main(string[] args)
		{
			var host = new ServiceHost(typeof(FooImpl), 
				new Uri("http://localhost/foo"));
			host.AddServiceEndpoint(typeof(IFoo), 
				new BasicHttpBinding(), 
				new Uri("http://localhost/foo"));
			host.Open();

			new SlowFirewall();

			var stopwatch = Stopwatch.StartNew();
			var channel = ChannelFactory<IFoo>.CreateChannel(
				new BasicHttpBinding
				{
					SendTimeout = TimeSpan.FromSeconds(1), 
					ReceiveTimeout = TimeSpan.FromSeconds(1),
					OpenTimeout = TimeSpan.FromSeconds(1),
                 			CloseTimeout = TimeSpan.FromSeconds(1)
				},
				new EndpointAddress("http://localhost:6547/bar"));
			
			var message = channel.GetMessage();
			
			stopwatch.Stop();
			Console.WriteLine("Got message in {0}ms", stopwatch.ElapsedMilliseconds);


			host.Close();
		}
	}

	[ServiceContract]
	public interface IFoo
	{
		[OperationContract]
		string GetMessage();
	}

	public class FooImpl : IFoo
	{
		public string GetMessage()
		{
			return new string('*', 5000);
		}
	}

	public class SlowFirewall
	{
		private readonly HttpListener listener;

		public SlowFirewall()
		{
			listener = new HttpListener();
			listener.Prefixes.Add("http://localhost:6547/bar/");
			listener.Start();
			listener.BeginGetContext(OnGetContext, null);
		}

		private void OnGetContext(IAsyncResult ar)
		{
			var context = listener.EndGetContext(ar);
			var request = WebRequest.Create("http://localhost/foo");
			request.Method = context.Request.HttpMethod;
			request.ContentType = context.Request.ContentType;
			var specialHeaders = new[] { "Connection", "Content-Length", 
"Host", "Content-Type", "Expect" }; foreach (string header in context.Request.Headers) { if (specialHeaders.Contains(header)) continue; request.Headers[header] = context.Request.Headers[header]; } var buffer = new byte[context.Request.ContentLength64]; ReadAll(buffer, context.Request.InputStream); using (var stream = request.GetRequestStream()) { stream.Write(buffer, 0, buffer.Length); } using (var response = request.GetResponse()) using (var responseStream = response.GetResponseStream()) { buffer = new byte[response.ContentLength]; ReadAll(buffer, responseStream); foreach (string header in response.Headers) { if (specialHeaders.Contains(header)) continue; context.Response.Headers[header] = response.Headers[header]; } context.Response.ContentType = response.ContentType; int i = 0; foreach (var b in buffer) { context.Response.OutputStream.WriteByte(b); context.Response.OutputStream.Flush(); Thread.Sleep(10); Console.WriteLine(i++); } context.Response.Close(); } } private void ReadAll(byte[] buffer, Stream stream) { int current = 0; while (current < buffer.Length) { int read = stream.Read(buffer, current, buffer.Length - current); current += read; } } } }

This problem means that even supposedly safe code, which has taken care of specifying timeouts properly is not safe from blocking because of network issues. Exactly the thing we specified the timeouts to avoid. I should note that this sample code is still at a very high level. There is a lot of things that you can do at all levels of the network stack to play havoc with your code.

As an aside, what book am I re-reading?

[Unstable code] So you think you are safe...

There is some interesting discussion on my previous post about unstable code.

I thought that it would be good to give a concrete example of the issue. Given the following interface & client code, is there a way to make this code block for a long time?

[ServiceContract]
public interface IFoo
{
	[OperationContract]
	string GetMessage();
}

var stopwatch = Stopwatch.StartNew();
var channel = ChannelFactory<IFoo>.CreateChannel(
	new BasicHttpBinding
	{
		SendTimeout = TimeSpan.FromSeconds(1), 
		ReceiveTimeout = TimeSpan.FromSeconds(1),
		OpenTimeout = TimeSpan.FromSeconds(1),
                CloseTimeout = TimeSpan.FromSeconds(1)
	},
	new EndpointAddress("http://localhost:6547/bar"));

var message = channel.GetMessage();

stopwatch.Stop();
Console.WriteLine("Got message in {0}ms", stopwatch.ElapsedMilliseconds);

You are free to play around with the server implementation as well as the network topology.

Have fun....

Challenge: What is wrong with this code

Let us assume that we have the following piece of code. It has a big problem in it. The kind of problem that you get called at 2 AM to solve.

Can you find it? (more below)

public static void Main()
{
	while(true)
	{
		srv.ProcessMessages();
		Thread.Sleep(5000);
	}
}

public void ProcessMessages()
{
	try
	{
	   var msgs = GetMessages();
	   byte[] data = Serialize(msgs);
	   var req =  WebRequest.Create("http://some.remote.server");
	   req.Method = "PUT";
	   using(var stream = req.GetRequestStream())
	   {
		   stream.Write(data,0,data.Length);
	   }
	   var resp = req.GetResponse();
	   resp.Close();// we only care that no exception was thrown
	   
	   MarkMessagesAsHandled(msgs); // assume this can't throw 
	}
	catch(Exception)
	{
		// bummer, but never mind,
		// we will get it the next time that ProcessMessages 
		// is called
	}
}

public Message[] GetMessages()
{
    List<Message> msgs = new List<Message>();
    using(var reader = ExecuteReader("SELECT * FROM Messages WHERE Handled = 0;"))
    while(reader.Read())
    {
        msgs.Add( HydrateMessage(reader) );
    }
    return msgs.ToArray();
}

This code is conceptual, just to make the point. It is not real code. Things that you don't have to worry about:

  • Multi threading
  • Transactions
  • Failed database

The problem is both a bit subtle and horrifying. And just to make things interesting, for most scenarios, it will work just fine.

Challenge: why did the tests fail?

For a few days, some (~4) of SvnBridge integration tests would fail. Not always the same ones (but usually the same group), and only if I run them all as a group, never if I run each test individually, or if I run the entire test class (which rules out most of the test dependencies that causes this). This was incredibly annoying, but several attempts to track down this issue has been less than successful.

Today I got annoyed enough to say that I am not leaving until I solve this. Considering that a full test run of all SvnBridge's tests is... lengthy, that took a while, but I finally tracked down what was going on. The fault was with this method:

protected string Svn(string command)
{
	StringBuilder output = new StringBuilder();
	string err = null;
	ExecuteInternal(command, delegate(Process svn)
	{
		ThreadPool.QueueUserWorkItem(delegate

		{
			err = svn.StandardError.ReadToEnd();
		});
		ThreadPool.QueueUserWorkItem(delegate
		{
			string line;
			while ((line = svn.StandardOutput.ReadLine()) != null)
			{
				Console.WriteLine(line);
				output.AppendLine(line);
			}
		});
	});
	if (string.IsNullOrEmpty(err) == false)
	{
		throw new InvalidOperationException("Failed to execute command: " + err);
	}
	return output.ToString();
}

This will execute svn.exe and gather its input. Only sometimes it would not do so.

I fixed it by changing the implementation to:

protected static string Svn(string command)
{
	var output = new StringBuilder();
	var err = new StringBuilder();
	var readFromStdError = new Thread(prc =>
	{
		string line;
		while ((line = ((Process)prc).StandardError.ReadLine()) != null)
		{
			Console.WriteLine(line);
			err.AppendLine(line);
		}
	});
	var readFromStdOut = new Thread(prc =>
	{
		string line;
		while ((line = ((Process) prc).StandardOutput.ReadLine()) != null)
		{
			Console.WriteLine(line);
			output.AppendLine(line);
		}
	});
	ExecuteInternal(command, svn =>
	{
		readFromStdError.Start(svn);
		readFromStdOut.Start(svn);
	});

	readFromStdError.Join();
	readFromStdOut.Join();

	if (err.Length!=0)
	{
		throw new InvalidOperationException("Failed to execute command: " + err);
	}
	return output.ToString();
}

And that fixed the problem. What was the problem?

Challenge: calling generics without the generic type

Assume that I have the following interface:

public interface IMessageHandler<T> where T : AbstractMessage
{
	void Handle(T msg);
}

How would you write this method so dispatching a message doesn't require reflection every time:

public void Dispatch(AbstractMessage msg)
{
	IMessageHandler<msg.GetType()> handler = new MyHandler<msg.GetType()>();
	handler.Handle(msg);
}

Note that you can use reflection the first time you encounter a message of a particular type, but not in any subsequent calls.

Challenge: The directory tree

Since people seems to really enjoy posts like this, here is another one. This time it is an interesting issue that I dealt with today.

Given a set of versioned file, you need to cache them locally. Note that IPersistentCache semantics means that if you put a value in it, is is always going to remain there.

Here is the skeleton:

public interface IPersistentCache
{
	void Set(string key, params string[] items);
	string[] Get(string key);
}

public enum Recursion
{
	None,
	OneLevel,
	Full
}

public class VersionedFile
{
	public int Version;
	public string Name;
}

public class FileSystemCache : IFileSystemCache 
{
	IPersistentCache cache;
	
	public FileSystemCache(IPersistentCache cahce)
	{
		this.cache = cache;
	}
	
	public void Add(VersionedFile[] files)
	{
		// to do
	}

	public string[] ListFilesAndFolders(string root, int version, Recursion recursion)
	{
		// to do
	}
	
}

How would you implement this? Note that your only allowed external dependency is the ICache interface.

The usage is something like this:

// given 
var fsc = new FileSystemCache(cache);
fsc.Add(new []
{
      new VersionFile{Version = 1, Name = "/"},
      new VersionFile{Version = 1, Name = "/foo"},
      new VersionFile{Version = 1, Name = "/foo/bar"},
      new VersionFile{Version = 1, Name = "/foo/bar/text.txt"},
});
fsc.Add(new []
{

      new VersionFile{Version = 2, Name = "/"},
      new VersionFile{Version = 2, Name = "/foo"},
      new VersionFile{Version = 2, Name = "/foo/bar"},
      new VersionFile{Version = 2, Name = "/foo/bar/text.txt"},
      new VersionFile{Version = 2, Name = "/test.txt"},
});

// then 
fsc.ListFilesAndFolders("/", 1, Recursion.None) == { "/" } 

fsc.ListFilesAndFolders("/", 1, Recursion.OneLevel) == { "/", "/foo", } 

fsc.ListFilesAndFolders("/", 2, Recursion.OneLevel) == { "/", "/foo", "test.txt" } 

fsc.ListFilesAndFolders("/", 1, Recursion.Full) == { "/", "/foo", "/foo/bar", "/foo/bar/text.txt"}  

You can assume that all paths are '/' separated and they always starts with '/'.