Ayende @ Rahien

Jun 04 2008

Hunting for the missing item exception

time to read 2 min | 308 words

Tags:

Subversion

There was a known, but irreproducible, in SvnBridge for a long while now.

The essence of the bug is that under some circumstances, updating from SvnBridge would fail. The reason for failure was an assertion failure. Somehow, someone made a change to a file that doesn't exists.

Did you note the some circumstances part? That is the critical issue there. It is the circumstances that I wasn't able to reproduce. As a matter of fact, I wasn't able to narrow what happened. I sometimes got it, and when I had a repro, I could fix it. But all too often, trying to reproduce the same steps would not cause the bug, but work as expected.

Annoying as hell, as you might imagine.

Today I finally managed to figure out what was going on. You can see the repro on the right. This is a timeline, where each user is colored differently.

As you can see, it require a failure unusual set of circumstances, and the way the SVN protocol work make it harder to figure out. (When you ask for changes on an item, you are actually requesting all changes on its parent, which is a critical issue here).

It is also important which side of the rename you are asking about, and... but now I am probably boring you with technical details that are not interesting even to geeks.

I have been in weird bug fixing mode for SvnBridge for a while now, and it is... interesting to see what crops up. Right now most bugs takes quite a long time to track and fix, unfortunately.

Oh, and as a parting shot, take a look here at how I tracked that one:

Jun 04 2008

Let us make the UI complex, just because we feel like it

time to read 1 min | 147 words

Tweet Share Share 17 comments

Tags:

Bugs
UI

Let us take a look at this dialog, shall we?

What I see is that we have two text fields and two checkboxes, in a fairly big dialog, to express the following bit of information:

I honestly have no idea what the purpose of this is. To make it harder to input the information, presumably. Given a URL, I now need to split it apart manually, I need to know that https is usually on 443, etc.

It also mean that when I am talking to a user, I need to give her three pieces of information, and explain where to put each of them, rather than sending a URL that she can just copy / paste in place and be done with it.

Jun 03 2008

Architecting Twitter

time to read 6 min | 1048 words

Tweet Share Share 21 comments

Tags:

Design

There are some interesting discussions about the way Twitter is architected. Frankly, I think that having routine outages over such a long period is a sign of either negligence or incompetence. The problem isn't that hard, for crying out loud.

With that in mind, I set out to try architecting such a system. As far as I am concerned, twitter can be safely divided into two separate applications, the server side and the client side. The web UI, by the way, is considered part of the client.

The server responsibilities:

Accept twits from the client, this include:

Analyzing the message content, (@foo should go to foo, #tag should be tagged, etc)
Forward the message to all the followers of the particular user

Answer to queries about twits from people that a certain person is following
Answer to queries about a person
It should scale
Clients are assumed to be badly written

The client responsibilities:

Display information to the user
Pretty slicing & dicing of the data

Obviously, I am talking as someone who knows that there is even the need to scale, but let us leave this aside.

I am going to ignore the client, I don't care much about this bit. For the server, it turn out that we have a fairly simple way of handling that.

We will split it into several pieces, and deal with each of them independently. The major ones are read, write and analysis.

There isn't much need to deal with analysis. We can handle that using on the backend, without really affecting the application, so we will start with processing writes.

Twitter, like most systems, is heavily skewed toward reads. In addition to that, we don't really need instant responsiveness. It is allowed to take a few moments before the new message is visible to all the followers.

As such, the process of handling new twits is fairly simple. A gateway server will accept the new twit and place it in a queue. A set of worker server will take the new twits out of the queue and start processing them.

There are several ways of doing that, either by distributing load by function or by simple round robin. Myself, I tend toward round robin unless there is a process that is significantly slower than the others, or require extra requirements (sending email may require opening a port in the firewall, as such, it cannot run on just any machine, but only on machines dedicated to it).

The process of handling a twit is fairly straightforward. As I mentioned, we are heavily skewed toward reads, so it is worth taking more time when processing a write to make sure that a read is as simple as possible.

This means that our model should support the following qualities:

Simple - Reading the timeline should involve no joins and no complexity whatsoever. Preferably, it should involve a query that uses a clustered index and that is it.
Cacheable - There should be as few a factors that affects the data that we need to handle as possible.
Shardable - the ability to split the work into multiple databases would mean that we will be able to scale out very easily.

As such, the model on the right seems like a good one (obviously this is very over simplified, but it works as an example).

This means that in order to display the timeline for a particular user, we will need to perform exactly two queries. One to the routing database, to find out what server this user data is sitting on, the second, to perform a trivial select on the timelines table.

What this means, in turn, is that the process for writing a new twit can be described using the following piece of code:

followers = GetFollowersFor(msg.Author)
followers.UnionWith( GetRepliesToIn(msg.Text) )

for follower in followers:
	DirectPublish(follower, msg)

DirectPublish would simply locate the appropriate server and insert a new message, that is all.

If we will take a pathological case of someone who has 10,000 followers, what this means is that each time this person will publish a new twit, the writer section of the application will have to go and write the message 10,000 times. Ridiculous, isn't it?

Not really. This model allows us to keep very low contention, since we don't have any need for complex coordination, it is easily scalable for additional servers as needed, and it means that even in the case of several such pathological users suddenly starting to send high amount of messages, the application as a whole is not really affected. That is quite a side from the fact that inserting 10,000 rows is not really big. Especially since we are splitting it across several servers.

But if it really bothered me, I would designate separate machines for handling such high volume users. It will ensure that regular traffic in the site can still flow while some machine in the back of the data center is slowly processing the big volume. In fact, I would probably decide that it is worth my time using bulk insert techniques for those users.

All of that said, we now have a system where the database end is trivially simple (probably because the problem, as shown in this post, is trivial, I am pretty sure that the real world is more complex, but never mind that), scaling out the writing part is a matter of adding more workers to process more messages. Scaling out the database is a matter of putting more boxes in the data center, nothing truly complex. Scaling the read portion is a good place for judicious use of caching, but the model lends itself well for that.

Now, feel free to tell me what I am off the hook...

Jun 03 2008

Painless Persistence with Castle Active Record

time to read 1 min | 61 words

Tweet Share Share 2 comments

Tags:

Castle

About a year ago me and Hammett gave a few talks about Castle in JAOO. I am not sure why the recording of the talk came up now, but it have.

You can watch this presentation here: http://www.infoq.com/presentations/eini-verissimo-castle-active-record

For all the people who wanted to see a video of one of my presentation, there you go.

Jun 03 2008

Bug Hunting Story

time to read 3 min | 500 words

Tweet Share Share 2 comments

Tags:

The bug: SvnBridge will not accept a change to a filename using multi byte format.

Let us start figuring out what is going on, shall we?

Hm... looks like when TortoiseSVN is PUT-ing the file, it uses the directory name instead of the actual file name. Obviously, this is a client bug, and I can close this bug and move on to doing other things. Except that I am pretty sure that SVN can handle non ASCII characters...

Let us compare the network trace from SVN and SvnBridge and see what is going on...

Oh, I got the problem. TortoiseSVN is requesting a CHECKOUT, and we return the wrong URL. Obviously I did something astoundingly stupid there when I processed that request. Indeed, taking a look at what is going on there is... interesting.

Finally I realize that the problem is that while TortoiseSVN sends the URL in seemingly reasonable format, it is obviously being read wrongly by SvnBridge. It is using ASCII encoding to read this, while it probably should use UTF8. Because of that, the URL looks something like /tfsrtm08/test/????.txt, and ? is the query string terminator, no wonder I have issues.

But, how does it work for standard SVN servers. Hm, looks like when TortoiseSVN uses Url Encoding for the URL when taking to a standard SVN server. Why is it doing this?

Hm... looks like the URL that TortoiseSVN uses when it is request a CHECKOUT is the one that is returned in the response to a PROPFIND request.

Sure, that is easy to fix, we will just Url Encode the response from the PROPFIND handler.

Except that it doesn't work...

Oh, we also handle some of that in the file node class, so we will fix it there.

Yeah!

Except that it still doesn't work!

Grr! Looks like the way we handle Url Encoding is too invasive, we need to only UrlEncode non ASCII characters, this means that we should not encode characters such as '/', for example.

Let us try this again, shall we?

Hm... it still doesn't work. What is going on.

Excuse me while I am having a nervous breakdown, please.

Oh! When we send the response to the CHECKOUT request, we send the URL for the PUT request in the location header, and we haven't Url Encoded that yet.

And now it works... :-D

Let us run the test and commit...

We have a broken test.

*Sob*

Well, okay, that is actually a test that is supposed to be broken (it is testing the way we Url Encode urls, after all).

Number of code lines changed, less than 20.

Number of hours spent of this bug: over eight hours.

Conclusion, I am not really being productive today. I am off to visit the code generation wizard...

Oren Eini

Oren Eini

CEO of RavenDB

Hunting for the missing item exception

Let us make the UI complex, just because we feel like it

Architecting Twitter

Painless Persistence with Castle Active Record

Bug Hunting Story

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed