Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,596
|
Comments: 51,224
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 308 words

There was a known, but irreproducible, imagein SvnBridge for a long while now.

The essence of the bug is that under some circumstances, updating from SvnBridge would fail. The reason for failure was an assertion failure. Somehow, someone made a change to a file that doesn't exists.

Did you note the some circumstances part? That is the critical issue there. It is the circumstances that I wasn't able to reproduce. As a matter of fact, I wasn't able to narrow what happened. I sometimes got it, and when I had a repro, I could fix it. But all too often, trying to reproduce the same steps would not cause the bug, but work as expected.

Annoying as hell, as you might imagine.

Today I finally managed to figure out what was going on. You can see the repro on the right. This is a timeline, where each user is colored differently.

As you can see, it require a failure unusual set of circumstances, and the way the SVN protocol work make it harder to figure out. (When you ask for changes on an item, you are actually requesting all changes on its parent, which is a critical issue here).

It is also important which side of the rename you are asking about, and... but now I am probably boring you with technical details that are not interesting even to geeks.

I have been in weird bug fixing mode for SvnBridge for a while now, and it is... interesting to see what crops up. Right now most bugs takes quite a long time to track and fix, unfortunately.

Oh, and as a parting shot, take a look here at how I tracked that one:

image

time to read 1 min | 147 words

Let us take a look at this dialog, shall we?

image

What I see is that we have two text fields and two checkboxes, in a fairly big dialog, to express the following bit of information:

image

I honestly have no idea what the purpose of this is. To make it harder to input the information, presumably. Given a URL, I now need to split it apart manually, I need to know that https is usually on 443, etc.

It also mean that when I am talking to a user, I need to give her three pieces of information, and explain where to put each of them, rather than sending a URL that she can just copy / paste in place and be done with it.

time to read 6 min | 1048 words

There are some interesting discussions about the way Twitter is architected. Frankly, I think that having routine outages over such a long period is a sign of either negligence or incompetence. The problem isn't that hard, for crying out loud.

With that in mind, I set out to try architecting such a system. As far as I am concerned, twitter can be safely divided into two separate applications, the server side and the client side. The web UI, by the way, is considered part of the client.

The server responsibilities:

  • Accept twits from the client, this include:
    • Analyzing the message content, (@foo should go to foo, #tag should be tagged, etc)
    • Forward the message to all the followers of the particular user
  • Answer to queries about twits from people that a certain person is following
  • Answer to queries about a person
  • It should scale
  • Clients are assumed to be badly written

The client responsibilities:

  • Display information to the user
  • Pretty slicing & dicing of the data

Obviously, I am talking as someone who knows that there is even the need to scale, but let us leave this aside.

I am going to ignore the client, I don't care much about this bit. For the server, it turn out that we have a fairly simple way of handling that.

We will split it into several pieces, and deal with each of them independently. The major ones are read, write and analysis.

image There isn't much need to deal with analysis. We can handle that using on the backend, without really affecting the application, so we will start with processing writes.

Twitter, like most systems, is heavily skewed toward reads. In addition to that, we don't really need instant responsiveness. It is allowed to take a few moments before the new message is visible to all the followers.

As such, the process of handling new twits is fairly simple. A gateway server will accept the new twit and place it in a queue. A set of worker server will take the new twits out of the queue and start processing them.

There are several ways of doing that, either by distributing load by function or by simple round robin. Myself, I tend toward round robin unless there is a process that is significantly slower than the others, or require extra requirements (sending email may require opening a port in the firewall, as such, it cannot run on just any machine, but only on machines dedicated to it).

 image The process of handling a twit is fairly straightforward. As I mentioned, we are heavily skewed toward reads, so it is worth taking more time when processing a write to make sure that a read is as simple as possible.

This means that our model should support the following qualities:

  • Simple - Reading the timeline should involve no joins and no complexity whatsoever. Preferably, it should involve a query that uses a clustered index and that is it.
  • Cacheable - There should be as few a factors that affects the data that we need to handle as possible.
  • Shardable - the ability to split the work into multiple databases would mean that we will be able to scale out very easily.

As such, the model on the right seems like a good one (obviously this is very over simplified, but it works as an example).

This means that in order to display the timeline for a particular user, we will need to perform exactly two queries. One to the routing database, to find out what server this user data is sitting on, the second, to perform a trivial select on the timelines table.

What this means, in turn, is that the process for writing a new twit can be described using the following piece of code:

followers = GetFollowersFor(msg.Author)
followers.UnionWith( GetRepliesToIn(msg.Text) )

for follower in followers:
	DirectPublish(follower, msg)

DirectPublish would simply locate the appropriate server and insert a new message, that is all.

If we will take a pathological case of someone who has 10,000 followers, what this means is that each time this person will publish a new twit, the writer section of the application will have to go and write the message 10,000 times. Ridiculous, isn't it?

Not really. This model allows us to keep very low contention, since we don't have any need for complex coordination, it is easily scalable for additional servers as needed, and it means that even in the case of several such pathological users suddenly starting to send high amount of messages, the application as a whole is not really affected. That is quite a side from the fact that inserting 10,000 rows is not really big. Especially since we are splitting it across several servers.

But if it really bothered me, I would designate separate machines for handling such high volume users. It will ensure that regular traffic in the site can still flow while some machine in the back of the data center is slowly processing the big volume. In fact, I would probably decide that it is worth my time using bulk insert techniques for those users.

All of that said, we now have a system where the database end is trivially simple (probably because the problem, as shown in this post, is trivial, I am pretty sure that the real world is more complex, but never mind that), scaling out the writing part is a matter of adding more workers to process more messages. Scaling out the database is a matter of putting more boxes in the data center, nothing truly complex. Scaling the read portion is a good place for judicious use of caching, but the model lends itself well for that.

Now, feel free to tell me what I am off the hook...

time to read 3 min | 500 words

The bug: SvnBridge will not accept a change to a filename using multi byte format.

Let us start figuring out what is going on, shall we?

Hm... looks like when TortoiseSVN is PUT-ing the file, it uses the directory name instead of the actual file name. Obviously, this is a client bug, and I can close this bug and move on to doing other things. Except that I am pretty sure that SVN can handle non ASCII characters...

Let us compare the network trace from SVN and SvnBridge and see what is going on...

Oh, I got the problem. TortoiseSVN is requesting a CHECKOUT, and we return the wrong URL. Obviously I did something astoundingly stupid there when I processed that request. Indeed, taking a look at what is going on there is... interesting.

Finally I realize that the problem is that while TortoiseSVN sends the URL in seemingly reasonable format, it is obviously being read wrongly by SvnBridge. It is using ASCII encoding to read this, while it probably should use UTF8. Because of that, the URL looks something like /tfsrtm08/test/????.txt, and ? is the query string terminator, no wonder I have issues.

But, how does it work for standard SVN servers. Hm, looks like when TortoiseSVN uses Url Encoding for the URL when taking to a standard SVN server. Why is it doing this?

Hm... looks like the URL that TortoiseSVN uses when it is request a CHECKOUT is the one that is returned in the response to a PROPFIND request.

Sure, that is easy to fix, we will just Url Encode the response from the PROPFIND handler.

Except that it doesn't work...

Oh, we also handle some of that in the file node class, so we will fix it there.

Yeah!

Except that it still doesn't work!

Grr! Looks like the way we handle Url Encoding is too invasive, we need to only UrlEncode non ASCII characters, this means that we should not encode characters such as '/', for example.

Let us try this again, shall we?

Hm... it still doesn't work. What is going on.

Excuse me while I am having a nervous breakdown, please.image

Oh! When we send the response to the CHECKOUT request, we send the URL for the PUT request in the location header, and we haven't Url Encoded that yet.

And now it works... :-D

Let us run the test and commit...

We have a broken test.

*Sob*

Well, okay, that is actually a test that is supposed to be broken (it is testing the way we Url Encode urls, after all).

Number of code lines changed, less than 20.

Number of hours spent of this bug: over eight hours.

Conclusion, I am not really being productive today. I am off to visit the code generation wizard...

FUTURE POSTS

  1. Replacing developers with GPUs - about one day from now
  2. Memory optimizations to reduce CPU costs - 6 days from now
  3. AI's hidden state in the execution stack - 9 days from now

There are posts all the way to Aug 18, 2025

RECENT SERIES

  1. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  2. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  3. Webinar (7):
    05 Jun 2025 - Think inside the database
  4. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  5. RavenDB News (2):
    02 May 2025 - May 2025
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}