Ayende @ Rahien

filter by tags archive

architecture (618) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (647) rss
hibernating-practices (72) rss
miscellaneous (592) rss
performance (397) rss
programming (1093) rss
raven (1459) rss
ravendb.net (545) rss
reviews (184) rss

2025
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB - High-Performance NoSQL Document Database

Jul 18 2025

RavenDB & Distributed Debugging

time to read 2 min | 311 words

Tweet Share Share 0 comments

Tags:

TLDR: Check out the new Cluster Debug View announcement

If you had asked me twenty years ago what is hard about building a database, I would have told you that it is how to persist and retrieve data efficiently. Then I actually built RavenDB, which is not only a database, but a distributed database, and I changed my mind.

The hardest thing about building a distributed database is the distribution aspect. RavenDB actually has two separate tiers of distribution: the cluster is managed by the Raft algorithm, and the databases can choose to use a gossip algorithm (based on vector clocks) for maximum availability or Raft for maximum consistency.

The reason distributed systems are hard to build is that they are hard to reason about, especially in the myriad of ways that they can subtly fail. Here is an example of one such problem, completely obvious in retrospect once you understand what conditions will trigger it. And it lay hidden there for literally years, with no one being the wiser.

Because distributed systems are complex, distributed debugging is crazy complex. To manage that complexity, we spent a lot of time trying to make it easier to understand. Today I want to show you the Cluster Debug page.

You can see one such production system here, showing a healthy cluster at work:

You can also inspect the actual Raft log to see what the cluster is actually doing:

This is the sort of feature that you will hopefully never have an opportunity to use, but when it is required, it can be a lifesaver to understand exactly what is going on.

Beyond debugging, it is also an amazing tool for us to explore and understand how the distributed aspects of RavenDB actually work, especially when we need to explain that to people who aren’t already familiar with it.

You can read the full announcement here.

Nov 26 2008

Why Remoting is so painful?

time to read 1 min | 185 words

Tweet Share Share 9 comments

Tags:

Debugging

Yes, I know, 2003 called and asked to get its distribution technology back. Nevertheless, remoting is an extremely useful tool, if you can make several assumptions about the way that you are going to use it.

In my case, I am assuming inter process, local machine configuration, with high expectation of reliability from both ends. Considering that I also need low latency, it seems like an appropriate solution indeed. I was pretty happy about this, until all my integration tests start to break.

After a while, I managed to figure out that the root cause for that is this error: Because of security restrictions, the type XYZ cannot be accessed.

Now, it worked, and it worked for a long time. What the hell is going on?

After thinking about this for a while, I realized that the major thing that changed was that I am now signing my assemblies. And that caused all hell to break lose. I managed to find this post with the solution, but I am still not happy. I really dislike things that can just go and break on me.

Nov 25 2008

Not a Production Quality Software

time to read 3 min | 460 words

Tweet Share Share 13 comments

Tags:

A while ago I worked at a bank, doing stuff there, and I was exposed to their internal IT structure. As a result of that experience, I decided that I will never put any money in that bank. I am in no way naive enough to think that the situation is different in other banks, but at least I didn't know how bad it was. In fact, that experience has led me to the following observation:

There is a direct reverse relationship between the amount of money a piece of code handles and its quality.

The biggest bank in Israel just had about 60 hours of downtime. Oh, and it also provide computing services for a couple of other banks as well, so we had three major banks down for over two days. The major bank, Hapoalim, happen to be my bank as well, and downtime in this scenario means that all of the systems in the bank were down. From credit card processing to the internal systems and from trading systems to their online presence and their customer service.

From what I was able to find out, they managed to mess up an upgrade, and went down hard. I was personally affected by this when I came to Israel on Sunday's morning, I wasn't able to withdraw any money, and my credit cards weren't worth the plastic they are made of (a bit of a problem when I need a cab to go home). I am scared to think what would have happened if I was still abroad, and my bank is basically in system meltdown and inaccessible.

I was at the bank yesterday, one of the few times that I actually had to physically go there, and I was told that this is the first time that they had such a problem ever, and the people I was speaking with has more than 30 years of working for the bank.

I am dying to know what exactly happened, not that I expect that I ever will, but professional curiosity is eating me up. My personal estimate of the damage to the bank is upward of 250 million, in addition to reputation & trust damage. That doesn't take into account lawsuits that are going to be filed against the bank, nor does it take into account the additional costs that they are going to incur as a result of that just from what the auditors are going to do to them.

Oh, conspiracy theories are flourishing, but that most damning piece as far as I am concern is how little attention the media has paid for this issue overall.

Leaving aside the actual cause, I am now much more concern with the disaster recovery procedures there...

Jul 21 2008

[Unstable code] How a blocking remote call can take down an application

time to read 4 min | 665 words

Tweet Share Share 4 comments

Tags:

I mentioned that this line has the potential to destabilize an application, because it is a remote blocking call.

var cart = customerSrv.GetShoppingCart(customerId);

Neil Mosafi left the following comment:

I've never experienced other threads being blocked whilst making a sync service call. Even an Async call is essentially a sync call but done in another thread or using an iocompletion port. Or are you saying we should be making duplex service calls to avoid possible problems?

Let us start by saying that I am talking about pathological scenarios, nothing that you'll meet in everyday scenario. However, "once in a million is next Tuesday" in our business. I have seen applications behave... strangely on production.

Let us focus on the trivial issues first, shall we?

HTTP: Only 2 concurrent requests per host
This is fairly well known, and there are ways around it, but it is neither trivial nor something you can ignore.
Result: requests are serialized in the HTTP layer
HTTPS: All of HTTP limitations, plus ~4,000 request per IP (not host) in any 2 minutes duration.
This is not well known, and while there are ways around it, it is not something that most people think of until the application fail.
Result: request is denied.

Those are the common ones, but with TCP based protocols, the server can hang the client in so many ways, it isn't even funny. TCP redirection loops, waiting on the listen queues, slow transfer rates, malformed TCP protocols and high packet loss are just the things that occur to me right now.

In general, we can divide the issues into fail fast and block. Fail fast are what we want, block is what we have to deal with.

Now, how can a blocking call take down an application? Starting with a convoy and ending with a chain reaction.

Let us say that we are making the blocking call above, and for some reason, it takes longer to process this than our SLA allows. In most scenarios, we would like to abort the current call and send an error downstream. What we don't want is to have a situation on our hands where we block. If we block, we hold a valuable thread that is doing nothing but wait.

In .NET, there are several types of threads that we utilize. Thread pool threads (ASP.Net, WCF, QueueWork, etc), main thread (in client applications), free threads (my own term, threads that were created by the application manually), IO threads (we mostly don't deal with them, they are an infrastructure concern) and private thread pools.

A thread is an expensive resource, so we tend to hang to it, rather than creating them all the time. In particular, for most servers, we have a finite amount of threads that are available for doing work.

Now, assume that some threads are blocked, or even just processing things more slowly. The concept of blocking remote calls means that we have now propagated this issue to all our clients, which will propagate them to their clients, etc. In fact, a convoy (serialization of processing work in one place) can easily lead to a chain reaction which will lead to the entire application meltdown.

And that is the good part.

The bad part is if all you threads are blocked for some reason. (I had a case once where some idiot run a long query with serializable isolation on the log table. Guess what happened to the application in the meantime?) If all the threads are blocked, you can't do anything, you are dead in the waters.

I will talk about approaches to dealing with this in a future post.

May 05 2008

How do you track that?

time to read 1 min | 111 words

Tweet Share Share 9 comments

Tags:

I have an interesting problem with SvnBridge.

After around 5000 full revision request (a set of requests that can occur), the application get hung making a web service call to TFS. This comes after making quite a few calls to TFS, and is generally fairly easily reproducible. The actual call being made is not an expensive one (nor is it the same call). TFS is responsive during that time, so it is not its fault.

It looks very much like I am hitting the 2 concurrent HTTP requests, except that all requests are serialized, and there is no multi threaded work involved.

I have been unable to reproduce this under a profiler or debugger...

Thoughts?

Feb 27 2008

Continuous Environment Validation

time to read 5 min | 823 words

Tweet Share Share 5 comments

Tags:

For some reason, it seems like I am talking about Release It a lot lately, to a lot of people. As I said when I reviewed it, that book literally changed the way that I approach problems. It also made me much more aware of the failure scenarios that I need to deal with.

A while ago I sat down in one of Jeremy Miller's talks and he mentioned that he had added the ability to do Environment Validation to StructureMap, so when the application is starting up, it can verify that all its dependencies are in a valid state. That made so much sense to me that I immediately added this facility to Windsor.

What I am going to talk about today is to take this approach one step further. Instead of running those tests just at application startup, they should be run every day, or every hour.

Yes, the operation team is suppose to have monitoring on the application, but unless they were part of the development process (or are a dedicated ops team), that still leaves you as the principal owner of knowledge in about the environment your application need. Even if you have a capable operation team, and they have very good understanding on your application, it is often best to support them by providing this functionality. It is very likely that you can get more information from your application that the operation team.

And if you don't have an operation team, you really want to be able to do that.

Now that we have taken care of the motivation for this approach, let us see what exactly we are talking about.

Environment validation means that you validate the your entire environment is in a state that allows your application to run in full capacity. I am going to list a few things that I think are essential for many applications, I am sure that I am going to miss some, however, feel free to add more items to the list.

Certificate's valid and expire in more than a month.
Domain registration expires in than one month.
For each server in the application (web, database, cache, application):

Server is alive and responding (within specified time).
Server's HD has more than 10% free space.
Server CPU usage is less than 80%

Associated 3rd party servers are responding within their SLA.
Sample execution of common scenarios finish successfully in a specified time frame.
Number of faults (non critical ones) in the application is below the threshold.
No critical fault (critical defined as taking the entire system down).
Current traffic / work on the system is within expected range (too low, and we may have external network issue, too high, and we need to up our capacity).
Application audit trail is updated. (Can do the same for log, if required).
System backup was performed and completed successfully.
All batch jobs have been run and completed successfully.
Verify the previously generated faults has been dealt with.

Those are the generalities, I am pretty sure that you can think of a lot more that fit your own systems.

The important thing to remember here is that you should treat this piece as a core part of the application infrastructure. In many production environment, you simply cannot get access. This is part of the application, and should be deployed with the application. At any rate, it should be made clear that this is part of the deployment program, not just useless appendix.

My preference would be to have a windows service to monitor my systems and alert when there are failures.

This is another important consideration, how do you send alerts? And when? You should have at least three levels of warnings: Warning, Error and Fatal. You send them according to the severity of the problem.

In all cases, I would log them to the event log at a minimum, probably send mail as well. For Error and Fatal levels, I would use SMS / generate alert to operation monitoring systems. If there are monitoring system in place that the operations team is using, it is best to route things through them. They probably have the ability to wake someone up in 3 AM already. If you don't have that, than an SMS is at least near instantaneous, and you can more or less rely on that to be read.

That is long enough, and I have to do some work today, so I'll just stop here, I think.

Dec 06 2007

Exception handling best practices

time to read 2 min | 292 words

Tweet Share Share 14 comments

Tags:

Debugging

No, I am not going to tell you to use throw; instead of throw e; I am going to talk about exception messages, assumptions, and pain.

Exception hierarchies are useful in many ways, mostly because they bring order to the way we handle exceptions. We can catch a specific exception, or a root exception in a hierarchy, and hanlde them specifically. But, one of the usages of exception hierarchies is to add additional data to an exception. In many cases, this is very useful data, such as the SQL error code or the details node of in a soap fault or the list of assemblies that could not be loaded.

Do you know what this three datums has in common?

They are very useful
They do not show in ex.ToString()

Guess what is going to be shown in any log, error message, etc?

You got that right, the ex.ToString() output!

If you have additional information in the exception, it must be available on the exception afterward. Trying to diagnose assembly load failures is driving me mad.

Imagine finding things like this in the log:

"ReflectionTypeLoadException: Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information."
"SoapException: Server was unable to process request"

Imagine gettting one of those during dev, you have no way of knowing where this is happening. Often you can't even set a breakpoint in the code there, and inspect the exception, because it is handled inside some library code. ASP.Net is a good example of how this can happen, and where this is a highly annoying issue to work with.

To summrise, if you create exceptions, make sure to remember a simple rule, everything should go in the ex.ToString().

Nov 11 2007

A definition of a nightmare platform

time to read 1 min | 68 words

Tweet Share Share 5 comments

Tags:

Debugging

Alex has more or less hit on about the worst description of a platform I can think of:

XYZ is a technology of highs and lows... the highs are when you've finally got something to work that should've worked in the first place, the lows are well... all the times in between.

If you are working on such a platform, make yourself happy, just go away.

Oct 06 2007

Amazon's Dynamo

time to read 1 min | 173 words

Tweet Share Share 1 comments

Tags:

Debugging

Okay, Amazon has just published an interesting paper about how they manage state for some of their services. The underlying idea is a hash table, distributed, reliable, versioned and simple. They have some interesting constraints that influenced the design of the system, and it is an interesting, if dry, read.

Dare has some comments about it.

I'll limit myself to saying that the data versioning approach is extremely interesting. The idea is that you issue a get(key) and the result is a set of relevant objects that may need reconciliation. They end with a conclusion that while this may seem like putting a lot of the responsibility in the app developers hands, this is something that they already had to deal with due to the scalability requirements that they face.

I wouldn't want to do this for a small site, but I can see the advantages for scaling wide.

Amusingly enough, the classic shopping cart sample appears to be a core service for this system, and a complex one.

Oct 03 2007

The CLR Sources

time to read 1 min | 85 words

Tweet Share Share 10 comments

Tags:

I have no idea why this isn't in much wider circulation, but this is huge.

ScottGu has announced that Microsoft is Releasing the Source Code for the .NET Framework Libraries.

I am disappointed to see that even in the tiny source code samples that he has in the post I have violent disagreements (they speak about sealing stuff, which I have serious objection to).

This hopefully means a lot less ReflectorDebugging, although I am not sure about all the implication that this has.

Oren Eini

Oren Eini

CEO of RavenDB

RavenDB & Distributed Debugging

Why Remoting is so painful?

Not a Production Quality Software

[Unstable code] How a blocking remote call can take down an application

How do you track that?

Continuous Environment Validation

Exception handling best practices

A definition of a nightmare platform

Amazon's Dynamo

The CLR Sources

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed