Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 185 words

Yes, I know, 2003 called and asked to get its distribution technology back. Nevertheless, remoting is an extremely useful tool, if you can make several assumptions about the way that you are going to use it.

In my case, I am assuming inter process, local machine configuration, with high expectation of reliability from both ends. Considering that I also need low latency, it seems like an appropriate solution indeed. I was pretty happy about this, until all my integration tests start to break.

After a while, I managed to figure out that the root cause for that is this error: Because of security restrictions, the type XYZ cannot be accessed.

Now, it worked, and it worked for a long time. What the hell is going on?

After thinking about this for a while, I realized that the major thing that changed was that I am now signing my assemblies. And that caused all hell to break lose. I managed to find this post with the solution, but I am still not happy. I really dislike things that can just go and break on me.

time to read 3 min | 460 words

A while ago I worked at a bank, doing stuff there, and I was exposed to their internal IT structure. As a result of that experience, I decided that I will never put any money in that bank. I am in no way naive enough to think that the situation is different in other banks, but at least I didn't know how bad it was. In fact, that experience has led me to the following observation:

There is a direct reverse relationship between the amount of money a piece of code handles and its quality.

The biggest bank in Israel just had about 60 hours of downtime. Oh, and it also provide computing services for a couple of other banks as well, so we had three major banks down for over two days. The major bank, Hapoalim, happen to be my bank as well, and downtime in this scenario means that all of the systems in the bank were down. From credit card processing to the internal systems and from trading systems to their online presence and their customer service.

From what I was able to find out, they managed to mess up an upgrade, and went down hard. I was personally affected by this when I came to Israel on Sunday's morning, I wasn't able to withdraw any money, and my credit cards weren't worth the plastic they are made of (a bit of a problem when I need a cab to go home). I am scared to think what would have happened if I was still abroad, and my bank is basically in system meltdown and inaccessible.

I was at the bank yesterday, one of the few times that I actually had to physically go there, and I was told that this is the first time that they had such a problem ever, and the people I was speaking with has more than 30 years of working for the bank.

I am dying to know what exactly happened, not that I expect that I ever will, but professional curiosity is eating me up. My personal estimate of the damage to the bank is upward of 250 million, in addition to reputation & trust damage. That doesn't take into account lawsuits that are going to be filed against the bank, nor does it take into account the additional costs that they are going to incur as a result of that just from what the auditors are going to do to them.

Oh, conspiracy theories are flourishing, but that most damning piece as far as I am concern is how little attention the media has paid for this issue overall.

Leaving aside the actual cause, I am now much more concern with the disaster recovery procedures there...

time to read 4 min | 665 words

I mentioned that this line has the potential to destabilize an application, because it is a remote blocking call.

var cart = customerSrv.GetShoppingCart(customerId);

Neil Mosafi left the following comment:

I've never experienced other threads being blocked whilst making a sync service call.  Even an Async call is essentially a sync call but done in another thread or using an iocompletion port.  Or are you saying we should be making duplex service calls to avoid possible problems?

Let us start by saying that I am talking about pathological scenarios, nothing that you'll meet in everyday scenario. However, "once in a million is next Tuesday" in our business. I have seen applications behave... strangely on production.

Let us focus on the trivial issues first, shall we?

  • HTTP: Only 2 concurrent requests per host
    This is fairly well known, and there are ways around it, but it is neither trivial nor something you can ignore.
    Result: requests are serialized in the HTTP layer
  • HTTPS: All of HTTP limitations, plus ~4,000 request per IP (not host)  in any 2 minutes duration.
    This is not well known, and while there are ways around it, it is not something that most people think of until the application fail.
    Result: request is denied.

Those are the common ones, but with TCP based protocols, the server can hang the client in so many ways, it isn't even funny. TCP redirection loops, waiting on the listen queues, slow transfer rates, malformed TCP protocols and high packet loss are just the things that occur to me right now.

In general, we can divide the issues into fail fast and block. Fail fast are what we want, block is what we have to deal with.

Now, how can a blocking call take down an application? Starting with a convoy and ending with a chain reaction.

Let us say that we are making the blocking call above, and for some reason, it takes longer to process this than our SLA allows. In most scenarios, we would like to abort the current call and send an error downstream. What we don't want is to have a situation on our hands where we block. If we block, we hold a valuable thread that is doing nothing but wait.

In .NET, there are several types of threads that we utilize. Thread pool threads (ASP.Net, WCF, QueueWork, etc), main thread (in client applications), free threads (my own term, threads that were created by the application manually), IO threads (we mostly don't deal with them, they are an infrastructure concern) and private thread pools.

A thread is an expensive resource, so we tend to hang to it, rather than creating them all the time. In particular, for most servers, we have a finite amount of threads that are available for doing work.

Now, assume that some threads are blocked, or even just processing things more slowly. The concept of blocking remote calls means that we have now propagated this issue to all our clients, which will propagate them to their clients, etc.  In fact, a convoy (serialization of processing work in one place) can easily lead to a chain reaction which will lead to the entire application meltdown.

And that is the good part.

The bad part is if all you threads are blocked for some reason. (I had a case once where some idiot run a long query with serializable isolation on the log table. Guess what happened to the application in the meantime?) If all the threads are blocked, you can't do anything, you are dead in the waters.

I will talk about approaches to dealing with this in a future post.

time to read 1 min | 111 words

I have an interesting problem with SvnBridge.

After around 5000 full revision request (a set of requests that can occur), the application get hung making a web service call to TFS. This comes after making quite a few calls to TFS, and is generally fairly easily reproducible. The actual call being made is not an expensive one (nor is it the same call). TFS is responsive during that time, so it is not its fault.

It looks very much like I am hitting the 2 concurrent HTTP requests, except that all requests are serialized, and there is no multi threaded work involved.

I have been unable to reproduce this under a profiler or debugger...

Thoughts?

time to read 5 min | 823 words

For some reason, it seems like I am talking about Release It a lot lately, to a lot of people. As I said when I reviewed it, that book literally changed the way that I approach problems. It also made me much more aware of the failure scenarios that I need to deal with.

A while ago I sat down in one of Jeremy Miller's talks and he mentioned that he had added the ability to do Environment Validation to StructureMap, so when the application is starting up, it can verify that all its dependencies are in a valid state. That made so much sense to me that I immediately added this facility to Windsor.

What I am going to talk about today is to take this approach one step further. Instead of running those tests just at application startup, they should be run every day, or every hour.

Yes, the operation team is suppose to have monitoring on the application, but unless they were part of the development process (or are a dedicated ops team), that still leaves you as the principal owner of knowledge in about the environment your application need. Even if you have a capable operation team, and they have very good understanding on your application, it is often best to support them by providing this functionality. It is very likely that you can get more information from your application that the operation team.

And if you don't have an operation team, you really want to be able to do that.

Now that we have taken care of the motivation for this approach, let us see what exactly we are talking about.

Environment validation means that you validate the your entire environment is in a state that allows your application to run in full capacity. I am going to list a few things that I think are essential for many applications, I am sure that I am going to miss some, however, feel free to add more items to the list.

  • Certificate's valid and expire in more than a month.
  • Domain registration expires in than one month.
  • For each server in the application (web, database, cache, application):
    • Server is alive and responding (within specified time).
    • Server's HD has more than 10% free space.
    • Server CPU usage is less than 80%
  • Associated 3rd party servers are responding within their SLA.
  • Sample execution of common scenarios finish successfully in a specified time frame.
  • Number of faults (non critical ones) in the application is below the threshold.
  • No critical fault (critical defined as taking the entire system down).
  • Current traffic / work on the system is within expected range (too low, and we may have external network issue, too high, and we need to up our capacity).
  • Application audit trail is updated. (Can do the same for log, if required).
  • System backup was performed and completed successfully.
  • All batch jobs have been run and completed successfully.
  • Verify the previously generated faults has been dealt with.

Those are the generalities, I am pretty sure that you can think of a lot more that fit your own systems.

The important thing to remember here is that you should treat this piece as a core part of the application infrastructure. In many production environment, you simply cannot get access. This is part of the application, and should be deployed with the application. At any rate, it should be made clear that this is part of the deployment program, not just useless appendix.

My preference would be to have a windows service to monitor my systems and alert when there are failures.

This is another important consideration, how do you send alerts? And when? You should have at least three levels of warnings: Warning, Error and Fatal. You send them according to the severity of the problem.

In all cases, I would log them to the event log at a minimum, probably send mail as well. For Error and Fatal levels, I would use SMS / generate alert to operation monitoring systems. If there are monitoring system in place that the operations team is using, it is best to route things through them. They probably have the ability to wake someone up in 3 AM already. If you don't have that, than an SMS is at least near instantaneous, and you can more or less rely on that to be read.

That is long enough, and I have to do some work today, so I'll just stop here, I think.

time to read 2 min | 292 words

No, I am not going to tell you to use throw; instead of throw e; I am going to talk about exception messages, assumptions, and pain.

Exception hierarchies are useful in many ways, mostly because they bring order to the way we handle exceptions.  We can catch a specific exception, or a root exception in a hierarchy, and hanlde them specifically.  But, one of the usages of exception hierarchies is to add additional data to an exception. In many cases, this is very useful data, such as the SQL error code or the details node of in a soap fault or the list of assemblies that could not be loaded.

Do you know what this three datums has in common?

  1. They are very useful
  2. They do not show in ex.ToString()

Guess what is going to be shown in any log, error message, etc?

You got that right, the ex.ToString() output!

If you have additional information in the exception, it must be available on the exception afterward. Trying to diagnose assembly load failures is driving me mad.

Imagine finding things like this in the log:

  • "ReflectionTypeLoadException: Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information."
  • "SoapException: Server was unable to process request"

Imagine gettting one of those during dev, you have no way of knowing where this is happening. Often you can't even set a breakpoint in the code there, and inspect the exception, because it is handled inside some library code. ASP.Net is a good example of how this can happen, and where this is a highly annoying issue to work with.

To summrise, if you create exceptions, make sure to remember a simple rule, everything should go in the ex.ToString().

time to read 1 min | 68 words

Alex has more or less hit on about the worst description of a platform I can think of:

XYZ is a technology of highs and lows... the highs are when you've finally got something to work that should've worked in the first place, the lows are well... all the times in between.

If you are working on such a platform, make yourself happy, just go away.

Amazon's Dynamo

time to read 1 min | 173 words

Okay, Amazon has just published an interesting paper about how they manage state for some of their services. The underlying idea is a hash table, distributed, reliable, versioned and simple.  They have some interesting constraints that influenced the design of the system, and it is an interesting, if dry, read.

Dare has some comments about it.

I'll limit myself to saying that the data versioning approach is extremely interesting. The idea is that you issue a get(key) and the result is a set of relevant objects that may need reconciliation. They end with a conclusion that while this may seem like putting a lot of the responsibility in the app developers hands, this is something that they already had to deal with due to the scalability requirements that they face.

I wouldn't want to do this for a small site, but I can see the advantages for scaling wide.

Amusingly enough, the classic shopping cart sample appears to be a core service for this system, and a complex one.

The CLR Sources

time to read 1 min | 85 words

I have no idea why this isn't in much wider circulation, but this is huge.

ScottGu has announced that Microsoft is Releasing the Source Code for the .NET Framework Libraries.

I am disappointed to see that even in the tiny source code samples that he has in the post I have violent disagreements (they speak about sealing stuff, which I have serious objection to).

This hopefully means a lot less ReflectorDebugging, although I am not sure about all the implication that this has.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}