Ayende @ Rahien

Refunds available at head office

Your queries are going to fail in production

Reading Davy Brion’s post There Is No Excuse For Failing Queries In Production had me twitching slightly. The premise is that since you have tests for your queries, they are not going to fail unless some idiot goes to the database and mess things up,and when that is going to happen, you are going to trip a circuit breaker and fail fast. Davy’s post is a response to a comment thread on another post:

Bertrand Le Roy: Would it make sense for the query to implement IDisposable so that you can put it in a using block? I would be a little concerned about the query failing and never getting disposed of.
Davy Brion: well, if a query fails the underlying SqlCommand object will eventually be garbage collected so i wouldn’t really worry about that. Failed queries shouldn’t happen frequently anyway IMO because they typically are the result of incorrect code.

The problem with Davy’s reasoning is that I can think of several very easy ways to get queries to fail in production even if you go through testing and put your DB under lock & key. Probably the easiest problems to envision are timeouts and transaction deadlocks. Imagine the following scenario, your application is running in production, and the load is big enough (and the request pattern is such that) transaction deadlocks begin to happen. Deadlocks in the DB are not errors. Well, not in the usual sense of the word. They are transient and should be dealt separately.

Using Davy’s approach, we would either have the circuit breaker tripping because of a perfectly normal occurrence or we may have a connection leak because we are assuming that all failures are either catastrophic or avoidable.

Comments

Nima
08/31/2009 08:51 AM by
Nima

It might be irrelevant to your post but since you have mentioned deadlocks , I really like to know what's your advise (as a NHibernate professional) to prevent them ? As a matter of fact I've been working on a projects for 4 years and we are using NHibernate. In rare cases we are getting deadlocks .we have tried to trace them in SQL server logs but we haven't had any success to get rid of them.

Ayende Rahien
08/31/2009 09:09 AM by
Ayende Rahien

Nima,

There aren't really ways of getting rid of deadlocks.

They are what happens when you have two transactions competing for the same resources.

In theory, you can try to restructure your queries so you will always access rows using the same path all the time, thus eliminating deadlocks.

In practice, it is not really possible to do something like that.

You can reduce deadlocks by specifying a lower transaction isolation level, but preventing them isn't likely in real world scenarios.

Luckily, recovering from them is very simple

Steve Degosserie
08/31/2009 11:06 AM by
Steve Degosserie

All true. It also makes me think about the common assumption most developers have about Transaction Commits that generally don't fail. There are also a number of cases where this can happen (e.g. deferrable constraints in Oracle) but is usually not well handled.

Basically, to develop robust software, you must be paranoid :) (re-re-read M. Nygard's excellent Release It !).

Niraj
08/31/2009 11:35 AM by
Niraj

Ayende, can you elaborate on recovering aspects? Timeouts ???

Alex Yakunin
09/01/2009 08:14 AM by
Alex Yakunin

Fully agree. Failing query in production is normal. Deadlocks, version conflicts & timeouts are typical cases. Their ignorance = lack of actual understanding of processes @ RDBMS.

Alex Yakunin
09/01/2009 08:24 AM by
Alex Yakunin

Or simply a mistake, of course. I hardly believe Davy isn't aware about this.

Dennis
09/01/2009 08:37 AM by
Dennis

Would be nice to just be able to replay the transaction. But SQL server has so many bugs related to its transaction handling, that you never quite know when your transaction is broken.

And in usual style, they pretty much ignore whatever bug reports they get.

Ayende Rahien
09/01/2009 08:39 AM by
Ayende Rahien

Dennis,

Do you have a repro that shows that transaction handling is broken on SQL Server?

Ayende Rahien
09/01/2009 09:01 AM by
Ayende Rahien

Interesting, but while this is a bug, it doesn't relate to what I consider the most important thing about transactions: ACID

You don't have data corruption here, you have a transaction that mysteriously dies (and rollback).

As long as you can trust ACID, you can replay transactions

Dennis
09/01/2009 09:04 AM by
Dennis

Yes and No... Since the transaction is stopped without your knowledge and no error handling is done, then all subsequent writes are still written.

Meaning that you cannot just assume that none of your work was completed, and can thus not repeat the transaction.

btw. The "Remember me?" keeps forgetting to remember itself, thus forgetting me the second time a post is made.

Ayende Rahien
09/01/2009 09:08 AM by
Ayende Rahien

The remember me shouldn't be on a second time, you are already remember. IOW, it shouldn't forget you.

And I fully agree about the problematic nature of silent aborts.

Have you tried contacting PSS?

Dennis
09/01/2009 09:12 AM by
Dennis

No. They are a waste of time also...

Last time they spent weeks finding out why our failover setup was failing. Just to have us find out that they dont really support failover unless it is in a domain setup.

So in usual style, I implemented a workaround, which in this particular case actually ended up much cleaner.

Dennis
09/01/2009 09:14 AM by
Dennis

I did remember me... Not quite sure why I am forgotten every second time I have something useful to say. Perhaps the timeout is too low?

Just checked, yes indeed just 1 month timeout on the cookie.

Paul Robson
09/01/2009 11:50 AM by
Paul Robson

Ayende, when you mention "replaying the transaction" above, do you mean throw away the session and restart the whole unit of work from the beginning?

Ayende Rahien
09/01/2009 11:55 AM by
Ayende Rahien

Paul,

Yes, you have to. You can't just execute the same statements, there is logic that may be affected

Roman
09/01/2009 01:37 PM by
Roman

The most common case of failing queries in production is IMHO a database update with feature change. This happens often especially with MySQL (even for minor version changes), but we've seen it with other DBMS too, e.g. PostgreSQL.

Mike Brown
09/01/2009 02:50 PM by
Mike Brown

Something was striking me as odd about his assertions there, and even if you have the most stringent of testing, some environments (e.g. FDA regulated) still require you to place checks in your code to handle scenarios that "are never supposed to happen"

Comments have been closed on this topic.