Ask AyendeWhat about the QA env?
Matthew Bonig asks, with regards to a bug in RavenDB MVC Integration (RavenDB Profiler) that caused major slow down on this blog.:
I'd be very curious to know how this code got published to a production environment without getting caught. I would have thought this problem would have occurred in any testing environment as well as it did here. Ayende, can you comment on where the process broke down and how such an obvious bug was able to slip through?
Well, the answer for that comes in two parts. The first part is that no process broke down. We use our own assets for final testing of all our software, that means that whenever there is a stable RavenDB release pending (and sometimes just when we feel like it) we move our infrastructure to the latest and greatest.
Why?
Because as hard as you try testing, you will never be able to catch everything. Production is the final test ground, and we have obvious incentives of trying to make sure that everything works. It is dogfooding, basically. Except that if we get a lemon, that is a very public one.
It means that whenever we make a stable release, we can do that with high degree of confidence that everything is going to work, not just because all the tests are passing, but because our production systems had days to actually see if things are right.
The second part of this answer is that this is neither an obvious bug nor one that is easy to catch. Put simply, things worked. There wasn’t even an infinite loop that would make it obvious that something is wrong, it is just that there was a lot of network traffic that you would notice only if you either had a tracer running, or were trying to figure out why the browser was suddenly so busy.
Here is a challenge, try to devise some form of an automated test that would catch something like this error, but do so without actually testing for this specific issue. After all, it is unlikely that someone would have written a test for this unless they run into the error in the first place. So I would be really interested in seeing what sort of automated approaches would have caught that.
More posts in "Ask Ayende" series:
- (28 Feb 2012) Aggregates and repositories
- (31 Jan 2012) What about the QA env?
- (25 Jan 2012) Handling filtering
- (19 Jan 2012) Life without repositories, are they worth living?
- (17 Jan 2012) Repository for abstracting multiple data sources?
Comments
I don't know if it's worth it or not, but you can test for this class of bug in a way that's not too dissimilar to how your profiler warns you about things like N+1 selects or other bad behavior.
From the perspective of a proxy between the browser and the server, there's a "normal" communication pattern and then there is an "abnormal" one. It would not be hard to imagine a scenario where this sort of traffic would be flagged as abnormal, causing a test to fail somewhere.
If I was shipping the Facebook Like button, I'd be doing tests like that. In this specific case, I kind of doubt the payoff is there.
Something we are planning in the near future is to run a crawler bot (potentially using selenium) on our staging environment, and then compare the current run's performance (from the crawler and from monitoring) against historic data, to see if there are any significant changes.
Also, IIRC people doing Continuous Deployment use monitoring to trigger rollbacks to prior versions on performance degradation of the production system.
Hard one really. The only thing I could think of would be some kind of record/replay regression testing software, which records all traffic between client and server whilst simulating some user actions and verifies there are no unexpected requests.
Trouble is these kind of tests tend to be really fragile so would be breaking every time you made a change, so you might still miss it.
Shawn, How would you try capturing something like that? Is this something that would be easy or obvious to capture without actually thinking about this scenario up front.
Peter, The problem is that there are many variants that can affect client performance, you have to put a big fudge factor, and that means that you can easily slip things. And monitoring will generally not tell you that there is a problem client side.
Neil, Exactly, this sort of scenario is incredibly fragile, and you really don't want to try to do those. You would have failing tests left & right
Yes. I've seen teams get into a LOT of trouble with having to maintain hundreds of those kind of tests. Still maybe having 5 or 10 high level ones might catch this, not sure.
Ayende,
while I have no experience/data to back this up, this is on of the reason we are considering to use selenium to do the crawling, so we get closer to real client experience. And by monitoring I don't just mean occasional GET requests to the server to see whether it's alive, but the whole network infrastructure (network traffic data would be useful for this case). And of course, we are monitoring application metrics and plan to expose metrics gathered from the crawler to the monitoring service to be able to see trends easily.
I will blog about it once we've got around to implementing it, though it's not in the close future (the pain is not big enough yet)
If you had set up automated UI tests and they explicitly waited until all ajax calls for loading the page had finished, you'd notice that they were waiting for an unreasonably long time (that is, forever).
Waiting for all the ajax loading to finish is reasonable if your page makes use of it for standard content. This blog seems relatively static, so that's a step I probably would have neglected.
And that's assuming you have automated tests in the first place...
If you're testing manually, it's a question of whether you happen to have firebug open at the time and taking up enough space for you to notice the fishy ajax requests.
When I had posted the original comment I had thought that this is something that could have been caught with just one client hitting a test environment. After reading the OP more I can see that this probably only happened because of the load that a production environment imposes.
Assume my understanding is correct it leads me to the next question I'd have:
Do you do stress testing?
On the whole testing with selenium angle -- check out BrowserMob. It does selenium testing, but out in the cloud so you can get alot closer to real traffic. Definitely has shown us where our apps fall down.
Is it related to that annoying debugger/profiler panel that flashes before the pages load?
Definitely agree with the notion that there's little point adding a specific test case for this specific fault case 'after the horse has bolted'. But there is a way to test for problems like this in general, which can be worthwhile:
The reported issue was that the blog was responding slowly. The cause was the repeated requests.
This implies that response speed is part of your acceptance criteria for the software. If that's the case, it's worth having some automated perf tests which would then catch problems with performance, and specifically response time. (Of course, coming up with the NFRs for that is always 'interesting'...)
So, you could introduce some very basic system load tests that simulated some number of users doing some set of actions, and checking that performance is as desired.
Then, no matter what performance fault is introduced that would affect overall system page response (be it bad AJAX handling, dodgy DB indexes, or anything else), it would be detected before going into production. Obviously there are always some bits of production you can't roll into testing, but you can mock at least some of those out.
Basically:
If you only do functional or performance tests against bits of a system before production, you're open to system functional or performance problems in production. If you only do functional - and not performance - tests against a full system before production, you'll only spot certain performance problems in production.
Royston, It ain't that simple. The response speed was absolutely fine. You would need a full fledged browser along with several minutes of idle time to actually discover that anything is wrong.
If "caused major slow down on this blog" didn't impact the response speed, what did it impact that was eventually detectable as a problem? Whatever that is, that's your criteria that needs to be tested before production.
Browsers are part of the overall system (and have certainly been known to have the occasional functional or performance foible!), as is the Javascript code that lives in those pages. If you're not running system tests involving real browsers, I'm not sure I'd call them full system tests.
Of course, the only way to get true real-world fault detection is to run your system in the real world. At some point there's a crossover between the cost of simulating the real-world sufficiently at test time, and the cost of having a bug in production. So yeah, it's all _doable_. But whether it's worthwhile is entirely down to where that cost crossover lies for you...
Royston, It caused a slowdown on the _client's machine_. On the server, everything was good. Full system tests that include browser code are slow, fragile, hard to work with and generally a mess. I much rather have a staging env to do those sort of things at. Which is why we dog food stuff at this blog.
Ayende, I've run across this type of issue several times under positive feedback loops like the one you mentioned (event handler triggers itself) or negative feedback loops from misconfigured resource pools (thundering herd where retry mechanisms don't back off and keep you in a failed state indefinitely).
A generic way to test for this concept is to use a "quiesced assertion" (made up term). You would need to have usage counters at various points in your system (number of http requests, number of times a function is called, whatever is relevant to your system) and also make them externally measurable via some api.
At the end of a regression suite or load test once you've expected the system to be idle you can then verify that specific usage counters are staying constant and that your system is not in a feedback loop.
I've used this approach for testing as described above or in monitoring to alert when usage counters exceed some specified rate.
What you're saying is very identical to Microsoft's strategy. Release now, fix later, regardless of how buggy the system is. Just do some basic testing, and then release - let the customers experience the bugs, report them, and then improve upon the feedback.
I don't think it's the best strategy, but sadly, it is the only strategy especially when the product becomes very large.
Fadi, I would strongly disagree. There is a LOT of difference between pushing unreleased software to this blog for testing in a live prod env and sending this to customers.
Comment preview