What kind of problems you’ll find only when you are dog fooding
Just minutes after I posted the previous post about dog fooding and the issues this has, we run into major trouble. Actually, several issues.
The first of which was that we started to get authentication issues, but even though we upgraded the server, we didn’t change any security related configuration. So what was going on? I tried logging in from my own machine, using the same credentials as the server, and everything worked! I then tried replacing the credentials to ones I knew worked, because they were being used by another system with no issues, and that didn’t work either.
Looking in Fiddler, there were no 401 requests, either. Very strange.
This ended up being an issue of lazy requests. Basically, a lazy request is one that carry multiple requests in a single round trip for the server. We changed how we handle those internally, so they look pretty much like any other request for the code, but it appears that we also forced them to go through authorization again, and obviously that didn’t work. Once we knew what was going on, fixing this was very easy.
The next issue was a little crazier. Some things were failing, and we weren’t sure why. We got an illegal duplicate key from somewhere, but it made no sense. What was worse, it appear to be happening in random places. It took me a while to figure out the root cause of this issue. In RavenDB 2.5, we tracked indexes using the index name. In RavenDB 3.0, we are tracking indexes using numeric ids. However, the conversion process between 2.5 to 3.0 didn’t take that into account, and while it gave the existing indexes ids, it didn’t set the next index id to the correct value. When we would try to create a new index, it would generate an index id that was already existing, and that failed. The error could crop up when you run a dynamic query that had to create a new index, so that was kind of funky.
The last error, however, is something that I consider to be purely evil. RavenDB 3.0 added a limitation that an index cannot output more than a set amount of index entries per document. The idea is that we want to stop the fan out problem. An index with high fan out can consume a lot of memory and other resources without control.
The idea was that we want to explicitly stop that early on. But what about existing indexes? We added a way to increase the limit explicitly, but old indexes obviously wouldn’t have this option. The problem was that this is only triggered on indexing, so you start the application, and everything works. Then indexing starts to fail. Which is fine, but then we have another RavenDB feature that applies, and if an index has too many errors, then it would be marked as failing. Failed indexes throw when queried.
The reason that this is evil is because it actually takes quite a bit of time for this error to surface. So you run your tests, and everything works, and a few hours or days later, everything crashes.
All of those issues has been resolved. I found that the index id issue was actually properly there, but I appear to have removed it during an unrelated fix to the previous problem without noticing. The lazy requests now know that they are already authenticated and the maximum size when loading from an existing system is 32K, which is big enough for pretty much anything you can think of. The behavior when you exceed the max fan out is also more consistent, it will skip this document, and only if you have a lot of them will it actually disable the index.
And yes, pushing things to ayende.com and seeing what breaks is a pretty good testing strategy, egg on face and all.
Comments
No more dog pictures please
Rafal, Yes, I would appreciate it as well, I hope that I won't need to either :-)
Great posts! They are very interesting, but somehow they scare me. If I had a production system with RavenDB I would wait at least till 3.1.
You wrote last week "we are going to go live on our own systems with RavenDB 3.0. And shortly after that we’ll do an release candidate, followed by the actual release." I don't know what you mean with "shortly", but for me it looks as you would need some time for fixing all these things (and much more if you want to test everything with Voron).
Please, don't take me wrong. I've been working in corporate environments for 20 years and I am probably the "cautious guy".
As a programmer I love the ideas and concepts behind RavenDB.
Raul, All of the things that we are actually speaking about has already been fixed. Note that most of this is basically just upgrade issues that we didn't take into account yet specifically because we test those at this blog.
When I said "time for fixing all these things" I should have written "all these kind of things".
Usually things are easy to fix when you know them, but first they must be found. Your dog fooding is showing the kind of problems that arise when you test with production systems.
We upgraded our own product last year, and even when we tested a lot we got some crazy bugs from data / system combinations we didn't think about.
I've been following your blog and RavenDB for months now. Both impress me and I would love to use RavenDB one day for a project.
Raul, Sure, that is the rub, isn't it. Fixing the things that you don't know about. That is why dog fooding is so important. And running in a live system has been very good for us in terms of finding all sort of interesting stuff. Next, we are going to be doing a lot more stability & performance tests.
Regarding Fiddler with NTLM ... http://stefsewell.com/2014/06/18/the-case-of-the-fiddler-heisenbug/
Comment preview