Production Test RunToo much of a good thing isn’t so good for you
Not all of our testing happened in a production settings. One of our test clusters was simply running a pretty simple loop of writes, reads and queries on all the nodes in the cluster while intentionally destabilizing the system.
After about a week of this we learned that this worked, there were no memory leaks or increased resource usage and also that the size of the data on disk was about three orders of magnitude too much.
Investigating this we discovered that the test process introduced conflicts because it wrote the same set of documents to each of the nodes, repeatedly. We are resolving this automatically but are also keeping the conflicted copies around so users can figure out what happened to their system. In this particular scenario, we had a lot of conflicted revisions, and it was hard initially to figure out what took that space.
In our production system, we also discovered that we log too much. One of the interesting feedback items we were looking for in this production test run is to see what kind of information we can get from the logs and make sure that the details there are actionable. A part of that was to see if we could troubleshoot something simply using the logs, and add missing details if there were stuff that we couldn’t figure out from them.
We also discovered that under load, we would log a lot. In particular, we had logs detailed every indexed document and replicated item. These are almost never useful, but they generate a lot of noise when we lowered the log settings. So that went away as well. We are very focused on logs usability, it should be possible to understand what is going on and why without drowning in minutia.
More posts in "Production Test Run" series:
- (26 Jan 2018) Overburdened and under provisioned
- (24 Jan 2018) The self flagellating server
- (22 Jan 2018) Rowhammer in Voron
- (18 Jan 2018) When your software is configured by a monkey
- (17 Jan 2018) Too much of a good thing isn’t so good for you
- (16 Jan 2018) The worst is yet to come
Comments
I wonder if NLog, for example, could have a function to record all log messages associated with a http request being handled and decide what to do with them at the end of request. If no error - throw away, leaving just the most important. And log everything in case any error happened in that request.
Rafal, At least at some point, log4net had a function like that, such as buffering certain amount of messages and only flushing if there is an error.
What an interesting concept, testing while in production and purposefully destabilizing your own system. I think you guys are a brave to try that. Even on a simple PHP web application, a production test at scale could cause a lot of backend issues. I'm really glad you were able to uncover so many insights and can lower the amount of logging that you output.
Rafal, I think DotnEtCore has such a feature called DiagnostSource. I have not used it, but it is described here: https://github.com/dotnet/corefx/blob/master/src/System.Diagnostics.DiagnosticSource/src/DiagnosticSourceUsersGuide.md
Comment preview