Configuration is user input too
Every programmer knows that input validation is important for good application behavior. If you aren’t validating the input, you will get… interesting behavior, to say the least.
The problem is that what developers generally don’t consider is that the system configuration is also users’ input, and it should be treated with suspicion. It can be something as obvious as a connection string that is malformed, or just wrong. Or it can be that the user has put executionFrequency=“Monday 9AM” into the the field. Typically, at least, those are fairly easy to discover and handle. Although most systems have a fairly poor behavior when their configuration doesn’t match their expectations (you typically get some error, usually with very few details, and frequently the system won’t start, so you need to dig into the logs…).
Then there is the perfectly valid and legal configuration items, such as dataDirectory=”H:\Databases\Northwind”, when the H: drive is in accessible for some reason (SAN drive that was snatched away). Or listenPort=”1234” when the port is currently busy and cannot be used. Handling those gracefully is something that devs tend to forget, especially if this is an unattended application (service, for example).
The problem is that we tend to assume that the people modifying the system configuration actually knows what they are doing, because they are the system administrators. That is not always the case, unfortunately, and I’m not talking about just issues of wrong path or busy port.
In a recent case, we had a sys admin that called us with high memory usage when creating an index. As it turned out, they setup the minimum number of items to load from disk to 10,000. And they have large documents (in the MBs range). The problem was that this configuration meant that before we could index, we had to load 10,000 documents to memory (each of them about 1 MB in average, and only then could we start indexing. That means 10GB of documents to load, and then start indexing (which has its own memory costs). That resulted in pushing other stuff from memory, and in general slowed things down considerably, because each indexing batch had to be at least 10GB in size.
We also couldn’t reduce the memory usage by reducing the batch size (as would normally would be the case under memory pressure), because the minimum amount was set so high.
In another case, a customer was experiencing a high I/O write rate, when we investigated, it looked like this was because of a very high fan out rate in the indexes. There is a limit on the number of fan out entries per document, and it is there for a reason, but it is something that we allow the system admin to configure. They have disabled this limit and went on with very high fan out rates, with the predictable result of issues as a result of it.
So now the problem is what to do?
On the one hand, accepting administrator input about how to tune the system is pretty much required. On the other hand, to quote a recent ops guy I spoke to “How this works? I have no idea, I’m in operations , I just need it to work“, referring to a service that his company wrote.
So the question is, how do you handle such scenario? And no, writing warnings to the log won’t do, no one reads that.
Comments
Well it depends on what application you are. If you are on windowed app, then I would suggest developers to use tooltips for every option where you have detailed explanation of what that certain option does and maybe examples or ranges for it. If you have some service or app running in background, then I guess the only thing to do are warnings/errors when you are starting the service/app that your config will lead to high memory or IO usage. I agree that warnings in logs are usually ignored.
We had pretty good experience offering a diagnostic tool for similar issues. The tool would consume logs and configuration parameters and raise warnings when values are out of recommanded ranges. Obviously alerts and warnings need to be properly documented.
In our case it was for a multithreaded event driven framework which could be run in diagnostic mode and output recommandation based on actual usage statistics. It was fed with well known error and abuses in order to guide people toward a better usage.
Main impact of this type of tools was defenitely less call for help!
I'd have to echo Cyrille's comment, this is a diagnostic situation, not a validation one. If I've got a setting that's out of the "normal" range for good reason, I'm going to be very annoyed if every time I turn around I'm asked to confirm that "yes, I mean what I've said, just like the last 10,000 times before". Now, if you have a way of correlating issues to particular sets of configuration settings, being able to query for those as needed would be cool.
How about introducing some kind of "prediction/analysis" mode, doesn't have to be accurate but just a basic guess like: we have seen that you want to set value XY to 10'0000 and here is the potential impact it might have based on the historical load/usage we have seen so far.
I'm sure I read a good idea for solving this kind of communication problem ...
Yeah, here's the link: http://ayende.com/blog/3570/does-you-application-has-a-blog
Daniel, Those values are configured in a config file, so we don't generally get involved when this is set. And it require us to write specific code for each of those, which sucks. But that is the price for professional software
TeamCity handle configuration warnings pretty smooth, unobtrusive and more visual than just adding warning to the log: http://www.ralbu.com/Media/Default/Windows-Live-Writer/Azure-Mobile-Service-deployment-with-Te_1311A/p21_2.png
Orjan, I assume you meant those images here: http://www.ralbu.com/azure-mobile-service-deployment-with-team-city-part-2-configure-teamcity-nuget-server
Note that we already do that, for alerts like disk space, etc. But the problem with such things is that they tend to freak people out, and they generate a very harsh response
In making such configuration changes, users will generally have the goal to try and improve some aspect of the behavior of their system. Thus, they should use some type of metric to measure whether they are getting closer to the goal with the changes they have made, and other (regression) metrics to ensure there are no big unintended consequences of those changes. I.e. changing the configuration should be combined with explicit monitoring vs. a baseline.
If you could provide an explicit support workflow for such configuration changes, that includes (1) describing the goal of the change (2) saving the previous configuration (3) requiring the user to select the metrics to use to monitor the desired effects and unintended regression and (4) committing/rolling back the configuration change, this might help with the kinds of problems you've highlighted.
It would give you a "configuration changes" event stream with its associated metric values. It also forces the user to think a bit more before making the changes, and could allow them to quickly determine whether the change was a good idea. You could combine this with a "trial period" for the change, during which you send (e.g. by email) report(s) whenever there are big differences/trends in the monitored metrics.
Alex, This also makes any single configuration value to a multi step, incredibly complex thing. Consider batch size, they impact I/O, memory utilization, network utilization, etc. Which metric do you measure for each? How do you define success? And how much work is it now to add a configuration option?
Oren, These are exactly the types of questions an admin should ask **before** modifying the configuration. In the example you gave (increasing indexing batch size), presumably they intended to improve indexing performance without a big impact to e.g. query time & latency, and thus they should monitor at least those metrics. Actually in this specific example, using the metric of average document size, it could have been determined that the amount of memory required for indexing could be around 10 GB for this config change. Those are useful things to know. One might also wonder if specifying the batch size as a number of documents is very useful, specifying the maximum amount/percentage of available memory to use for indexing might be more useful.
Additionally, I fail to see how my suggestion makes configuration incredibly complex. A minimal solution would provide a text editor view for the config file, a selection list of metrics and an entry field for the monitoring period. Granted, this will then also require work to be done to be able to follow up on the change, such as having a view to inspect these metrics, and actually collecting them. Then the question becomes one of whether that is worth the development effort or whether there are higher priorities.
Alex, At last count, we had about a gazzilion configuration options that impacted indexing speed. :-) We already monitor every facet of indexing that you can consider, from I/O rates to indexing speed to consumed memory. But tying this all together to a particular change in a particular value is really hard.
Comment preview