On Infinite Scalability
Udi Dahan posted about the myth of infinite scalability. It is a good post, and I recommend reading it. I have my own 2 cents to add, though.
YAGNI!
When building a system, I am always assuming an order of magnitude increase in the work the system has to do (usually, we are talking about # of requests / time period).
In other words, if we have 50,000 users, and we have roughly 5,000 active users during work hours, we typically have 100,000 requests per hour. My job is to make sure that we can manage to get up to 1,000,000 requests per hour. (Just to give you the numbers, we are talking about moving from ~30 requests per second to ~300 requests per second).
Why do we need that buffer?
There are several reasons to want to do that. First, we assume that the system is going to be in production for a relatively long time, so the # of user or their activity is going to grow. Second, in most systems, we are usually talking about some % of the users being active, but there are usually times when you have a significantly more users being active. For example, at end of tax year, an accounting system can be expected to see a greater utilization, as well as at every end of the month.
And within that buffer, we don’t need to change the system architecture, we just need to add more resources to the system.
And why a full order of magnitude?
Put simply, because it is relatively easy to get there. Let us assume end of year again, and now we have 15,000 active users (vs. the “normal” 5,000). But the amount of work we do isn’t directly related to the number of users. It is related to how active they are and what operations they are doing. In such a scenario, it is more likely that the users will be more active and stress the system more.
Finally, there is another aspect. Usually, moving a single order of magnitude up is no problem. In fact, I feel comfortable running on a single server (maybe replicate for HA) from 1 request per second to 50 or so. That is 3,600 request per hour to 180,000 requests per hour. But beyond that, we are going to start talking about multiple servers. We can handle a millions requests per hour on a web farm without much problems, but moving to the next level is likely to require more changes, and when you get beyond that, you require more changes still.
Note, those changes are not changes to the code, they are architectural and system changes. Where before you had a single database, now you have many. Where before you could use ACID, now you have to use BASE. You need to push a lot more tasks to the background, the user interaction changes, etc.
Of course, you can probably start your application design with a 10,000,000 requests per hour. That is big enough that you can move to a hundred million requests per hour easily enough. But, and this is important, is this something that is valuable? Is this really something that you can justify to the guys writing the checks.
Scalability is a cost function, as Udi has mentioned. And you need to be aware that this cost incurs during development and during production. If your expected usage can be handled (including the buffer) with a simpler architecture, go for it. If you grow beyond an order of magnitude, there are three things that will happen:
- You have the money to deploy a new system for the new load.
- You now have a better idea on the actual usage, rather than just guessing about the future.
- You are going to change how the system work anyway, from the UI to the internal works.
The last one is important. I am guessing that many people would say “but we can do it right the first time”. And the problem is that you really can’t. You don’t know what users will do, what they will like, etc. Whatever you are guessing now, you are bound to be wrong. I have had that happen to me on multiple systems.
More than that, remember time to market is a big feature as well.
Comments
Pretty sure the comments for this post are from a different post...
@Igor: That depends on which post you look at. When looking at http://ayende.com/blog/152801/windows-7-conflict-resolution-dialog-how-i-hate-thee your comment is out of place. :-)
YYes, it was, fixed now
Excellent points Ayende!
You didn't address topics like new functionality that significantly changes the scalability profile of the system. For example, from the time that one user performs action X, all other users need to see it within 2 seconds.
See my post on Non-Functional Architectural Woes for some more examples: http://www.udidahan.com/2010/01/12/non-functional-architectural-woes/
I do agree that you can't predict all this stuff up-front, therefore it is important to communicate that there is no such thing as a system that is infinitely scalable - there will always be some feature which will require a full system rewrite/redesign/rearchitecture.
Udi, I don't see it having any affect on the actual post though. Having a new requirement like you said is trivial if you stay within the order of magnitude that you have started with, in most cases. It is only when you get to the really high scale that those situations starts to crop up, and then you already have a set of solutions for that.
Ayende, "Having a new requirement like you said is trivial if you stay within the order of magnitude that you have started with, in most cases"
Ah - "in most cases"
It could be that I end up hearing from all the people that don't fit in the "most cases" and that everybody else is not having any issues, but I doubt it.
Udi, I don't think that we disagree. But I think that if you already have solutions in place for the sort of scale that you already have (including the buffer that I talked about), having this come up is already something that you had to address.
This could become a meeting game - if you hear 'scalability' and 'infinite' stand up and shout "YAGNI, YAGNI!"
Very good post. I absolutely agree on that it's quite useless to put too much worries about scaling, before you've collected some data on how the system will be used, what bottlenecks come up, etc.
The biggest challenge about scaling is predictability. You simply don't know how you need to design your system or how much servers you need, so that it can handle a million requests per hour. It might be easier with web-only applications, but when systems become more complex (data-analysis, live-streaming of stock prices, etc.) you can only guess how many infrastructure you will need to meet certain perf requirements.
Can you elaborate more on: "Note, those changes are not changes to the code, they are architectural and system changes. Where before you had a single database, now you have many. Where before you could use ACID, now you have to use BASE. You need to push a lot more tasks to the background, the user interaction changes, etc."
When you talk about jumping from 1 server to multiple servers, ACID to BASE, and how user interaction changes, how do you quantify that this is done without code changes?
Comment preview