TalkExtreme Performance Architecture
I gave this talk a few months ago, and now it is out, enjoy.
I gave this talk a few months ago, and now it is out, enjoy.
We are currently busy converting our entire infrastructure to the new version. Somehow we gathered quite a bit of stuff internally, so that is taking some time. However, this blog is now running on RavenDB 4.0. It feels faster, but we haven’t done proper measurements yet. Primarily because this blog was written to be a sample app for RavenDB about seven years ago, and the code show its age.
We’ll be working on also upgrading that to a more modern system. In particular, we want to turn that into a sample app of how to properly deploy a RavenDB 4.0 application for the modern world. This means that beside actually talking to a highly available cluster, the blog itself is going to be distributed and highly available. The idea is that it would be nice to not take down anything while we are updating stuff, but at the same time, the blog is small enough that it makes it possible to talk about its high availability features without drowning in details.
True work on that is going to start next week, and we would appreciate any feedback on what you are interested in seeing. I’ll probably make that into a series of posts, detailing how to take an existing RavenDB application and move it to RavenDB 4.0, adding all the nice touches along the way, ending up in a distributed and highly available system that can be deployed to production and survive all the nasty things going on there.
So please let me know what you’ll like us to cover.
You probably know that Chrome is a memory hog, I came up with the following extremely brute force manner to deal with it:
This works quite nicely in 99% of the cases, but sometimes, this fails. Can you see why?
I’ll give you a hint, the EmptyWorkingSet() returns a failure and an invalid handle error . Why is that?
Well, the problem is a bit tricky. We first execute the Get-Process cmdlet, and extract the handles from the results. That is great, but we don’t keep track of the Process instances that we get from Get-Process, which means that they are garbage.
That means that the GC might clean them, but they require finalization, so at some point, the finalizer will claim them, closing their handles, which means that the EmptyWorkingSet will fail sporadically in a very non obvious way. The “fix”, by the way, is to iterate of the processes directly, not on their handles, because that keep the process instance live for the duration (and thus its handle).
This issue was reported to the mailing list with a really scary error: UseAfterFree detected! Attempt to return memory from previous generation, Reset has already been called and the memory reused!
I initially read it as an error that is raised from the server, which raised up all sort of flags and caused us to immediately try to track down what is going on.
Here is the code that would reproduce this:
And a key part of that is that this is not happening on the server, but on the client. You now have all the information required to see what the error is.
Can you figure it out?
The problem is that this method returns a Task, but it isn’t an async method. In other words, we return a task that is still running from ToListAsync, but because we aren’t awaiting on it, the session’s dispose is going to run, and by the time the server request completes and is ready to actually do something with the data that it go, we are already disposed and we get this error.
The solution? Turn this into an async method and await on the ToListAsync() before disposing the session.
The intent is clear, we have a setup in which we have a certificate, and the only valid URLs in this case are hostnames that are in the certificate. If the user configured the system so we’ll be listening on: https://my-rvn-one but the certificate has hostname “their-rvn-two”, then we know that this is going to cause issues for clients. They will try to connect but fail because of certificate validation. The hostname in the URL and the hostnames in the certificate don’t match, therefor, an error.
I closed this bug as Won’t Fix, and that deserve an explanation. I usually care very deeply about the kind of errors that we generate, and we want to catch things as early as possible.
But sometimes, that is the worst possible thing. By preventing the user from doing the “wrong” thing, we also prevent it from doing something that is required if you got yourself into a bad state.
Consider the following case, a node is down, and we provisioned another one. We got a different IP, but we need to update the DNS record. That is going to take 24 hours to propagate properly, but we need to be up now. So I change the system configuration to use a different URL, but I can’t get a certificate for the new one yet for whatever reason. Now the validation kicks in, and I’m dead in the water. I might just want to be able to peek into the system, or configure the clients to ignore the certificate error, or something.
In this case, putting the system into an invalid state (such as mismatch between hostname and certificate) is desirable. An admin may want to do this for a host of reasons, mostly because they are under the gun and need things to work. There are surprisingly a large number of such cases, where you know that the situation is invalid, but you allow it because not doing so will lead to blocking off important scenarios.
This fallacy is meant to refer to different administrators defining conflicting policies, at least according to wikipedia. But our experience has shown that the problem is much more worse. It isn’t that you have two companies defining policies that end up at odds with one another. It is that even in the same organization, you have very different areas of expertise and responsibility, and troubleshooting any problem in an organization of a significant size is a hard matter.
Sure, you have the firewall admin which is usually the same as the network admin or at least has that one’s number. But what if the problem is actually in an LDAP proxy making a cross forest query to a remote domain across DMZ and VPN lines, with the actual problem being a DNS change that hasn’t been propagated on the internal network because security policies require DNSSEC verification but the update was made without one.
Figure that one out, with the only error being “I can’t authenticate to your software, therefor the blame is yours” can be a true pain. Especially because the person you are dealing with has likely no relation to the people who have a clue about the problem and is often not even able to run the tools you need to diagnose the issue because they don’t have permissions to do so (“no, you may not sniff our production networks and send that data to an outside party” is a security policy that I support, even if it make my life harder).
So you have an issue, and you need to figure out where it is happening. And in the scenario above, even if you managed to figure out what the actual problem was (which will require multiple server hoping and crossing of security zones) you’ll realize that you need the key used for the DNSSEC, which is at the hangs of yet another admin (most probably at vacation right now).
And when you fix that you’ll find that you now need to flush the DNS caches of multiple DNS proxies and local servers, all of which require admin privileges by a bunch of people.
So no, you can’t assume one (competent) administrator. You have to assume that your software will run in the most brutal of production environments and that users will actively flip all sort of switches and see if you break, just because of Murphy.
What you can do, however, is to design your software accordingly. That means reducing to a minimum the number of external dependencies that you have and controlling what you are doing. It means that as part of the design of the software, you try to consider failure points and see how much of everything you can either own or be sure that you’ll have the facilities in place to inspect and validate with as few parties involved in the process.
A real world example of such a design decision was to drop any and all authentication schemes in RavenDB except for X509 client certificates. In addition to the high level of security and trust that they bring, there is also another really important aspect. It is possible to validate them and operate on them locally without requiring any special privileges on the machine or the network. A certificate can be validate easy, using common tools available on all operating systems.
The scenario above, we had similar ones on a pretty regular basis. And I got really tired of trying to debug someone’s implementation of active directory deployment and all the ways it can cause mishaps.
Today is my birthday, and we are celebrating with no remaining issues for RavenDB 4.0.
We just made the final commits for RavenDB 4.0. This means that it is (almost) done. As we speak the release train is already picking up speed with the bits currently being churned on the build server and on their way to be publicly available.
And yet there is this almost, what does this mean? We don’t have anything left to do in 4.0, but the release process we have is not something as simple as just pushing a build through the build server and sending it to the world.
We are now going into the final proof stage. For the next week, the entire company is going to be focused primarily on trying to see if we can break RavenDB in interesting ways. We are also rolling the new RavenDB bits to all our productions systems, doing a much larger scale test of all the features.
We decided to make these bits available to users as well, to give you direct access to the final product before the actual release. The RavenDB website is already updated with a new coat of paint, which I’m quite fond of.
As it turns out, releasing the project is a bit of a chore, so we are also working furiously on the docs, but it will take some time to complete. We pushed a lot of the updates to the online docs already, but there is still a lot to be done, which is another reason why we hold off on the RTM label.
That said, this is it, the only work going on right now is docs, testing and making sure that all the bits are glued together. Please download the new bits and give it a try, we would dearly appreciate any and all feedback.
We’ll have a full blog post with all the details in a week, when the official release will happen, in the meantime, we’re off to celebrate.
One of the tough problems in distributed programming is how to deal with a component that is made up of multiple nodes. Consider a reservation service that is made up of a few nodes that needs to ensure that regardless of which node you’re talking to, if you made a reservation, you’ll have a spot. There are a lot of ways to solve this from the inside, but that isn’t what I want to talk about right now. I want to talk about the overall approach to modeling such systems.
Instead of focusing on how you would implement such a system, consider this to be an internal problem for this particular component. A good parallel for this problem is making plans with a couple for a meetup. You might be talking to both of them or just one, but you don’t care. The person you are talking to is the one that is going to giving you a decision that is valid for the couple.
How they do that is not relevant. It can be that one of them is in charge of the social calendar or that they flip based on the day of the week or whoever got out of bed first this morning or whatever his mother called her dog last year or… you don’t care. Furthermore, you probably don’t want to know. That is an internal problem and sticking your nose into the internal decision making is a Bad Idea that may lead to someone sleeping on your couch for an indefinite period of time.
But, and this is important, you can walk to either one of them and they will make such a decision. It may be something in the order of “Let me talk to my significant other and I’ll get back to you by tomorrow” or it can be “Sure, how about 8 PM on Sunday?” or even “I don’t like you, so nope, nope nope” but you’ll get some sort of an answer.
Taking this back to the distributed components design, that kind of decision in internal to the component and the mechanics of this is handled internally shouldn’t be exposed outside. Let’s take a look on why this is the case.
Starting out, we run all such decisions as a consensus that required a majority of the nodes. But a couple of nodes went down and took down the system in a bad way, so the next iteration we had moved to reserving some spots for each node that they own and can hand off to others on their own, without consulting any other nodes. This sort of change shouldn’t matters to callers of this component, but it is very common to have outside parties take notice of how you are doing things and take a dependency on that.
The main reason I call it the married couple design problem is that it should immediately cause you to consider how you should stay away from the decision making there. Of course, if you don’t, I’m going to call your design the Mother In Law Architecture.
In a recent PR, I run into this code, which is used in query generation to decide if we need to quote a particular alias. The code itself is pretty straightforward and easy to follow:
It also have two distinct issues. First, there is the allocation because of the ToUpper call and second, we are doing O(N) search on the alias array every single time.
I asked for a change, to use HashSet and to use the OrdinalIgnoreCase comparer.
Here is the change I got back:
This is exactly what I asked for, and it is very subtly wrong. We are now saving an allocation, which is great, but the problem is with the Contains method.
This looks okay, but this is not HashSet.Contains, instead, this is an extension method from Enumerable.Contains, which iterates over the set and compare each value.
The fix is also simple:
And now we don’t have O(N) anymore.
Although I’ll admit that for such small size, it probably doesn’t matter.
Spatial queries are fun, when you look at them from the outside. Not so fun when you are working to implement them, but that is probably not your concern.
RavenDB had had support for spatial queries for many years now, but the RavenDB 4.0 release has touched on that as well and now you can query spatial data with much greater each. Here is a small sample of how this works:
This query is doing a polygon search for all the employees located inside that polygon. You can visualize the query on the map, we have 4 employees (in yellow) in the viewport and two of them are included within the specified polygon (in blue).
And here is what this looks like in the studio:
In this case, you can see how we support automatic indexing of spatial data. You can also define your own spatial indexes if you need greater control but it is as easy as pie to just go ahead and start running it.
From code, this is just as easy:
I’m not sure why, but when looking at the results, this just feels like magic.
No future posts left, oh my!