Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,401 | Comments: 47,419

filter by tags archive

Migration strategies considerations for Dev–>UAT—>Production

time to read 5 min | 866 words

imagePart of the reason for RavenDB was that I wanted a database that actually took into account how it is being used into account, and provided good support for common usage scenarios. Making the process of moving between Dev –> UAT –> Production easier is a big part of that.

Typically, databases don’t handle that directly, but let the users figure it out. You can see the plethora of SQL Schema deploying and versioning options that you have to deal with.

With RavenDB, for indexes in particular, we made the process very easy. Indexes are defined in code, deployed along side the application and are versioned in the exact same manner, in the exact same system.

But as the feature set of RavenDB grows, we need to consider the deployment scenario in additional places. We recently started talking about the development cycle of ETL processes, Subscriptions, Backups and external replication. The last two are fairly rare in development / UAT scenarios, so we’ll ignore them for now. They are typically only ever setup & used in production. Sometimes you test them on a dedicated instance, but it doesn’t make sense to deploy a backup configuration in most cases. External replication is basically just destination  + credentials, so there isn’t really all that much to track or deploy.

ETL Processes and Subscriptions, on the other hand, can contain quite a bit of logic in them. An ETL process that feed into a reporting database might be composed of several distinct pieces, each of them feeding some part of the data to the reporting db. If the reporting needs change, we’ll likely need to update the ETL process as well, which means that we need to consider exactly how we’ll do that. Ideally, we want a developer to be able to start working on the ETL process on their own machine, completely isolated. Once they are done working, they can checkin their work into the code repository and move on to other tasks. At some future time, this code will get deployed, which will setup the right ETL process in production.

That is a really nice story, and how we are dealing with indexes, but it doesn’t actually work for ETL processes. The problem is that ETL is typically not the purview of the application developer, it is in the hand of the operations team or maybe it is owned by the team that owns the reports. Furthermore, changes to the ETL process are pretty common and typically happen outside the release cycle of the application itself. That means that we can’t tie this behavior to the code, unlike indexes, which have a pretty tight integration with the code that is using them, ETL is a background kind of operation, with little direct impact.  So it can’t be tied to the application code like indexes is. Even with indexes, we have measures in place that prevent (lock) the index definition, so an administrator can update the index definition on the fly without the application overwriting it with the old version of the index.

Subscriptions are more of a middle ground. A subscription is composed of a client side application that process the data and some server side logic related to filtering and shaping it. On the one hand, it make a lot of sense for the subscribing application to control its subscription, but an admin that wants to update the subscription definition is a very likely scenario. Maybe as a result of a data change, or need input from the business. We can update the server side code without re-deployment, and that is usually a good idea.

To make matters a bit more complex, we also have to consider secrets management. ETL processes, in particular, can contain sensitive information (connection strings). So we need to figure out a way to have the connection string, but not have the connection string Smile. In other words, if I write a new ETL process and deploy it to production, I need to be sure that I don’t need to remember to update the connection string from my local machine to the production database. Or, much worse, if I’m taking the ETL from production, I don’t want to accidently also get the production connection string. That means that we need to use named connection strings, and rely on the developer / admin to set it up properly across environments.

I would really appreciate any feedback you have about how to handle this scenario.

Both ETL processes and Subscriptions are just JSON documents of not too much complexity, so actually moving them around between servers isn’t hard, it is the process of doing so that we are trying to flesh out. I should also mention that we are probably just going to make sure that there is a process to handle that, not something that is mandated, because some companies have very different deployment models that we need to fit into. This is us trying to think about the best way to handle the most common scenario with as little friction as possible.

Keeping secrets on Linux, Redux

time to read 2 min | 203 words

A few weeks ago I talked about how we can keep secrets on Linux. I wasn’t really happy with the solution we came up with, namely a file that is ACLed so only the database user can access it, and I kept mulling that over in my head. We wanted something better, but we didn’t want to start taking dependencies on technologies that the user might not have.

What we came up with eventually was to externalize that decision. The only thing that we actually need is the single root key, so instead of keeping it on a file, we can let the admin provide a it for us (a script / executable / etc). What we’ll do is we’ll check if the user configured an executable to run, and if so, we’ll run that exec, send it over STDIN the data, and read the result back over STDOUT.  That free us from having to take dependencies and give the admin a great degree of freedom with how to deal with keeping secret in the way the organization is used to.

After figuring this out, I found out that Apache is already doing something very similar with SSL Pass Phrase Dialog.

History of storage costs and the software design impact

time to read 4 min | 642 words

Edgar F. Codd formulated the relational model in 1969. Ten years later, Oracle 2.0 comes to the market. And Sybase SQL Server came out with its first version in 1984. By the early 90s, it was clear that relational database has pushed out the competition (such as navigational or object oriented databases) to the sidelines. It made sense, you could do a lot more with a relational database, and you could do it easier, usually faster and certainly in a more convenient manner.

Let us look at what environment those relational databases were written for. In 1979, you could buy the IBM's 3370 direct access storage device. It offered a stunning 571MB (read, megabytes) of storage for the mere cost of $35,100. For reference the yearly salary of a programmer at that time was $17,535. In other words, the cost of a single 571MB hard drive was as much as two full time developers, for an entire year.

In 1980, the first drives with more than 1 GB storage appeared. The IBM 3380, for example, was able to store a whopping 2.52 GB of information, the low end version cost, at the time, $97,650 and it was about as big as a washing machine. By 1986, the situation improved and purchasing a good internal hard drive with all of 20MB at merely $800. For reference, a good car at the time would cost you less than $7,000.

Skipping ahead again, by 1996 you could actually purchase a 2.83 GB drive for merely $2,900. A car at that time would cost you $12,371. I could go on, but I'm sure that you get the point by now. Storage used to be expensive. So expensive that it dominated pretty much every other concern that you can think of.

At the time of this writing, you can get a hard disk with 10 TB of storage for about $400 [1]. And a 1 TB SSD drive will cost you less than $300[2]. Those numbers give us about a quarter of a dollar (26 cents, to be exact) per GB for SSD drives, and less than 4 cents per GB for the hard disk.

Compare that to a price of $38,750 per gigabyte in 1980. Oh, except that we forgot about inflation, so the inflation adjusted price for a single GB was $114,451.63. Now, you will be right if you'll point out that this is very unfair comparison. I'm comparing consumer grade hardware to high end systems. Enterprise storage systems, the kind you actually run databases on tend to be a bit above that price range. We can compare the cost of storing information in the cloud, and based on current pricing it looks like storing a GB on Amazon S3 for 5 years (to compare with expected life time of a hard disk) will cost less than $1.5, with Azure being even cheaper.

The really interesting aspect of those numbers is the way they shaped the software written at that time period. It made a lot of sense to put a lot more on the user, not because you were lazy, but because it was the only way to do things. Most document databases, for example, are storing the document structure alongside the document itself (so property names are stored in each document. It would be utterly insane to try to do that in a system where hard disk space was so expensive. On the other hand, decisions such as “normalization is critical” were mostly driven by the necessity to reduce storage costs, and only transitioned later on to the “purity of data model” reasoning once the disk space cost became a non issue.

 


[1] ST10000VN0004 - 7200RPM with 256MB Cache

[2] The SDSSDHII-1T00-G25 - with great then 500 MB / sec read/write speeds and close to 100,000 IOPS

Distributed task assignment with failover

time to read 5 min | 818 words

When building a distributed system, one of the more interesting aspects is how you are going to distribute tasks assignment. In other words, given that you have multiple nodes, how do you decide which node will do what? In some cases, that is relatively easy, you can say “all nodes will process read requests”, but in others, this is more complex. Let us take the case where you have several nodes, and you need to have a regular backup of a database that is replicated between all those nodes. You probably don’t want to run the backup across all the nodes, after all, they are pretty much the same and you don’t want to backup the exact same thing multiple times. On the other hand, you probably don’t want to assign this work statically, if you do, and if the node that is responsible for the backup is down, you got no backup.

Another example of the problem can be seen when you have other processes that you would like to be sticky if possible, and only jump around if there is a failure. Brining up a new node online is a common thing to do in a cluster, and the ideal scenario in that case is that a single node will feed it all the data that it needs. If we have multiple nodes doing that, they are likely to overlap and they might very well overload the poor new server. So we want just one node to update its state, but if that node goes down midway, we need someone else to pick up the slack.. For RavenDB, those sort of tasks includes things like ETL processes, Subscriptions, backup, bootstrapping new servers and more. We keep discovering new things that can use this sort of behavior.

But how do we actually make this work?

One way of doing this is to take advantage of the fact that RavenDB is using Raft and have the leader do task assignment. That works quite well, but it doesn’t scale. What do I meant by that? I mean that as the number of tasks that we need to manage grows, the complexity in the task assignment portion of the code grows as well. Ideally, I don’t want to have to consider twenty different variables and considerations before deciding what operation should go on which server, and trying to balance that sort of work in one place has proven to be challenging.

Instead, we have decided to allocate work in the cluster based on simple rules. Each task in the cluster has a key (which is typically generated by hashing some of its parameters), and that task can be assigned to a group of servers. Given those two details, we can use Jump Consistent Hashing to spread the load around. However, that doesn’t handle failover. We have a heartbeat process that can detect and notify nodes that a sibling has went down, so combining those two, we get the following code:

What we are doing here is rely on two different properties. Jump Consistent Hashing to let us know which node is responsible for what, and the Raft cluster leader that keep track of all the live nodes and let us know when a node goes down. When we need to assign a task, we use its hashed key to find its preferred owner, and if it is alive, that is that. But if it currently down, we do two things, we remove the downed node from the topology and re-hash the key with the new number of nodes in the cluster. That gives us a new preferred node, and so on until we find a live one.

The reason we rehash on failover is that Jump Consistent Hashing is going to usually point to the same position in the topology (that is why we choose it in the first place, after all), so we rehash to get a different position so it won’t all fall unto the next node in the list. All downed node tasks are fairly distributed among the remaining live cluster members.

The nice thing about this is that aside from keeping the live/down list up to date, the Raft cluster doesn’t really need to do something. This is a consistent algorithm, so different nodes operating on the same data can arrive at the same result, so a node going down will result in another node picking up on updating the new server up to spec and another will start a backup process. And all of that logic is right where we want it, right next to where the task logic itself is written.

This allow us to reason much more effectively about the behavior of each independent task, and also allow each node to let you know where each task is executing.

De-virtualization in CoreCLRPart II

time to read 2 min | 323 words

In my previous post, I discussed how the CLR is handling various method calls, depending on exactly what we are doing ( interface dispatch, virtual method call, struct method call ).

That was all fun and games, but how can we actually practice this? Let us take a look at a generic method and how it is actually translated to machine code:

In the case of an interface, we got through the standard virtual stub to dispatch the method. In the case of a struct, the JIT was smart enough to inline the generic call.

I’ll let that sink in for a second. Using a struct generic argument, we were able to inline the call.

Remember the previous post when we talked about the cost of method dispatch, in number of instructions, in number of memory jumps and references? We now have a way to replace those invocation cost with inlinable code.

When is going to be useful? This technique is most beneficial when we are talking about code that is used a lot, is relatively small / efficient already, to the point where the cost of calling it is a large part of the execution time.

One situation that pop to mind that answer just this scenario is getting hash code and equality checks in a dictionary. The number of such calls that we have is in the many billions for second, and the cost of indirection here is tremendous. We have our own dictionary implementation (with different default and design guideline than the default one), but one who is also meant to be used with this approach.

In other words, instead of passing an EqualityComparer instance, we use an EqualityComparer generic parameter. And that allows the JIT to tune us to the Nth degree, allow us to inline the hash / equals calls. That means that in a very hot code path, we can drastically reduce costs.

De-virtualization in CoreCLRPart I

time to read 5 min | 852 words

I mentioned that we are doing some work to enable de-virtualization of our code, as well as getting ready for CoreCLR changes that will get the JIT to do more de-virtualization.

I was asked (by Maayan) about this, more specifically:

How does manual de-virtualization work? AFAIK, the compiler always emits a CallVirt instruction for non-static method calls, regardless of weather the method is virtual or not (and regardless of weather the class is sealed or not). Are you extending the C# compiler and overriding the emission code? Are you re-JITting the code in runtime (as profilers do using the profiler API)?

And the answer is that Maayan is correct, C# is defined (for various reasons related to ECMA approval process over 15 years ago) to always dispatch methods using CallVirt. But CallVirt is an IL instruction, not an assembly one.

Here is some code for various code, as well as the relevant assembly being generated for it (CoreCLR 1.1, X64, Release). Don’t worry about the assembly, I have detailed explanations below for each part.

This is pretty simple (non inlined) set of methods that are just there to make sure that you can see what ends up actually running. The actual method just end up calling Console.WriteLine, but that is about it.

Now, let us inspect each of those behaviors one at a time, first let us look at how an interface dispatch work.

If you want the gory details, you can look for those  in the Book of the Runtime. But basically, we are jumping to a code location that will find the proper code that we need to execute. Note that there is still additional work there to actually route the method to the appropriate place, which is hidden from us by the runtime, JIT, etc.

Note that the funny cmp method there? It is there to generate a dereference of the address by the CPU, which will cause it raise a trap if the address is invalid, and if the address is 0, the CLR will convert that to a NullReferenceException.

Now, what about a virtual method call? That is actually simpler, since is it similar to how I learned it when I worked with C++.

Basically, at the end of the object we have a method table pointer, and we dereference that, and then another pointer to the specific method we want to invoke. Part of the reason that virtual calls are expensive is exactly this, we have to jump around in memory a lot, and that means that we both have to issue more instructions, and more to the point, we have to touch a lot more memory, which can can cause cache lines to be evicted, forcing us to stall.

What about a struct method call? To make things easier, I made sure that it couldn’t be inlined, which generated:

In this case, I’m using an empty struct, so it has as much size as a pointer, which is why you can see it being passed around like it was just a pointer. If I had a bigger struct, I would see very different code.

Why is the code simpler when the struct is bigger? Well, the answer is quite simple, we are looking at the actual method that was called with this parameter, but the job of actually sending the parameter is done by the caller, so we aren’t seeing it here.

What is going on is that as long as your struct is empty or have a single value that is an 4 / 8 bytes long, the CLR can optimize that to be a regular parameter, making it effectively free. In such cases, you can see the struct being passed around in registers (the struct itself, since it is copied, not the address to it).

However, if you have a struct that is composed of multiple fields, that require us to copy each field to the stack before call, which on large stacks can take a bit.

I mentioned that this was done with a struct that we disabled inlining for, what happens if we allow inlining (the default)?

This looks completely different then anything we have seen so far. And in fact, it is. What we are seeing here is the result of inlining the struct method invocation. Because the compiler was able to figure out what the end target of the call is, and because it is small enough to be inlined, we can skip calling the method entirely and just directly run the code.

As it turns out, this can have dramatic affect on performance (on both directions, mind), and something that you need to carefully consider when you analyze your application performance.

But the short of it is, the fewer jumps and dereferences we have, the better it is for us. And you can see the various methods (pun intended) that the CLR uses to dispatch them. In my next post, I’ll talk about how we can make use of this behavior.

Thread pool starvation? Just add another thread

time to read 2 min | 388 words

One of the nastier edge cases with TaskCompletionSource is that you can attach a continuation to that which will run synchronously. You can avoid that to a certain extent by using RunContinuationsAsynchronously, and that works, but under load, it can still be problematic.

In particular, consider the case where we have a task with:

  1. Do computation
  2. Enqueue a task to be completed by a different thread (getting a Task back)
  3. Continue computation until done
  4. Wait for previous operation to complete
  5. Go to 1

Even with avoiding running the continuation in sync mode, that still result in an issue. In particular, when we are running asynchronous continuation, that isn’t magic, that still need a thread to run on, and that will typically be a thread pool thread.

But if all the thread pool threads are busy doing the work above, it may force us to wait until we are done with the computation that the code is running, to pull some more work from the thread pool queue until the queue gets to the notification that we are ready to work. In other words, we may suffer from jitter, where the running task is waiting for an already complete async operation, but it doesn’t know it (and hence give up the thread) because there wasn’t any available thread to run it.

We resolve it by adding a dedicated thread, which simply wait for those notifications, and run only them. Because those are typically very short, and there isn’t that many of them, it means that we can process them very quickly. In order to prevent us from having stalls on that thread, we use what I think is a pretty nifty trick.

We are registering the event twice, once on our dedicated thread, and once on the normal thread pool. If somehow the dedicated thread is too busy, the thread pool (and its auto growth) will handle it, but most of the time, the dedicated thread can catch it and run it.

And adding a basically noop task to the thread pool isn’t going to generate any pressure on the thread pool if there is no load, and if there is load, it will encourage it to grow faster, which is what we want.

If you care to look at the code, it is here.

I was wrong, reflecting on the .NET design choices

time to read 3 min | 480 words

I have been re-thinking about some of my previous positions with regards to development, and it appear that I have been quite wrong in the past.

In particular, I’m talking about things like:

Note that those posts are parts of a much larger discussion, and both are close to a decade old. They aren’t really relevant anymore, I think, but it still bugs me, and I wanted to outline my current thinking on the matter.

C# is non virtual by default, while Java is virtual by default. That seems like a minor distinction, but it has huge implications. It means that proxying / mocking / runtime subclassing is a lot easier with Java than with C#. In fact, a lot of frameworks that were ported from Java rely on this heavily, and that made it much harder to use them in C#. The most common one being NHibernate, and one of the chief frustrations that I kept running into.

However, given that I’m working on a database engine now, not on business software, I can see a whole different world of constraints. In particular, a virtual method call is significantly more expensive than a direct call, and that adds up quite quickly. One of the things that we routinely do is try to de-virtualize method calls using various tricks, and we are eagerly waiting .NET Core 2.0 with the de-virtualization support in the JIT (we already start writing code to take advantage of it).

Another issue is that my approach to software design has significantly changed. Where I would previously do a lot of inheritance and explicit design patterns, I’m far more motivated toward using composition, instead. I’m also marking very clear boundaries between My Code and Client Code. In My Code, I don’t try to maintain encapsulation, or hide state, whereas with stuff that is expected to be used externally, that is very much the case. But that give a very different feel to the API and usage patterns that we handle.

This also relates to abstract class vs interfaces, and why you should care. As a consumer, unless you are busy doling some mocking or so such, you likely don’t, but as a library author, that matters a lot to the amount of flexibility you get.

I think that a lot of this has to do with my view point, not just as an Open Source author, but someone who runs a project where customers are using us for years on end, and they really don’t want us to make any changes that would impact their code. That lead to a lot more emphasis on backward compact (source, binary & behavior), and if you mess it up, you get ricochets from people who pay you money because their job is harder.

Single roundtrip authentication

time to read 5 min | 851 words

imageOne of the things that we did in RavenDB 4.0 was support running on Linux, which meant giving up on Windows Authentication. To the nitpickers among you, I’m aware that we can do LDAP authentication, and we can do Windows Auth over HTTP/S on Linux. Let us say that given the expected results, all I can wish you is that you’ll have to debug / deploy / support such a system for real.

Giving up on Windows Authentication in general, even on Windows, is something that we did with great relief. Nothing like getting an urgent call from a customer and trying to figure out what a certain user isn’t authenticated, and trying to figure out the trust relationship between different domains, forests, DMZ and a whole host of other stuff that has utter lack of anything related to us. I hate those kind of cases.

In fact, I think that my new ill wish phrase will become “may you get a 2 AM call to support Kerberus authentication problems on a Linux server in the DMZ to a nested Windows domain”. But that might be going too far and damage my Karma.

That lead us to API Key authentication. And for that, we want to be able to get authenticate against the server, and get a token, which can then use in future requests. An additional benefit is that by building our own system we are able to actually own the entire thing and support it much better. A side issue here is that we need to support / maintain security critical code, which I’m not so happy about. But owning this also give me the option of doing things in a more optimized fashion. And in this case, we want to handle authentication in as few network roundtrips as possible, ideally one.

That is a bit challenging, since on the first request, we know nothing about the server. I actually implemented a couple of attempts to do so, but pretty much all of them were vulnerable to some degree after we did security analysis on them. You can look here for some of the details, so we gave that up in favor of mostly single round trip authentication.

The very first time the client starts, it will use a well known endpoint to get the server public key. When we need to authenticate, we’ll generate our own key pair and use it and the server public key to encrypt a hash of the secret key with the client’s public key, then send our own public key and the  encrypted data to the server for authentication. In return, the server will validate the client based on the hash of the password and the client public key, generate an authentication token, encrypt that with the client’s public key and send it back.

Now, this is a bit complex, I’ll admit, and it is utterly unnecessary if the users are using HTTPS, as they should. However, there are actually quite a few deployments where HTTPS isn’t used, mostly inside organizations where they choose to deploy over HTTP to avoid the complexity of cert management / update / etc. Our goal is to prevent leakage of the secret key over the network even in the case that the admin didn’t setup things securely. Note that in this case (not using HTTPS), anyone listening on the network is going to be able to grab the authentication token (which is valid for about half an hour), but that is out of scope for this discussion.

We can assume that communication between client & server are not snoopable, given that they are using public keys to exchange the data. We also ensure that the size of the data is always the same, so there is no information leakage there. The return trip with the authentication token is also

A bad actor can pretend to be a server and fool a client into sending the authentication request using the bad actor’s public key, instead of the server. This is done because we don’t have trust chains. The result is that the bad actor now has the hash of the password with the public key of the client (which is also known). So the bad guy can try do some offline password guessing, we mitigate that using secret keys that are a minimum of 192 bits and generated using cryptographically secured RNG.

The bad actor can off course send the data as is to the server, which will authenticate it normally, but since it uses the client’s public key both for the hash validation and for the encryption of the reply, the bad actor wouldn’t be able to get the token.

On the client side, after the first time that the client hits the server, it will be able to cache the server public key, and any future authentication token refresh can be done in a single roundtrip.

RavenDB 4.0 Features: Document Versioning

time to read 3 min | 487 words

After focusing for so long on the low level and performance of RavenDB 4.0, you might be forgiven if you think that we are neglecting real meaningful features in 4.0. That couldn’t be further from the truth, the problem is that this it is actually quite hard to talk about some features meaningfully before they have been through complete round of implementing, testing, wiring for UI and UX and then beating them with a stick to see what fall.

Document versioning in RavenDB isn’t new. It has been a feature for several releases now. In RavenDB 4.0 we have decided to reimagine them complete. One of the strong points of RavenDB design from the get go was that triggers allowed you to build new features into the product without making core changes. That allowed quickly adding new features, but could cause issues with feature intersections.

With RavenDB 4.0, versions are now part of the core code. That means that they get special treatment. Here is how it looks like in the UI:

image

You can switch between versions of the document, and see them as it was at various points in time.

Externally, this is pretty much all there is to it. Here is what this will look like when you are looking at a revision.

image

In terms of the client API, there is little change:

So what did change?

Revisions are now stored separately from documents. Previously we stored them as special documents, which led to some problems (indexes had to read to see that they aren’t interested in them, for example), and we had to special case a lot of stuff that is supposed to work on documents to not apply to attachments. Making them their own unique thing simplify everything significantly. We have also made sure that revisions can properly flow with replication, and prevent the potential for versioning conflicts which were tricky to solve. The downside of this is that desirable operations on revisions (load a revision with include, for example), will no longer work.

The whole thing is also significantly faster and cheaper to work with.  A major change we also did was to ensure that versioning is always on, so you don’t have to specify it at the DB creation time. Instead, you can configure it whenever you want, without having to add a bundle at runtime and risk unsupported operations.

That means that if at some point you want to start versioning your documents, you can just flip a switch, and it will immediately start versioning all changes for you.

FUTURE POSTS

  1. Bug stories: The memory ownership in the timeout - about one hour from now
  2. We won’t be fixing this race condition - about one day from now
  3. Batch processing with subscriptions in RavenDB 4.0 - 4 days from now
  4. RavenDB 4.0: Unbounded results sets - 5 days from now
  5. The ghost of the zombie of revisions past - 6 days from now

There are posts all the way to Jul 05, 2017

RECENT SERIES

  1. RavenDB 4.0 (8):
    13 Jun 2017 - The etag simplification
  2. Bug stories (3):
    28 Jun 2017 - How do I call myself?
  3. PR Review (2):
    23 Jun 2017 - avoid too many parameters
  4. Reviewing Noise Search Engine (4):
    20 Jun 2017 - Summary
  5. De-virtualization in CoreCLR (2):
    01 May 2017 - Part II
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats