Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,707 | Comments: 48,617

filter by tags archive

Transactional Patterns: Conversation vs. Batch

time to read 6 min | 1136 words

When I designed RavenDB, I had a very particular use case at the forefront of my mind. That scenario was a business application talking to a database, usually as a web application.

These kind of applications have a particular style of communication with the user. As you can see below, there are two very distinct operations. Show the user the data, followed by some “think time” (seconds at minimum, but can be much longer) and then followed by an action.

image

This shouldn’t really be a surprised for anyone who developed any kind of application for the last decade or two, so why do I mention this explicitly?  I mention this because of the nature of communication between the application and the database.

Some databases have a the conversation pattern with the application. In terms of API, this will look something like this:

  • BeginTransaction()
  • Update()
  • Insert()
  • Commit()

This is a very natural model and should be quite familiar for most developers. The other alternative to this method is to use batches:

  • SaveChanges( [Update, Insert] )

I want to use this post to talk about the difference between the two styles and how that impacts your work. Relational databases uses the conversation style while RavenDB uses batch style. On the surface, it looks like it would be a more complex to use RavenDB to achieve the same task, but there is very little difference in the API as far as the user is concerned. In both cases, the code looks very much the same:

Behind the scenes, however, the RavenDB code will send just a single request to the server, while a relational database will need four separate commands to execute the transaction. In many cases, you can send all of these commands to the server in a single roundtrips, but that is an optimization that doesn’t always work and often isn’t applied even when it is possible.

Sidebar: Reducing server roundtrips

Why is the reduction in server roundtrips so important? Because it has a lot of implications on the overall performance of the system. In many cases the cost of making a remote query from the application to the database far outstrips the costs of actually executing the query. This ties closely to the Fallacies of Distributed Computing. Latency isn’t zero, even though when you develop locally it certainly seems like this is the case.

The primary goal of this design in RavenDB was to reduce the number of network roundtrips that your application must endure. Because in the vast majority of the cases, your application is going to follow the “show data” / “modify data” as two separate operations (often separated by a long idle time) there is a lot of value in having the database interaction model match what you will actually be doing.

As it turned out, there are some additional advantages (and disadvantages, which I’ll cover a bit later) to this approach, beyond just the obvious reduction in the number of server roundtrips.

When the server gets all the operations that needs to be done in a single request, it can apply all of them at once. For that matter, it can chose how to apply them in the most optimal order. This gives the database server a lot more chances for optimization. It is similar to going to the supermarket with a list of items to purchase vs. a treasure hunt. When you have the full list, you can decide to pick things up based on how close they are on the shelves. If you only get the next instruction after you complete the previous one, you have no option for optimization.

When using the conversation style, durability and state management become more complex as well. Relational databases typically use some variation of ARIES for their journals. This is because they need to record information about ongoing transactions that haven’t yet been committed. This add significant complexity to the amount of work that is required from the database engine. Furthermore, when running in a distributed system, you need to share this transaction state (which hasn’t yet been committed!) across the nodes to allow failover of the transaction if the server fails. With the conversation style, you need to support concurrent transactions all operating at the same time and potentially reading and modifying the same data. This lead to a great deal of code that is required to properly manage locking and latching inside the database engine.

On the other hand, batch mode give the server all the operations in the transaction in a single go. This means that failover can simply be sending the batch of operations to another node, without the need to share complex state between them. It means that the database server has all the required information and can make decisions based on it. For example, if there are no data dependencies, it can execute the operations in the transaction in whatever order it desires, leading to more optimal execution time. The database can also mix & match operations from different transactions into a single batch (as long as it keeps the externally visible behavior consistent, of course) to optimize things even further.

There are two major disadvantages for batch mode. The first of which is that there is usually a strict separation of reads from writes. That means that you usually can’t get a single consistent read/modify operation that stay in the same transaction. The second issue is similar, because you need to generate all the operations ahead of time, you can’t make decisions about what operations to execute based on the data you read, at least not in the same transaction. The typical solution for that is to send a script in the batch. This script can then read / modify data in the same context, apply logic, etc. The important thing here is that this script runs inside the server, already inside the transaction. This means that you don’t pay network round trips time to make such operations.

On the other hand, it means that you need to write potentially complex logic in the database’s scripting language, rather than your own platform, which you’ll likely prefer.

Luckily, for most scenarios, especially with web applications, you don’t need to execute complex logics on the server side. You can usually just send the commands you need in a single batch and be done with it. Often, just have optimistic concurrency is enough to get you the consistency you want, with scripting reserved for more exceptional cases.

RavenDB’s usage scenario was meant to make the common operations easy and the hard stuff possible. I think that we got it right and ended up with an API that is functional, highly performant and one that has withstood the test of time very well.

The iterative design process: Query parameters example

time to read 4 min | 660 words

When we start building a feature, we often have a pretty good idea of what we want to have and how to get there. And then we actually start building it and we often end up with something that is quite different (and usually much better). It has gotten to the point where we aren’t even trying to do hard specs and detailed design at anything beyond the exploratory levels. For example, in the design of RavenDB 4.0, there was not even a mention of RQL. That ended up being a very late addition to the codebase, but it improved RavenDB significantly. On the other hand, the low level mechanisms of zero copy documents from Voron all the way to the network were designed up front, but only at a fairly high level.

In this post, I want to talk about query parameters in RavenDB. Actually, let me be more specific, we have query parameters, but what we don’t have (or rather, didn’t have, because that will be merged in by the time you read this post) is the ability to run parameterized queries from the studio. We always meant to have that capability, but we run out of time with the 4.0 release. As we are gearing up to the 4.1 release, we are cleaning the table from the major-minor issues. (Major in term of impact, minor in term of amount of work required). The query parameters in the studio is one such example. Here is what this looks like:

image

My first thought was to just build something like this:

image

Give the user the ability to define arguments and be done with it. The task was assigned to one of our developers and I expected to get a PR in a short while.

This particular developer has a tendency to consider not just the task at hand but also other aspects of the problem. He didn’t want the user to have to manually specify each argument, since that has poor ergonomics. Instead, he wanted the studio to figure it out its own and help the user. So the first thing he did was detect the arguments (regex: “\$\w+”) and present them in the grid. Then there was the issue of how to deal with edits, etc. Then he run into another problem, types. Query parameters can be more than just strings, they can be any JSON data type.

Here is what he came up with:

image

Instead of having to define the query parameters in a separate location, just put them right in. Having the parameters grid involves pointing and clicking with the mouse, entering possibly complex values (such as long arrays) and in general much more work than just having them right above the query.

Note that this is a studio only feature, queries from the client API already have ways to specify arguments properly. So the next question is how we are going to handle passing the arguments to the server. Remember, this is only on the studio, so we can take quite a few shortcuts. In this case, we’ll simply snip the entire first section of the query text (which contains the query parameters). We can do that by going from the start of the query to the first from or declare keywords. We do a basic pre-processing to turn “$name = …“ into “results.$name = …“ and then just execute this code in the browser, giving us a JS object with all the parameters that we can then send to the servers.

The next stage is to make this discoverable, by detecting parameters whose value is not provided and giving the user a quick fix to add them.

Dealing with massively distributed data flows

time to read 4 min | 610 words

imageImagine that you are the owner of Gary’s Shoes, and that you want to get data from all of your multitudes of stores into a centralized location. You’ll use that data to make decisions, predict future trends, etc. Given that each store must operate independently, you have a server in each location that will push up it changes (and get updates from) the HQ cluster. You can see an example of this kind of setup in this post.

This work quite well, but it does require the user to be aware of a potential issue. When you have a massively distributed data flow process setup, you need to also pay attention for the quite in the noise. What do I mean by that?

One of our customers have RavenDB deployed to tens of thousands of locations worldwide. At any given time, you are going to have at least some of those locations unavailable. In some locations, part of closing down for the day means literally flipping the master switch on electricity for the entire building. On others, you might have someone tripping over the router or have some local or regional network outage.

Part of the strategy for dealing with such a data set, coming from so many separate locations, is the need to monitor when we aren’t getting data. The fact that on most of our locations we have near real time data is very powerful for the business. But you also need to see where you aren’t getting the data from and setup proper alerts and monitoring for the missing data. From a business perspective, it is also advisable to surface that kind of detail all the way to the user. If you are going to be ordering inventory for the stores in a particular state, but the two major stores in the area are down because of a network issue and has been down for two days now, you want to be aware of that and figure out that you are working on out of date data.

To be honest, the issues isn’t so much about two days of lag in the case of once in blue moon type of error. In the scenario outlined above, in pretty much all business scenarios that I can think of, you won’t really see any impact on the decision making of the organization.

The killer is when you have some sort of a problem that goes on for a while. A DNS update that was missed because of bad DNS cache policy, for example. Now your updates to HQ go into the void in a consistent basis. On the other hand, everything else continue to function properly both locally and for HQ. If this isn’t accounted for, it is easy to miss this for a long period of time. I have seen such a case that was only discovered when the year’s end numbers didn’t quite match up what they were supposed to. Given that this was the second year in a row this happened, the investigation found that some network issue indeed cause a very long term topology failure. This was actually properly reported, in a log file that no one ever read.

Lesson learned, make sure that part of your data flow strategy accounts for such things and bring them to the users’ attention. Actually resolving the issue was a network configuration change that took minutes and the entire dataset was synchronized within a few hours afterward. But finding out that there was even a problem took effectively forever.

Unexpected use cases for RavenDB in IoT

time to read 3 min | 558 words

imageWe designed RavenDB to be a server side database, to be used to run large scale business applications. Surprisingly for us, there is a large group of users that have taken RavenDB and actually run it as part of their deployed systems. In other words, instead of having a single large RavenDB cluster they will typically deploy many (hundreds in the small cases, tens of thousands to millions in the large cases) of RavenDB instances across a wide variety of locations.

Part of that is the fact that RavenDB can be embedded inside an application quite easily. That means that we don’t need complex setup or administration. You can just use RavenDB from your application and everything Will Just Work. Another factor is the fact that you can run RavenDB on very low end machines, including 32 bits machines, ARM SoC, etc.

One use case was a point of sales system that had to spec out their hardware a decade in advanced and had to deal with existing installations that were still running hardware from 10 years ago (with little desire to upgrade). Another use case was deploying RavenDB as part of an industrial robot package, with RavenDB installed on a 32 bits ARM system on chip that control the robot.

That kind of deployment pattern lead to interesting requests. For example, several of our customers need ad hoc replication in a location. So all the nodes in a particular physical location will join together to a full mesh of replicated nodes. This gives us high availability in a particular location with any node in the network being able to service any request across the entire location. Boot up a new machine, wait a bit for the rest of the network to update it and you are good to go. This also helps when you consider your machines to be unreliable (because they are old, beaten down and generally minimally maintained).

Another scenario with the need for dynamic topologies is the deployment of RavenDB as set of independent nodes that need to report to some sort of head quarters. This is easy to do by defining external replication or ETL on the node and have it send all the relevant data to a central location for processing. This way, you get a cheap “always available” local node but can still have a global view of your data. I posted about something similar in the past, if you care for the details.

We are now looking for additional features to serve this kind of deployment. In particular, we are interested in making it easy to share data and generate analytics across widely distributed and separated set of instances. One of things that we are currently considering is some form of integration with the cloud. For example, consider Amazon Athena, which allow you to run analytics queries on files residing in S3. We can define ETL processes that would upload the data from RavenDB as it is changed on each individual node. This way, you have each node pushing data to the cloud and a central location that can run live analytics on the data.

What are your thoughts on this? And what other features do you think will serve this kind of scenario?

Product Release Postmortem: Things You Should Never Do, Part II

time to read 18 min | 3409 words

imageThis post is the text version of a presentation I gave a few weeks ago. There is in reference of this classic post by Joel.

In 2015, I decided that we needed to reboot RavenDB. I did that with the full understanding that this is going to be a huge task, including knowing that it will be bigger than what I can project, even if I take this line of thinking into account.

RavenDB 1.0 was written a decade ago. It was written because it didn’t leave me alone and I wanted to get it out of my head. At the time, I was focused more on getting it out the door (and my head) and was taking shortcuts in the implementation. That allowed me to cut down dramatically on the amount of work that is involved in it. At the same time, this put some constraints on the implementation and architecture. The most obvious one was the reliance on Esent, which tied us to Windows. C# as the implementation language, to a lesser extent, also had the same issue until .NET Core. (Yes, I’m aware of Mono, I have no idea how people managed to run anything beyond hello world on it. We tried porting RavenDB to Mono multiple times, and I still bear the scars.)

I went back and looked at our release notes, in literally every major release, we have spent significant amount of time and effort on “performance optimizations”. In January of 2015 we had a few sprints that were dedicated to just this issue. We went down to assembly code in some cases, analyzed our hotspots and optimize things in a very serious manner. We got some amazing performance improvements in some cases, reducing the runtime by orders of magnitude in some cases. But it still felt like we were hitting a limit. What is more, experience from customers in production showed us that there were a number of cases where we run into problematic behavior. This mostly happened on large / complex projects. And nearly all those issues were related in one way or another to memory and the GC.

Our indexing, for example, would be reading data from disk into memory. That was meant to save disk I/O during indexing, and including pretty smart prefetching and monitoring behavior. It also had the side effect of loading documents (which can be large) into managed memory and holding on to them long enough to push them into Gen1 and Gen2. Then they would be indexed and need to go away. But given that they were pushed to a higher generation… that meant more expensive collection cycle.

RavenDB was created before the pervasive use of fast disks, and it turns out that in some cases, reading the data from disk was actually faster than parsing it using JSON.Net. In other words, our “I/O bound” process of reading documents was actually dominated by the time it took to parse the JSON text. That does not include the costs of actually cleaning up this memory. Complex JSON documents can have a lot of objects,  and the cost of GC rise with the number of objects that are being tracked. There were pretty fundamental problems, which I didn’t think we could fix in a piecemeal fashion.

That time also coincided with a peak in the number of support incidents that we got. Unlike many other open source projects, we treat support as a cost center, not a revenue center. In other words, we don’t want to have more support, that isn’t how we want to make money. Being a database, we were frequently at the heart of things and our customers and users are very sensitive to any issue that might arise. I’m painting somewhat of a bleak picture, I’m aware. It wasn’t nearly that bad from the point of view of any particular customer. But on aggregate, from our point of view, it felt like a nasty game of whack a mole. As soon as we provided a solution to one customer’s issue, another would pop up, somewhat related but just different enough to not be fixed by the previous change. These weren’t regressions, mind. These were just a lot of places where the changing times violated some of our core assumptions.

Toward the end of 2015, I sat down and really thought about what we needed and were missing. This was the situation as I saw it.

image

There was also the issue that we have learned a lot over the years. We built Voron (our storage engine) from the ground up, we had a lot of experience running in production and we knew what kind of tasks our customers were using us for. I kept thinking that I wished I had a time machine and could do things over properly. Given that my time machine is still in the shop, I decided that we had two options:

  1. Minor fixes along the way – slowly improving our behavior as we stride toward the desired architecture and usage.
  2. Break it all – essentially start from scratch, with a new architecture and write it the way we want it to be written.

The obvious choice was to do this slowly. The problem was that I really couldn’t think of a good way to actually achieve that. The kind of changes we wanted to make started from replacing the most fundamental structure we had, how we represent JSON in our document database and got more complex from there. We wanted to change how we store data on disk, how we index data, how we … literally every single feature that we had was going to be transformed in some way.

We also had additional issues. The Windows only limitation was really hurting us and we really wanted to get a good Linux story going. The support burden was also at the very top of my mind as we considered what to do. In the end, we came up with the following decisions:

  • We don’t require backward compatibility. Either on the server side or client side.
    • That was the hardest decision, but it meant that we could actually tackle some of the biggest issues freely and without constraint.
    • That meant that we wanted to keep the same feeling, but be able to make changes to corners of the API that atrophied.
  • Support cost and simplified operations as a primary concern.
    • This meant that, at the design level, we took into account debugging considerations.
  • Order of magnitude performance improvement across the board.
    • Otherwise, it isn’t worth the effort.
  • Cross platform from the get go.

That was in Sep 2015. I sat down and wrote a design document that outlined the new architectural approach, spiked a few things and then we were off to the races. I blogged all about the process extensively, so I’m not going to repeat that.

We decided to use DNX (which became .NET Core) at a very early stage. Initially, I don’t believe that we even had a debugger, and most of our builds had to be trigger from the command line. I guess that if you are going to make a risky decision, you might as well make a few others…

I’ll say that I made a lot of preparation to fail up front. Part of the reason we went with DNX was that we knew that worst case scenario, we could spend a few days and get it working on the full .NET framework if we had to. I took this step with a lot of backward glances to make sure that we won’t get lost.

Alongside our experience in supporting RavenDB, we also run a UX study and combed all the incident reports we generate from support calls. The idea was to take as much time as necessary to get things as right as we could handle it. The studio change between 3.5 and 4.0 is massive, and was driven by getting a talented professional to design each part of the UI, guided by real world UX study and analysis. We kept asking “where do it hurt?” and whenever we had found a cause of pain we worked to alleviate it.

Some of our guiding principals during that phase of the project were:

  • Cross platform from the get go.
    • We couldn’t afford to port it midway through. Too complex and prune to failure.
  • OWN the stack.
    • We don’t want to use any components that we don’t have good visibility into and the ability to work  with.
    • In particular, anything that is a core competence should be owned and built by us. For our scenario, that means primarily the storage engine.
  • Build for performance.
    • I wasn’t kidding about requiring x10 performance improvement. We had one or two devs at all time running benchmarks and fixing things performance of every completed feature.
  • Build for operations.
    • Each and every design decision should be considered in light of its operational behavior.
    • In particular, we excised any feature that relied on hard to figure out technology or integration (I’m looking at you, Windows Auth).
    • This included changing the design of the software so a core dump would make it easier to figure out what is going on. We also explicitly opened up a lot of the internal behavior as debug endpoints and plug them to the studio so operators will have greater visibility. As an aside, that was very helpful in figuring out our performance bottlenecks and we worked to improve that part of the project as we strived for ever faster performance.
  • Reducing the support burden as an major goal.
    • A lot of the previous points tie into this. But this is also where we combed over any issue that had a “user misconfigured / misused” and built in alerts directly into RavenDB to give the user early warnings about common issues.
  • We defined a set of common scenarios. Reading / writing documents, for example, and then we spent months on designing the whole system so it will work to make these fast, seamless and easy.

A good example of that is how we stored documents in RavenDB now. We have our own binary format that allow us to avoid parsing the document when reading from disk, plays nicely with memory mapped files (which is how Voron, our storage engine, works) and can effectively allow us to hand a pointer to a memory mapped buffer and start working with that as a JSON document without:

  • Allocating any managed memory
  • Parsing JSON
  • Require caching / pre-fetching, etc.

We spent a lot of time thinking about what we want to do, and then we looked into how the operating system expect us to behave. The idea is that if we play to the operating system’s expectation, we can reap a lot of benefits from the OS’ own behavior. This is how RavenDB handles loading data to memory. We let the operating system handle it and just make sure that our own behavior is both predictable and applicable to the OS’ optimizations.

I mentioned that GC was the bane of our existence, right? We moved a lot of the memory management in RavenDB to unmanaged code and handle that explicitly. That gave us the advantage that we know a lot more about how we should expect to use the memory an can spend the time to make this highly optimized.

At the debugger side of things, we made some changes to the design of RavenDB with the intent to make it easier to debug and analyze core dumps. For example, most of the long running threads are named, so it is easy to figure out who they belong to (and not just what they are currently doing). For that matter, long running tasks are using synchronous mode, specifically because it means that we can drop in the debugger / core dump and look at their state. This is much harder to do with async methods. You might have noticed that I mentioned core dumps a few times, right? These are essential to figuring out what is going on with your software on production systems. We learned a lot about production debugging over the years and with RavenDB 4.0 we took steps to make things easier. For example, many data structured in RavenDB have an extra field called tag that is there specifically to provide debugging information about the value if we are looking at the value in the debugger.

An obvious question for this project was whatever we should still stay on .NET or should we move to an unmanaged language. I considered this seriously, with Rust, C/C++ and Go being the top contenders as the implementation language. I decided to stay with .NET for several reasons. Productivity was right there in the top. We already had a team that was well versed in .NET, and while that isn’t a blocker, it was a consideration. The tooling around .NET are leagues ahead of anything else that I have seen. That include both write time (where Rider / ReSharper rules) and for debug time (I found nothing remotely close to Visual Studio for debugging non trivial code easily). The cross platform angle, which was the most serious issue for us, was resolved with .NET Core.

Rust wasn’t matured enough at the time (2015) and even today I think that a language that prides itself in being hard to learn isn’t a good choice. C++ was a strong contender, but the slow compilation times were an issue. The tooling is similar, but inferior in many respects. Cross platform C++ is possible, and modern C++ is very different from what I remember. However, it come with a very high degree of complexity and would take a lot of time to master again properly. C (distinct from C++) is much simpler language. Still has the compilation speed issues, but the language is much simpler. I think that if it had a defer mechanism builtin it would be a much nicer language. Go was ruled out because if I’m going to be writing everything from scratch, I might as well go all the way down to C’s level and not stop with something that still has GC pauses.

The choice of C# as CoreCLR has been vindicated. The project team and the community at large puts a large emphasis on performance and we keep getting more and better way to handle low level details while still able to use higher level concepts when needed. And the tooling… dear God. I routinely work with other platforms, testing things out, but there is nothing that come close to the toolsets that are available for C#.

An interesting wrinkle with the 4.0 release was that we started it before the 3.5 release was even out. For a while, we had a small team working on the foundations of 4.0 while the rest of us were busy hammering in the last details of RavenDB 3.5.

As soon as we had the bare minimum to go (basically, it compiled and could save a single document and even regurgitate it back up again) we started heavy parallelization of the work. We had a team working on indexes while another was dealing with (even at this early stage) performance and another working on the user interface. At that time frame, we have hired a few more people and could really see the benefits of all of the separate teams working in concrete. One of the priorities of this method was to get to a demoable state. In fact, at some point we had over 30% of the people working on either the UI directly or UI related infrastructure.

One of the things we kept hearing back is that the UI and insight it provided into what is going on inside the database were crucial for our users. It also helped a lot to us as we developed RavenDB to get to play with it directly and see things in an easy manner. The UI has been at once one of the most trivial of changes and the most profound. On the one hand, we didn’t really make any significant architectural changes in the UI. On the other hand, we re-wrote most of it with the aid of UX study and a real professional at the helm. That gave us a lot of visible polish that underscore the amount of work that happened in the engine.

How did all of that turn out?

  • I initially thought it would last a year to 15 months, with an expectant due date of Dec 2016. That was with a team size of about 25 people.  Work started in Sep 2015.
  • As it turns out, RavenDB accumulated a lot of features in the years it spent in production. We had to evaluate each of it, see how it would fit into our architecture and get it ported. That took a lot of time. Especially because it many cases we took the time to change the approach we had for the feature completely.
  • By Mid 2016 I already changed the scheduled to Jun 2017.
  • Close to the end of 2016 we released RavenDB 3.5. This freed up some people to work on the 4.0 release, but also meant we had higher than usual support calls while customers integrated the new release.
  • Actual release of the 4.0 release happened in Feb 2018. So just about 30 months from the start or about double the time I expected it to happen.
  • We had to cut some features out to make the 4.0 release, all of them are back in the 4.1 release, scheduled for next month.
    • This means that to get back to the same place took us 3 years. But we now have a lot of extra features.
    • Most of the missing features were pretty minor, though, and rarely used.

What did all of that gain us?

  • Performance: Single node. Over 100,000 writes / sec and over 1,000,000 reads / sec in our benchmarks.
    • Real world users report performance boost of x20 to x52 time faster.
  • Support call duration dropped from days / weeks to about 2 – 4 hours.
  • Cross platform on Windows, Linux, ARM and MacOSX.
    • We are now deployed to production on Raspberry PIs, because we are the fastest real database on that kind of hardware.

We were over a year overdue, and even with the deadline being extended several times we had to cut some features to actually make the cut for release. The general acceptance of the new release by the community has been a roaring success. We exceeded our own goals for the project, even if we took a lot longer than expected to get there.

Now, for some additional thoughts. We didn’t really re-write the whole thing from scratch. Instead, we had a lot of code that we could at least partially reuse. The storage engine was ported, no re-written, for example. However, we changed architectures in a pretty significant way. For example, the format and manner of working with JSON changed entirely between these two released. We are a JSON document database. As you can imagine, we pretty much had to modify everything as a result of that.

We didn’t designed the whole things from the started. We had a rough outline and we let things roll from there. As a result of the new architecture and expectation, by the time we hit a particular feature we were able to utilize what we already learn about how to work with the new architecture to improve things. We also weren’t afraid of changing things multiple times. Authentication had several major design changes midway through, and it ended up so much simpler than what we had before. Even pretty late in the game, we still made significant changes. The RQL support, having a SQL like querying language, came about on the last 20% of the project.

That was a huge change, and I got a lot of “here comes the crazy train again” feedback. This is probably one of the reasons we delayed by another few months. But it was worth it by far. Basically, because we were able to give up on backward compatibility, we were able to move quickly and change stuff as we wished. We knew that we wouldn’t have another change like that for another decade, so we try to get the big changes done.

In retrospect, I think it worked quite well. I’m really proud of how RavenDB 4.0 turned out.

Spanification in RavenDB

time to read 2 min | 333 words

imageWe are nearly done with RavenDB 4.1. There are currently a few minor stuff that we are still handling, but we are gearing up to push this to our production systems as part of our usual test matrix. Naturally, this means that we are already thinking about what we should do next.

There is a whole bunch of big ticket items that we want to look at, but the most important of which is the one that is likely to garner very little attention from the outside. We are going to take advantage of the new Span<T> API throughout the product. This is something that I really want to get to, since we have a lot of places where we touch native memory, memory mapped sections and in general pay a lot of attention to manual memory management. There are several cases where we had to copy data from unmanaged memory to managed memory just to make some API happy (I’m looking at you, Stream).

With the Span<T> API, that is no longer required, which means that we can usually just hand a pointer to the network that is mapped directly to a file and reduce the amount of work we need to do significantly.  We are also going to also go over the codebase and see where else we can take advantage of this behavior. For example, moving our code to the System.IO.Pipes opens up some really interesting scenarios for simplifications of code and reducing of overhead.

We are going to apply lessons learned about how we actually manage memory and apply them as part of that, so just calling it Span<T> is a bit misleading. The underlying reasoning is that we want to get to simplify both I/O and memory management, which are very closely tied together. This shouldn’t actually matter to users, except that the intent is to improve performance once again.

Living in the foundations, missing all the amenities

time to read 2 min | 377 words

imageWe talked to a candidate recently with a CV that included topics such as Assembly, SQL and JavaScript.  The list of skills was quite eclectic and we called the candidate to hear more about them.

The candidate completed a two years degree focused on the foundations of development, but it looked like whoever designed it was looking primarily to get a good foundation more than anything else. In other words, the end result is someone that can write SQL queries, but never built a data driven application, who knows (about? I’m not really clear at what level that was) assembly, but never written a real application. It doesn’t sound bad, I know, but it was like moving into a new house just after the contractor is done with the foundation. Sure, that is a really important part, but you don’t even have walls yet.

In 1999, I did a year long course that was focused on teaching me C and C++. I credit this course for much of my understanding of the basics of programming and how computers actually work. It has been an eye opening experience. I wouldn’t hire my 1999’s self, as I recall, that guy (can I deny knowing him?) wrote the following masterpieces:

  • sparse_matrix<T> in C++ templates that used five (5!) levels of pointer indirection!
  • The original single page application. I wrote an entire BBS system using a a single .VBS script that used three levels of recursive switch statements and included inline HTML, JS and VB code!

These are horrible things to inflict on an innocent computer, but that got me started in actually working on software and understanding things beyond the basics of syntax and action. I usually take the other side, that people are focused far too much on the high level stuff and do not pay attention to what is actually going on under the hood. This was an interesting reversal, because the candidate was the opposite. They had some knowledge about the basics, but nothing to build upon that yet.

And until you actually build upon the foundation, it is just a whole in the ground that was covered in some cement.

Modeling Milk: A discussion on domain modeling

time to read 2 min | 342 words

imageI recently had a discussion at work about the complexity of modeling data in real world systems. I used the example of a bottle of milk in the discussion, and I really like it, so I thought it would make for a good blog post.

Consider a supermarket that sells milk. In most scenarios, this is not exactly a controversial statement. How would you expect the system to model the concept of milk? The answer turns out to be quite complex, in practice.

To start with, there is no one system here. A supermarket is composed of many different departments that work together to achieve the end goal. Let’s try to list some of the most prominent ones:

  • Cashier
  • Stock
  • Warehouse
  • Product catalog
  • Online

Let’s see how each of these think about milk, shall we?

The cashier rings up a specific bottle of milk, but aside from that, they don’t actually care. Milk is fungible (assuming the same expiry date). The cashier doesn’t care which particular milk cartoon was sold, only that the milk was sold.

The stock clerks care somewhat about the specific milk cartoons, but mostly because they need to make sure that the store doesn’t sell any expired milk. They might also need to remove milk cartoons that don’t look nice (crumpled, etc).

The warehouse care about the number of milk cartoons that are in stock on the shelves and in the warehouse, as well as predicting how much should be ordered.

The product catalog cares about the milk as a concept, the nutritional values, its product picture, etc.

The online team cares about presenting the data to the user, mostly similar to the product catalog, until it hits the shopping cart / actual order. The online team also does prediction, based on past orders, and may suggest shopping carts or items to be purchased.

All of these departments are talking about the same “thing”, or so it appears, but it looks, behaves and acted upon in very different ways.

Working with legacy embedded types inside documents

time to read 2 min | 338 words

imageDatabase holds data for long periods of time. Very often, they keep the data for longer than single application generation. As such, one of the tasks that RavenDB has to take care of is the ability to process data from older generations of the application (or even from a completely different application).

For the most part, there isn’t much to it, to be honest. You process the JSON data and can either conform to whatever there is in the database or use your platform’s tooling to rename it as needed. For example:

There are a few wrinkles still. You can use RavenDB with dynamic JSON objects, but for the most part, you’ll use entities in your application to represent the documents. That means that we need to store the type of the entities you use. At the top level, we have metadata elements such as:

  • Raven-Clr-Type
  • Raven-Java-Class
  • Raven-Python-Type
  • Etc…

This is something that you can control, using Conventions.FindClrType event. If you change the class name or assembly, you can use that to tell RavenDB how to treat the old values. This require no changes to your documents and only a single modification to your code.

A more complex scenario happens when you are using polymorphic behavior inside your documents. For example, let’s imagine that you have an Order document, as shown on the right. This document has an internal property call Payment which can be any of the following types:

  • Legacy.CreditCardPayment
  • Legacy.WireTransferPayment
  • Legacy.PayPalPayment

How do you load such a document? If you try to just de-serialize it, you’ll get a deserialziation error. The type information about the polymorphic property is encoded in the document and you’ll need these legacy types to successfully load the document.

Luckily, there is a simple solution. You can customize the JSON serializer like so:

And the implementation of the binder is straightforward from that point:

In this manner, you can decide to keep the existing data as is or migrate it slowly over time.

Using GOTO in C#

time to read 2 min | 309 words

After talking about GOTO in C, I thought that I should point out some interesting use cases for using GOTO in C#. Naturally, since C# actually have proper methods for resource cleanups (IDisposable and using), the situation is quite different.

Here is one usage of GOTO in RavenDB’s codebase:

This is used for micro optimization purposes. The idea is that we put the hot spots of this code first, and only jump to the rare parts of the code if the list is full. This keep the size of the method very small, it allow us to inline it in many cases and can substantially improve performance.

Here is another example, which is a bit crazier:

As you can see, this is a piece of code that is full of gotos, and there is quite a bit of jumping around. The answer to why we are doing this is again, performance. In particular, this method is located in a very important hot spot in our code, as you can imagine. Let’s consider a common usage of this:

var val = ReadNumber(buffer, 2);

What would be the result of this call? Well, we asked the JIT to inline the method, and it is small enough that it would comply. We are also passing a constant to the method, so the JIT can simplify it further by checking the conditions. Here is the end result in assembly:

Of course, this is the best (and pretty common for us) case where we know what the size would be. If we have to send a variable, we need to include the checks, but that is still very small.

In other words, we use GOTO to direct as much as possible the actual output of the machine code, explicitly trying to be more friendly toward the machine at the expense of readability in favor of performance.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Reviewing FASTER (9):
    06 Sep 2018 - Summary
  2. RavenDB 4.1 features (12):
    22 Aug 2018 - MongoDB & CosmosDB Migration Wizards
  3. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
  4. Codex KV (2):
    06 Jun 2018 - Properly generating the file
  5. I WILL have order (3):
    30 May 2018 - How Bleve sorts query results
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats