Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,173 | Comments: 50,177

Privacy Policy Terms
filter by tags archive
time to read 2 min | 399 words

Terraform is all the rage on the DevOps world now, and we decided to take a look. In general, Terraform is written in Go, but it uses a pretty nifty plugins system to work with additional providers. A terraform provider is an executable that you run, but it communicates with the terraform controller using gRPC.

What happens is that Terraform will invoke the executable, and then communicate with it over gRPC whose details are provided in the standard output. That means that you don’t need to write in Go, any language will do. Indeed, Samuel Fisher did all the hard work in making it possible. Please note that this post likely makes no sense if you didn’t read Samuel’s post first.

However, his work assumes that you are running on Linux, and there are a few minor issues that you have to deal with in order to get everything working properly.

For safety, Terraform uses TLS for communication, and ensures that the data is safe in transit. The provider will usually generate a self signed key and provide the public key on the standard output. I would like to understand what the security model they are trying to protect from, but at any rate, that method should be fairly safe against anything that I can think of. The problem is that there is an issue with the way the C# Terraform provider from Samuel handle the certificate in Windows. I sent a pull request to resolve it, it’s just a few lines, and it quite silly.

The next challenge is how to make Terraform actually execute your provider. The suggestion by Samuel is to copy the data to where Terraform will cache it, and use that. I don’t really like that approach, to be honest, and it looks like there are better options.

You can save a %APPDATA%\terraform.rc file with the following content:

This will ensure that your provider will be loaded from the local directory, instead of fetched over the network. Finally, there is another challenge, Terraform expects the paths and names to match, which can be quite annoying for development.

I had to run the following code to get it working:

cp F:\TerraformPluginDotNet\samples\SampleProvider\bin\release\net5.0\win-x64\publish\* F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64
mv F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64\SampleProvider.exe F:\terraform\providers\example.com\example\dotnetsample\1.0.0\windows_amd64\terraform-provider-dotnetsample.exe

What this does is ensure that the files are in the right location with the right name for Terraform to execute it. From there on, you can go on as usual developing your provider.

time to read 10 min | 1949 words

For a database engine, controlling the amount of work that is being executed is a very important factor for the stability of the system. If we can’t control the work that is being done, it is quite easy to get to the point where we are overwhelmed.

Consider the case of a(n almost) pure computational task, which is completely CPU bound. In some cases, those tasks can easily be parallelized. Here is one such scenario:

This example seems pretty obvious, right? This is complete CPU bound (sorting), and leaving aside that sort itself can be done in parallel, we have many arrays that we need to sort. As you can imagine, I would much rather this task to be done as quickly as possible.

One way to make this happen is to parallelize the work. Here is a pretty simple way to do that:

Parallel.ForEach(arrayOfArrays, array => Array.Sort(array));

A single one liner and we are going to see much better performance from the system, right?

Yep, this particular scenario will run faster, depending on the sizes, that may be much faster. However, that single one liner is a nasty surprise for the system stability as a whole. What’s the issue?

Under the covers, this is going to use the .NET thread pool to do the work. In other words, this is going to be added to the global workload on the system. What else is using the same thread pool? Well, pretty much everything. For example, processing requests in Kestrel is done in the thread pool, all the async machinery uses the thread pool under the covers as well as pretty much everything else.

What is the impact of adding a heavily CPU bounded work to the thread pool, one may ask? Well, the thread pool would start executing these on its threads. This is heavy CPU work, so likely will run for a while, displacing other work. If we consider a single instance of this code, there is going to be a limit of the number of concurrent work that is placed in the thread pool. However, if we consider whatever we run the code above in parallel… we are going to be placing a lot of work on the thread pool. That is going to effectively starve the rest of the system. The thread pool will react by spawning more threads, but this is a slow process, and it is easy to get into a situation where all available threads are busy, leaving nothing for the rest of the application to run.

From the outside, it looks like a 100% CPU status, with the system being entirely frozen. That isn’t actually what is going on, we are simply holding up all the threads and can’t prioritize the work between request handling (important) and speeding up background work (less important). The other problem is that you may be running the operation in an otherwise idle system, and the non parallel code will utilize a single thread out of the many that are available.

In the context of RavenDB, we are talking here about indexing work. It turns out that there is a lot of work here that is purely computational. Analyzing and breaking apart text for full text search, sorting terms for more efficient access patterns, etc. The same problem above remains, how can we balance the indexing speed and the overall workload on the system?

Basically, we have three scenarios that we need to consider:

  1. Busy processing requests – background work should be done in whatever free time the system has (avoiding starvation), with as little impact as possible.
  2. Busy working on many background tasks – concurrent background tasks should not compete with one another and step on each other’s toes.
  3. Busy working on a single large background task – which should utilize as much of the available computing resources that we have.

For the second and third options, we need to take into account that the fact that there isn’t any current request processing work doesn’t matter if there is incoming work. In that case, the system should prioritize the request processing over background work.

Another important factor that I haven’t mentioned is that it would be really good for us to be able to tell what work is taking the CPU time. If we are running a set of tasks on multiple threads, it would be great to be able to see what tasks they are running in a debugger / profiler.

This sounds very much like a class in operating systems, in fact, scheduling work is a core work for an operating system. The question we have here, how do we manage to build a system that would meet all the above requirements, given that we can’t actually schedule CPU work directly.

We cannot use the default thread pool, because there are too many existing users there that can interfere with what we want. For that matter, we don’t actually want to have a dynamic thread pool. We want to maximize the amount of work we do for computational heavy workload. Instead, for the entire process, we will define a set of threads to handle work offloading, like this:

This creates a set of threads, one for each CPU core on the machine. It is important to note that these threads are running with the lowest priority, if there is anything else that is ready to run, it will get a priority. In order to do some background work, such as indexing, we’ll use the following mechanism:

Because indexing is a background operation in RavenDB, we do that in a background thread and we set the priority to below normal. Request process threads are running at normal priority, which means that we can rely on the operating system to run them first and at the same time, the starvation prevention that already exists in the OS scheduler will run our indexing code even if there is extreme load on the system.

So far, so good, but what about those offload workers? We need a way to pass work from the indexing to the offload workers, this is done in the following manner:

Note that the _globalWorkQueue is process wide. For now, I’m using the simple sort an array example because it make things easier, the real code would need a bit more complexity, but it is not important to explain the concept. The global queue contains queues for each task that needs to be run.

The index itself will have a queue of work that it needs to complete. Whenever it needs to start doing the work, it will add that to its own queue and try to publish to the global queue. Note that the size of the global queue is limited, so we won’t use too much memory there.

Once we published the work we want to do, the indexing thread will start working on the work itself, trying to solicit the aim of the other workers occasionally. Finally, once the indexing thread is done process all the remaining work, we need to wait for any pending work that is currently being executed by the workers. That done, we can use the results.

The workers all run very similar code:

In other words, they pull a queue of tasks from the global tasks queue and start working on that exclusively. Once they are done processing a single index queue to completion, the offload worker will try pick another from the global queue, etc.

The whole code is small and fairly simple, but there are a lot of behavior that is hidden here. Let me try to explore all of that now.

The indexing background work push all the work items to its local queue, and it will register the queue itself in the global queue for the offloading threads to process. The indexing queue may be registered multiple times, so multiple offloading threads will take part in this. The indexing code, however, does not rely on that and will also process its own queue. The idea is that if there are offloading threads available, they will help, but we do not rely on them.

The offloading thread, for its part, will grab an indexing thread queue and start processing all the items from the queue until it is done. For sorting arbitrary arrays, it doesn’t matter much, but for other workloads, we’ll likely get much better locality in terms of task execution by issuing the same operation over a large set of data.

The threads priority here is also important, mind. If there is nothing to do, the OS will schedule the offloading threads and give them work to do. If there is a lot of other things happening, it will not be scheduled often. This is fine, we are being assisted by the offloading threads, they aren’t mandatory.

Let’s consider the previous scenarios in light of this architecture and its impact.

If there are a lot of requests, the OS is going to mostly schedule the request processing threads, the indexing threads will also run, but it is mostly going to be when there is nothing else to do. The offload threads are going to get their chance, but mostly that will not have any major impact. That is fine, we want most of the processing power to be on the request processing.

If there is a lot of work, on the other hand, the situation is different. Let’s say that there are 25 indexes running and there are 16 cores available for the machine. In this case, the indexes are going to compete with one another. The offloading threads again not going to get a lot of chance to run, because there isn’t anything that adding more threads in this context will do. There is already competition between the indexing threads on the CPU resources. However, the offloading threads are going to be of some help. Indexing isn’t pure computation, there are enough cases where it needs to do I/O, in which case, the CPU core is available. The offloading threads can then take advantage of the free cores (if there aren’t any other indexing threads that are ready to run instead).

It is in the case that there is just a single task running on a mostly idle system that this really shines. The index is going to submit all its work to the global queue, at which point, the offloading threads will kick in and help it to complete the task much sooner than it would be able to other wise.

There are other things that we need to take into account in this system:

  • It is, by design, not fair. An index that is able to put its queue into the global queue first may have all the offloading threads busy working on its behalf. All the other threads are on their own. That is fine, when that index will complete all the work, the offloading threads will assist another index. We care about overall throughput, not fairness between indexes.
  • There is an issue with this design under certain loads. An offloading thread may be busy doing work on behalf of an index. That index completed all other work and is now waiting for the last item to be processed. However, the offloading thread has the lower priority, so will rarely get to run. I don’t think we’ll see a lot of costs here, to be honest, that requires the system to be at full capacity (rare, and admins usually hate that, so prevent it) to actually be a real issue for a long time. If needed, we can play with thread priorities dynamically, but I would rather avoid it.

This isn’t something that we have implemented, rather something that we are currently playing with. I would love to hear your feedback on this design.

time to read 2 min | 316 words

I run into this question on Hacker News, asking for the best computer science papers. There are a few that I keep getting back to, either because they are so fundamental or they are so useful.

Without any particular order

  • The Raft Paper – a distributed consensus algorithm that made sense to me on first read. There are a lot of subtle issues to consider, but when reading the paper, everything clicked. That is head and shoulders above what Paxos literature is about.
  • The Ubiquitous BTree – talk about a paper that I used daily. Admittedly, I didn’t get started on BTrees from this paper, but this is a very well written one and it does a great job presenting the topic. It is also from 1979, and BTree were already “ubiquitous” at that time, which tells us something.
  • Extendible Hashing – this is also from 1979, and it is well written. I implemented extendible hashing based on this article directly and I grokked it right away.
  • How Complex Systems Fail – not strictly a computer science paper. In fact, I’m fairly certain that this fits more into civil engineering, but it does an amazing job of explaining the internals of complex systems and the why and how they fail. I took a lot from this paper. It is also very short and highly readable.
  • OLTP Through the Looking Glass – discuss the internal structure of database engines and the cost and complexities of their various pieces.
  • You’re doing it wrong – discuss the implementation of Varnish proxy from the point of view of a kernel hacker. Totally different approach to the design of the system. Had a lot of influence on how I build systems.

I’m fairly certain that my criteria won’t be yours, but those are all papers that I have read multiple times and have utilized their insights in my daily work.

time to read 1 min | 85 words

Every now and then I need to do some work with text, and the Enron data set is one of the most well known corpuses.

I ended up writing the parsing code for that so many times that it isn’t even funny. Therefor, I decided to make my life easier and just post it somewhere that I can refer back to it.

This code simply unpack the Enron dataset into a .NET object, from where you can start processing the text in interesting ways.


time to read 2 min | 295 words

I’m talking a lot about candidates and the hiring process we go through right now. I thought it would only be fair to share a story about an interview task that I failed.

That was close to 20 years ago, and I was looking for my first job. Absolutely no professional experience and painfully aware of that. I did have a few years of working on Open Source projects, so I was confident that I had a good way to show my abilities.

The question was simple, write the code to turn the contents of this table into a hierarchical XML file:

image 

In other words, they wanted:

To answer the question, I was given pen and paper, by the way. That made my implementation choices quite hard, since I had to write it all in long hand. I tried to reproduce this from memory, and it looks like this:

This is notepad code, and I wrote it using modern API. At the time, I was using ADO.Net and the XmlDocument. The idea is the same, however, and it will spare you going through a mass of really uninteresting details.

I got so many challenges to this answer,though. I relied on null being sorted first on SQL Server and then on the fact that a parent must exist before its children. Aside from these assumptions, which I feel are fairly safe to make, I couldn’t figure out what the big deal was.

Eventually it turned out that the interviewers were trying to guide me toward a recursive solution. It never even occurred to me, since I was doing that with a single query and a recursive solution would need many such queries.

time to read 2 min | 363 words

A few days ago I posted about looking at GitHub projects for junior developer candidates. One of the things that is very common in such scenario is to see them use string concatenation for queries, I hate that. I just reached to a random candidate GitHub profile right now and found this gem:

The number of issues that I have with this code is legion.

  • Not closing the connection or disposing the command.
  • The if can be dropped entirely.
  • And, of course, the actual SQL INJECTION vulnerability in the code.

There is a reason that I have such a reaction for this type of code, even when looking at junior developer candidates. For them, this is acceptable, I guess. They are learning and focusing mostly on what is going on, not the myriad of taxes that you have to pay in order to get something to production. This is never meant to be production code (I hope, at least). I’m not judging this on that level. But I have to very consciously remind myself of this fact whenever I run into code like this (and as I said, this is all too common).

The reason I have such a visceral reaction to this type of code is that I see it in production systems all too often. And that leads to nasty stuff like this:

And this code led to a 70GB data leak on Gab. The killer for me that this code was written by someone with 23 years of experience.

I actually had to triple check what I was seeing when I read the code the first time, because I wasn’t sure that this is actually possible. I thought maybe this is some fancy processing done to avoid SQL injection, not that this is basically string interpolation.

Some bugs are things that you can excuse. A memory leak or a double free are things that will happen to anyone who is writing in C, regardless of experience and how careful they write. They are often subtle and easy to miss, happening in corner cases of error handling.

This sort of bug is a big box of red flags. It is also on fire.

time to read 6 min | 1093 words

I mentioned that we are currently hiring for a junior dev position and we have been absolutely swamped with candidates. Leaving aside the divorce lawyer that tried to apply to the position and the several accountants (I don’t really get it either) we typically get people with very little experience.

In fact, this position is explicitly open to people with no experience whatsoever. Given that most junior positions require a minimum of two years, I think that got us a lot of candidates.

The fact that we don’t require prior experience doesn’t meant that we don’t have prerequisites, of course. We are a database company and the fundamentals are important to us. A typical task in RavenDB involves a lot of taxes, from ACID compliance, distributed computing, strict performance requirements, visibility into the actions of the database, readability of the code, etc.

I talked before about the cost of a bad hire, and in the nearly a decade that passed since I wrote that post, I hasn’t changed my mind. I would rather end up with no one than hire someone that isn’t a match for our needs.

Our interview process is composed of a phone call, a few coding questions and then an in person interview. At this point, given that I have been doing that for over a decade, I think that I interviewed well over 5,000 people. A job interview stresses some people out a lot. Yesterday I had to listen to a candidate speak so fast that I could barely understand the words and I had to stop a candidate and tell them that they are currently in the 95% percentile of people I spoke to, so they wouldn’t freeze because of a flubbed question.

I twitted(anonymously) about the ups and down of the process and seem to have created quite a lot of noise. A typical phone call for a potential candidate takes about 15 – 30 minutes and is mostly there to serve as an explicit filter. If they don’t meet the minimum requirements that we have, there is no point in wasting either of our time.

One of the questions that I ask is: Build a phone book application that stores the data in memory and outputs the records in lexical order. This can stump some people, so we have an extra question to help. Instead of trying to output the data in lexical order, how would you ensure that you don’t have a duplicate phone number in such a system? Scanning through the entire list of records each time is obviously not the way to go. If they still can’t think of a way to do that the next hint is to think about O(1) and what data structure would fit this requirement. On the Twitter thread, quite a few people were up in arms about that.

Building a phone book is the kind of task that I remember doing in high school programming class as a teenager. Admittedly, that was in Pascal, but I just checked six different computer science degrees and for all of them, data structures was a compulsory course. Moreover, things like “what is the complexity of this operation” are things that we do multiple times a day here. We are building a database here, so operations on data is literally our bread and butter. But even for “normal” operations, that is crucial. A common example, we need to display some information to the user about their database. The information actually come from two sources internally. One is the database list which contains various metadata and one is the active database instance, which can give us the running stats such as the number of requests for this database in the past minute.

Let’s take a look at this code:

The complexity of this code is O(N^2). In other words, for ten databases, it would cost us a hundred. But for 250 databases it would cost 62,500 and for 500 it would be 250,000. Almost the same code, but without the needless cost:

This is neither theoretical nor rare, and it is part of every programming curriculum that you’ll find. Further more, I’m not even asking about the theoretical properties of various algorithms, or asking a candidate to compute the complexity of a particular piece of code. I’m presenting a fairly simple and common task (make sure that a piece of data in unique) and asking how to perform that efficiently. 

From a lot of the reactions, it seems that plenty of people believe that data structures aren’t part of the fundamentals and shouldn’t be something that is deeply embedded in the mindset of developers. To me, that is like saying that a carpenter shouldn’t be aware of the difference between a nail or a screw.

Rob has an interesting thread on the topic, which I wanted to address specifically:

It does not matter what language, actually. In JavaScript, you’ll not use a “new Hashtable()”, you’ll use an object, sure, but that is a meaningless detail. In fact, arrays implemented as hashes in JavaScript maintain most of their important properties, actually. O(1) access time by index being the key. If you want to go deeper, than in practice, the JS runtime will usually use an actual array if it detects that this is how you use the variable.

And I’m sorry, that is actually really important for many reasons. The underlying structure of our data has huge impact on what you can do with that data and how you can operate on it. That is why you have ArrayBuffer and friends in JavaScript now, because it matters, a lot.

And to the people whose 30 years experience never included ever needing to know those details, I have two things to say:

  • Either your code is full of O(N^2) traps (which is sadly common) or you know that, because lists vs. hash is not something that you can really get away with.
  • In this company, implementing basic data structures is literally part of day to day tasks. Over the years, we needed to get customize versions of arrays, lists, dictionaries, trees and many more. This isn’t pie in the sky stuff, that is Monday morning.
time to read 5 min | 927 words

I’m writing this post at a time when Donald Trump’s social media accounts were closed by Twitter, Facebook and across pretty much all popular social networks. Parler, an alternative social network has been kicked off the Apple and Google app stores and its AWS account was closed. It also appears that many vendors are dropping it and it will take significant time to get back online, if it is able to do so.

I’m not interested in talking about the reasons for this, mind. This is a hot topic political issue, in a country I’m not a citizen of, and I have no interest in getting bogged down with the political details.

I wish I could say that I had no dog in this fight, but I suspect that the current events will have a long term impact on the digital world. Note that all of those actions are taken by private companies, on their own volition. In other words, it isn’t a government or the courts demanding this behavior, but the companies’ own decision making process.

The thing that I can’t help thinking is that the current behavior by those companies is direct, blatant and very much short sighted. To start with, all of those companies are working on global scale, and they have just proven that they are powerful enough to rein in the President of the Unites States. Sure, he is at a lame duck status currently, but that is still something that upset the balance of power.

The problem with that is that while it would appear that the incoming US administration is favorable to this course of action, there are other countries and governments that are looking at this with concern. Poland is on track to pass a law prohibiting the removal of posts in social media that do not break local laws. Israel’s parliament is also considering a similar proposal.

In both cases, mind, these proposed laws got traction last year, before the current escalation in this behavior. I feel that more governments will now consider such laws in the near future, given the threat level that this represent to them. A politician is this day and age that doesn’t use social media to its fullest extent is going to be severely hampered. Both the Obama and the Trump campaigns were lauded for their innovative use of social media, for example.

There are also other considerations to ponder. One of the most costly portions of running a social network is the monitoring and filtering of posts. You have to take into account that people will post bile, illegal and obscene stuff. That’s expensive, and one of the reasons for vendors dropping of Parler was their moderation policies. That means that there is a big and expensive barrier in place for future social networks that try to grow.

I’m not sure how this is going to play out in the short term, to be honest. But in the long term, I think that there is going to be a big push, both legally and from a technical perspective to fill those holes. From a legal perspective, I would expect that many lawyers will make a lot of money on the fallout from the current events, just with regards to the banning of Parler.  I expect that there are going to be a whole lot of new precedents, both in the USA and globally.

From a technical perspective, the technology to run a distributed social network exists. Leave aside currently esoteric choices such as social network on blockchain (there appears to be a lot of them, search for that, wow!), people can fall back to good old Blog & RSS to get quite a bit of traction. It wouldn’t take much to something that looks very similar to current social networks.

Consider RSS Bandit or Google Reader vs. Twitter or Facebook. There isn’t much that you’ll need to do to go from aggregation of RSS feeds to a proper social network. One advantage of such a platform, by the way, is that it allows (and encourage) thought processes that are longer than 140 characters. I dearly miss the web of the 2000s, by the way.

Longer term, however, I would expect a rise of distributed social networks that are composed of independent but cooperating nodes (yes, I’m aware of Mastodon, and I’m familiar with Gab breaking out of that). I don’t know if this will be based on existing software or if we’ll end up with new networks, but I think that the die has been cast in this regard.

That means that the next social network will have to operate under assumed hostile environment. That means running on multiple vendors, taking no single point of failure, etc.

The biggest issue with getting a social network off the ground is… well, network effects. You need enough people in the network before you start getting more bang for the buck. But right now, there is a huge incentive for such a network, given the migration of many users from the established networks.

Parler’s app has seen hundreds of thousands of downloads a day in the past week, before it was taken down from the app stores. Gab is reporting 10,000+ new users an hour and more users in the past two days than they had seen in the past two years.

There is a hole there that will be filled, I think. Who will be the winner of all those users, I don’t know, but I think that this will have a fundamental impact on the digital world.

time to read 4 min | 705 words

I want to comment on the following tweet:

When I read it, I had an immediate and visceral reaction. Because this is one of those things that sound nice, but is actually a horrible dystopian trap. It confused two very important concepts and put them in the wrong order, resulting in utter chaos.

The two ideas are “writing tests” and “producing high quality code”. And they are usually expressed in something like this:

We write tests in order to product high quality code.

Proper tests ensure that you can make forward progress without having regressions. They are a tool you use to ensure a certain level of quality as you move forward. If you assume that the target is the tests and that you’ll have high quality code because of that, however, you end up in weird places. For example, take a look at the following set of stairs. They aren’t going anywhere, and aside from being decorative, serves no purpose.

image

When you start considering tests themselves to be the goal, instead of a means to achieve it, you end up with decorative tests. They add to your budget and make it harder to change things, but don’t benefit the project.

There are a lot of things that you are looking for in code review that shouldn’t be in the tests. For example, consider the error handling strategy for the code. We have an invariant that says that exceptions may no escape our code when running in a background thread. That is because this will kill the process. How would you express something like that in a test? Because you end up with an error raised from a write to a file that happens when the disk is full that kills the server.

Another example is a critical piece of code that needs to be safely handle out of memory exceptions. You can test for that, sure, but it is hard and expensive. It also tend to freeze your design and implementation, because you are now testing implementation concerns and that make it very hard to change your code.

Reviewing code for performance pitfalls is also another major consideration. How do you police allocations using a test? And do that without killing your productivity? For fun, the following code allocates:

Console.WriteLine(1_000_000);

There are ways to monitor and track these kind of things, for sure, but they are very badly suited for repeatable tests.

Then there are other things that you’ll cover in the review, more tangible items. For example, the quality of error messages you raise, or the logging output.

I’m not saying that you can’t write tests for those. I’m saying that you shouldn’t. That is something that you want to be able to change and modify quickly, because you may realize that you want to add more information in a certain scenario. Freezing the behavior using tests just means that you have more work to do when you need to make the changes. And reviewing just test code is an even bigger problem when you consider that you need to consider interactions between features and their impact on one another. Not in terms of correctness, you absolutely need to test that, but in terms of behavior.

The interleaving of internal tasks inside of RavenDB was careful designed to ensure that we’ll be biased in favor of user facing operations, starving background operations if needed. At the same time, it means that we need to ensure that we’ll give the background tasks time to run. I can’t really think about how you would write a test for something like that without basically setting in stone the manner in which we make that determination. That is something that I explicitly don’t want to do, it will make changing how and what we do harder. But that is something that I absolutely consider in code reviews.

time to read 2 min | 287 words

We got an interesting question a few times in recent weeks. How can I manually create a document revision with RavenDB? The answer for that is that you can use the ForceRevisionCreationFor() method to do so. Here is how you’ll typically use this API:

This is the typical expected usage for the API. We intended this to make it so users can manually triggers revisions, for example, when moving a document from draft mode to public and the like.

It turns out that there is another reason to want to use this API, when you migrate data to RavenDB and want to create historical revisions. The API we envisioned isn’t suitable for this, but the layered API in RavenDB means that we can still get the desired behavior.

Here is how we can achieve this:

Basically, we manually create the transaction steps that will run on the server, and we can apply the command to the same document multiple times.

Note that RavenDB requires a document to create a revision from it, so we set the document, create a revision and overwrite the document again, as many times as we need to.

Another issue that was brought up is that the @last-modified property on the document is set to the date of the revision creation. In some cases, they want to do this to get the revision to be created at the right time it was originally created during the migration period.

That is not supported by RavenDB, because the @last-modified is tracking the time that RavenDB modified the document or revision. If you need to track the time a document was modified in the domain, you need to keep that as part of your actual domain model.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Postmortem (2):
    23 Jul 2021 - Accidentally quadratic indexing output
  2. re (28):
    23 Jun 2021 - The performance regression odyssey
  3. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
  4. Webinar (4):
    11 Jun 2021 - Machine Learning and Time Series in RavenDB with Live Examples
  5. Webinar recording (13):
    24 May 2021 - The Rewards of Escaping the Relational Mindset
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats