I have wrote an article about the uses of the profiler and the benefits it brings. And here is the 30 seconds video:
You can watch the full demo in the Cosmos DB webinar.
I have wrote an article about the uses of the profiler and the benefits it brings. And here is the 30 seconds video:
You can watch the full demo in the Cosmos DB webinar.
The very first product of Hibernating Rhinos was a profiler for NHibernate, to allow you to figure out exactly what is going between your database and application. Now I’m proud to present our latest product: the Cosmos DB Profiler.
If you are using Azure, you are likely familiar with Cosmos DB. Cosmos DB is not a traditional relational database. It is marketed by Microsoft as a multi model database and it is widely known in the world of distributed databases. The first part is important enough to bear repeating. Cosmos DB is not a relational database, even if there is a tendency to treat it as such.
We have gathered everything we know about optimal database usage, mixed in all the experience we run into seeing users bump into issue working with distributed systems and then looked into all the best practices published about successful Cosmos DB applications. After we had all of that, we looked into patterns, things that we can do for you, automatically, that would prevent you from messing up. Thus, the Cosmos DB profiler was born.
Here is how it looks like, profiling an application locally:
As you can see, it give you context to the interaction between your application and the database. It allows you to see exactly what is going on behind the scenes. This is important, since most Cosmos DB applications aren’t trivial, we are usually talking about big applications with a lot of data and moving pieces. It can be hard to understand what is actually going on when you run a particular action.
Furthermore, the profiler is able to give you concrete suggestions that will improve your performance and reduce you cloud bills.
The pricing model for Cosmos DB is based on provisioned capacity, and it is very easy to get into a state where you need to provision a lot more than what you expected to need. The profiler is able to detect such issues, provide you with concrete recommendations on how to fix them and show you the savings, immediately.
I’m doing a webinar on the Cosmos DB profiler on Tuesday and I would love to see you there.
I posted about the @refresh feature in RavenDB, explaining why it is useful and how it can work. Now, I want to discuss a possible extension to this feature. It might be easier to show than to explain, so let’s take a look at the following document:
The idea is that in addition to the data inside the document, we also specify behaviors that will run at specified times. In this case, if the user is three days late in paying the rent, they’ll have a late fee tacked on. If enough time have passed, we’ll mark this payment as past due.
The basic idea is that in addition to just having a @refresh timer, you can also apply actions. And you may want to apply a set of actions, at different times. I think that the lease payment processing is a great example of the kind of use cases we envision for this feature. Note that when a payment is made, the code will need to clear the @refresh array, to avoid it being run on a completed payment.
The idea is that you can apply operations to the documents at a future time, automatically. This is a way to enhance your documents with behaviors and policies with ease. The idea is that you don’t need to setup your own code to execute this, you can simply let RavenDB handle it for you.
Some technical details:
We also considered another option (look at the Script property):
The idea is that instead of specifying the script to run inline, we can reference a property on a document. The advantage being is that we can apply changes globally much easily. We can fix a bug in the script once. The disadvantage here is that you may be modifying a script for new values, but not accounting for the old documents that may be referencing it. I’m still in two minds about whatever we should allow a script reference like this.
This is still an idea, but I would like to solicit your feedback on it, because I think that this can add quite a bit of power to RavenDB.
Exactly 9 years ago, Hibernating Rhinos had a major breakthrough. We moved to our own offices for the first time. Before that, I was mostly working from a home office of clients’ locations. Well, I say we, but I mean I. At the time, the change mostly involved me having to put on some shoes and going out of the house to work alone in a big empty office. The rest of the team at the time was completely remote.
I got the office because I needed to. Some people can manage a proper life / work balance while working from home. I find it very hard. I’m the kind of person that would get up at 2 AM to get something to drink, see a new mail notification on the monitor, and start working until 8 AM. Having a separate office was hugely beneficial for me. The other reason was that it allowed me to hire more people locally. The first real employee I had was hired within three months of moving to the new office.
That first office was great, but small. Just 5 rooms about about 120 m² (1300 ft²). We stayed in the office until we got to about 12 people. At this point, we really didn’t have enough room to swing a cat (to be fair, we didn’t have an office cat, nor a real good reason to want to swing one). We moved offices in 2015, from the center of the industrial zone of the city to the periphery of the business zone). The new offices were 250 m² (2700 ft²) and gave us a lot of room to expand, it also had two major advantages. It was nice to be able to walk downstairs and be able to walk to pretty much anywhere we needed to and we no longer had to deal with having a garage next door.
When we moved to the 2nd office, it felt like we had a huge amount of room, but it filled up quite quickly. It was certain that we would outgrow the new place in a short order, so we started looking for a permeant home that would suffice for the next 10 years or so. We got one, smack down in the center of the business zone of the city. Next door to city hall, actually. Well, I say “got one”. What we actually got was a piece of paper and a hole in the ground. Before we could move into the new offices, they had to be built first.
We stayed in the second office space for 3 years, but we run out of room before the new offices were ready. So we moved for the third time. Because our new offices weren’t ready, we moved to a shared working space (like WeWork). We planned on being there for a short while, but it ended up for over a year. The plus side, we were able to expand much more easily. We hired quite a few people this year and was able to simple add more offices as we grew. The down side was that this is very much not our office, so we really want to move.
This week, however, we are going to finally move. The new offices have more than enough space 415 m² (4500 ft²) for the new five to ten years of growth, it covers two floors in a brand new location, centrally located and beautifully done. I’m not posting any pictures because the vast majority of our own team haven’t seen it yet (we have a unveiling party tomorrow), but I’m super happy that we got to this point and just had to share in the blog.
RavenDB 5.0 will come out with support for time series. I talked about this briefly in the past, and now we are the point where we are almost ready for the feature to stand on its own. Before we get to that point, I have a few questions before the design is set. Here is what a typical query is going to look like:
We intend to make the queries as obvious as possible, so I’m not going to explain it. If you can’t figure out what the query above is meant to do, I would like to know, though.
What sort of queries would you look to do with the data? For example, here is something that we expect users to want to do, compare and contrast different time periods, which you’ll be able to do with the following manner:
The output of this query will give you the daily summaries for the last two months, as well as a time based diff between the two (meaning that it will match on the same dates, ignoring missing values, etc).
What other methods for the “timeseries.*” would you need?
The other factor that we want to get feedback on is what sort of visualization do you want to see on top of this data in the RavenDB Studio?
In my company, I have a simple rule. If you want a tool, ask for it, you’ll get it. If you want training, ask for it, you’ll get it. If you want technical books, let me know, you’ll get it. I don’t ask questions, and I don’t try to enforce any rules around that. I got requests for things like Pluralsight subscription (very relevant) and technical books on topics that we probably would never touch (which I happily purchased). But by far, I don’t get many requests for stuff. Things gotten so bad that we had a marketing effort internally to get people to ask for stuff.
I’ll repeat that: I had to actively make it attractive to have people send me an email “can you get us XYZ”.
There is no tedious process involving multiple pitches and getting buy in. There is literally just send an email and you’ll get it. And people don’t take advantage of this option.
Recently, I tried outsmarting my folks, and put that as an item for the current sprint. Something like: “Suggest a courses / conference / training that you should to this quarter”. I got push back from the team leaders, saying that no one could find something that they wanted to go to.
I’m still in the process of trying to find a solution to this problem, to be frank.
I thought about just giving individual people a budget and just letting them handle that directly. That actually fails for a bunch of reasons:
How do you pay for this? Simplest would be to just have the developers pay and reimburse them for that money. I don’t like this option, because there is no need for the dev to float money for the company. Especially since some of these can be fairly high. The cost of a training course, for example, can be thousands of dollars. At that point, it is likely that we are going to have a discussion on this anyway, so I might as well pay that directly. The same applies for tooling / books, etc (although they usually cost less).
Of more interest to me is that if there is a tool / training that one dev wants to go, it is likely that others will want as well. That matters, because you can usually get volume discounts instead of paying for multiple individual options.
Finally, there are tools, and then there are tools. What sort of text editor you use doesn’t really matter to me. Nor do I care what sort of Git client you use. But a tool that is used to generate code, or part of the build / test process, is something that I do care about and want to look at.
What we end up with is a situation where you can’t decentralize the process, but we also can’t seem to get the people involved to just ask.
I would like to hope that this is because they have everything they need. I have tried to make the process as smooth and painless as possible, with no takers. At this point, I’m just going to go an meditate over this bit of wisdom.
A few customers reported an error similar to the following one:
Invalid checksum for page 1040, data file Raven.voron might be corrupted, expected hash to be 0 but was 16099259854332889469
One such case might be a disk corruption, but multiple customers reporting it is an indication of a much bigger problem. That was a trigger for a STOP SHIP reaction. We consider data safety a paramount goal of RavenDB (part of the reason why I’m doing this Production Postmortem series), and we put some of our most experienced people on it.
The problem was, we couldn’t find it. Having access to the corrupted databases showed that the problem occurred on random. We use Voron in many different capacities (indexing, document storage, configuration store, distributed log, etc) and these incidents happened across the board. That narrowed the problem to Voron specifically, and not bad usage of Voron. This reduced the problem space considerably, but not enough for us to be able to tell what is going on.
Given that we didn’t have a lead, we started by recognizing what the issue was and added additional guards against it. In fact, the error itself was a guard we added, validating that the data on disk is the same data that we have written to it. The error above indicates that there has been a corruption in the data because the expected checksum doesn’t match the actual checksum from the data. This give us an early warning system for data errors and prevent us from proceeding on erroneous data. We have added this primarily because we were worried from physical disk corruption of data, but it turns out that this is also a great early warning system for when we mess up.
The additional guards were primarily additional checks for the safety of the data in various locations on the pipeline. Given that we couldn’t reproduce the issue ourselves, and none of the customers affected were able to reproduce this, we had no idea how to go from there. Therefor, we had a team that kept on trying different steps to reproduce this issue and another team that added additional safety measures for the system to catch any such issue as early as possible.
The additional safety measures went into the codebase for testing, but we still didn’t have any luck in figuring out what we going on. We went from trying to reproduce this by running various scenarios to analyzing the code and trying to figure out what was going on. Everything pointed to it being completely impossible for this to happen, obviously.
We got a big break when the repro team managed to reproduce this error when running a set of heavy tests on 32 bits machines. That was really strange, because all the reproductions to date didn’t run on 32 bits.
It turns out that this was a really lucky break, because the problem wasn’t related to 32 bits at all. What was going on there is that under 32 bits, we run in heavily constrained address space, which under load, can cause us to fail to allocate memory. If this happens at certain locations, this is considered to be a catastrophic error and requires us to close the database and restart it to recover. So far, this is pretty standard and both expected and desired reaction. However, it looked like sometimes, this caused an issue. This also tied to some observations from customers about the state of the system when this happened (low memory warnings, etc).
The very first thing we did was to test the same scenario on the codebase with the new checks added. So far, the repro team worked on top of the version that failed at the customers’ sites, to prevent any other code change from masking the problem. With the new checks, we were able to confirm that they actually triggered and caught the situation early. That was a great confirmation, but we still didn’t know what was going on. Luckily, we were able to add more and more checks to the system and run the scenario. The idea was to trip over a guard rail as early as possible, to allow us to inspect what actually caused it.
Even with a reproducible scenario, that was quite hard. We didn’t have a reliable method of reproducing it, we had to run the same set of operations for a while to hopefully reproduce this scenario. That took quite a bit of time and effort. Eventually, we figured out what was the root cause of the issue.
In order to explain that, I need to give you a refresher on how Voron is handling I/O and persistent data.
Voron is using MVCC model, in which any change to the data is actually done on a scratch buffer, this allow us to have snapshot isolation at very little cost and give us a drastically simplified model for working with Voron. Other important factors include the need to be transactional, which means that we have to make durable writes to disk. In order to avoid doing random writes, we use a Write Ahead Journal. For these reasons, I/O inside Voron is basically composed of the following operations:
In Voron 3.5, we had Journal writes (which happen on each transaction commit) at one side of the I/O behavior and flush & sync as the other side. In Voron 4.0, we actually split it even further, meaning that journal writes, data flush and file sync are all independent operations which can happen independently.
Transactions are written to the journal file one at a time, until it reach a certain size (usually about 256MB), at which point we’ll create a new journal file. Flush will move data from the scratch buffers to the data file and sync will ensure that the data that was moved to the data file is durably stored on disk, at which point you can safely delete the old journals.
In order to trigger this bug, you needed to have the following sequence of events:
All of these steps, that is just the setup for the actual problem, mind you.
In this case, we are prepared to have to this issue, but we aren’t yet to actually experience it. This is because what happened is that the persistent state (on disk) of the database is now suspect, if a crash happens, we will miss the oldest journal that still have transactions that haven’t been properly persisted to the data file.
Once you have setup the system properly, you aren’t done, in terms of reproducing this issue. We now have a race, the next flush / sync cycle is going to fix this issue. So you need to have a restart of the database within a very short period of time.
For additional complexity, the series of steps above will cause a problem, but even if you crash in just the right location, there are still some mitigating circumstances. In many cases, you are modifying the same set of pages in multiple transactions, and if the transactions that were lost because of the early deletion of the journal file had pages that were modified in future transactions, these transactions will fill up the missing details and there will be no issue. That was one of the issues that made it so hard to figure out what was going on. We needed to have a very specific set of timing between three separate threads (journal, flush, sync) that create the whole, then another race to restart the database at this point before Voron will fix itself in the next cycle, all happening just at the stage that Voron moves between journal files (typically every 256MB of compressed transactions, so not very often at all) and with just the right mix of writes to different pages on transactions that span multiple journal files.
These are some pretty crazy requirements for reproducing such an issue, but as the saying goes: One in a million is next Tuesday.
What made this bug even nastier was that we didn’t caught it earlier already. We take the consistency guarantees of Voron pretty seriously and we most certainly have code to check if we are missing transactions during recovery. However, we had a bug in this case. Because obviously there couldn’t be a transaction previous to Tx #1, we aren’t checking for a missing transaction at that point. At least, that was the intention of the code. What was actually executing was a check for missing transactions on every transaction except for the first transaction on the first journal file during recovery. So instead of ignoring just the the check on Tx #1, we ignored it on the first tx on all recoveries.
Of course, this is the exact state that we have caused in this bug.
We added all the relevant checks, tightened the guard rails a few more times to ensure that a repeat of this issue will be caught very early and provided a lot more information in case of an error.
Then we fixed the actual problems and subject the database to what in humans would be called enhanced interrogation techniques. Hammers were involved, as well as an irate developer with penchant to pulling the power cord at various stages just to see what will happen.
We have released the fix in RavenDB 4.1.4 stable release and we encourage all users to upgrade as soon as possible.
I talk a lot about the hiring process that we go through, but there is also the other side of that. When people leave us. Hibernating Rhinos has been around for about a decade, in that time it grew from a single guy operation to a company that cross the bridge from small to medium business a couple of years ago.
When I founded the company, I had a pretty good idea of what I wanted to have. Actually, I had a very clear idea of what I didn’t want to have. The things that I didn’t want to carry over to my own company. For example, on call for 24/7 or working hours that exceed the usual norms or being under constant pressure. By and large, looking back at our history and where we are today, I think that we did a pretty good job at upholding these values.
But that isn’t the topic of this post. I wanted to talk about people leaving the company. Given the time that we are in business, we actually have very little turnover. Oh, we had people come and go, and I had to fire people who weren’t pulling their weight. But those were almost always people who were at the company for a short while (typically under a year).
In the past six months, we had two people leave that were with us for three and seven years (about three months apart from one another). That is a very different kind of separation. When I was told that they intend to leave, I was both sad and happy. I was sad because I hated to lose good people, I was happy because they were going to very good places.
After getting over my surprised, I sat down and planned for their leaving. Israel has a month notice requirement, so we had the time to do things properly. I was careful to check (very gently) whatever this is a reversible decision and once I confirmed that they had made the decision, I carried on with that.
My explicit goals for that time were:
The last three, I believe, are pretty common goals when people are leaving, but the most important piece was the first one. What does this mean?
I wrote each of them a recommendation letter. Note that they both already had accepted positions elsewhere at that time, so it wasn’t something they needed. It is something that they might be able to make use of in the future, and it was something that I wanted to do, formally, as an appreciation for their work and skills.
As an aside, I have an open invitation to my team. I’ll provide both recommendation letters and serve as a reference in any job search they have, while they are working for us. I sometimes get CVs from candidates that explicitly note: “sensitive, current employer isn’t aware”. I don’t want to be the kind of place that you have to hide from.
We also threw each of them a going away party, with the entire company stopping everything and going somewhere to celebrate.
I did that for several reasons. First, each of them, in very different ways, contributed significantly to RavenDB. It was a joy to work with them, I don’t see any reason why it shouldn’t be a joy to see them go. I can certainly say that not saying goodbye properly would have created a bad taste for the entire thing, and that is something that I don’t want.
Second, and a bit more cold minded, I want to leave the door open to have them come back again. After so much time in the company, the amount of knowledge that they have in their head is a shame to lose for good. But even if they never come back, that is still a net benefit, because…
Third, there is the saying about “if you love someone, let them go…”. I think that a really good way to make people want to leave is to make it hard to do so. By making the separation easy and cordial, the people who stay know that they don’t need to fear or worry about things if they want to see what else is available for them.
The last few statements came out a bit colder than I intended them to be, but I can’t really think about a good way to phrase the intention that would sound like that. I don’t like that these people left, and I would much rather have them stay. But I started out from the assuming that they are going to leave, and the goal is to make the best out of that.
I was careful to not apply any pressure on them to stay regardless. In fact, in one case, I upfront apologized to the person on the way out, saying: “I want you to know that I’m not pressuring you to stay not because I want you to go, but because I respect your decision to leave and don’t want to make it awkward”.
Fourth, and coming back to the what I want to have as a value for the company, is the idea that I wouldn’t mind at all to be a place where people retire from. In fact, I decidedly want that to be the case. And we do a lot of work to ensure that we are the kind of place that you can be at for long period of times (investing in our people, working on cool stuff, ensuring that grunt work is shared and minimized, etc). However, I would also take great pride in being the place that would be a launching pad to people’s careers.
In closing, people are going to leave. If it is because of something that you can control, that should be a warning sign and something that you should look at to see if you can do better. If it is out of your hands, you should accept it as given and make the best of it.
I was very sad to see them go, and I wish them all the best in their future endeavors.
Last week we pushed an update to our public demo site, this is intended to walk you through using RavenDB, show code samples and provide detailed guidance into using RavenDB from your application.
Here is an example screen shot:
We spent a lot of time and effort on it, and I would appreciate you taking a peek and providing feedback on how useful that is for you to learn RavenDB and how to use it.
I created RavenDB because I couldn’t not to. It was an idea that had to go out of my head. I looked up the details, and toward the end of 2008 I started to work on it as a side project. At the time I was involved in five or six active open source projects, just got my NHibernate Profiler product to a stable ground and was turning the idea of getting deeper into databases in my head for a while. So I sat down and wrote some code.
I was just doing some code doodling and it turned into deep design discussion and at some point I was actually starting to actively look for hep building the user interface for a “done” project. That was in late Feb 2010. Somehow, throwing some code at the compiler become over a journey that lasted over a year in which I worked 16+ hours days on this project.
Around Mar 2010 I knew that I had a problem. Continuing as I did before, just writing a lot of code and trying to create an OSS project out of it would eat up all my time (and money). The alternatives were actually making money from RavenDB or stop working on it completely. And I didn’t want to stop working on it.
I decided that I had to make an effort to actually make a product out of this project. And that meant that I had to sit down and plan how I would actually make money from it. I firmly believe that “build it, and they will come” is a nice slogan, but it doesn’t replace planning, strategy and (at least some) luck.
That left us with dual licensing as a way to make money. I chose the AGPL because it was an OSI approved license that isn’t friendly for commercial use, leading most users who want to use it to purchase a commercial license.
So far, this is fairly standard, I believe.
I decided that RavenDB is going to be OSS, but from most other aspects, I’m going to treat it as a commercial product. It had a paid team working on it from the moment it stopped being a proof of concept. It meant that we are intentionally set out to make our money on the license. This, in turn had a lot of implications. Support is defined as a Cost Center in Hibernating Rhinos. In other words, one of the things that we routinely do in Hibernating Rhinos is look at how we can reduce support.
One way of doing that, of course, is not have support, or staff the support team with students or the cheapest off shore option available. Instead, our support staff consists of decided support engineers and the core team that builds RavenDB. This serves several goals. First, it means that when you raise a support issue with us, you get someone who knows what they are doing. Second, it means that the core team is directly exposed (and affected by) the support issues that are raised. I have structured things in this manner explicitly because having an insight into actual deployment and customer behavior means that the team is directly aware of the impact of their work. For example, writing an error message that will explain some issue to the user matters, because it would reduce the time an engineer spends on the phone troubleshooting (not fun) and increases the amount of time they can sling code around (fun).
We had a major update between versions 3.5 and 4.0, taking almost 3 years to finish. The end result was a vastly improved performance, the ability to run on multiple platforms and a whole host of other cool stuff. But the driving force behind it all? We had to make a significant change to our architecture in order for us to reduce the support burden. It worked, and the need for support went down by over 80%.
Treating RavenDB as a commercial product from the get go, even though it had an OSS license, meant that we focused on a lot of the stuff that is mostly boring. Anything from docs, setup and smoothing out all the bumps in the road, etc. The AGPL was there as a way to have your cake and eat it too. Be an OSS project with all the benefits that this entails. Confidence from our users about what we do, entry to the marketplace, getting patches from users and many more. Just having the ability to directly talk to our community with the code in front of all of us has been invaluable.
At the same time, we sell licenses to RavenDB, which is how we make money. The idea is that we provide value above and beyond whatever it is our license cost, and we can do that because we are very upfront and obvious in how we get paid.
We have a few users who have chosen to go with the AGPL version and skip paying us. I would obviously rather get paid, but I have laid out the rules of the game when I started playing and that is certainly within the rules. I believe that we’ll meet these users as customers in the future, it isn’t really that different from the community edition which we offer freely. In both cases, we aren’t getting paid, but it expands our reach, which will usually get us more customers in the long run.
We have been doing this for a decade and Hibernating Rhinos currently has about 30 people working full time on it, so it is certainly working so far !
No future posts left, oh my!