Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,801 | Comments: 48,964

filter by tags archive
time to read 10 min | 1824 words

imageThis is a sordid tale of chance and mystery and the nasty tricks that Murphy can play on you.

A few customers reported an error similar to the following one:

Invalid checksum for page 1040, data file Raven.voron might be corrupted, expected hash to be 0 but was 16099259854332889469

One such case might be a disk corruption, but multiple customers reporting it is an indication of a much bigger problem. That was a trigger for a STOP SHIP reaction. We consider data safety a paramount goal of RavenDB (part of the reason why I’m doing this Production Postmortem series), and we put some of our most experienced people on it.

The problem was, we couldn’t find it. Having access to the corrupted databases showed that the problem occurred on random. We use Voron in many different capacities (indexing, document storage, configuration store, distributed log, etc) and these incidents happened across the board. That narrowed the problem to Voron specifically, and not bad usage of Voron. This reduced the problem space considerably, but not enough for us to be able to tell what is going on.

Given that we didn’t have a lead, we started  by recognizing what the issue was and added additional guards against it. In fact, the error itself was a guard we added, validating that the data on disk is the same data that we have written to it. The error above indicates that there has been a corruption in the data because the expected checksum doesn’t match the actual checksum from the data. This give us an early warning system for data errors and prevent us from proceeding on erroneous data. We have added this primarily because we were worried from physical disk corruption of data, but it turns out that this is also a great early warning system for when we mess up.

The additional guards were primarily additional checks for the safety of the data in various locations on the pipeline. Given that we couldn’t reproduce the issue ourselves, and none of the customers affected were able to reproduce this, we had no idea how to go from there. Therefor, we had a team that kept on trying different steps to reproduce this issue and another team that added additional safety measures for the system to catch any such issue as early as possible.

The additional safety measures went into the codebase for testing, but we still didn’t have any luck in figuring out what we going on. We went from trying to reproduce this by running various scenarios to analyzing the code and trying to figure out what was going on. Everything pointed to it being completely impossible for this to happen, obviously.

We got a big break when the repro team managed to reproduce this error when running a set of heavy tests on 32 bits machines. That was really strange, because all the reproductions to date didn’t run on 32 bits.

It turns out that this was a really lucky break, because the problem wasn’t related to 32 bits at all. What was going on there is that under 32 bits, we run in heavily constrained address space, which under load, can cause us to fail to allocate memory. If this happens at certain locations, this is considered to be a catastrophic error and requires us to close the database and restart it to recover. So far, this is pretty standard and both expected and desired reaction. However, it looked like sometimes, this caused an issue. This also tied to some observations from customers about the state of the system when this happened (low memory warnings, etc).

The very first thing we did was to test the same scenario on the codebase with the new checks added. So far, the repro team worked on top of the version that failed at the customers’ sites, to prevent any other code change from masking the problem. With the new checks, we were able to confirm that they actually triggered and caught the situation early. That was a great confirmation, but we still didn’t know what was going on. Luckily, we were able to add more and more checks to the system and run the scenario. The idea was to trip over a guard rail as early as possible, to allow us to inspect what actually caused it.

Even with a reproducible scenario, that was quite hard. We didn’t have a reliable method of reproducing it, we had to run the same set of operations for a while to hopefully reproduce this scenario. That took quite a bit of time and effort. Eventually, we figured out what was the root cause of the issue.

In order to explain that, I need to give you a refresher on how Voron is handling I/O and persistent data.

Voron is using MVCC model, in which any change to the data is actually done on a scratch buffer, this allow us to have snapshot isolation at very little cost and give us a drastically simplified model for working with Voron. Other important factors include the need to be transactional, which means that we have to make durable writes to disk. In order to avoid doing random writes, we use a Write Ahead Journal. For these reasons, I/O inside Voron is basically composed of the following operations:

  • Scratch (MEM) – copy on write data for pages that are going to be changed in the transaction. Usually purely in meamory. This is how we maintain the Isolated and Atomic aspects on ACID.
  • Journal (WAL) – sequential, unbuffered, writes that include all the modifications to the transaction. This is how we maintain the Atomic and Durability aspects in ACID.
  • Flush (MMAP)– copy data from the scratch buffers to the data file, which allow us to reuse space in the scratch file.
  • Sync – (FSYNC) – ensure that the data from a previous flush is stored in durable medium, allow us to delete old journal files.

In Voron 3.5, we had Journal writes (which happen on each transaction commit) at one side of the I/O behavior and flush & sync as the other side. In Voron 4.0, we actually split it even further, meaning that journal writes, data flush and file sync are all independent operations which can happen independently.

Transactions are written to the journal file one at a time, until it reach a certain size (usually about 256MB), at which point we’ll create a new journal file. Flush will move data from the scratch buffers to the data file and sync will ensure that the data that was moved to the data file is durably stored on disk, at which point you can safely delete the old journals.

In order to trigger this bug, you needed to have the following sequence of events:

  • Have enough transactions happen quickly enough that the flush / sync operations are lagging by more than a single file behind the transaction rate.
  • Have a transaction start a new journal file while the flush operation was in progress.
  • Have, concurrently, the sync operation complete an operation that include that last journal file. Sync can take a lot of time.
  • Have another flush operation go on while the sync is in progress, which will move the flush target to the new journal file.
  • Have the sync operation complete, which only synced some of the changes that came from that journal, but because the new flush (which we didn’t sync yet) already moved on from that journal, mistakenly believe that this journal file is completed done and delete it.

All of these steps, that is just the setup for the actual problem, mind you.

In this case, we are prepared to have to this issue, but we aren’t yet to actually experience it. This is because what happened is that the persistent state (on disk) of the database is now suspect, if a crash happens, we will miss the oldest journal that still have transactions that haven’t been properly persisted to the data file.

Once you have setup the system properly, you aren’t done, in terms of reproducing this issue. We now have a race, the next flush / sync cycle is going to fix this issue. So you need to have a restart of the database within a very short period of time.

For additional complexity, the series of steps above will cause a problem, but even if you crash in just the right location, there are still some mitigating circumstances. In many cases, you are modifying the same set of pages in multiple transactions, and if the transactions that were lost because of the early deletion of the journal file had pages that were modified in future transactions, these transactions will fill up the missing details and there will be no issue. That was one of the issues that made it so hard to figure out what was going on. We needed to have a very specific set of timing between three separate threads (journal, flush, sync) that create the whole, then another race to restart the database at this point before Voron will fix itself in the next cycle, all happening just at the stage that Voron moves between journal files (typically every 256MB of compressed transactions, so not very often at all) and with just the right mix of writes to different pages on transactions that span multiple journal files.

These are some pretty crazy requirements for reproducing such an issue, but as the saying goes: One in a million in next Tuesday.

What made this bug even nastier was that we didn’t caught it earlier already. We take the consistency guarantees of Voron pretty seriously and we most certainly have code to check if we are missing transactions during recovery. However, we had a bug in this case. Because obviously there couldn’t be a transaction pervious to Tx #1, we aren’t checking for a missing transaction at that point. At least, that was the intention of the code. What was actually executing was a check for missing transactions on every transaction except for the first transaction on the first journal file during recovery. So instead of ignoring just the the check on Tx #1, we ignored it on the first tx on all recoveries.

Of course, this is the exact state that we have caused in this bug.

Sigh.

We added all the relevant checks, tightened the guard rails a few more times to ensure that a repeat of this issue will be caught very early and provided a lot more information in case of an error.

Then we fixed the actual problems and subject the database to what in humans would be called enhanced interrogation techniques. Hammers were involved, as well as an irate developer with penchant to pulling the power cord at various stages just to see what will happen.

We have released the fix in RavenDB 4.1.4 stable release and we encourage all users to upgrade as soon as possible.

time to read 6 min | 1059 words

I talk a lot about the hiring process that we go through, but there is also the other side of that. When people leave us. Hibernating Rhinos has been around for about a decade, in that time it grew from a single guy operation to a company that cross the bridge from small to medium business a couple of years ago. 

When I founded the company, I had a pretty good idea of what I wanted to have. Actually, I had a very clear idea of what I didn’t want to have. The things that I didn’t want to carry over to my own company. For example, on call for 24/7 or working hours that exceed the usual norms or being under constant pressure.  By and large, looking back at our history and where we are today, I think that we did a pretty good job at upholding these values.

But that isn’t the topic of this post. I wanted to talk about people leaving the company. Given the time that we are in business, we actually have very little turnover. Oh, we had people come and go, and I had to fire people who weren’t pulling their weight. But those were almost always people who were at the company for a short while (typically under a year).

In the past six months, we had two people leave that were with us for three and seven years (about three months apart from one another). That is a very different kind of separation. When I was told that they intend to leave, I was both sad and happy. I was sad because I hated to lose good people, I was happy because they were going to very good places.

After getting over my surprised, I sat down and planned for their leaving. Israel has a month notice requirement, so we had the time to do things properly. I was careful to check (very gently) whatever this is a reversible decision and once I confirmed that they had made the decision, I carried on with that.

My explicit goals for that time were:

  • Make sure that they are leaving on good terms and great spirits.
  • Ensure proper handoff of their current tasks.
  • Provide guidance about current and past tasks.
  • Map area of responsibilities and make sure that they are covered after they are gone.

The last three, I believe, are pretty common goals when people are leaving, but the most important piece was the first one. What does this mean?

I wrote each of them a recommendation letter. Note that they both already had accepted positions elsewhere at that time, so it wasn’t something they needed. It is something that they might be able to make use of in the future, and it was something that I wanted to do, formally, as an appreciation for their work and skills.

As an aside, I have an open invitation to my team. I’ll provide both recommendation letters and serve as a reference in any job search they have, while they are working for us. I sometimes get CVs from candidates that explicitly note: “sensitive, current employer isn’t aware”. I don’t want to be the kind of place that you have to hide from.

We also threw each of them a going away party, with the entire company stopping everything and going somewhere to celebrate.

I did that for several reasons. First, each of them, in very different ways, contributed significantly to RavenDB. It was a joy to work with them, I don’t see any reason why it shouldn’t be a joy to see them go. I can certainly say that not saying goodbye properly would have created a bad taste for the entire thing, and that is something that I don’t want.

Second, and a bit more cold minded, I want to leave the door open to have them come back again. After so much time in the company, the amount of knowledge that they have in their head is a shame to lose for good. But even if they never come back, that is still a net benefit, because…

Third, there is the saying about “if you love someone, let them go…”. I think that a really good way to make people want to leave is to make it hard to do so. By making the separation easy and cordial,  the people who stay know that they don’t need to fear or worry about things if they want to see what else is available for them.

The last few statements came out a bit colder than I intended them to be, but I can’t really think about a good way to phrase the intention that would sound like that. I don’t like that these people left, and I would much rather have them stay. But I started out from the assuming that they are going to leave, and the goal is to make the best out of that.

I was careful to not apply any pressure on them to stay regardless. In fact, in one case, I upfront apologized to the person on the way out, saying: “I want you to know that I’m not pressuring you to stay not because I want you to go, but because I respect your decision to leave and don’t want to make it awkward”.

Fourth, and coming back to the what I want to have as a value for the company, is the idea that I wouldn’t mind at all to be a place where people retire from. In fact, I decidedly want that to be the case. And we do a lot of work to ensure that we are the kind of place that you can be at for long period of times (investing in our people, working on cool stuff, ensuring that grunt work is shared and minimized, etc). However, I would also take great pride in being the place that would be a launching pad to people’s careers.

In closing, people are going to leave. If it is because of something that you can control, that should be a warning sign and something that you should look at to see if you can do better. If it is out of your hands, you should accept it as given and make the best of it.

I was very sad to see them go, and I wish them all the best in their future endeavors.

time to read 3 min | 454 words

If you are tracking the nightlies of RavenDB, the Pull Replication feature has fully landed. You now have three options to chose when you define replication in your systems.

image

External Replication is meant to go from the current database to another database (usually in a different cluster). It is a way to share data with another location. The owner of the replication is the current database, which initiate the connection and send the data to the other side.

Pull Replication reverse this behavior. The first thing you’ll need to do to get Pull Replication working is to define the Pull Replication Hub.

image

As you can see, there isn’t much to do here. We give the hub a name and minimal configuration (how far back this should go, basically). In this case, we are allowing sinks to get the data from the database, with a 20 minutes delay in built into the loop. You can also export the sink configuration from this view. We also define a certificate that provide access to this Hub Pull Replication, this certificate allow only access to this Pull Replication Hub, it grant no additional permissions. In this way, you may have one certificate that provide access to a delayed public stock ticker and another that provides an immediate access to the data.

The next step is to go to the other side, the sink. There, we either manually define the details on the hub (or more likely import the configuration). The sink will then connect to the hub and start pulling the data from it. Here is what this looks like:

image

The idea is that you are very likely to have a lot more sinks than hubs. That is why we make it easy to define the sink just by importing (although in practical terms we expect that this will just be part of a shared image that is deployed many times).

One we have defined the Sink Pull Replication, it will connect to the Hub and start accepting data. You can track how this works from the studio:

image

On the other side, you can track the connected sinks on the Hub:

image

And this is all you need to setup Pull Replication yourself.

time to read 1 min | 92 words

Last week we pushed an update to our public demo site, this is intended to walk you through using RavenDB, show code samples and provide detailed guidance into using RavenDB from your application.

Here is an example screen shot:

image

We spent a lot of time and effort on it, and I would appreciate you taking a peek and providing feedback on how useful that is for you to learn RavenDB and how to use it.

time to read 3 min | 500 words

In the previous post I talked about how to use a map reduce index to aggregate events into a final model. You can see this on the right. This is an interesting use case of indexing, and it can consolidate a lot of complexity into a single place, at which point you can utilize additional tooling available inside of RavenDB.

As a reminder, you can get the dump of the database that you can import into your own copy of RavenDB (or our live demo instance) if you want to follow along with this post.

Starting from the previous index, all we need to do is edit the index definition and set the Output Collection, like so:

image

What does this do? This tell RavenDB that in addition to indexing the data, it should also take the output of the index and create new documents from it in the ShoppingCarts collection. Here is what these documents look like:

image

You can see at the bottom that this document is flagged as artificial and coming from an index. The document id is a hash of the reduce key, so changes to the same cart will always go to this document.

What is important about this feature is that once the result of the index is a document, we can operate it using all the usual tools for indexes. For example, we might want to create another index on top of the shopping carts, like the following example:

In this case, we are building another aggregation. Taking all the paid shopping carts and computing the total sales per product from these. Note that we are now operating on top of our event streams but are able to extract second level aggregation from the data.

Of course, normal indexes on top of the artificial ShoppingCarts allow you to do things like: “Show me my previous orders”. In essence, you are using the events for your writes, define the aggregation to the final model in an index and then RavenDB take care of the read model.

Some other options to pay attention to is the not doing the read model and the full work on the same database instance as your events. Instead, you can output the documents to a collection and then use RavenDB’s native ETL capabilities to push them to another database (which can be another RavenDB instance or a relational database) for further processing.

The end result is a system that is built on dynamic data flow. Add an event to the system, the index will go through it, aggregate it with other events on the same root and output it to a document, at which point more indexes will pick it up and do further work, ETL will push it to other databases, subscriptions can start operation on it, etc.

time to read 7 min | 1259 words

imageAfter my previous two posts on the topic, it is now time to discuss how we make money from Open Source Software. Before I start, I want to clarify what I’m talking about:

  • RavenDB is a document database.
  • It is about a decade old.
  • The server is released under the AGPL / commercial license.
    • We offer free community / developer licenses without any AGPL hindrance.
  • The RavenDB client APIs are licensed under the MIT license.
  • RavenDB (the product) is created by Hibernating Rhinos (the company).

I created RavenDB because I couldn’t not to. It was an idea that had to go out of my head. I looked up the details, and toward the end of 2008 I started to work on it as a side project. At the time I was involved in five or six active open source projects, just got my NHibernate Profiler product to a stable ground and was turning the idea of getting deeper into databases in my head for a while. So I sat down and wrote some code.

I was just doing some code doodling and it turned into deep design discussion and at some point I was actually starting to actively look for hep building the user interface for a “done” project. That was in late Feb 2010. Somehow, throwing some code at the compiler become over a journey that lasted over a year in which I worked 16+ hours days on this project.

Around Mar 2010 I knew that I had a problem. Continuing as I did before, just writing a lot of code and trying to create an OSS project out of it would eat up all my time (and money). The alternatives were actually making money from RavenDB or stop working on it completely. And I didn’t want to stop working on it.

I decided that I had to make an effort to actually make a product out of this project. And that meant that I had to sit down and plan how I would actually make money from it. I firmly believe that “build it, and they will come” is a nice slogan, but it doesn’t replace planning, strategy and (at least some) luck.

  • I already knew that I couldn’t sustain the project as a labor of love, and donations are not a sustainable way (or indeed, a way) to make money.
  • Sponsorship seemed like it would be unlikely unless I got one of my clients to start using RavenDB and then have them pay me to maintain it. That seemed… unethical, so wasn’t an option.
  • Services / consulting was something that I was already doing quite heavily, and was quite successful at it. But this is a labor intensive way of making money and it would compete directly with the time that it would take to build RavenDB itself.
  • Support is a model I really don’t like, because it put me in a conflict of interest. I take pride in what I do, and I wanted to make something that would be easy to use and not require support.
  • Open Core / N versions back – are models that I don’t like. The open core model often leaves out critical functionality (such as security) and the N versions back mean that you give the users you most want to have the best experience (since that would encourage them to give you money) the worst experience (here are all our bugs that we fixed but won’t give to you yet).

That left us with dual licensing as a way to make money. I chose the AGPL because it was an OSI approved license that isn’t friendly for commercial use, leading most users who want to use it to purchase a commercial license.

So far, this is fairly standard, I believe.

I decided that RavenDB is going to be OSS, but from most other aspects, I’m going to treat it as a commercial product. It had a paid team working on it from the moment it stopped being a proof of concept. It meant that we are intentionally set out to make our money on the license. This, in turn had a lot of implications. Support is defined as a Cost Center in Hibernating Rhinos. In other words, one of the things that we routinely do in Hibernating Rhinos is look at how we can reduce support.

One way of doing that, of course, is not have support, or staff the support team with students or the cheapest off shore option available. Instead, our support staff consists of decided support engineers and the core team that builds RavenDB. This serves several goals. First, it means that when you raise a support issue with us, you get someone who knows what they are doing. Second, it means that the core team is directly exposed (and affected by) the support issues that are raised. I have structured things in this manner explicitly because having an insight into actual deployment and customer behavior means that the team is directly aware of the impact of their work. For example, writing an error message that will explain some issue to the user matters, because it would reduce the time an engineer spends on the phone troubleshooting (not fun) and increases the amount of time they can sling code around (fun).

We had a major update between versions 3.5 and 4.0, taking almost 3 years to finish. The end result was a vastly improved performance, the ability to run on multiple platforms and a whole host of other cool stuff. But the driving force behind it all? We had to make a significant change to our architecture in order for us to reduce the support burden. It worked, and the need for support went down by over 80%.

Treating RavenDB as a commercial product from the get go, even though it had an OSS license, meant that we focused on a lot of the stuff that is mostly boring. Anything from docs, setup and smoothing out all the bumps in the road, etc. The AGPL was there as a way to have your cake and eat it too. Be an OSS project with all the benefits that this entails. Confidence from our users about what we do, entry to the marketplace, getting patches from users and many more. Just having the ability to directly talk to our community with the code in front of all of us has been invaluable.

At the same time, we sell licenses to RavenDB, which is how we make money. The idea is that we provide value above and beyond whatever it is our license cost, and we can do that because we are very upfront and obvious in how we get paid.

We have a few users who have chosen to go with the AGPL version and skip paying us. I would obviously rather get paid, but I have laid out the rules of the game when I started playing and that is certainly within the rules. I believe that we’ll meet these users as customers in the future, it isn’t really that different from the community edition which we offer freely. In both cases, we aren’t getting paid, but it expands our reach, which will usually get us more customers in the long run.

We have been doing this for a decade and Hibernating Rhinos currently has about 30 people working full time on it, so it is certainly working so far Smile!

time to read 6 min | 1149 words

imageRichard Stallman is, without a doubt, one of the most influential people on the Open Source movement. I think it is fitting in a post like this to look for a bit at some of his reasoning around what Open Source is.

When we call software “free,” we mean that it respects the users' essential freedoms: the freedom to run it, to study and change it, and to redistribute copies with or without changes. This is a matter of freedom, not price, so think of “free speech,” not “free beer.”

The essential freedoms he talks about are “users have the freedom to run, copy, distribute, study, change and improve the software”. That was the intent behind the GNU, the GPL and much of the initial drive for Open Source. Rather, to be exact, the lack of these freedoms drove a lot of the proponents of Open Source.

I find the philosophy behind Open Source is a good match to my own view in many respects. But it is really important to understand that the point behind all of this is the user’s freedom, not the developer. In many cases (such as developers tools), the distinction isn’t obvious, but it is an important one. The developer in this case is the entity that developed the software and released it. The user is whoever got hold of the software, however that was done.

As you might have noticed above, Stallman’s reasoning explicitly call out free speech and not free beer. In other words, nothing is constructed as to prevent or forbid paying for Open Source software. So far, this is great, but the problem with selling Open Source Software is that one of the essential freedoms is the ability to redistribute the software to 3rd parties. Let’s assume that party A is selling licenses for OSS project at 1,000$. The OSS license is explicitly okay with the first buyer of the software (party B) immediately turning around and selling the software for 500$. And the first buyer from party B to sell it onward for 250$, etc.

In practical terms, this means that you would expect the price of OSS projects to be near the distribution cost. When the GPL came about in 1986, that meant floppy disks as the primary mode of data transfer. There was a very real cost to distributing software, especially on mass. With today’s network, the cost of distributing software is essentially nil for most projects.

In my previous post on the topic, I mentioned that this cause a real problem for OSS projects. Building software projects cost a lot of time and money. Sometimes you get such projects that are funded for the “common good”. The Linux Kernel is one such project, but while other examples exists (jQuery, I believe), they are rare. If you want to make money and work on OSS projects, this isn’t really a good way to go about doing this.

If you want to make money and not do OSS, you are likely to run into a lot of pressure from the market. In many environments, being an OSS project gives you a huge leg up in marketing, users and mindshare. Conversely, not being OSS is a major obstacle for your intended users. This pretty much forces you toward an OSS business model, as described in my previous post.

A really interesting aspect of OSS business models is the use of the core principles of Open Source as a monetization strategy. Very rarely you’ll find that there is something that interesting / novel in a particular project. It is the sum of the individual pieces that make it valuable. Sometimes you do have projects with secret sauce that they want to protect, but for the most part, keeping the source closed isn’t done to hide something. It is done so you’ll be able to sell the software. Dual licensing with a viral license take a very different approach for the same problem.

Instead of keeping the source secret and selling licenses to that, you release your software under an OSS license, but one that require your potential customers to release their source code in turn. Remember how I said that most projects don’t have anything interesting / novel in them? That was from a technical point of view. From a business perspective, that is a a wholly different matter. And if you aren’t in the business of selling software, you probably don’t want to release your code (which include many sensitive details about your organization and its behavior).

An example that would hopefully make it clear is the Google ranking algorithm. Knowing exactly how Google ranks pages would be a huge boon to any SEO effort. Or, if you consider the fact that the actual algorithm probably wouldn’t make sense without the data that drives it, consider the case of a credit rating agency. Knowing exactly how your credit score is computed can allow to manipulate it, and the exact details matter. So you can take it for granted that businesses would typically want to avoid Open Sourcing their internal systems.

The dual licensing with a viral license utilize this desire to charge money for OSS projects. Instead of using the software under a viral OSS license,  customers pay to purchase a commercial license, which typically have non of the freedoms associated with Open Source Projects.

Here is the dichotomy at the heart of this business model. In order to make money from OSS projects, companies chose viral licenses so their users will pay to have less freedom (and its obligations). There Is No Such Thing As Free Lunch still applies.

Recent moves by Redis and MongoDB, for example, show how this apply in practice. Redis’ Common Clause prevent selling the software (directly or via SaaS) and MongoDB’s SSPL is used to prevent hosting of MongoDB by cloud providers without paying a license fee. The problems that both of them (and others) have run into is that new deployment models (SaaS in particular) has rendered the previous “protections” from viral licenses obsolete.

I find it refreshingly honest that Redis’ license change has explicitly acknowledged that this isn’t a Open Source license any longer. And SSPL was almost immediately classified as non OSI license. MongoDB seem to think it is meeting the criteria for an Open Source license, but the OSI seem to disagree with that.

I wrote this post (and had an interesting time researching OSS history and license discussions) to point out this dissonance between a license that has more freedom (as the GPL / AGPL are usually described) and being more limited in how you can use it in practice. This is long enough, so I’ll have a separate post talking about how we approach both licensing and making money from Open Source.

time to read 7 min | 1304 words

Open Source is a funny business model, first you give away the crown jewels, then you try to get some money back. I have been working on OSS projects for close to twenty years now. I have been making my living off of OSS projects for most of that time. It is a very interesting experience, because of a simple problem. After you gave away everything, what do you charge for? I wrote this post because of this article and this twit. The article talks about the Open Core model and how it is usually end up. The twit talks about the reaction of (some) people in the marketplace when they are faced with the unconscionable request to pay for software.

The root problem is that there are two very different forces at play here.

  1. Building software is expensive. And that is only the coding part*.
  2. There is a very strong expectation that software should be freely available.

* If you also need to do documentation, double that. If you need to do deployment (multi platform, docker, k8s, ), do that again. If you need to support multiple versions, you guess it. There is also the website, graphics, GDPR compliance and a dozen other stuff that you have to do if you want to go beyond the some code on GitHub repository stage. There is so much more to a software project than just slinging code, and most of these things are not fun to do and take a whole lot of time. Which means that you have to pay people to do so.

When I say very strong expectation, I mean just that. To the point where if the code isn’t available, it is a huge barrier to entry. So in practice, you usually have to open source the project, or at least enough of it to satisfy people.

Reading the last statement again, it sounds very negative, but that isn’t meant to be the case. A major advantage of being Open Source project is that you get a lot of credibility from potential users. To start with, people go in and go through your code. They do strange things like talk to you about it, offer advice, patches and pull requests. They make the software better. They also take your software and do some really amazing things with it. For the past decade and a half, my default position has been that software I write is opened by default. I have yet to regret that decision.

An OSS project can typically get a lot more traction than a closed sourced one, these days. Which create a lot of pressure to open source things. And that, in turn, lead us to a simple problem. How can you make money from OSS projects?

There are a few ways to do so:

Labor of love – in some cases, you have people who simply enjoy coding and sharing their work. The problem here is that eventually you’ll run out of time to donate to the project and have to find some means to pay for it.

Donations – this is how people typically imagine OSS projects are paid for. I have tried that a few times in the past, I don’t believe that I made enough money to go hit the movie theater midday.

Sponsorship (small) – sometimes a project is important enough for a company that they are willing to pay for it. That means either hiring the major contributors or paying them in some manner. This is a great way to get paid while working on what you are passionate for, especially because you can usually complete all the taxes that a project requires (from a website to the documentation).

Sponsorship (large) – I’m thinking about something like Apache, Linux foundation, etc. These typically reserved to stuff that is core infrastructure and trying to build something like that from scratch seems… hard.

Services / Consulting – I did that actively for several years. Going to customers, helping them integrate / customize various projects that I was involved in. It was a lot of fun, but also exhausting. It’s basically being a consultant, but you are focusing on a few projects. Here, OSS work is basically awesome for building your reputation and showing off your skills. You can build a business around that, but that require having a large number of users and it subject to the usual constraints of consulting companies. The most limiting of which is that the company is an charging some % of the costs of employees, and the % can’t be too high (otherwise the employees will just do that directly).

The common thread among all the options above? None of them are viable options if you have VC money. The problem with all of these options is that (even in the case of something like the Linux Kernel), the ROI just isn’t worth it.

So what can you do, if you believe that your project should be OSS (for marketing, political or strongly held believes reasons) and you want a business model that can show significant returns?

Support – Offer the project itself for free, but charge for support. For certain industries and environments, that works great. But it does suffer from a problem, if you don’t have to buy support, why would you? In such cases, usually there is a conflict of interest here. Making the software simpler and easier to use will cannibalize the support that the company relies on. Red Hat is a good example of that. Note that a large part of what Red Hat does is the grunge work. Back porting patches, ensuring compatibility, etc. The kind of things that needs to be done, but you won’t get people doing for fun. To my knowledge, however, there are very few, if any, examples of other companies that successfully monetize this approach.

Open Core – in this model, you offer the core pieces for all, but keep the features that matter most to the customers with the most money closed in some fashion. In a sense, this is basically what the Support model is doing, for customers who need support. GitLab, MySQL, Redis and Neo4J are common examples of open core models. The idea is that for development and small fries (people who would typically not pay much / at all) will get you the customers that will pay for the high end features. The idea here is to get people to purchase licenses, similar to how commercial software works.

N versions back – A more aggressive example of the open core model is simply having two editions. An open source one and a commercial one. The major difference is that the open source one is delayed behind the commercial one. Couchbase, for example, is licensed under such a model.

Dual licensing with viral license – in this model, the idea is that the code is offered under a license which isn’t going to be acceptable for the target customers. Leading them to purchase the commercial edition. This model also mandates that the company is able to dual license the code, so any outside contributions require a copyright assignment to the company.

Cloud hosting – in this model, the software itself is offered under OSS license, but the most common use case is to use the cloud offering (and thus pay for the software). WordPress is a good example of that. The idea is that while people can install your software on their own machines, paying the company to do that is the more cost effective solution.

I’m sure that I have skipped many other options for making money out of OSS projects, but the ones I mentioned seems to be the most popular ones right now. I got a lot more to talk about this topic, so there will be most posts coming.

time to read 7 min | 1321 words

imageAlmost by accident, it turned out that I implemented a pretty simple, but non trivial task in both C and Rust and blogged about them.

Now that I’m done with both of them, I thought it would be interesting to talk about the differences in the experiences.

The Rust version clocks at exactly 400 lines of code and uses 12 external crates.

The C version has 911 lines of C code and another 140 lines in headers and depends on libuv and openssl.

Both took about two weeks of evenings of me playing around. If I was working full time on that, I could probably do that in a couple of days (but probably more, to be honest).

The C version was very straightforward. The C language is pretty much not there, and on the one hand, it didn’t get in my way at all. On the other hand, you are left pretty much on your own. I had to write my own error handling code to be sure that I got good errors, for example. I had to copy some string processing routines that aren’t available in the standard library, and I had to always be sure that I’m releasing resources properly. Adding dependencies is something that you do carefully, because it is so painful.

The Rust version, on the other hand, uses the default error handling that Rust has (and much improved since the last time I tried it). I’m pretty sure that I’m getting worse error messages than the C version I used, but that is good enough to get by, so that is fine. I had to do no resource handling. All of that is already handled for me, and that was something that I didn’t even consider until I started doing this comparison.

When writing the C version, I spent a lot of time thinking about the structure of the code, debugging through it (to understand what is going on, since I also learned how OpenSSL work) and seeing if things worked. Writing the code and compiling it were both things that I spent very little time on.

In comparison, the Rust version (although benefiting from the fact that I did it second, so I already knew what I needed to do) made me spend a lot more time on just writing code and getting it to compile.  In both cases, I decided that I wanted this to be a production worthy code, which meant handling all errors, producing good errors, etc. In C, that was simply a tax that needed to be dealt with. With Rust, that was a lot of extra work.

The syntax and language really make it obvious that you want to do that, but in most of the Rust code that I reviewed, there are a lot of unwrap() calls, because trying to handle all errors is too much of a burden. When you aren’t doing that, your code size balloons, but the complexity of the code didn’t, which was a great thing to see.

What was really annoying is that in C, if I got a compiler error, I knew exactly what the problem was, and errors were very localized. In Rust, a compiler error could stymie me for hours, just trying to figure out what I need to do to move forward. Note that the situation is much better than it used to be, because I eventually managed to get there, but it took a lot of time and effort, and I don’t think that I was trying to explore any dark corners of the language.

What really sucked is that Rust, by its nature, does a lot of type inferencing for you. This is great, but this type inferencing goes both backward and forward. So if you have a function and you create a variable using: HashMap::new(), the actual type of the variable depends on the parameters that you pass to the first usage of this instance. That sounds great, and for the first few times, it looked amazing. The problem is that when you have errors, they compound. A mistake in one location means that Rust has no information about other parts of your code, so it generates errors about that. It was pretty common to make a change, run cargo check and see three of four screen’s worth of errors pass by, and then go into a “let’s fix the next compiler error” for a while.

The type inferencing bit also come into play when you write the code, because you don’t have the types in front of you (and because Rust love composing types) it can be really hard to understand what a particular method will return.

C’s lack of async/await meant that when I wanted to do async operations, I had to decompose that to event loop mode. In Rust, I ended up using tokio, but I think that was a mistake. I should have used the event loop model there as well. It isn’t as nice, in terms of the code readability, but the fact that Rust doesn’t have proper async/await meant that I had a lot more additional complexity to deal with, and that nearly caused me to give up on the whole thing.

I do want to mention that for C, I had run Valgrind a few times to get memory leaks and invalid memory accesses (it found a few, even when I was extra careful). In Rust, the compiler was very strict and several times complained about stuff that if allowed, would have caused problems. I did liked that, but most of the time, it felt like fighting the compiler.

Speaking of which, the compilation times for Rust felt really high. Even with 400 lines of code, it can take a couple of seconds to compile (with cargo check, mind, not full build). I do wonder what it will do with a project of significant size.

I gotta say, though, compiling the C code meant that I would have to test the code. Compiling the Rust code meant that I could run things and they usually worked. That was nice, but at the same time, getting the thing to compile at all was a major chore many times. Even with the C code not working properly, the feedback loop was much shorter with C than with Rust. And some part of that was that I already had a working implementation for most of what I needed, so I had a lot less need to explore when I wrote the Rust code.

I don’t have any hard conclusions from the experience, I like the simplicity of C, and if I had something like Go’s defer to ensure resource disposal, that would probably be enough (I’m aware of libdefer and friends). I find the Rust code elegant (except the async stuff) and the standard library is great. The fact that the crates system is there means that I have very rich access to additional libraries and that this is easy to do. However, Rust is full of ceremony that sometimes seems really annoying. You have to use cargo.toml and extern crate for example.

There is a lot more to be done to make the compiler happy. And while it does catch you sometimes doing something your shouldn’t, I found that it usually felt like busy work more than anything else. In some ways, it feels like Rust is trying to do too much. I would have like to see something less ambitious. Just focusing on one or two concepts, instead of trying to be high and low level language, type inference set to the make, borrow checker and memory safety, etc. It feels like this is a very high bar to cross, and I haven’t seen that the benefits are clearly on the plus side here.

FUTURE POSTS

  1. Technical marketing from the other side - 10 hours from now
  2. The first database I ever built (20 years ago) - about one day from now
  3. Data modeling with indexes: Event sourcing–Part III–time sensitive data - 2 days from now
  4. RavenDB Customers Portal - 5 days from now

There are posts all the way to Feb 25, 2019

RECENT SERIES

  1. Data modeling with indexes (6):
    11 Feb 2019 - Event sourcing–Part II
  2. Production postmortem (25):
    18 Feb 2019 - This data corruption bug requires 3 simultaneous race conditions
  3. RavenDB 4.2 Features (3):
    14 Feb 2019 - Pull Replication has landed
  4. Making money from Open Source Software (3):
    08 Feb 2019 - How we do it?
  5. Using TLS in Rust (5):
    31 Jan 2019 - Handling messages out of band
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats