Creating a good SaaS experience in the cloud

time to read 8 min | 1407 words

Iimage care about usability and the user experience for our users. We spend a lot of time on making sure that things are running smoothly. When we created RavenDB Cloud, I knew that it was important to create a good experience for our cloud offering.

One of the most important things that I did was to go and look at other people’s offerings and see where they failed to meet customer expectations. I recently run into this article about AWS Elastic. Similar issues has been raised about it for a while. And it was one of my explicit design goals of what not to do in our cloud offerings.

Summarizing the discussion, it seems like the following major issues across the board.

  • Backups – Being able to have your own backup schedule and destinations. Retention policies based on your needs, etc. AWS Elastic uses hourly backups and you only get 14 days of that. Cosmos DB, to look at an Azure offering, is taking a backup every 4 hours and you have a maximum of two of them.

We have users that needs to be able to go back weeks / months / years and look at the state of their database at a given point in time. The 8 hours backup period for Cosmos DB is really short, but even 14 days on AWS Elastic is short enough that you probably need to roll some other solutions for that.

With RavenDB Cloud, you have automatic backups (hourly) with a retention period that defaults to 14 days. The key here, by the way, is defaults. You are absolutely free to define your own backup policies (per database or per cluster), that means that you can set your own destinations (want to do cross cloud backups, no problem) and your own retention policies.

  • Visibility – This is a very common complaint, it showed up in pretty much all resources that I checked and have been a common cause of complaints among the peers I sampled when we did the background for RavenDB Cloud. In particular, no logs or access to the debug endpoints is a killer for productivity. If you can’t tell what the problem is, you can’t fix it, so you’ll need to call support and have someone look things over. Here are a few choice quotes, that I think goes to the heart of things.

Here is Liz Bennett talking about things in 2017:

I feel equipped to deal with most Elasticsearch problems, given access to administrative Elasticsearch APIs, metrics and logging. AWS’s Elasticsearch offers access to none of that. Not even APIs that are read-only, such as the /_cluster/pending_tasks API.

Without access to logs, without access to admin APIs, without node-level metrics (all you get is cluster-level aggregate metrics) or even the goddamn query logs, it’s basically impossible to troubleshoot your own Elasticsearch cluster.

AWS’s Elasticsearch doesn’t provide access to any of those things, leaving you no other option but to contact AWS’s support team. But AWS’s support team doesn’t have the time, skills or context to diagnose non-trivial issues

And here is Nick Price, just a few days ago:

So your cluster resize job broke (on a service you probably chose so you wouldn’t have to deal with this stuff in the first place), so you open a top severity ticket with AWS support. Invariably, they’ll complain about your shard count or sizing and will helpfully add a link to the same shard sizing guidelines you’ve read 500 times by now. And then you wait for them to fix it. And wait. And wait. The last time I tried to resize a cluster and it locked up, causing a major production outage, it took SEVEN DAYS for them to get everything back online.

They couldn’t even tell if they’d fixed the problem and had to have me verify whether they had restored connectivity between their own systems.

I think that the reason this is the case is that AWS Elastic is a multi tenant service. so each instance is shared among multiple clients. That means that they have to limit the endpoints that you can access, to not leak data from other clients.

With RavenDB Cloud, you get all the usual diagnostic features features that you would usually get. We don’t want you to escalate things to support. The ideal scenario is that if you run into any trouble, you’ll be able to figure things out completely on your own. Hell, we do active monitoring and will contact our customers if we see outliers in the system behavior to gives them a heads up.

You can go to your RavenDB Cloud instance, pull the logs, watch ongoing traffic and even get a stack trace of the running production system. Everything is there, in the box.

  • Flexibility – The most common cited issue with AWS Elastic seems to be related to the issue of not being able to make changes in the environment. Or, rather, you can do that (increasing node count or changing the instance types) but when you do that, you are going to have a full blown migration step. That means that you’ll double the number of nodes you’ll run and incur a really expensive operation to copy all the data. Given that Elastic already has this feature (add a node to an existing cluster) the decision not to support it is likely related to constraints on the AWS Elastic multi tenancy layer.

I’m not really sure what to say here, RavenDB Cloud has no such issue. To be rather more exact, our multi tenant architecture was specifically designed so the outward facing differences between how you’ll operate RavenDB on your own systems vs. how you’ll operate RavenDB on the Cloud will be minimal.

Adding and removing node is certainly possible. And in fact, in my intro video to RavenDB Cloud I showed how I can upgrade a cluster along multiple axes, while it is being actively written to. All with exactly zero interruptions in service. Client code that was busy reading and writing from the cluster didn’t even noticed that. That was certainly not an easy feature to implement, but I considered this to be the baseline of what we had to offer.

  • Support – Both Nick and Liz had a poor experience with AWS Elastic support. You can read the quotes above, or read the full posts for the whole picture.

I don’t like support. We explicitly modeled the company so support is a cost center, not a source of revenue. That means that we want to close each support incident as soon as possible.  What this doesn’t mean is that we do an auto close of all issues on creation. That would give us fantastic closure rates, I imagine, but at a cost.

Instead, we have a multi layered system to deal with things. Consider the scenario Nick run into, a full disk. That is certainly something that you might run into, no?

  1. Automatic recovery - With RavenDB,  a single full disk will simply cause that node to reject writes, nothing else. And the cluster as a whole will continue to operate normally. With RavenDB Cloud, we have a lot more control, and our monitoring systems will automatically alert on disk full and increase its size automatically behind the scenes with no impact on users.
  2. Diagnostics - Active monitoring means that you’ll be notified in the RavenDB Studio about suspicious issues (for example, you run out of IOPS) and be able to investigate them with full visibility. RavenDB does a lot of work to ensure that if something is broken, you’ll know about it.
  3. Front line support – if you need to call our support, the person answering the call is going to be able to help you. They would typically be an engineer that was involved either in the actual building / managing of RavenDB Cloud or (2nd tier) involved in the development of RavenDB itself.

My goal with a support call is to get you back to speed as soon as possible, and the usual metrics for that are measured in minutes, not days.

We are now several months post the launch of RavenDB Cloud and the pickup of customers has been great. What is more important from my end, however, is that we are seeing how this kind of investment in our architecture and setup is paying off.