Whitepaper: Couchbase vs RavenDB Performance at Rakuten Kobo
We just published a white paper on RavenDB performance vs. Couchbase performance in a real customer scenario.
I had to check the results three times before I believed them. RavenDB is pretty awesome, but I had no idea it was that awesome.
The data set was reasonably big, 1.35 billion docs and the scenario we present is a real world one based on production load.
Some of the interesting details:
- RavenDB uses 1/3 of the disk space that Couchbase uses, but stores 3 times as much data.
- Operationally, RavenDB just worked, Couchbase needed 6 times the hardware to just scrape by. A single failure in Couchbase meant at least 15 – 45 minutes for the node to recover. Inducing failures in RavenDB brought the node back up in a few seconds.
- For queries, we pitted a Couchbase cluster with 96 cores and 384 GB RAM against single RavenDB node running on a Raspberry PI. RavenDB on the Pi was able to sustain better latencies at the 99 percentile handling twice as much load as Couchbase is able.
There are all sort of other goodies in the white paper and we went pretty deep into the overall architecture and impact of the difference design decisions.
As usual, we welcome your feedback.
Comments
That's nice, maybe I'll live the day when their JP store won't throw some kind of error on every page navigation 😹
Congrats, looks like RavenDB is not a couch potato And managed to do the task with almost no overhead in disk usage vs raw data However, i wonder, if the goal was to optimize data structure for quick search of highlights by user id and book id, i think there's still a lot of overhead even in the raw data. 1.35 billion records, assume big numbers and lets take 8 bytes for book Id, 8 bytes for user Id and 4 bytes for position in the book - this gives us 27 GB of data. With binary storage of data and indexes we would fit everything in 64GB. Just put it in bare Voron, or BerkeleyDB and a single laptop would handle hundreds of thousands of queries per second. And you dont need clusters, sharding, caching...
Rafal,
The key here isn't the association of user to books, what we were working here was the highlights. There is a sample document there that shows the data.
Yes, you can try to model things in the manner you describe, but then the cost of loading the data for a user request becomes much higher. You'll need to get the book contents (may be big), scan to to relevant location, parse the content, translate markup to text, etc. It is cheaper and easier to do it the other way around.
Especially when you have to do that once per highlights, and some people do a LOT of highlights.
Rafal,
Another thing, note that the data wasn't just for the highlights. The dataset include a lot of other details which weren't relevant for this specific benchmark. They were there to show data management for large databases.
Yep, must have oversimplified it. But you know, 'billions of something' looks impressive until you realize that gigabyte is a billion, and even your phone has few GB of RAM. So not everyting with a billion records is necessarily a large database that requires a datacenter (but with a careful choice of database you may well need one)
... which reminds me of recent mention of Parler and their insane data overheads - the couchdb case doesnt look that bad compared to that
Excellent read. And now go against Mongo and Cosmos DB please...
This is probably a difficult subject - going against competing products while knowing they all serve the same purpose, and all get the job done, and neither is particularly expensive - i would not expect spectacular differences. However, like shown here, if you get disk usage reduction by factor of 2-3, and need half the RAM, maybe half the infrastructure, then its substantial, not spectacular but still worth showing. Spectacular would be for example negating the need for an expensive cluster, reducing number of servers 10-fold, but this is not possible without changing the approach entirely. And IT folk are not that easy to impress - after all they are the IT gurus in companies, the experts and know-it-alls, who made some decisions and need to prove they were right => so anyone coming and announcing 'hey, your database is a slow, bloated, data-losing monstrosity' will be shot immediately or at least called an idiot.
Much better in my opinion is to find a specialization, some niche where your product really solves some problem better than everything else out there, and then it will shine. Not sure if it applies to databases - a very general-purpose tool, but maybe in some particular class of problems, in some specific businesses... NB, there are many specialized, niche products (for example, software for handling medical data) where companies can successfully sell products of inferior quality just because they get a hit on several keywords, have some compliance certificates that technically mean nothing but no one else has them.. not implying that this is the way to go but seems a clever strategy
Rafal,
Do note that for real world scenarios, you can run at 8% of the hardware costs ! That is better than your 10-fold scenario.
I admit i didnt parse that information from the article. Pretty bad differences at some points, dont know couchbase at all but maybe there's some configuration problem or it's used in a wrong way for the data? Or the community edition has some speed limit built in?
Rafal,
I pinged someone that is quite knowledgeable in how Couchbase works, they didn't find any glaring issues in the way we set up things. We also tested the Enterprise edition, their license limits the detail I can expose, but it isn't a magic fix.
Then i hope they do the right thing at Rakuten :)
Comment preview