reEntity Framework Core performance tuning–Part II
After looking at this post detailing how to optimize data queries in EF Core, I obviously decided that I need to test how RavenDB handles the same load.
To make things fair, I tested this on my laptop, running on battery mode. The size of the data isn’t that much, only 100,000 books and half a million reviews, so I decided to increase that by an order of magnitude.
The actual queries we make from the application are pretty simple and static. We can sort by votes / publication date / price (ascending /descending) and we can filter by number of votes and the publication year.
This means that we don’t have an explosion of querying options, so that simplify the kind of work we are doing. To make things simple for myself, I kept the same model of books / authors and reviews as separate collections. This isn’t the best model for document database, but it allows us to compare apples to apples against the work the EF Core based solution and the RavenDB solution need to do.
A major cost in Jon’s solution is the need to aggregate the reviews for a book (so the average for the review can be computed). In the end, the only way to get the solution required was to just manually calculate the average reviews for each book and store the computation in the book. We’ll discuss this a bit more in a few minutes, for now, I want to turn our eyes toward the simplest possible query in this page, getting 100 books sorted by the book id.
Because we aren’t running on the same machine, it is hard to make direct parallels, but on Jon’s machine he got 80 ms for this kind of query on 100,000 books. When increasing the data to half a million books, the query time rose to 150ms. Running the same query gives us the results instantly (zero ms). Querying and sorting by the title, for example, give us the results in 19 ms for a page size of 100 books.
Now, let us look at the major complexity for this system, sorting and filtering by the number of votes in the system. This is hard because the reviews are stored separately from the books. With EF Core, there is the need to join between the tables, which is quite expensive and eventually led Jon to take upon himself the task of manually maintaining the values. With RavenDB, we can use a map/reduce index to handle this all for us. More specifically, we are going to use a multi map/reduce index.
Here is what the index definition looks like:
We map the results from both the Books and the BookReviews into the same shape, and then reduce them together into the final output, which contains the relevant aggregation.
Now, let us do some queries, shall we? Here is us querying over the entire dataset (an order of magnitude higher than the EF Core sample set), filtering by the published date and ordering by the computed votes average. In here, we get the first 100 items, and you can see that we got over 289,753 total results:
One very interesting feature of this query is that we are asking to include the book document for the results. This is handled after the query (so no need to do a join to the entire 289K+ results), and we are able to get everything we want in a very simple fashion.
Oh, and the total time? 17 ms. Compared to the 80ms result for EF with 1/10 of the data size. That is pretty nice (and yes, different machines, hard to compare, etc).
I’ll probably have another post on this topic, showing off some of the cool things that you can do with RavenDB and queries.
More posts in "re" series:
- (19 Jun 2024) Building a Database Engine in C# & .NET
- (05 Mar 2024) Technology & Friends - Oren Eini on the Corax Search Engine
- (15 Jan 2024) S06E09 - From Code Generation to Revolutionary RavenDB
- (02 Jan 2024) .NET Rocks Data Sharding with Oren Eini
- (01 Jan 2024) .NET Core podcast on RavenDB, performance and .NET
- (28 Aug 2023) RavenDB and High Performance with Oren Eini
- (17 Feb 2023) RavenDB Usage Patterns
- (12 Dec 2022) Software architecture with Oren Eini
- (17 Nov 2022) RavenDB in a Distributed Cloud Environment
- (25 Jul 2022) Build your own database at Cloud Lunch & Learn
- (15 Jul 2022) Non relational data modeling & Database engine internals
- (11 Apr 2022) Clean Architecture with RavenDB
- (14 Mar 2022) Database Security in a Hostile World
- (02 Mar 2022) RavenDB–a really boring database
Comments
Hi Oren, I follow your blog but have no experiences with RavenDB - so maybe my question is stupid: When you create the index where sum, total and average is computed - is the result computed more or less immediately after the creation and updated with every update/insert/delete on Books or Revisions? Or are those values calcuated when issuing the query? Regards Thomas
@Tom, the results are computed on the data available at index creation then update after every update/insert/delete. Basically you are querying a table that is automatically kept in sync with the rest of the data.
The downside is that it does not update immediately, there is no ACID guarantee for the indexes. They could be updated 10ms after an insert or 2 minutes later. There are ways to mitigate this for most common scenarios, but still, it can't happen in a transaction.
@Mircea Thank you for the explanation. But then it seems to me that apples are compared to peaches since the relational database calculates the aggregated values on the fly. A better comparison would be that there is some procedure in place where the relational database calculates those aggregated values boforehand and stores them somewhere. Calculation could be startet by a trigger or by a Scheduler.
I would have a hard time coming up with a scenario where the results need to be up to the second or up to the minute for ranking based on reviews.
The non-async computation (EF solution) drives you to that behavior because of the technology while RavenDB allows for the behavior that it is eventually realized (which for all practical purposes ends up being very quickly consistent).
I would be interested in how an async computation of rankings would be implemented with an EF solution.
@Tom, well in that case you might as well just stick to comparing SQL databases to each other, because yes, in this case you are comparing apples to oranges. The use case is the same, but there are different trade-offs. It depends really on what you want to do; there are use cases where the Map/Reduce indexes of RavenDB don't quite fit what you want to do and a relational database might be a better fit. There is also the opposite - which for most "web" applications, is true - where you need to load one entity with a lot of related data and rarely need to make complex queries on it.
How does the map/reduce index in RavenDb compare in performance to using an INSERT/UPDATE trigger on BookReviews to update Total/Average fields on Book table in SQL?
That's not perhaps as simple as a map/reduce index, but if you want a true comparison of what the system is capable of it makes more sense to compare stored and indexed vs. stored and indexed values on both sides when the capability exists, instead of forcing one side to calculate on every query.
That'd also be an interesting comparison on the insert/update side -- how does SQL compare in that situation with a lot of inserts firing triggers vs RavenDb updating the index? And how does SQL perform using an initial bulk load with triggers disabled and a single query to update all of the averages vs. RavenDb in a similar situation.
Do you have any tutorials on many to many with add and attach operations to keep data unique?
Chris, RavenDB map/reduce is faster, it doesn't run during the write transaction, for one, so it doesn't slow down writes. It also means that it can operate in bulk, which give it a great speed advantage.
The major difference here is that with RavenDB, the process I described is the natural and easy way to do things, while triggers tend to be frowned upon. Especially as the amount of operations you need grows.
Joe Hoeller, I don't follow what you mean?
TomCollins, Yes, we compute the result for all the permutations immediately and then we update this whenever there is a change. At query time, you are seeing the already computed results, which is partly why we are so much faster.
TomCollins, Yes, this isn't the same comparison, but the idea isn't to compare the two databases doing the same operation. I'm pretty sure that whenever RavenDB and SQL Server are going to compete head to head doing the exact same thing, SQL Server will be faster. The whole point is that we don't need to do the same things. And by not doing the same things, we can be faster and easier to work with.
When we designed RavenDB, we had a lot of thinking around the use of databases in OLTP scenarios, how we can create a database that is suited for business applications. One of the results in the nature of aggregation, which is very fast and can be very easily plugged into your system, without all the considerations that you have to make with a relational system.
To wit, book ranking are hardly the place you'll need absolute consistency, but if you are going to build something that does offline aggregation, you are going to have to build, monitor, schedule and update it all time. That is a lot of code to write. All of which you don't have to write with RavenDB. That means that you can actually make use of aggregations within your business app without paying the tremendous cost around that. Which is ridiculous, since even something as trivial as putting "you have 10 unread messages" at the top of the page can have a major impact on the system with a relational database, and no impact with RavenDB.
Perhaps map/reduce is faster than triggers, perhaps not. If you compare an optimized RavenDb to an RDBMS scenario running with one hand tied behind its back, then it's hard to say. A site displaying reviews is read heavy, not write heavy, so the impact of triggers is likely negligible in a real world scenario -- but again, it'd be great to see an apples to apples comparison of both read and write.
If you're going to talk about optimization and compare performance metrics, at least do the RDBMS the favor of actually optimizing the low hanging fruit and using the functionality its capable of.
I love RavenDb and I'm an avid follower of your blog. I learn a lot about all of the intricacies of programming performant algorithms, how much an impact allocations have, etc. But if you're running benchmarks and throwing out comparisons of records processed and ms taken, it'd be much more beneficial to see an actual optimized RDBMS version vs. an optimized RavenDb version.
Chris, This is hardly a benchmark per se. I took an article that specifically showed a set of changes to improve the performance of a relational based system and showed how I could do that in RavenDB and make it significantly faster. Note that I spent very little time trying to optimize anything in RavenDB and consider the fact that the article stopped at a certain point.
That should tell you that there is a clear cost/benefit line here with regards to the complexity of the solution that cause a threshold to be reached at some point with the relational system. In particular, consider the case where you are not talking about just a single feature, but dozens of them, and then consider the impact of the system for having to hand tailor a solution for each of them.
A good example would be this blog. On the right, you have the total number of posts and comments, that is sent on each request, and it is done via map/reduce. It is a tiny feature, but to do that "properly" with RDBMS I would have to write a lot of code. Consider a typical business application and the amount of data that you want to show and it should be clear why such things matter, a lot.
A little sidestep from the main topic of this discussion... We all know and love/hate Excel - when working with business software you can't avoid your application being compared and matched with Excel. And sometimes Excel is the model, perfect application (with only few minor disadvantages) that you can never beat . Everything can be done in Excel, it's flexible and can accept any input and any change in the logic without complaining, and if you dont like something you can always make a copy and change it. So usually your application rates poorly compared to the almighty spreadsheet. But there is one nice feature i really admire in Excel - dependency tracking. No matter how convoluted the formulas are and how many sheets they span, if you modify something then all the data changes where necessary and you always see the correct state. And its so hard to replicate such behavior in the database without making a lot of mess everywhere, plus it's going to get slower and slower with every modification. So, my question is - how to approach such requirement, in SQL database and maybe in Raven, so you can tell the system what are the dependencies between data and it will figure it out how to keep everything in sync?
Rafal, With RavenDB, that can be handled with recursive map/reduce. You change one part of the system, and it will roll over to the rest of it.
However, note that this is a pretty strange thing to do, mostly because in software, you don't want to have these kind of flow, it make tracking changes and figuring out what is going on very hard. In particular, even though something may trigger a change in something else, we need to worry about all sort of things that also need to happen. For example, if you change the price of a product, it's tax rate may change, and that may apply a different policy, etc. That isn't just a value calc, there are additional actions that need to take place.
Comment preview