Chapter 9 of the Inside RavenDB 3.0 book is now available. This chapter goes over Full Text Search, and how to make the best use of that in RavenDB. It covers how Full Text Search works, Suggestions, Facets, and more.
You can download it using the following link. We now have PDF, Kindle and ePub versions available.
On Friday, Microsoft came up with Azure DocumentDB. You might say that I have a small interest in such things, so I headed over there to see what I can learn about this project.
Aside from being somewhat annoyed with the name, this seems to be a very different animal from RavenDB, and something that was built to serve a different niche. One of the things that we put first with RavenDB is ease of use, development and deployment for business applications. The ADB design appears to be built around a different goal, around very big datasets.
Nitpicker corner: Yes, I know this is a preview, and I know that they are going to be changes. And I repeat, I have no knowledge about this project beyond the documentation and several hours of playing with it.
That said, I do have a fair bit of experience in this area. So I feel that I can speak with confidence about the topic.
ADB is supposed to be an highly scalable system that store documents. So far, so good, I can certainly understand that need. But it has made drastically different design choices, some of which I feel very strongly about. I'll try to explore the issues that I have issues with, and contrast that with what you can do with RavenDB.
This post has two parts, the first talks about conceptual issues. The second talk about the currently published limits, and their implications for general use for ADB.
- No sorting option, or a good paging story
- SQL Injection, without any other alternative
- Hard to deploy and to keep current with your codebase
- Poor development story & no testing story
- Poor client API
- Lots of table scans
- Limited queries and few optimization options
- Single document transactions (from the client)
- No cross collection transactions at all
- Very small document sizes allowed
Also see the “What is this for?” section below.
For a document database platform that doesn’t have any of those issues, and run in Azure, see RavenHQ on Azure.
Transactions – ADB say that it has transactions, and for a very limited meaning of the word, I believe it means it. Transactions in ADB means a single document only can be saved with a guarantee it will either be saved or not. That is great, in the sense that at least you won’t have data corruption, but that isn’t really something that mean much. Even MongoDB can satisfy that bar.
Oh, sure, you can get actual transactions if you write JS code that run as a “stored procedure” inside ADB. This means that you can send data to the server and have your JS Stored Procedure make multiple operations in a single transaction. Which is just slightly better (although see my comments on those stored procedures later), but that is still limited to only operations inside the same collections.
A trivial example for transactions in a document database would be to add a new comment, and update the comment count. You cannot do that in ADB. Not in a single transaction. I don’t know about you, but most of the interesting use cases happen when you are working with multiple document types. Sure, you can put all your documents inside the same collection, but have fun trying to work with that in the long term.
In contrast, RavenDB fully support actual transactions that can span multiple documents (even on different collections, which I would never believe would be an accomplishment). RavenDB can even support DTC and transactions that spans multiple interactions with the server. You know, the kind of transactions you actually want to use. For more, see the documentation on RavenDB transactions.
Management – it honestly feels like someone missed all the points that made people want to ditch SQL in the first place. ADB has the concepts of triggers, user defined functions (more on that travesty later, when I discuss queries) and stored procedures. You can define them in JS, and you create something that looks like this:
Let me count the ways that this is going to cause problems for you.
- Business logic in the database, because we haven’t learned anything about that in the past.
- Code that you cannot run or test independently. Just debugging something like that is going to be hard.
- No way to actually manage deployment or make sure that this code is in sync with the rest of your codebase.
- Didn’t we already learn that triggers are a source for a lot of pain? Are they really necessary for you to do things?
Yes, you have a DB that is schema less, but those kind of things are actually important. They define what you can do with the database, and not having a good way to move those around, and most importantly, not having a way to tie them to the source control system you are using is going to be a giant PITA.
Sorry, that isn’t actually something that you can delay doing for later. You need a good development story, and as I see it, the entire development story all around here is just going to be hard. You would have to manually schlep things around between development and production. And that isn’t just about the SP or UDFs. There are a lot of settings that you’re going to have to deal with. For example, the configuration per collection, which you’ll want to make sure is the same (otherwise you get some very subtle and hard to understand bugs).
For that matter, there doesn’t seem to be a development story. You are probably expected to just run another ADB instance on Azure and use that. This means a lot of latency in development, and that also means that you can’t have separate databases per developer, which is a standard practice. This means having to write a lot of code just to manage those things, and you are right back again at the good old days of “who didn’t update the schema script” and failed deployments.
In contrast, RavenDB make is very easy to handle your indexes & transformers in your code and deploying them as a single step. That also means that they are versioned in the same place as your code, so you don’t have to worry about moving between dev & prod. We spent a lot of time thinking and working around this specific area, because this is a common pain point in relational databases, and we weren’t willing to accept that being the case in our database. For more information, please see the documentation about index management in RavenDB.
Indexing – there are several things here that bother me. By default, everything is indexed, and in the same transaction. This is a great decision, for a demo system. But in a real world system, the overhead of indexing everything is prohibitive, especially in a high write system. So ADB is allowing to specify the paths that you will include or exclude from indexing, as well as whatever indexing should be within the same transaction or lazy.
The problem with that is that those are per collection settings and there doesn’t appear to be any way to modify them after the fact. So you start running your system in production, realize that the cost of indexing is high, so you need to change the indexing strategy for a collection. The only way to do that is to create a new collection, with a new indexing strategy, move all the data there, then delete the old one. For even more fun, consider the case where you have a production and development environments. In production, you have a different indexing strategy then in development (where the ‘index everything’ mode is still on). That means that when you push things to production, your system will fail silently, because you won’t be indexing the fields you though were indexed.
This need re-iteration, the way this currently work, you start running with the default indexing option, which is expensive. As long as you don’t have any performance requirements (for example, during development), that is just fine. But when you actually have a lot of data there, or have a lot of writes, that is when you’ll figure out that those things need to be changed. At that point, you are are pretty much screwed, because you need to pull all the data out, create a new collection with the new indexing options, and write it all back. That is a horrible experience, especially because you’ll likely need to do that under pressure with users breathing down your necks and management complaining about the performance.
For that matter, indexing in general scares me. Again, I don’t actually have any knowledge of the internal operations, but there are a lot of stuff there that just doesn’t make sense. It looks like the precision of the indexes used are up to 3 characters (by default) per value. I’m guessing that this is done to reduce the amount of space used by the indexing, at least that is what the docs says. The problem is that when you do that, you do a lookup by the first 3 characters, then you have to do a flat search over all the other values. That is going to be causing problems.
It is also indicated that you cannot do any range searches except on numeric values. Which has interesting implications if you want to do searches on something like a date range, or time spans, an incredibly common operation.
In contrast, RavenDB indexes are always using the full value, so you are getting an O(logN) search behavior, and not a fallback to O(N) behavior. Range searches are possible on any value, numeric, date time, time span, string, etc. For more information, see the RavenDB documentation about searching with RavenDB.
Queries – Speaking of problems. Let me talk for a moment on ADB SQL. It looks nice on the surface, it is certainly would be familiar to most people. It is also contain a lot of hidden traps.
For example, the docs talk about being able to do joins, but you are only actually able to do “joins” into the sub documents, not into other collections, or even documents in the same collection. A query such as:
SELECT c.Name as CustomerName, o.Total, o.Date FROM Orders o JOIN Customers c ON c.Id = o.CustomerId
Can’t be executed on ADB. So the whole notion of “joins” is actually limited to what you can do in a single document and the sub documents it contains. That make it very limited.
The options for filtering (where clause) is also interesting. Mostly because of the wide range they allow. It is very easy to create queries that cannot be computed using indexes. That means that your query is now running table scans. Lots & lots of table scans. Sure, you don’t have tables, but O(N) is still O(N), and when N is large, as it is apparently the expected case here, you are going to be pretty much dead in the water.
Another thing that I can’t wrap my head around is the queries shown. There is no way to pass parameters to the query. None. This appears to be the case because 30+ years of working with SQL has shown that there is absolutely no issue with putting user’s input directly into the query. And since complex queries require you to use the raw ADB SQL, you are pretty much have guaranteed that you’ll have SQL Injection attacks.
Sure, you might no get caught by Little Bobby Tables (you can’t modify data via SQL), but you are still exposed and can leak important data. This query works just fine, and will return all products:
SELECT * FROM Products p WHERE p.Name = "testing" OR 1 = 1 -- "
I’ll assume that you understand how I got there. This is a brand new database engine, but ADB is bringing very old issues back into the future. Not only that, we don’t have anyway around that. I guess you are going to have to write your on parameter scrubbing code, and make sure to use it everywhere.
In general, queries are limited. Severely limited, actually. Take a look at the following query:
SELECT * FROM Products p WHERE p.Type = "Beer" AND p.Maker = "Guinness" AND p.Discontinued = false AND p.Price > 10 AND p.Price < 100
You can’t run it in ADB. It is too complex to run. Note that this is about as trivial a query as you can get, in pretty much any reasonable business system.
Continuing on with the problems for business apps theme, there doesn’t appear to any good way to do things like paging. When you issue a query, you can specify the number of items to take and you can repeat a query by passing a continuation. But that doesn’t really help when you need to actually page with the user. So you show the data to the user, then want to go to the next page… you have to pass the continuation token all the way around, and hope that it will remain valid for the duration. For that matter, the current client API does paging at the server level, but it will fetch all the results for a query, even if it take it hours to do so.
There is no way to actually get the total number of items that match the query. So you can’t show the user something like: “You have 250 new emails”, nor can you show them “Page 1 … 50”.
Another troubling omission is the total lack of anything that would allow you to actually query your documents in a particular order. If I want to get the latest orders in descending order (or in fact, in any well defined order), I am out of luck. There is no way of doing that. This is a huge deal, because this isn’t just something that you can try papering over. This is a core functionality that you need in pretty much any application. And it is just not there. There is some indication that this is being worked on, but I’m surprised that this isn’t here already. Distributed sorting is a non trivial problem, of course, so I’ll reserve further judgment until I see what they have.
ADB’s queries are highly limited, so I expect a workaround for that is going to be to push functionality into the UDF. Note that UDF don’t have access to any context, so it can’t load additional documents. What it can do it utterly destroy any chance you’ll ever have for optimizing a query. The moment that a UDF is involved, you don’t really have a choice about how to execute a query, you pretty much have to go to a table scan. Maybe filtering some stuff based on the other filters in the query, but in many cases, that means that you’ll have to run your UDF over millions of records. Because UDFs are permitted to perform non pure operations (like the current time), you can’t even cache its values, or do anything smart around that. You’ll always have to execute the UDF, regardless of the amount of data you have to go through. I don’t expect that to perform very well.
In contrast, RavenDB was explicitly designed to give you both flexibility and performance in queries. There are no table scans in RavenDB, and complex queries are expected, encouraged and are handled properly. Queries across multiple documents (and in other collections) are possible, and quite easy to do. Common operations, like paging or sorting are part of the core functionality, and are both very easy to use and come with no additional costs. Complex things like full text search, spatial queries, facets and many more are right there for you to use. For more information, see the RavenDB documentation about querying in RavenDB, spatial searches in RavenDB and how RavenDB actually index the data to allow complex operations.
Data types – ADB data types are the ones defined in the JSON spec. In particular, it doesn’t have native support for date times. The ADB documentation suggest that you’ll do custom serialization to handle that. Rendering things like asking: “Give me all the orders for this customer for 2014” very hard, leaving aside the issues of querying for orders in a particular month, which is not possible either, since you can only do range searches on numeric data. Dates, in particular, are a very complex topic, and not actually handling this in the database is going to open you up for a lot of issues down the road. And dates are kinda important type to have.
In contrast, RavenDB handles complex (including user defined) types in a well defined manner. And has full support for dates, operations on dates, etc. It seems silly to mention, to be fair, because it seems so basic to have that. For more information, you can read the documentation about dates in RavenDB.
Aggregation – this one is simple, you don’t have any. That means that you cannot get the total number of unread emails, or the total sum of orders per customer, or the maximum order per this month . This whole functionality just isn’t there.
In contrast, RavenDB has explicit support for counting the number of results for a query as well as map/reduce indexes. Those give you powerful aggregation framework, which execute the work in the background. When you query, you get the pre-computed results very quickly, without having to do any work at query time. For more information, you can read about Map/Reduce in RavenDB and dynamic aggregation queries.
Set operations – another easy one, it is just not there. You can do some operations in a stored procedure, but you have 5 seconds to run, and that is it. If you need to do something like: Split FullName to FirstName and LastName, get ready to write a lot of code, and wait for a long time for this to complete. For that matter, something as simple as “delete all inactive users” is very hard to do as well.
In contrast, RavenDB has explicit support for set based updates and deletes. You can specify a query that match a set of results that would either be deleted or patched using a JS script. For more operations, read the documentations about Set Based Operations.
Client API – this is still a preview, so that is somewhat unfair, but the client API is very primitive. Basically, it is a very thin wrapper around the REST API, and it does a poor job at that. The REST API support paging, but the C# client API does not, for example. There is no concept of unit of work, change tracking, client side behavior or anything at all that would actually make this work nicely. There is also an interesting design decision to go async for all operations except queries.
With queries, you actually issue an async REST call, but you are going to be waiting on that query synchronously. This is probably because of the IQueryable interface and its assumption that the query is sync. But that is a very bad thing to do in terms of mixing sync and async work. It is easy to get into problems such as deadlocks, self lock and just plain weirdness.
In contrast, RavenDB has a carefully designed client APIs (for .NET, JVM, etc), which fully expose the power of RavenDB. They have been designed to be intuitive, easy to use and guide you into the pit of success, RavenDB also have separate sync and async API, including fully async queries. For more information, read the documentation about the client API.
Self links – when issuing any operation whatsoever to the database, you have to use something call the object link, or self link. For example, in my test database, the Products collection link is: dbs/frETAA==/colls/frETANSmEAA=/
You have to use links like that whenever you make any operation what so ever. For fun, those are going to be unique per database, so creating a Products collection in another database would result in a different collection link. That means that I can’t just store them in configuration. So you’ll probably have to read them from the database every time you need to use them (maybe with some caching?). This is just silly. This is making it very hard to look at what is going on and see what the system is doing (for example, by watching what is going on in Fiddler).
In contrast, RavenDB applies human readable names whenever possible. For more information, see the documentation about the efforts to make sure that everything in RavenDB in human readable and easily debuggable. One such place is the id generation strategy.
Development and testing – in this day and age, people are connected to the internet through most of their day to day life. That doesn’t mean that they are always connected, or that you can actually rely on the network, or that the latency is acceptable. There is no current development story for ADB. No way to run your own database and develop while you are offline (on the train or at 30,000 feet in the air). That means that every call to ADB has to go over the internet, and that means, in turn, that there is no local development story at all. It means a lot more waiting from the point of view of the developer (also see next point), it means that there is just no testing story.
If you want to run code to test your ADB usage, you have to setup (and pay) a whole new ADB instance, have to make sure that it is setup exactly the same way as your production instance, and run it against that. It means that test not only have to go outside your process, but across the internet to a remote server. This pretty much kills the notion of fast tests.
In contrast, RavenDB has an excellent development and testing story. You don’t pay for development or CI instances, and you can run tests against RavenDB using an in memory mode embedded inside your process. This has been heavily optimize to allow fast running tests. We are developers, and we care to make other developers’ life easy. It shows. For more information, see the documentation about unit testing RavenDB.
Joins are for your code – because ADB doesn’t actually support joins beyond the document scope, or any other option like that, it means that if you want to do something trivial, like show a customer a list of their orders, you are actually going to have to do the join in your own code, not in the database. In fact, let us take a silly scenario, let us say that we want to show a list of new employees as well as their managers, so we can have a chat with them about how they are settling in.
If we were using SQL, we would be using something like this:
SELECT emp.Id as EmpId, emp.Name as EmpName, mngr.Id as ManagerId, mngr.Name as ManagerName FROM Employees emp JOIN Managers mngr where emp.ManagerId = mngr.Id WHERE emp.JoinedAt > '2014-06-01'
That is pretty easy, right? How do you do something like that in ADB? Well, you start with the first query:
SELECT emp.Id as EmpId, emp.Name as EmpName, emp.ManagerId as ManagerId FROM Employees emp WHERE emp.JoinedAt > '2014-06-01'
And then, for each of the returned managers’ ids, we have to issue a separate query (ADB doesn’t have support for IN). This pattern of usage is called SELECT N+1, and it is a very well known anti pattern, even leaving aside the fact that you have to manually do the join in your own code, with all that this implies. This sort of operations will effectively kill the performance of any application, because you are very chatty with the database.
In contrast, RavenDB contains several ways to load related items. From including a related document to projecting it via a transformer, you can very easily and efficiently get all the data you need, in a single query to RavenDB. In fact, RavenDB applies a Safe By Default approach and limit the number of times you can call the server (configurable) to prevent just this case. We’ll error if you go over the budget of remote calls you are allowed to make. This gives you an early chance to catch performance problems. For more information, see the documentation about includes, transformers and the Safe By Default approach practiced by RavenDB.
Limits - reading the limits for ADB makes for some head scratching. Yes, I know that we are talking about the preview mode only. I’m aware that you can ask to increase those limits. Nevertheless, those limits likely reflect real trade offs made in the system. So increasing those limits for a particular use case means that you’ll have to pay the price for that elsewhere.
For example, let us take this blog post as an example. It is over 22KB in size. But I can’t store this blog post in ADB. That is because documents are limited to 16KB in size. This is utterly ridiculous. I just checked a few of our databases, an common size for documents is 4 – 8 KB, this is true. But larger documents appear all the time. Even if you exclude blog posts as BLOB of text, we have order documents that have with multiple order lines that are easily past that size. In our users, we see every document size possible, from hundreds of KB to several MB.
I reached out to Codealike, one of our customers, who were also featured in one of Azure’s case studies, to hear from them what their situation was. Out of 1.6 million documents in one of their databases, about 90% are in the 500Kb range.
I’m assuming that a large part of this limitation is the fact that by default, everything is indexed. You can’t index everything and have large documents and have reasonable performance. So this limit was introduced. I’m also assuming that there are other issues here (to be able to fit into pages? low level technical stuff?). Regardless, this is just utterly ridiculous low limit. The problem is that even raising this limit by x5 or x10, that is still not enough. And I’m assuming that they didn’t chose this limit out of thin air, that there is a technical reason for it.
Other issues is the number of stored procedure and UDF that you have available. You get 5 of each, and that is it. So you don’t get to actually express anything complex there. You also get to use only a single UDF per query, and to use a maximum of 3 AND / OR clauses in a query. I’m assuming that the reasoning here is that the more clauses you have, the more complex it is to run the query, especially in a distributed environment. So they put a hard limit on that.
Those limits together, along with not supporting sorting basically render ADB into an interesting curiosity, but not a real contender for a generally applicable database.
What is this for?
After going over the documentation, there is one thing that I couldn’t find. What is the primary use case for ADB?
It looks more like a solution in search of a problem than the other way around. It appears that this is used by several MS systems to store 100s of TB of data, and process millions of queries. Sheer data size isn’t really interesting, we have customers that have multiple TB data. And millions of queries per day isn’t really something to brag about (10 million queries per day translate to about 115 queries per second, or about 20 – 30 queries per second per node).
What interests me is what sort of data do you put there? The small size limitation make it pretty much unsuitable for storing actual complex documents. You have to always be aware of the size you are storing, and that put a serious crimp in how you can work with this. The limited queries and the inability to sort also lead me to believe that this is a very purpose built tool.
OneNote’s server side is apparently one such use case, but from the look of things, I would expect that this is the other way around. That ADB is actually the backend of OneNote that Microsoft has decided to make public (like Dynamo’s in Amazon’s case).
Some of those limitations are probably going to be alleviated by using additional Microsoft tools. So the new Search Server (presumably that one has complex searching & sorting available) would allow you to do some proper queries, and HDInsight might be used for doing aggregation.
You aren’t going to be able to get the “show me the count of unread emails for this user” from Hadoop, not when the data is constantly changing. And using a secondary search server will introduce high latencies for the indexing. That is leaving aside the additional operational complexity of having to manage multiple systems (and the communication between them) just to get things done.
Here are a few things that would be hard to build in ADB, as it stands today:
- This blog – the posts are too big, can’t sort posts by date, can’t do “complex” queries (tag & date & published & not deleted)
- Logging – I actually thought that this would be a great use case, but we actually need to show logs by date. As well as be able to search using multiple fields (more than 3) or do contains queries.
- Orders system – important orders with a lot of line items will be rejected because of the size limitation.
In fact, I don’t know what would work there. What kind of data are you putting there? This isn’t good for bulk data work, because the ingest rate is really small (~500 writes / second? The debug version RavenDB does 2,500 writes per sec that on my dev laptop without even using the bulk insert API) and there isn’t a good way to work with large amount of data at once. It isn’t good for business applications, for the reasons outlined above.
I guess that if you patched this and the search server and Hadoop together you would get something that might be able to serve. But I think that the complexity involved is going to be very high, and I just don’t see where this would be a great solution.
In short, what is the problem that this is trying to solve? What application would be a perfect fit for this?
With RavenDB, the answer is simple, it is a general purpose database focused on OTLP applications. Until you have an answer, you can use RavenDB on Azure today using RavenHQ on Azure.
I’m going to have availability for on site consulting in Malmo, Sweden (17 Sep) and in New York City, NY (end of Sep – beginning of Oct).
If you want me to come by and discuss what you are doing (architecture, nhibernate or ravendb), please drop me a line.
I’m especially interested in people who need to do “strange” things with data and data access. We are building a set of tailored database solutions for customers now, and we have seem customers show x750 improvement in performance when we gave them a database that was designed to fit their exact needs, instead of having to contort their application and their database to a seven dimensional loop just to try to store and read what they needed.
This can be a very short post, just: See CAP. Unfortunately, we have a lot of people who actually have experience in using distributed transactions, and have a good reason to believe that they work. The answer is that yes, they do, as long as you don’t run into some of the interesting edge cases.
By the way, that is not a new observation, see The Two Generals.
Allow me to demonstrate. Assume that we have a distributed system with the following actors:
This is a fairly typical setup. You have a worker that pull messages from a queue and read/write to a database based on those messages. To coordinate between them, it uses a transaction coordinator such as MSDTC.
Transaction coordinators use a two phase commit (or sometimes a three phase commit protocols) to ensure that either all the data would be committed, or none of it would be.
The general logics goes like this:
- For each participant in the transaction, send a prepare message.
- If the participant answered “prepared”, it is guaranteeing that the transaction can be committed.
- If any of the participants failed on prepare, abort the whole transaction.
- If all of the participants successfully prepared, send the commit message to all of them.
The actual details are a bit more involved, obviously, but that is pretty much it.
Now, let us take a look at an interesting scenario. Worker #1 is pulling (in a distributed transaction) a message from the queue, and based on that message, it modify the database. Then it tells the transaction coordinator that it can commit the transaction. At this point, the TC is sending the prepare message to the database and the queue. Both reply that they have successfully prepared the transaction to be committed. The TC sends a commit message to the queue, completing the transaction from its point of view, however, at this point, it is unable to communicate with the database.
What happens now?
Before I answer that, let us look at another such scenario. The TC needs to commit a transaction, it sends a prepare message to the database, and receive a successful reply. However, before it manages to send a prepare message to the queue, it becomes unavailable.
Note that from the point of view of the database, the situation looks exactly the same. It got (and successfully replied) to a Prepare message, then it didn’t hear back from the transaction coordinator afterward.
Now, that is where it gets interesting. In an abstract world, we can just wait with the pending transaction until the connection with the coordinator is resumed, and we can actually get a “commit / abort” notification.
But we aren’t in abstract world. When we have such a scenario, we are actually locking records in the database (because they are in the process of being modified). What happens when another client comes to us and want to modify the same record?
For example, it is quite common for to host the business logic, queue and transaction coordinator on the same worker instance, while the database is on a separate machine. That means that in the image above, if Worker #1 isn’t available, we recover by directing all the users to the 2nd worker. However, at that point, we have a transaction that was prepared, but not committed.
When the user continue to make requests to our system, the 2nd worker, which has its own queue and transaction coordinator is going to try and access the user’s record. The same user whose record are currently locked because of the ongoing transaction.
If we just let it hang in this manner, we have essentially created a situation where the user’s data become utterly unavailable (at least for writes). In order to resolve that, transactions comes with a timeout. So after the timeout has expired, we can roll back that transaction. Of course, that leads to a very interesting situation.
Let us go back to the first scenario we explored. In this scenario, the queue got both Prepare & Commit messages, while the database got just a Prepare message. The timeout has expired, and the database has rolled back the transaction. In other words, as far as the queue is concerned, the transaction committed, and the message is gone. As far as the database is concerned, that transaction was rolled back, and never happened.
Of course, the chance that something like that can happen in one of your systems? Probably one in a million.
There has been quite a lot of research on the notion of early lock release as a way to improve database concurrency. For a while, Voron actually supported that, mostly because I thought that this is a good idea. Lately, however, I decided that it is anything but for modern databases.
In short, this is how transactions behave:
|Standard transaction||Early Lock Release Transaction|
| || |
As you note, the major difference is when we release the locks. In the case of ELR, we release the locks immediately when start flushing the transaction state to log. The idea is that we don’t need to wait for potentially length I/O to allow the next transaction to start running. However, we don’t report the transaction as completed before we flushed the transaction.
There is a tiny complication in there, the next transaction cannot start doing the async flushing to log before our transaction is also over, but that is fine and expected, anyway.
However, what about error conditions? What happens if we’ve started to flush the transaction to disk in an async manner, then we got an error. Well, the obvious thing to do here (and as called out in the paper) is to abort the transaction, and then abort any transaction that has started after the failed transaction released its locks.
This is usually okay, because in general, that is no big deal at all. Aside from something like out of disk space errors (which can be resolved by pre-allocating the data), there aren’t really any non catastrophic disk errors. So usually if the disk give you a hard time, it pretty much time to give up and go home.
However, with the use of cloud computing, it is actually pretty common (however much it is a bad idea) to have a “networked disk”. This means that it can certainly be the case that a I/O request you made to the disk can get lost, delayed and just appear to fail (but actually go through). It is the last scenario in particular that worries me. If you actually wrote to the log, but you think that you didn’t, what is your state now?
And while I can handle that in a case where I can fail the transaction and rollback all the previous state, it is much harder to do that if I’ve already committed the transaction in memory, since we might need to do a memory only rollback, and that isn’t something that we are actually setup to do.
In short, we’ll be rolling back early lock release in Voron, it isn’t worth the complexity involved, especially since we already have better ways to handle concurrency.
I was asked to review the book (and received a free electronic copy).
As someone that is very into storage engines, I was quite excited about this. After going over the leveldb codebase, I would finally get to read a real book about how it works.
I was disappointed, badly.
This book isn’t really about leveldb. It contains pretty much no background, explanation, history or anything much at all about how leveldb works. Instead, it is pretty much a guide of how to use LevelDB to write iOS application. There is a lot of chapters dealing with Objective-C, NSString and variants, how to do binding, how to handle drag and drop.
However, things that I would expect. Such as explanations of how it works, what does it do, alternative use cases, etc are very rare, if there at all. Only chapter 10 is really worth reading, and even so, I got the feeling that it only made sense to me because I already knew quite a lot leveldb already. I can’t imagine actually starting from scratch and actually being able to understand leveldb from this book.
If you are working on iOS apps / OS X, I guess that this might be a good choice, but only if you want to know about actually implementing leveldb. You’ll need to do your actual leveldb learning elsewhere.
The book does contain some interesting tidbits. Chapter 10 is talking about tuning and key policies, and it did have some interesting things to talk about, but it also contain wrong information* (and if I could spot it, with my relatively little experience with leveldb, I’m pretty sure that there are other things there too that are wrong).
* The book said that is it better to write keys in order, to reduce I/O. But leveldb writes to a skip list in memory, then flush that entire thing in sorted fashion to disk. Your writes have to be bigger than the buffer size of that to actually matter, and that still won’t help you much.
In short, feel free to skip this book, unless you are very focused on writing leveldb apps on iOS. In which case it might be a worth it, but I don’t think so. You are better off reading the docs or any of the tutorials.
I was pointed at this blog post, and I thought that I would comment, from a RavenDB perspective.
If you don’t know what how to tie your shoes, don’t run.
The actual details in the posts are fascinating, I’ve never heard about this Diaspora project. But to be perfectly honest, the problems that they run into has nothing to do with MongoDB or its features. They have a lot to do with a fundamental lack of understanding on how to model using a document database.
In particular, I actually winced in sympathetic pain when the author explained how they modeled a TV show.
They store the entire thing as a single document:
Hell, the author even talks about General Hospital, a show that has 50+ sessions and 12,000 episodes in the blog post. And at no point did they stop to think that this might not be such a good idea?
A much better model would be something like this:
- [episode ids]
- TV shows
- [seasons ids]
- [review ids]
- [episode ids]
- [actor ids]
Now, I am not settled in my own mind if it would be better to have a single season document per season, containing all episodes, reviews, etc. Or if it would be better to have separate episode documents.
What I do know is that having a single document that large is something that I want to avoid. And yes, this is a somewhat relational model. That is because what you are looking at is a list of independent aggregates that have different reasons to change.
Later on in the post the author talks about the problem when they wanted to show “all episodes by actor”, and they had to move to a relational database to do that. But that is like saying that if you stored all your actors information as plain names (without any ids), you would have hard time to handle them in a relational database.
Now, as for the Disapora project issues. They appear to have been doing some really silly things there. In particular, let us store store the all information multiple times:
You can see for example all the user information being duplicated all over the place. Now, this is a distributed social network, but that doesn’t call for it to be stupid about it. They should have used references instead of copying the data.
Now, to be fair, there are very good reasons why you’ll want to duplicate the data, when you want a point in time view of it. For example, if I commented on a post, I probably want my name in that post to remain frozen. It would certainly make things much easier all around. But if changing my email now requires that we’ll run some sort of a huge update operation on the entire database… well, it ain’t the db fault. You are doing it wrong.
Now, when you store references to other documents, you have a few options. If you are using RavenDB, you have Include support, so you can get the associated documents easily enough. If you are using mongo, you have an additional step in that you have to call $in(the ids), but that is about it.
I am sorry, but this is blaming the dancer blaming the floor.
Kelly has an interesting post about memory mapped files and the cloud. This is in response to a comment on my post where I stated that we don’t reserve space up front in Voron because we support cloud providers that charge per storage.
From Kelly’s post, I assume she thinks about running it herself on her own cloud instances, and that is what here pricing indicates. Indeed, if you want to get a 100GB cloud disk from pretty much anywhere, you’ll pay for the full 100GB disk from day 1. But that isn’t the scenario that I actually had in mind.
I was thinking about the cloud providers. Imagine that you want to go to RavenHQ, and get a db there. You sigh up for a 2 GB plan, and all if great. Except that on the very first write, we allocate a fixed 10 GB, and you start paying overage charges. This isn’t what you pay when you run on your own hardware. This is what you would have to deal with as a cloud DBaaS provider, and as a consumer of such a service.
That aside, let me deal a bit with the issues of memory mapped files & sparse files. I created 6 sparse files, each of them 128GB in size in my E drive.
As you can see, this is a 300GB disk, but I just “allocated” 640GB of space in it.
This also shows that there has been no reservation of space on the disk. In fact, it is entirely possible to create files that are entirely too big for the volume they are located on.
I did a lot of testing with mmap files & sparseness, and I came to the conclusion that you can’t trust it. You especially can’t trust it in a cloud scenario.
But why? Well, imagine the scenario where you need to use a new page, and the FS needs to allocate one for you. At this point, it need to find an available page. That might fail, let us imagine that this fails because of no free space, because that is easiest.
What happens then? Well, you aren’t access things via an API, so there isn’t an error code it can return, or an exception to be thrown.
In Windows, it will use Standard Exception Handler to throw the error. In Linux, that will be probably generate a SIVXXX error. Now, to make things interesting, this may not actually happen when you are writing to the newly reserved page, it may be deferred by the OS to a later point in time (or if you call msync / FlushViewOfFile). At any rate, that means that at some point the OS is going to wake up and realize that it promised something it can’t deliver, and in that point (which, again, may be later than the point you actually wrote to that page) you are going to find yourself in a very interesting situation. I’ve actually tested that scenario, and it isn’t a good one form the point of view of reliability. You really don’t want to get there, because then all bets are off with regards to what happens to the data you wrote. And you can’t even do graceful error handling at that point, because you might be past the point.
Considering the fact that disk full is one of those things that you really need to be aware about, you can’t really trust this intersection of features.
This is more of less how a page is represented in memory.
There are a few things to note. The entries immediately following the page header points to the actual data at the end of the page. Why is that? Well, remember that we are working with a fixed size page. And we need to add data to the page in a sorted fashion, and do this cheaply. In order to do that, we always add the actual data at the end of the file, before any other data. When the page is empty, we add the value in the end. And the next value is immediately before the previous value, etc.
However, as you can see in the image, the fact that we add values in a particular order doesn’t mean that they are sorted in that order. In order to get good sorting, we keep a list of the entries offsets in the beginning of the page, just after the header. This is a sorted list, which gives us the ability to easy add / move an item in the page without actually moving the page, just by changing the offset position it is located on. If we would add another entry to this page, with the key C, the actual data would go on top of entry# 3. But the offset recording this entry would go between B & A at the head of the page.
In other words, the actual data of the page grows upward, and the offsets pointing to the locations of entries in the page grows downward. A page is full when you can’t fit the next data item and its offset in the space between the data & the offsets.
So that is addition. How about deletion. When we delete, we need to recover the space, but that is actually very easy to handle. Let us say that we want to delete the A entry. That means that we need to do the following. Move all the higher entries data downward to cover the space taken by the entry, and then update the offsets to reflect that. In the end, we are left with:
Updates work as a pair of delete/add operations, with some important edge cases. What happen if we have a page that is nearly full, and we want to update an entry that is bigger than it can currently hold? We have to check if the value is also too big if the previous value is removed. If not, we can fit it into the currently page. Otherwise, we would have to do a page split to accommodate it.
Surprisingly enough, the code that is required to handle that is quite pretty, although the actual mechanism probably require a whiteboard to explain. And a lot of hand waving.
In the previous post, I gave a brief introduction about B+Trees. Now I want to talk about a crucial part of handling them. Let us start with the basic, we have the following tree, which consist of a single page:
As you can imagine, attempting to add a single item to this page isn’t going to happen (there isn’t enough space), so we are going to have to something about it. That something is call page splitting. The easiest way to do that is to simply cut things in half, like so:
We have split the page, and now we are left with two pages that we can start writing to. Everyone is happy. Almost… in this case, you can see that we are doing sequential inserts. So page #0 is never going to have any additional data written to it. That means that we are wasting half this page!
What I have done is to add just a tiny bit of smarts into the mix. Instead of doing a 50/50 split, we can detect if this is a sequential insert that is causing the split. If it is, we are going to split the data in a way that put more of it in the original page, like so:
We leave 85% of the data in the original page because that give some room to play with if one of the entries change there. This gives us a better page utilization in the common case of sequential inserts. But what about non sequential inserts? Well, if we detect that the page is being split because of a non sequential insert, we are going to do a 50/50 split, expecting that there would be additional non sequential inserts along the way.
I am not sure how useful that would be, but it seems like something that could be a nice optimization for a common case, and it requires nothing else from the user, we can do it all on our own.
One of the things that I enjoy about learning new things is the way it changes the way I look at the stuff that I already knows. Reading the LMDB codebase, and implementing a persistent B+Tree has given me a new depth of understanding about how relational databases (and Esent, too), work.
I am going to go back to level zero and assume that you have never heard about this. You probably know what a binary search tree is. And you know how they work. Here is such a tree:
Most tree algorithms spend a lot of time ensuring that the trees are balanced, because their most important property, O(log N) search performance, require that. Binary search trees seems ideally suitable for building databases ,since they allow speedy lookups and backward & forward iteration through the data. Unfortunately, they suffer from a pretty crippling limitation. They have absolutely horrible performance when used in persistent medium. Let us assume that I have a file that contains the following information (Key, Left offset, Right offset). I could search through that file by just walking the links. That, however, would result quite terrible performance. Every time that we move between nodes, we have to do a seek, and that would kill any benefits we will gain from using binary search tree.
Instead, we have a data structure that is aimed at reducing those costs. B+Tree. On the face of it, B+Tree is a truly stupidly simple idea. Instead of having just 2 items in the tree, left & right, let us have more.
Now, since B+Trees are usually used in persistent format, we start out by defining pages. That is because if we want to do random I/O, we have to be able to write in pages to the file. Therefor, we usually talk about B+Trees in terms of pages. Here is the above tree, rendered as a B+Tree:
Since the data is small, it all fits into a single page. That means that you can read the entire page into memory, and then do a binary search on the data page cheaply. The page above uses a page size of 4,096, so it can contain about 200 entries (of the size we just put in, which are very small). What happens when we have more than that? We Split the page. The resulting data now looks like this (256 entries):
Now, if we want to do a search for key 152, we start by reading the root node, we can see that the page containing values greater than 102 is page 1, so we will go there. We read that, and do a binary search on the data to find our value. So that was two seeks to get the data. Instead of eight. And usually, the high level pages are already cached, so we would only need a single seek.
Let us see what happens when we put even more data in (512):
As you can see, we doubled the amount of information we store, but we remain at the same height. That is quite important, in order to get good performance from the disk. The usual fan out is about a 100, depending on your page size and the keys you selected.
I played with the settings a a bit so you can get a good idea about what this look like when you have some more depth, I wrote the following code:
Which resulted in the following tree:
So far, so good, but this is actually the best case scenario. This is what happens when we are inserting data in a sequential manner. Let us mix things up a bit, okay?
This code will generate values in effectively a random fashion, and you can see the impact it has on the tree:
What happens is that we have had to insert data in a lot of places, which resulted in a lot of page splits, which increase the depth of the tree. Deeper trees means more operations required to actually get to the data on the leaf nodes. And yes, if you have a flash back to “always use sequential ids” section in RDMBS class, this is the reason why.
Now, I think that this is enough for now, I’ll do some more talking about the subject in my next post.
In my previous post I started to talk about the architecture of Naver. But this is actually still too early to tell, since before anything else, I want to create the low level interface for working with pages. And even before that, I had to decide how to actually represent a page in C#. In LMDB, this is easy because you can just access the memory using a pointer, and work with memory in this fashion is really natural in C. But in C#, that is much harder. It took me some trial and error, but I realize that I was trying to write C code in C#, and that isn’t going to work. Instead, I have to write native C#, which is similar, but different. In C, I can just pass around a pointer, and start doing evil things to it at will. In C#, there are a lot of road blocks along the way that prevent you from doing that.
So I ended up with this:
This is the memory layout that I want. But, there is not way in C# to say, allocate this struct at this location. Luckily, I found a workaround.
What I can do, I can have a Page class that manage this for me. This means that I just give the class a pointer, and it can treat this either as a PageHeader* or as a byte array. This means that I also get to do cool tricks like having an array backed by memory directly in the code:
The Page class have a lot of methods relating to managing the page itself, and it should abstract away all the pesky details of actually working with memory directly.
So, this isn’t a really informative post, I fear. But it did take me a few hours to come up with the approach that I wanted. I kept trying to find ways to embed addresses inside the PageHeader, until I realized that I can just hold that externally. Note that all the important state about the Page is actually stored in memory outside the Page class, so that is shared globally, but you can have two Page instances that point to the same place. And that is where I decided to put local state. Things like where we are currently positioned in the page, for example.
As the author of a schema less database, I find myself in the strange position of the barefoot shoemaker. I need to explain a bit. Our current storage engines, Esent and Munin (which was designed mostly to be like Esent) have rigid schemas. There are tables, and indexes, etc. This means that features that touch the storage layer tend to be much more complex. They require migrations, adding new features that require storage means that we have to create new storage tables, or modify indexes, or any number of a bunch of stuff that we developed RavenDB so our users wouldn’t have to.
I have been working with our managed implementation of LevelDB quite a lot lately. In order to do more than merely write tests for this, I tried to create a feature complete feature, an aggregation engine. The code is not production worthy (yet!), but what struck me quite hard was the fact that except for the fact that the storage layer is full of bugs (well, that is why I was writing stuff on top of it, to expose it), I had a blast actually working with it.
I could make modifications & changes with hardly any friction, and it was a real pleasure to start working with things in this fashion.
In the mailing list, we got asked about an issue with code that looked like this:
I fixed the bug, but that was a strange thing to do, I thought. Happily, the person asking that question was actually taking part of a RavenDB course and I could sit with him and understand the whole question.
It appears that in their system, they have a lot of things like that:
And along with that, they also have a coordinating class:
The question was, could they keep using the same approach using RavenDB? They were quite anxious about this, since they had a need for the capabilities of this in their software.
This is why I hate Hello World questions. I could answer just the question that was asked, and that was it. But the problem is quite different.
You might have recognized it by now, what they have here is Entity Attribute Value system. A well known anti pattern for the relational database world and one of the few ways to actually get a dynamic schema in that world.
In RavenDB, you don’t need all of those things. You can just get things done. Here is the code that we wrote to replace the above monstrosity:
Not only will this class handle the dynamics quite well, it also serializes to idiomatic JSON, which means that querying that is about as easy as you can ask.
The EAV schema was created because RDBMS aren’t suitable for dynamic work, and like many other things from the RDMBS world, this problem just doesn’t exists for us in RavenDB.
Okay, enough about the data access parts. Let us see take a look at a few of the other things that are going on in the application. In particular, this is supposed to be an application with…
Domain logic is implemented by means of a Domain Model, onto a layer of services adds application logic. The model is persisted by a DAL designed around the principles of the "Repository" patterns, which has been implemented in a LINQ-friendly way.
Let us try to figure this out one at a time, okay?
The only method in the domain model that have even a hint of domain logic is the CalculateTotalIncome method. Yes, you got it right, that is a method, as in singular. And that method should be replaced with a query, it has no business being on the domain model.
So let us move to the services, okay? Here are the service definitions in the entire project:
Look at the methods carefully. Do you see any action at all? You don’t, the entire thing is just about queries.
And queries should be simple, not abstracted away and made very hard to figure out.
The rule of the thumb is that you try hard to not abstract queries, it is operations that you try to abstract. Operations is where you usually actually find the business logic.
One of the things that I repeatedly call out is the forwarding type of architecture, a simple operation that is hidden away by a large number of abstractions that serves no real purpose.
Instead of a controller, let us look at a web service, just to make things slightly different. We have the following:
Okay, let us dig deeper:
I really like the fact that this repository actually have have FindById method, which this service promptly ignores in favor of using the IQueryable<Customer> implementation. If you want to know how that is implemented, just look (using the EF Code First repository implementations, the others are fairly similar):
All in all, the entire thing only serves to make things harder to understand and maintain.
Does anyone really think that this abstraction adds anything? What is the point?!
In my previous post, I talked about how the CatalogController.Product(id) action was completely ridiculous in the level of abstraction that it used to do its work and promised to show how to do the same work on an actual read model in a much simpler fashion. Here is the code.
There are several things that you might note about this code:
- It is all inline, so it is possible to analyze, optimize and figure out what the hell is going on easily.
- It doesn’t got to the database to load data that it already has.
- The code actually does something meaningful.
- It only do one thing, and it does this elegantly.
This is actually using a read model. By, you know, reading from it, instead of abstracting it.
But there is a problem here, I hear you shout (already reaching for the pitchfork, at least let me finish this post).
Previously, we have hidden the logic of discontinued products and available products behind the following abstraction:
Now we are embedding this inside the query itself, what happens if this logic changes? We would now need to go everywhere we used this logic and update it.
Well, yes. Except that there are two mitigating factors.
- This nice abstraction is never used elsewhere.
- It is easy to create our own abstraction.
Here is an example on how to do this without adding additional layers of abstractions.
Which means that our action method now looks like this:
Simple, easy, performant, maintainable.
And it doesn’t make my head hurt or cause me to stay up until 4 AM feeling like an XKCD item.
In my last few posts, I have gone over the data access strategy used in NSK. I haven’t been impressed. In fact, I am somewhere between horrified, shocked and amused in the same way you feel when you see a clown slipping on a banana peel. Why do I say that? Let us trace a single call from the front end all the way to the back.
The case in point, CatalogController.Product(id) action. This is something that should just display the product on the screen, so it should be fairly simple, right? Here is how it works when drawn as UML:
To simplify things, I decided to skip any method calls on the same objects (there are more than a few).
Let me show you how this looks like in actual code:
Digging deeper, we get:
We will deal with the first method call to CatalogServices now, which looks like:
I’ll skip going deeper, because this is just a layer of needless abstraction on top of Entity Framework and that is quite enough already.
Now let us deal with the second call to CatalogServices, which is actually more interesting:
Note the marked line? This is generating a query. This is interesting, because we have already loaded the product. There is no way of optimizing that, of course, because the architecture doesn’t let you.
Now, you need all of this just to show a single product on the screen. I mean, seriously.
You might have noticed some references to things like Read Model in the code. Which I find highly ironic. Read Models are about making the read side of things simple, not drowning the code in abstraction on top of abstraction on top of abstraction.
In my next post, I’ll show a better way to handle this scenario. A way that is actually simpler and make use an of actual read model and not infinite levels of indirection.
Update: Andrea, the project’s leader has posted a reply to this series of posts.
I like to start reviewing applications from their database interactions. That it usually low level enough to tell me what is actually going on, and it is critical to the app, so a lot of thought usually goes there.
In good applications, I have hard time finding the data access code, because it isn’t there. It is in the OR/M or the server client API (in the case of RavenDB). In some applications, if they work against legacy databases or without the benefit of OR/M or against a strange data source (such as a remote web service target) may need an explicit data layer, but most don’t.
NSK actually have 5 projects dedicated solely to data access. I find this.. scary.
Okay, let me start outlying things in simple terms. You don’t want to do things with regards to data access the way NSK does them.
Let us explore all the ways it is broken. First, in terms of actual ROI. There is absolutely no reason to have multiple implementations with different OR/Ms. There is really not a shred of reason to do so. The OR/M is already doing the work of handling the abstraction from the database layer, and the only thing that you get is an inconsistent API, inability to actually important features and a whole lot more work that doesn’t' get you anything.
Second, there are the abstractions used:
I don’t like repositories, because they abstract important details about the way you work with the database. But let us give this the benefit of the doubt and look at the implementation. There is only one implementation of IRepository, which uses NHibernate.
As you can see, this is pretty basic stuff. You can also see that there are several methods that aren’t implemented. That is because they make no sense to a data. The reason they are there is because IRepository<T> inherits from ICollection<T>. And the reason for that is likely because of this:
Mediates between the domain and data mapping layers using a collection-like interface for accessing domain objects.
The fact that this is totally the wrong abstraction to use doesn’t enter to the design, it seems.
Note that we also violate the contract of ICollection<T>.Remove:
There are other reasons to dislike this sort of thing, but I’ll touch on that on my next post.
This article thinks so, and I was asked to comment on that. I have to say that I agree with a lot in this article. It starts by laying out what an anti pattern is:
- It initially appears to be beneficial, but in the long term has more bad consequences than good ones
- An alternative solution exists that is proven and repeatable
And then goes on to list some of the problems with OR/M:
- Inadequate abstraction - The most obvious problem with ORM as an abstraction is that it does not adequately abstract away the implementation details. The documentation of all the major ORM libraries is rife with references to SQL concepts.
- Incorrect abstraction – …if your data is not relational, then you are adding a huge and unnecessary overhead by using SQL in the first place and then compounding the problem by adding a further abstraction layer on top of that.
On the the other hand, if your data is relational, then your object mapping will eventually break down. SQL is about relational algebra: the output of SQL is not an object but an answer to a question.
- Death by a thousand queries – …when you are fetching a thousand records at a time, fetching 30 columns when you only need 3 becomes a pernicious source of inefficiency. Many ORM layers are also notably bad at deducing joins, and will fall back to dozens of individual queries for related objects.
If the article was about pointing out the problems in OR/M I would have no issues in endorsing it unreservedly. Many of the problems it points out are real. They can be mitigated quite nicely by someone who knows what they are doing, but that is beside the point.
I think that I am in a pretty unique position to answer this question. I have over 7 years of being heavily involved in the NHibernate project, and I have been living & breathing OR/M for all of that time. I have also created RavenDB, a NoSQL database, that gives me a good perspective about what it means to work with a non relational store.
And like most criticisms of OR/M that I have heard over the years, this article does only half the job. It tells you what is good & bad (most bad) in OR/M, but it fails to point out something quite important.
To misquote Churchill, Object Relational Mapping is the worst form of accessing a relational database, except all of the other options when used for OLTP.
When I see people railing against the problems in OR/M, they usually point out quite correctly problems that are truly painful. But they never seem to remember all of the other problems that OR/M usually shields you from.
One alternative is to move away from Relational Databases. RavenDB and the RavenDB Client API has been specifically designed by us to overcome a lot of the limitations and pitfalls inherit to OR/M. We have been able to take advantage of all of our experience in the area and create what I consider to be a truly awesome experience.
But if you can’t move away from Relational Databases, what are the alternative? Ad hoc SQL or Stored Procedures? You want to call that better?
A better alternative might be something like Massive, which is a very thin layer over SQL. But that suffers from a whole host of other issues (no unit of work means aliasing issues, no support for eager load means better chance for SELECT N+1, no easy way to handle migrations, etc). There is a reason why OR/M have reached where they have. There are a lot of design decisions that simply cannot be made any other way without unacceptable tradeoffs.
From my perspective, that means that if you are using Relational Databases for OLTP, you are most likely best served with an OR/M. Now, if you want to move away from Relational Databases for OLTP, I would be quite happy to agree with you that this is the right move to make.
I am working on managed storage solution for RavenDB in an off again on again fashion for a while now. The main problem Is that doing something like this is not particularly hard, but it is complex. You either have to go with a transaction log or an append only model.
There are more than enough material on the matter, so I won’t touch that. The problem is that building that is going to take time, and probably a lot of it I decided that it is better off to have something than nothing, and scaled back the requirements.
The storage has to have:
- Multi threaded
- Fully managed
- Support both memory and files
- Easy to work with
The last one is important. Go and take a look at the codebase of any of the available databases. They can be… pretty scary.
But something has to be give, so I decided that to make things easier, I am not going to implement indexing on the file system. Instead, I’ll store the data on the disk, and keep the actual index completely in memory. There is an overhead of roughly 16 bytes per key plus the key itself, let us round it to 64 bytes per held per key. Holding 10 million keys would cost ~600 MB. That sounds like a lot, because it is. But it actually not too bad. It isn’t that much memory for modern hardware. And assuming that our documents are 5 KB in size, we are talking about 50 GB for the database size, anyway.
Just to be clear, the actual data is going to be on the disk, it is the index that we keep in memory. And once we have that decision, the rest sort of follow on its own.
It is intentionally low level interface, and mostly it gives you is a Add/Read/Remove interface, but it gives you multiple tables, the ability to do key and range scans and full ACID compliance (including crash recovery).
And the fun part, it does so in 400 lines!
The title of this post is a translation of an Arabic saying that my father quoted me throughout my childhood.
I have been teaching my NHibernate course these past few days, and I had come to realize that my approach for designing RDBMS based applications has gone a drastic change recently. I think that the difference in my view was brought home when I started getting angry about this model:
I mean, it is pretty much a classic, isn’t it? But what really annoyed me was that all I had to do was look at this and know just how badly this is going to end up as when someone is going to try to show an order with its details. We are going to have, at least initially, 3 + N(order lines) queries. And even though this is a classic model, loading it efficiently is actually not that trivial. I actually used this model to show several different ways of eager loading. And remember, this model is actually a highly simplified representation of what you’ll need in real projects.
I then came up with a model that I felt was much more palatable to me:
And looking at it, I had an interesting thought. My problem with the model started because I got annoyed by how many tables were involved in dealing with “Show Order”, but the end result also reminded me of something, Root Aggregates in DDDs. Now, since my newfound sensitivity about this has been based on my experiences with RavenDB, I found it amusing that I explicitly modeled documents in RavenDB after Root Aggregates in DDD, then went the other way (reducing queries –> Root Aggregates) with modeling in RDBMS).
The interesting part is that once you start thinking like this, you end up with a lot of additional reasons why you actually want that. (If the product price changed, it doesn’t affect the order, for example).
If you think about it, normalization in RDBMS had such a major role because storage was expensive. It made sense to try to optimize this with normalization. In essence, normalization is compressing the data, by taking the repeated patterns and substituting them with a marker. There is also another issue, when normalization came out, the applications being being were far different than the type of applications we build today. In terms of number of users, time that you had to process a single request, concurrent requests, amount of data that you had to deal with, etc.
Under those circumstances, it actually made sense to trade off read speed for storage. In today’s world? I don’t think that it hold as much.
The other major benefit of normalization, which took extra emphasis when the reduction in storage became less important as HD sizes grew, is that when you state a fact only once, you can modify it only once.
Except… there is a large set of scenarios where you don’t want to do that. Take invoices as a good example. In the case of the order model above, if you changed the product name from “Thingamajig” to “Foomiester”, that is going to be mighty confusing for me when I look at that order and have no idea what that thing was. What about the name of the customer? Think about the scenarios in which someone changes their name (marriage is most common one, probably). If a woman orders a book under her maiden name, then changes her name after she married, what is supposed to show on the order when it is displayed? If it is the new name, that person didn’t exist at the time of the order.
Obviously, there are counter examples, which I am sure the comments will be quick to point out.
But it does bear thinking about, and my default instinct to apply 3rd normal form has been muted once I realized this. I now have a whole set of additional questions that i ask about every piece of information that I deal with.
I recently had a chance to work on an interesting project, doing a POC of moving from a relational model to RavenDB. And one of the most interesting hurdles along the way wasn’t technical at all, it was trying to decide what an entity is. We are so used to make the assumption that Entity == Table that we started to associate the two together. With a document database, an entity is a document, and that map much more closely to a root aggregate than to a RDMBS entity.
That gets very interesting when we start looking at tables and having to decide if they represent data that is stand alone (and therefore deserve to live is separate documents) or whatever they should be embedded in the parent document. That led to a very interesting discussion on each table. What I found remarkable is that it was partly a discussion that seem to come directly from the DDD book, about root aggregates, responsibilities and the abstract definition of an entity and partly a discussion that focused on meeting the different modeling requirement for a document database.
I think that we did a good job, but I most valued the discussion and the insight. What was most interesting to me was how right was RavenDB for the problem set, because a whole range of issues just went away when we started to move the model over.
I decided to take a chance (installing Oracle is a big leap :-) ) and see how things match in Oracle.
I decided to run the following query:
SELECT deptno, dname, loc, (SELECT COUNT(*) FROM emp WHERE emp.deptno = dept.deptno) AS empcount FROM dept WHERE deptno = 20
Please note that I run in on a database that had (total) maybe a 100 records, so the results may be skewed.
Like in the SQL Server case, we need to create an index on the FK column. I did so, after which I got:
Then I dropped that index and create a simple view:
CREATE VIEW depswithempcount AS SELECT deptno, dname, loc, (SELECT COUNT(*) FROM emp WHERE emp.deptno = dept.deptno) AS empcount FROM dept
Querying on top of that gives me the same query plan as before. Trying to create a materialized view out of this fails, because of the subquery expression, I’ll have to express the view in terms of joins, instead. Like this:
SELECT dept.deptno, dname, loc, COUNT(*) empcount FROM dept LEFT JOIN emp ON dept.deptno = emp.deptno WHERE dept.deptno = 20 GROUP BY dept.deptno, dname, loc
Interestingly enough, this is a different query plan than the subquery, with SQL Server, those two query exhibit identical query plans.
Now, to turn that into an materialized view.
CREATE materialized VIEW deptwithempcount AS SELECT dept.deptno, dname, loc, COUNT(*) empcount FROM dept left join emp ON dept.deptno = emp.deptno GROUP BY dept.deptno, dname, loc
And querying on this gives us very interesting results:
select * from deptwithempcount where deptno = 20
Unlike SQL Server, we can see that Oracle is reading everything from the view. But let us try one more thing, before we conclude this with a victory.
update emp set deptno = 10 where deptno = 20; select * from deptwithempcount where deptno = 20But now, when we re-run the materialized view query, we see the results as they were at the creation of the view.
There appears to be a set of options to control that, but the one that I want (RERESH FAST), which update the view as soon as data changes will not work with this query, since it consider it too complex. I didn’t investigate too deeply, but it seems that this is another dead end.
Let us say that I have the homepage of the application, where we display Blogs with their Post count, using the following query:
select dbo.Blogs.Id, dbo.Blogs.Title, dbo.Blogs.Subtitle, (select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) as PostCount from dbo.Blogs
Given what I think thoughts of denormalization, and read vs. write costs, it seems a little wasteful to run the aggregate all the time.
I can always add a PostCount property to the Blogs table, but that would require me to manage that myself, and I thought that I might see whatever the database can do it for me.
This isn’t a conclusive post, it details what I tried, and what I think is happening, but it isn’t the end all be all. Moreover, I run my tests on SQL Server 2008 R2 only, not on anything else. I would like to hear what you think of this.
My first thought was to create this as a persisted computed column:
ALTER TABLE Blogs ADD PostCount AS (select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) PERSISTED
But you can’t create computed columns that uses subqueries. I would understand easier why not if it was only for persisted computed columns, because that would give the database a hell of time figuring out when that computed column needs to be updated, but I am actually surprised that normal computed columns aren’t supporting subqueries.
Given that my first attempt failed, I decided to try to create a materialized view for the data that I needed. Materialized views in SQL Server are called indexed views, There are several things to note here. You can’t use subqueries here either (likely because the DB couldn’t figure which row in the index to update if you were using subqueries), but have to use joins.
I created a data set of 1,048,576 rows in the blogs table and 20,971,520 posts, which I think should be enough to give me real data.
Then, I issued the following query:
select dbo.Blogs.Id, dbo.Blogs.Title, dbo.Blogs.Subtitle, count_big(*) as PostCount from dbo.Blogs left join dbo.Posts on dbo.Blogs.Id = dbo.Posts.BlogId where dbo.Blogs.Id = 365819 group by dbo.Blogs.Id, dbo.Blogs.Title, dbo.Blogs.Subtitle
This is before I created anything, just to give me some idea about what kind of performance (and query plan) I can expect.
Query duration: 13 seconds.
And the execution plan:
The suggest indexes feature is one of the best reasons to move to SSMS 2008, in my opinion.
Following the suggestion, I created:
CREATE NONCLUSTERED INDEX [IDX_Posts_ByBlogID] ON [dbo].[Posts] ([BlogId])
And then I reissued the query. It completed in 0 seconds with the following execution plan:
After building Raven, I have a much better understanding of how databases operate internally, and I can completely follow how that introduction of this index can completely change the game for this query.
Just to point out, the results of this query is:
Id Title Subtitle PostCount
----------- --------------------- ---------------------- --------------------
365819 The lazy blog hibernating in summer 1310720
I decided to see what using a view (and then indexed view) will give me. I dropped the IDX_Posts_ByBlogID index and created the following view:
CREATE VIEW BlogsWithPostCount WITH SCHEMABINDING AS select dbo.Blogs.Id, dbo.Blogs.Title, dbo.Blogs.Subtitle, count_big(*) as PostCount from dbo.Blogs join dbo.Posts on dbo.Blogs.Id = dbo.Posts.BlogId group by dbo.Blogs.Id, dbo.Blogs.Title, dbo.Blogs.Subtitle
After which I issued the following query:
select Id, Title, Subtitle, PostCount from BlogsWithPostCount where Id = 365819
This had the exact same behavior as the first query (13 seconds and the suggestion for adding the index).
I then added the following index to the view:
CREATE UNIQUE CLUSTERED INDEX IDX_BlogsWithPostCount ON BlogsWithPostCount (Id)
And then reissued the same query on the view. It had absolutely no affect on the query (13 seconds and the suggestion for adding the index). This make sense, if you understand how the database is actually treating this.
The database just created an index on the results of the view, but it only indexed the columns that we told it about, which means that is still need to compute the PostCount. To make things more interesting, you can’t add the PostCount to the index (thus saving the need to recalculate it).
Some points that are worth talking about:
- Adding IDX_Posts_ByBlogID index resulted in a significant speed increase
- There doesn’t seem to be a good way to perform materialization of the query in the database (this applies to SQL Server only, mind you, maybe Oracle does better here, I am not sure).
In other words, the best solution that I have for this is to either accept the cost per read on the RDBMS and mitigate that with proper indexes or create a PostCount column in the Blogs table and manage that yourself. I would like your critique on my attempt, and additional information about whatever what I am trying to do is possible in other RDMBS.