When should you NOT use RavenDB?
That is a much more interesting question, especially when it is me that gets asked. Again, I am might be biased, but I really think that RavenDB is suitable for a whole range of problems.
One case that we don’t recommend using RavenDB for is reporting, though. In many cases, reporting databases need to answer queries that would be awkward or slow on a RavenDB system. For reporting, we recommend just throwing that data into a reporting database (either star schema or a cube) and do the reporting directly from there.
Most reporting tools will give you great support when you do that, and that is one sour spot (the reverse of a sweet spot) for RavenDB. We can do reporting, but you have to know what you want to report on beforehand. In most reporting systems, you give the users great deal of freedom with regards to how to look, aggregate and process the data, and that isn’t something that you can easily do with RavenDB.
So far, that is the one place where we have said, “nope, that isn’t us, go look at that other thing that does this feature really well”. RavenDB is focused on being a good database for OLTP, not for reporting.
Ayende, this is one thing that I've seen around many times during my cloud databases research. Given that reporting is the most natural use of RELATIONS, and that nosql-like databases are focused on objects, it's easy to understand that each database system has its most favorable context. We have to change the paradigm that a solution has one database. A solution is composed of as many software artifacts as necessary. And so, as many database systems as needed.
Isn't squirting out the data to a reporting DB the area Raven's Index Replication Bundle touches on?
I suppose RavenDB is as bad as a reporting db as any other OLTP relational database. Even with a relational db you will not be able to do ad-hoc reports once the amount of data is big enough - you will need a separate reporting db optimized for that purpose. On the other hand, Lucene can be used to compute some statistics very fast so Raven has some reporting capabilities built-in and that can be extended with a map-reduce index. But of course it will never be an ad-hoc reporting tool. There are at least few things more worth considering: - operations on large sets of data will probably be slower in Raven than in RDBMS (because of storage format) - relational data is much more compact so a RavenDB database is probably going to be much bigger than relational (depends on the data structure) - and so might require more disk performance and more memory (I'm not sure about that because Raven's read of a single document == multiple reads from many rdbms tables)
The problem with RDBMS is that even if it is designated OLTP, it is very tempting to do just "some" reports on that, because it allows it, and then you die when you grow big
And I am pretty sure that we are going to be using less memory and CPU for standard operations, if only because of the different modeling requirements.
Reports are obvious, but how about ad-hoc queries? NoSQL designs usually use specific indexes, often spanning multiple objects, for each query, and they perform well. But SQL allows us to join arbitrary entitites. And if you don't think internet, but enterprise scale, these often perform well. (Of course this does not mean that SQL has to be your primary data store, as CQRS shows)
How about DRI? You never miss that? In NoSQL you have to explicitly deal with situations that just cannot occur in a fully specified relational scheme. (Well, more often, DRI won't prevent everything of course).
Stefan, RavenDB supports ad hoc queries very well, so that isn't it. DRI is something else, indeed. I assume that you are talking about denormalization, and RavenDB supports multiple ways to handle that scenarios (includes, live projections) if you want to avoid it. Then again, in many cases, you actually do want, so you'll have a point in time snapshot of the info
How does RavenDB support ad-hoc queries? From what I've read it creates a dynamic index for ad-hoc queries. If so, then how long will you have to wait before the index is created (let's say for 20 million documents)? And what will happen to database performance during creation of that index (think about disk i/o and memory cache)? I don't think this is the operation you would like to do on a busy database.
Rafal, This isn't something that you want to just do, since it -is- expensive, but the same is true on making an unindexed query on a big RDMBS. We have facilities in place to optimize that scenario, mind
Sorry if this is a RTFM question, but if you issue an ad-hoc query, does Raven attempt to create an index based on it?
Jdn, Yes, it does. Well, first the query optimizer tries to find a matching index, but if it can't, it will create one
Sorry, another RTFM question.
For a basic scenario in SQL, I just need to create one index per key column (FK/PK), and any query joining any of those tables will be able to use those indexes. There's no need to dynamically build one index for each query, because the query optimizer will just combine them as needed. E.g.:
from c in Companies from e in c.Employees where c.Country = "USA" where e.Wage > 10000
SQL would just use the indexes for Companies.Country, Employees.Company (FK) and Empyees.Wage. In Lucene, I'd need to create a special index on Companies that includes all relevant data for employees, right? How does RavenDB do this? The docs for live projections indicate that a specific index needs to be built. I understand the advantages of such a model, but it is not exactly what you want for an ad-hoc query, is it?
And how does building all thoes indexes not affect write-performance? By being asynchronous? Do you have some real-world data about how the indexing affects server load? I assume in a large system, you'll end up with a lot of query-specific indexes. (It's kind of a combinatorial explosion problem, right?)
My take would have been to eventually ship data for relational queries to an RDBMS, just like I'd use an OLAP DB for statistical reports.
Stefan, Most queries in RavenDB are only over a single document, we don't try to do complex operations over multiple indexes, because the data format allows us to store complex information inline, vs. spread them on many tables.
We build indexes on background threads, yes. You can look at real world data here: http://ayende.com/blog/24577/raccoon-blog-and-ravendbndash-one-month-later
Ayende, it's not that I don't understand the advantages of document DBs, of course you'll often find that mapping an aggregate (DDD sense) to a single document avoids many problems that you'd seem to have from a relational angle. But sometimes you'll need to join aggregates not for reporting, but just for finding or selecting entities (in enterprise apps much more than in internet apps like blogs).
While this works too, it needs extra indexes, denormalization, or some other persistent data. A SQL database can do these things based on a small set of indexes, and I'd consider this an advantage, all other issues aside. Eventually, some NoSQL-based apps will find that they need explicit query databases for this just as they do for OLAP (depending on their use cases of course). CQRS could be a good starting point for this. It deals with all the questions RavenDB's architecture brings up (stale indexes, eventual consistency, dedicated data structures for queries...)
Stefan, It is fairly easy to do cross aggregate joins in RavenDB (see multi map). The advantage that you refer to with RDBMS with regards to indexing exists, but I don't think that it is nearly as important as you believe it is. Mostly, because RavenDB does no computation during the query (meaning that all the work has already been done and we are very fast).
Sure, but if you create your queries ad hoc, that's not a lot of help. Likewise, if you've got too many predefined queries spanning documents from various angles, but not executed all that often, you'll eventually reach a point where SQL-like ad hoc joins would have been more efficient. Even if most apps would benefit from your trade-off (as opposed to that of any RDBMS), there's still a downside to it, and I think it belongs in a list of things not to do with RavenDB.
Good timing of these posts, I'm currently evaluating RavenDB for my projects and the main question I'm trying to answer is what do I lose compared to SQL Server when I chose RavenDB. I've created a StackOverflow question for it, I'd be grateful if someone could answer it here or over there: http://stackoverflow.com/questions/8250254/if-i-choose-ravendb-what-benefits-of-sql-server-do-i-lose
Stefan, I think that you are significantly over estimating the cost. We actually have clients that have > 500 indexes, and they have no real issues with that. There is also the issue of tradeoff between costs, we do most of the cost after write (async), RDBMS handle most of the cost on read. You tend to have a LOT more reads than writes.
Do I understand this right? When I do an ad-hoc query, a new dynamic index is created. This index is then permanent, i.e. it will need to be maintained for every write operation (in the background). The index needs to be built before my ad-hoc query can execute, so there's a delay. (Will indexes be reused for identical/similar queries?)
And you say even if I do this hundreds of times, and keep doing it, I don't have a problem? That would be impressive. Or did I just get the process wrong?
Related topic: can RaveDB create indexes that include data from other partitions?
Stefan, Ad hoc queries will try to find an existing index, if there is no index that matches this query, it will be created. There might be a delay while the query is being executed, yes.
Indexes are reused,yes.
Dynamic indexes (created because of ad hoc queries) will hang around for a while, ready to serve the next matching query. If there isn't enough activity, they will go away. If there is a lot of activity, they will become permanent.
Nice. Do you have any real-world customer stories showing how RavenDB handles large amounts of data and transactions? How RavenDB integrates with operational procedures (backup/restore)? What's the failover-story? etc.
With all these paradigm shifts, it's hard to convince anybody that what looks good in theory or works well in the lab is actually ready for critical enterprise apps. (The latest MongoDB data-loss rumors didn't help.)
Stefan, Yes, we have several big clients working already. You can see some who allowed us to talk about them here: http://beta.ravendb.net/testimonials/
What monogdb data loss rumors?
http://www.infoq.com/news/2011/11/MongoDB-Criticism unfortunately, this all easily adds up to a vague impression that all these Web-targeting DBs (MySQL, NoSQL) don't really work well in critical situations, where you can't afford to lose even a single record.
What I'm interested in is not so much the developer story, that one looks pretty solid. I'd like to know how it compares to, say, SQL Server from an operations perspective. Because one good thing about shipping an app based on SQL Server is that it's implicitly trusted by the customer, and they already have experience and operational procedures for it in place. There's features like log shipping and database mirroring, transparent data encryption, you name it. How would I actually deploy and support a large RavenDB instance/farm in the real world?
Stefan, MonogoDB is intentionally design to be really fast at the expense of being reliable. Personally, I think that this is an insane decision for most scenarios, but that is what it is designed for.
RavenDB is designed to be _safe_, it is ACID, transactional, and its storage format is known to have taken a beating for decades without issues. We are actually using the same backend as Exchange and Active Directory. We have support for each of the items that you have mentioned, btw.
I'm talking about perception too, not just facts. Not only the programmer needs to be convinced, but the customer too. That you're using ESE is helping a lot, didn't know that. Where can I find documentation of these things? The documentation section on ravendb.net is very developer-centric.
Stefan, We would love your comments on the new docs at beta.ravendb.net
Doesn't seem to have much administration information either, just a few bits here and there within the developer docs. At least google can't find any, the search box at the top of the page does not seem to work. Convincung developers is one thing. If you want to help them convince their admins, DBAs and customers, I suggest adding an IT-Pro section. I would also consider posting a list of SQL Server and Oracle featuers and describing how the same effect is reached using RavenDB (or why it is not required)
Stefan, Thanks, we will add that.
Given that the write operations are separated from the read and indexing operations, would RavenDB still be a good choice for a system where users are concurrently editing records in a system and near real-time updates need to be pushed out to any user viewing/editing the same record? I'm concerned that the latency between a user saving changes to a record and that record being ready to be queried could be a problem in some scenarios.
Chris, If you already know what document you are looking at, there is no latency whatsoever.