Large, interconnected, in memory model
I got into an interesting discussion about Event Sourcing in the comments for a post and that was interesting enough to make a post all of its own.
Basically, Harry is suggesting (I’m paraphrasing, and maybe not too accurately) a potential solution to the problem of having the computed model from all the events stored directly in memory. The idea is that you can pretty easily get machines with enough RAM to store stupendous amount of data in memory. That will give you all the benefits of being able to hold a rich domain model without any persistence constraints. It is also likely to be faster than any other solution.
And to a point, I agree. It is likely to be faster, but that isn’t enough to make this into a good solution for most problems. Let me to point out a few cases where this fails to be a good answer.
If the only way you have to build your model is to replay your events, then that is going to be a problem when the server restarts. Assuming a reasonably size data model of 128GB or so, and assuming that we have enough events to build something like that, let’s say about 0.5 TB of raw events, we are going to be in a world of hurt. Even assuming no I/O bottlenecks, I believe that it would be fair to state that you can process the events at a rate of 50 MB/sec. That gives us just under 3 hours to replay all the events from scratch. You can try to play games here, try to read in parallel, replay events on different streams independently, etc. But it is still going to take time.
And enough time that this isn’t a good technology to have without a good backup strategy, which means that you need to have at least a few of these machines and ensure that you have some failover between them. But even ignoring that, and assuming that you can indeed replay all your state from the events store, you are going to run into other problems with this kind of model.
Put simply, if you have a model that is tens or hundreds of GB in size, there are two options for its internal structure. On the one hand, you may have a model where each item stands on its own, with no relations to other items. Or if there are any relations to other items, they are well scoped to the a particular root. Call it the Root Aggregate model, with no references between aggregates. You can make something like that work, because you have a good isolation between the different items in memory, so you can access one of them without impacting another. If you need to modify it, you can lock it for the duration, etc.
However, if your model is interconnected, so you may traverse between one Root Aggregate to another, you are going to be faced with a much harder problem.
In particular, because there are no hard breaks between the items in memory, you cannot safely / easily mutate a single item without worrying about access from another item to it. You could make everything single threaded, but that is a waste of a lot of horsepower, obviously.
Another problem with in memory models is that they don’t do such a good job of allowing you to rollback operations. If you run your code mutating objects and hit an exception, what is the current state of your data?
You can resolve that. For example, you can decide that you have only immutable data in memory and replace that atomically. That… works, but it requires a lot of discipline and make it complex to program against.
Off the top of my head, you are going to be facing problems around atomicity, consistency and isolation of operations. We aren’t worried about durability because this is purely in memory solution, but if we were to add that, we would have ACID, and that does ring a bell.
The in memory solution sounds good, and it is usually very easy to start with, but it suffer from major issues when used in practice. To start with, how do you look at the data in production? That is something that you do surprisingly often, to figure out what is going on “behind the scenes”. So you need some way to peek into what is going on. If your data is in memory only, and you haven’t thought about how to explore it to the outside, your only option is to attach a debugger, which is… unfortunate. Given the reluctance to restart the server (startup time is high) you’ll usually find that you have to provide some scripting that you can run in process to make changes, inspect things, etc.
Versioning is also a major player here. Sooner or later you’ll probably put the data inside a memory mapped to allow for (much) faster restarts, but then you have to worry about the structure of the data and how it is modified over time.
None of the issues I have raised is super hard to figure out or fix, but in conjunction? They turn out to be a pretty big set of additional tasks that you have to do just to be in the same place you were before you started to put everything in memory to make things easier.
In some cases, this is perfectly acceptable. For high frequency trading, for example, you would have an in memory model to make decisions on as fast as possible as well as a persistent model to query on the side. But for most cases, that is usually out of scope. It is interesting to write such a system, though.
I stumbled upon the same issues you mentioned, it's exactly like that with in-memory model. Sure, there are a plenty of ways to fix those issues but it requires considerable amount of time. And you'll eventually endup with some sort of semi-completed database. Why not use a DB from the start for keeping aggregate root state? Another PITA I encountered was aggregate root schema migrations, they are like 9th level of hell when your business model evolves, and you have to take down your servers for a while to let it happen or, again, develop some complex solution to do it at runtime. But what if you made a mistake... It's just like juggling razor-sharp knifes without a handhold.
Could a solution resolving all those issues be the future of software development ? Between application and database.
Remi, I don't see this, to be honest. Object databases tried to do that for a while, but you might have noticed that there aren't any in common use any more?
Indeed. But for me document database are really close to object database and with the progress in distributed data we can see on many provider (RavenDB included), I think the advantage for the developper and architect would be big: your application/microservice is not anymore data store+business but one single thing, and developer wouldn't have to think about persistence. We can nearly do that with container now, appart from the fact that concurrent writing to volume is not possible and you have to setup your node's synchronisation strategy.
Remi, A really important distinction between object dbs and document databases are the boundaries drawn. A document database has clear boundary between documents, There are references, but they are separated. Objects are much more fluid and it is much harder to draw a line inside an object graph and state that this particular set of objects can be operated independently from others.
I don't think that you can ignore persistence or its concerns. There is only so much that abstractions can do for you. I'm not sure that I'm following how you are thinking about containers in this context, though.
I get your point. My point about container was that now you can see your app as a deployable unit assembling data and behaviour where in the past you had two distinct component communicating, but you'll have some work if you want all your "unit" of the same kind to sync their data. Not sure if I am clear now (it's in my head I guess), nevermind :)
Remi, Technically speaking, you can deploy your container with a file that holds all you data as in memory objects. That is _possible_, but really unadvisable.
How do you handle versioning, backup, inspection, modifications outside your code? How do you handle migration of data? etc?
The issue isn't just having the data, the issue is the whole concept of owning it.
Note that unlike your code, you can't usually blow away the data and start from scratch with the next time you compile. The data lives longer and is treated very differently.
I'm only slightly paraphrased, and I'm very pleased to see it discussed more!
I do think picking on the mega-size monoaggregate is a bit reductio ad absurdam though, in the real world you would definitely shard the object, or pick a problem space that won't cause these problems*. I doubt the (smallest possible) in-memory model for the scheduling domain would have been so large that it ran into the scaling issues mentioned above.
Remi, you can perhaps answer some of the questions by using a virtual actor system like Orleans to provide hosting with multiple redundancy (although updating the code without downtime then becomes an interesting problem, perhaps solvable using .NET Core 3's new unloadable assemblies).
Some links for the interested:
https://martinfowler.com/bliki/MemoryImage.html https://github.com/devrexlabs/memstate https://www.infoq.com/articles/Big-Memory-Part-2
*although I half remember seeing a tweet from someone who works at SO who was trying to fit the whole Rep system into RAM and having to do something like compress bigints in order to get it all to fit
ACtual links: https://martinfowler.com/bliki/MemoryImage.html [https://github.com/devrexlabs/memstate] (https://martinfowler.com/bliki/MemoryImage.html) https://www.infoq.com/articles/Big-Memory-Part-2
Harry, The problem isn't a single object, it is in the interconnection between them. Using the nurse scheduling, we have:
Schedule.Location.Schedules <-- can be a huge list
Schedule.Assignees.CurrentSchedules <-- multiple schedules.
So just by having this two paths, you can probably walk off of any object in the graph and traverse the entire thing.
And storing a lot of data in RAM for various purposes is very common, the question is what you do with it and if this is the master data set.