The Guts n’ Glory of Database InternalsThe curse of old age…
This is a fun series to write, but I’m running out of topics where I can speak about the details at a high level without getting into nitty gritty details that will make no sense to anyone but database geeks. If you have any suggestions for additional topics, I would love to hear about them.
This post, however, is about another aspect of running a database engine. It is all about knowing what is actually going on in the system. A typical web application has very little state (maybe some caches, but that is pretty much about it) and can be fairly easily restarted if you run into some issue (memory leak, fragmentation, etc) to recover most problems while you investigate exactly what is going on. A surprising number of production systems actually have this feature that they just restart on a regular basis, for example. IIS will restart a web application every 29 hours, for example, and I have seen production deployment of serious software where the team was just unaware of that. It did manage to reduce a lot of the complexity, because the application never got around to live long enough to actually carry around that much garbage.
A database tend to be different. A database engine lives for a very long time, typically weeks, months or years, and it is pretty bad when it goes down, it isn’t a single node in the farm that is temporarily missing or slow while it is filling the cache, this is the entire system being down without anything that you can do about it (note, I’m talking about single node systems here, distributed systems has high availability systems that I’m ignoring at the moment). That tend to give you a very different perspective on how you work.
For example, if you are using are using Cassandra, it (at least used to) have an issue with memory fragmentation over time. It would still have a lot of available memory, but certain access pattern would chop that off into smaller and smaller slices, until just managing the memory at the JVM level caused issues. In practice, this can cause very long GC pauses (multiple minutes). And before you think that this is an situation unique to managed databases, Redis is known to suffer from fragmentation as well, which can lead to higher memory usage (and even kill the process, eventually) for pretty much the same reason.
Databases can’t really afford to use common allocation patterns (so no malloc / free or the equivalent) because they tend to hold on to memory for a lot longer, and their memory consumption is almost always dictated by the client. In other words, saving increasing large record will likely cause memory fragmentation, which I can then utilize further by doing another round of memory allocations, slightly larger than the round before (forcing even more fragmentation, etc). Most databases use dedicated allocators (typically some from of arena allocators) with limits that allows them to have better control of that and mitigate that issue. (For example, by blowing the entire arena on failure and allocating a new one, which doesn’t have any fragmentation).
But you actually need to build this kind of thing directly into the engine, and you need to also account for that. When you have a customer calling with “why is the memory usage going up”, you need to have some way to inspect this and figure out what to do about that. Note that we aren’t talking about memory leaks, we are talking about when everything works properly, just not in the ideal manner.
Memory is just one aspect of that, if one that is easy to look at. Other things that you need to watch for is anything that has a linear cost proportional to your runtime. For example, if you have a large LRU cache, you need to make sure that after a couple of months of running, pruning that cache isn’t going to be an O(N) job running every 5 minutes, never finding anything to prune, but costing a lot of CPU time. The number of file handles is also a good indication of a problem in some cases, some databases engines have a lot of files open (typically LSM ones), and they can accumulate over time until the server is running out of those.
Part of the job of the database engine is to consider not only what is going on now, but how to deal with (sometimes literally) abusive clients that try to do very strange things, and how to manage to handle them. In one particular case, a customer was using a feature that was designed to have a maximum of a few dozen entries in a particular query to pass 70,000+ entries. The amazing thing that this worked, but as you can imagine, all sort of assumptions internal to the that features were very viciously violated, requiring us to consider whatever to have a hard limit on this feature, so it is within its design specs or try to see if we can redesign the entire thing so it can handle this kind of load.
And the most “fun” is when those sort of bugs are only present after a couple of weeks of harsh production systems running. So even when you know what is causing this, actually reproducing the scenario (you need memory fragmented in a certain way, and a certain number of cache entries, and the application requesting a certain load factor) can be incredibly hard.
More posts in "The Guts n’ Glory of Database Internals" series:
- (08 Aug 2016) Early lock release
- (05 Aug 2016) Merging transactions
- (03 Aug 2016) Log shipping and point in time recovery
- (02 Aug 2016) What goes inside the transaction journal
- (18 Jul 2016) What the disk can do for you
- (15 Jul 2016) The curse of old age…
- (14 Jul 2016) Backup, restore and the environment…
- (11 Jul 2016) The communication protocol
- (08 Jul 2016) The enemy of thy database is…
- (07 Jul 2016) Writing to a data file
- (06 Jul 2016) Getting durable, faster
- (01 Jul 2016) Durability in the real world
- (30 Jun 2016) Understanding durability with hard disks
- (29 Jun 2016) Managing concurrency
- (28 Jun 2016) Managing records
- (16 Jun 2016) Seeing the forest for the trees
- (14 Jun 2016) B+Tree
- (09 Jun 2016) The LSM option
- (08 Jun 2016) Searching information and file format
- (07 Jun 2016) Persisting information
Comments
I would actually really like to see your take on the nitty-gritty details. You already have regular visitors who will understand those posts, even if I may not. BTW one of ways your blog stands out is due to the regularity/frequency of your posts, besides the value of the posts themselves.
Peter, Thanks, the problem is that I tend to either paint with a broad stroke, such as this series, where I explain what is going on and the full reasoning, or we have super specific stuff (like the cost of transaction logs).
It is hard for me to find topic in the middle, because I'm not usually working at that level. If you have suggestions for topics, I would love that.
What about covering locking (all the different kinds of locks and escalation)? There is probably enough there that you could write a whole series of posts if you wanted to.
I happily concur with what Peter said. Your blog for years has been a beacon of incredibly useful and valuable information.
If you're looking for post ideas - have you done anything on packet-level security? For example, if you have two disparate systems passing sensitive data back and forth via TCP/IP, what are some good approaches for keeping the data secure? You may have already covered this, but I couldn't think of anything else at the moment.
I'd like to understand the process you go through to determine the high level client API. For example, why do you create a connection and then a session within that? Why do you set limits around the number of records returned from a query? How do you decide on the naming convention to use?
Chris B, Did you see this post about locking: https://ayende.com/blog/174562/the-guts-n-glory-of-database-internals-managing-concurrency
The problem with talking about locking at this level is that this really depend on the actual db you are using. In relational database, you have row locks, page locks, etc. In NoSQL db you have (maybe) records locks and typically a writer lock. Very little to talk about
Lucas, While that is an interesting topic, that is quite outside of my own area of expertise. I know enough about security to know that I don't know enough about security.
My probable recommendation is to put SslStream around the connection, verify the certificate on both ends, and assume that the SSL implementation that I'm using is safe and secure enough.
Beyond that, I would go to an expert in that particular field
Stuart, Thanks, I'll think about how to answer those questions as a full post
An example for a topic "in the middle", that relates to this one, would be to run through a scenario in which a number of concurrent transactions (e.g. 2 read-only with one that is "slow", and 2 read-write) are performed and how this is handled in Voron in terms of allocation/use/freeing of memory and files. I.e. journal files being created, filled and recycled, scratch memory pages being allocated / freed, MVCC finding pages from scratch buffers or main data file, writing to and growing the main data file, etc.
This could complement and to some extent tie together a recent set of blog posts on how "logical" pages and their contents are found using B+Trees, the "copy-on-write" MVCC snapshotting mechanism, the recent "durability" series and this one about management of allocated memory and file handles.
A similar alternative scenario could be to run through the steps occurring (again referring to journal files, scratch pages, data sync checkpoint, main data files, ...) when Voron recovers after a crash.
Stuart, The high level client API started out by looking at both NHibernate session API and the Entity Framework API. The idea is to give users something that they are familiar with, with very little surprises.
The naming conventions was the standard .NET one, again, for fewer surprises.
The limitations on unbounded result set and other stuff like that mostly came out of reading Release It! and from actual experience in seeing application fail to perform (or just fail) as a result of those common mistakes. One of the things that we tried to do is to ensure that we don't have those hard pitfalls in RavenDB
What about writing about arena memory management? You do not define the word Arena in the text even if you write for a broader audience.
Database defragmentation could be another subject.
Database statistics for building index.
Another memory leak is stack overflow due to recursive calls - like Toyota Camry: http://embeddedgurus.com/state-space/2014/02/are-we-shooting-ourselves-in-the-foot-with-stack-overflow/
Alex, Thanks, lots of good ideas here. I added a new series here: https://ayende.com/blog/posts/series/
It is a bit far into the queue, though.
Carsten,
Arena allocation is actually explained pretty nicely in Wikipedia, I'm not seeing much to add there.
DB Defrag - You mean how does the database handles reducing the file size after mass deletes?
What do you mean by db stats for index building? In RavenDB? In General?
That issue is related to safety critical code. In those cases, you don't even consider such things as managed language. You use C / Ada and has a very little freedom (no memory allocations, no recursion, no function pointers, no double pointers, etc)
Yes, I found information about Arena when reading your article: https://en.wikipedia.org/wiki/Region-based_memory_management
You could have provided the link or a short explanation.
How is it used in RavenDB? Do you have your own UNSAFE memory management with custom new/delete of different block size? https://ayende.com/blog/173089/measuring-baseline-costs
DB Defrag, yes auto skrink. I know MS Exchange does online during the night and you have offline as well: https://support.microsoft.com/en-us/kb/328804
Both in general and in RavenDB. I know from SQL that the stats need to be updated to make the query optimizer work.
The issue is that you can compare a DB engine with embedded software because it has to run 24x7 for years.
NB: Query optimizer is another subject which you can write about.
Carsten, Arena allocator is covered there pretty well. We allocate a bunch of memory and release it at once, I don't think we do anything special there, but yes, we do use native memory for that.
I'll do a post about defrag, yes.
Query optimizer is relational and RavenDB are very different, and statistics, for example, play no role in RavenDB.
Comment preview