Ayende @ Rahien

It's a girl

Lightning memory-mapped database is Esent

Here is something that suddenly dawned at me as I was going through the LMDB codebase. LMDB is actually how Esent works. In fact, going through LMDB codebase I am in a much better position to understand how Esent works.

For example, Esent is limited to 16TB per database. I did some math to figure out the max size of a LMDB db that used 32bits for page number and I got to 2 ** 32 times 4,096 == 16 TB, and that isn’t a coincidence. And all the background information in the Esent API that talk about large values, for example. That is how LMDB does overflow, and tables are implemented as additional B-Trees (what LMDB calls dbs).

At some point I was going through this in my head and realized that indexes are also working in the exact same way, and so does pretty much everything else. What Esent does on top is provide better concurrency (LMDB allows only a single writer), and a whole lot of additional features (schemas, secondary indexes, etc).

But I think that the major difference from my point of view is that I had a theoretical idea about how Esent really worked, under the covers. Now I feel like I actually grok things at a much better level. For instance, Esent will not give back space to the operating system, even if you deleted stuff. I knew that, and just accepted that as how it worked. But now I can understand why it doesn’t do that. Or the MAX_VER_PAGES setting, I knew what it purpose was, and how it impacted operations (it is the amount of changes that can be pending). But after reading LMDB dirty_pages and its limit, I understand the why and how it works at a much deeper level.

To be honest, I kept reading LMDB codebase purely because I wasn’t going to let that code beat me, but it is very educational.

Comments

Howard Chu
07/12/2013 04:48 PM by
Howard Chu

You seem to be confusing "embedded key/value data store" with "DBMS". LMDB is only a key/value data store, DBMSs get implemented on top of it. E.g., SQLite3 on top of LMDB, MariaDB on top of LMDB, Riak on top of LMDB, memcacheDB on top of LMDB, OpenLDAP slapd on top of LMDB, etc. etc. etc.

Ayende Rahien
07/12/2013 04:52 PM by
Ayende Rahien

Howard, Do you know what Esent is? I am not sure that I understand this comment.

Howard Chu
08/09/2013 10:07 AM by
Howard Chu

ESENT is a DBMS layer built on top of JET. All the "whole lot of extras" you talk about like schemas and secondary indexes are at the management system layer, while LMDB is only a key/value store. You could more fairly compare LMDB to JET since they serve the same role. The examples I listed above are all cases where schemas and indexing mechanisms get built on top of LMDB.

Of course, comparing JET to LMDB would be like comparing an 80386 to a modern CPU. JET is firmly stuck in 32-bit land, and a lot slower.

James C
08/09/2013 12:49 PM by
James C

Actually ESENT is not the same as the generic "Jet" and is internally called Jet Blue versus the original Jet Red used by Access. Since XP and Server 2003 it has a native 64-bit version for 64-bit machines.

http://en.m.wikipedia.org/wiki/ExtensibleStorageEngine

Justin
08/09/2013 05:12 PM by
Justin

ESENT aka Jet Blue is used as the storage engine for Exchange server and interestingly enough as the storage engine for Active Directory which is also a LDAP server.

http://en.wikipedia.org/wiki/ESENT

It is a relational database, but not a SQL database.

Effectively Raven DB is a NoSQL database built on a relational DB. At least for now.

Of course a relation is a key-value data structure where the keys and values are homogeneous allowing some optimizations such as not storing key values or interned key id's with each tuple nor having to do searches for the key per tuple, just follow offsets.

You see this surface in key-values based systems like say Mongo DB (I guess Raven too?) where it is suggested to reduce key length to reduce storage requirements and improve performance: http://christophermaier.name/blog/2011/05/22/MongoDB-key-names.

Not sure how LMDB stores key values, does it do some sort of interning?

Ayende Rahien
08/09/2013 06:58 PM by
Ayende Rahien

Justin, Esent isn't relational. It is table based, but it doesn't have the concept of relations.

Ayende Rahien
08/09/2013 06:59 PM by
Ayende Rahien

Justin, Regarding keys, yes, that is important (to some aspect) in RavenDB as well, and it is relevant for LMDB as well. The keys you are talking about are the data size, not key size. So it is outside the scope of something like LMDB. And see my reaction to this post here: http://ayende.com/blog/4669/you-saved-5-cents-and-your-code-is-not-readable-congrats

Justin
08/09/2013 07:29 PM by
Justin

Table==Relation, I think you're confusing foreign keys or joins:

"In relational database theory, a relation is a set of tuples (d1, d2, ..., dj), where each element dn is a member of Dn, a data domain.[1] Each distinct domain used in the definition of a relation is called an attribute, and each attribute is usually named."

"In SQL, a query language for relational databases, relations are represented by tables, where each row of a table represents a single tuple, and where the values of each attribute form a column."

http://en.wikipedia.org/wiki/Relation_(database)

ESENT is definitely relational in the it has tables which are relations even if it has no joins or foreign keys.

I was referring to key size, the is the larger you have a k/v of longkeyname=1, longkeyname=2 etc. in a relational db the longkeyname only needs to be stored in the schema and only the values are stored in the tuples and the engine can find the data in each tuple without doing a key lookup.

Ayende Rahien
08/09/2013 07:32 PM by
Ayende Rahien

Justin, I understand what you are saying, but in practice, relational database doesn't means just tables.

WRT keys, you need to be more specific. In particular, in key/value store there isn't a name for the key. There is the key, and then there are the values. The values may or may not store the full key names, but that is an implementation decision.

Justin
08/09/2013 08:12 PM by
Justin

Sorry but it's not my fault most don't know what the relation in relational means. It is very well defined and ESENT falls under the definition. ESENT sure isn't a document or K/V store. I guess it's NoSQL since it it doesn't implement SQL the language and yet it still stores data in tables with columns and rows with a predefined schema.

Your article on the key size issue touches only on the cost of disk space without recognizing that those extra bytes also must be read and written to the disk which take more time and in order to find the key within the document it must be searched rather than seeking to the columns offset.

I realize there are two levels of abstraction here when discussing k/v, one is document ID/document content the other is the k/v structure of the JSON documents themselves.

I guess you are looking at LMDB just in the sense of storing the JSON blob in the value with the key being a doc id of some sort?

This is what I see when looking at a Raven DB ESENT database, it has a documents table with a id column and a data column. However you would not be shredding the JSON down into a more discrete data structure as say MSSQL does with the XML datatype.

So Raven isn't just repeating keys in each document it also storing the full json with brackets etc too? Is it BSON? That would be a little better, but Mongo uses BSON and the large key issue comes up constantly since most documents do have a repeating schema that is duplicated over and over.

Ayende Rahien
08/09/2013 11:14 PM by
Ayende Rahien

Justin, If you are going to quote wikipedia, take a look here: https://en.wikipedia.org/wiki/Relationaldatabase, also Esent is defined as ISAM store (http://en.wikipedia.org/wiki/ExtensibleStorage_Engine). Yes, it fits some of the bill for relational db, but in the same sense, so does fixed size record files. It is a meaningless distinction.

I think you are missing something. The key you are talking about is the property names, and those are the actual value that goes there. Shredding the document into distinct values is a really bad idea. you can see why when you look at the recommendation for SQL Server XML type. It doesn't really work for real world performance. You end up having to manually do the lifting of values out. And it make even less sense when you have dynamic schema. Internally in RavenDB we store it as BSON, yes. The repeating keys is a really silly issue, see: http://ayende.com/blog/4669/you-saved-5-cents-and-your-code-is-not-readable-congrats

Sure, you can optimize that by doing interning at the storage layer, but that is an optimizastion step that isn't really interesting.

Justin
08/10/2013 12:35 AM by
Justin

ISAM isn't data model, it's an access method, just look at Berkeley DB which is also ISAM compared to ESENT, completely different. Berkely has no schema or constraints on the data while ESENT has a full schema, using tables aka relations with columns with fixed data types.

I am sure ESENT would not satisfy all of Codds 12 rules, but it's damn close, closer than Berkeley, way closer than a fixed size record file, geez.

How about semi-relational?

It so close you could take Ravens db structure from ESENT and port it to any relational database without much trouble.

Repeating keys is so silly MongoDB has a highly voted issue, marked major with a plan to fix. And many client libraries have code built that tries to shorten names for you:

https://jira.mongodb.org/browse/SERVER-863

I personally think reading and writing 2x or more data then necessary is not a silly issue, it impacts all aspects of performance and database management. Disk are cheap sure, then again if your database is smaller you might have been able to put it on a ssd or faster but smaller drive, or just transferred less bytes from that cheap disk to accomplish the same thing. Are you really going to tell me every byte doesn't make a difference after examining these c code bases that do thier best to save every byte for performance reasons?

Ayende Rahien
08/10/2013 02:52 AM by
Ayende Rahien

Justin, Sure, Esent give you a structure that is very similar to RDBMS, but that doesn't make it one. I suggest we drop this topic, I don't think we will agree here.

As for the other thing, I started answering you, but it got to big. I made a post out of this, you can find it here: http://ayende.com/blog/163426/field-tokenization-a-problem-analysis?key=d3fc0fb799a34d54b8631aa1bf6b8759

njy
08/11/2013 02:09 PM by
njy

Oren,now that you have a better understanding of LMDB and ESENT, do you have an idea of the reasons why the files produced by ESENT are not OS-portable? I mean, I know that different versions of Win cannot reuse the same physical files, and in fact for RavenDB you suggest to use the smuggler tool to do the job. If I'm correct, on top of being able to run on different OSs, LMDB does not have this specific problem, so why does ESENT, that btw only runs on Win?

Thanks

Ayende Rahien
08/11/2013 02:13 PM by
Ayende Rahien

Njy, There are two reasons. 1) Between different OS versions, they make changes to the file format. LMDB also have that, but they don't appear to have a previous version, so they don't deal with that (there is code there to be forward prepare, mind). 2) They do case insensitive match by default, and that depend on the exact OS version for what it does. Because the comparison function might change in different versions on the OS.

njy
08/11/2013 03:27 PM by
njy

Thanks Oren, it makes sense. Now i wonder why, for example, they didn't code directly a match function of their own, so not to be dependent on the OS version.

Anyway, if RavenDB would support, for example, LMDB it would automatically gains not only transparent support for different versions of win, but also wholly different OS (non win) altogether.

Cheers.

Ayende Rahien
08/11/2013 04:18 PM by
Ayende Rahien

Njy, Because that is a REALLY REALLY hard problem! Case insensitive matches require a lot of work. For example, are you using UTF8? UTF16? MultiByte? For that matter what language are you using?

Here is a nice example: http://blogs.msdn.com/b/michkap/archive/2004/12/31/344739.aspx

What about sorting for Hebrew characters? Sorting for mixed Hebrew & numbers? Sorting for mixed Hebrew & English text? Japanese text sorting is something that you should just give up on (http://www.localizingjapan.com/blog/2011/02/13/sorting-in-japanese-%E2%80%94-an-unsolved-problem/).

For fun, did you know that the sorting rules can change ?

Put simply, Esent said, "I am not going to be maintaining that", there is a perfectly nice thing build into Windows: LCMapString (http://msdn.microsoft.com/en-us/library/windows/desktop/dd318700(v=vs.85).aspx), I am going to use that.

And then you have bug reports to Microsoft about those sort of things, or flat out rule changes, and you need to change the output, and then you have to rebuild indexes because the sort order might have changed.

njy
08/11/2013 04:30 PM by
njy

Yes, sure, don't get me wrong: i was not trying to say it is a smple thing, I was just pointing out the dependency problem.

I have to point out that i really don't know the nitty gritty details of the thing and I may be saying stupid things, so yeah, maybe it's me.

Just like, for example, handling JSON is not something easy peasy, it has its owh set of problems and difficulties: but to avoid dependency problems you guys interned JSON.NET instead of depending on an "external" version of the code, so you did not had to write the code, or mantain the code, but nonetheless you also do not depend "actively" on that.

Ayende Rahien
08/11/2013 08:12 PM by
Ayende Rahien

Njy, I can follow every line of JSON.Net. I can't even understand the problem set involved in culture sensitive comparisons. There is a big difference between those things. I have some ideas on how to solve those, but they are just that, ideas.

Howard Chu
08/12/2013 02:35 AM by
Howard Chu

... actually, LMDB's data format is not OS-specific. It is, however, CPU arch-specific. In particular, it is dependent on word-size and byte-order. (The lock.mdb format is OS-specific, but it's also irrelevant - you don't need to copy it when you want to copy a DB to another machine.)

Case-insensitive string compare as the default compare function for a general-purpose DB is quite obviously a stupid design choice. There's more than just strings that get used for keys. (Heck, Asian languages don't even have upper/lower case distinctions.)

Of course, this is a Microsoft package we're talking about. It's riddled with stupid design choices. http://thread.gmane.org/gmane.comp.ldap.umich/2990/focus=2991

Ayende Rahien
08/12/2013 04:40 AM by
Ayende Rahien

Howard, LMDB is pretty new, I'm guessing that you didn't have to yet make some changes that would be incompatible at the file version. You already have the structure to detect that (MDB_VERSION), but consider that Esent has been around for a while, it makes sense that you would need to make changes like that that would impact that ability.

As for the string compare. It isn't the general purpose function, but if you are using strings, it is something that you pretty much need. Windows as a whole is a case insensitive operation system. And an LDAP query on user is case insensitive, for example. So, once you need that, you pretty much have to live with the issues I mentioned.

With LMDB, you have md_cmp to provide the comparison function, for pretty much the same reason, I am assuming. Now, what would happen if someone would change the comparison function in any way? Pretty much any tree that you have that used that function became invalid. That is the underlying issue here.

njy
08/12/2013 10:32 AM by
njy

Oren, I think you misunderstood me: I was not asking why you (Oren/HR) didn't internalized that part into RavenDB, I was asking myself why they (a.k.a. Microsoft) didn't do that.

I mean, since not doing that gives ESENT the portability problem, and since ESENT is built by MS, and the OS is built by MS, and every core part of the OS (string match and whatnot) is built by MS, I was thinking out loud about why they didn't internalize those parts, just like you guys did for JSON.NET, completly removing the portability problem. Just like LMDB seems to have done.

I hope I made myself clearer now. What do you think?

Howard Chu
08/12/2013 10:56 AM by
Howard Chu

The difference is, LMDB's default comparison functions have no OS-version dependencies. md_cmp is there if the user wants to define their own custom functions. If a user specifies a custom function, the responsibility is all theirs to ensure its correct behavior. As an informed LMDB user, you have the ability to fix your own mistakes. As an informed ESE user, you're still stuck with the stupid choices of the original design.

This in particular, goes back to my previous point about abstractions. At the DB engine layer, bytes are bytes. There is no difference between "string" and "binary," nor should there be. If you want to implement such distinctions, you do so at a higher layer. In OpenLDAP we provide support for caseInsensitive UTF8 comparisons, because LDAP uses UTF8. But the DB engine doesn't know anything about that, nor does it need to.

Our UTF8 compare functions' behavior might change if you change the version of Unicode spec they're built against. But they won't just suddenly and silently change because an OS version changed. You have the OpenLDAP source code, you can explicitly control which version of the spec you want to use, and freeze it at that level forever if you wish.

Ayende Rahien
08/12/2013 01:03 PM by
Ayende Rahien

Njy, Esent didn't do that because then you would have things like: "On Windows, I can call LCMapString and get result xyz but on AD I see result zyx" It would mean more work they have to do. And most of the stuff they did with Esent was pretty tied to the OS they were using.

Ayende Rahien
08/12/2013 01:04 PM by
Ayende Rahien

Howard, And as you said, Esent does a lot more than LMDB. It had to provide support for string comparisons.

njy
08/12/2013 01:39 PM by
njy

"And most of the stuff they did with Esent was pretty tied to the OS they were using"

Yep, that is my main issue here. Anyway I think I cannot go on with this without actually really looking at the code in more details, which I cannot do as of now. Anyway thanks for the time.

njy
08/12/2013 04:41 PM by
njy

Btw Oren, for your info: the blog archive pages based on dates are not working. I'm talking about urls like http://ayende.com/blog/archive/2010/10

Comments have been closed on this topic.