Hello Voron
What is Voron? Well, it is a joke, but I’ll leave it for someone else to explain.
Voron is the codename for our next generation storage engine, it is a managed port of LMDB, with some tweaks that we added on. The basic idea is that we will have our own storage solution based on our own code that we fully control.
Voron is different than LMDB in a few details, mostly:
- Only single process (multiple process can read, but only one process can write, and we don’t really care about multi process reads).
- No duplicates (LMDB allows them, we have no need for them yet, so we didn’t do that).
- Support for increasing the file size (LMDB doesn’t support it, but we need to, so we added support for it).
- Pure mmap access to the data (we don’t do manual I/O writes).
Early testing shows that for read speeds, we really couldn’t ask for anything else. For write speed, we are good, but aren’t as good as we would like it to be.
Personally, I think that the codebase is a lot easier to follow, but I wrote it, so… It isn’t ready yet for public consumption, but I would like to share it with a few people. So here is the deal.
I am looking for people to join the alpha program for it. Like with RavenDB at its early stage, I am looking for commitment to do something for the project. It can be anything from testing it out and submitting failing scenarios to actually working on the code to building small scale proofs on it, etc.
What I am not interested is viewers. There will be time enough for that at a later point in time. What I am looking for right now is contributors. And I think that this is a really cool project to be a part of. If you want to join, please ping via email.
Comments
The interesting thing is that Voron is Russian word for Raven, but I guess you might already know that
Wikipedia entry for Voron redirects to Raven: http://en.wikipedia.org/wiki/Voron
Voron is a male crow. Grach would be the word for Raven.
Gleb, Yes, that is why we selected the name.
And the licence would be...?
Howard, I think that we use different models for approach things like concurrent tasks. I would tend to use a separate thread for this, instead.
Can you explain what you mean WRT duplicate keys required for fast indexing? Do you mean indexes that allow duplicate entries?
Jesus, We wrote Lucene Directory a few times, but it pretty much have hard assumptions about files, and it doesn't deal very well when you aren't using files.
Scooletz, 99.999% that it would match RavenDB's license.
"LMDB doesn’t support it, but we need to, so we added support for it"
Actually LMDB does support this. You simply re-open the database which isn't a big deal.
Oops bad paste in the comment above :P I meant "Support for increasing the file size".
what about temp.raven.storage?
Kelly, Close & re-open the env require that all the current transactions would stop, because it would invalidate all their data. It also means that the calling code needs to keep careful attention on this, because if you are writing to the db, you need to beware of out of space error, then abort the transaction, lock all current transactions, close the env, open the env with bigger size, then restart the transaction. That is quite a bit of stuff to have to do.
Alim, Call it back burner option :-) I'm not forgetting it.
About licencing, does it mean: closed for non-OSS, non-Raven usage?
Scooletz, Yes, or commercial.
Oren,
Re-opening the database is not a big deal. You shouldn't be resizing the DB very often to begin with and when you infrequently do, its minimal hit in my benchmarking.
You need to be aware of out of space regardless using any storage engine. You can't just write to the DB and ignore the fact you may run out no matter what the implementation is.
Kelly, Did you miss the whole process I described? Assume that you are using it as a server process. You have the writer thread hitting out of space. What now?
Well, you need to: * stop all new requests * wait for all the current requests to complete. * stop & start the environment. * resume other requests but make sure that you are the only one that can get the write lock * restart your transaction from scratch.
Don't know about you, but I think that this would be a big deal.
And yes, I need to plan what happens if I run out of space on the _disk_, but if I have space on the disk, I don't see that I need to think about this. At the other hand, asking for way more space just because I don't want to run out of is bad too. Remember, we support cloud providers that offer plans based on storage used. Wouldn't be nice if we started out by "oh, you want to store a 3Kb document, I'll just go ahead and reserve that 10GB for you anyway."
Oren,
Yes I'm aware of that situation. Do you think implementing this inside the storage engine makes that situation any different?
As I said, you shouldn't be re-sizing the database frequently whether its implemented outside or inside the storage engine.
Kelly, Sure it does. When I resize things inside the storage engine, I can do that _while still serving concurrent read requests_. From the outside world, there is no observable change. Internally, I know how to request enough from the OS so I wouldn't have to do that very often. But not using too much space is important, because if you do use too much space you can get backlash from the users. And "frequently" is an interesting word. What is frequently enough? If I am current writing 10 tx a second, I might need to request more space once an hour. If I am writing 10,000 tx / sec, I'll need to request once a minute. In either case, by the way, I'll be request different amount of space, because I can detect what is actually going on.
That explains the LMDB review series. How does this related to the managed port of leveldb?
Judah, I've not forgotten about that. Call it spreading our chances.
Voron is already there - Raven. Why not Nosorog :P
Oren,
When he talks about duplicate keys required for fast indexing, Howard means that if you implement an indexing/search engine on top of LMDB, then the duplicate keys feature makes it suitable for use as a very efficient inverted index.
I must admit that dual licencing a storage engine which from the very beginning is meant to be OSS, written by contributors, seems gets me a bit fuzzy. Why don't you publish it under permissive, free license? Ayende, could you elaborate on this topic? Dual licencing, code ownership?
Scooletz, My code, my rules, pretty much. You are free to do the same and publish it under any license you want.
As for why going with dual license, pretty simple. I want to be able to actually make money out of my work. Going with a permissive license means that users can take the code and write commercial applications without compensation. That tend to make it hard to buy food.
But you do realize that it is you who takes a code with a permissive license (LMDB) and use it to write a commercial application without compensation?
Rafal, Yes, I do. And the authors made an explicit decision to do so. The code I am doing is going to be public, available for OSS projects, etc. But I am absolutely respecting the author's wishes here as expressed by the choice of license.
Ayende, yep, you're right it's your code and your rules and you can publish under any license you want. You're the one who establishes the license. You can even encourage people to work on your codebase as contributors and that's also perfectly fine :)
Didn't you do something similar with the signal r code base too?
In my experience, dual licensing is more hassle than it's worth. Large enterprises require support contracts for all the software they use, so they pay us despite the permissive license. Small organizations or individuals/hobbyists won't purchase/pay regardless.
Ultimately there's no value in so-called "intellectual property" - the only real value is in expertise. The creative works of a mind only have value by virtue of being disseminated - if you come up with a brilliant idea and keep it entirely to yourself, your impact on the world is zero. Astute business leaders will pay for real value, because they understand its importance.
Howard +1. There is so many more projects I could contribute to, which have permissive licenses and let me use my code in any of my solutions.
+1 Howard. Storage engine is a perfect candidate for a permissive license. It allows 3rd parties to build solutions on top of it, and once they do that they have a vested interest in contributing and ensuring the engine moves forward and survives.
Another architecture consideration for Voron would be to support prefix compression for keys like leveldb. It's another drawback of lmdb (other than fixed size and no compression). This would yield significant size reduction when key is a composite key.
You might also want to look into LSM key value store used by sqlite4, you might find some things interesting.
Actually, now that I see Howard is looking at this, I'd be interested to know why he chose fixed size database considering the headaches it cause with having to handle full db and retrying transaction after reopening environment. Why not just grow file and create another mmap of appended bytes?
Duke, Actually, no. We tried using SignalR as is, but we run into problems. We tried fixing it a few times, then gave up and wrote a version that did what we wanted and was suited for our needs. It is a very strict subset of what SignalR is doing
Howard, I think we are targeting different audiences. Large portion of our customers are small businesses / developers, and they do pay.
Scooletz, And you are welcome to do so. Last I checked, I pointed no guns.
Philippe, Prefix compression is really cool, yeah. But it has major impact on the complexity of the code. Because it can generate page splits pretty much everywhere, it require a lot of very careful coding. I don't know if it is worth it at this point.
Philippe, Creating a new mmap means that you have discontinuity in your memory space, it means that if you have another 1MB available at the end, and you need 2 MB, you can't just re-grow that. I solved that in Voron by re-mapping the whole thing again with the increase size, while keeping the old map until all transactions are done.
I don't see any problems with discontinuity in memory space, why not use linked list of mmaps instead of one big mmap? In windows a mmap is just a pointer anyway? I'm assuming you're using fixed size pages so as long as you grow by a multiple of page size there shouldn't be too much added complexity. The downside to your approach is it is vulnerable to address space fragmentation: you cannot grow a database if no contiguous addressable memory space for entire file is available or mmap will fail. I guess you could just tell users to use 64bit process, then you have nothing to worry about :D
Philippe, It is an issue of complexity. If I need 2 MB, and I am only left with 1 MB at the end of the file, I can allocate a new map, sure. But then I would have to remember that I have 1 MB free there, and use that for small allocations, while using the new map for big allocations, and... it gets crazy.
Since lmdb uses a free list, I thought you had done the same.
Something that might reduce complexity is to use regular IO for writes so you don't have to know ahead of time how much space you need and therefore reserve. You just keep appending and when you're done you mmap all the new blocks/nodes and append new node to linked list of mmap.
Another advantage of regular IO for write is that you can do some neat tricks with async (overlapped) unbuffered (FILE_FLAG_NO_BUFFERING) IO like avoid blocking while flushing and still being tolerant to process crash (if machine crashes, then you loose last transaction). Again, I think this is what leveldb does in default mode.
Sorry for long post!
Philippe: re key prefix compression, I already addressed that here: http://ayende.com/blog/162914/the-design-of-managed-memory-map-database#comment10
To reiterate: composite keys are a poor solution used in DB implementations that don't support multiple tables. LMDB duplicate key support and named sub-tables are superior in all aspects.
re: data compression - that is not the purpose of a B+tree library. You can write a wrapper around LMDB that compresses on put and decompresses on get if you wish. You can also just throw LMDB on top of a filesystem that does dynamic compression and forget about it. It's easy to do the job elsewhere, without great complexity, so it belongs elsewhere.
re: fixed size, growing the map - I think that's already been covered pretty well. Address-space fragmentation will kill you. LMDB's purpose is also to be as fast and efficient as possible; preallocation is the fastest and most reliable approach. There are too many uncontrollable failure cases for the alternative. Even on modern 64bit machines, due to address space layout randomization there's every possibility that when demand hits you won't have any contiguous range of space large enough to satisfy your new size requirement.
We could certainly have mapped the DB in chunks, using a linked list. That would make page lookups slower.
LMDB delivers consistent, predictable latency in its current design. All of the choices you propose would destroy this property. LSM designs cannot deliver consistent, predictable latency. Nothing that relies on garbage collection or compaction can.
Again, see http://symas.com/mdb/hyperdex/ - this is the difference between good and bad design choices.
One other thought, related to reimplementing LMDB on your own - as a learning exercise, I can see the value in it. As a production move it's quite shortsighted, and using a managed language is also shortsighted.
When we were looking for new storage engines for use in OpenLDAP, we found quite a lot of interesting ideas but most of them were written in Java, which was an immediate non-starter. You cannot embed java code inside any other app. Likewise with C#. Like it or not, C is the universal language - write a library in C and every other language user can use it. (Witness how quickly wrappers for LMDB got written, in so many different languages.) Write a library in java or python or C# or whatever, and only other programs in the same language will ever use it. If your purpose is to write something that is widely applicable and will grow a large community, this is the wrong approach.
OpenLDAP has been the fastest directory software in the world since 2005. It is the 2nd most deployed in the world. (M$ AD is the most, at 80% of the market, but that number has been steadily shrinking.) Even our competitors use our libraries. The accumulated expertise and experience is unmatched anywhere. We've already been down all the ratholes and dead ends that you've yet to discover.
Howard, The intent is to use this for our own software. Applicability outside that field is of little relevance to us. And I need to support everything I ship. Having to deal with some issue with storage is something that would happen, and I want to be able to do so more easily. In this case, having a fully managed stack trumps it.
Ayende, you probably already know this, but didn't mention it, so...
If you are going to grow the database, you shouldn't be doing it every minute no matter how fast inserts are coming because you should grow it to a size proportional to the current size (say doubling or adding 25%). That makes the number of grow operations logarithmic in the number of inserts, so no matter how fast you are doing inserts the number of grow operations will quickly taper off.
David, That is correct, until you get into interesting issues with "I want to increase by 256MB by I have only 250MB available", etc. There is also the issue of not wanting to take too much space too fast. Growth is something that you need to balance between available space, efficiency and user's expectations.
Random prefix compression notes: a) Prefix compression is useful for non-compound keys if they are reasonably long. In practice this normally means people have indexed a string (like a URL). b) Prefix compression works very well with suffix compression, basically guaranteeing a high fanout for internal pages no matter how the keys are distributed. c) One way to approach prefix compression is to have a bit per value indicating whether it is prefix compressed or not. That allows a mix of compressed and non-compressed values on a single page.
Laurion, Any comments on Howard's take for prefix compression? http://ayende.com/blog/162914/the-design-of-managed-memory-map-database#comment10
I think that allowing compressed & uncompressed values on the same page is important, but I agree that it tends to create very complex code.
I would disagree with that analysis. Whenever you order entries by a field you will always get entries with a common prefix, even if there are no duplicates. As a simple example, consider ordering by a 64-bit customer ID. All the entries on a page will probably have the first 6 or 7 bytes of the customer ID in common. As a very useful example, imagine someone has built a secondary index over URLs or file paths. Those entries will have a huge common prefix ("C:\Users\Laurion\Documents...") with all the uniqueness coming at the end. Prefix compression is incredibly useful there.
As far as duplicates: it depends on how your secondary indexes are structured.
In a lot of database engines (including ESENT) the secondary index is just a mapping of (secondary-key => primary-key). When you do a seek in the secondary index you get back the primary key of the record and then have to traverse the primary index to get the rest of the record. Some people refer to this as a 'logical bookmark'.
In other systems the secondary index contains the physical location of the primary record (e.g. the page number). This can be referred to as a 'physical bookmark' and is how database engines that support heaps (e.g. Oracle) work. This is faster, but makes moving the primary index record trickier.
So, for systems that use logical bookmarks the secondary index entries always contain the primary key! Supporting duplicates actually just means changing the key comparison code to include the primary key in the comparison.
In ESENT the prefix compression code was a bit tricky, but definitely worth it. The single most complex part was actually determining the optimal prefix out of all the values on the page.
Laurion, Since the data is already sorted, isn't the way to find the common prefix just a simple "find first non shared byte" from the previous one?
With Voron, I thought about making it do a physical bookmark, but that would have been really hard to support given the page splits / merges that happens for other keys. With LMDB, FWIW, you have to have logical bookmarks.
That is how you find the common prefix between two records but when you pull out a common prefix you need to know the prefix that will save the most space on the page.
(Remember that there is no guarantee that all the entries on the page share a common prefix. This is especially true as you move higher in the b-tree.)
For example, you might have these keys on a page:
AAAAAAAA AAAAAAAB BBBBBBBB BBBBBBBC BBBBBCCC BBBBBCCD BBBBBCCE BBBBBCCF BBBBBBCG CCCCCCCC CCCCCCCD
In this case "BBBBBCC" is the best prefix to use, because that will shorten the keys on the page to:
AAAAAAAA AAAAAAAB BBB BBC C D E F G CCCCCCCC CCCCCCCD
A naive algorithm for to find an optimal prefix is to take each key on the page and calculate the savings if that key was used as a prefix and then take the key that has the biggest savings (there will probably actually be several such keys), but that has horrible time complexity.
The level beyond prefix compression is to do column compression like SQL Server, which can pull out all shared values into a page-level dictionary. That gives even better compression but is one or two orders of magnitude more complex than simple prefix compression.
Also, although it is quite old (1977) you might want to read the "Prefix B-Trees" paper by Rudolf Bayer and Karl Unterauer. Bayer's prefix compression is completely different than what we are discussion here -- it involves throwing away the common suffix of keys when selecting a separator key during a b-tree split at the leaf level. I know this technique as "suffix compression".
The reason that is so useful is that for internal pages we can expect keys to either share a common prefix (so prefix compression will work) or a common suffix (in which case we can throw the suffix away). when internal pages get good compression there is less data to cache and the height of the b-tree can be reduced, which is extremely useful.
Laurion, What are internal pages? Leaf pages? Branch pages?
Also, you seem to be assuming that there will only be a single prefix per page, right? What if we did?
input: AAAAAAAA AAAAAAAB BBBBBBBB BBBBBBBC BBBBBCCC BBBBBCCD BBBBBCCE BBBBBCCF BBBBBBCG CCCCCCCC CCCCCCCD
AAAAAAAA B BBBBBBBB C CCC D E F G CCCCCCCC D
LevelDB operates in pretty much the same fashion, the first 2 bytes (shared count, non shared count). It does limit you to 256, but that can be handled.
Internal pages are branch pages.
I am assuming a simple implementation with one prefix per page. You can definitely have more than one prefix per page, but it makes the code a bit more complex. Taking that to the limit gives something like SQL Server column compression, which tries to compress all common values.
If you look at different database engines there are a ton of interesting ways to optimize the page format (you only need to store the key length if you have a variable-length key, you only need to store the shared count if it is non-zero, etc.).
Will Voron storage option free RavenDb from being Windows-bound?
Gleb, That is one of the reasons for building it, yes.
Comment preview