Field Tokenization: A problem analysis
I was asked about the problem of repeating property names in RavenDB. In particular, let us consider the following document:
1: {
2: "FirstName":"John",3: "LastName":"Doe",4: "BirthDay": "1980-01-01",5: "LastLoggedInAt": "2013-08-10"6: }
There are 40 chars in this document that are dedicated solely for field names, and just 28 chars dedicate to actual data. Now, 40 vs 28 means that if we have 1 million of those documents, we are losing a lot of space for no good reason. Surely the database can do better on this, right? It can tokenize those things so instead of storing 40 bytes for fields, it would store only 16 (assuming int32 for token ids). And for larger documents, you would see an even more massive saving.
I think that I talked about this before, oh yes, here it is: http://ayende.com/blog/4669/you-saved-5-cents-and-your-code-is-not-readable-congrats, in which you manually (or via config at the client side) change the document to look like this:
1: {
2: "FN":"John",3: "LN":"Doe",4: "BD": "1980-01-01",5: "LLD": "2013-08-10"6: }
This is pretty horrible and I cover exactly why this is bad in the blob post above. Now, for 1 million documents, those 40 chars represent about 38MB of disk space. Not really a major issue, I think.
The actual question I was asked was (by Justin):
Repeating keys is so silly MongoDB has a highly voted issue, marked major with a plan to fix. And many client libraries have code built that tries to shorten names for you:
https://jira.mongodb.org/browse/SERVER-863
I personally think reading and writing 2x or more data then necessary is not a silly issue, it impacts all aspects of performance and database management. Disk are cheap sure, then again if your database is smaller you might have been able to put it on a ssd or faster but smaller drive, or just transferred less bytes from that cheap disk to accomplish the same thing. Are you really going to tell me every byte doesn't make a difference after examining these c code bases that do their best to save every byte for performance reasons?
The ugly solution shown above is a manual approach, but surely the db can do better, right? Interning property names is relatively simple conceptually, but a lot more complex in practice.
For example, if you want to do interning, you probably want to do it end to end. Which means that the client libs need to understand the server tokens. That means that you need some way of informing the client when there is a new interned property added, otherwise they wouldn't be able to figure out the real property name. Then there is the issue of how you are going to handle sharding or replication. Is each server going to have its unique copy of the intern table? Is each client going to have to navigate that? Or are you going to try to have a single intern table, and somehow manage to agree on all the positions in that table among a distributed set of nodes?
Imagine what going to happen during failover, you suddenly need to get the intern table for each of the replicas. And making sure that you aren't using the wrong intern table is going to be fun. Not to mention what happens when you have different clients talking to different servers, especially if you had a network split.
Even assuming that you don't care for that, and only want this for saving disk I/O on the server, you are going to run into issue with concurrency management with regards to the itern list. If you have two transaction adding values that both require modifying the intern list, you are in an interesting problem. Not least of which because if you save just the property name id to the data store, instead of the full key, you have to be able to say that this is there if the transaction is committed, because otherwise you have an unknown (or wrong) property name when you read it back. So now you have concurrency issues, and the potential for conflicts because two transactions may be modifying the values of the intern table at the same time. For fun, assume that they both have documents with the same property name that needs to be added, as well as other property names.
And that doesn't even take into account that it is actually pretty common to have dynamic property names. For example, dates & times. You have 1 million documents, each of them have a property with the names like: "2013-08-03T19:57:32.7060000Z"
A common scenario where this occurs is tracking things by time. For instance, if you want to track the amount of hours you worked on a bug. The document might look like this:
1: {
2: "2013-08-03T19:57:32.7060000Z":{
3: "Project":"RavenDB",
4: "RavenDB-1248":"Fixed",
5: "RavenDB-1212":"Progress made"
6: }
7: }
As you can see, we actually have multiple properties with dynamic names here. It would be a waste to try to intern them. In fact, it is likely going to cause us to fall over and die. And wait for that to happen when you have to pass this to the client... fun & joy all around.
But you know what, I can still probably go and figure out what the most 10,000 common field names are, and have a fixed table of those in place. This would give me the ability to handle things like Name, FirstName, Date, etc. Because it is a fixed list, it would means that a lot of the problems listed above don’t matter. It would mean that if you wanted to call your property FirstNames, you might need to call it Names, because the first one would be in the list and won’t be interned.
And that leads to a lot of interesting potential issues with backward and forward compatibility. How do I add more terms to the list? Older clients wouldn’t be able to understand the tokens, and that leads to additional complication just managing the versioning story. And I can assure you that you’ll have a lot of people clamoring and wanting to add their own unique properties to that list. (What do you mean ‘PensionFundTarget’ isn’t on the intern list, is a a really common term and we could really use the perf boost of that!).
And then we have the issue of debugging. Right now, you can watch what is going on over the wire using Fiddler very easily, and the values are easily understood. Do you really want to try to figure out data like this?
1: {
2: "231":"John",3: "432":"Doe",4: "127": "1980-01-01",5: "5841": "2013-08-10"6: }
In pretty much every aspect you can think of, this is a really crazy hard feature, and the benefits that it brings aren’t really that important.
Sure, you can reduce the amount of data that you read from the disk. But that data is already sitting in the same place. And there is absolutely no difference between reading a 30 bytes value and a 200 bytes value (reads/writes are always done at a minimum of 512 bytes). So reading a bit of extra bytes means pretty much nothing to us. We don’t have to seek to it, it is already there. And anyway, we have caches to alleviate that problem for us.
And reading from the disk isn’t really our major problem, that is relatively cheap compare to sending the data to the user over the network. But wait, we actually have found a solution for that, we use gzip compression on the wire, a supported and well known method that you can easily see through with tools like Fiddler. That usually gives you the best of all worlds. You get small response size, you still use readable format, you don’t have to deal with all of the issues mentioned above. Happy happy joy joy!
Hm… can we not apply the same method internally on the server side? Sure we can. We can compress & decompress data on the fly when we write / read to the disk, further reducing the amount of data that we write.
In RavenDB, we do both. Data is compressed on the wire, and can be compressed to the disk as well.
As for the MongoDB issue, it may be highly voted, it may be marked as major, but it has been sitting in the issue tracker for the past 3+ years. I would say that they aren’t in any rush to resolve this. Probably for much the same reasons as I outlined here. It is cheap to vote for an issue, and usually the people setting the issue severity are the people creating it. I wouldn’t put too much stock on anything like that.
Put simply, trying to solve this is a really hard problem, with a lot of cascading implication and dangerous side effects. Conversely there are more effective solutions that are cheaper, simpler and are orthogonal to everything else.
I think you can guess what I would pick.
Comments
Interning at the database level isn't the only option. It can be done at the schema level. This is how Protocol Buffers and Apache Thrift work. The schema assign each field a number and the value is serialized to that field number and the name only exists in the schema definition.
Doing it at the schema level also helps with forward and backward compatibility because the field names can be added or renamed without affecting a client with an older schema definition.
For things like sensor data and other high-rate sources, when storing 10 million records per hour, the size of field names and textual storing of numbers does add up.
Nathan, Schema works, if you can define one upfront. However, defining a schema upfront means that you are going to introduce a whole new level of complexity. See the document here: http://ayende.com/blog/163425/json-packing-text-based-formats-and-other-stuff-that-come-to-mind-at-5-am?key=08561a0c8e6e4a8f9247ebd41a6f491a It means that you lose a lot of the benefits of schemaless documents, have to really think about schemas, updates, migrations, etc. Oh, and doing dynamic stuff like item per day, or dictionaries keys by custom value becomes hard to impossible. And again, compressing really helps, and is far less complex.
"And there is absolutely no difference between reading a 30 bytes value and a 200 bytes value (reads/writes are always done at a minimum of 512 bytes). So reading a bit of extra bytes means pretty much nothing to us. We don’t have to seek to it, it is already there. And anyway, we have caches to alleviate that problem for us."
So you do know that databases generally store data in pages and MULTIPLE rows are stored per page right? What does this mean, it means a single disk read can read multiple contiguous rows, and I hope I don't have to explain all the optimizations that can come from this.
These pages are generally cached using the databases own caching mechanisms or using the OS's in the case of mem-mapped files. Do I have to explain why less bytes per record mean more records are cached in the same amount of ram? Are you going to cache gzipped bson in ram and get the same outcome?
Cached pages can also be partially read, they do not have the same restriction as disk reads, this means less data need to be read from ram to the CPU to actually use the information, this without even getting into the orders of magnitude more work it takes to compress/decompress data in your solution.
"Interning at the database level isn't the only option. It can be done at the schema level. "
This is exactly what relational databases do. All that is stored per row is data and the metadata necessary to manage it(locking,transactions, offsets, etc.). The schema being fixed is stored elsewhere. PostgreSQL for instance has about 28 bytes per row of overhead the rest is just your data, plus the fact that it can optimize based on datatype, for instance a date and time taking only 8 bytes rather than 28.
Here the funny thing, Raven has all this plus it's own overhead of storing BSON documents in relations! ESENT having TABLES with a similar structure to any relational database has a similar row overhead to say PostgreSQL or MSSQL, in-fact more because Raven stores it's own locking and tx data as columns in the ESENT table!
Looking at the Raven ESENT database each document has 7 columns with at least 44 bytes of overhead if not more(didn't include key size), not counting ESENTS own row overhead which is going to be similar to PostgreSQL or MSSQL.
So lets look at what it takes to store the following in Raven vs PostgreSQL assuming ESENT is as efficient as PostgreSQL per row and not counting a primary key size on either:
{ "FirstName":"John", "LastName":"Doe", "BirthDay": "1980-01-01", "LastLoggedInAt": "2013-08-10" }
ESENT: 28 bytes row overhead + 44 bytes Raven row overhead + BSON 99 bytes = 171 bytes (as a side note the JSON itself is only 91 bytes, BSON has overhead too!)
PostgreSQL: 28 bytes row overhead + 4 bytes First Name + 3 bytes Last Name + 4 bytes BirthDay + 4 bytes LastLoggedInAt = 43 bytes
So Raven must read/write/store/cache 4x as much data for everyone one of these documents. A single 512 disk read can read 2 Raven documents while PostgreSQL can read 11. PostgreSQL can cache 4x as many of these docs in ram as Raven it also doesn't need to deserialize BSON or even worse unzip then deserialize to say make an index on a column, it just follows offsets to the column data.
As far as dynamic schema, change management, etc. well that is definitely a consideration, but to dismiss the I/O overhead of a document store like Raven or Mongo is not wise. Here is the first sentence from the Mongo issue: "Most collections, even if they dont contain the same structure , they contain similar. So it would make a lot of sense and save a lot of space to tokenize the field names."
Better yet how about avoiding tokenization all together if the documents all contain the same structure and use a relational db? How about using a relational db like PostgreSQL that lets you store and query JSON while letting you do regular lighter weight tables when that makes sense, best of both worlds so to speak?
Justin, In RavenDB, most of our documents are actually too big to fit on a single page, the go to overflow storage. So being able to read multiple rows in a single page read is nice, but not relevant for what we are doing.
And yes, we do use a page cache for compressed BSON under certain scenarios.
Compression is actually something that uses CPU, and CPU tend to be something that I have in abundance. I/O is usually my limiting factor. I would happily trade CPU time for I/O time.
As for schemas, sure, you don't need to worry about field names if you have a schema. But 30 - 40 years of experience with RDMBS thought us that while schemas can be good, they also have severe limitations on how they allow you to work. In particular, when you are talking about complex, dynamic or even just changing data. RavenDB is schemaless. It has advantages & disadvantages, but you can't really come and say: "you could do this one thing better with a schema" without explaining all the implications around that.
The locking & tx data that we store are actually there for DTC, not for our transactions.
You math doesn't really hold up the moment you move from the trivial example. Let us take the document outlined in this post: http://ayende.com/blog/163425/json-packing-text-based-formats-and-other-stuff-that-come-to-mind-at-5-am?key=08561a0c8e6e4a8f9247ebd41a6f491a
To store this in PostgreSQL would require 6 tables and 30 rows (for this particular document). In RavenDB, it is a 4.5Kb doc. Using your own math, I would have overhead of about 840 bytes just for the rows metadata.
That is leaving aside the actual cost of reading something like that (6 joins and who know how many seeks).
And Justin, sure, it would be nice to have tokenization for "static" stuff. Except that you forget the other costs associated with it. See above, and the think versioning & changes.
@Justin,
I don't believe PostgreSQL would only need 4 bytes to store "John". I don't know about PostgreSQL, but in for example MSSQL you either have to define the column as char(x) in which case it will take up exactly x bytes, or you have to define it as varchar(x) in which case there will be some overhead, because the length of that rows first name will have to be stored somewhere.
Unless you already know every first name will have just 4 characters, in which case you can define it as char(4) and you win. :P
And if you want to support international names, you would have to use nchar / nvarchar, in which case the overhead will be even greater. But RavenDB's might be too, I don't know how it stores unicode.
Patrick, We store things in UTF8
Yeah, I already thought that. It's the most logical choice.
@Patrick
Yes your right PostgreSQL needs an extra 4 bytes for every variable length field to store the length of the data.
So add 8 bytes to 43 and you get 51 bytes for those docs, still not storing field names in every row, I still win ;)
You can specify your charset in PostgreSQL, UTF8 is typical.
@Ayende
I am not sure your math holds up, that 4.5k JSON doc only has 2k of string data in it.
I am counting 6 tables but only 23 rows:
Build: 1 Downloads: 0 Changes:7 Files:13 Resolved Issues:0 Contributors:2
Thats 644 bytes of row overhead not 810.
Doing the math shows PostgreSQL will take about 2.5k to store the same data in full table form. Almost half of the JSON form.
Of course PostgreSQL supports arrays, so why make files a separate table? Just make it a varchar array. Just eliminated another 400 bytes and the seeks to get to it. Still can do queries,index,join etc on arrays.
PostgreSQL gives you choice, you want to store a full JSON doc, fine it will even compress it automatically using TOAST and use the full V8 js engine in plv8 to manipulate/index it. Want to store in tables but keep certain fields arrays or maybe an hstore fine. Have a lot of small docs with a fixed schema, use a table with primitive types and save 4x the space and I/O and ram with no compression overhead.
Bundles seem like the way to go if you had a specific type of document that would really benefit from interning.
I just do this…
if (!DEBUG)
[JsonProperty(PropertyName = "FN", NullValueHandling = NullValueHandling.Ignore)] #else [JsonProperty(NullValueHandling = NullValueHandling.Ignore)]
endif
public string FName { get; set;}
When I am developing against an in-memory database…in DEBUG…I see the whole thing….in RELEASE mode I get the short version.
I do see your point but I don't ever even have to think about this….it just works.
@Rob I think the point is about trying to read documents in production.
@Rob...OK but it is not following an interning approach described above and it requires no separate data dictionary as it is right there with the property in the code.
So while the in your head translation has to occur when manually editing (say in Raven Studio)...for me 99% of the time that I am looking at the raw document is pre-production during development and testing and then it is the property name.
Is FirstName and FN really so hard to map in your head? Unless you have hundreds of properties on a document one or two character names can be relatively intuitive.
I would say that for smaller documents it probably isn't that worth it and is more a matter of style and personal preference rather than hard drive savings etc.
Rob, Assume a reasonably size set of entities, and reasonably complex entities. You can get into trouble very quickly:
SA: { L1: "...", L2: null, S: "...", Z: "...}, BA: { L1: "...", L2: null, S: "...", Z: "...}, S: true,
vs:
ShippingAddress: { Line1: "...", Line2: null, State: "...", Zip: "...}, BillingAddress: { Line1: "...", Line2: null, State: "...", Zip: "...}, Shipped: true,
For example. Now take that into a model like this: http://www.databaseanswers.org/data_models/customers_and_payments_e_govt/index.htm
And try to imagine how it would look like with single or two letter property names.
The decision is a lot more straightforward when you have a formal schema. LDAP/X.500 has very rigid schema rules, and every attributeType has multiple properties associated with it. Once you have more than just a name to keep track of, but also types, syntaxes, and other properties, it makes sense to define a structure to carry all of that and only store handles referencing those structures. The name itself is insufficient, and looking up a name to find the structure is inefficient.
As for schemaless, document-oriented DBs, I find the entire situation rather ironic. All of these developers running away from weakly typed languages (like C) to more strongly typed languages, and at the same time running away from strongly typed DBs to weakly typed ones.
Comment preview