Field Tokenization: A problem analysis

time to read 15 min | 2857 words

I was asked about the problem of repeating property names in RavenDB. In particular, let us consider the following document:

   1: { 
   2:     "FirstName":"John",
   3:     "LastName":"Doe",
   4:     "BirthDay": "1980-01-01",
   5:     "LastLoggedInAt": "2013-08-10"
   6: }

There are 40 chars in this document that are dedicated solely for field names, and just 28 chars dedicate to actual data. Now, 40 vs 28 means that if we have 1 million of those documents, we are losing a lot of space for no good reason. Surely the database can do better on this, right? It can tokenize those things so instead of storing 40 bytes for fields, it would store only 16 (assuming int32 for token ids). And for larger documents, you would see an even more massive saving.

I think that I talked about this before, oh yes, here it is:, in which you manually (or via config at the client side) change the document to look like this:

   1: { 
   2:  "FN":"John",
   3:  "LN":"Doe",
   4:  "BD": "1980-01-01",
   5:  "LLD": "2013-08-10"
   6: }

This is pretty horrible and I cover exactly why this is bad in the blob post above. Now, for 1 million documents, those 40 chars represent about 38MB of disk space. Not really a major issue, I think.

The actual question I was asked was (by Justin):

Repeating keys is so silly MongoDB has a highly voted issue, marked major with a plan to fix. And many client libraries have code built that tries to shorten names for you:

I personally think reading and writing 2x or more data then necessary is not a silly issue, it impacts all aspects of performance and database management. Disk are cheap sure, then again if your database is smaller you might have been able to put it on a ssd or faster but smaller drive, or just transferred less bytes from that cheap disk to accomplish the same thing. Are you really going to tell me every byte doesn't make a difference after examining these c code bases that do their best to save every byte for performance reasons?

The ugly solution shown above is a manual approach, but surely the db can do better, right? Interning property names is relatively simple conceptually, but a lot more complex in practice.

For example, if you want to do interning, you probably want to do it end to end. Which means that the client libs need to understand the server tokens. That means that you need some way of informing the client when there is a new interned property added, otherwise they wouldn't be able to figure out the real property name. Then there is the issue of how you are going to handle sharding or replication. Is each server going to have its unique copy of the intern table? Is each client going to have to navigate that? Or are you going to try to have a single intern table, and somehow manage to agree on all the positions in that table among a distributed set of nodes?

Imagine what going to happen during failover, you suddenly need to get the intern table for each of the replicas. And making sure that you aren't using the wrong intern table is going to be fun. Not to mention what happens when you have different clients talking to different servers, especially if you had a network split.

Even assuming that you don't care for that, and only want this for saving disk I/O on the server, you are going to run into issue with concurrency management with regards to the itern list. If you have two transaction adding values that both require modifying the intern list, you are in an interesting problem. Not least of which because if you save just the property name id to the data store, instead of the full key, you have to be able to say that this is there if the transaction is committed, because otherwise you have an unknown (or wrong) property name when you read it back. So now you have concurrency issues, and the potential for conflicts because two transactions may be modifying the values of the intern table at the same time. For fun, assume that they both have documents with the same property name that needs to be added, as well as other property names.

And that doesn't even take into account that it is actually pretty common to have dynamic property names. For example, dates & times. You have 1 million documents, each of them have a property with the names like: "2013-08-03T19:57:32.7060000Z" 

A common scenario where this occurs is tracking things by time. For instance, if you want to track the amount of hours you worked on a bug. The document might look like this:

   1: {
   2:   "2013-08-03T19:57:32.7060000Z":{
   3:      "Project":"RavenDB",
   4:      "RavenDB-1248":"Fixed",
   5:      "RavenDB-1212":"Progress made"
   6:   }
   7: }

As you can see, we actually have multiple properties with dynamic names here. It would be a waste to try to intern them. In fact, it is likely going to cause us to fall over and die. And wait for that to happen when you have to pass this to the client... fun & joy all around.

But you know what, I can still probably go and figure out what the most 10,000 common field names are, and have a fixed table of those in place. This would give me the ability to handle things like Name, FirstName, Date, etc. Because it is a fixed list, it would means that a lot of the problems listed above don’t matter. It would mean that if you wanted to call your property FirstNames, you might need to call it Names, because the first one would be in the list and won’t be interned.

And that leads to a lot of interesting potential issues with backward and forward compatibility. How do I add more terms to the list? Older clients wouldn’t be able to understand the tokens, and that leads to additional complication just managing the versioning story. And I can assure you that you’ll have a lot of people clamoring and wanting to add their own unique properties to that list. (What do you mean ‘PensionFundTarget’ isn’t on the intern list, is a a really common term and we could really use the perf boost of that!).

And then we have the issue of debugging. Right now, you can watch what is going on over the wire using Fiddler very easily, and the values are easily understood. Do you really want to try to figure out data like this?

   1: { 
   2:  "231":"John",
   3:  "432":"Doe",
   4:  "127": "1980-01-01",
   5:  "5841": "2013-08-10"
   6: }

In pretty much every aspect you can think of, this is a really crazy hard feature, and the benefits that it brings aren’t really that important.

Sure, you can reduce the amount of data that you read from the disk. But that data is already sitting in the same place. And there is absolutely no difference between reading a 30 bytes value and a 200 bytes value (reads/writes are always done at a minimum of 512 bytes). So reading a bit of extra bytes means pretty much nothing to us. We don’t have to seek to it, it is already there. And anyway, we have caches to alleviate that problem for us.

And reading from the disk isn’t really our major problem, that is relatively cheap compare to sending the data to the user over the network. But wait, we actually have found a solution for that, we use gzip compression on the wire, a supported and well known method that you can easily see through with tools like Fiddler. That usually gives you the best of all worlds. You get small response size, you still use readable format, you don’t have to deal with all of the issues mentioned above. Happy happy joy joy!

Hm… can we not apply the same method internally on the server side? Sure we can. We can compress & decompress data on the fly when we write / read to the disk, further reducing the amount of data that we write.

In RavenDB, we do both. Data is compressed on the wire, and can be compressed to the disk as well.

As for the MongoDB issue, it may be highly voted, it may be marked as major, but it has been sitting in the issue tracker for the past 3+ years. I would say that they aren’t in any rush to resolve this. Probably for much the same reasons as I outlined here. It is cheap to vote for an issue, and usually the people setting the issue severity are the people creating it. I wouldn’t put too much stock on anything like that.

Put simply, trying to solve this is a really hard problem, with a lot of cascading implication and dangerous side effects. Conversely there are more effective solutions that are cheaper, simpler and are orthogonal to everything else.

I think you can guess what I would pick.