Optimization story: GetNextIdentityValueWithoutOverwritingOnExistingDocuments

time to read 16 min | 3026 words

A customer had a problem. They were mostly using the RavenDB HiLo algorithm for saving documents to the database, which is very fast & cheap. That client, however, chose to use the identity method. Which means that RavenDB will assign the value.

This is usually used if you need to have sequential values. The identity is actually being managed internally by RavenDB, and that works perfectly fine.

Except… What happens when you enter replication to the mix. The documents with the identity values are replicated to the secondary server, and there we don’t have the identity value, we just have the docs being written with their full id. (users/1, users/2, users/3, etc).

So far, so good. But what happens when you have a failover and you need to write to the secondary, and you use the identity? Well, RavenDB ain’t stupid, and it won’t overwrite the users/1 document. Instead, it will search for the next available opening from the smallest identity value generated and use that. The code looks like this:

   1: private long GetNextIdentityValueWithoutOverwritingOnExistingDocuments(string key, 
   2:     IStorageActionsAccessor actions, 
   3:     TransactionInformation transactionInformation)
   4: {
   5:     long nextIdentityValue;
   6:     do
   7:     {
   8:         nextIdentityValue = actions.General.GetNextIdentityValue(key);
   9:     } while (actions.Documents.DocumentMetadataByKey(key + nextIdentityValue, transactionInformation) != null);
  10:     return nextIdentityValue;
  11: }

This works, great. Except when you have large number of documents that have already been written. Instead of the brute force search, we now use the following approach:

   1: public long GetNextIdentityValueWithoutOverwritingOnExistingDocuments(string key,
   2:     IStorageActionsAccessor actions,
   3:     TransactionInformation transactionInformation,
   4:     out int tries)
   5: {
   6:     long nextIdentityValue = actions.General.GetNextIdentityValue(key);
   7:  
   8:     if (actions.Documents.DocumentMetadataByKey(key + nextIdentityValue, transactionInformation) == null)
   9:     {
  10:         tries = 1;
  11:         return nextIdentityValue;
  12:     }
  13:     tries = 1;
  14:     // there is already a document with this id, this means that we probably need to search
  15:     // for an opening in potentially large data set. 
  16:     var lastKnownBusy = nextIdentityValue;
  17:     var maybeFree = nextIdentityValue*2;
  18:     var lastKnownFree = long.MaxValue;
  19:     while (true)
  20:     {
  21:         tries++;
  22:         if(actions.Documents.DocumentMetadataByKey(key + maybeFree, transactionInformation) == null)
  23:         {
  24:             if (lastKnownBusy + 1 == maybeFree)
  25:             {
  26:                 actions.General.SetIdentityValue(key, maybeFree);
  27:                 return maybeFree;
  28:             }
  29:             lastKnownFree = maybeFree;
  30:             maybeFree = Math.Max(maybeFree - (maybeFree - lastKnownBusy) / 2, lastKnownBusy + 1);
  31:  
  32:         }
  33:         else
  34:         {
  35:             lastKnownBusy = maybeFree;
  36:             maybeFree = Math.Min(lastKnownFree, maybeFree*2);
  37:         }
  38:     }
  39: }

This can figure out the first free item in a range of billion documents in under 100 tries, which I am pretty sure if good enough.