Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,972 | Comments: 44,518

filter by tags archive

Optimization story: GetNextIdentityValueWithoutOverwritingOnExistingDocuments


A customer had a problem. They were mostly using the RavenDB HiLo algorithm for saving documents to the database, which is very fast & cheap. That client, however, chose to use the identity method. Which means that RavenDB will assign the value.

This is usually used if you need to have sequential values. The identity is actually being managed internally by RavenDB, and that works perfectly fine.

Except… What happens when you enter replication to the mix. The documents with the identity values are replicated to the secondary server, and there we don’t have the identity value, we just have the docs being written with their full id. (users/1, users/2, users/3, etc).

So far, so good. But what happens when you have a failover and you need to write to the secondary, and you use the identity? Well, RavenDB ain’t stupid, and it won’t overwrite the users/1 document. Instead, it will search for the next available opening from the smallest identity value generated and use that. The code looks like this:

   1: private long GetNextIdentityValueWithoutOverwritingOnExistingDocuments(string key, 
   2:     IStorageActionsAccessor actions, 
   3:     TransactionInformation transactionInformation)
   4: {
   5:     long nextIdentityValue;
   6:     do
   7:     {
   8:         nextIdentityValue = actions.General.GetNextIdentityValue(key);
   9:     } while (actions.Documents.DocumentMetadataByKey(key + nextIdentityValue, transactionInformation) != null);
  10:     return nextIdentityValue;
  11: }

This works, great. Except when you have large number of documents that have already been written. Instead of the brute force search, we now use the following approach:

   1: public long GetNextIdentityValueWithoutOverwritingOnExistingDocuments(string key,
   2:     IStorageActionsAccessor actions,
   3:     TransactionInformation transactionInformation,
   4:     out int tries)
   5: {
   6:     long nextIdentityValue = actions.General.GetNextIdentityValue(key);
   7:  
   8:     if (actions.Documents.DocumentMetadataByKey(key + nextIdentityValue, transactionInformation) == null)
   9:     {
  10:         tries = 1;
  11:         return nextIdentityValue;
  12:     }
  13:     tries = 1;
  14:     // there is already a document with this id, this means that we probably need to search
  15:     // for an opening in potentially large data set. 
  16:     var lastKnownBusy = nextIdentityValue;
  17:     var maybeFree = nextIdentityValue*2;
  18:     var lastKnownFree = long.MaxValue;
  19:     while (true)
  20:     {
  21:         tries++;
  22:         if(actions.Documents.DocumentMetadataByKey(key + maybeFree, transactionInformation) == null)
  23:         {
  24:             if (lastKnownBusy + 1 == maybeFree)
  25:             {
  26:                 actions.General.SetIdentityValue(key, maybeFree);
  27:                 return maybeFree;
  28:             }
  29:             lastKnownFree = maybeFree;
  30:             maybeFree = Math.Max(maybeFree - (maybeFree - lastKnownBusy) / 2, lastKnownBusy + 1);
  31:  
  32:         }
  33:         else
  34:         {
  35:             lastKnownBusy = maybeFree;
  36:             maybeFree = Math.Min(lastKnownFree, maybeFree*2);
  37:         }
  38:     }
  39: }

This can figure out the first free item in a range of billion documents in under 100 tries, which I am pretty sure if good enough.


Comments

Jonas H

Stop the presses, it's binary search!

Rafal

But isn't it possible to retrieve max value from the index?

dotnetchris

Ayende, from some recent discussions on ravendb's forums about generating sequential with no gaps invoice numbers due to government regulations, does this provide an easy solution for that?

Edward

This reminds me of the days with MS Access (client database),

Event OnDuplicateSaveException (key)

while true {

keyAdded=0; key++; loop ++ try { SaveWithNewKey(key) break; } catch { if (loop > maxKeyAddition) throw system exception }

}

}

Edward

When in a disconnected replicated enviroment how does Raven handle unique id's ?

Like: 2 servers, one in the cloud and 1 at the office.

dotnetchris

Rafal, whose to say there's a single index whatsoever? You can use ravendb without a single index ever existing.

Edward 3

Why maybeFree = nextIdentityValue*2 ?

Could this not cause huge gaps when ids are like 5 and more digits?

nit picking corner:
Move line 13 to 7 and remove line 10

Frank Quednau

Some kind of nested interval algorithm...I doubt there will be a "hole" .

Is access to this algorithm serialized?

Rafal

@dotnetchris I was thinking about the 'primary key' index in the Esent database. On a second thought, this index is probably sorted in lexicographic order so getting max numeric value could be a problem.

Ayende Rahien

Chris, That is certainly the way to go about this, sure.

Ayende Rahien

Edward, 2 machines, that have a partition among them? Failover?

There are many scenarios.

Ayende Rahien

Edward, The scenario we had for that is that we had millions of docs already there, we want to get high ASAP. We have ids with seven digits, and this code does NOT generate holes.

Ayende Rahien

Frank, This is not serialized, but the calls are transitionally safe.

Edward

@Ayende, about replication:

3 servers, master-master scenario. They all have the same data (no partioning although it could be possible) and changes are happening on every server.

1 server at the office with an internet connection which can disconnect, The software at the office is connected to this database.

1 server in the field with an internet connection which can disconnect, The software at this location is connected to this database.

1 server in the cloud for consumers and also office employees.

These servers are replicating each other. When a connection is reconnected the servers are syncing each other.

It is like MS SQL merge replication: http://msdn.microsoft.com/en-us/library/ms152746(v=sql.105).aspx

Ayende Rahien

Edward, Okay, in that scenario, what you want to do is to make sure that you are using Raven/ServerPrefix, which will make sure that each of them has its own prefix for the docs. That would mean that you won't have conflicts when creating new ids.

Edward

Sounds solid. How is that serverprefix calculated/determined? is it unique? stays it unique?

Ayende Rahien

Edward, You set it up, and you need to ensure that is is unique per all servers

Edward

Okay clear, thanks.

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

  1. Reducing parsing costs in RavenDB - 14 hours from now

There are posts all the way to Aug 04, 2015

RECENT SERIES

  1. Production postmortem (5):
    29 Jul 2015 - The evil licensing code
  2. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats