Ayende @ Rahien

It's a girl

Optimization story: GetNextIdentityValueWithoutOverwritingOnExistingDocuments

A customer had a problem. They were mostly using the RavenDB HiLo algorithm for saving documents to the database, which is very fast & cheap. That client, however, chose to use the identity method. Which means that RavenDB will assign the value.

This is usually used if you need to have sequential values. The identity is actually being managed internally by RavenDB, and that works perfectly fine.

Except… What happens when you enter replication to the mix. The documents with the identity values are replicated to the secondary server, and there we don’t have the identity value, we just have the docs being written with their full id. (users/1, users/2, users/3, etc).

So far, so good. But what happens when you have a failover and you need to write to the secondary, and you use the identity? Well, RavenDB ain’t stupid, and it won’t overwrite the users/1 document. Instead, it will search for the next available opening from the smallest identity value generated and use that. The code looks like this:

   1: private long GetNextIdentityValueWithoutOverwritingOnExistingDocuments(string key, 
   2:     IStorageActionsAccessor actions, 
   3:     TransactionInformation transactionInformation)
   4: {
   5:     long nextIdentityValue;
   6:     do
   7:     {
   8:         nextIdentityValue = actions.General.GetNextIdentityValue(key);
   9:     } while (actions.Documents.DocumentMetadataByKey(key + nextIdentityValue, transactionInformation) != null);
  10:     return nextIdentityValue;
  11: }

This works, great. Except when you have large number of documents that have already been written. Instead of the brute force search, we now use the following approach:

   1: public long GetNextIdentityValueWithoutOverwritingOnExistingDocuments(string key,
   2:     IStorageActionsAccessor actions,
   3:     TransactionInformation transactionInformation,
   4:     out int tries)
   5: {
   6:     long nextIdentityValue = actions.General.GetNextIdentityValue(key);
   7:  
   8:     if (actions.Documents.DocumentMetadataByKey(key + nextIdentityValue, transactionInformation) == null)
   9:     {
  10:         tries = 1;
  11:         return nextIdentityValue;
  12:     }
  13:     tries = 1;
  14:     // there is already a document with this id, this means that we probably need to search
  15:     // for an opening in potentially large data set. 
  16:     var lastKnownBusy = nextIdentityValue;
  17:     var maybeFree = nextIdentityValue*2;
  18:     var lastKnownFree = long.MaxValue;
  19:     while (true)
  20:     {
  21:         tries++;
  22:         if(actions.Documents.DocumentMetadataByKey(key + maybeFree, transactionInformation) == null)
  23:         {
  24:             if (lastKnownBusy + 1 == maybeFree)
  25:             {
  26:                 actions.General.SetIdentityValue(key, maybeFree);
  27:                 return maybeFree;
  28:             }
  29:             lastKnownFree = maybeFree;
  30:             maybeFree = Math.Max(maybeFree - (maybeFree - lastKnownBusy) / 2, lastKnownBusy + 1);
  31:  
  32:         }
  33:         else
  34:         {
  35:             lastKnownBusy = maybeFree;
  36:             maybeFree = Math.Min(lastKnownFree, maybeFree*2);
  37:         }
  38:     }
  39: }

This can figure out the first free item in a range of billion documents in under 100 tries, which I am pretty sure if good enough.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Jonas H
10/30/2012 10:33 AM by
Jonas H

Stop the presses, it's binary search!

Rafal
10/30/2012 12:38 PM by
Rafal

But isn't it possible to retrieve max value from the index?

dotnetchris
10/30/2012 02:16 PM by
dotnetchris

Ayende, from some recent discussions on ravendb's forums about generating sequential with no gaps invoice numbers due to government regulations, does this provide an easy solution for that?

Edward
10/30/2012 04:34 PM by
Edward

This reminds me of the days with MS Access (client database),

Event OnDuplicateSaveException (key)

while true {

keyAdded=0; key++; loop ++ try { SaveWithNewKey(key) break; } catch { if (loop > maxKeyAddition) throw system exception }

}

}

Edward
10/30/2012 04:36 PM by
Edward

When in a disconnected replicated enviroment how does Raven handle unique id's ?

Like: 2 servers, one in the cloud and 1 at the office.

dotnetchris
10/30/2012 04:55 PM by
dotnetchris

Rafal, whose to say there's a single index whatsoever? You can use ravendb without a single index ever existing.

Edward 3
10/30/2012 05:00 PM by
Edward 3

Why maybeFree = nextIdentityValue*2 ?

Could this not cause huge gaps when ids are like 5 and more digits?

nit picking corner:
Move line 13 to 7 and remove line 10

Frank Quednau
10/30/2012 07:47 PM by
Frank Quednau

Some kind of nested interval algorithm...I doubt there will be a "hole" .

Is access to this algorithm serialized?

Rafal
10/30/2012 11:18 PM by
Rafal

@dotnetchris I was thinking about the 'primary key' index in the Esent database. On a second thought, this index is probably sorted in lexicographic order so getting max numeric value could be a problem.

Ayende Rahien
10/31/2012 01:14 PM by
Ayende Rahien

Chris, That is certainly the way to go about this, sure.

Ayende Rahien
10/31/2012 01:15 PM by
Ayende Rahien

Edward, 2 machines, that have a partition among them? Failover?

There are many scenarios.

Ayende Rahien
10/31/2012 01:17 PM by
Ayende Rahien

Edward, The scenario we had for that is that we had millions of docs already there, we want to get high ASAP. We have ids with seven digits, and this code does NOT generate holes.

Ayende Rahien
10/31/2012 01:17 PM by
Ayende Rahien

Frank, This is not serialized, but the calls are transitionally safe.

Edward
10/31/2012 05:32 PM by
Edward

@Ayende, about replication:

3 servers, master-master scenario. They all have the same data (no partioning although it could be possible) and changes are happening on every server.

1 server at the office with an internet connection which can disconnect, The software at the office is connected to this database.

1 server in the field with an internet connection which can disconnect, The software at this location is connected to this database.

1 server in the cloud for consumers and also office employees.

These servers are replicating each other. When a connection is reconnected the servers are syncing each other.

It is like MS SQL merge replication: http://msdn.microsoft.com/en-us/library/ms152746(v=sql.105).aspx

Ayende Rahien
11/01/2012 08:01 AM by
Ayende Rahien

Edward, Okay, in that scenario, what you want to do is to make sure that you are using Raven/ServerPrefix, which will make sure that each of them has its own prefix for the docs. That would mean that you won't have conflicts when creating new ids.

Edward
11/01/2012 03:40 PM by
Edward

Sounds solid. How is that serverprefix calculated/determined? is it unique? stays it unique?

Ayende Rahien
11/01/2012 03:52 PM by
Ayende Rahien

Edward, You set it up, and you need to ensure that is is unique per all servers

Edward
11/02/2012 01:29 PM by
Edward

Okay clear, thanks.

Comments have been closed on this topic.