Implementing CreateSequentialUuid()

time to read 2 min | 281 words

We run into an annoying problem in Raven regarding the generation of sequential guids. Those are used internally to represent the etag of a document.

For a while, we used the Win32 method CreateSequentialUuid() to generate that. But we run into a severe issue with that, it create sequential guids only as long as the machine is up. After a reboot, the guids are no longer sequential. That is bad, but it also means that two systems calling this API can get drastically different results (duh! that is the point, pretty much, isn’t it?). Which wouldn’t bother me, except that we use etags to calculate the freshness of an index, so we have to have an always incrementing number.

How would you implement this method?

public static Guid CreateSequentialUuid()

A few things to note:

We really actually care about uniqueness here, but only inside a single process, not globally.

The results must always be incrementing.

The always incrementing must be consistent across machine restarts and between different machines.

Yes, I am fully aware of the NHibernate’s implementation of guid.comb that creates sequential guids. It isn't applicable here, since it doesn't create truly sequential guids, only guids that sort near one another.

Tweet Share Share 60 comments

Tags:

Programming

Comments

13 Sep 2010
10:44 AM

Andrey Shchekin

Personally, I would probably use some kind of timestamp instead of guid here, especially if you want to use it for index freshness.

But I do not really understand how the consistency of incrementing between different machines corresponds with uniqueness within a single process. How many processes are actually there?

13 Sep 2010
10:45 AM

tobi

There seems to be a contradiction between "only inside a single process" and "between different machines". Or is it allowed to have globally non-unique numbers that are always-increasing across the cluster? That seems to be impossible as "always incrementing must be consistent across machine restarts and between different machines" seems to imply global uniqueness.

13 Sep 2010
10:47 AM

Eugene Kirpichov

What do you meen under consistency between different machines? Will Lamport timestamps work?

13 Sep 2010
10:55 AM

Ayende Rahien

Andrey & Tobi,

By that I mean that given machine A & machine B running in parallel:

Guids generated on machine A are allowed to be identical to guids on machine B.

However, a guid generated in time T on machine B must be greater than guid generated on machine A at any time prior to T.

13 Sep 2010
11:03 AM

FallenGameR

May be you'll be interested in GUID algorithm implementation details (not MS specific it seams) - blogs.msdn.com/.../8659071.aspx

I think that you should check time sync algorithms - http://en.wikipedia.org/wiki/Clock_synchronization. Than you can rely on machine time across the machines.

13 Sep 2010
11:04 AM

tobi

Well, what is "time"? Wall clock time only has a meaning if the clocks are synchronized which is never exact.

When this challenge is only about assigning a unique time (=guid) to each event/update then you could partition the guid space into chunks of N and hand out the chunks to the cluster machines. That will reduce coordination traffic by a factor of N. The disadvantage would be that guid time will only roughly correspond to wall clock time.

13 Sep 2010
11:19 AM

Bill Michell

You seem to be asking the impossible, unless you have some kind of "master" process handing out tokens. Database sequence numbers spring to mind. But I assume you don't want the cost of network access (or it is unsuitable for some other reason).

For systems with synchronized clocks, a simple timestamp seems to be suitable at first glance, but falls apart if a process wants to create two GUIDs in the space of a single tick. It can decorate the timestamp with an internal counter, but how does a machine running remotely know when this occurs, and ensure it interleaves correctly? Or does it not matter?

13 Sep 2010
11:20 AM

Steve Py

That's a tough requirement to meet, and in any case even if you do come up with something it's bound to be fragile.

Might try generating your own UUID similar to the standard. Based on your comments about machine A & B...

private static Guid CreateSequentialUID()

{

var address = NetworkInterface.GetAllNetworkInterfaces()[0].GetPhysicalAddress().GetAddressBytes(); // TODO: Interrogate, choose, and store the MAC address.

var time = System.BitConverter.GetBytes(DateTime.Now.Ticks).Reverse().ToList();


var byteList = new List

<byte(time);

byteList.Add(0x0);

byteList.Add(0x0);

byteList.AddRange(address);

Thread.Sleep((int)time[4]/10);

return new Guid(byteList.ToArray());

}

Provided the times were in-sync the precidence is on Time. If this is running on a server then I'd have the clients send their MAC address to the server to ensure one truth to the time variable. The Thread.Sleep is to help ensure two calls aren't made within the date range tick. Though I'm sure there'd be a better option (high-sensitivity timer) plus some additional considerations for Threading to add as well. I just tossed 2 buffer bytes in the middle, I'm sure there's probably something better/unique that might be useful to use them for.

13 Sep 2010
11:20 AM

Ayende Rahien

Bill,

You can assume clock synchronization, but no master process, no.

13 Sep 2010
11:21 AM

Ayende Rahien

Bill,

Also note that it is explicitly fine to have two machines produce the same guid.

13 Sep 2010
11:21 AM

Augi

I would probably try to use time-based UUID: coderjournal.com/.../creating-a-time-uuid-guid-...

13 Sep 2010
11:23 AM

Ayende Rahien

Thread.Sleep means that it is not possible to use this.

There is a better solution.

13 Sep 2010
11:47 AM

Steve Py

Ah, n/m. I thought you were looking for ideas, not another code challenge. I'd still think relying on sequential but unique identifiers is bound to cause you more problems in days to come.

13 Sep 2010
12:14 PM

Ted

I would use the current time plus an index to disambiguate UUID's generated at the same time. The time is used for the higher portion of the UUID so that UUID's for times in the future always compare greater than UUID's generated in the past. Assuming you can't restart fast enough to be within the same second (or whatever the minimum time resolution you use) then you don't need to store these values on disk. Here's some pseudo code to illustrate:

int index = 0; // global

public Guid GetNextUUID() {

DateTime currentTime = DateTime.NowUTc;

int indexToUse = Interlock.Increment(ref index);

return MakeGuidFromTimeAndIndex(currentTime, indexToUse);

}

13 Sep 2010
12:29 PM

Bill Michell

Yeah, that is more or less what I meant by decorating the timestamp with an internal counter. And you seem to be saying that you don't care which way UUIDs created at the same tick on different machines will sort.

13 Sep 2010
12:35 PM

Cory Foy

"Also note that it is explicitly fine to have two machines produce the same guid. "

GUID - Globally Unique IDentifier.

I don't understand this requirement. If you want incrementing numbers, use incrementing numbers. Why wouldn't that work?

13 Sep 2010
12:55 PM

Greg

Perhaps the requirement to have a consistently incrementing number/guid/stamp across all machines is a red herring. Maybe there is some simplification that can be made to the freshness algorithm itself? Don't know enough about the freshness algorithm you are using to say so though.

13 Sep 2010
13:06 PM

Alois Kraus

When it should be unique you can use a combination of current time and the QueryPerformanceCounter call which has an accuracy of ns which ensures uniqueness. DateTime is a long and QueryPerformanceCounter does also return a long so you can stuff this 16 bytes into a "normal" Guid class.

    static Guid CreateSequentialGuid()

    {

        var bytes = new byte[16];

        WriteLong(DateTime.Now.Ticks, bytes, 0);

        WriteLong(Stopwatch.GetTimestamp(), bytes, 8);

        return new Guid(bytes);

    }


    static void WriteLong(long value, byte[] arr, int startIdx)

    {

        for (int i = startIdx + 7, j = 0; i >= startIdx; i--, j++)

        {

            arr[i] = (byte)(value >> j * 8);

        }

    }

Then you can write a custom sorter which does split both times the system and highres time where you can use the highres time if the clock time has the same value. It is well known that the standard clock time has only a resolution of 16ms so you will need a second high resolution time which is strictly growing.

Yours,

Alois Kraus

13 Sep 2010
13:43 PM

Dmitry

If Mono compatibility is not an issue:

private const string NativeLibrary = "rpcrt4.dll";

private const int NativeLibrarySuccessCode = 0;

[DllImport(NativeLibrary, SetLastError = true)]

private static extern int UuidCreateSequential(out Guid guid);

    public static Guid NewSequentialGuid()

    {

        Guid sequentialGuid;


        if (UuidCreateSequential(out sequentialGuid) != NativeLibrarySuccessCode)

throw new Exception("....");

       return sequentialGuid;

    }

13 Sep 2010
14:12 PM

Frank Quednau

Didn't you write about Paxos once? tbh, I only know so much as you wrote that day but it sounds relevant as it ensures ordering of events (e.g. generated guids) over several machines and ensure the correct sequence..?

13 Sep 2010
14:17 PM

configurator

Have a long at MongoDB's ObjectId. There are 12-byte (almost-)globally unique IDs. They do exactly what you need, and then some - if machine A and B are on the same network, they will definitely be unique.

http://www.mongodb.org/display/DOCS/Object+IDs

13 Sep 2010
15:41 PM

Rik Hemsley

I made this a week or so ago:

www.rikkus.info/sequential-uuid-based-entity-ke...

13 Sep 2010
15:43 PM

Rik Hemsley

Er, what do you mean by 'After a reboot, the guids are no longer sequential.'?

That there are clashes with previously created GUIDs?

13 Sep 2010
15:49 PM

Ayende Rahien

Rik,

Let us say that you start out with guids: 100,101,102,103, etc.

After reboot, the guid may be: 50,51,52,53, etc

They would still be unique, but they won't be sequential across reboots.

13 Sep 2010
15:51 PM

Rik Hemsley

Ayende,

I expected this as I didn't imagine there would be some mechanism to save the current number.

Does this jump in initial number cause the index speed issue to return? I was assuming that if most GUIDs were sequential then all would be well, but I haven't actually measured that yet.

13 Sep 2010
15:53 PM

Ayende Rahien

Rik,

Depending on what you use the guid for.

If you worry about index fragmentation in a B-Tree, then no, that isn't an issue.

13 Sep 2010
15:55 PM

Rik Hemsley

Yes, that's what I am using it for, so I'm happy.

Mind if I ask why you need it to be sequential (always increasing) without jumps?

13 Sep 2010
16:12 PM

Ants Aasma

"However, a guid generated in time T on machine B must be greater than guid generated on machine A at any time prior to T."

This requirement is ill-defined not only because of practical concerns, but due to fundamental laws of physics. The order of two events depends on the reference frame of the observer.

You can either live with the fact that the order may be incorrect within some time-window (on the order of milliseconds for synchronized clocks), or you should use some algorithm that ensures an uniform agreement on a total order of the generated guids. For the first case, just make the guid a timestamp + a counter for guids within one tick. For the second, simplest would be a master handing out sequential tokens, but there are algorithms for distributed fault tolerant agreement of total order. (e.g. http://infoscience.epfl.ch/record/83648)

13 Sep 2010
18:18 PM

josh

Oren, is there another way to calculate freshness?It just occurs to me we're focusing a lot on one side of the problem while ignoring the other. Can this be solved on the other end?

Also, restating the guid problem..

You're looking for a globally unique ID which is also time dependent and time relevant. Does it necessarily have to be numerically sequential as long as the other criteria are satisfied?

I don't really have answers, but I'm thinking about seeding, ticks, and MAC's (as in hardware addresses).

13 Sep 2010
18:19 PM

Agarwal / Simon Labrecque

IF the times are synchronized between machines, and you allow having the same GUID on different machine, then the problem comes down to having always-incrementing GUIDS on a single machine.

In order to do that accross reboots, I guess good source of (given correct configuration) always-incrementing numbers is the clock.

So the GUID needs to use the clock as it's source, and in order to have always-incrementing GUIDS between multiple threads, you need some sort of lock somewhere, so that you are sure that no 2 process can Clock.GetBytes() at exactly the same time. The lock can be extremely short, though.

From then it should be a simple matter of understanding the GUID format and correctly transforming the bytes from the Clock to a GUID.

13 Sep 2010
18:56 PM

josh

Agarwal / Simon,

Duplicates are a bad idea generally; and completely opposite the point of a GUID, which should be unique everywhere.

Also, Time sync isn't always reliable. Given the precision sensitivity (milliseconds/nanoseconds), you just can't absolutely ensure that level of sync.

In my experience anyway.

13 Sep 2010
19:26 PM

Alois Kraus

Well the point is that you need the time. My previous guid approach can very simple be enhanced to e.g. shave off 8 redundant bytes and fill the rest with the true guids from the OS or Guid.CreateNew. That way you have shortened time stamps with a range of e.g. 30 years if you use an int for both and stil have 8 bytes left which can be taken from the "true" guid approach the OS uses.

Yours,

Alois Kraus

13 Sep 2010
23:01 PM

Agarwal / Simon Labrecque

josh,

I fully agree that duplicates are usually bad, and that time sync isn't reliable; but I'm working within the hypothesis Ayende has given us.

That said, I too would like to know why this problem needs solving. In my day job, I tend to fight anyone coming to see me with the solution; I'd rather like to see the problem, then insure that the chosen solution is the best.

So, why do you need this "special" ever-incrementing GUID, Ayende? ;)

14 Sep 2010
03:42 AM

Ivan Krivyakov

Sorry if someone already proposed it - I would use machine specific guid (for uniqueness) plus a timestamp (for increment). That is, if you don't care about the number of bits it consumes.

14 Sep 2010
04:29 AM

Martin

I guess it cant really get to be more precise than ticks. So to make it unique you have to create your own rules / assumptions, which you can base your logic on.

For an example, that ids created on one machine is always created after or before another machine (if the exact same date time). That should make it possible to know how to handle the id in the logic (depending on what you need to use it for).

The first thing that comes to my mind is creating a structure like this

Days/Ticks Offset - Global id - Internal Counter

If you want it to be 16 bytes as a guid.

Date and time could be split into days and ticks, and allocated 6 bytes,

If you let the days take up 2 bytes it will give you an increasing number for next 179 years (approx. if you use 1/1-2010 as an offset date), and then let the ticks take up the last 4 bytes.

If a limit of 65536 for the global id is okay it will take up 2 bytes. If the global id is used per document, database, process, shard is up to you. You just have to assign it before it starts creating ids,

Then there is 8 bytes left for the internal counter, which is used to assure that the number is sequential if two ids is generated at the exact same time (using the same "global id").

This somewhat mimics the guid.comb, but is precise down to ticks and contains a consistent rule you can use in your logic when handling the ids.

Basically of course it is just a very long number. Ids generated after the predefined offset date will be something like:

days ..........ticks .......global....internal id

00001 0123456789 00001 000...........01

00001 0123456789 00001 000...........02

00001 0123456789 00002 000...........01

00001 0123456789 00002 000...........02

00001 0123456789 00002 000...........03

They are sequential with the precision of ticks and the rule that one is always created before or after another (depending on the global id) if created at the exact same time, and truly sequential within that "global id".

The only other alternative that i can think of is a central service handing out ids (which i assume you dont want).

At least if it should be sequential you dont want anything to be random.

... on a different note.

A few things I have never understood about the Guid.Comb implementation in Nhiberntate is why it doesnt use a more recent offset date. Also there is no reason to divide the ticks with 3.333... just because sql server accuracy when getting time is only 1/300th of a millisecond. Also the actual type of the guid created should be of type "SqlGuid" and not "Guid" for sorting correctly (if used in code), as far as i remember.

Martin

... im really tired so probably i am not thinking clearly ... :)

14 Sep 2010
05:55 AM

Steve

I wouldn't get too clever. I'd create something similar to Oracle's concept of a sequence generator either in the database or outside as a service.

Start off with a seed Guid, write a routine to increment it by 1, and then persist it so it maintains state between reboots. It'd have to have some lock, or be a singleton or something to insure uniqueness, but even so I suspect you could generate these pretty fast.

14 Sep 2010
05:55 AM

Venu

Ayende

Assuming that you you already have the implementation of generating sequential guids, and since you mention that this problem manifests itself only when the machine reboots, can't you have a guid tracker componenet in each server.

When there is a reboot, it would ask every machine in the cluster (using gossip protocol???) what their current max guid value is (or what it should be at a future time) and use that to compute what its new base/seed ought to be.

-Venu

14 Sep 2010
08:31 AM

Rafal

And will there be a solution to this riddle? Because I suppose it will be based on some assumptions not mentioned here that weaken the requirements a bit.

14 Sep 2010
09:55 AM

Set

just because sql server accuracy when getting time is only 1/300th of a millisecond.

Not with SQL Server 2008, there are more datetime formats now.

14 Sep 2010
10:59 AM

Ayende Rahien

Ants,

Note that I am explicitly stating that clock sync is a given. Therefor, you can relay on the clocks of the different systems to be accurate.

More formally: Time T is defined as the Ticks value on each machine, all references to prior / after with regards to the different machines refer to the current Ticks value, which was synced to be identical at a previous point in time.

14 Sep 2010
11:00 AM

Ayende Rahien

Josh,

It doesn't have to be numerically sequential, no. It just have to always sort properly

14 Sep 2010
11:01 AM

Ayende Rahien

Agarwal,

The reason for the requirements is that I am using this guid as the etag for documents, and I need to compare that to previous known etag to know if I see this before or not.

14 Sep 2010
11:03 AM

Ayende Rahien

Steve,

Have fun making it work correctly without locks in a multi threaded env.

14 Sep 2010
11:07 AM

Ayende Rahien

Venu,

Except that this creates a big problem in the sense that you need to persist the values, get the value on startup, etc.

There are simpler solutions

14 Sep 2010
12:10 PM

Ants Aasma

Clocks are never 100% accurate, so sooner or later you can observe a situation where you see that event A, with ticks value T_A has happened on machine A, and a tiny bit later machine B thinks that event B with ticks value T_B not happened, even though T_B < T_A. I don't know what you're going to use this for, so this might not be an issue for your usecase.

14 Sep 2010
12:16 PM

Ayende Rahien

Ants,

I know, but for my scenarios, it doesn't matter

14 Sep 2010
12:40 PM

Imran

Since you stated clocks can be assumed to be more or less in sync and you mentioned previously that duplicate identifiers are not a problem then why can't you just something like ticks since some reference date

e.g.

ticks since 1900 or something of that sort. Maybe I am over simplifying this.

14 Sep 2010
12:52 PM

Imran

Also if you need per process uniqueness you could stick a discriminator value (possibly a guid) to the end of the tick value to handle date time collisions on the same machine.

14 Sep 2010
14:06 PM

Karhgath

@Ayende

Aren't etag actually not supposed to be unique/sortable across URL/entity/resource? I don't know why you're using etag in a sorting manner across entities since it shouldn't be used as such at first glance. What exactly are you doing with your etag that requires this?

As far as I know, I've only used etag as a comparison parameter to check if a resource changed, not to sort em. Maybe you're using the wrong tool? As you can see by the response, this is non-trivial and probably has a better and simpler solution than those kinds of ids.

14 Sep 2010
14:22 PM

Ayende Rahien

Karhgath,

The etag will change on each document change, but for replication purposes (among others), it is important to be able to capture all changes.

14 Sep 2010
16:06 PM

Karhgath

Can you have multiple masters in your replication scheme? Or just a single master/many slaves relationship?

Do you version each object in Raven and thus have an etags history? (you talked about changes) Like document A with etags 1,2,3,4,5(current).

Are you using etags that are unique across documents (globally), and must remains unique because you use them for something else? SInce you're talking about Guid, I assume so. I also assume that you want your etag to be in a format compatible with Guid? All these are restrictions that makes this case hard indeed.

Let me think about it.

I can see the problem if do you track changes, and even moreso if your etags needs to be globally unique and follow a guid format.

14 Sep 2010
21:56 PM

Ayende Rahien

Kargath,

Multiple masters.

And yes, there are versions.

But documents etags don't need to be unique globally, just per machine.

15 Sep 2010
00:32 AM

Andrew

I was thinking along the same lines as Imram; why not just use DateTime.Ticks as your 'guid'.

15 Sep 2010
07:36 AM

Ayende Rahien

Andrew,

Two guids created on the same tick wouldn't be sequential

15 Sep 2010
12:09 PM

Andrew

Can you maybe keep another counter in memory (sequence) so that you generate IDs like <ticks_ <sequence where sequence is just += 1 each time a guid is created; so that sequential IDs on the same tick are unique.

If the computer reboots; it would have passed at least 1 tick by the time it is back online and so the sequential part can be reset to 0.

16 Sep 2010
02:39 AM

Martin

Andrew: My option that i wrote earlier is like that and includes also a global identifier.

The problem with the internal counter is that two guids created at the same time (tick) on two different machines wont necessary be truly sequential. The counter at one machine can be at 1000 and 2000 at another.

I cant think of a better option though.

16 Sep 2010
11:59 AM

Andrew

Martin: Yep sorry.

One idea for having multiple machines generating the same GUIDs is to perhaps:

1) Generate a random GUID on application startup (one time only) that is added to every GUID made by that server

2) OR; get a one time GUID assigned by a central server (?) so that it is guaranteed to be unique

Then each server will have its own namespace of GUIDs.

20 Sep 2010
23:41 PM

From

Ayende: What solution did you end up with ? or havent decided yet ?

21 Sep 2010
08:16 AM

Ayende Rahien

From,

I added it to the queue, will be published in about 1 week

21 Sep 2010
08:56 AM

Robert

Use Ticks - that gives you an always increasing number. Then every machine has a unique ID number. Figure out the precedence used in sorting. Shove the ticks into whatever positions in the GUID are more significant, find a position that is less significant than the where the ticks go and put the machine ID # in that. Sequential guids. Different ticks - always increasing. Same tick? Every machiine can make a GUID on the same tick and the IDS will now be significant and make them sequential up until the next tick.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB