Generating sequential numbers in a distributed manner
On its face, we have a simple requirement:
- Generate sequential numbers
- Ensure that there can be no gaps
- Do that in a distributed manner
Generating the next number in the sequence is literally as simple as ++, so surely that is a trivial task, right?
The problem is with the second requirement. The need to ensure that there are no gaps comes often when dealing with things like invoices. The tax authorities are really keen on “show me all your invoices”, and if there are gaps in the numbers, you may have to provide Good Answers.
You may think that the third one, running in a distributed environment, is the tough challenge, but that isn’t actually the case. If we are running in a single location, that is fairly easy. Run the invoice id generation as a transaction, and you are done. But the normal methods of doing that are usually wrong in edge cases.
Let’s assume that we use an Oracle database, which uses the following mechanism to generate the new invoice id:
invoice_seq.NEXTVAL
Or SQL Server with an identity column:
CREATE TABLE invoices ( invoice_id INT IDENTITY(1,1) PRIMARY KEY, ... );
In both cases, we may insert a new value to the invoices table, consuming an invoice id. At some later point in time, we may roll back the transaction. Care to guess what happens then?
You have INVOICE #1000 and INVOICE #1002, but nothing in between. In fact, no way to even tell what happened, usually.
In other words, identity, sequence, serial, or autonumber – regardless of what database platform you use, are not suitable for generating gapless numbers.
The reasoning is quite simple. Assume that you have two concurrent transactions, which generate two new invoices at roughly the same time. You commit the later one before the first one, and roll back the first. You now have:
- Invoice #1
- Invoice #2
- …
- Invoice #1000
- Invoice #1002
However, you don’t have Invoice #1001, and you cannot roll back the sequence value there, because if you do so, it will re-generate the #1002 on the next call.
Instead, for gapless numbers, we need to create this as a dedicated part of the transaction. So there would be a record in our system that contains the NextInvoiceId, which will be incremented as part of the new invoice creation.
In order to ensure that there are no gaps, you need to ensure that the NextInvoideId record increment is handled as a user operation, not a database operation. In other words, in SQL Server, that is a row in a table, that you manually increment as part of adding a new invoice. Here is what this will look like:
As you can see, we increment the row directly. So it will be rolledback as well.
The downside here is that we can no longer create two invoices concurrently. The second transaction would have to wait on the lock on the row in the next_gapless_ids table.
All of that happens inside a single database server. What happens when we are running in a distributed environment?
The answer in this case, is the exact same thing. You need a transaction, a distributed one, using a consensus algorithm. Here is how you can achieve this using RavenDB’s cluster wide transactions, which use the Raft protocol behind the scenes:
The idea is simple, we have a transaction that modifies the gapless ids document and creates a new invoice at the same time. We have to handle a concurrency exception if two transactions try to create a new invoice at the same time (because they both want to use the same invoice id value), but in essence this is pretty much exactly the same behavior as when you are running on a single node.
In other words, to ensure the right behavior, you need to use a transaction. And if you need a distributed transaction, that is just a flag away with RavenDB.
Comments
this obviously requires a policy of no physical deletes
Peter,
For _invoices_, for sure.Except that usually you only care about that in a fiscal year.
I didn't test it to the end, but you could create the invoice asynchron.
First create a SEQUENCE object for it (dbo.s_invoice_id). If you want to guarantee, that it is without gaps, even when your server crashes, you may either want to set cache to 1, but it will be SLOOOOW, when you have very much rows. Or you ensures, that the sequence will be checked / corrected every time the SQL server crashes (regular Shutdowns doesn't matter, they'll work fine, so it could be even a manual process in the (rare) case of a server crash)
Create a procedure that consists of a statement similar to the following: UPDATE dbo.invoice WITH (READCOMMITTED, READPAST) SET invoice_id = NEXT VALUE FOR dbo.s_invoice_id WHERE invoice_id IS NULL
The READPAST will skip rows from uncommited rows.
Insert your invoices with an empty invoice_id from your app. And create an Agent Job, which executes the procedure e.g. every minute.
This way you should be able to create / insert as many invoices at the same time (and roll back some transactions). Only drawback: the user has to wait up to a minute before he sees the final invoice number (you could even asign a temporary, negative number and replace negative invoice_ids with positive ones in the procedure.
Thomas,
That is hardly a solution.
To start with, you won't get an invoice id to give to the customer. You have to wait up to a minute to do that.That is bad UX. It also doesn't help if there is a transaction rollback.
It also doesn't ensure that your invoices will be created in sequential order.Two invoices may have different sort orders for invoice id vs. created-at
When I order something online it is very common, that I get the invoice >= one minute later. Yes, you can't it show immediate, but this is often not a problem (when you have the right processes or can adjust them a bit).
Rollbacks are not a problem. When the procedure rolls back, all invoices affected by it will be rolled back at the same time. And an "unfinished" order (with open / not commited transaction) will not get a real invoice_id because of the "READCOMMITED, READPAST" hints, so a ROLLBACK will not matter.
Regarding the order: in the procedure you you could enforce it, either by creating a filterered index (CREATE INDEX idx_invoices_without_invoice_id ON dbo.invoices (invoice_commit_time) INCLUDE (invoice_id) WHERE invoice_id IS NULL) and enforcing it in the UPDATE Statement by adding "INDEX (idx_invoices_without_invoice_id) " behind the READPAST (instead of invoice_commit_time you could use of course the invoice_id itself, if you are working with negative intermediate IDs). To be 100% sure, you could even add a MAX_DOP = 1 query hint.
Besides that - what is the "natural" order of an invoice that must not break under any circumstance? The creation timestamp, a random ID assigned, when you clicked the create button? Isn't it all random too? What, if the customer was delayed by a few seconds for whatever reason - he would get another ID.
And regarding the usabilitiy / user experience: is it better, when a user has to wait until x other users are done, because there is a lock on your table? What when a developer is adding a bug (e.g. an endless loop that happens in some very special circumstances) 3 years in the future or someone is debugging in prod? Will your users will wait 30 minutes? Will a system, that works with exclusive locks even scale (imaging 1000 invoices per second)? With an asynchron system you would prevent many of those problems. The endless looping / debugging invoice would not get its final invoice_id, but all others will. And an update over 60k new rows every minute (scaling) should be not a problem too (except you set the CACHE = 1 for the sequence).
Thomas,
You are using a sequence, when you roll back the sequence, the values aren't returned. That means that when you roll back, you'll lose those ids that you generated.
You have added a huge amount of complexity, but there is zero need for that. Don't use a sequence, use a value that is incremented as part of the transaction, and you don't need to worry about any of this.
You are right, forgot this behavior of the sequences. On the other hand it is more unlikely, that a small procedure with a single statement that will be called from exactly one job will be canceled / rolled back, than a insert-procedure, which could be called simultan 1000 times per second.
So let us combine both methods, using your id-table and the asynchrone procedure:
CREATE PROCEDURE dbo.SetInvoiceIds AS BEGIN DECLARE @invoice_id INT
END; GO
This way, you would ensure the consistence / gaplessness but prevent the locking / blocking problems - and it WILL occur - on a previous job I had to solve a very similar problem with an ID table (as your next_gapless_ids table), even when we had only 20-100 concurrent users.
Thomas,
I wouldn't go with this approach at all.
The locking on the id row is _be design_, since we need to have concurrency control here to ensure no gaps properly.
Note that even with the locking model, it should be possible to do this at a rate of thousands or tens of thousands / sec easily enough. Worst case scenario, you do transaction merging, but assuming you need gapless ids, that is just the cost of the requirement. The model of writing and then later getting the invoice id is not something that is viable for this problem
Comment preview