Using RavenDB from Serverless applications
I got a great question about using RavenDB from Serverless applications:
DocumentStore should be created as a singleton. For Serverless applications, there are no long-running instances that would be able to satisfy this condition. DocumentStore is listed as being "heavy weight", which would likely cause issues every time a new insurance is created.
RavenDB’s documentation explicitly calls out that the DocumentStore should be a singleton in your application:
We recommend that your Document Store implement the Singleton Pattern as demonstrated in the example code below. Creating more than one Document Store may be resource intensive, and one instance is sufficient for most use cases.
But the point of Serverless systems is that there is no such thing as a long running instance. As the question points out, that is likely going to cause problems, no? On the one hand we have RavenDB’s DocumentStore, which is optimized for long running processes and on the other hand we have Serverless systems, which focus on minimal invocations. Is this really a problem?
The answer is that there is no real contradiction between those two desires, because while the Serverless model is all about a single function invocation, the actual manner in which it runs means that there exists a backing process that is reused between invocations. Taking AWS Lambda as an example, you can define a function that will be invoked for SQS (Simple Queuing Service), the signature for the function will look something like this:
async Task HandleSQSEvent(SQSEvent sqsEvent, ILambdaContext context);
The Serverless infrastructure will invoke this function for messages arriving on the SQS queue. Depending on its settings, the load and defined policies, the Serverless infrastructure may invoke many parallel instances of this function.
What is important about Serverless infrastructure is that a single function instance will be reused to process multiple events. It is the Serverless infrastructure's responsibility to decide how many instances it will spawn, but it will usually not spawn a separate instance per message in the queue. It will let an instance handle the messages and spawn more as they are needed to handle the ingest load. I’m using SQS as an example here, but the same applies for handling HTTP requests, S3 events, etc.
Note that this is relevant for AWS Lambda, Azure Functions, GCP Cloud Functions, etc. A single instance is reused across multiple invocations. This ensure far more efficient processing (you avoid startup costs) and can make use of caching patterns and common optimizations.
When it comes to RavenDB usage, the same thing applies. We need to make sure that we won’t be creating a separate DocumentStore for each invocation, but once per instance. Here is a simplified example of how you can do this:
We define the DocumentStore when we initialize the instance, then we reuse the same DocumentStore for each invocation of the lambda (the handler code).
We can now satisfy both RavenDB’s desire to use a singleton DocumentStore for best performance and the Serverless programming model that abstracts how we actually run the code, without really needing to think about it.
Request for a follow up to this post about RavenDb + Serverless + Autogenerated Id's
When you "Initialize" a document store with the default HiLo autogenerated Id's, I -thought- that when the document store initializes, it creates a range -for this document store instance- in RavenDb (stored in the db at Raven/Hilo/collection) and then the document store remembers this in memory to reduce round trips to the db?
If this understanding is correct, what does this mean for 'serverless' scenario's? I'm -assuming- that the "single function instance" will create the first HiLo range ... and this instance will be teared down "soonish" (yes, a single instance might be reused for multiple "invocations" to 'reuse').
But in general, I'm guessing that in a serverless environment, the HiLo ranges will be large as many ranges are constantly increased?
REMEMBER -> i might be doing -heaps- of reads in a serverless scenario (with HiLo) .. so i don't -need- the range to keep on increasing, but this auto happens each time a doc-store initializes, right? ANd there's heaps of READ only invocations happening and lots of new instances getting created, then deleted, then created .. etc.
I'm also remembering something you (Ayende) said -years- ago: (not an exact quote) why the emotional attachment to identity numbers and .. even more so .. the emotional attachment to "nicely" incrementing identities? I know I still struggle with this lame concept. Maybe that's just my OCD :(
So .. thoughts? Don't use HiLo? Accept that the HiLo range will be large and vary with documents?
@Justin I believe Oren will follow you up on your questions later. But from my knowledge, what you have said on HiLo algorithm is incorrect. HiLo algorithm like you said is reserve by range, but the range of ID you will use is not initialized during document store initialization, it is an lazy operation. Only until you call for next ID for the first time.
I believe many people never looked at source code, nor you have used HiLo algorithm manually. You don't have to depends on document store to generate HiLo for you, you can create async HiLo algorithm class yourselves and call it yourselves manually and assign to an ID manually. Same thing, what document store does is, when you try to create a new document, that document does not have an dedicate ID field, or you didn't specify ID, then it will use HiLo algorithm to create for you.
On top of that, you can define the range of the HiLo algorithm, the default is 32. In RavenDB 3.5 I know you can change capacity, but I couldn't find a way to set capacity in RavenDB 4 or 5. Of course, worst case scenario you can just re-implement the class yourself and setup batch size. It's not hard.
For actual implementation, you can check here. Which indicate the next range is only called for a given type on the first time next ID been called. Which translate to first time a given type been saved after document store initialized. If you didn't save any new document before store been finialized, zero id will be consumed
If manage HiLo is an issue for you, you can also use server side generated IDs.
I remember one of article Oren also talk about generate ID for invoice. Forgot how he described at the time, I was thinking of combining counter feature with compare exchange transaction scope. Never done that before so need to test out.
@Jason - thanks mate for the lovely reply.
AH! it's been a -long time- since I checked the code to how HiLo was getting used (back in the 3.x days) .. and then later on when it was changed to collection/number-tag (and the number-tag confused me). But that too was eons ago.
And yeah, I new about the other Id generation techniques but just like using HiLo because it's auto. (the tag in number-tag still does my head in .. but I need to practice not being emotionally attached to it. i'm not attached to vm's, etc .. so donno why I am for Id's :) )
I'm not sure that I'm following your issue. Multiple document stores running at the same time in different lambda instances will use the
hiloalgorithm to reserve a range each. If you dispose of them prosperity, they will return that range (if possible). In other words, there is no issue with parallel instances generating the same ids.
The number of "oustanding" ranges may be high (equal to the total number of lambda instances, but the size of the range is 32 items by default, and the value range in 64 bits. Not an issue, in other words. This is also something that happens only on write operations, not for read.
You would get gaps in the documents ids, yes. If you really care for that, use identities or
users/as the id prefix, which will generate ids like
users/0000000021-Ain the server side.
@Oren thanks for the reply, mate.
My initial assumption was incorrect -> which made the question .. poor.
I was thinking that each time a lambda instance (or azure function instance, etc) was created, it would create a range on in the HiLo collection in the db. This is not the case. That range is "lazy" created .. on a write request.
So lets say for example i have an Azure Function, HTTP trigger. It's serving up a large number of requests/second. i donno .. lets say 100 r/s or even 250 r/s. Assuming that this is a largish number of requests/second, there would be a large number of (short lived) instances created. Next , I also -thought- that each time an instance is created, a doc-store is created and initialized and the initialization process would generate a range of 32 "available docs" in the in-memory doc-store. For arguments sake, these requests/sec are all reads. Fine. Now the instance is destroyed quickly (again, poorly assuming that the function doesn't live for long because of 'reasons I don't understand how the plumbing works at azure, etc". Ok .. fine ... now more requests come in, more instances are created. more ranges are stored in the document-store ... but not getting used. So .. when a write finally happens .. the gap / gaps would be large. -That- was what I was incorrectly thinking.
So there's a number of key points that I've learnt here from this convo and also don't 100% grok:
Returning a range happens when you disposed the
DocumentStore, typically. That will happen if the range we return is the latest one. If you have many ranges out, only the latest one can be returned. This is mostly so if you re-run your code manually, you won't have large skips all the time, probably not that relevant in this context.
Yes, each lambda instance will get its own dedicated range.
Are there way to adjust range like what was possible in 3.5? In that way those people who worry about scattering save could use lower number such as 5. Of course, generate identities by
users/also works. It's just the document says default is 32, so feels like there is possible way of change it. Where in 3.5 we can, now in the new code base, we couldn't be able to find it.
DocumentStoredisposal, if it has been registered as singleton, during server suddenly shutdown, will it get called? Or only during gracefully shutdown? Such as in ASP.NET, register
IHostService, then on close event, dispose
DocumentStore? Or do we expect GC's disposal call towards singleton class will trigger that.
The default (and min) is 32 items, we do dynamic adjustment of the value automatically, so it will not necessarily be 32 if you request a lot.
If you are disposing because of an orderly shutdown, we'll return the range, yes. If you just shut down / abort the process, no.
Tried the approach but getting an error message : "Document Store does not contain a definition for Certificate". Also the namespace referenced here is RavenDB.Client.Documents. On the other hand I have the namespace RavenDB.Client.Document. Can you let know how to get the right references added.
Balachandar, You are using the 3.5 client, which is no longer supported. Please use the modern one.