Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,087 | Comments: 49,877

filter by tags archive
time to read 3 min | 432 words

imageIt’s the end of November, so like almost every year around this time, we have the AWS outage impacting a lot of people. If past experience is any indication, we’re likely to see higher failure rates in many instances, even if it doesn’t qualify as an “outage”, per se.

The image on the right shows the status of an instance of a RavenDB cluster running in useast-1. The additional load is sufficient to disrupt the connection between the members of the cluster. Note that this isn’t load on the specific RavenDB cluster, this is the environment load. We are seeing busy neighbors and higher network latency in general, enough to cause a disruption of the connection between the nodes in the cluster.

And yet, while the cluster connection is unstable, the  individual nodes are continuing to operate normally and are able to continue to work with no issues. This is part of the multi layer design of RavenDB. The cluster is running using a consensus protocol, which is sensitive to network issues and require a quorum to progress. The databases, on the other hand, uses a separate, gossip based protocol to allow for multi master distributed work.

What this means is that even in the presence of increased network disruption, we are able to run without needing to consult other nodes. As long as the client is able to reach any node in the database, we are able to serve reads and writes successfully.

In RavenDB, both clients and servers understand the topology of the system and can independently fail over between nodes without any coordination. A client that can’t reach a server will be able to consult the cached topology to know what is the next server in line. That server will be able to process the request (be it read or write) without consulting any other machine.

The servers will gossip with one another about misplaced writes and set everything in order.

RavenDB gives you a lot of knobs to control exactly this process works, but we have worked hard to ensure that by default, everything should Just Work.

Since we released the 4.0 version of RavenDB, we have had multiple Black Fridays, Cyber Monday and Crazy Tuesdays go by peacefully. Planning, explicitly, for failures has proven to be a major advantage. When they happen, it isn’t a big deal and the system know how to deal with them without scrambling the ready team. Just another (quite) day, with 1000% increase in load.

time to read 5 min | 922 words

One of the interesting features with RavenDB is Subscriptions. These allow you to define a query and then to subscribe to its results. When a client opens the subscription, RavenDB will send it all the documents matching the subscription query. This is an ongoing process, so you will get updates from documents that were modified after your subscription started. For example, you can ask to get: “All orders in UK”, and then use that to compute tax / import rules.

Subscriptions are ideal for a number of use cases, but backend business processing is where they shine. This is because of the other property that subscriptions have, the ability to process the subscription results reliably. In other words, a failure in process a subscription batch will not lose anything, we can simply restart the subscription. In the same manner, a server failure will simply failover to another node and things will continue processing normally. You can shut down the client for an hour or a week and when the subscription is started, it will process all the matching documents that were changed while we didn’t listen.

Subscriptions currently have one very strong requirement. There can only ever be a single subscription client open at a given point. This is done to ensure that we can process the batches reliably. A subscription client will accept a batch, process it locally and then acknowledge it to the server, which will then send the next one.

Doing things in this manner ensures that there is an obvious path of progression in how the subscription operates. However, there are scenarios where you’ll want to use concurrent clients on a single subscription. For example, if you have a lengthy computation required, and you want to have concurrent workers to parallelize the work. That is not a scenario that is currently supported, and it turns out that there are significant challenges in supporting it. I want to use this post to walk through them and consider possible options.

The first issue that we have to take into account is that the fact that subscriptions are reliable is a hugely important feature, we don’t want to lose that. This means that if we allow multiple concurrent clients at the same time, we have to have a way to handle a client going down. Right now, RavenDB keeps track of a single number to do so. You can think about it as the last modified date that was sent to the subscription client, this isn’t how it works, but it is a close enough lie that would save us the deep details.

In other words, we send a batch of documents to the client and only update our record of the “last processed” when the batch is acknowledged. This design is simple and robust, but it cannot handle the situation when we have concurrent clients that are processing batches. We have to account for a client failing to process a batch and needing to resend it. This can be sent to the same client or to another one. That means that in addition the last “last processed” value, we also need to keep a record of in flight documents that were sent in batches and hasn’t been acknowledged yet.

We keep track of our clients by holding on to the TCP connection. That means that as long as the connection is open, the batch of documents that was sent will be considered in transit state. If the client that got the batch failed, we’ll have to note (when we close the TCP connection) and then send the old batch to another client. There are issues with that, by the way, different clients may have different batch sizes, for example. If the batch we need to retry has 100 documents, but the only available client needs 10 at a time, for example.

There is another problem with this approach, however. Consider the case of a document that was sent to a client for processing. While it is being processed, it is modified again, that means that we have a problem. Do we send the document again to another client for processing? Remember that it is very likely that you’ll do something related to this document, and it can be a cause for bugs because two clients will get the same document (albeit, two different versions of it) at the same time.

In order to support concurrent clients on the same subscription, we need to handle all of these problems.

  • Keep track of all the documents that were sent and haven’t been acknowledged yet.
  • Keep track of all the active connections and re-schedule the documents to be sent to clients that weren’t acknowledged if the connection is broken.
  • When a document is about to be sent, we need to check that it isn’t already being processed (an early version of it, rather) by another client. If that is the case, we have to wait until that document is acknowledged before allowing that document to be processed.

The latter is meant to avoid concurrency issues with handling of a single document. I think that limiting the work on a document basis is a reasonable behavior. If your model requires coordination across multiple distinct documents, that is something that you’ll need to implement directly. Implementing the “don’t send the same document to multiple clients at the same time”, on the other hand, is likely to result in better experience all around.

This post is meant to explore the design of such a feature, and as such, I would dearly love any and all feedback.

time to read 2 min | 376 words

RavenDB can handle large documents. There actually is a limit to the size of a document in RavenDB, but given that it is 2 GB in size, that isn’t a practical one. I actually had to go and check if the limit was 2 or 4 GB, because it doesn’t actually matter.

That said, having large documents is something that you should be wary of. They work, but they have very high costs.

I have run some benchmarks on the topic a while ago, and the results are interesting. Let’s consider a 100MB document. Parsing time for that should be around 4 – 5 seconds. That ignores the fact that there are also memory costs. For example, you can have a JSON documents that is parsed to 50(!) time the size of the raw text. That is 5GB of memory to handle a single 100MB document. That is just the parsing cost. But there are others. Reading a 100MB from most disks will take about a second, assuming that the data is sequential. Assuming you have 1Gbits/S network, all of which is dedicated to this single document, you can push that to the network in 800 ms or so.

Dealing with such documents is hard and awkward, if you accidently issue a query on a bunch of those documents and get 25 of them page, you just got a query that is 2.5 GB in size.  With documents of this size, you are also likely to want to modify multiple pieces at the same time, so you’ll need to be very careful about concurrency control as well.

In general, at those sizes, you stop threating this as a simple document and move to a streaming approach, because anything else doesn’t make much sense, it is too costly.

A better alternative is to split this document up to its component parts. You can then interact with each one of them on an independent basis.

It is the difference between driving an 18 wheeler and driving family cars. You can pack a whole lot more on the 18 wheeler truck, but it got a pretty poor mileage and it is very awkward to park. You aren’t going to want to use that for going to the grocery store.

time to read 8 min | 1540 words

As I’m writing this, there seem to be a huge amount of furor over the US elections. I don’t want to get into that particular mess, but I had a few idle thoughts about how to fix one of the root issues here. The fact that the system of voting we use is based on pen & paper and was designed when a computer was a job title for a practical mathematician.

A large part of the problem is that we have to wait for a particular date, and there are polls, which seems to be used for information and misinformation at the same time. This means that people are basically trying to vote with insufficient information. It would be much more straightforward if the whole system was transparent and iterative.

Note: This is a blog post I’m writing to get this idea out of my head. I did very little research and I’m aware that there is probably a wide body of proposals in the area, which I didn’t look at.

The typical manner I have seen suggested for solving the election problem is to ensure the following properties:

  • Verifying that my vote was counted properly.
  • Verifying that the total votes were counted fairly.
  • (maybe) Making it hard / impossible to show a 3rd party who I voted to.

The last one is obviously pretty hard to do, but is important to prevent issues with pressuring people to vote for particular candidates.

I don’t have the cryptographic chops to come up with such a system, not do I think that it would be particularly advisable. I would rather come up with a different approach entirely. In my eyes, we can have a system where we have the following:

  • The government issues each voter with a token (coin) that they can spend. That will probably get you to think about blockchains, but I don’t think this is a good idea. If we are talking about technical details, let’s say that the government issues a certificate to each voter (who generates their own private key, obviously).
  • The voter can then give that voting coin to a 3rd party. For example, but signing a vote for a particular candidate using their certificate.
  • These coins can then be “spent” during election. The winner of the election is the candidate that got more than 50% of the total coins spent.

As it currently stands, this is a pretty horrible system. To start with, this means that it is painfully obvious who voted for whom. I think that a transparent voting record might be a good idea in general, but there are multitude of problems with that. So like many great ideas in theory, we shouldn’t allow it.

This is usually where [complex cryptography] comes into play, but I think that a much better option would be to introduce the notion of brokers into the mix. What do I mean?

While you could spend your coin directly on a particular candidate, you could also spend it at a broker. That broker is then delegated to use your vote. You may use a party affiliated broker or a neutral one. Imagine a 2 party system when you have the Olive party and the Opal party. I’m using obscure colors here to try to reduce any meaning people will read into the color choices. For what it’s worth, red & blue as party colors have opposite meaning in the states and Israel, which is confusing.

Let’s take two scenarios into considerations:

  • A voter spend their coin on the Olive Political Action Committee, who is known to support Olive candidates. In this case, you can clearly track who they want to vote for. Note that they aren’t voting directly for a candidate, because they want to delegate their vote to a trusted party to manage that.
  • A voter spend their coin on a Private Broker #435. While they do that, they instruct the broker to pass their vote to the Olive PAC broker, or a particular candidate, etc.

The idea is that the private broker is taking votes from enough people that while it is possible to know that you went through a particular broker, you can’t know who you voted for. The broker itself obviously know, but that is similar to tracking the behavior of a voting booth, which also allows you to discover who voted to whom. I think that it is possible to set things up so the broker itself won’t be able to make this association, however. Secured Sum would work, probably. A key point for me is that this is an issue that is internal to a particular broker, not relevant to the system as a whole.

So far, I don’t think that I’m doing too much, but the idea here is that I want to take things further. Instead of stopping the system there, we allow to change the vote. In other words, instead of having to vote blindly, we can look at the results and adjust them.

In the Apr 2019 Israeli election, over 330 thousands votes were discarded because they didn’t reach the minimum threshold. That was mostly because the polls were wrong, because I don’t think that people would have voted for those parties if they knew that they are on the verge. Being able to look at that and then adjust the votes would allow all those votes to be counted.

Taking this further, once we have the system of brokers and electronic votes in place, there is no reason to do an election once every N years. Instead, we can say that the coins are literal political capital. In order to remain in office, the elected officials must keep holding over 50% of the amount of spent coins. It would probably be advisable to actually count these on a weekly / bi-weekly basis, but doing this on a short time intervals means that there is going to be a lot more accountability.

Speaking from Israel’s point of view, there have been sadly numerous cases where candidates campaigned on A, then did the exact opposite once elected. There is even a couple of famous sayings about it:

  • We promised, but didn’t promise to fulfil.
  • What you see from here you can’t see from there.

Note that this is likely to result in more populist system, since the elected officials are going to pay attention to the electorate on an ongoing basis, rather than just around election time. I can think of a few ways to handle that. For example, once seated, for a period of time, you’ll need > 50% of the coins to get an elected official out of office.

A few more details:

  • For places like the states, where you vote for multiple things at the same time (local, state, federal house & senate, president), you’ll get multiple coins to spend, and can freely spend them in different locations. Or, via a broker, designate that they are to be spend on particular directions.
  • A large part of the idea is that a voter can withdraw their coin from a broker or candidate at any time.
  • Losing the key isn’t a big deal. You go to the government, revoke your pervious certificate and get a new one. The final computation will simply ignore any revoked coins.
  • The previous point is also important to dealing with people who died or moved. It is trivial to ensure that the dead don’t vote in this scheme, and you can verify that a single person don’t have coins from multiple different locations.

The implications of such a system are interesting, in my eyes. The idea is that you delegate the vote to someone you trust, directly or indirectly. That can be a candidate, but most likely will be a group. The usual term in the states in PAC, I believe. The point is that you then have active oversight by the electorate on the elected officials.

Over seven years ago I wrote a post about what I most dislike in the current election system. You are forced to vote on your top issue, but you usually have a far more complex system of values that you have to balance. For example, let’s say that my priorities are:

  • National security
  • Fiscal policy
  • Healthcare
  • Access to violet flowers

I may have to balance between a candidate that want to ban violet flowers but propose to lower taxes or a candidate that wants to raise taxes and want to legalize violet flowers. Which one do I choice? If I can shift my vote as needed, I can make that determination at the right political time. During the budget month, my votes goes to $PreferredFiscalPolicy candidate and then if they don’t follow my preferences on violet flowers, I can shift.

This will have the benefit of killing the polls industry, since you can just look at where the political capital is. And it will allow the electorate to have a far greater control over the government. I assume that elected officials will then be paying very careful attention to how much political capital they have to spend and act accordingly.

I wonder if I should file this post under science fiction, because that might make a good background for world building. What do you think of this system? And what do you think the effects of it would be?

time to read 2 min | 297 words

imageI mentioned in my previous post that I managed to lock myself out of the car by inputting the wrong pin code. I had to wait until the system reset before I could enter the right pin code. I got a comment to the post saying that it would be better to use a thumbprint scanner for the task, to avoid this issue.

I couldn’t disagree more.

Let’s leave aside the issue of biometrics, their security and the issue of using that for identity. I don’t want to talk about that subject. I’ll assume that biometrics cannot fail and can 100% identify a person with no mistakes and no false positives and negatives.

What is the problem with a thumbprint vs. a pin code as the locking mechanism on a car?

Well, what about when I need someone else to drive my car? The simplest example may be valet parking, leaving the car at the shop or just loaning it to someone.  I can give them the pin code over the phone, I’m hardly going to mail someone my thumb because. There are many scenarios where I actually want to grant someone the ability to drive my car, and making it harder to do so it a bad idea.

There is also the issue of what happens if my thumb is inoperable? It might be raining and my hands are wet, or I changed a tire and will need half an hour at the sink to get cleaned up again.

You can think up solutions to those issues, sure, but they are cases where the advanced solution makes anything out of the ordinary a whole lot more complex. You don’t want to go there.

time to read 3 min | 582 words

Complex systems are considered problematic, but I’m using this term explicitly here. For reference, see this wonderful treatise on the subject. Another way to phrase this is how to create robust systems with multiple layers of defense against failure.

When something is important, you should prepare in advance for it. And you should prepare for failure. One of the typical (and effective) methods for handle failures is the good old retry. I managed to lock myself out of my car recently, had to wait 15 minutes before it was okay for me to start it up. But a retry isn’t going to help you if the car run out of fuel, or there is a flat tire. In some cases, a retry is just going to give you the same failing response.

Robust systems do not have a single way to handle a process, they have multiple, often overlapping, manners to do their task. There is absolutely duplication between these methods, which tend to raise the hackles of many developers. Code duplication is bad, right? Not when it serve a (very important) purpose.

Let’s take a simple example, order processing. Consider the following example, an order made on a website, which needs to be processed in a backend system:

image

The way it would work, the application would send the order to a payment provider (such as PayPal) which would process the actual order and then submit the order information via web hook to the backend system for processing.

In most such systems, if the payment provider is unable to contact the backend system, there will be some form of retry. Usually with a exponential back-off strategy. That is sufficient to handle > 95% of the cases without issue. Things get more interesting when we have a scenario where this breaks. In the above case, assume that there is a network issue that prevent the payment provider from accessing the backend system. For example, a misconfigured DNS entry means that external access to the backend system is broken. Retrying in this case won’t help.

For scenarios like this, we have another component in the system that handles order processing. Every few minutes, the backend system queries the payment provider and check for recent orders. It then processes the order as usual. Note that this means that you have to handle scenarios such as an order notification from the backend pull process concurrently with the web hook execution. But you need to handle that anyway (retrying a slow web hook can cause the same situation).

There is additional complexity and duplication in the system in this situation, because we don’t have a single way to do something.

On the other hand, this system is also robust on the other direction. Let’s assume that the backend credentials for the payment provider has expired. We aren’t stopped from processing orders, we still have the web hook to cover for us.  In fact, both pieces of the system are individually redundant. In practice, the web hook is used to speed up common order processing time, with the backup pulling recent orders as backup and recovery mechanism.

In other words, it isn’t code duplication, it is building redundancy into the system.

Again, I would strongly recommend reading: Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety.

time to read 5 min | 970 words

A customer reported that on their system, they suffered from frequent cluster elections in some cases. That is usually an indication that the system resources are hit in some manner. From experience, that usually means that the I/O on the machine is capped (very common in the cloud) or that there is some network issue.

The customer was able to rule these issues out. The latency to the storage was typically withing less than a millisecond and the maximum latency never exceed 5 ms. The network monitoring showed that everything was fine, as well. The CPU was hovering around the 7% CPU and there was no reason for the issue.

Looking at the logs, we saw very suspicious gaps in the servers activity, but with no good reason for them. Furthermore, the customer was able to narrow the issue down to a single scenario. Resetting the indexes would cause the cluster state to become unstable. And it did so with very high frequency.

“That is flat out impossible”, I said. And I meant it. Indexing workload is something that we have a lot of experience managing and in RavenDB 4.0 we have made some major changes to our indexing structure to handle this scenario. In particular, this meant that indexing:

  • Will run in dedicated threads.
  • Are scoped to run outside certain cores, to leave the OS capacity to run other tasks.
  • Self monitor and know when to wind down to avoid impacting system performance.
  • Indexing threads are run with lower priority.
  • The cluster state, on the other hand, is run with high priority.

The entire thing didn’t make sense. However… the customer did a great job in setting up an environment where they could show us: Click on the reset button, and the cluster become unstable.  So it is impossible, but it happens.

We explored a lot of stuff around this issue. The machine is big and running multiple NUMA node, maybe it was some interaction with that? It was highly unlikely, and eventually didn’t pan out, but that is one example of the things that we tried.

We setup a similar cluster on our end and gave it 10x more load than what the customer did, on a set of machines that had a fraction of the customer’s power. The cluster and the system state remain absolutely stable.

I officially declared that we were in a state of perplexation.

When we run the customer’s own scenario on our system, we saw something, but nothing like what we saw on their end. One of the things that we usually do when we investigate resource constraint issues is to give the machines under test a lot less capability. Less memory and slower disks, for example, means that it is much easier to surface many problems. But the worse we made the situation for the test cluster, the better the results became.

We changed things up. We gave the cluster machines with 128 GB of RAM and fast disks and tried it again. The situation immediately reproduced.

Cue facepalm sound here.

Why would giving more resources to the system cause instability in the cluster? Note that the other metrics also suffered, which made absolutely no sense.

We started digging deeper and we found the following index:

It is about as simple an index as you can imagine it would be and should cause absolutely no issue for RavenDB. So what was going on? We then looked at the documents…

image

I would expect the State field to be a simple enum property. But it is an array that describe the state changes in the system. This array also holds complex objects. The size of the array is on average about 450 items and I saw it hit a max of 13,000 items.

That help clarify things. During index, we have to process the State property, and because it is an array, we index each of the elements inside it. That means that for each document, we’ll index between 400 – 13,000 items for the State field. What is more, we have a complex object to index. RavenDB will index that as a JSON string, so effectively the indexing would generate a lot of strings. These strings are going to be held in memory until the end of the indexing batch. So far, so good, but the problem in this case was that there were enough resources to have a big batch of documents.

That means that we would generate over 64 million string objects in one of those batches.

Enter GC, stage left.

The GC will be invoked based on how many allocations you have (in this case, a lot) and how many live objects you have. In this case, also a lot, until the index batch is completed. However, because we run GC multiple times during the indexing batch, we had promoted significant numbers of objects to the next generation, and Gen1 or Gen2 collections are far more expensive.

Once we knew what the problem was, it was easy to find a fix. Don’t index the State field. Given that the values that were indexed were JSON strings, it is unlikely that the customer actually queried on them (later confirmed by talking to the customer).

On the RavenDB side, we added monitoring for the allocation frequency and will close the indexing batch early to prevent handing the GC too much work all at once.

The reason we failed to reproduce that on lower end machine was simple, RavenDB already used enough memory so we closed the batch early, before we could gather enough objects to cause the GC to really work hard. When running on a big machine, it had time to get the ball rolling and hand the whole big mess to the GC for cleanup.

time to read 2 min | 213 words

I was asked what the meaning of default(object) is in the following piece of code:

The code is something that you’ll see a lot in RavenDB indexes, but I understand why it is a strange construct. The default(object) is a way to null. This is asking the C# compiler to add the default value of the object type, which is null.

So why not simply say null there?

Look at the code, we aren’t setting a field here, we are creating an anonymous object. When we set a field to null, the compiler can tell what the type of the field is from the class definition and check that the value is appropriate. You can’t set a null to a Boolean properly, for example.

With anonymous objects, the compiler need to know what the type of the field is, and null doesn’t provide this information. You can use the (object)null construct, which has the same meaning as default(object), but I find the later to be syntactically more pleasant to read.

It may make more sense if you’ll look at the following code snippet:

This technique is probably only useful if you deal with anonymous objects a lot. That is something that you do frequently with RavenDB indexes, which is how I run into this syntax.

time to read 3 min | 577 words

email-me-clipart | free clip art from: www.fg-a.com/email1.s… | FlickrWe got a feature request that we don’t intend to implement, but I thought the reasoning is interesting enough for a blog post. The feature request:

If there is a critical error or major issue with the current state of the database, for instance when the data is not replicated from Node C to Node A due to some errors in the database or network it should send out mail to the administrator to investigate on  the issue. Another example is, if the database not active due to some errors then it should send out mail as well.

On its face, the request is very reasonable. If there is an error, we want to let the administrator know about it, not hide it in some log file. Indeed, RavenDB has the concept of alerts just for that reason, to surface any issues directly to the admin ahead of time. We also have a mechanism in place to allow for alerts for the admin without checking in with the RavenDB Studio manually: SNMP. The Simple Network Monitoring Protocol is designed specifically to enable this kind of monitoring and RavenDB expose a lot of state via that which you can act upon in your monitoring system.

Inside your monitoring system, you can define rules that will alert you. Send an SMS if the disk space is low, or email on an alert from RavenDB, etc. The idea of actively alerting the administrator is something that you absolutely want to have.

Having RavenDB send those emails, not so much. RavenDB expose monitoring endpoint and alerts, it doesn’t act or report on them. That is the role of your actual monitoring system. You can setup Zabbix or talk to your Ops team which likely already have one installed.

Let’s talk about the reason that RavenDB isn’t a monitoring system.

Sending email is actually really hard. What sort of email provider do you use? What options are required to set it up a connection? Do you need X509 certificate or user/pass combo? What happens if we can’t send the email? That is leaving aside the fact that actually getting the email delivered is hard enough. Spam, SPF, DKIM and DMARC is where things start. In short, that is a lot of complications that we’ll have to deal with.

For that matter, what about SMS integration? Surely that would also help. But no one uses SMS today, we want WhatsApp integration, and Telegram, and … You go the point.

Then there are social issues. How will we decide if we need to send an email or not? There should be some policy, and ways to configure that. If we won’t have that, we’ll end up sending either too many emails (which will get flagged / ignored) or too few (why aren’t you telling me about XYZ issue?).

A monitoring system is built to handle those sort of issues, it is able to aggregate reports and give you a single email with the current status, open issues for you to fix and do a whole lot more that is simply outside the purview or RavenDB. There is also the most critical alert of all, if RavenDB is down, it will not be able report that it is down because it is down.

The proper way to handle this is to setup integration with a monitoring system, so we’ll not be implementing this feature request.

time to read 2 min | 380 words

A sadly common place “attack” on applications is called “Web Parameter Tampering”. This is the case where you have a URL such as this:

https://secret.medical.info/?userid=823

And your users “hack” you using:

https://secret.medical.info/?userid=999

And get access to another users records.

As an aside, that might actually be considered to be hacking, legally speaking. Which make me want to smash my head on the keyboard a few time.

Obviously, you need to run your security validation on parameters, but there are other reasons to want to avoid to expose the raw identifiers to the user. If you are using the a incrementing counter of some kind, creating two values might cause you to leak the rate in which your data change. For example, a competitor might want to create an order once a week and track the number of the order. That will give you a good indications of how many orders there have been in that time frame.

Finally, there are other data leakage issues that you want to might want to take into account. For example, “users/321” means that you are likely to be using RavenDB while “users/4383-B” means that you are using RavenDB 4.0 or higher and “607d1f85bcf86cd799439011” means that you are using MongoDB.

A common reaction to this is to switch your ids to use guids. I hate that option, it means that you are entering very unfriendly territory  for the application. Guids convey no information to the developers working with the system and they are hard to work with, from a humane point of view. They are also less nice for the database systemto work with.

A better alternative is to simply mask the information when it leaves your system. Here is the code to do so:

You can see that I’m actually using AES encryption to hide the data, and then encoding it in the Bitcoin format.

That means that an identifier such as "users/1123" will result in output such as this:

bPSPEZii22y5JwUibkQgUuXR3VHBDCbUhC343HBTnd1XMDFZMuok

The length of the identifier is larger, but not overly so and the id is even URL safe Smile. In addition to hiding the identifier itself, we also ensure that the users cannot muck about in the value. Any change to the value will result in an error to unmask it.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Open Source & Money (2):
    19 Nov 2020 - Part II
  2. Webinar recording (10):
    28 Oct 2020 - Advanced Search Scenarios in RavenDB
  3. re (27):
    27 Oct 2020 - Investigating query performance issue in RavenDB
  4. Reminder (10):
    25 Oct 2020 - Online RavenDB In Action Workshop tomorrow via NDC
  5. Podcast (3):
    17 Aug 2020 - #SoLeadSaturday with Oren Eini
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats