Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 6 min | 1084 words

In my previous post, I wrote about the case of a medical provider that has a cloud database to store its data, as well as a whole bunch of doctors making house calls. There is the need to have the doctors have (some) information on their machine as well as push updates they make locally back to the cloud.

image

However, given that their machines are in the field, and that we may encounter a malicious doctor, we aren’t going to fully trust these systems. We still want the system to function, though. The question is how will we do it?

Let’s try to state the problem in more technical terms:

  • The doctor need to pull data from the cloud (list of patients to visits, patient records, pharmacies and drugs available, etc).
  • The doctor nee to be able to create patient records (exam made, checkup results, prescriptions, recommendations, etc).
  • The doctor’s records needs to be pushed to the cloud.
  • The doctor should not be able to see any record that is not explicitly made available to them.
  • The same applies for documents, attachments, time series, counters, revisions, etc.

Enforcing Distributed Data Integrity

The requirements are quite clear, but they do bring up a bit of a bother. How are we going to enforce it?

One way to do that would be to add some metadata rule for the document, deciding if a doctor should or should not see that document. Something like that:

image

In this model, a doctor will have be able to get this document if they have any of the tags associated with the document.

This can work, but that has a bunch of non trivial problems and a huge problem that may not be obvious. Let’s start from the non trivial issues:

  • How do you handle non document data? Based on the owner document, probably. But that means that we have to have a parent document. That isn’t always the case.
  • It isn’t always the case if the document was deleted, or is in a conflicted state.
  • What do you do with revisions, if the access tags has changed? What do you follow?

There are other issues, but as you can imagine, they are all around managing the fact that this model allows you to change the tags for the document and expect to handle this properly.

The huge problem, however, is what should happen when a tag is removed? Let’s assume that we have the following sequence of events:

  • patients/oren is created, with access tag of “doctors/abc”
  • That access tag is then removed
  • Doctor ABC’s machine is then connected to the cloud and setup replication.
  • We need to remove patients/oren from the machine, so we send a tombstone.

So far, so good. However, what about Doctor' XYZ’s machine? At this time, we don’t know what the old tags were, and that machine may or may not have that document. It shouldn’t have it now, so we’ll send a tombstone there? That leads to information leak by revealing document ids that we aren’t authorized for.

We need a better option.

Using the Document ID as the Basis for Data Replication

We can define that once created, the access tags are immutable, and that would help considerably.  But that is still fairly complex to manage and opens up issues regarding conflicts, deletion and re-creation of a document, etc.

Instead, we are going to use the document’s id as the source for the decision to replicate the document or not. In other words, when we register the doctor’s machine, we set it up so it will allow:

Incoming paths Outgoing paths
  • doctors/abc/visits/*
  • tasks/doctors/abc/*
  • patients/clinics/33-conventry-rd/*
  • pharmacies/*
  • tasks/doctors/abc/*
  • doctors/abc
  • laboratories/*

In this case, incoming and outgoing are defined from the point of view of the cloud cluster. So this setup allows the doctor’s machine to push updates to any document with an id that starts with “doctors/abc/visits/” or “tasks/doctors/abc/*”. And the cloud will send all pharmacies and laboratories data. The cloud will also send all the patients for the doctor’s clinic as well as the tasks for this doctor, finally, we have the doctor’s record itself. Everything else will be filtered.

This Model is Simple

This model is simple, it provides a list of outgoing and incoming paths for the data that will be replicated. It is also quite surprisingly powerful. Consider the implications of the configuration above.

The doctor’s machine will have a list of laboratories and pharmacies (public information) locally. It will have the doctor’s own document as well as records of the patients in the clinic. The doctor is able to create and push patient visit’s records. Most interestingly, the tasks for the doctor are defined to allow both push and pull. The doctor will receive updates from the office about new tasks (home visits) to make and can mark them complete and have it show up in the cloud.

The doctor’s machine (and the doctor as well) is not trusted. So we limit the exposure of the data that they can see on a Need To Know basis. On the other hand, they are limited in what they can push back to the cloud. Even with these limitations, there is a lot of freedom in the system, because once you have this defined, you can write your application on the cloud side and on the laptop and just let RavenDB handle the synchronization between them. The doctor doesn’t need access to a network to be able to work, since they have a RavenDB instance running locally and the cloud instance will sync up once there is any connectivity.

We are left with one issue, though. Note that the doctor can get the patients’ files, but is unable to push updates to them. How is that going to work?

The reason that the doctor is unable to write to the patients’ files is that they are not trusted. Instead, they will send a visit record, which contains their finding and recommendation and on the cloud, we’ll validate the data, merge it with the actual patient’s record, apply any business rules and then update the record. Once that is done, it will show up in the doctor’s machine magically. In other words, this setup is meant for untrusted input.

There are more details that we can get into, but I hope that this outline the concepts clearly. This is not a RavenDB 5.0 feature, but will be part of the next RavenDB release, due around September.

time to read 4 min | 753 words

RavenDB is typically deployed as a set of trusted servers. The network is considered to be hostile, which is why encrypt everything over the wire and using X509 certificates for mutual authentication, but once the connection is established, we trust the other side to follow the same rules as we do.

To clarify, I’m talking here about trust between nodes, not a client connected to RavenDB. These are also authenticated using X509 certificate, but they are limited to the access permissions assigned to them. Nodes in a cluster fully trust one another and need to do things like forward commands accepted by one node to another one. That requires that the second node trust that the first node properly authenticated the client and won’t pass operations that the client has no authority for.

Use Case 1: A Database System for Multiple Medical Clinics

I think that a real use case might make things more concrete. Let’s assume that we have a set of clinics, with the following distribution of data.

image 

We have two clinics, one in Boston and one in Chicago, as well as a cloud system. The rules of the system are as follows:

  • Data from each clinic is replicated to the cloud.
  • Data from the cloud is replicated to the clinics.
  • Data from a clinic may only be at the clinic or in the cloud.
  • A clinic cannot get (or modify) data that didn’t came from the clinic.

In this model, we have three distinct locations, and we presumably trust all of them (otherwise, would we put patient data on them?). There is a need to ensure that we don’t expose patient data from one clinic to another, but that is about it. Note that in terms of RavenDB topology, we don’t have a single cluster here. That wouldn’t make sense. To start with, we need to be able to operate the clinic when there is no internet connectivity. And we don’t want to pay with any avoidable latency even if everything is working fine.  So in this case, we have three separate clusters, one in each location, and they are connected to one another using RavenDB’s multi master replication.

Use Case 2: A Database System Sharing with Outside Edge Points

Let’s look at another model, however, in this case, we are still dealing with medical data, but instead of a clinic, we have to deal with a doctor making house calls:

image

In this case, we are still talking about private data, but we are no longer trusting the end device. The doctor may lose the laptop, they may have malware running on the machine or may be trying to do Bad Things directly.  We want to be able to push data to the doctor’s machine and receive updates from the field.

RavenDB has some measures at the moment to handle this scenario. You need to only get some data from the cloud to the doctor’s laptop, and you want to push only certain things back to the cloud. You can use pull replication and ETL. to handle this scenario, and it will work, as long as you are willing to trust the end machine. Given the stringent requirement for medical data, it is not something out of bounds, actually. Full volume encryption, forbidding use of unknown software and a few other protections ensure that if the laptop is lost, the only thing you can do with it is repurpose the hardware.  If we can go with that assumption, this is great.

However… we need to consider the case that our doctor is actually malicious.

image

When the Edge Point isn’t as Healthy as the Doctor Using It

So we need a something in the middle, between all our data and what can reside on that doctor’s machine.  As it currently stands, in order to create the appropriate barrier between the doctor’s machine and the cloud, you’ll have to write your own sync code and apply any logic / authorization at that level.

Sync code is non trivial, mostly because of the number of edge cases you have to deal with and the potential for conflicts. This has already been solved by RavenDB, so having to write it again is not ideal as far as we are concerned.

What would you do?

time to read 2 min | 307 words

A customer opened a support call about a recurring problem that they had in their system.

System.InvalidOperationException: Errno: 1224='The requested operation cannot be performed on a file with a user-mapped section open.
' (rc=0) - 'Failed to rvn_write_header 'headers.one', reason : FailOpenFile'. FailCode=FailOpenFile.
   at Sparrow.Server.Platform.PalHelper.ThrowLastError(FailCodes rc, Int32 lastError, String msg) in C:\Builds\RavenDB-Stable-4.2\42040\src\Sparrow.Server\Platform\PalHelper.cs:line 35
   at Voron.StorageEnvironmentOptions.DirectoryStorageEnvironmentOptions.WriteHeader(String filename, FileHeader* header) in C:\Builds\RavenDB-Stable-4.2\42040\src\Voron\StorageEnvironmentOptions.cs:line 706
   at Voron.Impl.FileHeaders.HeaderAccessor.Modify(ModifyHeaderAction modifyAction) in C:\Builds\RavenDB-Stable-4.2\42040\src\Voron\Impl\FileHeaders\HeaderAccessor.cs:line 200
   at Voron.Impl.Journal.WriteAheadJournal.JournalApplicator.SyncOperation.UpdateDatabaseStateAfterSync() in C:\Builds\RavenDB-Stable-4.2\42040\src\Voron\Impl\Journal\WriteAheadJournal.cs:line 998
   at Voron.Impl.Journal.WriteAheadJournal.JournalApplicator.LockTaskResponsible.RunTaskIfNotAlreadyRan() in C:\Builds\RavenDB-Stable-4.2\42040\src\Voron\Impl\Journal\WriteAheadJournal.cs:line 1192

And the code that failed was:

Based on that information, the issue is obvious. Someone is poking their nose into RavenDB’s files. It is important to understand that this will cause the I/O operation to fail. And since this is part of a critical path in managing the file, it will force the entire database to unload.

We asked the customer to make sure that all indexing services, anti virus programs and the like would be instructed to ignore the RavenDB directories and were assured that this is already the case.

That led us to a problem. If no one else is touching our files, is it possible that it is our own code that is causing the issue?

I asked the customer to use Process Monitor to track system calls made on that particular file.  It took a day or so, but they came back with:
8:13:31.4865562 AM MsMpEng.exe 3112 CreateFileMapping D:\RavenDB\Server\RavenData\System\headers.one FILE LOCKED WITH WRITERS SyncType: SyncTypeCreateSection, PageProtection: PAGE_EXECUTE

We had our culprit, Windows Defender would mess around in our files, causing some I/O operations to fail.  The customer added RavenDB’s directories to Windows Defender’s excluded list as well, and that should be the end of things.

Overall, this is a short issue, but it was only short because we were able to prove what other process was causing us issues.

time to read 2 min | 380 words

A sadly common place “attack” on applications is called “Web Parameter Tampering”. This is the case where you have a URL such as this:

https://secret.medical.info/?userid=823

And your users “hack” you using:

https://secret.medical.info/?userid=999

And get access to another users records.

As an aside, that might actually be considered to be hacking, legally speaking. Which make me want to smash my head on the keyboard a few time.

Obviously, you need to run your security validation on parameters, but there are other reasons to want to avoid to expose the raw identifiers to the user. If you are using the a incrementing counter of some kind, creating two values might cause you to leak the rate in which your data change. For example, a competitor might want to create an order once a week and track the number of the order. That will give you a good indications of how many orders there have been in that time frame.

Finally, there are other data leakage issues that you want to might want to take into account. For example, “users/321” means that you are likely to be using RavenDB while “users/4383-B” means that you are using RavenDB 4.0 or higher and “607d1f85bcf86cd799439011” means that you are using MongoDB.

A common reaction to this is to switch your ids to use guids. I hate that option, it means that you are entering very unfriendly territory  for the application. Guids convey no information to the developers working with the system and they are hard to work with, from a humane point of view. They are also less nice for the database systemto work with.

A better alternative is to simply mask the information when it leaves your system. Here is the code to do so:

You can see that I’m actually using AES encryption to hide the data, and then encoding it in the Bitcoin format.

That means that an identifier such as "users/1123" will result in output such as this:

bPSPEZii22y5JwUibkQgUuXR3VHBDCbUhC343HBTnd1XMDFZMuok

The length of the identifier is larger, but not overly so and the id is even URL safe Smile. In addition to hiding the identifier itself, we also ensure that the users cannot muck about in the value. Any change to the value will result in an error to unmask it.

time to read 1 min | 154 words

The most famous example about the use of transactions is the money transfer scenario. As money is shifted from one account to the other, we want to ensure that no money goes poof or made up of whole cloth.

I just logged in to the bank to pay the taxes. It is a boring process that mostly consist of checking a box in a transfer filled by the accounting department. Today there was much excitement. The transfer failed.

That was strange.

I got an error that the money transfers failed and that I should process the order again later.  I checked my balance and the money is deducted from my account.

I’m trying to decide if I should shrug it off and just make sure that the money was sent in a couple of days or if I should call someone at the bank and offer them consulting services about how to build transactional systems.

time to read 3 min | 490 words

In my previous post, I showed how you can search for a range of cleared bit in a bitmap, as part of an allocation strategy. I want to do better than just finding a range, I want to find the best range. Best is define as:

  • Near a particular location (to increase locality).
  • Create the least amount of fragmentation.

Here is the bitmap in question: [0x80520A27102E21, 0xE0000E00020, 0x1A40027F025802C8, 0xEE6ACAE56C6C3DCC].

And a visualization of it:

download (1)

What we want is to find the following ranges:

  • single cell near the red marker (9th cell)
  • 2 cells range near the red marker (9th cell)
  • 4 cells range near the yellow marker (85th cell)
  • 7 cells range near the green marker (198th cell)

We have an existing primitive already, the find_next() function. Let’s see how we can use it to find the right buffer for us.

Here is how it goes, we scan forward from the beginning of the word that you and trying to find a range that match the requirement. We’ll find the smallest range we can find, but as we go further from the starting point, we’ll loosen our requirements. If we can’t find anything higher, we’ll search lower that the given position. Regardless, if we find an exact match, one where we don’t need to fragment for, we’ll take that if it is close enough.

Here are the results from the code (the code itself will follow shortly):

  • single cell near the 9th – 12 – as expected
  • 2 cells near the 9th – 27 – because there is a two cell hole there and we won’t fragment
  • 4 cells near 85 – 64 – I mentioned that we are scanning only forward, right? But we are scanning on a word boundary. So the 85th cell’s bit is in the 2nd word and the first 4 bits there are free.
  • 7 cells near 198 – 47 – we couldn’t find anything suitable higher, so we started to look from the bottom.

Here is the full code for this. I want to highlight a couple of functions:

Here you can see how I’m approach the issue. First, get the data from the start of the word containing the desired location, then see if I can get the best match. Or, if I can’t, just try to get anything.

The fact that I can play around with the state of the system and resume operations at ease make it possible to write this code.

Note that there is a tension in allocation code between:

  • Finding the best possible match
  • The time that it takes to find the match
  • The resources consumed by the find

In particular, if I need to scan a lot of memory, I may run into issue. It is usually better to get something as soon as possible than wait for the perfect match.

If you are interested in a nice mental workout, implement the find_prev function. That is a somewhat of a challenge, because you need to manage the state in reverse. Hint, use: __builtin_ctzll().

time to read 2 min | 314 words

In my previous post I shows how you can find a range of cleared bits in a bitmap. I even got fancy and added features such as find range from location, etc. That was great to experiment with, but when I tried to take the next step, being able to ask question such as find me the next range from location or get the smallest range available, I run into problems.

The issue was simple, I had too much state in my hands and I got lost trying to manage it all. Instead, I sat down and refactor things down. I introduced an explicit notion of state:

This structure allows me to keep track of what is going on when I’m calling functions. Which means that I can set it up once, and then call find_next immediately, without needing to remember the state at the call site.

This behavior also helped a lot in the actual implementation. I’m processing the bitmap one uint64 at a time, which means that I have to account for ranges that cross a word boundary. The core of the routine is the find_range_once call:

This function operates on a single word, but the state is continuous. That means that if I have a range that spans multiple words, I can manage that without additional complications. Here is how I’m using this:

We are getting all the range from the current word, then decide if we should move to the next, etc.

Notice that I’m using the current member in the struct to hold a copy of the current word, that is so I can mask a copy and move on.

Note that if the first cleared bit is in the start, I’m overflowing from ULONG_MAX back to zero. And with that, I’m able to do much more targeted operations on the bitmap, but I’ll discuss that in the next post.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}