Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,408 | Comments: 50,842

Privacy Policy Terms
filter by tags archive
time to read 9 min | 1605 words

It’s very common to model your backend API as a set of endpoints that mirror your internal data model. For example, consider a blog engine, which may have:

  • GET /users/{id}: retrieves information about a specific user, where {id} is the ID of the user
  • GET /users/{id}/posts: retrieves a list of all posts made by a specific user, where {id} is the ID of the user
  • POST /users/{id}/posts: creates a new post for a specific user, where {id} is the ID of the user
  • GET /posts/{id}/comments: retrieves a list of all comments for a specific post, where {id} is the ID of the post
  • POST /posts/{id}/comments: creates a new comment for a specific post, where {id} is the ID of the post

This mirrors the internal structure pretty closely, and it is very likely that you’ll get to an API similar to this if you’ll start writing a blog backend. This represents the usual set of operations very clearly and easily.

The problem is that the blog example is so attractive because it is inherently limited. There isn’t really that much going on in a blog from a data modeling perspective. Let’s consider a restaurant and how its API would look like:

  • GET /menu: Retrieves the restaurant's menu
  • POST /orders: Creates a new order
  • POST /orders/{order_id}/items: Adds items to an existing order
  • POST /payments: Allows the customer to pay their bill using a credit card

This looks okay, right?

We sit at a table, grab the menu and start ordering. From REST perspective, we need to take into account that multiple users may add items to the same order concurrently.

That matters, because we may have bundles to take into account. John ordered the salad & juice and Jane the omelet, and Derek just got coffee. But coffee is already included in Jane’s order, so no separate charge for that. Here is what this will look like:

 ┌────┐┌────┐┌─────┐┌──────────────────────┐
 │John││Jane││Derek││POST /orders/234/items│
 └─┬──┘└─┬──┘└──┬──┘└─────┬────────────────┘
   │     │      │         │       
   │    Salad & Juice     │       
   │─────────────────────>│       
   │     │      │         │       
   │     │     Omelet     │       
   │     │───────────────>│       
   │     │      │         │       
   │     │      │ Coffee  │       
   │     │      │────────>│       

The actual record we have in the end, on the other hand, looks like:

  • Salad & Juice
  • Omelet & Coffee

In this case, we want the concurrent nature of separate requests, since each user will be ordering at the same time, but the end result should be the final tally, not just an aggregation of the separate totals.

In the same sense, how would we handle payments? Can we do that in the same manner?

 ┌────┐┌────┐┌─────┐┌──────────────────┐
 │John││Jane││Derek││POST /payments/234│
 └─┬──┘└─┬──┘└──┬──┘└────────┬─────────┘
   │     │      │            │          
   │     │     $10           │          
   │────────────────────────>│          
   │     │      │            │          
   │     │      │ $10        │          
   │     │──────────────────>│          
   │     │      │            │          
   │     │      │    $10     │          
   │     │      │───────────>│  

In this case, however, we are in a very different state. What happens in this scenario if one of those charges were declined? What happens if they put too much. What happens if there is a concurrent request to add an item to the order while the payment is underway?

When you have separate operations, you have to somehow manage all of that. Maybe a distributed transaction coordinator or by trusting the operator or by dumb luck, for a while. But this is actually an incredibly complex topic. And a lot of that isn’t inherent to the problem itself, but instead about how we modeled the interaction with the server.

Here is the life cycle of an order:

  • POST /orders: Creates a new order – returns the new order id
  • ** POST /orders/{order_id}/items: Adds / removes items to an existing order
  • ** POST /orders/{order_id}/submit: Submits all pending order items to the kitchen
  • POST /orders/{order_id}/bill: Close the order, compute the total charge
  • POST /payments/{order_id}: Handle the actual payment (or payments)

I have marked with ** the two endpoints that may be called multiple times. Everything else can only be called once.

Consider the transactional behavior around this sort of interaction. Adding / removing items from the order can be done concurrently. But submitting the pending orders to the kitchen is a boundary, a concurrent item addition would either be included (if it happened before the submission) or not (and then it will just be added to the pending items).

We are also not going to make any determination on the offers / options that were selected by the diners until they actually move to the payment portion. Even the payment itself is handled via two interactions. First, we ask to get the bill for the order. This is the point when we’ll compute orders, and figure out what bundles, discounts, etc we have. The result of that call is the final tally. Second, we have the call to actually handle the payment. Note that this is one call, and the idea is that the content of this is going to be something like the following:

{
  "order_id": "789",
  "total": 30.0,
  "payments": [
    {
      "amount": 15.0,
      "payment_method": "credit_card",
      "card_number": "****-****-****-3456",
      "expiration_date": "12/22",
      "cvv": "123"
    },
    { 
        "amount": 10.0, 
        "payment_method": "cash" },
    {
      "amount": 5.0,
      "payment_method": "credit_card",
      "card_number": "****-****-****-5678",
      "expiration_date": "12/23",
      "cvv": "456"
    }
  ]
}

The idea is that by submitting it all at once, we are removing a lot of complexity from the backend. We don’t need to worry about complex interactions, race conditions, etc. We can deal with just the issue of handling the payment, which is complicated enough on its own, no need to borrow trouble.

Consider the case that the second credit card fails the charge. What do we do then? We already charged the previous one, and we don’t want to issue a refund, naturally. The result here is a partial error, meaning that there will be a secondary submission to handle the remainder payment.

From an architectural perspective, it makes the system a lot simpler to deal with, since you have well-defined scopes. I probably made it more complex than I should have, to be honest. We can simply make the entire process serial and forbid actual concurrency throughout the process. If we are dealing with humans, that is easy enough, since the latencies involved are short enough that they won’t be noticed. But I wanted to add the bit about making a part of the process fully concurrent, to deal with the more complex scenarios.

In truth, we haven’t done a big change in the system, we simply restructured the set of calls and the way you interact with the backend. But the end result of that is the amount of code and complexity that you have to juggle for your infrastructure needs are far more lightweight. On real-world systems, that also has the impact of reducing your latencies, because you are aggregating multiple operations and submitting them as a single shot. The backend will also make things easier, because you don’t need complex transaction coordination or distributed locking.

It is a simple change, on its face, but it has profound implications.

time to read 2 min | 399 words

Let’s assume that you want to make a remote call to another server. Your code looks something like this:

var response = await httpClient.GetAsync("https://api.myservice.app/v1/create-snap", cancellationTokenSource.Token);

This is simple, and it works, until you realize that you have a problem. By default, this request will time out in 100 seconds. You can set it to a shorter timeout using HttpClient.Timeout property, but that will lead to other problems.

The problem is that internally, inside HttpClient, if you are using a Timeout, it will call CancellationTokenSource.CancelAfter(). That is... what we want to do, no?

Well, in theory, but there is a problem with this approach. Let's sa look at how this actually works, shall we?

It ends up setting up a Timer instance, as you can see in the code. The problem is that this will modify a global value (well, one of them, there are by default N timers in the process, where N is the number of CPUs that you have on the machine.

What that means is that in order to register a timeout, you need to take a look. If you have a high concurrency situation, setting up the timeouts may be incredibly expensive.

Given that the timeout is usually a fixed value, within RavenDB we solved that using a different manner. We set up a set of timers that will go off periodically and then use this instead. We can request a task that will be completed on the next timeout duration. This way, we'll not be contending on the global locks, and we'll have a single value to set when the timeout happens.

The code we use ends up being somewhat more complex:

var sendTask = httpClient.GetAsync("https://api.myservice.app/v1/create-snap", cancellationTokenSource.Token);
var waitTask = TimeoutManager.WaitFor(TimeSpan.FromSeconds(15), cancellationTokenSource.Token);

if (Task.WaitAny(sendTask, waitTask) == 1)
{
        throw new TimeoutException("The request to the service timed out.");
}

Because we aren't spending a lot of time doing setup for a (rare) event, we can complete things a lot faster.

I don't like this approach, to be honest. I would rather have a better system in place, but it is a good workaround for a serious problem when you are dealing with high-performance systems.

You can see how we implemented the TimeoutManager inside RavenDB, the goal was to get roughly the same time frame, but we are absolutely fine with doing roughly the right thing, rather than pay the full cost of doing this exactly as needed. For our scenario, roughly is more than accurate enough.

time to read 5 min | 962 words

When I started using GitHub Copilot, I was quite amazed at how good it was. Sessions using ChatGPT can be jaw dropping in terms of the generated content.

The immediate reaction from many people is to consider what the impact of that would be on the humans who currently fill those roles. Surely, if we can get a machine to do the task of a human, we can all benefit (except for the person made redundant, I guess).

I had a long discussion on the topic recently and I think that it is a good topic for a blog post, given the current interest in the subject matter.

The history of replacing manual labor with automated machines goes back as far as you’ll like to stretch it. I wouldn’t go back to the horse & plow, but certain the Luddites and their arguments about the impact of machinery on the populace will sound familiar to anyone today.

The standard answer is that some professions will go away, but new ones will pop up, instead. The classic example is the ice salesman. That used to be a function, a guy on a horse-drawn carriage that would sell you ice to keep your food cold. You can assume that this profession is no longer relevant, of course.

The difference here is that we now have computer programs and AI taking over what was classically thought impossible. You can ask Dall-E or Stable Diffusion for an image and in a few seconds, you’ll have a beautiful render that may actually match what you requested.

You can start writing code with GitHub Copilot and it will predict what you want to do to an extent that is absolutely awe-inspiring.

So what is the role of the human in all of this? If I can ask ChatGPT or Copilot to write me an email validation function, what do I need a developer for?

Here is ChatGPT’s output:

image

And here is Copilot’s output:

image

I would rate the MailAddress version better, since I know that you can’t actually manage emails via Regex. I tried to take this further and ask ChatGPT about the Regex, and got:

image

ChatGPT is confused, and the answer doesn’t make any sort of sense.

Most of the time spent on “research” for this post was waiting for ChatGPT to actually produce a result, but this post isn’t about nitpicking, actually.

The whole premise around “machines will make us redundant” is that the sole role of a developer is taking a low-level requirement such as email validation and producing the code to match.

Writing such low-hanging fruit is not your job. For that matter, a function is not your job. Nor is writing code a significant portion of that. A developer needs to be able to build the system architecture and design the interaction between components and the overall system.

They need to make sure that the system is performant, meet the non-functional requirements, etc. A developer would spend a lot more time reading code than writing it.

Here is a more realistic example of using ChatGPT, asking it to write to a file using a write-ahead log. I am both amazed by the quality of the answer and find myself unable to use even a bit of the code in there. The scary thing is that this code looks correct at a glance. It is wrong, dangerously so, but you’ll need to be a subject matter expert to know that. In this case, this doesn’t meet the requirements, the provided solution has security issues and doesn’t actually work.

On the other hand, I asked it about password hashing and I would give this answer a good mark.

I believe it will get better over time, but the overall context matters. We have a lot of experience in trying to get the secretary to write code. There have been many tools trying to do that, going all the way back to CASE in the 80s.

There used to be a profession called: “computer”, where you could hire a person to compute math for you. Pocket calculators didn’t invalidate them, and Excel didn’t make them redundant. They are now called accountants or data scientists, instead. And use the new tools (admittedly, calling calculators or Excel new feels very strange) to boost up their productivity enormously.

Developing with something like Copilot is a far easier task, since I can usually just tab complete a lot of the routine details. But having a tool to do some part of the job doesn’t mean that there is no work to be done. It means that a developer can speed up the routine bits and get to grips faster / more easily with the other challenges it has, such as figuring out why the system doesn’t do what it needs to, improving existing behavior, etc.

Here is a great way to use ChatGPT as part of your work, ask it to optimize a function. For this scenario, it did a great job. For more complex scenarios? There is too much context to express.

My final conclusion is that this is a really awesome tool to assist you. It can have a massive impact on productivity, especially for people working in an area that they aren’t familiar with. The downside is that sometimes it will generate junk, then again, sometimes real people do that as well.

The next few years are going to be really interesting, since it provides a whole new level of capability for the industry at large, but I don’t think that it would shake the reality on the ground.

time to read 1 min | 108 words

The recording of my webinar showing off the new Sharding feature in RavenDB 6.0 is now live. I’m showcasing the new technology preview of RavenDB 6.0 and we have a nightly build already available for it. I think the webinar was really good, and I’m super excited to discuss all the ins & out of how this works.

Please take a look, and take the software for a spin. We have managed to get sharding down to a simple, obvious and clear process. And we are very much looking for your feedback.

time to read 7 min | 1202 words

Sometimes, you need to hold on to data that you really don’t want to have access to. A great example may be that your user will provide you with their theme color preference. I’m sure you can appreciate the sensitivity of preferring a light or dark theme when working in the IDE.

At any rate, you find yourself in an interesting situation, you have a piece of data that you don’t want to know about. In other words, the threat model we have to work with is that we protect the data from a malicious administrator. This may seem to be a far-fetched scenario, but just today I was informed that my email was inside the 200M users leak from Twitter. Having an additional safeguard ensures that even if someone manages to lay their hands on your database, there is little that they can do about it.

RavenDB supports Transparent Data Encryption. In other words, the data is encrypted on disk and will only be decrypted while there is an active transaction looking at it. That is a server-side operation, there is a single key (not actually true, but close enough) that is used for all the data in the database. For this scenario, that is not good enough. We need to use a different key for each user. And even if we have all the data and the server’s encryption key, we should still not be able to read the sensitive data.

How can we make this happen? The idea is that we want to encrypt the data on the client, with the client’s own key, that is never sent to the server. What the server is seeing is an encrypted blob, basically. The question is, how can we make it work as easily as possible. Let’s look at the API that we use to get it working:

As you can see, we indicate that the value is encrypted using the Encrypted<T> wrapper. That class is a very simple wrapper, with all the magic actually happening in the assigned JSON converter. Before we’ll look into how that works, allow me to present you with the way this document looks like to RavenDB:

As you can see, we don’t actually store the data as is. Instead, we have an object that stores the encrypted data as well as the authentication tag. The above document was generated from the following code:

The JSON document holds the data we have, but without knowing the key, we do not know what the encrypted value is. The actual encrypted value is composed of three separate (and quite important) fields:

  • Tag – the authentication tag that ensures that the value we decrypt is indeed the value that was encrypted
  • Data – this is the actual encrypted value. Note that the size of the value is far larger than the value we actually encrypted. We do that to avoid leaking the size of the value.
  • Nonce – a value that ensures that even if we encrypt similar values, we won’t end up with an identical output. I talk about this at length here.

Just storing the data in the database is usually not sufficient, mind. Sure, with what we have right now, we can store and read the data back, safe from data leaks on the server side. However, we have another issue, we want to be able to query the data.

In other words, the question is how, without telling the database server what the value is, can we query for matching values? The answer is that we need to provide a value during the query that would match the value we stored. That is typically fairly obvious & easy. But it runs into a problem when we have cryptography. Since we are using a Nonce, it means that each time we’ll encrypt the value, we’ll get a different encrypted value. How can we then query for the value?

The answer to that is something called DAE (deterministic authenticated encryption). Here is how it works: instead of generating the nonce using random values and ensuring that it is never repeated, we’ll go the other way. We’ll generate the nonce in a deterministic manner. By effectively taking a hash of the data we’ll encrypt. That ensures that we’ll get a unique nonce for each unique value we’ll encrypt. And it means that for the same value, we’ll get the same encrypted output, which means that we can then query for that.

Here is an example of how we can use this from RavenDB:

And with that explanation out of the way, let’s see the wiring we need to make this happen. Here is the JsonConverter implementation that makes this possible:

There is quite a lot that is going on here. This is a JsonConverter, which translates the in-memory data to what is actually sent over the wire for RavenDB.

On read, there isn’t much that is going on there, we pull the individual fields from the JSON and pass them to the DeterministicEncryption class, which we’ll look at shortly. We get the plain text back, read the JSON we previously stored, and translate that back into a .NET object.

On write, things are slightly more interesting. We convert the object to a string, and then we write that to an in memory stream. We ensure that the stream is always aligned on 32 bytes boundary (to avoid leaking the size). Without that step, you could distinguish between “Dark” and “Light” theme users simply based on the length of the encrypted value. We pass the data to the DeterministicEncryption class for actual encryption and build the encrypted value. I choose to use a complex object, but we could also put this into a single field just as easily.

With that in place, the last thing to understand is how we perform the actual encryption:

There is actually very little code here, which is pretty great. The first thing to note is that we have GetCurrentKey, which is a delegate you need to provide to find the current key. You can have a global key for the entire application or for the current user, etc. This key isn’t the actual encryption key, however. In the DerivedKeys function, we use the Blake2b algorithm to turn that 32 bytes key into a 64 bytes value. We then split this into two 32 bits keys. The idea is that we separate the domains, we have one key that is used for computing the SIV and another for the actual encryption.

We use HMAC-Blake2b using the SIV key to compute the nonce of the value in a deterministic manner and then perform the actual encryption. For decryption, we go in reverse, but we don’t need to derive a SIV, obviously.

With this in place, we have about 100 lines of code that add the ability to store client-side encrypted values and query them. Pretty neat, even if I say so myself.

Note that we can store the encrypted value inside of RavenDB, which the database have no way of piercing, and retrieve those values back as well as query them for equality. Other querying capabilities, such as range or prefix scans are far more complex and tend to come with security implications that weaken the level guarantees you can provide. 

time to read 2 min | 221 words

imageThis Wednesday I’m going to be doing a webinar about RavenDB & Sharding. This is going to be the flagship feature for RavenDB 6.0 and I’m really excited to be talking about it in public finally.

Sharding involves splitting your data into multiple nodes. Similar to having different volumes of a single encyclopedia.

RavenDB’s sharding implementation is something that we have spent the past three or four years working on. That has been quite a saga to get it out. The primary issue is that we want to achieve two competing goals:

  • Allow you to scale the amount of data you have to near infinite levels.
  • Ensure that RavenDB remains simple to use and operate.

The first goal is actually fairly easy and straightforward. It is the second part that made things complicated. After a lot of work, I believe that we have a really good solution at hand.

In the webinar, I’m going to be presenting how RavenDB 6.0 implements sharding, the behavior of the system at scale, and all the details you need to know about how it works under the cover.

I’m really excited to finally be able to show off the great work of the team! Join me, it’s going to be really interesting.

time to read 4 min | 749 words

incaseofemergency

Preconditions, postconditions, and invariants, oh my!

The old adage about Garbage In, Garbage Out is a really important aspect of our profession. If you try to do things that don’t make sense, the output will be nonsensical.

On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

~Charles Babbage – Inventor of the first computer

As you can see, the issue isn’t a new one. And there are many ways to deal with that. You should check your inputs, assume they are hostile, double check on every layer, etc.

Those are the principles of sound programming design, after all.

This post is about a different topic. When everything is running smoothly, you want to reject invalid operations and dangerous actions. The problem is when everything is hosed.

The concept of emergency operations is something that should be a core part of the design, because emergencies happen, and you don’t want to try to carve new paths in emergencies.

Let’s consider a scenario such as when the root certificate has expired, which means that there is no authentication. You cannot authenticate to the servers, because the auth certificate you use has also expired. You need to have physical access, but the data center won’t let you in, since you cannot authenticate.

Surely that is fiction, right? Happened last year to Facebook (bad IP configuration, not certs, but same behavior).

An important aspect of good design is to consider what you’ll do in the really bad scenarios. How do you recover from such a scenario?

For complex systems, it’s very easy to get to the point where you have cross dependencies. For example, your auth service relies on the database cluster, which uses the auth service for authentication. If both services are down at the same time, you cannot bring them up.

Part of the design of good software is building the emergency paths. When the system breaks, do you have a well-defined operation that you can take to recover?

A great example of that is fire doors in buildings. They are usually alarmed and open to the outside world only, preventing their regular use. But in an emergency, they allow the quick evacuation of a building safely, instead of creating a chokepoint.

We recently got into a discussion internally about a particular feature in RavenDB (modifying the database topology). There are various operations that you shouldn’t be able to make, because they are dangerous. They are also the sort of things that allow you to recover from disaster. We ended up creating two endpoints for this feature. One that included checks and verification. The second one is an admin-only endpoint that is explicitly meant for the “I know what I mean” scenario.

RavenDB actually has quite a bit of those scenarios. For example, you can authenticate to RavenDB using a certificate, or if you have a root access on the machine, you can use the OS authentication mechanism instead. We had scenarios where users lost their certificates and were able to use the alternative mechanism instead to recover.

Making sure to design those emergency pathways ahead of time means that you get to do that with a calm mind and consider more options. It also means that you get to verify that your emergency mechanism doesn’t hinder normal operations. For example, the alarmed fire door. Or in the case of RavenDB, relying on the operating system permissions as a backup if you are already running as a root user on the machine.

Having those procedures ahead of time, documented and verified, ends up being really important at crisis time. You don’t need to stumble in the dark or come up with new ways to do things on the fly. This is especially important since you cannot assume that the usual invariants are in place.

Note that this is something that is very easy to miss, after all, you spend a lot of time designing and building those features, never to use them (hopefully). The answer to that is that you also install sprinklers and fire alarms with the express hope & intent to never use them in practice.

The amusing part of this is that we call this: Making sure this areup to code.

You need to ensure that your product and code are up to code.

time to read 5 min | 969 words

One of the big changes in RavenDB is the new search engine, Corax. We want to replace Lucene with a purpose built search engine, capable of doing everything that we can do with Lucene, but far faster.

In this post, I want to discuss a particular detail of the implementation. Managing the posting list. In information retrieval, the idea of a posting list is that we have a list of document ids that match a particular term. I’m ignoring the details of how we create or manage the list. The basic idea is that we have a need to store a potentially large number of document ids, update them on the fly as well query them. Lucene manages that by creating immutable segments, but that design decision doesn’t match the way we want to do things in Corax. The problem with immutable segments is that you’ll eventually need to run compaction, which can be really expensive.

As a database, we already have a really good data structure that matches most of those requirements, it turns out. A B+Tree can do a pretty good approximation of a posting list, but it does have a problem. It’s not efficient in terms of space usage. A document id is a 64 bits integer, and we can make a number of assumptions about it. Therefore, I create a dedicated B+Tree like structure to hold the posting list. This B+Tree is called a Set inside of Voron, and it can hold any number of uint64 values. Internally, this is managed as two separate types of pages. The Branch pages (which are fairly standard B+Tree branches) and the Leaf pages.

Here is the first version of the Set Leaf Page implementation:

image

Let’s figure out what is going on here. The page has a notion of a base. That means all the documents IDs that have the same upper 33 bits. Basically, inside a single page, all the IDs are in the same 2 billion range. That means that even though the document ids are 64 bits, in the context of a single page, we can treat them as 32 bits integers. That turns out to be important, since most integer compression routines work on 32 bits integers.

How does that work? We have a section in the page that is dedicated to raw values, and we insert values into that section until it is full. Whenever that happens, we compress the raw values section using PForDelta compression. The page will then contain multiple compressed segments and a single raw values segment. Each compressed segment is non-overlapping with the other compressed segments (but may overlap with the raw values). PForDelta is really good in compressing integers, especially if it is able to figure out patterns in your data. And documents IDs in Corax are explicitly designed to have common patterns so it will be able to take advantage of this behavior. When we read the entries from the page, we merge the compressed segments with the raw values and return the full details.

The code isn’t particularly simple, but has a fairly straightforward approach to the problem once you understand the design.

One thing that I haven’t touched is the notion of removals. That is an important concept, and we handle that in an interesting way. Remember that I said that the baseline for the page is the upper 33 bits? That is because the numbers inside the page are 31 bits in length, we reserve the top bit to mark a value as a tombstone marker.  In other words, when we write to the raw values, we can have either a new value or a removed value, distinguished using the uppermost bit.

When we fill the raw values segment, we compress it alongside the relevant compressed segments. At that point, we’ll filter out the removed values. This is very similar to the segments that Lucene is using, but we do that on a page boundary (8KB), not across all values.

We are able to push a lot of values into a single page. We see typically thousands to tens of thousands of documents IDs fitting into a single 8KB page. That is pretty amazing, since even if you have a posting list that has millions of entries, the actual disk reads are minimal.

The design was with us throughout most of the development of Corax, but we ran into a serious issue with the way it works when we started to build the query optimizer for Corax.

That is an interesting statement, I think you’ll agree. What is the relation between a (very) low-level design of the on-disk data format and the result of the query optimizer?

Well, it turns out that one of the things that we need to know for the query optimizer is: “How big is this posting list?”

This question is actually really important to be able to generate optimal queries. And given the structure we have, we can provide a good answer to that, most of the time, but not always.

Why is that?

The problem is what happens when we want to remove a value from the set, or add an existing value. If the value already exists inside a compressed segment, we don’t open the compressed segement (which will require re-writing it from scratch), so we record an addition that is actually spurious. Conversely, if we try to remove a value that isn’t in the set, we’ll wrongly decrement the number of entries in the posting list, leading to issues with a mismatch between the record number of entries and the actual number we have in the posting list.

That was very confusing to figure out, I’ll admit. It was also really hard to fix, but I’ll address that in the next post in the series.

time to read 4 min | 750 words

I’ve been calling myself a professional software developer for just over 20 years at this point. In the past few years, I have gotten into teaching university courses in the Computer Science curriculum. I have recently had the experience of supporting a non-techie as they went through a(n intense) coding bootcamp (aiming at full stack / front end roles). I’m also building a distributed database engine and all the associated software.

I list all of those details because I want to make an observation about the distinction between fundamental and transient knowledge.

My first thought is that there is so much to learn. Comparing the structure of C# today to what it was when I learned it (pre-beta days, IIRC), it is a very different language. I had literally decades to adjust to some of those changes, but someone that is just getting started needs to grasp everything all at once. When I learned JavaScript you still had browsers in the market that didn’t recognize it, so you had to do the “//<!—” trick to get things to work (don’t ask!).

This goes far beyond mere syntax and familiarity with language constructs. The overall environment is also critically important. One of the basic tasks that I give in class is something similar to: “Write a network service that would serve as a remote dictionary for key/value operations”.  Most students have a hard time grasping details such as IP vs. host, TCP ports, how to read from the network, error handling, etc. Adding a relatively simple requirement (make it secure from eavesdroppers) will take it entirely out of their capabilities.

Even taking a “simple” problem, such as building a CRUD website is fraught with many important details that aren’t really visible. Responsive design, mobile friendly, state management and user experience, to name a few. Add requirements such as accessibility and you are setting the bar too high to reach.

I intentionally choose the examples of accessibility and security, because those are “invisible” requirements. It is easy to miss them if you don’t know that they should be there.

My first website was a PHP page that I pushed to the server using FTP and updated live in “production”. I was exposed to all the details about DNS and IPs, understood exactly that the server side was just a machine in a closet, and had very low levels of abstractions. (Naturally, the solution had no security or any other –ities). However, that knowledge from those early experiments has served me very well for decades. Same for details such as how TCP works or the basics of operating system design.

Good familiarity with the basic data structures (heap, stack, tree, list, set, map, queue) paid itself many times over. The amount of time that I spent learning WinForms… still usable and widely applicable even in other platforms and environments. WPF or jQuery? Not so much.

Learning patterns paid many dividends and was applicable on a wide range of applications and topics.

I looked into the topics that are being taught (both for bootcamps and universities) and I understand why in many cases, those are being skipped. You can actually be a front end developer without understanding much (if at all) about networks. And the breadth of details you need to know is immense.

My own tendency is to look at the low level stuff, and given that I work on a database engine, that is obviously quite useful. What I have found, however, is that whenever I dug deep into a topic, I found ways to utilize that knowledge at a later point in time. Sometimes I was able to solve a problem in a way that would be utterly inconceivable to me previously. I’m not just talking about being able to immediately apply new knowledge to a problem. If that were the case, I would attribute that to wanting to use the new thing I just learned.

However, I’m talking about scenarios where months or years later I ran into a problem, and was then able to find the right solution given what was then totally useless knowledge.

In short, I understand that chasing the 0.23-alpha-stage-2.3.1-dev updates on the left-pad package is important, but I found that spending time deep in the stack has a great cumulative effect.

Joel Spolsky wrote about leaky abstractions, that was 20 years ago. I remember reading that blog post and grokking that. And it is true, being able to dig one or two layers down from where you usually live has a huge amount of leverage on your capabilities.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Recording (8):
    17 Feb 2023 - RavenDB Usage Patterns
  2. Production postmortem (48):
    27 Jan 2023 - The server ate all my memory
  3. Answer (12):
    05 Jan 2023 - what does this code print?
  4. Challenge (71):
    04 Jan 2023 - what does this code print?
  5. RavenDB Indexing (2):
    20 Oct 2022 - exact()
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats