Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,431 | Comments: 50,881

Privacy Policy Terms
filter by tags archive
time to read 2 min | 270 words

I found myself today needing to upload a file to S3, the upload size is a few hundred GBs in size. I expected the appropriate command, like so:

aws s3api put-object --bucket twitter-2020-rvn-dump --key mydb.backup --body ./mydb.backup

But then I realized that this is uploading a few hundred GB file to S3, which may take a while. The command doesn’t have any progress information, so I had no way to figure out where it is at.

I decided to see what I can poke around to find, first, I ran this command:

ps -aux | grep s3api

This gave me the PID of the upload process in question.

Then I checked the file descriptors for this process, like so:

$ ls -alh /proc/84957/fd


total 0
dr-x------ 2 ubuntu ubuntu  0 Mar 30 08:10 .
dr-xr-xr-x 9 ubuntu ubuntu  0 Mar 30 08:00 ..
lrwx------ 1 ubuntu ubuntu 64 Mar 30 08:10 0 -> /dev/pts/8
lrwx------ 1 ubuntu ubuntu 64 Mar 30 08:10 1 -> /dev/pts/8
lrwx------ 1 ubuntu ubuntu 64 Mar 30 08:10 2 -> /dev/pts/8
lr-x------ 1 ubuntu ubuntu 64 Mar 30 08:10 3 -> /backups/mydb.backup

As you can see, we can tell that file descriptor#3 is the one that we care about, then we can ask for more details:

$ cat /proc/84957/fdinfo/3
pos: 140551127040 flags: 02400000 mnt_id: 96 ino: 57409538

In other words, the process is currently at ~130GB of the file or there about.

It’s not ideal, but it does give me some idea about where we are at. It is a nice demonstration of the ability to poke into the insides of a running system to figure out what is going on.

time to read 5 min | 962 words

When I started using GitHub Copilot, I was quite amazed at how good it was. Sessions using ChatGPT can be jaw dropping in terms of the generated content.

The immediate reaction from many people is to consider what the impact of that would be on the humans who currently fill those roles. Surely, if we can get a machine to do the task of a human, we can all benefit (except for the person made redundant, I guess).

I had a long discussion on the topic recently and I think that it is a good topic for a blog post, given the current interest in the subject matter.

The history of replacing manual labor with automated machines goes back as far as you’ll like to stretch it. I wouldn’t go back to the horse & plow, but certain the Luddites and their arguments about the impact of machinery on the populace will sound familiar to anyone today.

The standard answer is that some professions will go away, but new ones will pop up, instead. The classic example is the ice salesman. That used to be a function, a guy on a horse-drawn carriage that would sell you ice to keep your food cold. You can assume that this profession is no longer relevant, of course.

The difference here is that we now have computer programs and AI taking over what was classically thought impossible. You can ask Dall-E or Stable Diffusion for an image and in a few seconds, you’ll have a beautiful render that may actually match what you requested.

You can start writing code with GitHub Copilot and it will predict what you want to do to an extent that is absolutely awe-inspiring.

So what is the role of the human in all of this? If I can ask ChatGPT or Copilot to write me an email validation function, what do I need a developer for?

Here is ChatGPT’s output:

image

And here is Copilot’s output:

image

I would rate the MailAddress version better, since I know that you can’t actually manage emails via Regex. I tried to take this further and ask ChatGPT about the Regex, and got:

image

ChatGPT is confused, and the answer doesn’t make any sort of sense.

Most of the time spent on “research” for this post was waiting for ChatGPT to actually produce a result, but this post isn’t about nitpicking, actually.

The whole premise around “machines will make us redundant” is that the sole role of a developer is taking a low-level requirement such as email validation and producing the code to match.

Writing such low-hanging fruit is not your job. For that matter, a function is not your job. Nor is writing code a significant portion of that. A developer needs to be able to build the system architecture and design the interaction between components and the overall system.

They need to make sure that the system is performant, meet the non-functional requirements, etc. A developer would spend a lot more time reading code than writing it.

Here is a more realistic example of using ChatGPT, asking it to write to a file using a write-ahead log. I am both amazed by the quality of the answer and find myself unable to use even a bit of the code in there. The scary thing is that this code looks correct at a glance. It is wrong, dangerously so, but you’ll need to be a subject matter expert to know that. In this case, this doesn’t meet the requirements, the provided solution has security issues and doesn’t actually work.

On the other hand, I asked it about password hashing and I would give this answer a good mark.

I believe it will get better over time, but the overall context matters. We have a lot of experience in trying to get the secretary to write code. There have been many tools trying to do that, going all the way back to CASE in the 80s.

There used to be a profession called: “computer”, where you could hire a person to compute math for you. Pocket calculators didn’t invalidate them, and Excel didn’t make them redundant. They are now called accountants or data scientists, instead. And use the new tools (admittedly, calling calculators or Excel new feels very strange) to boost up their productivity enormously.

Developing with something like Copilot is a far easier task, since I can usually just tab complete a lot of the routine details. But having a tool to do some part of the job doesn’t mean that there is no work to be done. It means that a developer can speed up the routine bits and get to grips faster / more easily with the other challenges it has, such as figuring out why the system doesn’t do what it needs to, improving existing behavior, etc.

Here is a great way to use ChatGPT as part of your work, ask it to optimize a function. For this scenario, it did a great job. For more complex scenarios? There is too much context to express.

My final conclusion is that this is a really awesome tool to assist you. It can have a massive impact on productivity, especially for people working in an area that they aren’t familiar with. The downside is that sometimes it will generate junk, then again, sometimes real people do that as well.

The next few years are going to be really interesting, since it provides a whole new level of capability for the industry at large, but I don’t think that it would shake the reality on the ground.

time to read 2 min | 221 words

imageThis Wednesday I’m going to be doing a webinar about RavenDB & Sharding. This is going to be the flagship feature for RavenDB 6.0 and I’m really excited to be talking about it in public finally.

Sharding involves splitting your data into multiple nodes. Similar to having different volumes of a single encyclopedia.

RavenDB’s sharding implementation is something that we have spent the past three or four years working on. That has been quite a saga to get it out. The primary issue is that we want to achieve two competing goals:

  • Allow you to scale the amount of data you have to near infinite levels.
  • Ensure that RavenDB remains simple to use and operate.

The first goal is actually fairly easy and straightforward. It is the second part that made things complicated. After a lot of work, I believe that we have a really good solution at hand.

In the webinar, I’m going to be presenting how RavenDB 6.0 implements sharding, the behavior of the system at scale, and all the details you need to know about how it works under the cover.

I’m really excited to finally be able to show off the great work of the team! Join me, it’s going to be really interesting.

time to read 4 min | 750 words

I’ve been calling myself a professional software developer for just over 20 years at this point. In the past few years, I have gotten into teaching university courses in the Computer Science curriculum. I have recently had the experience of supporting a non-techie as they went through a(n intense) coding bootcamp (aiming at full stack / front end roles). I’m also building a distributed database engine and all the associated software.

I list all of those details because I want to make an observation about the distinction between fundamental and transient knowledge.

My first thought is that there is so much to learn. Comparing the structure of C# today to what it was when I learned it (pre-beta days, IIRC), it is a very different language. I had literally decades to adjust to some of those changes, but someone that is just getting started needs to grasp everything all at once. When I learned JavaScript you still had browsers in the market that didn’t recognize it, so you had to do the “//<!—” trick to get things to work (don’t ask!).

This goes far beyond mere syntax and familiarity with language constructs. The overall environment is also critically important. One of the basic tasks that I give in class is something similar to: “Write a network service that would serve as a remote dictionary for key/value operations”.  Most students have a hard time grasping details such as IP vs. host, TCP ports, how to read from the network, error handling, etc. Adding a relatively simple requirement (make it secure from eavesdroppers) will take it entirely out of their capabilities.

Even taking a “simple” problem, such as building a CRUD website is fraught with many important details that aren’t really visible. Responsive design, mobile friendly, state management and user experience, to name a few. Add requirements such as accessibility and you are setting the bar too high to reach.

I intentionally choose the examples of accessibility and security, because those are “invisible” requirements. It is easy to miss them if you don’t know that they should be there.

My first website was a PHP page that I pushed to the server using FTP and updated live in “production”. I was exposed to all the details about DNS and IPs, understood exactly that the server side was just a machine in a closet, and had very low levels of abstractions. (Naturally, the solution had no security or any other –ities). However, that knowledge from those early experiments has served me very well for decades. Same for details such as how TCP works or the basics of operating system design.

Good familiarity with the basic data structures (heap, stack, tree, list, set, map, queue) paid itself many times over. The amount of time that I spent learning WinForms… still usable and widely applicable even in other platforms and environments. WPF or jQuery? Not so much.

Learning patterns paid many dividends and was applicable on a wide range of applications and topics.

I looked into the topics that are being taught (both for bootcamps and universities) and I understand why in many cases, those are being skipped. You can actually be a front end developer without understanding much (if at all) about networks. And the breadth of details you need to know is immense.

My own tendency is to look at the low level stuff, and given that I work on a database engine, that is obviously quite useful. What I have found, however, is that whenever I dug deep into a topic, I found ways to utilize that knowledge at a later point in time. Sometimes I was able to solve a problem in a way that would be utterly inconceivable to me previously. I’m not just talking about being able to immediately apply new knowledge to a problem. If that were the case, I would attribute that to wanting to use the new thing I just learned.

However, I’m talking about scenarios where months or years later I ran into a problem, and was then able to find the right solution given what was then totally useless knowledge.

In short, I understand that chasing the 0.23-alpha-stage-2.3.1-dev updates on the left-pad package is important, but I found that spending time deep in the stack has a great cumulative effect.

Joel Spolsky wrote about leaky abstractions, that was 20 years ago. I remember reading that blog post and grokking that. And it is true, being able to dig one or two layers down from where you usually live has a huge amount of leverage on your capabilities.

time to read 1 min | 160 words

FizzBuzz is a well known test to show that you can program. To be rather more exact, it is a simple test that does not tell you if you can program well, but if you cannot do FizzBuzz, you cannot program. This is a fail only kind of metric. We need this thing because sadly, we see people that fail FizzBuzz coming to interviews.

I have another test, which I feel is simpler than FizzBuzz, which can significantly reduce the field of candidates. I show them this code and ask them to analyze what is going on here:

Acceptable answers include puking, taking a few moments to breathe into a paper bag and mild to moderate professional swearing.

This is something that I actually run into (about 15 years ago, in the WebForms days) and I have used it ever since. That is a great way to measure just how much a candidate knows about the environment in which they operate.

time to read 2 min | 324 words

imageThe phrase “work well under pressure” is something that I consider to be a red flag in a professional environment. My company builds a database that is used as the backend of business critical systems. If something breaks, there is a need to fix it. It costs money (sometimes a lot of money) for every minute of downtime.

Under such a scenario, I absolutely want the people handling the issue to remain calm, collected and analytical. In such a case, being able to work well under pressure is a huge benefit.

That is not how this term is typically used, however. The typical manner you’ll hear this phrase is to refer to the usual working environment. For example, working under time pressure to deliver certain functionality. That sort of pressure is toxic over time.

Excess stress is a well known contributor to health issues (mental and physical ones), it will cause you to make mistakes and it adds frictions all around.

From my perspective, the ability to work well under pressure is an absolutely important quality, which should be hoarded. You may need to utilize this ability in order to deal with a blocking customer issue, but should be careful not to spend that on non-critical stuff.

And by definition, most things are not critical. If everything is critical, you have a different problem.

That means that part of the task of the manager is to identify the places where pressure is applied and remove that. In the context of software, that may be delaying a release date or removing features to reduce the amount of work.

When working with technology, the most valuable asset you have is the people and the knowledge they have. And one of the easiest ways to lose that is to burn the candle at both ends. You get more light, sure, but you also get no candle.

time to read 3 min | 424 words

I like to think about myself as a database guy. My go to joke about building user interfaces is that a <table> is all I need for layout (it’s not a joke). About a decade ago I just gave up on trying to follow what is going on in the frontend land and accepted that I’ll reside in the backend from here on after.

Being ignorant of the ways you’ll write a modern frontend doesn’t affect the fact that I like to use a good user interface. I have seriously mixed feelings about the importance of RavenDB Studio to the project. On the one hand, I care that it is easy to use, obvious and functional. I love that it is beautiful and will generally make your life easier. And at the same time, I abhor the fact that it has such an impact on people’s decisions. I mean, the backend of RavenDB is absolutely beautiful, from a technical perspective. But everyone always talk about the studio.

Leaving aside my mini rant, we spend quite a lot of time and effort on the studio and the User Experience in general. This release is not an exception and we have a couple of major new updates to the studio.

One of the most common things you’ll do in the studio is run queries. In this release we have done a complete revamp of the automatic code completion for the client-side RQL queries written in the studio.
The new code assistance is available when writing any query in the Query view, Patch view, and in the Subscription Query. That was actually quite interesting, from a computer science perspective. We have formal grammar for RQL now, for example, which means that we can provide much better experience for query editing. For example, take a look:

image

Full code completion assistance and better error handling directly at the studio makes it easier to work with RavenDB for both developers and operations.

The second feature is the Identities page:

image

Identities has been a feature in RavenDB for a long time, and somehow they have never been front and center. Maybe the discoverability of the feature suffered? You can now create, edit and modify the identities directly in the studio, not just through the API.

image

time to read 2 min | 273 words

imageNext week is Black Friday, which has reached a global phenomenon status. It is a fun day for shoppers, and a nervous wreck for IT admins everywhere. It is not uncommon to see traffic doubles or triples and the actual load (processing more heavyweight requests) can go up an order of magnitude. Preparing for Black Friday can be a harrowing issue since you have a narrow window of opportunity and it is hard to know exactly where the stress points are.

This year, I decided to make your life easier, and RavenDB is offering a Black Friday Surge to all our customers. No, we aren’t offering you 50% off and everything must go. What we do instead is try to be of help.

This Black Friday (and Cyber Monday as well), we are offering all our customers double what they paid for. When running RavenDB on premise, if you purchased a RavenDB license for a 12 cores cluster (running on 3 nodes of 4 cores each), we’ll offer you 30 days of double the core count. In other words, you can scale your system to be twice as powerful, and it won’t cost you a cent.

On the cloud, as well, we will provide users with credits to upgrade their clusters to the next level up (doubling their power) for a full week during the next 30 days. Again, there is no extra cost here.

You can register for the Surge here to request the upgrade and you’ll get twice as much power to handle the increased load.

Enjoy the power up!

time to read 2 min | 285 words

Everyone is on the cloud these days, and one of the things that I keep seeing pushed is the notion of usage based billing. Basically, the idea that you are paying for what you use.

Let’s assume that we are building a software as a service where users can submit an image and you’ll do some computation on that. The actual details aren’t relevant. What matters is that your pricing model is based around how much time processing each image takes and how much memory is used. You are running this on many machines and need to figure out how to do billing at the end of the month. It turns out that this can be quite a challenge. With incremental time series, a lot of the details around that just go away.

Here is how you can implement this:

You count the required memory as well as the actual runtime and record that in an incremental time series. We are also storing the details  in a separate document for that particular run in the same transaction (if the user cares about that level of detail). The interesting bit about how this can be used is that the data is now immediately available for the user to see how much they are going to be billed.

Typically, a lot of time is spent in figuring out how to record those details efficiently and then how to query and aggregate those. We tested time series in RavenDB to billions of data points, and the internal format lends itself very well to aggregated queries.

Now you can take the code above, run it on 100s of machines, and it will all end up giving you the proper result in the end.

FUTURE POSTS

  1. Integer compression: SIMD bit packing and unusual usages - 2 days from now
  2. Integer compression: Understanding FastPFor - 3 days from now
  3. Integer compression: The FastPFor code - 4 days from now
  4. Integer compression: Porting simdcomp to C# - 5 days from now

There are posts all the way to Jun 15, 2023

RECENT SERIES

  1. Integer compression (7):
    08 Jun 2023 - Using SIMD bit packing in practice
  2. Talk (7):
    09 Jun 2023 - Scalable Architecture From the Ground Up
  3. Fight for every byte it takes (6):
    01 May 2023 - Decoding the entries
  4. Looking into Corax’s posting lists (3):
    17 Apr 2023 - Part III
  5. Recording (8):
    17 Feb 2023 - RavenDB Usage Patterns
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats