Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,495
|
Comments: 51,046
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 113 words

For Episode 123 of the CollabTalk Podcast, we explored the pivotal role of community in shaping businesses, discussing my guest’s founding of his company and the strategies for building and nurturing open-source communities. We covered the symbiosis between commercial success and community engagement, emphasizing the importance of community feedback in innovation and the challenges and benefits of integrating open-source models into business strategies. You can listen to the podcast above and follow me using your favorite app, such as Spotify, Apple Podcasts, Stitcher, Soundcloud, or the iHeartRadio app. Be sure to subscribe!

time to read 1 min | 103 words

A couple of months ago I had the joy of giving an internal lecture to our developer group about Voron, RavenDB’s dedicated storage engine. In the lecture, I’m going over the design and implementation of our storage engine.

If you ever had an interest on how RavenDB’s transactional and high performance storage works, that is the lecture for you. Note that this is aimed at our developers, so we are going deep.

You can find the slides here and here is the full video.

time to read 1 min | 99 words

One of the most fun things that I do at work is share knowledge about how various things work. A few months ago I talked internally about how Certificates work. Instead of just describing the mechanism of that, I decided to actually walk our developers through the process of building the certificate infrastructure from scratch.

You can find the slides here and the full video is available online, it’s just over an hour of both lecture and discussion.

time to read 9 min | 1696 words

Following my previous post about updating the publishing platform of this blog, I realized that I dug myself into a hole. The new workflow was pretty sweet. To the point where I wrote my blog posts a lot more frequently than before, as you can probably tell.

The problem was that I wanted to edit and process the blog post inside Google Docs, where I have a great workflow for editing, reviews, collaboration, etc. And then I want to push that same document to the blog. The killer for me is that I want that to be a smooth process, and the end text should fit into the blog. That means, if I want to emphasize something, it should be seen in the blog as bold. And if I want to write some code, that should work as well. In fact, the reason that I started this process is that it got so annoying to post code to the blog.

I’m using Google Docs’ export functionality to get the HTML back, and I did some basic cleaning to get it blog-ready instead of being focused on visual fidelity. I was using HTML Agility Pack to do that, and it turned out to be the wrong tool for the job. The issue is that it processed the data as if it were an XML document. I actually got a lot of track record with XML, so that wasn’t the issue. The problem is that I wanted to do a series of non-trivial things with the HTML, and there aren’t any off-the-shelf facilities to do that in .NET that I could find.

For example, given how important it is to me to show code snippets properly, I wanted to be able to grab them from the document, figure out what language I’m actually using there and syntax highlight it properly. There isn’t anything like that in .NET, all the libraries I found were for JavaScript.

You know the adage about: Let’s rewrite it in Rust? I rewrote my entire publishing process to JavaScript. Which then led me to another adventure. How can I do two contrary things? When I’m writing this document, I want to be able to just write the code. When I publish it, I want to see the syntax highlighted code, properly formatted and working.

Google Docs has support for writing code blocks inline (for some small number of languages), which is great for the editing process. However,  the HTML that this generates is beyond atrocious. What is even worse, in HTML, it doesn’t align things properly using fixed-sized fonts, etc. In other words, it is almost there, but not quite.

When analyzing the Google Docs output, I noticed a couple of funny characters in the code output. Here is what it looks like. I believe this is a bug in the export process, probably related to the way code blocks work in Google Docs.

Dear Googlers, if you are reading this, please make a note that this thing has just been Hyrum's Law. It is an observable state, and I’m relying on it to do important tasks. Don’t break this in the future.

It turns out these are actually a pair of Unicode characters. More specifically, they are Unicode characters that are marked for private use:

  • 0xEC03 - appears to be used to mark the beginning of a code block
  • 0xEC02 - appears to be used to mark the end of a code block

Note the “appears”, and my blatant disregard for things like software maintenance discipline and all things proper and good in the world of Computer Science. This is a project where there are no rules, there is one customer, and he can code 🙂.

As mentioned earlier, while extracting the Google Doc as HTML and processing it, I encounter those Unicode markers that delineate the code section. This is good, because in terms of HTML itself, what it is doing inside is a… mess. Getting the actual text as it is supposed to be is not easy. So I exported the file again, as text. Those markers are showing up in the textual edition as well, which made things a lot easier for me.

With all of this done, allow me to show you some truly horrifying beautiful code:


let blocks = [];
for (const match of text.data.matchAll(/\uEC03(.*?)\uEC02/gs)) {
    const code = match[1].trim();
    const lang = flourite(code, { shiki: true, noUnkown: true }).language;
    const formattedCode = Prism.highlight(code, Prism.languages[lang], lang);


    blocks.push("<hr/><pre class='line-numbers language-" + lang + ">" +
        "<code class='line-numbers language-" + lang + "'>" +
        formattedCode + "</code></pre><hr/>");
}


let inCodeSegment = false;
htmlDoc.findAll().forEach(e => {
    var text = e.getText().trim();
    if (text == "&#60419;") {
        e.replaceWith(blocks[codeSegmentIndex++]);
        inCodeSegment = true;
    }
    if (inCodeSegment) {
        e.extract();
    }
    if (text == "&#60418;") {
        inCodeSegment = false;
    }
})

That isn’t a lot of code, but it does plenty. We scan through the textual version of the document and find all the code blocks using a regular expression. We then try to figure out what language I’m using and apply code formatting during the publication process (this saves the need to change anything on the blog, which is nice, especially since we have to take into account syndication).

I push the code snippets into an array and then I process the actual HTML document using the DOM and find all the code snippets. I replace the start marker with the actual formatted code and continue to discard all the other elements until I hit the end of the code segment. The rest of the code remains pretty much the same as before.

I was writing this in VS Code and copilot suggested the following code for handling images:


htmlDoc.findAll('img').forEach(img => {
    if (img.attrs.hasOwnProperty('src')) {
        let src = img.attrs.src;
        let imgName = src.split('/').pop();
        let imgData = entries.find(e => e.entryName === 'images/' + imgName).getData();
        let imgType = imgName.split('.').pop();
        let imgSrc = 'data:image/' + imgType + ';base64,' + imgData.toString('base64');
        img.replaceWith('<img src="' + imgSrc + '" style="float: right"/>');
    }
})

In other words, instead of uploading the images as separate files, I can just encode them into the blog post directly. I like that idea very much because it means that I don’t have to store the images elsewhere.

Given that I don’t have any npm packages to abandon, I don’t know if I can call myself a JavaScript developer, but I did put the full code up for people to take a peek and then recoil.

time to read 1 min | 101 words

RavenDB will be participating in the DevWeek hackathon in February. The hackathon is now live, and we are offering prizes worth 4,000 USD for the top two winners.

The hackathon is open to both attendees of the DevWeek conference and the general public. The challenge we put forth is building a sharing platform in a community. I’m excited to see what kind of solutions will be submitted.

I will also be personally attending the DevWeek conference and would be very happy to meet you in person. Happy hacking!

time to read 2 min | 235 words

1

If you are reading this blog, I assume that you are a like-minded person. My idea of relaxation is to sit and write code. Hopefully on something that I’m not familiar with. I have many such blog post series covering topics I care about. It’s my idea of meditation.

For the end of 2023, I thought that we could do something similar but on a broader scale. A while ago Alex Klaus wrote a walkthrough on how to build a complete application from scratch using modern best practices (and RavenDB). We refreshed the code and made it widely available, offering you something fun , educational, and productive to engage with.

The system is a bug tracker (allowing us to focus on the architecture rather than domain concerns), and you can play with a deployed version live. The code is available under the MIT license, and we’ll be very happy to receive any suggested improvements.

Topics that are covered:

  1. Building an enterprise application with the .NET and RavenDB

  2. Non-Relational Data Modeling Through Domain Driven Design Prism

  3. Hidden side of document IDs in RavenDB

  4. Dynamic Fields for Indexing

  5. Entity Relationships in non-relational database (one-to-many, many-to-many)

  6. Multi-tenant database in NoSQL

  7. Database Integration Testing – The Secret Recipe

As usual, I would love any feedback you have to offer.

time to read 2 min | 380 words

I was looking into reducing the allocation in a particular part of our code, and I ran into what was basically the following code (boiled down to the essentials):

As you can see, this does a lot of allocations. The actual method in question was a pretty good size, and all those operations happened in different locations and weren’t as obvious.

Take a moment to look at the code, how many allocations can you spot here?

The first one, obviously, is the string allocation, but there is another one, inside the call to GetBytes(), let’s fix that first by allocating the buffer once (I’m leaving aside the allocation of the reusable buffer, you can assume it is big enough to cover all our needs):

For that matter, we can also easily fix the second problem, by avoiding the string allocation:

That is a few minutes of work, and we are good to go. This method is called a lot, so we can expect a huge reduction in the amount of memory that we allocated.

Except… that didn’t happen. In fact, the amount of memory that we allocate remained pretty much the same. Digging into the details, we allocate roughly the same number of byte arrays (how!) and instead of allocating a lot of strings, we now allocate a lot of character arrays.

I broke the code apart into multiple lines, which made things a lot clearer. (In fact, I threw that into SharpLab, to be honest). Take a look:

This code: buffer[..len] is actually translated to:

char[] charBuffer= RuntimeHelpers.GetSubArray(buffer, Range.EndAt(len));

That will, of course, allocate. I had to change the code to be very explicit about the types that I wanted to use:

This will not allocate, but if you note the changes in the code, you can see that the use of var in this case really tripped me up. Because of the number of overloads and automatic coercion of types that didn’t happen.

For that matter, note that any slicing on arrays will generate a new array, including this code:

This makes perfect sense when you realize what is going on and can still be a big surprise, I looked at the code a lot before I figured out what was going on, and that was with a profiler output that pinpointed the fault.

time to read 7 min | 1203 words

I have been doing Open Source work for just under twenty years at this point. I have been paying my mortgage from Open Source software for about 15.  I’m stating that to explain that I have spent quite a lot of time struggling with the inherent tension between having an Open Source project and getting paid.

I wrote about it a few times in the past. It is not a trivial problem, and the core of the issue is not something that you can easily solve with technical means. I ran into this fascinating thread on Twitter that over the weekend:

And another part of that is here:

I’m quoting the most relevant pieces, but the idea is pretty simple.

Donations don’t work, period. They don’t work not because companies are evil or developers don’t want to pay for Open Source. They don’t work because it takes a huge amount of effort to actually get paid.

If you are an independent developer, your purchasing process goes something like this:

  1. I would like to use this thing
  2. I need to pay for that
  3. The price matches the value I’m getting
  4. Where is my credit card…
  5. Paid!

Did you note step 2? The part about needing to pay?

If you don’t have that step, what will happen? Same scenario, an independent developer:

  1. I would like to use this thing
  2. I use this thing
  3. It would be great to pay something to show my appreciation
  4. Where did I put the credit card? Oh, it’s down the hall… I’ll get to that later (never).

That is in the best-case scenario where the thought of donating actually crossed your mind. In most likelihood, the process is more:

  1. I would like to use this thing
  2. I use this thing
  3. Ticket closed, what is the next one… ?

Now, what happens if you are not an independent developer? Let’s say that you are a contract worker for a company. You need to talk to your contact person, they will need to get purchasing approval. Depending on the amount, that may require escalating upward a few levels, etc.

Let’s say that the amount is under 100$, so basically within the budgetary discretion of the first manager you run into. They would still need to know what they are paying for, what they are getting out of that (they need to justify that). If this is a donation, welcome to the beauty of tax codes in multiple jurisdictions and what counts as such. If this is not a donation, what do they get? That means that you now have to do a meeting, potentially multiple ones. Present your case, open a new supplier at the company, etc.

The cost of all of those is high, both in time and money. Or… you can just nuget add-package and move on.

In the case of RavenDB, it is an Open Source software (a license to match, code is freely available), but we treat it as a commercial project for all intents and purposes. If you want to install RavenDB, you’ll get a popup saying you need a license, directing you to a page where you see how much we would like to get and what do you get in return, etc. That means that from a commercial perspective, we are in a familiar ground for companies.  They are used to paying for software, and there isn’t an option to just move on to the next task.

There is another really important consideration here. In the ideal Open Source donation model, money just shows up in your account. In the commercial world, there is a huge amount of work that is required to get things done. That is when you have a model where “the software does not work without a purchase”.  To give some context, 22% is Sales & Marketing and they spent around 21.8 billion in 2022 on Sales & Marketing. That is literally billions being spent to make sales.

If you want to make money, you are going to invest in sales, sales strategy, etc. I’m ignoring marketing here because if you are expected to make money from Open Source, you likely already have a project well-known enough to at least get started.

That means that you need to figure out what you are charging for, how do you get customers, etc. In the case of RavenDB, we use the per-core model, which is a good indication of how much use the user is getting from RavenDB. LLBLGen Pro, on the other hand, they are charging per seat. Particular’s NServiceBus uses a per endpoint / number of messages a day model.

There is no one model that fits all. And you need to be able to tailor your pricing model to how your users think about your software.

So pricing strategy, creating a proper incentive to purchase (hard limit, usually) and some sales organization to actually drive all of that are absolutely required.

Notice what is missing here? GitHub. It simply has no role at all up to this point. So why the title of this post?

There is one really big problem with getting paid that GitHub can solve for Open Source (and in general, I guess).

The whole process of actually getting paid is absolutely atrocious. In the best case, you need to create a supplier at the customer, fill up various forms (no, we don’t use child labor or slaves, indeed), figure out all sorts of weird roles (German tax authority requires special dispensation, and let’s not talk about getting paid from India, etc). Welcome to Anti Money Laundering roles and GDPR compliance with Known Your Customer and SOC 2 regulations. The last sentence is basically nonsense words, but I understand that if you chant it long enough, you get money in the end.

What GitHub can do is be a payment pipe. Since presumably your organization is already set up with them in place, you can get them to do the invoicing, collecting the payment, etc. And in the end, you get the money.

That sounds exactly like GitHub Sponsorships, right? Except that in this case, this is no a donation. This is a flat-out simple transaction, with GitHub as the medium. The idea is that you have a limit, which you enforce, on your usage, and GitHub is how you are paid. The ability to do it in this fashion may make things easier, but I would assume that there are about three books worth of regulations and EULAs to go through to make it actually successful.

Yet, as far as I’m concerned, that is really the only important role that we have for GitHub here.

That is not a small thing, mind. But it isn’t a magic bullet.

time to read 3 min | 533 words

Measuring the length of time that a particular piece of code takes is a surprising challenging task. There are two aspects to this, the first is how do you ensure that the cost of getting the start and end times won’t interfere with the work you are doing. The second is how to actually get the time (potentially many times a second) in as efficient way as possible.

To give some context, Andrey Akinshin does a great overview of how the Stopwatch class works in C#. On Linux, that is basically calling to the clock_gettime system call, except that this is not a system call. That is actually a piece of code that the Kernel sticks inside your process that will then integrate with other aspects of the Kernel to optimize this. The idea is that this system call is so frequent that you cannot pay the cost of the Kernel mode transition. There is a good coverage of this here.

In short, that is a very well-known problem and quite a lot of brainpower has been dedicated to solving it. And then we reached this situation:

image

What you are seeing here is us testing the indexing process of RavenDB under the profiler. This is indexing roughly 100M documents, and according to the profiler, we are spending 15% of our time gathering metrics?

The StatsScope.Start() method simply calls Stopwatch.Start(), so we are basically looking at a profiler output that says that Stopwatch is accounting for 15% of our runtime?

Sorry, I don’t believe that. I mean, it is possible, but it seems far-fetched.

In order to test this, I wrote a very simple program, which will generate 100K integers and test whether they are prime or not. I’m doing that to test compute-bound work, basically, and testing calling Start() and Stop() either across the whole loop or in each iteration.

I run that a few times and I’m getting:

  • Windows: 311 ms with Stopwatch per iteration and 312 ms without
  • Linux: 450 ms with Stopwatch per iteration and 455 ms without

On Linux, there is about 5ms overhead if we use a per iteration stopwatch, on Windows, it is either the same cost or slightly cheaper with per iteration stopwatch.

Here is the profiler output on Windows:

image

And on Linux:

image

Now, that is what happens when we are doing a significant amount of work, what happens if the amount of work is negligible? I made the IsPrime() method very cheap, and I got:

image

So that is a good indication that this isn’t free, but still…

Comparing the costs, it is utterly ridiculous that the profiler says that so much time is spent in those methods.

Another aspect here may be the issue of the profiler impact itself. There are differences between using Tracing and Sampling methods, for example.

I don’t have an answer, just a lot of very curious questions.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Recording (13):
    05 Mar 2024 - Technology & Friends - Oren Eini on the Corax Search Engine
  2. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  3. Production postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
  4. Challenge (74):
    13 Oct 2023 - Fastest node selection metastable error state–answer
  5. Filtering negative numbers, fast (4):
    15 Sep 2023 - Beating memcpy()
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}