For Episode 123 of the CollabTalk Podcast, we explored the pivotal role of community in shaping businesses, discussing my guest’s founding of his company and the strategies for building and nurturing open-source communities. We covered the symbiosis between commercial success and community engagement, emphasizing the importance of community feedback in innovation and the challenges and benefits of integrating open-source models into business strategies. You can listen to the podcast above and follow me using your favorite app, such as Spotify, Apple Podcasts, Stitcher, Soundcloud, or the iHeartRadio app. Be sure to subscribe!
A couple of months ago I had the joy of giving an internal lecture to our developer group about Voron, RavenDB’s dedicated storage engine. In the lecture, I’m going over the design and implementation of our storage engine.
If you ever had an interest on how RavenDB’s transactional and high performance storage works, that is the lecture for you. Note that this is aimed at our developers, so we are going deep.
You can find the slides here and here is the full video.
One of the most fun things that I do at work is share knowledge about how various things work. A few months ago I talked internally about how Certificates work. Instead of just describing the mechanism of that, I decided to actually walk our developers through the process of building the certificate infrastructure from scratch.
You can find the slides here and the full video is available online, it’s just over an hour of both lecture and discussion.
In this episode we talk to Dejan Milicic about the new version of RavenDb that just dropped.
We also talk about how important passion is as a developer.
RavenDB can run on the Raspberry Pi, it is actually an important use case for us when our users are deploying RavenDB as part of Internet of Things systems. We wanted to showcase RavenDB’s performance and decided that instead of scaling up and showing you how well RavenDB does ridiculous loads, we’ll go the other way around. We’ll go small, and let you directly experience how efficient RavenDB is.
You can look at the demo unit directly on this page.
We decided to dial it down yet further, and run RavenDB on the Raspberry Pi Zero.
This tiny computer is about the size of a cigarette lighter and is small enough to comfortably fit on your keychain. Most Raspberry Pis are impressive machines given their cost, more than powerful enough to power real applications.
Here is what this actually looks like, with me as a reference for size 🙂.
However, just installing RavenDB on the Zero isn't much of a challenge or particularly interesting, to be honest. We wanted to do something that would be both fun and useful. One of the features we want users to explore is the ability to run RavenDB in appliance mode. The question is, what sort of an appliance will we build?
A key part of our thinking was that we wanted to show something that works with realistic data sizes. We wanted to have an actual use case for this, beyond just showing a toy. One of the things that I always find maddening about being disconnected is that I feel like half my brain has been cut away.
We set out to fix that, the project is to create a knowledge system inside the Pi Zero that would be truly Plug & Play. That turned out to be quite a challenge, but I think we met it in a very nice manner.
We went to archive.org and got some of the Stack Exchange data sets. In particular, we got the datasets that are most interesting for DevOps scenarios. In particular, we have raspberrypi.stackexchange.com, unix.stackexchange.com, serverfault.com, and superuser.com.
I find it deliciously recursive that we can use the Raspberry Pi Zero to store the dataset about the Raspberry Pi itself. We loaded all those datasets into the Zero, for a total of about 7.5 GB, and over 4.2 million documents were stored there.
Note that this is using RavenDB’s document compression, which reduced the total size by over 50% over the original dataset size.
Next was the time to actually make this accessible. Just working with RavenDB directly to query the data is cool, for sure, but we wanted to be useful.
So we built a portal to access the data. Here is what it looks like when you enter it for the first time:
We offer full search capabilities and complete offline access to all those data sets. Perfect when you are stuck in the middle of nowhere and urgently need to remember that awk syntax or how to configure networking on a stubborn device.
Another aspect that we have to consider is how this can work? The Raspberry Pi Zero is a tiny device, and actually working with it can be annoying. It needs Micro-USB power but has no ethernet or standard USB ports. For display, it uses a mini HDMI port. That means that you can safely assume that you’re likely to have a power cable for it, but not much else.
We want to provide a good solution, so what do we do? The Raspberry Pi Zero we use does have a wifi chip, so we took things further and set it up as an access point with a captive portal.
You can read exactly how we configured that in this post.
In other words, the expected deployment model is to plug this into power, wait 30 seconds for the machine to boot, and then connect to the “Hugin” wireless network. You will then land directly into the application, able to deep dive into the questions of your choice.
We have been giving away those appliances at the DevWeek conference, and we got a really good reaction from users. Beyond the coolness factor, the fact that we can run a high-performance system on top of a… challenging hardware platform (512MB RAM, 1Ghz RAM, SD Card for disk) and still provide sub-100ms response times is quite amazing.
You can view the project page here, the entire thing is Open Source, and you can explore how we are able to do that on GitHub.
Join Oren Eini, CEO of RavenDB, as he explores the design and implementation of RavenDB’s indexing engine Corax, its impact on indexing and query performance, and how the engine addresses common challenges such as slow data retrieval, high hosting expenses, and sluggish development processes. You’ll also gain valuable insights into the architecture's performance costs and its ability to unlock efficiency in data handling.
Following my previous post about updating the publishing platform of this blog, I realized that I dug myself into a hole. The new workflow was pretty sweet. To the point where I wrote my blog posts a lot more frequently than before, as you can probably tell.
The problem was that I wanted to edit and process the blog post inside Google Docs, where I have a great workflow for editing, reviews, collaboration, etc. And then I want to push that same document to the blog. The killer for me is that I want that to be a smooth process, and the end text should fit into the blog. That means, if I want to emphasize something, it should be seen in the blog as bold. And if I want to write some code, that should work as well. In fact, the reason that I started this process is that it got so annoying to post code to the blog.
I’m using Google Docs’ export functionality to get the HTML back, and I did some basic cleaning to get it blog-ready instead of being focused on visual fidelity. I was using HTML Agility Pack to do that, and it turned out to be the wrong tool for the job. The issue is that it processed the data as if it were an XML document. I actually got a lot of track record with XML, so that wasn’t the issue. The problem is that I wanted to do a series of non-trivial things with the HTML, and there aren’t any off-the-shelf facilities to do that in .NET that I could find.
For example, given how important it is to me to show code snippets properly, I wanted to be able to grab them from the document, figure out what language I’m actually using there and syntax highlight it properly. There isn’t anything like that in .NET, all the libraries I found were for JavaScript.
You know the adage about: Let’s rewrite it in Rust? I rewrote my entire publishing process to JavaScript. Which then led me to another adventure. How can I do two contrary things? When I’m writing this document, I want to be able to just write the code. When I publish it, I want to see the syntax highlighted code, properly formatted and working.
Google Docs has support for writing code blocks inline (for some small number of languages), which is great for the editing process. However, the HTML that this generates is beyond atrocious. What is even worse, in HTML, it doesn’t align things properly using fixed-sized fonts, etc. In other words, it is almost there, but not quite.
When analyzing the Google Docs output, I noticed a couple of funny characters in the code output. Here is what it looks like. I believe this is a bug in the export process, probably related to the way code blocks work in Google Docs.
Dear Googlers, if you are reading this, please make a note that this thing has just been Hyrum's Law. It is an observable state, and I’m relying on it to do important tasks. Don’t break this in the future.
It turns out these are actually a pair of Unicode characters. More specifically, they are Unicode characters that are marked for private use:
- 0xEC03 - appears to be used to mark the beginning of a code block
- 0xEC02 - appears to be used to mark the end of a code block
Note the “appears”, and my blatant disregard for things like software maintenance discipline and all things proper and good in the world of Computer Science. This is a project where there are no rules, there is one customer, and he can code 🙂.
As mentioned earlier, while extracting the Google Doc as HTML and processing it, I encounter those Unicode markers that delineate the code section. This is good, because in terms of HTML itself, what it is doing inside is a… mess. Getting the actual text as it is supposed to be is not easy. So I exported the file again, as text. Those markers are showing up in the textual edition as well, which made things a lot easier for me.
With all of this done, allow me to show you some truly horrifying beautiful code:
let blocks = []; for (const match of text.data.matchAll(/\uEC03(.*?)\uEC02/gs)) { const code = match[1].trim(); const lang = flourite(code, { shiki: true, noUnkown: true }).language; const formattedCode = Prism.highlight(code, Prism.languages[lang], lang); blocks.push("<hr/><pre class='line-numbers language-" + lang + ">" + "<code class='line-numbers language-" + lang + "'>" + formattedCode + "</code></pre><hr/>"); } let inCodeSegment = false; htmlDoc.findAll().forEach(e => { var text = e.getText().trim(); if (text == "") { e.replaceWith(blocks[codeSegmentIndex++]); inCodeSegment = true; } if (inCodeSegment) { e.extract(); } if (text == "") { inCodeSegment = false; } })
That isn’t a lot of code, but it does plenty. We scan through the textual version of the document and find all the code blocks using a regular expression. We then try to figure out what language I’m using and apply code formatting during the publication process (this saves the need to change anything on the blog, which is nice, especially since we have to take into account syndication).
I push the code snippets into an array and then I process the actual HTML document using the DOM and find all the code snippets. I replace the start marker with the actual formatted code and continue to discard all the other elements until I hit the end of the code segment. The rest of the code remains pretty much the same as before.
I was writing this in VS Code and copilot suggested the following code for handling images:
htmlDoc.findAll('img').forEach(img => { if (img.attrs.hasOwnProperty('src')) { let src = img.attrs.src; let imgName = src.split('/').pop(); let imgData = entries.find(e => e.entryName === 'images/' + imgName).getData(); let imgType = imgName.split('.').pop(); let imgSrc = 'data:image/' + imgType + ';base64,' + imgData.toString('base64'); img.replaceWith('<img src="' + imgSrc + '" style="float: right"/>'); } })
In other words, instead of uploading the images as separate files, I can just encode them into the blog post directly. I like that idea very much because it means that I don’t have to store the images elsewhere.
Given that I don’t have any npm packages to abandon, I don’t know if I can call myself a JavaScript developer, but I did put the full code up for people to take a peek and then recoil.
RavenDB will be participating in the DevWeek hackathon in February. The hackathon is now live, and we are offering prizes worth 4,000 USD for the top two winners.
The hackathon is open to both attendees of the DevWeek conference and the general public. The challenge we put forth is building a sharing platform in a community. I’m excited to see what kind of solutions will be submitted.
I will also be personally attending the DevWeek conference and would be very happy to meet you in person. Happy hacking!
I spoke with Jaime recently in the Modern .NET Podcast:
In this episode of The Modern .NET Show podcast, Oren Eini, a seasoned developer with over 20 years of experience in the .NET field, discussed the evolution of the .NET framework and the complexities that come with it. Eini highlighted the rapid pace of change in the language, from the introduction of generics at version 2.0 to switch expressions and pattern matching in the latest versions. While these new features allow for more concise code, Eini acknowledged that they also increase the scope and complexity of learning C# from scratch.
Would love to hear your feedback.
This was actually released a while ago, I was occupied with other matters and missed that.
I had a blast talking with Carl & Richard about data sharding and how we implemented that in RavenDB.
What is data sharding, and why do you need it? Carl and Richard talk to Oren Eini about his latest work on RavenDB, including the new data sharding feature. Oren talks about the power of sharding a database across multiple servers to improve performance on massive data sets. While a sharded database is typically in a single data center, it is possible to distribute the shards across multiple locations. The conversation explores the advantages and disadvantages of the different approaches, including that you might not need it today, but it's great to know it's there when you do!
You can listen to the podcast here.