Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,488
|
Comments: 51,038
Privacy Policy · Terms
filter by tags archive
time to read 9 min | 1696 words

Following my previous post about updating the publishing platform of this blog, I realized that I dug myself into a hole. The new workflow was pretty sweet. To the point where I wrote my blog posts a lot more frequently than before, as you can probably tell.

The problem was that I wanted to edit and process the blog post inside Google Docs, where I have a great workflow for editing, reviews, collaboration, etc. And then I want to push that same document to the blog. The killer for me is that I want that to be a smooth process, and the end text should fit into the blog. That means, if I want to emphasize something, it should be seen in the blog as bold. And if I want to write some code, that should work as well. In fact, the reason that I started this process is that it got so annoying to post code to the blog.

I’m using Google Docs’ export functionality to get the HTML back, and I did some basic cleaning to get it blog-ready instead of being focused on visual fidelity. I was using HTML Agility Pack to do that, and it turned out to be the wrong tool for the job. The issue is that it processed the data as if it were an XML document. I actually got a lot of track record with XML, so that wasn’t the issue. The problem is that I wanted to do a series of non-trivial things with the HTML, and there aren’t any off-the-shelf facilities to do that in .NET that I could find.

For example, given how important it is to me to show code snippets properly, I wanted to be able to grab them from the document, figure out what language I’m actually using there and syntax highlight it properly. There isn’t anything like that in .NET, all the libraries I found were for JavaScript.

You know the adage about: Let’s rewrite it in Rust? I rewrote my entire publishing process to JavaScript. Which then led me to another adventure. How can I do two contrary things? When I’m writing this document, I want to be able to just write the code. When I publish it, I want to see the syntax highlighted code, properly formatted and working.

Google Docs has support for writing code blocks inline (for some small number of languages), which is great for the editing process. However,  the HTML that this generates is beyond atrocious. What is even worse, in HTML, it doesn’t align things properly using fixed-sized fonts, etc. In other words, it is almost there, but not quite.

When analyzing the Google Docs output, I noticed a couple of funny characters in the code output. Here is what it looks like. I believe this is a bug in the export process, probably related to the way code blocks work in Google Docs.

Dear Googlers, if you are reading this, please make a note that this thing has just been Hyrum's Law. It is an observable state, and I’m relying on it to do important tasks. Don’t break this in the future.

It turns out these are actually a pair of Unicode characters. More specifically, they are Unicode characters that are marked for private use:

  • 0xEC03 - appears to be used to mark the beginning of a code block
  • 0xEC02 - appears to be used to mark the end of a code block

Note the “appears”, and my blatant disregard for things like software maintenance discipline and all things proper and good in the world of Computer Science. This is a project where there are no rules, there is one customer, and he can code 🙂.

As mentioned earlier, while extracting the Google Doc as HTML and processing it, I encounter those Unicode markers that delineate the code section. This is good, because in terms of HTML itself, what it is doing inside is a… mess. Getting the actual text as it is supposed to be is not easy. So I exported the file again, as text. Those markers are showing up in the textual edition as well, which made things a lot easier for me.

With all of this done, allow me to show you some truly horrifying beautiful code:


let blocks = [];
for (const match of text.data.matchAll(/\uEC03(.*?)\uEC02/gs)) {
    const code = match[1].trim();
    const lang = flourite(code, { shiki: true, noUnkown: true }).language;
    const formattedCode = Prism.highlight(code, Prism.languages[lang], lang);


    blocks.push("<hr/><pre class='line-numbers language-" + lang + ">" +
        "<code class='line-numbers language-" + lang + "'>" +
        formattedCode + "</code></pre><hr/>");
}


let inCodeSegment = false;
htmlDoc.findAll().forEach(e => {
    var text = e.getText().trim();
    if (text == "&#60419;") {
        e.replaceWith(blocks[codeSegmentIndex++]);
        inCodeSegment = true;
    }
    if (inCodeSegment) {
        e.extract();
    }
    if (text == "&#60418;") {
        inCodeSegment = false;
    }
})

That isn’t a lot of code, but it does plenty. We scan through the textual version of the document and find all the code blocks using a regular expression. We then try to figure out what language I’m using and apply code formatting during the publication process (this saves the need to change anything on the blog, which is nice, especially since we have to take into account syndication).

I push the code snippets into an array and then I process the actual HTML document using the DOM and find all the code snippets. I replace the start marker with the actual formatted code and continue to discard all the other elements until I hit the end of the code segment. The rest of the code remains pretty much the same as before.

I was writing this in VS Code and copilot suggested the following code for handling images:


htmlDoc.findAll('img').forEach(img => {
    if (img.attrs.hasOwnProperty('src')) {
        let src = img.attrs.src;
        let imgName = src.split('/').pop();
        let imgData = entries.find(e => e.entryName === 'images/' + imgName).getData();
        let imgType = imgName.split('.').pop();
        let imgSrc = 'data:image/' + imgType + ';base64,' + imgData.toString('base64');
        img.replaceWith('<img src="' + imgSrc + '" style="float: right"/>');
    }
})

In other words, instead of uploading the images as separate files, I can just encode them into the blog post directly. I like that idea very much because it means that I don’t have to store the images elsewhere.

Given that I don’t have any npm packages to abandon, I don’t know if I can call myself a JavaScript developer, but I did put the full code up for people to take a peek and then recoil.

time to read 20 min | 3954 words

I've been writing this blog since 2004. That means I have been doing this for twenty years, which is frankly unbelievable to me. The actual date is sometime in April, so I’ll probably do a summary post then about that.

What I want to talk about today is a different aspect. The mechanism and processes I use to write blog posts. A large part of the reason I write blog posts is that it helps me understand and organize my own thoughts. And in order to do that effectively, I have found that I need very little friction in the blogging process.

About a decade ago, Google Reader was shut down, and I’m still very bitter about that. It effectively killed a significant portion of the blogging audience and made the ergonomics of reading blogs a lot harder. That also led people to use walled gardens to communicate with others, instead of the decentralized network and feed aggregators. A side effect of that decision is that blogging tools have stopped being a viable thing people spend time or money on.

At the time, I was using Windows Live Writer, which was a high-quality editor and had a rich plugin system. Microsoft discontinued it at some point, it became an open-source project, and even that died. The website is no longer functional and even in terms of the GitHub project, the last commit was 5 years ago.

I’m still using Open Live Writer to write the majority of my blog posts, but given there are no longer any plugins, even something as simple as embedding code in my posts has become an… annoyance. That kills the ergonomics of blogging for me.

Not a problem, this is Open Source, and I can do that myself. Except… I really don’t have the time to spend on something ancillary like that. I would happily pay (a reasonable amount) for a blogging client, but I’m going to assume that I’m not part of a large enough group that there is a market for this.

Taking the code snippets example, I can go into the code, figure out what is going on there, and add a “code snippet” feature. I estimate that would take several days. Alternatively, I can place the code as a GitHub gist and embed it in the page. It is annoying, but far quicker than going to the trouble of figuring that out.

Another issue that bugs me (pun intended) is a problem with copy/paste of images, where taking screenshots using the Snipping Tool doesn’t paste into Writer. I need to first paste them into Paint, then into Writer. In this case, I assume that Writer doesn’t recognize the clipboard format or something similar.  

Finally, it turns out that I’m not writing blog posts in the same manner as I used to. It got to the point where I asked people to review my posts before making them public. It turns out that no matter how many times it is corrected, my brain seems unable to discern when to write “whether” or “whatever”, for example. At this point I gave up updating that piece of software 🙂. Even the use of emojis doesn’t work properly (Open Live Writer mostly predates a lot of them and breaks the HTML in a weird fashion 🤷).

In other words, there are several problems in my current workflow, and it has finally reached the point where I need to do something about it. The last requirement, by the way, is the most onerous. Consider the workflow of getting the following fixes to a blog post:

  • and we run => and we ran
  • we spend => we spent

Where is my collaborating editing and the ability to suggest changes with good UX? Improving the ergonomics for the blog has just expanded in scope massively. Now it is a full-fledged publishing platform with modern sensibilities. It’s 2024, features like proper spelling and grammar corrections should absolutely be there, no? And what about AI integration? It turns out that predicting text makes the writing process more efficient. Here is what this may look like:

At this stage, this isn’t just a few minor fixes. I should mention that for the past decade and a half or so, I stopped considering myself as someone who can do UI in any meaningful manner. I find that the <table/> tag, which used to be my old reliable method, is not recommended now, for some reason.

This… kind of sucks. I want to upgrade my process by a couple of decades, but I don’t want to pay the price for that. If only there was an easier way to do that.

I started using Google Docs to edit my blog posts, then pasting them into Live Writer or directly to the blog (using a Rich Text Box with an editor from… a decade ago). I had to check the source code for this, by the way. The entire experience is decidedly Developer UX. Then I had a thought, I already have a pretty good process of writing the blog posts in Google Docs, right? It handles rich text editing and management much better than the editor in the blog. There are also options for things like proper workflows. For example, someone can go over my drafts and make comments or suggestions.

The only thing that I need is to put both of those together. I have to admit that I spent quite some time just trying to figure out how to get the document from Google Docs using code. The authentication hurdles are… significant to someone who isn’t aware of how it all plugs together. Once I got that done, I got my publishing platform with modern features. Here is what the end result looks like:


public class PublishingPlatform
{
  private readonly DocsService GoogleDocs;
  private readonly DriveService GoogleDrive;
  private readonly Client _blogClient;


  public PublishingPlatform(string googleConfigPath, string blogUser, string blogPassword)
  {
    var blogInfo = new MetaWeblogClient.BlogConnectionInfo(
     "https://ayende.com/blog",
     "https://ayende.com/blog/Services/MetaWeblogAPI.ashx",
     "ayende.com", blogUser, blogPassword);
    _blogClient = new MetaWeblogClient.Client(blogInfo);


    var initializer = new BaseClientService.Initializer
    {
      HttpClientInitializer = GoogleWebAuthorizationBroker.AuthorizeAsync(
          GoogleClientSecrets.FromFile(googleConfigPath).Secrets,
          new[] { DocsService.Scope.Documents, DriveService.Scope.DriveReadonly },
          "user", CancellationToken.None,
          new FileDataStore("blog.ayende.com")
      ).Result
    };


    GoogleDocs = new DocsService(initializer);
    GoogleDrive = new DriveService(initializer);
  }


  public void Publish(string documentId)
  {
    using var file = GoogleDrive.Files.Export(documentId, "application/zip").ExecuteAsStream();
    var zip = new ZipArchive(file, ZipArchiveMode.Read);


    var doc = GoogleDocs.Documents.Get(documentId).Execute();
    var title = doc.Title;


    var htmlFile = zip.Entries.First(e => Path.GetExtension(e.Name).ToLower() == ".html");
    using var stream = htmlFile.Open();
    var htmlDoc = new HtmlDocument();
    htmlDoc.Load(stream);
    var body = htmlDoc.DocumentNode.SelectSingleNode("//body");


    var (postId, tags) = ReadPostIdAndTags(body);


    UpdateLinks(body);
    StripCodeHeader(body);
    UploadImages(zip, body, GenerateSlug(title));


    string post = GetPostContents(htmlDoc, body);


    if (postId != null)
    {
      _blogClient.EditPost(postId, title, post, tags, true);
      return;
    }


    postId = _blogClient.NewPost(title, post, tags, true, null);


    var update = new BatchUpdateDocumentRequest();
    update.Requests = [new Request
    {
      InsertText = new InsertTextRequest
      {
        Text = $"PostId: {postId}\r\n",
        Location = new Location
        {
          Index = 1,
        }
      },
    }];


    GoogleDocs.Documents.BatchUpdate(update, documentId).Execute();
  }


  private void StripCodeHeader(HtmlNode body)
  {
    foreach(var remove in body.SelectNodes("//span[text()='&#60419;']").ToArray())
    {
      remove.Remove();
    }
    foreach (var remove in body.SelectNodes("//span[text()='&#60418;']").ToArray())
    {
      remove.Remove();
    }
  }


  private static string GetPostContents(HtmlDocument htmlDoc, HtmlNode body)
  {
    // we use the @scope element to ensure that the document style doesn't "leak" outside
    var style = htmlDoc.DocumentNode.SelectSingleNode("//head/style[@type='text/css']").InnerText;
    var post = "<style>@scope {" + style + "}</style> " + body.InnerHtml;
    return post;
  }


  private static void UpdateLinks(HtmlNode body)
  {
    // Google Docs put a redirect like: https://www.google.com/url?q=ACTUAL_URL
    foreach (var link in body.SelectNodes("//a[@href]").ToArray())
    {
      var href = new Uri(link.Attributes["href"].Value);
      var url = HttpUtility.ParseQueryString(href.Query)["q"];
      if (url != null)
      {
        link.Attributes["href"].Value = url;
      }
    }
  }


  private static (string? postId, List<string> tags) ReadPostIdAndTags(HtmlNode body)
  {
    string? postId = null;
    var tags = new List<string>();
    foreach (var span in body.SelectNodes("//span"))
    {
      var text = span.InnerText.Trim();
      const string TagsPrefix = "Tags:";
      const string PostIdPrefix = "PostId:";
      if (text.StartsWith(TagsPrefix, StringComparison.OrdinalIgnoreCase))
      {
        tags.AddRange(text.Substring(TagsPrefix.Length).Split(","));
        RemoveElement(span);
      }
      else if (text.StartsWith(PostIdPrefix, StringComparison.OrdinalIgnoreCase))
      {
        postId = text.Substring(PostIdPrefix.Length).Trim();
        RemoveElement(span);
      }
    }
    // after we removed post id & tags, trim the empty lines
    while (body.FirstChild.InnerText.Trim() is "&nbsp;" or "")
    {
      body.RemoveChild(body.FirstChild);
    }
    return (postId, tags);
  }


  private static void RemoveElement(HtmlNode element)
  {
    do
    {
      var parent = element.ParentNode;
      parent.RemoveChild(element);
      element = parent;
    } while (element?.ChildNodes?.Count == 0);
  }


  private void UploadImages(ZipArchive zip, HtmlNode body, string slug)
  {
    var mapping = new Dictionary<string, string>();
    foreach (var image in zip.Entries.Where(x => Path.GetDirectoryName(x.FullName) == "images"))
    {
      var type = Path.GetExtension(image.Name).ToLower() switch
      {
        ".png" => "image/png",
        ".jpg" or "jpeg" => "image/jpg",
        _ => "application/octet-stream"
      };
      using var contents = image.Open();
      var ms = new MemoryStream();
      contents.CopyTo(ms);
      var bytes = ms.ToArray();
      var result = _blogClient.NewMediaObject(slug + "/" + Path.GetFileName(image.Name), type, bytes);
      mapping[image.FullName] = new UriBuilder { Path = result.URL }.Uri.AbsolutePath;
    }
    foreach (var img in body.SelectNodes("//img[@src]").ToArray())
    {
      if (mapping.TryGetValue(img.Attributes["src"].Value, out var path))
      {
        img.Attributes["src"].Value = path;
      }
    }
  }


  private static string GenerateSlug(string title)
  {
    var slug = title.Replace(" ", "");
    foreach (var ch in Path.GetInvalidFileNameChars())
    {
      slug = slug.Replace(ch, '-');
    }


    return slug;
  }
}

You’ll probably not appreciate this, but the fact that I can just push code like that into the document and get it with proper formatting easily is a major lifestyle improvement from my point of view.

The code works with the document in two ways. First, in the Document DOM (which is quite complex), it extracts the title of the blog post and afterward updates it with the document ID. But the core of this code is to extract the document as a zip file, grab everything from there, and push that to the blog. I do some editing for the HTML to get everything set up properly, mostly editing the links and uploading the images. There is also some stuff happening with CSS scopes that I frankly don’t understand. I think I got it right, which is fine for now.

This cost me a couple of evenings, and it was fun. Nothing earth-shattering, I’ll admit. But it’s the first time in a while that I actually wrote a piece of code that was immediately useful. My blogging queue is rather full, and I hope that with this new process it will be easier to push the ideas out of my head and to the blog.

And with that, it is now 01:26 AM, and I’m going to call it a night 🙂.

And as a final thought, I had just made several changes to the post after publication, and it went smoothly. I think that I like it.

FUTURE POSTS

  1. RavenDB Cloud Global Status vs. Product Status - about one day from now
  2. Recording: Technology & Friends - Oren Eini on the Corax Search Engine - 3 days from now
  3. RavenDB and Two Factor Authentication - 4 days from now

There are posts all the way to Mar 06, 2024

RECENT SERIES

  1. Recording (13):
    15 Jan 2024 - S06E09 - From Code Generation to Revolutionary RavenDB
  2. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  3. Production postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
  4. Challenge (74):
    13 Oct 2023 - Fastest node selection metastable error state–answer
  5. Filtering negative numbers, fast (4):
    15 Sep 2023 - Beating memcpy()
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}