Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,567
|
Comments: 51,184
Privacy Policy · Terms
filter by tags archive
time to read 8 min | 1542 words

That is one scary headline, isn’t it? Go into the story and read all about it, there is a plot twist in the end.

This post was written in 05:40 AM, and I have spent the entire night up and awake*.

A customer called me in a state of panic, their database is not loading, and anything they tried didn’t work. I worked with him for a while on understanding what is going on, and how to try to recover what was going on.

Here is the story as I got it from the customer in question, and only embellished a lot to give the proper context for the story.

It all started with a test failure, in fact, it started with all the tests failing. The underlying reason was soon discovered, the test database disk was completely full, not even enough disk space to put half a byte.  The customer took a single look at that, and run a search on the hard disk, to find what is taking so much space. The answer was apparent. The logs directory was full, in fact, the Webster dictionary would need to search hard and wide to find a word to express how full it was.

So the customer in question did the natural thing, and hit Shift+Delete to remove all that useless debug logs that has been cluttering the disk. He then started the server again, and off to the races. Except that there was a slight problem, when trying to actually load the database, the server choked, cried and ended up curled to a fetal position, refusing to cooperate even when a big stick was fetch and waved in its direction.

The problem was that the logs files that were deleted were not debug log. Those were the database transaction logs. And removing them from the disk has the effect of causing the database to be unable to recover, so it stopped and refused to work.

Now, remember that this is a test server, which explains why developers and operations guys are expected to do stuff to it. But the question was raised, what actually caused the issue? Can this happen in production as well? If it happens, can we recover from  it? And more importantly, how can this be prevented?

The underlying reason for the unbounded transaction logs growth was that the test server was an exact clone of the production system. And one of the configuration settings that was defined was “enable incremental backups”. In order to enable incremental backups, we can’t delete old journal files, we have to wait for a backup to actually happen, at which point we can copy the log files elsewhere, then delete them.  If you don’t backup a database marked with enable incremental backups, it can’t free the disk space, and hence, the problem.

In production, regular backups are being run, and there were no tx log files being retained. But no one bothered to do any maintenance work on the test server, and the configuration explicitly forbid us from automatically handling this situation.  But in a safe-by-default mindset we would do anything for the operations guy to notice it with enough time to do something about it. That’s why for 3.0 we are taking a proactive step toward this case, and we will alert when the database is about to run out of free space.

Now, for the actual corruption issue. Any database makes certain assumptions, and chief among them is that when you write to disk, and actually flush the data, it isn’t going away or being modified behind our back. If that happens, which can be because of disk corruption, manual intervention in the files, stupid anti virus software or just someone randomly deleting files by accident. At that point, all bets are off, and there isn’t much that can be done to protect ourselves from this situation.

The customer in question? Federico from Corvalius.

CV-logo-BLACK-v_final

Note that this post is pretty dramatized. This is a test server, not a production system, , so the guarantees, expectations and behavior toward them are vastly different. The reason for making such a big brouhaha from what is effectively a case of fat fingers is that I wanted to discuss high availability story with RavenDB.

The general recommendation we are moving toward in 3.0 is that any High Availability story in RavenDB has to take the Shared Nothing approach. In effect, this means that you will not using technologies such as Windows Clustering, because that relies on a common shared resource, such as the SAN. Issues there, which actually creep up on you (out of quota space in the SAN can happen very easily) and take down the whole system, even though you spent a lot of time and money on a  supposedly highly available solution.

A shared nothing approach limit the potential causes for failure by having multiple nodes that can each operate independently. With RavenDB, this is done using Replication, you define a master/master replication between two nodes, and you can run it with one primary node that your servers connect to usually. At that point, any failure in this node would mean automatic switching over to the secondary, with no downtime. You don’t have to plan for it, you don’t have to configure it, It Just Works.

Now, that is almost true, because you need to be aware that in a split brain situation, you might have conflicts, but you can set a default strategy for that (server X is the authoritative source) or a custom conflict resolution policy.

The two nodes means that you always have a hot spare, which can also handle scale out scenario by handling some of the load from the primary server if needed.

Beyond replication, we also need to ensure that the data that we have is kept safe. A common request from admins that we heard is that a hot spare is wonderful, but they don’t trust that they have a backup if they can’t put it on a tape and file it on a shelve somewhere. That also help for issues such as offsite data storage in case you don’t have a secondary data center (if you do, put a replicated node there as well). This may sound paranoid, but having an offline backup means that if something did a batch job that was supposed to delete old customers, but deleted all customers, you won’t be very happy to realize that this batch delete process was actually replicate to your entire cluster and your customer count is set at zero, and then start declining from there. This is the easy case, a bad case is when you had a bug in your code that wrote bad data over time, you really want to be able to go back to the database as it was two weeks ago, and you can only do that from cold storage.

One way to do that is to actually do backups. The problem with doing that is that you usually go for full backups, which means that you might be backing up tens of GB on every backup, and that is very awkward to deal with. Incremental backups are easier to work with, certainly. But when building Highly Available systems, I usually don’t bother with full backups. I already have the data in one or two additional locations, after all. I don’t care for quick restore at this point, because I can do that on one of the replicated nodes. What I do care is that I have an offsite copy of the data that I can use if I ever need to. Because time to restore isn’t a factor, but convenience and management is, I would usually go with the periodic export system.

This is how this looks like:

image

The Q drive is a shared network drive, and we do an incremental export to it every 10 minutes and a full export every 3 days.

I am aware that this is pretty paranoid setup, we have multiple nodes holding the data, and exports of the data, sometimes I even have each node export the data independently, for the “no data loss, ever”

Oh, and about Federico’s issue? While he was trying to see if he could fix the database in the event such a thing happen in production (in the 3 live replicas at once), he was already replicating to the test sandbox from one of the production live replicas. With big databases it will take time, but a high-availability setup allows it. So even if the data file appears to be corrupted beyond repair, everything is good now.

* To be fair, that is because I’m actually at the airport waiting for a plane to take me on vacation, but I thought it was funnier to state it this way.

time to read 4 min | 784 words

It still feels funny to say that a major feature in a database product is the user interface, but I’m feeling a lot less awkward about saying that about the new studio now.

The obvious change here is that it is using HTML5, and not Silverlight. That alone would be great, because Silverlight has gotten pretty annoying, but we have actually done much more that. We moved to HTML5 and we added a lot of new features.

Here is how this looks like:

image

Now, let me show you some of the new stuff. None of it is ground breaking on its own, but combined they create a vastly improved experience.

Indexes copy/paste allows you to easily transfer index definitions from one database to another, without requiring any external tools.

image

Also on indexing, we have the format index feature, which can take a nasty index and turn that into a pretty and more easily understood code:

image

Speaking of code and indexing, did you notice the C# button there? Clicking on that will give you this:

image

Like the copy/paste index feature, the idea is that you can modify the index on the studio, play with the various options, then you hit this button and you can copy working index creation code into your project and don’t worry any more about how you are going to deploy it.

We also added some convenience factors, such as computed columns.  Let us see how that works. Here is the default view of the employees in the Northwind database:

image

that is nice, but it seems complex to me, all I care about is the full name and the age. So I head to the settings and define a few common functions:

image

I then head back to the employees collection and click on the grid icon at the header row, which gives me this:

image

After pressing “Save as default”, I can see that the values shown for employees are:

image

You can also do the same for the results of specific queries or indexes, so you’ll have better experience looking at the data. The custom functions also serve additional roles, but I’ll touch them on a future post.

Speaking of queries, here is how they look like:

image

 

Note the Excel icon on the top, you can export the data directly to Excel now. This is common if you need to send it to a colleague or anyone in the business side of things. For that matter, you can also load data into RavenDB from a CSV file:

image

There is actually a lot of stuff that goes on in the studio, but I won’t talk about it now, replication tracking, better metrics, etc. I’ll talk about them in posts specific for the major bundles and a post (or posts) about better operations support.

I’ll leave you with one final feature, the map reduce visualizer:

image

More posts are coming Smile.

time to read 5 min | 802 words

A frequent request from RavenDB users was the ability to store binary data. Be that actual documents (PDF, Word), images (user’s photo, accident images, medical scans) or very large items (videos, high resolution aerial photos).

RavenDB can do that, sort of, with attachments. But attachments were never a first class feature in RavenDB.

With RavenFS, files now have first class support. Here is a small screen shot, I’ve a detailed description of how it works below.

image

The Raven File System exposes a set of files, which are binary data with a specific key. However, unlike a simple key/value store, RavenFS does much more than just store the binary values.

It was designed upfront to handle very large files (multiple GBs) efficiently at API and storage layers level. To the point where it can find common data patterns in distinct files (or even in the same file) and just point to it, instead of storing duplicate information. RavenFS is a replicated and highly available system, updating a file will only send the changes made to the file between the two nodes, not the full file. This lets you update very large files, and only replicate the changes. This works even if you upload the file from scratch, you don’t have to deal with that manually.

Files aren’t just binary data. Files have metadata associated with them, and that metadata is available for searching. If you want to find all of Joe’s photos from May 2014, you can do that easily. The client API was carefully structured to give you full functionality even when sitting in a backend server, you can stream a value from one end of the system to the other without having to do any buffering.

Let us see how this works from the client side, shall we?

var fileStore = new FilesStore()
{
    Url = "http://localhost:8080",
    DefaultFileSystem = "Northwind-Assets",
};

using(var fileSession = fileStore.OpenAsyncSession())
{
    var stream = File.OpenRead("profile.png");
    var metadata = new RavenJObject
    {
        {"User", "users/1345"},
        {"Formal": true}
    };
    fileSession.RegisterUpload("images/profile.png", stream, metadata);
    await fileSession.SaveChangesAsync(); // actually upload the file
}

using(var fileSession = fileStore.OpenAsyncSession())
{
    var file = await session.Query()
                    .WhereEquals("Formal", true)
                    .FirstOrDefaultAsync();

    var stream = await session.DownloadAsync(file.Name);

    var file = File.Create("profile.png");

    await stream.CopyToAsync(file);
}

First of all, you start by creating a FileStore, similar to RavenDB’s DocumentStore, and then create a session. RavenFS is fully async, and we don’t provide any sync API. The common scenario is using for large files, where blocking operations are simply not going cut it.

Now, we upload a file to the server, note that at no point do we need to actually have the file in memory. We open a stream to the file, and register that stream to be uploaded. Only when we call SaveChangesAsync will we actually read from that stream and write to the file store. You can also see that we are specifying metadata on the file. Later, we are going to be searching on that metadata. The results of the search is a FileHeader object, which is useful if you want to show the user a list of matching files. To actually get the contents of the file, you call DownloadAsync. Here, again, we don’t load the entire file to memory, but rather will give you a stream for the contents of the file that you can send to its final destination.

Pretty simple, and highly efficient process, overall.

RavenFS also has all the usual facilities you need from a data storage system, including full & incremental backups, full replication and high availability features. And while it has the usual file system folder model, to encourage familiarity, the most common usage is actually as a metadata driven system, where you locate a desired file based searching.

time to read 1 min | 171 words

If you have been following this blog at all, you must have heard quite a lot about Voron. If you haven’t been paying attention, you can watch my talk about it at length, or you can get the executive summary below.

The executive summary is that Voron is a high performance low level  transactional storage engine, which was written from scratch by Hibernating Rhinos with the intent to move most / all of our infrastructure to it. RavenDB 3.0 can run on either Voron or Esent, and show comparable performance using either one.

More importantly, because Voron was created by us, this means that we can do more with it, optimize it exactly to our own needs and requirements. And yes, one of those would be running on Linux machines.

But more important, having Voron also allows us to create dedicated database solutions much more easily. One of those is RavenFS, obviously, but we have additional offering that are just waiting to get out and blow you minds away.

time to read 2 min | 361 words

“I don’t know, why are you asking me such hard questions? It is new, it is better, go away and let me play with the fun stuff, I think that I got the distributed commit to work faster now. Don’t you have a real job to do?”

That, more or less, was my response when I was told asked that we really do need a “What has changed” list for RavenDB. And after some kicking and screaming, I agreed that this is indeed something that is probably not going to be optional. While I would love to just put a sticker saying “It is better, buy it!”, I don’t think that RavenDB is targeting that target demographic.

There is a reason why I didn’t want to compile such a list. Work on RavenDB 3.0 actually started before 2.5 was even out, and currently it encompass 1,270 resolved issues and 21,710 commits. The team size (as in people actually paid to work on this on this full time, excluding any outside contributions) grew to just over 20. And we had close to two years of work. In other words, this release represent a lot of work.

The small list that I had compiled contained over a hundred items. That is just too big to do justice to all the kind of things we did. So I won’t be doing a single big list with all the details. Instead, I’m going to do a rundown of the new stuff in separate blog post per area.

All the indexing improvements in one blog post, all the client API changes in another, etc.

At a very high level, here is the major areas that were changed:

  • Voron
  • RavenFS
  • HTML5 Studio
  • JVM API
  • Operations
  • Indexes & Queries

I’ll get to the details of each of those (and much more) in the upcoming posts.

Because there is so much good stuff, I'm afraid that I'll have to break tradition. For the following week or so, we are going to be moving to a 2 posts a day mode.

Also, please remember that we're hosting two RavenDB events in Malmo and Stockholm, Sweden next week. We'll be talking about all the new stuff.

RavenDB Events

time to read 1 min | 138 words

FireworksAfter about a week of running (and no further issues Smile) on the Esent database, we have now converted the backend database behind this blog to Voron.

The process was done by:

  • Putting App_Offline.html file for ayende.com
  • Exporting the data from blog.ayende.com database using the smuggler.
  • Deleting the database configuration from RavenDB, but retaining the actual database on disk.
  • Creating a new RavenDB database with Voron as the new blog.ayende.com database.
  • Importing the data from the previous export using smuggler.
  • Deleting the App_Offline.html file.

Everything seems to be operating normally, at least for now.

To my knowledge, this is the first production deployment of Voron Smile.

time to read 4 min | 686 words

I got a log file with some request trace data from a customer, and I want to have a better view about what is actually going on. The log file size was 35MB, so that made things very easy.

I know about Log Parser, but to be honest, it would take more time to learn to use that effectively than to write my own tool for a single use case.

The first thing I needed to do is actually get the file into a format that I could work with:

var file = @"C:\Users\Ayende\Downloads\u_ex140904\u_ex140904.log";
var parser = new TextFieldParser(file)
{
CommentTokens = new[] {"#"},
Delimiters = new[] {" "},
HasFieldsEnclosedInQuotes = false,
TextFieldType = FieldType.Delimited,
TrimWhiteSpace = false,
};

////fields
// "date", "time", "s-ip", "cs-method", "cs-uri-stem", "cs-uri-query", "s-port", "cs-username", "c-ip",
// "cs(User-Agent)", "sc-status", "sc-substatus", "sc-win32-status", "time-taken"

var entries = new List<LogEntry>();

while (parser.EndOfData == false)
{
var values = parser.ReadFields();
if (values == null)
break;
var entry = new LogEntry
{
Date = DateTime.Parse(values[0]),
Time = TimeSpan.Parse(values[1]),
ServerIp = values[2],
Method = values[3],
Uri = values[4],
Query = values[5],
Port = int.Parse(values[6]),
UserName = values[7],
ClientIp = values[8],
UserAgent = values[9],
Status = int.Parse(values[10]),
SubStatus = int.Parse(values[11]),
Win32Status = int.Parse(values[12]),
TimeTaken = int.Parse(values[13])
};
entries.Add(entry);
}

Since I want to run many queries, I just serialized the output to a binary file, to save the parsing cost next time. But the binary file (BinaryFormatter) was actually 41MB is size, and while parsing the file took 5.5 seconds for text parsing, the binary load process took 6.7 seconds.

After that, I can run queries like this:

var q = from entry in entries
where entry.TimeTaken > 10
group entry by new {entry.Uri}
into g
where g.Count() > 2
select new
{
g.Key.Uri,
Avg = g.Average(e => e.TimeTaken)
}
into r
orderby r.Avg descending
select r;

And start digging into what the data is telling me.

time to read 1 min | 113 words

I’m trying to get a better insight on a set of log files sent by a customer. So I turned to find a tool that can do that, and I found Inidihiang. There is a x86 vs x64 issue that I had to go through, but then it was just sitting there trying to parse a 34MB log file.

I got annoyed enough that I actually checked, and this is the reason why:

image

Sigh…

I gave up on this and wrote my own stuff.

FUTURE POSTS

  1. The null check that didn't check for nulls - 9 hours from now

There are posts all the way to Apr 28, 2025

RECENT SERIES

  1. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}