Ayende @ Rahien

It's a girl

Raven.Storage – that ain’t proper logging for our kind of environment

Continuing with my work on porting leveldb to .NET, we run into another problem. The log file. The log file is pretty important, this is how you ensure durability, so any problems there are a big cause of concern.

You can read a bit about the format used by leveldb here, but basically, it uses the following:

   1: block := record* trailer?
   2: record :=
   3:  checksum: uint32    // crc32c of type and data[] ; little-endian
   4:  length: uint16        // little-endian
   5:  type: uint8        // One of FULL, FIRST, MIDDLE, LAST
   6:  data: uint8[length]

Block is of size 32Kb.

The type can be First, Middle, End or Full. Since it is legit to split a record across multiple blocks. The reasoning behind this format are outlined in the link above.

It is also a format that assumes that you know, upfront, the entire size of your record, so you can split it accordingly. That makes a lot of sense, when working in C++ and passing buffers around.

 

This is straightforward in C++, where the API is basically:

Status Writer::AddRecord(const Slice& slice)

(Slice is basically just a byte array).

In .NET, we do not want to be passing buffers around, mostly because of the impact on the LOH. So we had to be a bit smarter about things, in particular, we had an interesting issue with streaming the results. If I want to write a document with a size of 100K, how do I handle that?

Instead, I wanted this to look like this:

   1: var buffer = BitConverter.GetBytes(seq);
   2: await state.LogWriter.WriteAsync(buffer, 0, buffer.Length);
   3: buffer = BitConverter.GetBytes(opCount);
   4: await state.LogWriter.WriteAsync(buffer, 0, buffer.Length);
   5:  
   6: foreach (var operation in writes.SelectMany(writeBatch => writeBatch._operations))
   7: {
   8:     buffer[0] = (byte) operation.Op;
   9:     await state.LogWriter.WriteAsync(buffer, 0, 1);
  10:     await state.LogWriter.Write7BitEncodedIntAsync(operation.Key.Count);
  11:     await state.LogWriter.WriteAsync(operation.Key.Array, operation.Key.Offset, operation.Key.Count);
  12:     if (operation.Op != Operations.Put)
  13:         continue;
  14:     using(var stream = state.MemTable.Read(operation.Handle))
  15:         await stream.CopyToAsync(stream);
  16: }

The problem with this approach is that we don’t know, upfront, what is the size that we are going to have. This means that we don’t know how to split the record, because we don’t have the record until it is over. And we don’t want (can’t actually) to go back in the log and change things to set the record straight (pun intended).

What we ended up doing is this:

image

Note that we explicitly mark the start / end of the record, and in the meantime, we can push however many bytes we want. Internally, we buffer up to 32Kb in size (a bit less, actually, but good enough for now) and based on the next call, we decide whatever the current block should be marked as good or bad.

The reason this is important is that this allows us to actually keep the same format as leveldb, with all of the benefits for dealing with corrupted data, if we need to. I also really like the idea of being able to have parallel readers on the log file, because we know that we can just skip at block boundaries.

Tags:

Published at

Originally posted at

Comments (4)

Raven’s Scripted Index Results

Scripted Index Results (I wish it would have a better name) is a really interesting new feature in RavenDB 2.5. As the name implies, it allows you to attach scripts to indexes. Those scripts can operate on the results of the indexing.

Sounds boring, right? But the options that is opens are nothing but. Using Scripted Index Results you can get recursive map/reduce indexes, for example. But we won’t be doing that today. Instead, I’ll show how you can enhance entities with additional information from other sources.

Our sample database is Northwind, and we have defined the following index to get some statistics about our customers:

image

And we can query it like this:

image

However, what we want to do is to be able to embed those values inside the company document, so we won’t have to query for it separately. Here is how we can use the new Scripted Index Results bundle to do this:

image

Once we have defined that, whenever the index is done, it will run these scripts, and that, in turns, means that this is what our dear ALFKI looks like:

image

I’ll leave recursive map/reduce as tidbit for my dear readers Smile.

Tags:

Published at

Originally posted at

Comments (10)

RavenDB’s Dynamic Reporting

One of the nice things about having more realistic demo data set is that we can now actually show demos on that data. I didn’t realize how much of an issue that was until we actually improved things.

Let us see how we are going to demo RavenDB’s dynamic reporting feature.

We start by creating the following index. It is a pretty simple one, with the one thing to notice is that we are explicitly setting the Sort mode for Total to be Double.

image

Now that we have done that, we are going to go to Query > Reporting:

image

And then I can start issue reporting queries:

image

This is the equivalent of doing:

select EmployeeID, sum(tot.Total) Total from Orders o join 
    (
        select sum((Quantity * UnitPrice) * (1- Discount)) Total, OrderId from [Order Details]
        group by OrderID
    ) tot
    on o.OrderID = tot.OrderID
where o.CustomerID = @CustomerId
group by EmployeeID

The nice thing about this, and what makes this feature different from standard map/reduce, is that you can filter the input data into the aggregation.

In code, this would look something like this:

session.Query<Order>("Orders/Total")
  .Where(x=>x.Company = companyId)
  .AggregateBy(x=>x.Employee)
  .SumOn(x=>x.Total)
  .ToList();

Pretty nice, even if I say so myself.

Tags:

Published at

Originally posted at

Comments (7)

RavenDB Sample Data: Hello Northwind

RavenDB has always came with some sample data that you could use to test things out. Unfortunately, that sample data was pretty basic, and didn’t really cover a lot of interesting scenarios.

For RavenDB 2.5, we updated the sample data to use the brand new and amazing (wait for it…) Northwind database.

image

At a minimum, it would make demoing stuff easier. And in order to make things even nicer, you can get the C# classes for the sample data here.

Tags:

Published at

Originally posted at

Comments (7)

Challenge: The problem of locking down tasks…

The following code has very subtle bug:

   1: public class AsyncQueue
   2: {
   3:     private readonly Queue<int> items = new Queue<int>();
   4:     private volatile LinkedList<TaskCompletionSource<object>> waiters = new LinkedList<TaskCompletionSource<object>>();
   5:  
   6:     public void Enqueue(int i)
   7:     {
   8:         lock (items)
   9:         {
  10:             items.Enqueue(i);
  11:             while (waiters.First != null)
  12:             {
  13:                 waiters.First.Value.TrySetResult(null);
  14:                 waiters.RemoveFirst();
  15:             }
  16:         }
  17:     }
  18:  
  19:     public async Task<IEnumerable<int>> DrainAsync()
  20:     {
  21:         while (true)
  22:         {
  23:             TaskCompletionSource<object> taskCompletionSource;
  24:             lock (items)
  25:             {
  26:                 if (items.Count > 0)
  27:                 {
  28:                     return YieldAllItems();
  29:                 }
  30:                 taskCompletionSource = new TaskCompletionSource<object>();
  31:                 waiters.AddLast(taskCompletionSource);
  32:             }
  33:             await taskCompletionSource.Task;
  34:         }
  35:     }
  36:  
  37:     private IEnumerable<int> YieldAllItems()
  38:     {
  39:         while (items.Count > 0)
  40:         {
  41:             yield return items.Dequeue();
  42:         }
  43:  
  44:     }
  45: }

I’ll even give you a hint, try to run the following client code:

   1: for (int i = 0; i < 1000 * 1000; i++)
   2: {
   3:     q.Enqueue(i);
   4:     if (i%100 == 0)
   5:     {
   6:         Task.Factory.StartNew(async () =>
   7:             {
   8:                 foreach (var result in await q.DrainAsync())
   9:                 {
  10:                     Console.WriteLine(result);
  11:                 }
  12:             });
  13:     }
  14:  
  15: }
Can you figure out what the problem is?

Published at

Originally posted at

Comments (7)

Special Offer: 29% discount for all our products

Well, it is nearly the 29 May, and that means that I have been married for two years.

To celebrate that, I am offering a 29% discount on all our products (RavenDB, NHibernate Profiler, Entity Framework Profiler, etc).

All you have to do is purchase any of our products using the following coupon code:

2nd anniversary

This offer is valid to the end of the month, so hurry up.

Raven Xyz: Trying out some ideas

One of the things that we are planning for Raven 3.0 is the introducing of additional options. In addition to having RavenDB, we will also have RavenFS, which is a replicated file system with an eye toward very large files. But that isn’t what I want to talk about today. Today I would like to talk about something that is currently just in my head. I don’t even have a proper name for it yet.

Here is the deal, RavenDB is very good for data that you care about individually. Orders, customers, etc. You track, modify and work with each document independently. If you are writing a lot of data that isn’t really relevant on its own, but only as an aggregate, that is probably not a good use case for RavenDB.

Examples for such things include logs, click streams, event tracking, etc. The trivial example would be any reality show, where you have a lot of users sending messages to vote for a particular candidate, and you don’t really care for the individual data points, only the aggregate. Other things might be to want to track how many items were sold in a particular period based on region, etc.

The API that I had in mind would be something like:

   1: foo.Write(new PurchaseMade { Region = "Asia", Product = "products/1", Amount = 23 } );
   2: foo.Write(new PurchaseMade { Region = "Europe", Product = "products/3", Amount = 3 } );

And then you can write map/reduce statements on them like this:

   1: // map
   2: from purchase in purchases
   3: select new
   4: {
   5:     purchase.Region,
   6:     purchase.Item,
   7:     purchase.Amount
   8: }
   9:  
  10: // reduce
  11: from result in results
  12: group result by new { result.Region, result.Item }
  13: into g
  14: select new
  15: {
  16:     g.Key.Region,
  17:     g.Key.Item,
  18:     Amount = g.Sum(x=>x.Amount)
  19: }

Yes, this looks pretty much like you would have in RavenDB, but there are important distinctions:

  • We don’t allow modifying writes, nor deleting them.
  • Most of the operations are assumed to be made on the result of the map/reduce statements.
  • The assumption is that you don’t really care for each data point.
  • There is going to be a lot of those data points, and they are likely to be coming in at a relatively high rate.

Thoughts?

Tags:

Published at

Originally posted at

Comments (27)

RavenDB, Victory

Jeremy Miller’s post about Would I use RavenDB again has been making the round. It is a good post, and I was asked to comment on it by multiple people.

I wanted to comment very briefly on some of the issues that were brought up:

  • Memory consumption – this is probably mostly related to the long term session usage, which we expect to be much more short lived.
    • The 2nd level cache is mostly there to speed things up when you have relatively small documents. If you have very large documents, or routinely have requests that return many documents, that can be a memory hog. That said, the 2nd level cache is limited to 2,048 items by default, so that shouldn’t really be a big issue. And you can change that (or even turn it off) with ease.
  • Don’t abstract RavenDB too much – yeah, that is pretty much has been our recommendation for a while.
    • I don’t see this as a problem. You have just the same issue if you are using any OR/M against an RDBMS.
  • Bulk Insert – the issue has already been fixed. In fact, IIRC, it was fixed within a day or two of the issue being brought up.
  • Eventual Consistency – Yes, you need to decide how to handle that. As Jeremy said, there are several ways of handling that, from using natural keys with no query latency associated with them to calling WaitForNonStaleResultsAsOfNow();

Truthfully, the thing that really caught my eye wasn’t Jeremy’s post, but one of the comments:

image

Thanks you, we spend a lot of time on that!

Tags:

Published at

Originally posted at

Comments (4)

Fixing up the build process

There is a big problem in the RavenDB build process. To be rather more exact, there is a… long problem in the RavenDB build process.

image

As you can imagine, when the build process run for that long, it doesn’t' get run too often. We already had several runs of “let us optimize the build”. But… the actual reason for the tests taking this long is a bit sneaky.

image

To save you from having to do the math, this means an average of 1.15 seconds per test.

In most tests, we actually have to create a RavenDB instance. That doesn’t take too long, but it does take some time. And we have a lot of tests that uses the network, because we need to test how RavenDB works on the wire.

From that perspective, it means that we don’t seem to have any real options. Even if we cut the average cost of running the tests by half, it would still be a 30 minutes build process.

Instead, we are going to create a layered approach. We are going to freeze all of our existing tests, move them to an Integration Tests project. We will create a small suite of tests that cover just core stuff with RavenDB, and use that. Over time, we will be adding tests to the new test project. When that becomes too slow, we will have another migration.

What about the integration tests? Well, those will be run solely by our build server, and we will setup things so we can automatically test when running from our own forks, not the main code line.

Tags:

Published at

Originally posted at

Comments (11)

The difference between Ordering & Boosting

This seems to be a pretty common issue with people getting the two of them confused. As an example, let us take the users in Stack Overflow:

image

Here, we want to get the users in order. We want to get all the users in descending order of reputation.

But what happens when we want to do an actual search, for example, we want to get users by tag. Perhaps we want to get someone that knows some ravendb.

Here is the data that we have to work with:

image

Now, when searching, we want to be able to do the following. Find users that match what the tags that we specified, that are relevant and have them show up in reputation order.

And that is where it kills us. Relevancy & order are pretty much exclusive. Before we can explain that, we need to understand that order is absolute, but relevancy is not. If I have 10,000 tags, there is very little meaning to me having a tag or not. But if I have 10 tags, me having a tag or not is a lot more important. You want to talk with an expert in a specific field, not just someone who is a jack of all trades.

Now, it might be that you want to apply some boost factor to users with high reputation, because there are people who are jack of all trades and master of most. That is the difference between boosting and ordering.

Ordering is absolute, while boosting is a factor applied against the relative relevancy of the current query.

Tags:

Published at

Originally posted at

Comments (3)

How not to deal with Replication Lag

Because RavenDB replication is async in nature, there is a period of time between a write has been committed on the master and until it is visible to the clients.

A user has requested that we would provide a low latency way to provide a solution to that. The idea was that the master server would report to the secondaries that a write happened, and then they would mark all reads from them for those documents as dirty, until replication caught up.

Implementation wise, this is all ready to happen. We have the Changes API, which is an easy way to get changes from a db. We have the ability to return a 204 Non Authoritative response, so it looks easy.

In theory, it sounds reasonable, but this idea just doesn’t hold water. Let us talk about normal operations. Even with the “low latency” notifications (and replication is about as low latency as it already get), we have to deal with a window of time between the write completing on the master and the notification arriving on the secondaries. In fact, it is the exact same window as with replication. Sure, if you have a high replication load, that might be different, but those tend to be rare (high write load, very big documents, etc).

But let us assume that this is really the case. What about failures?

Let us assume Server A & B and client C. Client C makes a write to A, A notifies B and when C reads from B, it would get 204 response until A replicates to B. All nice & dandy. But what happens when A can’t talk to B ? Remember a server being down is the easiest scenario, the hard part is when both A & B are operational, but can’t talk to one another. RavenDB  is designed to gracefully handle network splits and merges, so what would happen in this case?

Client C writes to A, but A can’t notify B or replicate to it. Client C reads from B, but since B got no notification about a change, it return 200 Ok response, which means that this is the latest version. Problem.

In this case, this is actually a bigger problem than you might consider. If we support the notifications under the standard scenario, user will make assumptions about this. They will have separate code paths for non authoritative responses, for example. But as we have seen, we have a window of time where the reply would say it is authoritative even though it isn’t (a very short one, sure, but still) and under failure scenarios we will out right lie.

It is better not to have this “feature” at all, and let the user handle that on his own (and there are ways to handle that, reading from the master for important stuff, for example).

Tags:

Published at

Originally posted at

Comments (4)

RavenDB Clusters & Write Assurances

RavenDB handles replication in an async manner. Let us say that you have 5 nodes in your cluster, set to use master/master replication.

That means that you call SaveChanges(), the value is saved to the a node, and then replicated to other nodes.  But what happens when you have safety requirements? What happens if a node goes down after the call to SaveChanges() was completed, but before it replicate the information out?

In other systems, you have the ability to specify W factor, to how many nodes this value will be written before it is considered “safe”. In RavenDB, we decided to go in a similar route. Here is the code:

   1: await session.StoreAsync(user);
   2: await session.SaveChangesAsyng(); // save to one of the nodes
   3:  
   4: var userEtag = session.Advanced.GetEtagFor(user);
   5:  
   6: var replicas = await store.Replication.WaitAsync(etag: userEtag, repliacs: 1);

As you can see, we now have a way to actually wait until replication is completed. We will ping all of the replicas, waiting to see that replication has matched or exceeded the etag that we just wrote.  You can specify the number of replicas that are required for this to complete.

Practically speaking, you can specify a timeout, and if the nodes aren’t reachable, you will get an error about that.

This gives you the ability to handle write assurances very easily. And you can choose how to handle this, on a case by case basis (you care to wait for users to be created, but not for new comments, for example) or globally.

Tags:

Published at

Originally posted at

Comments (6)

RavenDB & Locking indexes

One of the things that we keep thinking about with RavenDB is how to make it easier for you to run in production.

To that end, we introduce a new feature in 2.5, Index Locking. This looks like this:

image

But what does this mean, to lock an index?

Well, let us consider a production system, in which you have the following index:

from u in docs.Users
select new
{
   Query = new[] { u.Name, u.Email, u.Email.Split('@') }
}

After you go to production, you realize that you actually needed to also include the FullName in the search queries as well. You can, obviously, do a full deployment from scratch, but it is generally so much easier to just fix the index definition on the production server, update the index definition on the codebase, and wait for the next deploy for them to match.

This works, except that in many cases, RavenDB applications call IndexCreation.CreateIndexes() on start up. Which means that on the next startup of your application, the change you just did will be reverted. These options allows you to lock an index for changes, either in such a way that gives you the ability ignore changes to this index, or by raising an error when someone tries to modify the index

It is important to note that this is not a security feature, you can at any time unlock the index. This is there to make help operations, that is all.

Tags:

Published at

Originally posted at

Comments (1)

Better patching API for RavenDB: Creating New Documents

A while ago we introduced the ability to send js scripts to RavenDB for server side execution. And we have just recently completed a nice improvement on that feature, the ability to create new documents from existing ones.

Here is how it works:

store.DatabaseCommands.UpdateByIndex("TestIndex",
                                     new IndexQuery {Query = "Exported:false"},
                                     new ScriptedPatchRequest { Script = script }
  ).WaitForCompletion();

Where the script looks like this:

for(var i = 0; i < this.Comments.length; i++ ) {
   PutDocument('comments/', {
    Title: this.Comments[i].Title,
    User: this.Comments[i].User.Name,
    By: this.Comments[i].User.Id
  });
}

this.Export = true;

This will create a set of documents for each of the embedded documents.

Tags:

Published at

Originally posted at

Comments (5)

RavenDB Map/Reduce optimizations

So I was diagnosing a customer problem, which required me to write the following:

image

This is working on a data set of about half a million records.

I took a peek at the stats and I saw this:

image

You can ignore everything before 03:23, this is the previous index run. I reset it to make sure that I have a clean test.

What you can see is that we start out with a mapping & reducing values. And you can see that initially this is quite expensive. But very quickly we recognize that we are reducing a single value, and we switch strategies to a more efficient method, and we suddenly have very little cost involved in here. In fact, you can see that the entire process took about 3 minutes from start to finish, and very quickly we got to the point where are bottle neck was actually the maps pushing data our way.

That is pretty cool.

Tags:

Published at

Originally posted at

Comments (2)

The state of Rhino Mocks

I was asked to comment on the current state of Rhino Mocks. The current codebase is located here: https://github.com/hibernating-rhinos/rhino-mocks

The last commit was 2 years ago. And I am no longer actively / passively monitoring the mailing list.

From my perspective, Rhino Mocks is done. Done in the sense that I don’t have any interest in extending it, done in the sense that I don’t really use mocking any longer.

If there is anyone in the community that wants to steps in and take charge as the Rhino Mocks project leader, I would love that. Failing that, the code it there, it works quite nicely, but that is all I am going to be doing with this for the time being and the foreseeable future.

Tags:

Published at

Originally posted at

Comments (13)

RavenDB 2.5 Features: Import data to Excel

I wonder what it says about RavenDB that we spend time doing excel integration Smile.

At any rate, we have the following documents inside RavenDB:

image

And we want to get this data into Excel. Not only that, but we want this to be something more than just a flat file. We want something that will auto update itself.

We start by defining the shape of the output, using a transformer.

image

Then we go an visit the following url:

http://localhost:8080/databases/MusicBox/streams/query/Raven/DocumentsByEntityName?query=Tag:Albums&resultsTransformer=Albums/ShapedForExcel&format=excel

  • http://localhost:8080/databases/MusicBox – The server & database that we are querying.
  • streams/query/Raven/DocumentsByEntityName?query=Tag:Albums – Stream the results of querying the index Raven/DocumentsByEntityName for all Tag:Albums (effectively, give me all the albums).
  • resultsTransformer=Albums/ShapedForExcel – transform the results using the specified transformer.
  • format=excel – output this in a format that excel will find easy to understand

The output looks like this:

image

Now, let us take this baby and push this to Excel. We create a new document, and then go to the Data tab, and then to From Text:

image

We have a File Open Dialog, and we paste the previous URL as the source, then hit enter.

image

We have to deal with the import wizard, just hit next on the first page.

image

We mark the input as comma delimited, and then hit finish.

image

We now need to select where it would go on the document:

image

And now we have the data inside Excel:

image

We aren’t done yet, we have the data in, now we need to tell Excel to refresh it:

image

Click on the connections button, where you’ll see something like this:

image

Go to Properties:

image

  • Uncheck Prompt for file name on refresh
  • Check Refresh data when opening the file

Close the file, go to your database and change something. Open the file again, and you can see the new values in there.

You have now create an Excel file that can automatically pull data from RavenDB and give your users immediate access to the data in a format that they are very comfortable with.

Tags:

Published at

Originally posted at

Comments (19)

Raven’s Storage: Memtables are tough

Memtables are conceptually a very simple thing. You have the list of values that you were provided, as well as a skip list for searches.

Complications:

  • Memtables are meant to be used concurrently.
  • We are going to have to have to hold all of our values in memory. And I am really not sure that I want to be doing that.
  • When we switch between mem tables (and under write conditions, we might be doing that a lot), I want to immediately clear the used memory, not wait for the GC to kick in.

The first thing to do was to port the actual SkipList from the leveldb codebase. That isn’t really hard, but I had to make sure that assumptions made for the C++ memory model are valid for the CLR memory model. In particular, .NET doesn’t have AtomicPointer, but Volatile.Read / Volatile.Write are a good replacement, it seems. I decided to port the one from leveldb because I don’t know what assumptions other list implementations have made. That was the first step in order to create a memtable. The second was to decide where to actually store the data.

Here is the most important method for that part:

   public void Add(ulong seq, ItemType type, Slice key, Stream val)

The problem is that we cannot just reference this. We have to copy those values into memory that we control. Why is that? Because the use is free to change the Stream contents or the Slice’s array as soon as we return from this method. By the same token, we can’t just batch this stuff in memory, again, because of the LOH. The way this is handled in leveldb never made much sense to me, so I am going to drastically change that behavior.

In my implementation, I decided to do the following:

  • Copy the keys to our own buffer, and keep them inside the skip list. This is what we will use for actually doing searches.
  • Change the SkipList to keep track of values, as well as the key.
  • Keep the actual values in unmanaged memory, instead of managed memory. That avoid the whole LOH issue, and give me immediate control on when the memory is disposed.

This took some careful coding, because I want to explicitly give up on the GC for this. That means that I need to make damn sure that I don’t have bugs that would generate memory leak.

Each memtable would allocate 4MB of unmanaged memory, and would write the values to it. Note that you can write over 4MB (for example, by writing a very large value, or by writing a value whose length exceed the 4MB limit. At that point, we would allocate more unmanaged memory, and hand over the memory table to compaction.

The whole thing is pretty neat, even if I say so myself Smile.

Tags:

Published at

Originally posted at

Comments (4)

Raven’s Storage: Understanding the SST file format

This is an example of an SST that stores:

  • tests/0000 –> values/0
  • tests/0001 –> values/1
  • tests/0002 –> values/2
  • tests/0003 –> values/3
  • tests/0004 –> values/4

As well as a bloom filter for rapid checks.

image

If you are wondering about the binary format, that is what this post is all about. We actually start from the end. We have the last 48 bytes of the file are dedicated to the footer.

image

The footer format is:

  • Last 8 bytes of the file is a magic number: 0xdb4775248b80fb57ul – this means that we can quickly identify whatever this is an SST file or not.
    Here is what this number looks like broken down to bytes:
    image
  • The other 40 bytes are dedicated for the metadata handle and the index handle.

Those are two pair of longs, encoded using 7 bit encoding, in our case, here is what they look like:

image

Let us see if we can parse them properly:

  • 100
  • 38
  • 143, 1 = 143
  • 14

Note that in order to encode 143 properly, we needed two bytes, because it is higher than 127 (and we use the last bit to indicate if there are more items to read). The first two values are actually the metadata handle (position: 100, count: 38), the second are the index handle (position: 143, count: 14).

We will start by parsing the metadata block first:

image

You can see the relevant portions in the image.

The actual interesting bits here are the first three bytes. Here we have:

  • 0 – the number of shared bytes with the previous key (there is no, which is why it is zero).
  • 25 – the number of non shared bytes (in this case ,the full value, which is 25).
  • 2 – the size of the value, in this case ,the value is the handle in the file of the data for the filter.

You can read the actual key name on the left ,”filter.BuiltinBloomFilter”, and then we have the actual filter data:

image

You can probably guess that this is the filter handle (position: 82, count: 18).

The rest of the data are two 4 bytes integers. Those are the restart array (position 130 –133) and the restart count (position 134 – 137). Restarts are a very important concept for reducing the size of the SST, but I’ll cover them in depth when talking about the actual data, not the metadata.

Next, we have the actual filter data itself, which looks like this:

image

This is actually a serialized bloom filter, which allows us to quickly decide if a specified key is here or not. There is a chance for errors, but errors can only be false positive, never false negative. This turn out to be quite useful down the road, when we have multiple SST files are need to work with them in tandem. Even so, we can break it apart into more detailed breakdown:

  • The first 8 bytes are the actual serialized bloom filter bits.
  • The 9th byte is the k value in the bloom filter algorithm. The default value is 6.
  • The last value (11) is the lg value (also used in the bloom filter algo).
  • The rest of the data is a bit more interesting. The 4 bytes preceding the 11 (those are 9,0,0,0) are the offset of valid data inside the filter data.
  • The four zeros preceding that are the position of the relevant data in the file.

Honestly, you can think about this as a black box. Filter data is probably enough.

Now that we are done with the filter, we have to look at the actual index data. This is located on 143, but we know that the filter data is actually ended on 100 + 38, so why the gap? The answer is that after each block, we have a block type (compressed or not, basically) and the block CRC, which is used to determine if the file has been corrupted.

Back to the index block, is tarts at 143 and goes for 14 bytes, looking like this:

image

The last 4 bytes (32 bit int equal to 1) is the number of restarts that we have for this block. And the 4 bytes preceding them (32 bit int equal to 0) is the offset of the restarts in the index block.

In this case, we have just one restart, and the offset of that is 0. Hold on about restarts, I’ll get there. Now let us move to the beginning of the index block. We have the following three bytes: 0,1,2.

Just like in the meta block case, those fall under (shared, non shared, value size) and are all 7 bit encoded ints. That means that there is no shared data with the previous key (because there isn’t previous key), the non shared data is 1 and the data size is 2. If you memorized your ASCII table, 117 is lower case ‘u’. The actual value is a block handle. This time, for the actual data associated with this index. In this case, a block with position: 0 and count: 70.

image

Let us start analyzing this data. 0,10, 8 tells us shared, non shared, value . Indeed, the next 10 bytes spell out ‘tests/0000’ and the 8 after that are ‘values/0’. And what about the rest?

image

Now we have 9,1,8. Shared is 9, non shared 1 and value size is 8. We take the first 9 bytes of the previous key, giving us ‘tests/000’ and append to it the non shared data, in this case, byte with value 49 (‘1’ in ASCII), giving us a full key of ‘tests/0001’. The next 8 bytes after that spell out ‘values/1’. And the rest is pretty much just like it.

Now, I promised that I would talk about restarts. You see how we can use the shared/non shared data to effectively compress data. However, that has a major hidden cost. In order to figure out what the key is, we need to read all the previous keys.

In order to avoid this, we use the notion of restarts. Every N keys (by default, 16), we we will have a restart point and put the full key into the file. That means that we can skip ahead based on the position specified in the restart offset, and that in turn is governed by the number of restarts that we have in the index block.

And… that is pretty much it. This is the full and unvarnished binary dump of an SST. Obviously real ones are going to be more complex, but they all follow the same structure.

Tags:

Published at

Originally posted at

Comments (2)

Raven’s Storage: Reading a Sorted String Table

When reading a SST, we have to deal with values of potentially large sizes. I want to avoid loading anything into managed memory if I can possible avoid it. That, along with other considerations has led me to use memory mapped files as the underlying abstractions for reading from the table.

Along the way, I think that I made the following assumptions:

  • Work in little endian only.
  • Can work in 32 bits, but 64 bits are preferred.
  • Doesn’t put pressure on the .NET memory manager.

In particular, I am worried about users wanting to do store large values, or large number of keys.

As it turned out, .NET’s memory mapped files are a pretty good answer for what I wanted to do. Sure, it is a bit of a pain with regards to how to handle things like WinRT / Silverlight, etc. But I am mostly focused on server side for now. And I got some ideas on how to provide the mmap illusion on top regular streams for platforms that don’t support it.

The fact that SST is written by a single thread, and once it is written it is immutable has drastically simplified the codebase. Although I have to admit that looking at hex dump to figure out that I wrote to the wrong position is a bit of a bother, but more on that later. A lot of the code is basically the leveldb code, tweaked for .NET uses. One important difference that I made was with regards to the actual API.

  • Keys are assumed to be small. Most of the time, less than 2KB in size, and there are optimizations in place to take advantage of that. (It will still work with keys bigger than that, but will consume more memory).
  • In general, through the codebase I tried to put major emphasis on performance and memory use even at this early stage.
  • Values are not assumed to be small.

What does the last one mean?

Well, to start with, we aren’t going to map the entire file into our memory space, to start with, it might be big enough to start fragmenting our virtual address space, but mostly because there is no need. We always map just a single blokc at at time, and usually we never bother to read the values into managed memory, instead just accessing the data directly from the memory mapped file.

Here is an example of reading the key for an entry from the SST:

image

As you can see, we need to read just the key itself to memory, the value itself is not even touched. Also, there is some buffer management going on to make sure that we don’t need to re-allocate buffers as we are scanning through the table.

When you want to get a value, you call CreateValueStream, which gives you a Stream that you can work with. Here is how you use the API:

image

This is actually part of the internal codebase, we are actually storing data inside the SST that will later help us optimize things, but that is something I’ll touch on a later point in time.

Except for the slight worry that I am going to have to change the underlying approach from memory mapped files to streams if I need to run it outside the server/client, this is very cool.

Next on my list is to think on how to implement the memtable, again, without impacting too much on the managed memory.

Tags:

Published at

Originally posted at

Comments (11)