Ayende @ Rahien

It's a girl

Tail/Feather–Snapshots

The Raft protocol gives us a stable replicated distributed log. In other words, all servers in the cluster will agree on all the committed entries to the log (both what they are, and in what position). We usually fill the logs in operations that a state machine will execute.

In the Tail/Feather example, the commands are set/del operations on the key value store. Note that this doesn’t mean that all servers will always have the same state. It is possible that a server (or set of servers) will have an outdated view of the log, but the log that they have will match up to the point that they have it.

So, what is the problem? What happens when we have an active system? Well, every time that we make a modification, we’ll add it to the log. That is all good and great, but what about the actual log? Well, it is going to stay there, we need it so we can catch up any new server that will join the cluster. But that means that over time, we are going to have an unbounded growth. Which isn’t a very nice thing to have.

Rachis handle this by asking the state machine to implement snapshots. A way to take the current state of the state machine and transmit it over the network. For example, assume that we have an entry full of these logs:

{ Op: "Add", Key: "users/1/login-attempts", "Value": 1}
{ Op: "Add", Key: "users/1/login-attempts", "Value": 2}
{ Op: "Add", Key: "users/1/login-attempts", "Value": 3}

// ...

{ Op: "Add", Key: "users/1/login-attempts", "Value": 300000}

The log for that is 300,000 entries long, but the current state of the machine:

{ "users/1/login-attempts": 300000 }

Which is obviously much smaller. Rachis doesn’t force a state machine to implement this, but if it isn’t doing so, we can never clear the log. But implementing snapshots has its own problems.

What about the actual cost of creating the snapshot? Imagine that we ask the state machine for a snapshot every 10,000 entries. In the example above, that would mean just writing out { "users/1/login-attempts": 300000 } or whatever the actual current value is.

{ Op: "Add", Key: "users/1/login-attempts", "Value": 1}
{ Op: "Add", Key: "users/2/login-attempts", "Value": 1}
{ Op: "Add", Key: "users/3/login-attempts", "Value": 1}

// ...

{ Op: "Add", Key: "users/300000/login-attempts", "Value": 1}

Note that instead of having 300,000 changes to the same key, we are going to have 300,000 keys. In this case, writing the full list down on every snapshot is very expensive. That is what incremental backups are here to solve.  We let Voron know that this is what we want by specifying:

options.IncrementalBackupEnabled = true;

And now it is time to define policies about taking snapshots. We are going to handle this using Voron full & incremental snapshots. You can see the logic in the following code.

public void CreateSnapshot(long index, long term, ManualResetEventSlim allowFurtherModifications)
{
    // we have not snapshot files, so this is the first time that we create a snapshot
    // we handle that by asking voron to create a full backup
    var files = Directory.GetFiles(_storageEnvironment.Options.BasePath, "*.Snapshot");
    Array.Sort(files, StringComparer.OrdinalIgnoreCase); // make sure we get it in sort order
    if (files.Any() == false)
    {
        DoFullBackup(index, term, allowFurtherModifications);
        return;
    }
    string lastFullBackup = null;
    int fullBackupIndex = -1;
    for (int i = files.Length - 1; i >= 0; i--)
    {
        if (!files[i].StartsWith("Full")) 
            continue;
        fullBackupIndex = i;
        lastFullBackup = files[i];
        break;
    }
            
    if (lastFullBackup == null)
    {
        // this shouldn't be the case, we must always have at least one full backup. 
        // maybe user deleted it? We'll do a full backup here to compensate
        DoFullBackup(index, term, allowFurtherModifications);
        return;
    }
            
    var fullBackupSize = new FileInfo(lastFullBackup).Length;
    var incrementalBackupsSize = files.Skip(fullBackupIndex + 1).Sum(f => new FileInfo(f).Length);

    // now we need to decide whatever to do a full or incremental backup, doing incremental backups stop 
    // making sense if they will take more space than the full backup. Our cutoff point is when it passes to 50%
    // size of the full backup.
    // If full backup size is 1 GB, and we have 25 incrmeental backups that are 600 MB in size, we need to transfer
    // 1.6 GB to restore. If we generate a new full backup, we'll only need to transfer 1 GB to restore.

    if (incrementalBackupsSize / 2 > fullBackupSize)
    {
        DoFullBackup(index, term, allowFurtherModifications);
        return;
    }

    DeleteOldSnapshots(files.Take(fullBackupIndex - 1));// delete snapshots older than the current full backup

    var incrementalBackup = new IncrementalBackup();
    incrementalBackup.ToFile(_storageEnvironment,
        Path.Combine(_storageEnvironment.Options.BasePath, string.Format("Inc-{0:D19}-{1:D19}.Snapshot", index, term)),
        infoNotify: Console.WriteLine,
        backupStarted: allowFurtherModifications.Set);
}

private void DoFullBackup(long index, long term, ManualResetEventSlim allowFurtherModifications)
{
    var snapshotsToDelete = Directory.GetFiles(_storageEnvironment.Options.BasePath, "*.Snapshot");

    var fullBackup = new FullBackup();
    fullBackup.ToFile(_storageEnvironment,
        Path.Combine(_storageEnvironment.Options.BasePath, string.Format("Full-{0:D19}-{1:D19}.Snapshot", index, term)),
        infoNotify: Console.WriteLine,
        backupStarted: allowFurtherModifications.Set
        );

    DeleteOldSnapshots(snapshotsToDelete);
}

private static void DeleteOldSnapshots(IEnumerable<string> snapshotsToDelete)
{
    foreach (var snapshot in snapshotsToDelete)
    {
        try
        {
            File.Delete(snapshot);
        }
        catch (Exception)
        {
            // we ignore snapshots we can't delete, they are expected if we are concurrently writing
            // the snapshot and creating a new one. We'll get them the next time.
        }
    }
}

Basically, we need to strike a balance between full and incremental backups. We do that by first taking a full backup, and then starting to take incremental backups until our incremental backups takes more than 50% of the full backup, at which point we are probably better off doing another full backup. Note that we use the event of a full backup to clear the old incremental and full backup files.

And with that, we can move to actually sending the snapshot over the wire. This is exposed by the GetSnapshotWriter() method. This just shell all the responsibility to the SnapshotWriter:

public ISnapshotWriter GetSnapshotWriter()
{
    return new SnapshotWriter(this);
}

public class SnapshotWriter : ISnapshotWriter
{
    private readonly KeyValueStateMachine _parent;

    private List<FileStream> _files = new List<FileStream>();

    public SnapshotWriter(KeyValueStateMachine parent)
    {
        _parent = parent;
        var files = Directory.GetFiles(_parent._storageEnvironment.Options.BasePath, "*.Snapshot");
        var fullBackupIndex = GetFullBackupIndex(files);

        if (fullBackupIndex == -1)
            throw new InvalidOperationException("Could not find a full backup file to start the snapshot writing");

        var last = Path.GetFileNameWithoutExtension(files[files.Length-1]);
        Debug.Assert(last != null);
        var parts = last.Split('-');
        if(parts.Length != 3)
            throw new InvalidOperationException("Invalid snapshot file name " + files[files.Length - 1] + ", could not figure out index & term");

        Index = long.Parse(parts[1]);
        Term = long.Parse(parts[2]);

        for (int i = fullBackupIndex; i < files.Length; i++)
        {
            _files.Add(File.OpenRead(files[i]));
        }
    }

    public void Dispose()
    {
        foreach (var file in _files)
        {
            file.Dispose();
        }
    }

    public long Index { get; private set; }
    public long Term { get; private set; }
    public void WriteSnapshot(Stream stream)
    {
        var writer = new BinaryWriter(stream);
        writer.Write(_files.Count);
        foreach (var file in _files)
        {
            writer.Write(file.Name);
writer.Write(file.Length); writer.Flush(); file.CopyTo(stream); } } }

What is going on here? We get the snapshot files, and find the latest full backup, then we open all the files that we’ll need for the snapshot (the last full backup and everything afterward). We need to open them in the constructor to lock them for deletion by the CreateSnapshot() method.

Then we just concatenate them all and send them over the wire. And getting them? That is pretty easy as well:

public void ApplySnapshot(long term, long index, Stream stream)
{
    var basePath = _storageEnvironment.Options.BasePath;
    _storageEnvironment.Dispose();

    foreach (var file in Directory.EnumerateFiles(basePath))
    {
        File.Delete(file);
    }

    var files = new List<string>();

    var buffer = new byte[1024*16];
    var reader = new BinaryReader(stream);
    var filesCount = reader.ReadInt32();
    if (filesCount == 0)
        throw new InvalidOperationException("Snapshot cannot contain zero files");
    for (int i = 0; i < filesCount; i++)
    {
        var name = reader.ReadString();
        files.Add(name);
        var len = reader.ReadInt64();
        using (var file = File.Create(Path.Combine(basePath, name)))
        {
            file.SetLength(len);
            var totalFileRead = 0;
            while (totalFileRead < len)
            {
                var read = stream.Read(buffer, 0, (int) Math.Min(buffer.Length, len - totalFileRead));
                if (read == 0)
                    throw new EndOfStreamException();
                totalFileRead += read;
                file.Write(buffer, 0, read);
            }
        }
    }
            
    new FullBackup().Restore(Path.Combine(basePath, files[0]), basePath);

    var options = StorageEnvironmentOptions.ForPath(basePath);
    options.IncrementalBackupEnabled = true;
    //TODO: Copy any other customizations that might have happened on the options

    new IncrementalBackup().Restore(options, files.Skip(1));

    _storageEnvironment = new StorageEnvironment(options);

    using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.ReadWrite))
    {
        var metadata = tx.ReadTree("$metadata");
        metadata.Add("last-index", EndianBitConverter.Little.GetBytes(index));
        LastAppliedIndex = index;
        tx.Commit();
    }
}

Unpack the snapshots from the stream, then first apply a full backup, then all the incremental backups. Make sure to update the last applied index, and we are set Smile.

Published at

Originally posted at

Tail/Feather–The client API

As I mentioned Tail/Feather is a weekend project to test out how stuff works for real. After creating the highly available distributed key/value store, we are now in need of actually building a client API for it.

Externally, that API is going to look like this:

public class TailFeatherClient : IDisposable
{
public TailFeatherClient(params Uri[] nodes);

public Task Set(string key, JToken value);

public Task<JToken> Get(string key);

public Task Remove(string key);

public void Dispose();
}

If this wasn’t a weekend project, I would add batch support, but that isn’t important for our purposes right now. The API itself is pretty stupid, which is great, but what about the actual behavior?

We want it to be able to handle dynamic cluster changes, and we need it to be smart about it. A lot of that is shared among all operations, so the next layer of the API is:

public Task Set(string key, JToken value)
{
return ContactServer(client => client.GetAsync(string.Format("tailfeather/key-val/set?key={0}&val={1}",
Uri.EscapeDataString(key), Uri.EscapeDataString(value.ToString(Formatting.None)))));
}

public async Task<JToken> Get(string key)
{
var reply = await ContactServer(client => client.GetAsync(string.Format("tailfeather/key-val/del?key={0}",
Uri.EscapeDataString(key))));
var result = JObject.Load(new JsonTextReader(new StreamReader(await reply.Content.ReadAsStreamAsync())));

if (result.Value<bool>("Missing"))
return null;

return result["Value"];
}

public Task Remove(string key)
{
return ContactServer(client => client.GetAsync(string.Format("tailfeather/key-val/del?key={0}",
Uri.EscapeDataString(key))));
}

The actual behavior is in ContactServer:

private readonly ConcurrentDictionary<Uri, HttpClient> _cache = new ConcurrentDictionary<Uri, HttpClient>();
private Task<TailFeatherTopology> _topologyTask;

public TailFeatherClient(params Uri[] nodes)
{
_topologyTask = FindLatestTopology(nodes);
}

private HttpClient GetHttpClient(Uri node)
{
return _cache.GetOrAdd(node, uri => new HttpClient { BaseAddress = uri });
}

private async Task<TailFeatherTopology> FindLatestTopology(IEnumerable<Uri> nodes)
{
var tasks = nodes.Select(node => GetHttpClient(node).GetAsync("tailfeather/admin/flock")).ToArray();

await Task.WhenAny(tasks);
var topologies = new List<TailFeatherTopology>();
foreach (var task in tasks)
{
var message = task.Result;
if (message.IsSuccessStatusCode == false)
continue;

topologies.Add(new JsonSerializer().Deserialize<TailFeatherTopology>(
new JsonTextReader(new StreamReader(await message.Content.ReadAsStreamAsync()))));
}

return topologies.OrderByDescending(x => x.CommitIndex).FirstOrDefault();
}

private async Task<HttpResponseMessage> ContactServer(Func<HttpClient, Task<HttpResponseMessage>> operation, int retries = 3)
{
if (retries < 0)
throw new InvalidOperationException("Cluster is not reachable, or no leader was selected. Out of retries, aborting.");

var topology = (await _topologyTask ?? new TailFeatherTopology());

var leader = topology.AllVotingNodes.FirstOrDefault(x => x.Name == topology.CurrentLeader);
if (leader == null)
{
_topologyTask = FindLatestTopology(topology.AllVotingNodes.Select(x => x.Uri));
return await ContactServer(operation, retries - 1);
}

// now we have a leader, we need to try calling it...
var httpResponseMessage = await operation(GetHttpClient(leader.Uri));
if (httpResponseMessage.IsSuccessStatusCode == false)
{
// we were sent to a different server, let try that...
if (httpResponseMessage.StatusCode == HttpStatusCode.Redirect)
{
var redirectUri = httpResponseMessage.Headers.Location;
httpResponseMessage = await operation(GetHttpClient(redirectUri));
if (httpResponseMessage.IsSuccessStatusCode)
{
// we successfully contacted the redirected server, this is probably the leader, let us ask it for the topology,
// it will be there for next time we access it
_topologyTask = FindLatestTopology(new[] { redirectUri }.Union(topology.AllVotingNodes.Select(x => x.Uri)));

return httpResponseMessage;
}
}

// we couldn't get to the server, and we didn't get redirected, we'll check in the cluster in general
_topologyTask = FindLatestTopology(topology.AllVotingNodes.Select(x => x.Uri));
return await ContactServer(operation, retries - 1);
}

// happy path, we are done
return httpResponseMessage;
}

There is quite a bit going on here. But the basic idea is simple. Starting from the initial list of nodes we have, contact all of them and find the topology with the highest commit index. That means that it is the freshest, so more likely to be the current one. From the topology, we take the leader, and send all queries to the leader.

If there is any sort of errors, we’ll contact all other servers to find who we are supposed to be using now. If we can’t find it after three tries, we give us and we let the caller sort it out, probably by retrying once the cluster is in a steady state again.

Now, this is really nice, but it is falling into the heading of weekend code. That is means that this is quite far from what I would call production code. What is missing?

  • Caching the topology locally in a persistent manner so we can restart when the known servers are down from last known good topology.
  • Proper error handling, and in particular, error reporting, to make sure that we can understand what is actually is going on.
  • Features such as allowing reads from non leaders, testing, etc.

But overall, I’m quite happy with this.

Tail/Feather–highly available distributed key/value store weekend project

Weekend project means just that, I’m trying some things out, and writing something real is the best way to exercise. This isn’t going to be a full blown project, but it should be functional and usable.

The basic idea, I’m going to build a distributed key/value configuration store. Similar to etcd, this will allow me to explore how to handle full blown Rachis from both server & client sides.

We want this to be a full  blown implementation, which means persistence, snapshots, network api, the works.

In terms of the data model, we’ll go for the simplest possible one. A key/value store. A key is a string of up to 128 characters. A value is a json formatted value of up to 16Kb. Persistence will be handled by Voron. The persistent of the project is mostly Voron, so what we are left with is the following:

public enum KeyValueOperationTypes
{
Add,
Del
}

public class KeyValueOperation
{
public KeyValueOperationTypes Type;
public string Key;
public JToken Value;
}

public class OperationBatchCommand : Command
{
public KeyValueOperation[] Batch { get; set; }
}

This gives us the background for the actual state machine:

public class KeyValueStateMachine : IRaftStateMachine
{
readonly StorageEnvironment _storageEnvironment;

public KeyValueStateMachine(StorageEnvironmentOptions options)
{
_storageEnvironment = new StorageEnvironment(options);
using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.ReadWrite))
{
_storageEnvironment.CreateTree(tx, "items");
var metadata = _storageEnvironment.CreateTree(tx, "$metadata");
var readResult = metadata.Read("last-index");
if (readResult != null)
LastAppliedIndex = readResult.Reader.ReadLittleEndianInt64();
tx.Commit();
}
}

public event EventHandler<KeyValueOperation> OperatonExecuted;

protected void OnOperatonExecuted(KeyValueOperation e)
{
var handler = OperatonExecuted;
if (handler != null) handler(this, e);
}

public JToken Read(string key)
{
using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.Read))
{
var items = tx.ReadTree("items");

var readResult = items.Read(key);

if (readResult == null)
return null;


return JToken.ReadFrom(new JsonTextReader(new StreamReader(readResult.Reader.AsStream())));
}
}

public long LastAppliedIndex { get; private set; }

public void Apply(LogEntry entry, Command cmd)
{
var batch = (OperationBatchCommand)cmd;
Apply(batch.Batch, cmd.AssignedIndex);
}


private void Apply(IEnumerable<KeyValueOperation> ops, long commandIndex)
{
using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.ReadWrite))
{
var items = tx.ReadTree("items");
var metadata = tx.ReadTree("$metadata");
metadata.Add("last-index", EndianBitConverter.Little.GetBytes(commandIndex));
var ms = new MemoryStream();
foreach (var op in ops)
{
switch (op.Type)
{
case KeyValueOperationTypes.Add:
ms.SetLength(0);

var streamWriter = new StreamWriter(ms);
op.Value.WriteTo(new JsonTextWriter(streamWriter));
streamWriter.Flush();

ms.Position = 0;
items.Add(op.Key, ms);
break;
case KeyValueOperationTypes.Del:
items.Delete(op.Key);
break;
default:
throw new ArgumentOutOfRangeException();
}
OnOperatonExecuted(op);
}

tx.Commit();
}
}


public void Dispose()
{
if (_storageEnvironment != null)
_storageEnvironment.Dispose();
}
}

As you can see, there isn’t much here. Not surprising, since we are storing a key/value data structure. I’m also ignoring snapshots for now. That is good enough for now, let us go for the network portion of the work. We are going to be using Web API for the network stuff. And we’ll be initializing it like so:

var nodeName = options.NodeName ?? (Environment.MachineName + ":" + options.Port);

var kvso = StorageEnvironmentOptions.ForPath(Path.Combine(options.DataPath, "KeyValue"));
using (var statemachine = new KeyValueStateMachine(kvso))
{
using (var raftEngine = new RaftEngine(new RaftEngineOptions(
new NodeConnectionInfo
{
Name = nodeName,
Url = new Uri("http://" + Environment.MachineName + ":" + options.Port),
},
StorageEnvironmentOptions.ForPath(Path.Combine(options.DataPath, "Raft")),
new HttpTransport(nodeName),
statemachine
)))
{
using (WebApp.Start(new StartOptions
{
Urls = { "http://+:" + options.Port + "/" }
}, builder =>
{
var httpConfiguration = new HttpConfiguration();
RaftWebApiConfig.Register(httpConfiguration);
httpConfiguration.Properties[typeof(HttpTransportBus)] = new HttpTransportBus(nodeName);
httpConfiguration.Properties[typeof(RaftEngine)] = raftEngine;
builder.UseWebApi(httpConfiguration);
}))
{

Console.WriteLine("Ready & processing requests, press ENTER to sop");

Console.ReadLine();
}
}
}

Note that we need to initialize both the state machine and the raft engine, then wire the raft engine controllers. Now we are pretty much done with setup, and we can turn to the actual semantics of running the cluster. The first thing that I want to do is to setup the baseline, so we create this base controller:

public abstract class TailFeatherController : ApiController
{
public KeyValueStateMachine StateMachine { get; private set; }
public RaftEngine RaftEngine { get; private set; }

public override async Task<HttpResponseMessage> ExecuteAsync(HttpControllerContext controllerContext, CancellationToken cancellationToken)
{
RaftEngine = (RaftEngine)controllerContext.Configuration.Properties[typeof(RaftEngine)];
StateMachine = (KeyValueStateMachine)RaftEngine.StateMachine;
try
{
return await base.ExecuteAsync(controllerContext, cancellationToken);
}
catch (NotLeadingException)
{
var currentLeader = RaftEngine.CurrentLeader;
if (currentLeader == null)
{
return new HttpResponseMessage(HttpStatusCode.PreconditionFailed)
{
Content = new StringContent("{ 'Error': 'No current leader, try again later' }")
};
}
var leaderNode = RaftEngine.CurrentTopology.GetNodeByName(currentLeader);
if (leaderNode == null)
{
return new HttpResponseMessage(HttpStatusCode.InternalServerError)
{
Content = new StringContent("{ 'Error': 'Current leader " + currentLeader + " is not found in the topology. This should not happen.' }")
};
}
return new HttpResponseMessage(HttpStatusCode.Redirect)
{
Headers =
{
Location = leaderNode.Uri
}
};
}
}
}

That is a lot of error handling, but basically it just get the right values from the configuration and expose them to the controller actions, then a lot of error handling when we have a command that requires a leader that hit a follower.

Next step, actually managing the cluster, here we go:

public class AdminController : TailFeatherController
{
[HttpGet]
[Route("tailfeather/admin/fly-with-us")]
public async Task<HttpResponseMessage> Join([FromUri] string url, [FromUri] string name)
{
var uri = new Uri(url);
name = name ?? uri.Host + (uri.IsDefaultPort ? "" : ":" + uri.Port);

await RaftEngine.AddToClusterAsync(new NodeConnectionInfo
{
Name = name,
Uri = uri
});
return new HttpResponseMessage(HttpStatusCode.Accepted);
}

[HttpGet]
[Route("tailfeather/admin/fly-away")]
public async Task<HttpResponseMessage> Leave([FromUri] string name)
{
await RaftEngine.RemoveFromClusterAsync(new NodeConnectionInfo
{
Name = name
});
return new HttpResponseMessage(HttpStatusCode.Accepted);
}
}

So now we have a way to add and remove items from the cluster, which is all the admin stuff that we need to handle right now. Next, we need to actually wire the operations, this is done here:

public class KeyValueController : TailFeatherController
{
[HttpGet]
[Route("tailfeather/key-val/read")]
public HttpResponseMessage Read([FromUri] string key)
{
var read = StateMachine.Read(key);
if (read == null)
{
return Request.CreateResponse(HttpStatusCode.NotFound, new
{
RaftEngine.State,
Key = key,
Missing = true
});
}
return Request.CreateResponse(HttpStatusCode.OK, new
{
RaftEngine.State,
Key = key,
Value = read
});
}

[HttpGet]
[Route("tailfeather/key-val/set")]
public Task<HttpResponseMessage> Set([FromUri] string key, [FromUri] string val)
{
JToken jVal;
try
{
jVal = JToken.Parse(val);
}
catch (JsonReaderException)
{
jVal = val;
}

var op = new KeyValueOperation
{
Key = key,
Type = KeyValueOperationTypes.Add,
Value = jVal
};

return Batch(new[] { op });
}

[HttpGet]
[Route("tailfeather/key-val/del")]
public Task<HttpResponseMessage> Del([FromUri] string key)
{
var op = new KeyValueOperation
{
Key = key,
Type = KeyValueOperationTypes.Del,
};

return Batch(new[] { op });
}

[HttpPost]
[Route("tailfeather/key-val/batch")]
public async Task<HttpResponseMessage> Batch()
{
var stream = await Request.Content.ReadAsStreamAsync();
var operations = new JsonSerializer().Deserialize<KeyValueOperation[]>(new JsonTextReader(new StreamReader(stream)));

return await Batch(operations);
}

private async Task<HttpResponseMessage> Batch(KeyValueOperation[] operations)
{
var taskCompletionSource = new TaskCompletionSource<object>();
RaftEngine.AppendCommand(new OperationBatchCommand
{
Batch = operations,
Completion = taskCompletionSource
});
await taskCompletionSource.Task;

return Request.CreateResponse(HttpStatusCode.Accepted);
}
}

And we are pretty much set.

Note that I’ve been writing this post while I’m writing the code, so I’ve made some small changes, you can see actual code here.

Anyway, we are pretty much done. Now we can compile and try testing what is going on.

First, we seed the cluster, but running:

.\TailFeather.exe --port=9079 --DataPath=One --Name=One –Bootstrap

This tell us that this node is allowed to become a leader without having to pre-configure a cluster. This command runs and exit, so now we’ll run three such copies:

  • start .\TailFeather.exe "--port=9079 --DataPath=One --Name=One"
  • start .\TailFeather.exe "--port=9078 --DataPath=Two --Name=Two"
  • start .\TailFeather.exe "--port=9077 --DataPath=Three --Name=Three"

We have all three nodes up and running, so now is the time to actually make use of it:

http://localhost:9079/tailfeather/key-val/set?key=ravendb&val={ 'Url': 'http://live-test.ravendb.net', 'Database': 'Sample' }

In this case, you can see that we are setting a configuration value to point to a RavenDB database on the first node. Note that at this point, we have a single node cluster, and the two other are waiting to join it, but are taking no action.

We can get the value back using:

image

So far, so good. Now, let us add a second node in by inviting it to fly with our cluster. We do that using the following command:

http://localhost:9079/tailfeather/admin/fly-with-us?url=http://localhost:9078&name=Two

Which will give us:

image

Note that we are using the just added node for too look at this.

Next, we can add the third node.

http://localhost:9079/tailfeather/admin/fly-with-us?url=http://localhost:9077&name=Three

I would put the image in, but I think you get the point.

This is it for now. We have a highly available persistent & distributed key/value store. Next, we need to tackle the idea of snapshots and the client API, but I’ll deal with that at another post.

Published at

Originally posted at

“Incremental” map/reduce in MongoDB isn’t

Rafal  an Ben Foster commented on my previous post with some ideas on how to deal with incremental updates to map/reduce indexes. Rafal said:

Actually, it's quite simple if you can 'reverse' the mapping operation (for given key find all documents matching that key): you just delete aggregate record with specified key and run incremental map-reduce on all matching documents. In today's example, you would delete the aggregate with key='oren' and then run map reduce with a query:

db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: 'oren' } });

And Ben said:

It's worth mentioning that I was able to get the MongoDB map-reduce collections updating automatically (insert/update/delete) by monitoring the MongoDB OpLog …

…and listen for new documents in the OpLog which could then be used to re-execute an incremental Map-Reduce.

And while this looks right, this actually can’t possibly work. I’ll start from Rafal’s suggestion first. He suggest just issuing the following set of commands whenever we delete something from the database:

   1: db.distinct_item_names.remove({name: 'oren' } });
   2: db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: 'oren' } });

And yes, that will actually work, as long as you are careful to never do this concurrently. Because if you do run this concurrently… well, the best you can hope is no data, but the liker scenario is data corruption.

But this actually gets better, deletes are annoying, but they are a relatively simple case to process. You have updates to deal with too. We’ll assume that we are watching the oplog to get notified when this happens. Here is an MongoDB oplog entry

   1: {
   2:   "ts": {
   3:     "t": 1286821984000,
   4:     "i": 1
   5:   },
   6:   "h": "1633487572904743924",
   7:   "op": "u",
   8:   "ns": "items",
   9:   "o2": {
  10:     "_id": "4cb35859007cc1f4f9f7f85d"
  11:   },
  12:   "o": {
  13:     "$set": {
  14:       "Name": "Eini"
  15:     }
  16:   }
  17: }

As you can see, we an update operation (op: u) on a specific document (o2._id) with the specified update (o.$set). That is really great, and it is utterly useless for our purposes. In this case, we updated the name from Oren to Eini, so we would like to be able to run this:

   1: db.distinct_item_names.remove({name: 'oren' } });
   2: db.distinct_item_names.remove({name: eini' } });
   3: db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: 'oren' } });
   4: db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: eini' } });

Except that we don’t have any way to get the old value out from the oplog. And this still isn’t going to work concurrently.

But let us say that we decided to have a watcher process monitor the oplog somehow, and it will ensure no concurrency of those requests. Now you have to deal with fun issues like: “what happens if the watcher process recycle?”  How do you keep your place in the oplog (and remember, the oplog is capped, stuff you haven’t seen might be removed if they are beyond the specified size.

And… to be frank, once we have done all of that, this is still the easy part. One of the reasons that you want to do this work in the first place is to deal with large amount of data. But you cannot assume that you’ll have even distribution of the data.

One bug request that came against the RavenDB map/reduce implementation was a map/reduce index on the US Census data. That is ~300 million documents, and the index the user wanted to build was a map/reduce group by the state. You have states like California, with more than 30 million people in it, and you realize that you don’t want to have to re-do the map/reduce over the entire 30+ million documents that you have there. In RavenDB, under this scenario, you’ll have to issue about 3,073 operations, by the way. Versus the 30 millions you would need for this approach.

So yeah, “incremental” map/reduce can’t handle concurrent work, can’t handle deletes, can’t handle updates, and definitely shouldn’t be used on large data sets. And that is after you went to the trouble of setting up the watcher process, monitoring the oplog, etc.

Or, you can use RavenDB and you get a true incremental map/reduce without having to worry about any of that.

Tags:

Published at

Originally posted at

Comments (9)

Differences in Map/Reduce between RavenDB & MongoDB

Ben Foster has a really cool article showing some of the similarities and differences between MongoDB & RavenDB with regards to their map/reduce implementation.

However, there is a very important distinction that was missed. Map/reduce operations are run online in MongoDB, that means that for large collections, map/reduce is going to be very expensive. MongoDB has the option of taking the result of a map/reduce operation and writing it to a collection, so you don’t need to run map/reduce jobs all the time. However, that is a snapshot view of the data, not a live view. Ben mentioned that you can do something called incremental map/reduce, but that isn’t actually really good idea at all.

Let us look at the following sequence of operations:

   1: db.items.insert({name: 'oren', ts: 1 });
   2: db.items.insert({name: 'ayende', ts: 2});
   3:  
   4: var map = function Map() { emit(this.name,null); };
   5: var reduce = function(key, val) { return key; };
   6:  
   7: db.items.mapReduce(map,reduce, { out: 'distinct_item_names' });

This creates two items, and give me the distinct names in a separate collection. Now, let us see how that works with updates…

   1: db.items.insert({name: 'eini', ts: 3 });
   2:  
   3: db.items.mapReduce(map,reduce, { out: {reduce: 'distinct_item_names'}, query: {ts: {$gt: 2} } });

This is actually nice, mongo is able to merge the previous results with the new results, so you only have to do the work on the new data. But this has several implications:

  • You have to kick something like ‘ts’ property around to check for new stuff. And you have to _udpate_ that ts property on every update.
  • You have to run this on a regular basis yourself, mongo won’t do that for you.
  • It can’t work with deletes.

It is the last part that is really painful:

   1: db.items.remove({name: 'oren'});

Now, there is just no way for you to construct a map/reduce job that would remove the name when it is gone.

This sort of thing works very nicely when what you want is to just append stuff. That is easy. It is PITA when we are talking about actually using it for live data, that can change and be modified.

Contrast that with the map/reduce implementation in RavenDB:

  • No need to manually maintain state, the database does it for you.
  • No need to worry about updates & deletes, the database does it for you.
  • No need to schedule map/reduce job updates, database does it for you.
  • Map/reduce queries are very fast, regardless of data size.

To be frank, the map/reduce implementation in RavenDB is complex, and pretty much all of it comes down to the fact that we don’t do stupid stuff like run a map/reduce operation on a large database on every query, and that we support edge cases scenarios like data that is actually updated or deleted.

Naturally I’m biased, but it seems to me that trying to use map/reduce in Mongo just means that you have to do a lot of hand holding yourself, while with RavenDB, we take care of everything and leaving you to actually do stuff.

Tags:

Published at

Originally posted at

Comments (10)

Review: Getting started with LevelDB

Getting Started with LevelDB

I was asked to review the book (and received a free electronic copy).

As someone that is very into storage engines, I was quite excited about this. After going over the leveldb codebase, I would finally get to read a real book about how it works.

I was disappointed, badly.

This book isn’t really about leveldb. It contains pretty much no background, explanation, history or anything much at all about how leveldb works. Instead, it is pretty much a guide of how to use LevelDB to write iOS application. There is a lot of chapters dealing with Objective-C, NSString and variants, how to do binding, how to handle drag and drop.

However, things that I would expect. Such as explanations of how it works, what does it do, alternative use cases, etc are very rare, if there at all. Only chapter 10 is really worth reading, and even so, I got the feeling that it only made sense to me because I already knew quite a lot leveldb already. I can’t imagine actually starting from scratch and actually being able to understand leveldb from this book.

If you are working on iOS apps / OS X, I guess that this might be a good choice, but only if you want to know about actually implementing leveldb. You’ll need to do your actual leveldb learning elsewhere.

The book does contain some interesting tidbits. Chapter 10 is talking about tuning and key policies, and it did have some interesting things to talk about, but it also contain wrong information* (and if I could spot it, with my relatively little experience with leveldb, I’m pretty sure that there are other things there too that are wrong).

* The book said that is it better to write keys in order, to reduce I/O. But leveldb writes to a skip list in memory, then flush that entire thing in sorted fashion to disk. Your writes have to be bigger than the buffer size of that to actually matter, and that still won’t help you much.

In short, feel free to skip this book, unless you are very focused on writing leveldb apps on iOS. In which case it might be a worth it, but I don’t think so. You are better off reading the docs or any of the tutorials.

Published at

Originally posted at

Comments (8)

So, what have YOU been learning lately?

One of the worst things that can happen to you professionally is stagnation. You know what you are doing, you know how it works, and you can coast along very easily. Unfortunately, there is the old, it isn’t what we know that we don’t know that is going to hurt us. It is what we don’t know that we don’t know that is going to bite us in the end.

One of the reasons that I have routinely been going out and searching for difficult codebases to read has been to avoid that. I know that I don’t know a lot, I just don’t know what I don’t know. So I go into an unfamiliar codebase and try to figure out how things work over there.

I have been doing that for quite some time now. And I am not talking about looking at some sample project a poo schlump put out to show how you can do CQRS with 17 projects to create a ToDo app. I am talking about production code, and usually in areas or languages that I am not familiar with.

A short list of the stuff that I have been gone over:

  • CouchDB (to learn Erlang, actually, but that got me to do DB stuff).
  • LevelDB
  • LMDB
  • NServiceBus
  • Mass Transit
  • SignalR
  • Hibernate
  • Hibernate Search

Those are codebases that do interesting things that I wanted to learn from. Indeed, I have learned from each of those.

Some people can learn by reading academic papers, I find that I learn best from having a vague idea about what is going on, then diving into the implementation details and seeing how it all fits together.

But the entire post so far was a preface to the question I wanted to ask. If you are reading this post, I am pretty sure that you are a professional developer. Doctors, lawyers and engineers (to name a few) have to recertify every so often, to make sure that they are current. But I have seen all too many developers stagnate to the point where they are very effective in their chosen field (building web apps with jQuery Mobile on ASP.Net WebForms 3.5) and nearly useless otherwise.

So, how are you keeping your skills sharp and your knowledge current? What have you been learning lately? It can be a course, or a book or a side project or just reading code. But, in my opinion, it cannot be something passive. If you were going to answer: “I read your blog” as the answer to that question, that is not sufficient, flatterer. Although, you might want to go a bit further and consider that imitation is the sincerest form of flattery, so go ahead and do something.

Tags:

Published at

Originally posted at

Comments (32)

Some final notes about LMDB review

Okay, having gone through the LMDB codebase with a fine toothed comb, I think that I can safely say that it is both a very impressive codebase and one the dearly need some TLC. I’ll freely admit that I am by no means a C guy. And it is entirely possible that a lot of the issues that I have been bugging me are standard C things. But I don’t think so. Methods that go on for hundreds of lines, duplicated code and plethora of gotos hardly seem to be the things that pop to mind when I hear good C code.

But beyond my issues with the code, the implementation is really quite brilliant. The way LMDB manages to pack so much functionality by not doing things is quite impressive. Interestingly, you couldn’t write this database even 5 years ago. LMDB relies on being able to map the db into memory, and up until x64 became prevalent, you just couldn’t do that for any db with a meaningful size. With x64 and the effectively unlimited address space we have (will I be laughing at my naivety in a few years?), that is no longer an issue.

I learned quite a lot from the project, and it has been frustrating, annoying and fascinating experience.

Tags:

Published at

Originally posted at

Comments (16)

World’s smallest No SQL Database: Persistent data

The second item to go over with the World’s smallest No SQL database is about persistence, and following that, I/O. Right now, there is no persistence. If you restart the process, all the data is gone. Now, there are actually quite a few real world dbs that behave in this fashion. But they are a rarity. For the most part, if I put data inside a db, I expect it to be there until I do something about it.

And at that point, you are in for quite a bit of complexity. How are you going to persist the data? The easiest way to do it, just create a file per every value in the db is going to be… problematic on most systems. So you need to put a lot of data in a small set of files. Which means that you have to decide how you are going to put the data together. In general, there is either the fixed size option, in which you divide the file(s) into pages and work around that. The good thing about this is that this gives you the ability to reuse space in the file after deletes / updates. The bad thing about that is that it is quite complex. Alternatively, you can just write the data out as needed, but then you can’t really update written data, and would need to run compactions.

And we haven’t talked about searching yet. Some DBs, like Bitcask / Munin, would actually store the keys in memory, and store the position on the disk for retrieving the value. But for the most part, both keys & values tend to be on disk in some form. In CouchDB, they are held inside an append only B+Tree. In LevelDB, they are held in Sorted String Tables. LMDB uses Copy-On-Write B+Tree. Esent use a B+Tree with a Log file.

In each of those cases, the actual semantics for persistent data involve at least two concerns. You need to actually be able to search the data (that usually mean more than just O(1) access, you want to be able to go back & forth on the keys) and you need to be able to do a transactional save. This is so you can recover  in case of a crash, most of the time.

But there are actually a lot more that goes into the selection of the proper persistence format. To start with, how you store the data on disk will have a big effect on your performance. If you store the data as a linked list, for example, you might as well kiss your performance goodbye. Beyond that, we also have issues with things like how is the format going to scale when we have concurrent readers. For example, if you have something that does a lot of seeks, and rely on the seeks always going forward to ensure performance, that is going to be very badly hit the moment that you have concurrent readers doing concurrent reads on different parts of the system. You would be forcing the system to do random seeks.

There are other considerations, as well. For example, if you are using something like B+Tree, it is likely that you’ll be overwriting the same section on the disk multiple times. That is fine with HDD, but SSD would just give up & die on you at some point. And we haven’t started talking about secondary indexes yet…

Tags:

Published at

Originally posted at

Comments (2)

Lightning memory-mapped database is Esent

Here is something that suddenly dawned at me as I was going through the LMDB codebase. LMDB is actually how Esent works. In fact, going through LMDB codebase I am in a much better position to understand how Esent works.

For example, Esent is limited to 16TB per database. I did some math to figure out the max size of a LMDB db that used 32bits for page number and I got to 2 ** 32 times 4,096 == 16 TB, and that isn’t a coincidence. And all the background information in the Esent API that talk about large values, for example. That is how LMDB does overflow, and tables are implemented as additional B-Trees (what LMDB calls dbs).

At some point I was going through this in my head and realized that indexes are also working in the exact same way, and so does pretty much everything else. What Esent does on top is provide better concurrency (LMDB allows only a single writer), and a whole lot of additional features (schemas, secondary indexes, etc).

But I think that the major difference from my point of view is that I had a theoretical idea about how Esent really worked, under the covers. Now I feel like I actually grok things at a much better level. For instance, Esent will not give back space to the operating system, even if you deleted stuff. I knew that, and just accepted that as how it worked. But now I can understand why it doesn’t do that. Or the MAX_VER_PAGES setting, I knew what it purpose was, and how it impacted operations (it is the amount of changes that can be pending). But after reading LMDB dirty_pages and its limit, I understand the why and how it works at a much deeper level.

To be honest, I kept reading LMDB codebase purely because I wasn’t going to let that code beat me, but it is very educational.

Tags:

Published at

Originally posted at

Comments (27)

Reviewing Lightning memory-mapped database library: MVCC

MVCC stands for Multi Versioning Concurrency Control. This is how you can have both readers & writer at the same time and not have to arrange locks for the two. With LMDB, MVCC is implemented by always creating a new page, never modifying them in place. That also means that when we “free” a page, we need to make sure not to actually use it until all the transactions that could see it has completed.

To be honest, I can’t follow the code. It is somewhat related to me_pghead , but I just can’t follow what is going on. I think that this is related to the way it manage multiple transactions, but I am just unable to follow the code .Maybe it is just that I overloaded my senses with too much C code, I have been diving into this code, and sometimes it feels like this:

That said, I understand how it has to work, so that should be enough for now. Next, I want to see how to do it myself Smile.

Tags:

Published at

Originally posted at

Comments (5)

Reviewing Lightning memory-mapped database library: What about free pages?

Okay, so in my last post (and last night, from my perspective), I wrote about the process in which LMDB maintains the B-Tree when it grows. But what about when it shrinks? And how does LMDB handle things like reusing space?

For that matter, we need to discuss another thing. As I understand this right now, LMDB is going to write a page once and only once .And when that is done, that page is now immutable. Updates to this page will result in a new page, and the old one will be marked for reuse. So you don’t actually have to worry about updating the page while others may be reading it. In other words, I would expect this code to use a lot of pages:

image

Actually, I would expect it to use two pages all the time. Alternating between the two after every transaction.

Indeed, following the code, it is easy to see that this magic happens during mdb_page_touch. We find (somehow, not sure how yet) a free page (or create a new one), and we mark the old one for reuse. That means that when you actually write to the data, you aren’t really writing to the old page, you are creating a new one. This drastically reduces a lot of complexity. However, it does mean that the db will be using more space, since every write transaction will use a minimum of one new page (and probably more, considering that there is the B-Tree to maintain). in fact, on average size dbs, I would be surprised if a single transaction could write less than 3 – 4 pages as a minimum. I’ll test the min number of pages per transaction in a bit, right now I want to focus on the implications of that decision.

Well, to start with, concurrency is easier. We aren’t writing to old data, so we have MVCC. The only thing we need to ensure is that we aren’t reusing a page before all of its transactions are complete. But it also has other implications, I haven’t confirmed my suspicions about transaction commits, but because we are writing to new locations, not modifying existing ones, this means that a crash midway would simply restore us to the previous state, and will not corrupt any data.

Since we don’t modify pages, it means that free pages aren’t just things that happen when we delete data, they happen all the time, so it would be interesting to see how LMDB handles them. Here is the relevant piece of the code:

image

So either find me a free page, or allocate a new one. Then add the existing page to the free list for the current transaction. A bit lower in the code we copy the data in the old page to the new one, and we are ready to go.

So, when we make updates, at least in this scenario, we are cycling between two pages, always allocating one and freeing the other, and on the next transaction we go the other way. This is really quite interesting, from the perspective of comparing that to other storage engines. CouchDB work in pretty much the same way, B-Tree with all the associated benefits. But CouchDB model is based on append only, and it cannot reuse the space in the file, which require compactions. LMDB model is much nicer in that regard. Since in both cases, in place updates are not allowed, there is a lot of wasted space that goes just for those updates. In LMDB’s case, that wasted space is a lot bigger, because it works in page sizes, and can reuse them. In CouchDB’s case, it doesn’t reuse space, so it doesn’t use pages (more space efficient this way).

Side note: C’s lack of native data structures beyond array is really annoying. You realize that in C, a Stack is a design pattern?!

And about the minimum number of pages per transaction, let us see what is going on about that. So I wrote the following code:

image

This should generate a deep enough B-Tree to show what I want. And indeed, the action happens here:

image

Both the root node and the data node are modified, because of the cursor stack. It isn’t really a big deal, to be honest, since usually you would only need to update H pages (where H is the height of the tree) and H is very small for B-Trees.

But this leads to me to consider something else. Usually when talking about on disk structures, one of the more important things that you want to keep in mind is reducing seeks. But the B-Tree model that we have here would result in having pages pretty much all over the file. Now, in something like CouchDB, it doesn’t actually matter, because the file is append only, so you always end up scanning from the end of the file backward. But because LMDB is reusing pages, it is entirely possible for a root page to be in the middle of the file, the next level to be in the start and the actual data in the end.

Now, I am guessing that this isn’t actually an issue, because LMDB relies on the OS page cache to handle this, and that would take care of that. In practice, I expect, the branches of the tree are always going to be in memory, since they were the last written and the most accessed.

And that is enough for now. Next, I want to figure out how we return free pages to the system. In order to do that, I think that I’ll need to look at how we transactions are committed.

Tags:

Published at

Originally posted at

Comments (11)

Reviewing Lightning memory-mapped database library: A thoughtful hiatus

I thought that I would stop a bit from focusing on what the LMDB code is doing in favor to some observations about the code itself. Going into this codebase it like getting hit in the face with a shovel. Now, this might be my personal experience, as someone who has done a lot of managed code work in the past. But I used to be a pretty horrible C/C++ guy (the fact that I say C/C++ should tell you exactly what my level was).

But I don’t think that it was just that. Even beyond the fact that the code is C, and not C++ (which I am much more used to), there is a problem that only become clear to me well after I read the code for the millionth time. It grew. Looking at the way the code is structured, it looks like it was about as nice a C codebase as you can get (don’t get excited, that isn’t saying much). But overtime, features were added, but the general structure of the codebase wasn’t adjusted to account for that.

I am talking about things like this:

image

image

image

There are actually 22 (!) ‘if(IS_LEAF(mp))’ references in the codebase.

Or what about this?

image

image

It looks like certain features (duplicate keys support, for example) was added that had a lot of implication on the code, but it wasn’t refactored accordingly. It make it very hard to go through.

Tags:

Published at

Originally posted at

Comments (4)

Reviewing Lightning memory-mapped database library: On page splits and other painful things

image

I don’t know if you noticed,  but the LMDB codebase is giving me serious headache issues. The bloody thing is a very dense piece of code, but at the same time, it is also quite interesting. In particular, B-Trees are pretty much The Answer for a lot of on disk data structures, but they tend to be quite hard to handle properly. So I am looking forward to this piece of code, in which I am going to figure out if I can figure out this code. Just to note mdb_page_split is also another 400 lines method with goto sprinkled all over.

In order to do that, I wrote the following code (in a single transaction):

image

And then I spent another few minutes trying to get it to compile. Mostly because I started out with “for (int i=0; …” and C doesn’t allow that (blah!).

Anyway, I got the page to split, and now I want to see how it actually behaves. I am currently at the first part where we take the root (currently leaf) page and turn it into a branch. This is an interesting case of a root split.

We start with:

image

And the first thing we do is to go to this mode:

image

We add an empty implicit pointer to the previous root. But this is really just the beginning, now we need to divide the entries that used to be in the root between the left & right sides. This is made complex by the fact that you might have it setup so it has a few large & small values, so just cutting them in the middle would produce a result that would be too big. At least, that is what a comment says. I am not sure how that can be. We are going from one page to two pages, so I don’t see how you can get into a situation where that would happen. I guess it is time to slot into more code.

Okay, I got it, the issue is that we do a page split because we need to add a new item. So that is the reason we have to jump through those hops. Because we add a new item (that is already too big for the original page, since that is why we are actually splitting that).

Another complex issue with page splits is that they might be recursive. If we need to split a page, we need to add an entry to the parent page, which might cause that page to split, etc…

An interesting observation on the sizes involved, a LMDB page has 4084 bytes available for storage. The data for the page number is 8 bytes long (it uses pointer size page number) and assuming keys that are 16 bytes keys in size (and including about 8 bytes for node header), we get about 128 keys per branch page. Even considering that B-Tree are specifically designed to make themselves efficient in the presence of memory hierarchies, it is  quite impressive.

Consider, assuming a full tree, if we hold just the root and the first level in memory, we would consume about 512kb. And that would give us just one seek to get any of ~2 million items. As an aside, one reason that I like reading this code is that for once, this is a B-Tree implementation that isn’t covered in locking code, which is how this is usually works in RDBMS.

Another aspect that is quite interesting is that this code really shows important aspects for working with relational databases. It is all about keeping things inside the same page, with spill over to overflow pages slowing things down because you need to do another seek. Or a really obvious explanation why page splits are such a bad thing, and a lot of other details that you learn when you go a bit deep into relational databases but (at least for me) have never been real before I started dealing with building databases myself.

Tags:

Published at

Originally posted at

Comments (15)

Reviewing Lightning memory-mapped database library: On disk data

As mentioned, the data in LMDB is actually stored in page. And I am currently tracking through the process in which we add a new item to the database. The first thing that happen is that we allocate a leaf page. I think that I’ll have to go over how pages are allocated now, which should explain a lot about how LMDB reuses disk space.

A new page is allocated by first finding the oldest transaction that is still active. This is how LMDB implements MVCC. Basically, any free page that is older than any existing transaction is eligible for reuse. The first db (although a better term would be to call it btree, or maybe a table) contains the free pages, and at first we search there for an available page. If we can’t find such a page, we use the end of the file for that. This happens in mdb_page_alloc.

An very interesting aspect is the fact that LMDB allows direct memory writes (with the additional risk of corrupting the db if you messed the data), interestingly, this is done in the following code:

image

If the options allows directly memory writes, you get a point to the mmap file. Otherwise, LMDB will allocate a page (or re-use one that has already been allocated).

I am not really sure what is going on with the insert yet. This is a function pointer (delegate for C#). And it select which behavior will happen later on. I am just not sure what those two different function do yet.

I think I got it, the key is here:

image

You can see that we use the insert method to add the mid variable to the dirty list. So if we give you a direct pointer, we append it to the list. But if we give you allocate a page, we need to add it in order.

The mdb_mid2l_insert will add an item to the list in the appropriate location to ensure sorting. I think that this is done so when we actually write the dirty page to disk, if we do that using file I/O, we will do that in ascending file order, so we can get good seek times from the disk. A new page has 4,084 bytes available for it (12 bytes are taken by header data).

A database is actually created the first time data is written to it. The root page is allocated and recorded. And then the data is added to the page.

As you might recall, data inside a page is kept in a sorted list. But remember that the data is also stored on the page, and there is the whole notion of overflow pages, so I think that I am going to have to draw a picture to figure out what is going on.

image

This is more or less how it works. But I don’t think that I really get it. Note that there are two collections here. First is the positions of nodes in the page, and the second is the node themselves. The reason we have this complexity is that we need to maintain both of them in an efficient manner. And we need to do that without too much copying. Therefor, the list of nodes in the page is in the beginning, growing downward, and the actual nodes are in the end, growing upward. The page is full when you can’t fit anything between those two lists.

A node, by the way, is the LMDB internal name for a key (and maybe data). But I think that I might have an error here, because I don’t see the code here to actually add nodes to the page in a sorted fashion, so it might be doing linear search on the content of a node. I am not tracking through the code for adding a second value into a database. And I think that I have to be wrong about linear node search. The code I was looking at was invoked at the the new db creation section, so it might be taking shortcuts that aren’t generally available.

At any rate, the logic for adding an item goes like this… first we need to find the appropriate page. And we do this by starting from the root and asking for the actual page. Which gets us to mdb_page_get, where we have the following:

image

The really interesting part here is that each transaction have a dirty list of pages, so if you modified a page, you’ll get your own version, even before the data was committed. Otherwise, we will just hand you the current version. This is quite nice.

And then we get to the already familiar mdb_page_search_root function, and I can confirm that the nodes in a page are actually sorted, which only make sense. I am not sure where that happens yet, but I have got the feeling that I am going to see that in a bit.

Okay, I am going to go on a bit of a rant here, mostly against C, to be honest.

image

Take a look at this code. mdb_page_search does something, but what it does it mutate the state for the cursor (mc), and you are then going to access the mc->mc_pg[mc->mc_top] to get the actual result you wanted. To make things more interesting, the is also a mc->mc_ki, which is the index of the node inside the node. And it drove me crazy, because I just couldn’t see the association between those three values, especially because I am so not used to this type of code that I never even considered this as a possibility.

At any rate, now I know how it gets to insert things in the right order. When doing mdb_page_search, it calls to mdb_node_search, which does the appropriate setup to tell the next call where it need to actually do the insert to get things in a sorted fashion. I am currently in the mdb_cursor_put, which is trice cursed and really hard to follow. A 400+ lines method with gotos galore.

But I am in the section where we are actually going to be writing a new value. (new_sub: goto header, if you care). And the code is pretty straight forward from there.

Next up, I want to see how it handles a page split, but that is a subject to another post.

Tags:

Published at

Originally posted at

Comments (10)

Reviewing Lightning memory-mapped database library: Stepping through make everything easier

Okay, I know that I have been critical about the LMDB codebase so far. But one thing that I really want to point out for it is that it was pretty easy to actually get things working on Windows. It wasn’t smooth, in the sense that I had to muck around with the source a bit (hack endianess, remove a bunch of unix specific header files, etc). But that took less than an hour, and it was pretty much it. Since I am by no means an experienced C developer, I consider this a major win. Compare that to leveldb, which flat out won’t run on Windows no matter how much time I spent trying, and it is a pleasure.

Also, stepping through the code I am starting to get a sense of how it works that is much different than the one I had when I just read the code. It is like one of those 3D images, you suddenly see something.

The first thing that became obvious is that I totally missed the significance of the lock file. LMDB actually create two files:

  • lock.mdb
  • data.mdb

Lock.mdb is used to synchronized data between different readers. It seems to mostly be there if you want to have multiple writers using different processes. That is a very interesting model for an embedded database, I’ve to admit. Not something that I think other embedded databases are offering. In order to do that, it create two named mutexes (one for read and one for write).

A side note on Windows support:

LMDB supports Windows, but it is very much a 2nd class citizen. You can see it in things like path not found error turning into a no such process error (because it try to use GetLastError() codes as C codes), or when it doesn’t create a directory even though not creating it would fail.

I am currently debugging through the code and fixing such issues as I go along (but no, I am doing heavy handed magic fixes, just to get past this stage to the next one, otherwise I would have sent a pull request).

Here is one such example. Here is the original code:

image

But ReadFile in Win32 will return false if the file is empty, so you actually need to write something like this to make the code work:

image

Past that hurdle, I think that I get a lot more about what is going on with the way LMDB works than before.

Let us start with the way data.mdb works. It is important to note that for pretty much everything in LMDB we use the system page size. By default, that is 4KB.

The data file starts with 2 pages allocated. Those page contain the following information:

image

Looking back at how CouchDB did things, I am pretty sure that those two pages are going to be pretty important. I am guess that they would always contain the root of the data in the file. There is also the last transaction on them, which is what I imagine determine how something gets committed. I don’t know yet, as I said, guessing based on how CouchDB works.

I’ll continue this review in another time. Next time, transactions…

Tags:

Published at

Originally posted at

Comments (17)

Reviewing Lightning memory-mapped database library: going deeper

Okay, I now have a pretty rough idea about how the codebase actually works. I still think that the codebase is quite ugly. For example, take a look at this:image

The len parameter for CreateFile is whatever to open or create or just open (read only). But why is it in a parameter called len?

Probably because it was just there, and it would be a sin to create another local variable just for this, I guess (in a codebase where a method had > 18 local variables!).  To make things more interesting, in the rest of this method, this is actually a string len variable, sigh.

At any rate, let us actually dig deeper now. The following structure is holding data about a db.

image

This is actually somewhat misleading, at least with regards to how I would think about a db. This is the entry point for all the pages that belong to a specific db. But a db in LMDB is not really the same thing as a db in SQL Server or RavenDB. It all reside in the same file, and you always have at least two. The first one is the free db, which is used to track all the free pages. The second one is the main db. Then you have additional, named databases.

This is used here:

image

This define the metadata for the entire environment. Note that we have the two dbs there in mm_dbs. The mm_txnid denotes the last committed transaction id. This value is what gives LMDB its MVCC support.  The mm_last_pg value is the last used page, any transaction that wants to write will start writing at that value.

Let us see how we deal with pages here, shall we?

image

The first part find try to find a dirty page if we are in a read/write transaction and we haven’t specify that we can write directly to memory. This is done by doing a binary search on the list of dirty pages.

Otherwise, we can just hand the user the actual page by accessing it directly.

Next, let us look where this is actually used. Searching for a page with a specific key in it. This is done mostly in mdb_node_search.

image

This seems to be doing a binary search for the keys inside a specific page (in this case, the page that is at the top of the stack on the cursor). That leads to the conclusion that pages internally have data internally stored as sorted arrays.

And this leads me to another pet peeve with this code base. Take a look at this line:

image

Sure, this is a well known trick to cheaply divide  a number by half, but are you seriously telling me that the compiler isn’t going to optimize  (low + high) / 2 ? To my knowledge, no C compiler updated in the last 10 – 15 years managed to miss this optimization. So why write code that is going to be harder to read?

Okay, so now we know how we search for a specific key inside a page, but how do we get to the actual page that we want to search on? This happens on mdb_page_search_root. Let me see if I can take it apart.

When this method is called, the cursor is setup so the first page on the pages stack is the root page. 

And… that is enough for me. Up until now, I have been trying to just read the code. Not execute it, not debug through it, nothing .Just go over the code one line at a time and figure out what is going on. I actually think that I have a really good grasp about what is going on in there, but I think that this is pretty much all I can do at that point from just reading the code. So I am going to stop now and setup an debug environment so I can work with it, and report my finding from stepping through the code.

Tags:

Published at

Originally posted at

Comments (13)

World’s Smallest No SQL Database: Persistent transaction logs

As it stand the World’s Smallest No  SQL Database will last only as long as you actually have power. The moment that you have a restart, all the data is gone. The is actually how several databases are running, but for now, we are going to assume that this is not desirable. The question now becomes, how do you actually store the data on disk?

This really becomes a pretty complex question, because you need to store the data on disk in a way that is crash safe, allow updates, and doesn’t take all the disk space in the world. Before we will get to the actual on disk data structures, we need to discuss how we implement persistent logs. Persistent logs are the key way that databases gets Durability. And as it turned out, there are just two ways of doing that that I am aware of:

  • Append only
  • Transaction log

Append only models rely on the fact that you only append to create a safe way to ensure that the data is either all in or all out. When we write, we don’t overwrite values, we are creating new values, and then we write where the last log entry is located. In CouchDB, for example, this is done by modifying the header portion of the file to point to the new data. In LMDB this is done by updating the page listing with the new pages. In both cases, you actually commit a transaction when the on disk pointer is changed to point to the new data. If you crash midway, nothing bad happened, you didn’t update the on disk pointer, it is still pointing at the old location. Worst case scenario, you wasted some disk space, and you probably have a way to reclaim that anyway.

Transaction logs use a different way to handle this. They are also usually implemented as append only files, into which we write the new data. Afterward, we can apply those changes in memory / on disk safely. A crash would simply mean having to replay the log. This is how leveldb, for example, works. This is also the essential idea behind how SQL Server Oracle works. Although in their case I believe that the transaction log usually contain both prev/new state of the pages they changed for a specific transaction.

One thing that you have to be aware of, either way, is that you should be prepared to handle scenarios where your process crashed midway, or when your entire machine had the plug pulled out. That means using fsync, sure, but it also means that you might have to do log replay, or be ready to see if you can recover something from the append only model.

The good thing about the transaction log approach is that after you have committed the changes to the standard persistent format, you can clear it (usually by just creating a new file and deleting the old one). With the append only model, you usually have to run some sort of compaction to clear things out. Note that the transaction log model vs append only model doesn’t really mean about how the rest of the persistent data is actually stored. We will touch on that on the next post.

On Memory Mapped Files

Tobi had a few questions about memory mapped files. And it is quite an interesting topic.

A memory mapped file is a feature for all modern operating system. It require coordination between the memory manager and the I/O subsystem. Basically, you can tell the OS that some file is the backing store for a certain portion of the process memory. In order to understand that, we have to understand virtual memory.

Here is your physical memory, using 32 bits for ease of use:

image

Now, you have two processes that are running. Each of them get their own 4 GB address space (actually, only 2 GB is available to the process in 32 bits, but that is good enough). What happen when both of those processes obtain a pointer to 0x0232194?

Well, what actually happens is that the pointer that looks like a pointer to physical memory is actually a virtual pointer, which the CPU and the OS work together to map to physical memory. It is obvious from the image that there is a problem here, what happens if two processes uses 4GB each? There isn’t enough physical memory to handle that. This is the point were the page file gets involved. So far, this is pretty basic OS 101 stuff.

The next stuff is where it gets interesting. So the OS already knows how to evict pages from memory and store them on the file system, because it needs to do that for the page file. The next step is to make use of that for more than just the page file. So you can map any file into your memory space. Once that is done, you can access the part of the memory you have mapped and the OS will load the relevant parts of the file to memory. Again, pretty basic stuff so far.

You can read about this more in this article.

The reason you want to do this sort of thing is that it gives you the ability to work with files as if it was memory. And you don’t have to worry about all that pesky file I/O stuff. The OS will take care of that for you.

And now, to Tobi’s questions:

  • What are the most important advantages and disadvantages?

Probably the most important is that you don’t have to do manual file I/O. That can drastically reduce the amount of work you have to do, and it can also give you a pretty significant perf boost. Because the I/O is actually being managed by the operation system, you gain a lot of experience in optimizing things. And the OS will do things like give you a page buffer, caching, preloading, etc.  It also make it drastically easier to do parallel I/O safely, since you can read/write from “memory” concurrently without having to deal with complex API.

The disadvantages you need to be aware that things like the base addresses would change whenever you re-open the file, and that data structures that are good in memory might not result in good performance if they are stored on disk.

Another problem relates to how the changes actually are saved to the file. It is hard to make sure that the writes you do are coherently stored in the file system. For example, let us say that you made changes to two different memory locations, which reside on two different pages. The OS can decide, at any time, to take one of those pages away from you because it needs that memory for other things. When that happens, it will write that page to disk for you. So when you ask for it the next time, it can load it up again and the application won’t notice.

However, what would happen during a crash? You might have partially written data in that scenario. Usually, when writing to files using file I/O routines, you have very predictable write pattern. When you use memory mapped files for writes, you don’t know for sure in what order that is going to happen. The OS is free to choose that. There are ways you can control that. For example, you might use FILE_MAP_COPY to avoid the OS writing stuff for you, but you would have to handle writes yourself now, and that is decidedly not trivial.

You can use something like FlushViewOfFilew() to make sure that a specific range is flushed, but that still doesn’t mean that they will be written in any order that you might expect. As a db writer, I care, quite deeply, about the exact way the data is written to file, because otherwise it is really hard to reason about how to read it back. Especially when you have to consider failures and corruptions.

Personally, I would be writing stuff using memory mapped file for data that I needed critical stuff for.

  • Why are using well known products like SQL Server custom caches instead of memory mappings?

SQL Server is actually its own operating system. It managed memory, locks, threads, I/O and a lot more. That comes from the fact that for a long time, it had to do those sort of things. But note that SQL Server probably use memory mapped file quite a lot. But they are also using their own custom caches because it make sense to them. For example, query plan cache. Even when you do have memory mapped files, you usually have multiple layers of caching.

In RavenDB, for example, we have the native page cache (managed by Esent), the documents cache (managed by RavenDB) and the client cache. The reason that we have multiple layers is that we cache different things. The client cache avoid having to call the server. The documents cache avoid having to parse documents and the native page cache avoid having to go to disk.

  • Why are other products like LevelDB using mmap instead of custom caches?

Because they are drastically simpler products. They really want to just give you the data as quickly as possible, and they don’t need to do additional processing of the data. They can lean on the OS page cache to a much larger extent. Note that when use in real products, there are often higher level caches that they will use anyway, used for storing processed / parsed information.

  • Are they well suited for managed code?

Memory Mapped Files are usually harder to use from managed code, because we don’t do our own memory management. It does meant that you lose the benefit of just treating this as part of your memory space, because there is a very clear line between managed memory and native memory, and memory mapped files are on the other side of the hatch. That said, you can easily use them via UnmanagedMemoryStream, or read from them directly via Memory Accessor. The managed implementation for leveldb make heavy use of memory mapped files.

  • How does performance compare with other techniques?

It is generally better, mostly because the OS is really good in managing paging, and that you rely directly on the low level I/O routines. Another thing that you have to remember that if you use file I/O, you need to create a buffer, copy to/from that buffer, etc. Using memory mapped files saves all of that.

  • Is it possible to use them with mutable data structures and retain crash recoverability?

Yes, but… is probably the best answer. Yes, you can use them for mutable data, but you really want to be careful about how you do it. In particular, you need to make sure that you write in such a way that you can survive partial writes. A good way of doing that is to make writes to specific pages, and you “commit” by recording that those pages are now available on a metadata page, or something like that. This require a lot of really careful work, to be honest. Probably more work than you would like it to be. LMDB does it this way, and even if the code wasn’t a eye tearing pain, what it is doing is quite hard.

Note that in order to actually be sure that you a flushing to disk, you have to call both FlushViewOfFile and FlushFileBuffers on Windows.

  • What guarantees does the OS make regarding ordering and durability?

Pretty much none regarding ordering, as noted. Windows will guarantee that if both FlushViewOfFile and FlushFileBuffers  have been called and successfully completed, you are golden. But there aren’t any promises in the API about what will happen for partway failures, or in what order this happens.

To summarize:

Memory mapped files are a great tool, and for reads, they are excellent. For writes, they are awesome, but since there is no way to ensure in what order dirty pages gets written to disk, it make it hard to generate reliable system using them.

I rather use standard file I/O for writes, since that is far more predictable.

World’s smallest No SQL Database: ACID & Transactions

Even for a first attempt, World’s Smallest No  SQL Database is actually pretty good. We got 3 of the 4.

The DB is:

  • Atomic, since change will either go in or be rejected.
  • Consistent, you can’t see changes half way.
  • Isolated, one change can’t modify/view another.

It isn’t durable, because everything is in memory.

It even has transactions, in the sense that a change goes in or not in an atomic fashion.

But, I think you’ll agree, this is very much a poor man’s solution. And those only apply just as long as you need to work with just a single item. In practice, you would usually want to make a lot more than that.

The basic properties you want is the ability to do multi item transactions, so you can modify two items at the same time, and they either go in or not at all. And that is where it gets really  complex, because at that point we need to build our own way to handle atomic updates to multiple values, how to keep them isolated and how to handle this consistently.

There aren’t really simple way to handle that, especially if we want to handle concurrent transactions. You would have to make a lot of decisions all of a sudden. For example, what happens if you have two concurrent transactions trying to update the same value. Is one of them rejected? And if so, at what point? Initial write? Commit time?

And that is just the start.

  • Do you allow a transaction to live across multiple requests?
  • How do you handle transaction expiration?
  • Do you need to have a transaction span more than one node? How do you handle that?

And we haven’t even talked about making transactions durable. There is no point, since we don’t do any persistence. But we will touch on persistence in a future post, so let talk about durability right now.

In general, there are two major ways to handle transactions. Either an append only model or a transaction log. In both cases, concurrent transactions make it… interesting to figure out how to properly write to the log. And the moment that you have either option, you have to build in some form of a recovery tool for system or db crashes. You have better also love fsync, since it is a very important tool in making sure that you can actually get durable transactions.

Oh, and of course, I forgot, if we allow long running transactions…

  • Who can read the yet uncommitted value? Everyone? Just the current transaction? Can you explicitly ask to see uncommitted value?
  • How do you allow it to be read?
  • Is it available for queries?

This is usually the most touchy part in the system, since is so critical. It is also something that you have to take into account in pretty much the entire codebase.

Tags:

Published at

Originally posted at

Comments (6)

World smallest’s No SQL Database: Talk

As I mentioned earlier, I actually gave that talk in USI Events in Paris a short while ago. The recording for this talk is now available, you can find it here:

http://www.usievents.com/en/conferences/12-paris-usi-2013/sessions/1090-why-you-should-never-write-your-own-database

I am afraid that I had a very short time to go through the details, which is why I am also doing the blog post series.

Tags:

Published at

Originally posted at

Comments (2)

Reviewing Lightning memory-mapped database library: Because, damn it!

I decided to resume my attempt to figure out the LMDB codebase. Okay, I did some more figuring out and I think that I know where I went wrong. The codebase appears to be trying small enough to fit into the CPU caches, in order to do that, they decided to make it hard enough to make sure it won’t fit into my cache. But I’m trying to work through that.

One of the things that I figured out was that I made a mistake, I kept thinking about databases in the same way you’ll see in RavenDB or SQL Server, as complete separate items. Instead, LMDB does something quite different. It basically uses a single file for everything. This is called the environment, and that can contains named dbs. If you have multiple databases, all of them are going to reside in the same file. The way it works, it creates a file and then map it into memory. On Windows, it actually allocate the space for the file upfront (which doesn’t appears to be necessary).

But the basic idea is that you need to specify upfront the size of the file you want, and that is the maximum data size you have. This can change, but only if you close the environment and start it up again with a higher value. This also explains why you have to specify the number of databases you want to have per environment.  Once you created the environment, everything else become an issue of just managing the pages. In particular, because LMDB only allows a single writer, it doesn’t really have to worry about concurrency issues. It has mapped all the data into memory, and then it is just a matter of creating the appropriate data structure.

A lot of the code is dedicated to managing pages of data, and now that I have gone through enough of the codebase to be sure that I sort of have a vague idea about what is going on I can still say that I think that this is way denser than it should. And I shudder to think what it would take to make any sort of change to this codebase.

Tags:

Published at

Originally posted at

Comments (10)

Reviewing Lightning memory-mapped database library: tries++

I decided to resume my attempt to figure out the LMDB codebase. The previous attempt ended when my eyes started bleeding a little, but I think that I can get a bit further now.

One thing that I figured out already is that because this is C code, memory management is really painful. I noted previous how good it was from leveldb to use std::string for most allocations, since it frees both caller and lib from having to worry about the memory. LMDB only shows how important that was. Take a look at the following method:

image

Quick, what do you think it does. As it turn out, what is does is update the key to point to the key that the cursor is currently pointing at. The reason for that is that the library doesn’t want to own the memory for key (since then it would have to provide a way to free it).

Now, to be fair, this appears to be an internal method, only used by the lib itself, but such things are pretty pervasive in the codebase, and it is really annoying that the code is all in one spot, with no real structure, since I am still trying to figure out some sort of hierarchy for the code so I could get a grip on what is actually going on.

I decided to just go to the start and figure out how it opens a database:

image

It is interesting that you need a transaction to open the database. However, the opened database survives the transaction closing, and from the docs, if you open a db in a transaction, that is the only thing you can do in that transaction.

I think that I just figured out something, the code in mdb.c is actually written in reverse depth order. So the closer to the end of the file a function is, the more visible it is.

Okay, it appears that there is the notion of multiple dbs, but there is also the notion of the main db. It looks like the main db is also used for tracking things for child dbs. In particular, when opening a named db, we have the following:image

We search for the db information in the main db. As you can see, mdb_cursor_set is actually setting the data on the cursor to the value we passed in. More interesting is what happens if the db is not there. It appears to merely create the db data and set it in the main db. Okay, looking at this function docs, it appears that this is just about allocating a database handle, not about opening the actual database.

I am not really sure that I understand the reasoning behind it, but a new database is actually created in mdb_cursor_put. So only after the first time you create a value, will the db actually be created.

image

I guess that explains why this function goes on for over 400 lines of code and has no less than 10 goto(!) and 7 goto sections!

image

I get that this is C, but come on, seriously!

Tags:

Published at

Originally posted at

Comments (7)

World’s Smallest No SQL Database: Concurrency

I am pretty sure that it would surprise you, but the World’s Smallest No SQL Database has a well defined concurrency model. Basically, it is using Last Write Wins. And we are safe from any concurrency issues. That is pretty much it, right?

Well, not really.  In a real world system, you actually need to do a lot more with concurrency. Some obvious examples:

  • Create this value only if it doesn’t exists already.
  • Update this value only if it didn’t change since I last saw it.

Implementing those is actually going to be pretty simple. All you need to do is to have a metadata field, version, that is incremented on every change. Here is the change that we need to make:

   1: public class Data
   2: {
   3:     public byte[] Value;
   4:     public int Version;
   5: }
   6:  
   7: static readonly ConcurrentDictionary<string, Data> data = 
   8:    new ConcurrentDictionary<string, Data>(StringComparer.InvariantCultureIgnoreCase); 
   9:  
  10:  public HttpResponseMessage Get(string key)
  11:  {
  12:      Data value;
  13:      if(data.TryGetValue(key, out value) == false)
  14:          return new HttpResponseMessage(HttpStatusCode.NotFound);
  15:  
  16:      return new HttpResponseMessage
  17:          {
  18:              Headers = { "Version", value.Version },
  19:             Content = new ByteArrayContent(value.Value)
  20:          };
  21:  }
  22:  
  23: public void Put(string key, [FromBody]byte[] value, int version)
  24: {
  25:     data.AddOrUpdate(key, () => 
  26:     { // create
  27:        if(version != 0)
  28:            throw new ConcurrencyException();
  29:        return new Data{ Value = value, Version = 1 };
  30:     }, (_, prev) => 
  31:     { // update
  32:         if(prev.Version != version)
  33:           throw new ConcurrencyException();
  34:         return new Data{ Value = value, Version = prev.Version +1 };
  35:     });
  36: }

As you can see, it merely doubled the amount of code that we had to write, but it is pretty obvious how it works. RavenDB actually uses something very similar to that for concurrency control for writes, although the RavenDB ETag mechanism is alos doing a lot more.

But the version system that we have above is actually not enough, it only handle concurrency control for updates. What about concurrency controls for reads?

In particular, how are we going to handle non repeatable reads or phantom reads?

  • Non repeatable reads happen when you are reading a value, it is then deleted, and when you try to read it again, it is gone.
  • Phantom read is the other way around, first you tried, but didn’t find anything, then it was created, and you read it again and find it.

This is actually interesting, because you only care about those for the duration of a single operation / transaction / session. As it stand now, we actually have no way to handle either issue. This can lead to… interesting bugs that only happen under very specific scenarios.

With RavenDB, we actually handle both cases. In a session lifetime, you are guaranteed that if you saw a document, you’ll continue to see this document until the end of the session, which deals with the issue of non repeatable read. Conversely, if you didn’t see a document, you will continue to not see it until the session is closed. This is done for Load, queries are a little bit different.

Another aspect of concurrency that we need to deal with is Locking. Sometimes a user has a really good reason why they want to lock a record for a period of time. This is pretty much the only way to handle “checking out” of a record in a scenario where you have to multiple users wanting to make changes to a record concurrently. Locks can be Write Or ReadWrite locks. A Write lock allows users to read the data, but prevent them from changing that. When used in practice, this is usually going to immediately fail an operation, rather than make you wait for it.

The reasoning behind immediate fail for write is that if you encountered a record with a write lock, it means that it was either already written to or is about to be written to. At that case, your write is going to be operating on stale data, so we might was well fail you immediately. For ReadWrite locks, the situation is a bit different. In this case, we want to also prevent readers from moving on. This is usually done to ensure consistent state system wise, and basically, any operation on the record would have to wait until the lock is removed.

In practice,ReadWrite locks can cause a lot of issues. The moment that you have people start placing locks, you have to deal with lock expiration, manual unlocking, abandoned lock detection, lock maintenance, etc. About the only thing that they are good for is to allow the user to make a set of changes and present them as one unit, if we don’t have better transaction support. But I’ll discuss that in another post. In the meantime, just keep in mind that from my point of view, ReadWrite locks are pretty useless all around.

Tags:

Published at

Originally posted at

Comments (16)

World’s smallest No SQL Database: Memory

The first item to go over with the World’s smallest No SQL database is how it uses memory. The codebase is really simple, which is pretty much the point, but can you think about the implications of the current codebase with regards to memory?

All the data is kept in a ConcurrentDictionary<string,byte[]>, and there is an unspoken assumption. Both keys and values are assumed to be small. What happens if that isn’t the case? Well, in .NET, you are going to get memory fragmentation pretty quickly. That means that on 32 bits, you are probably going to be able to store a lot less than the 2 GB of virtual memory that you get from the OS.

We will ignore that as well and move to other aspects of the code. What about how the system actually work in production? There is a lot of work that goes into ensuring that the CPU will be able to access memory efficiently. In particular, most dbs uses some form of an in memory data structure that take advantage of the L1, L2, L3 cache of the CPU. Usually something like a B+Tree. How does the dictionary compare here?

For that matter, what happens when we grow really big? What about when we are paging? How would that work? Can we just go over the keys in the dictionary without having to page through the values as well?

And…

I think that you get my drift by now. And this is literally just me throwing stuff off the top of my head. There are a lot of other things that we need to worry about. For example, if we implement the sensible notion of a buffer pool, we might actually have a buffer that was released by a write but is currently being sent to the client, so we can’t just use buffers, we have to use ref counting as well.

Tags:

Published at

Originally posted at

Comments (2)