Ayende @ Rahien

It's a girl

RavenDB – Working with the query statistics

Querying in RavenDB is both similar and dissimilar to the sort of queries that you are familiar with from RDBMS. For example, in RavenDB, it is crazily cheap to find out count(*). And then there is the notion of potentially stale queries that we need to expose.

Following the same logic as my previous post, I want to have an easy way of exposing this to the user without making it harder to use. I think that I found a nice balance here:

RavenQueryStatistics stats;
var results =  s.Query<User>()
     .Customize(x=>x.WaitForNonStaleResults())
     .Statistics(out stats)
     .Where(x => x.Name == "ayende")
     .ToArray();

The stats expose some additional information about the query:

image

You can use those to decide how to act.

But I have to admit that, like in the MultiQuery case with NHibernate, the real need for this was to be able to do paged query + total count in a single remote request.

Tags:

Published at

Originally posted at

Distributed transactions with remote queuing system

I got into an interesting discussion about that two days ago, so I thought that I would put down some of my thoughts about the topic.

Just to put some context into operation here, I am NOT talking about solving the generic problem of DTC on top of remote queuing system. I am talking specifically about sharing a transaction between a remote queuing system and a database. This is important if we want to enable “pull msg from queue, update db” scenarios. Without those, you are running into danger zones of processing a message twice.

We will assume that the remote queuing system operations on the Dequeue/Timeout/Ack model. In which you dequeue a message from the queue, and then you have a timeout to acknowledge its processing before it is put back into the queue. We will also assume a database that supports transactions.

Basically, what we need to do is to keep a record of all the messages we processed. We do that by storing that record in the database, where it is subject to the same transactional rules as the rest of our data. We would need a table similar to this:

CREATE TABLE Infrastructure.ProcessedMessages
(
   MsgId uniqueidentifier primary key,
   Timestamp DateTime not null
)

Given that, we can handle messages using the following code:

using(var tx = con.BeginTransaction())
{
    var msg = queue.Dequeue();
    try
    {
        InsertToProcessedMessages(msg);
    }
    catch(DuplicatePrimaryKey)
    {
        queue.Ack(msg);
        tx.Rollback();
        return;
    }

    // process the msg using this transaction context
    tx.Commit();
queue.Ack(msg);
}

Because we can rely on the database to ensure transactional consistency, we can be sure that:

  • we will only process a message once
  • an error in processing a message will result in removing the process record from the database

We will probably need a daily job or something similar to clear out the table, but that would be about it.

Thoughts?

Queuing Systems

It is not a surprise that I have a keen interest in queuing system and messaging. I have been pondering building another queuing system recently, and that led to some thinking about the nature of queuing systems.

In general, there are two major types of queuing systems. Remote Queuing Systems and Local Queuing Systems. While the way they operate is very similar, they are actually major differences in the way you would work with either system.

With a Local Queuing System, the queue is local to your machine, and you can make several assumptions about it:

  • The queue is always going to be available – in other words: queue.Send()  cannot fail.
  • Only the current machine can pull data from the queue – in other words: you can use transactions to manage queue interactions.
  • Only one consumer can process a message.
  • There is a lot of state held locally.

Examples of Local Queuing Systems include:

  • MSMQ
  • Rhino Queues

A Remote Queuing System uses a queue that isn’t stored on the same machine. That means that:

  • Both send and receive may fail.
  • Multiple machines may work against the same queue.
  • You can’t rely on transactions to manage queue interactions.
  • Under some conditions, multiple consumers can process the same message.
  • Very little state on the client side.

An example of Remote Queuing System is a Amazon SQS.

Let us take an example of simple message processing with each system. Using local queues, we can:

using(var tx = new TransactionScope())
{
   var msg = queue.Receive();

   // process msg
  
   tx.Complete();
}

There are a lot actually going on here. The act of receiving a message in transaction means that no other consumer may receive it. If the transaction complete, the message will be removed from the queue. If the transaction rollbacks, the message will become eligible for consumers once again.

The problem is that this pattern of behavior doesn’t work when using remote queues. DTC are a really good way to kill both scalability and performance when talking to remote systems. Instead, Remote Queuing System apply the concept of a timeout.

var msg = queue.Receive( ackTimeout: TimeSpan.FromMniutes(1) );

// process msg

queues.Ack(msg);

When the message is pulled from the queue, we specify the time that we promise to process the message by. The server is going to set aside this message for that duration, so no other consumer can receive it. If the ack for successfully processing the message arrives in the specified timeout, the message is deletes and everything just works. If the timeout expires, however, the message is now available for other consumers to process. The implication is that if for some reason processing a message exceed the specified timeout, it may be processed by several consumers. In fact, most Remote Queuing Systems implement a poison message handling so if X number of time consumers did not ack a message in the given time frame, the message is marked as poison and moved aside.

It is important to understand the differences between those two systems, because they impact the system design for systems using it. Rhino Service Bus, MassTransit and NServiceBus, for example, all assume that the queuing system that you use is a local one.

A good use case for a remote queuing system is when your clients are very simple (usually web clients) or you want to avoid deploying a queuing system.

Tags:

Published at

Originally posted at

Comments (19)

The building blocks of a database: Transactional & persistent hash table

I am working on managed storage solution for RavenDB in an off again on again fashion for a while now. The main problem Is that doing something like this is not particularly hard, but it is complex. You either have to go with a transaction log or an append only model.

There are more than enough material on the matter, so I won’t touch that. The problem is that building that is going to take time, and probably a lot of it I decided that it is better off to have something than nothing, and scaled back the requirements.

The storage has to have:

  • ACID
  • Multi threaded
  • Fully managed
  • Support both memory and files
  • Fast
  • Easy to work with

The last one is important. Go and take a look at the codebase of any of the available databases. They can be… pretty scary.

But something has to be give, so I decided that to make things easier, I am not going to implement indexing on the file system. Instead, I’ll store the data on the disk, and keep the actual index completely in memory. There is an overhead of roughly 16 bytes per key plus the key itself, let us round it to 64 bytes per held per key. Holding 10 million keys would cost ~600 MB. That sounds like a lot, because it is. But it actually not too bad. It isn’t that much memory for modern hardware. And assuming that our documents are 5 KB in size, we are talking about 50 GB for the database size, anyway.

Just to be clear, the actual data is going to be on the disk, it is the index that we keep in memory. And once we have that decision, the rest sort of follow on its own.

It is intentionally low level interface, and mostly it gives you is a Add/Read/Remove interface, but it gives you multiple tables, the ability to do key and range scans and full ACID compliance (including crash recovery).

And the fun part, it does so in 400 lines!

http://github.com/ayende/Degenerated.Storage

Tags:

Published at

Originally posted at

Comments (15)

Smart post re-scheduling with Subtext

As you know, I uses future posting quite heavily, which is awesome, as long as I keep to the schedule. Unfortunately, when you have posts two or three weeks in advance, it is actually quite common for you need to post things in a more immediate sense.

And that is just a pain. I just added smart re-scheduling to my fork of Subtext. Basically, it is very simple. If I post now, I want the post now. If I post it in the future, move everything one day ahead. If I post with no date, put it as the last item on the queue.

This is the test for this feature.

Tags:

Published at

Originally posted at

Comments (4)

Exception handling for flow control is EVIL!!!

I just spent over half a day trying to fix a problem in NHibernate. To be short one of the updates that I made caused a backward compatibility error, and I really wanted to fix it. The actual error condition happen only when you use triple nested detached criteria with reference to the root from the inner most child. To make things fun, that change was close to 9 months ago, and over a 1,000 revisions.

I finally found the revision that introduced the error, and started looking at the changes. It was a big change, and a while ago, so it took some time. Just to give you some idea, here is where the failure happened:

public string GetEntityName(ICriteria criteria)
{
  ICriteriaInfoProvider result;
  if(criteriaInfoMap.TryGetValue(criteria, out result)==false)
    throw new ArgumentException("Could not find a matching criteria info provider to: " + criteria);
  return result.Name;
}

For some reason, the old version could find it, and the new version couldn’t. I traced how we got the values to criteriaInfoMap every each way, and I couldn’t see where the behavior was different between the two revisions.

Finally, I resorted to a line by line revert of the revision, trying to see when the test will break. Oh, I forgot to mention, here is the old version of this function:

public string GetEntityName(ICriteria criteria)
{
  string result;
  criteriaEntityNames.TryGetValue(criteria, out result);
  return result;
}

The calling method (several layers up) looks like this:

private string[] GetColumns(string propertyName, ICriteria subcriteria)
{
  string entName = GetEntityName(subcriteria, propertyName);
  if (entName == null)
  {
    throw new QueryException("Could not find property " + propertyName);
  }
  return GetPropertyMapping(entName).ToColumns(GetSQLAlias(subcriteria, propertyName), GetPropertyName(propertyName));
}

But even when I did a line by line change, it still kept failing. Eventually I got fed up and change the GetEntityName to return null if it doesn’t find something, instead of throw.

The test passed!

But I knew that returning null wasn’t valid, so what the hell(!) was going on?

Here is the method that calls the calling method;

public string[] GetColumnsUsingProjection(ICriteria subcriteria, string propertyName)
{
  // NH Different behavior: we don't use the projection alias for NH-1023
  try
  {
    return GetColumns(subcriteria, propertyName);
  }
  catch (HibernateException)
  {
    //not found in inner query , try the outer query
    if (outerQueryTranslator != null)
    {
      return outerQueryTranslator.GetColumnsUsingProjection(subcriteria, propertyName);
    }
    else
    {
      throw;
    }
  }
}

I actually tried to track down who exactly wrote this mantrap, (this nasty came from the Hibernate codebase). But I got lost in migrations, reformatting, etc.

All in all, given how I feel right now, probably a good thing.

Tags:

Published at

Originally posted at

Comments (12)

Why I hate implementing Linq

Linq has two sides. The pretty side is the one that most users see, where they can just write queries and C# and a magic fairy comes and make it work on everything.

The other side is the one that is shown only to the few brave souls who dare contemplate the task of actually writing a Linq provider. The real problem is that the sort of data structure that a Linq query generates has very little to the actual code that was written. That means that there are multiple steps that needs to be taken in order to actually do something useful in a real world Linq provider.

A case in point, let us take NHibernate’s Linq implementation. NHibernate gains a lot by being able to use the re-linq project, which takes care of a lot of the details of linq parsing. But even so, it is still too damn complex. Let us take a simple case as an example. How the Cacheable operation is implemented.

(
  from order in session.Query<Order>()
  where order.User == currentUser
  select order
).Cacheable()

Cacheable is used to tell NHibernate that it should use the 2nd level cache for this query. Implementing this is done by:

  1. Defining an extension method called Cachable:
  2. public static IQueryable<T> Cacheable<T>(this IQueryable<T> query)
  3. Registering a node to be inserted to the parsed query instead of that specific method:
  4. MethodCallRegistry.Register(new[] { typeof(LinqExtensionMethods).GetMethod("Cacheable"),}, typeof(CacheableExpressionNode));
  5. Implement the CacheableExpressionNode, which is what will go into the parse tree instead of the Cacheable call:
  6. public class CacheableExpressionNode : ResultOperatorExpressionNodeBase
  7. Actually, the last thing was a lie, because the action really happens in the CacheableResultOperator, which is generated by the node:
  8. protected override ResultOperatorBase CreateResultOperator(ClauseGenerationContext clauseGenerationContext)
    {
        return new CacheableResultOperator(_parseInfo, _data);
    }
  9. Now we have to process that, we do that by registering operator processor:
  10. ResultOperatorMap.Add<CacheableResultOperator, ProcessCacheable>();
  11. That processor is another class, in which we finally get to the actual work that we wanted to do:
  12. public void Process(CacheableResultOperator resultOperator, QueryModelVisitor queryModelVisitor, IntermediateHqlTree tree)
    {
        NamedParameter parameterName;
    
        switch (resultOperator.ParseInfo.ParsedExpression.Method.Name)
        {
            case "Cacheable":
                tree.AddAdditionalCriteria((q, p) => q.SetCacheable(true));
                break;

 

Actually, I lied. What is really going on is that this is just the point where we are actually registering our intent. The actual code will be executed at a much later point in time.

To foretell the people who knows that this is an overly complicated mess that could be written in a much simpler fashion…

No, it can’t be.

Sure, it is very easy to write a trivial linq provider. Assuming that all you do is a from / where / select and another else. But drop into the mix multiple from clauses, group bys, joins, into clauses, lets and… well, I could probably go on for a while there. The point is that industrial strength Linq providers (i.e. non toy ones) are incredibly complicated to write. And that is a pity, because it shouldn’t be that hard!

Tags:

Published at

Originally posted at

Comments (16)

Synchronization primitives, MulticastAutoResetEvent

I have a very interesting problem within RavenDB. I have a set of worker processes that all work on top of the same storage. Whenever a change happen in the storage, they wake up and start working on it. The problem is that this change may be happening while the worker process is busy doing something other than waiting for work, which means that using Monitor.PulseAll, which is what I was using, isn’t going to work.

AutoResetEvent is what you are supposed to use in order to avoid losing updates on the lock, but in my scenario, I don’t have a single worker, but a set of workers. And I really wanted to be able to use PulseAll to release all of them at once. I started looking at holding arrays of AutoResetEvents, keeping tracking of all changes in memory, etc. But none of it really made sense to me.

After thinking about it for  a while, I realized that we are actually looking at a problem of state. And we can solve that by having the client hold the state. This led me to write something like this:

public class MultiCastAutoResetEvent 
{
    private readonly object waitForWork = new object();
    private int workCounter = 0;
    
    
    public void NotifyAboutWork()
    {
        Interlocked.Increment(ref workCounter);
        lock (waitForWork)
        {
            Monitor.PulseAll(waitForWork);
            Interlocked.Increment(ref workCounter);
        }
    }
    
    
    public void WaitForWork(TimeSpan timeout, ref int workerWorkCounter)
    {
        var currentWorkCounter = Thread.VolatileRead(ref workCounter);
        if (currentWorkCounter != workerWorkCounter)
        {
            workerWorkCounter = currentWorkCounter;
            return;
        }
        lock (waitForWork)
        {
            currentWorkCounter = Thread.VolatileRead(ref workCounter);
            if (currentWorkCounter != workerWorkCounter)
            {
                workerWorkCounter = currentWorkCounter;
                return;
            }
            Monitor.Wait(waitForWork, timeout);
        }
    }
}

By forcing the client to pass us the most recently visible state, we can efficiently tell whatever they still have work to do or do they have to wait.

Tags:

Published at

Originally posted at

Comments (38)

Why RavenDB is OSS

There are actually several reasons for that, not the least of which is that I like working on OSS projects. But there is also another reason for that, which is very important for adoption:

image

Joel Spolsky have an interesting article about just this topic.

When I sit down to architect a system, I have to decide which tools to use. And a good architect only uses tools that can either be trusted, or that can be fixed. "Trusted" doesn't mean that they were made by some big company that you're supposed to trust like IBM, it means that you know in your heart that it's going to work right. I think today most Windows programmers trust Visual C++, for example. They may not trust MFC, but MFC comes with source, and so even though it can't be trusted, it can be fixed when you discover how truly atrocious the async socket library is. So it's OK to bet your career on MFC, too.

You can bet your career on the Oracle DBMS, because it just works and everybody knows it. And you can bet your career on Berkeley DB, because if it screws up, you go into the source code and fix it. But you probably don't want to bet your career on a non-open-source, not-well-known tool. You can use that for experiments, but it's not a bet-your-career kind of tool.

I have used the same logic myself in the past, and I think it is compelling.

What is the cost of storage, again?

So, I got a lot of exposure about my recent post about the actual costs of saving a few bytes from the fields names in schema-less databases. David has been kind enough to post some real numbers about costs, which I am going to use for this post.

The most important part is here:

  • We have now moved to a private cloud with Terremark and use Fibre SANs. Pricing for these is around $1000 per TB per month.
  • We are not using a single server – we have 4 servers per shard so the data is stored 4 times. See why here. Each shard has 500GB in total data so that’s 2TB = $2000 per month.

So, that gives a price point of 4 US Dollars per gigabyte.

Note that this is a pre month cost, which means that it is going to cost a whopping of 48$ per year. Now, that is much higher cost than the 5 cents that I gave earlier, but let us see what this gives us.

We will assume that the saving is actually higher than 1 GB, let us call it 10 GB across all fields across all documents. Which seems a reasonable number.

That thing now costs 480 $per year.

Now let us put this in perspective, okay?

At 75,000$ a year (which is decidedly on the low end, I might add), that comes to less than 2 days of developer time.

It is also less than the following consumer items

  • The cheapest iPad - 499$
  • The price of the iPhone when it came out - 599$

But let us talk about cloud stuff, okay?

  • A single small Linux instance on EC2 – 746$ per year.

In other words, your entire saving isn’t even the cost of adding the a single additional node to your cloud solution.

And to the nitpickers, please note that we are talking about data is is already replicated 4 times, so it already includes such things as backups. And back to the original problem. You are going to lose more than 2 days of developer time on this usage scenario when you have variable names like tA.

A much better solution would have been to simply put the database on a compressed directory, which would slow down some IO, but isn’t really important to MongoDB, since it does most operations in RAM anyway, or just implement document compression per document, like you can do with RavenDB.

Doing merges in git

One of the things that some people fear in a distributed source control is that they might run into conflicts all the time.

My experience has shown that this isn’t the case, but even when it is, there isn’t really anything really scary about that.

Here is an example of a merge conflict in Git:

image

Double click the conflicting file, and you get the standard diff dialog. You can then resolve the conflict, and then you are done.

Tags:

Published at

Originally posted at

Comments (10)

You saved 5 cents, and your code is not readable, congrats!

I found myself reading this post, and at some point, I really wanted to cry:

We had relatively long, descriptive names in MySQL such as timeAdded or valueCached. For a small number of rows, this extra storage only amounts to a few bytes per row, but when you have 10 million rows, each with maybe 100 bytes of field names, then you quickly eat up disk space unnecessarily. 100 * 10,000,000 = ~900MB just for field names!

We cut down the names to 2-3 characters. This is a little more confusing in the code but the disk storage savings are worth it. And if you use sensible names then it isn’t that bad e.g. timeAdded -> tA. A reduction to about 15 bytes per row at 10,000,000 rows means ~140MB for field names – a massive saving.

Let me do the math for a second, okay?

A two terabyte hard drive now costs 120 USD. By my math, that makes:

  • 1 TB = 60 USD
  • 1 GB = 0.058 USD

In other words, that massive saving that they are talking about? 5 cents!

Let me do another math problem, oaky?

Developer costs about 75,000 USD per year.

  • (52 weeks – 2 vacation weeks) x 40 work hours = 2,000 work hours per year.
  • 75,000 / 2,000 = 37.5 $ / hr
  • 37.5 / 60 minutes = 62 cents per minutes.

In other words, assuming that this change cost a single minute of developer time, the entire saving is worse than moot.

And it is going to take a lot more than one minute.

Update: Fixed decimal placement error in the cost per minute. Fixed mute/moot issue.

To those of you pointing out that real server storage space is much higher. You are correct, of course. I am trying to make a point. Even assuming that it costs two orders of magnitudes higher than what I said, that is still only 5$. Are you going to tell me that saving the price of a single cup of coffee is actually meaningful?

To those of you pointing out that MongoDB effectively stores the entire DB in memory. The post talked about disk size, not about memory, but even so, that is still not relevant. Mostly because MongoDB only requires indexes to fit in memory, and (presumably) indexes don't really need to store the field name per each indexed entry. If they do, then there is something very wrong with the impl.

Tags:

Published at

Originally posted at

Comments (92)

RavenDB – Defining indexes

One of the more powerful features of RavenDB is the notion of indexes. Indexes allow you to query RavenDB efficiently and they stand at the core of RavenDB’s philosophy of “we shall never make you wait” and they are the express way to a lot of RavenDB’s capabilities (spatial searches and full text searches, to name just a few).

Unfortunately, indexes require that you pre-define them before you can use them. That led to a problem. A big one. It meant that in addition for simply deploying RavenDB, you had to manage additional assets, the index definitions. In essence, that is very similar to having to worry about the database schema.  That, in turn, made deploying that bit more complex, and that was unacceptable.

So we implemented the Code Only Indexes, which allow you to define your indexes as part of your project, and have them automatically created the first time the application starts.

Defining the indexes is pretty simple:

public class Movies_ByActor : AbstractIndexCreationTask<Movie>
{
    public Movies_ByActor()
    {
        Map = movies => from movie in movies
                        select new {movie.Name};
        Index(x=>x.Name, FieldIndexing.Analyzed);
    }
}

public class Users_CountByCountry : AbstractIndexCreationTask<User>
{
    public Users_CountByCountry()
    {
        Map = users => from user in users
              select new {user.Country, Count = 1};
        Reduce= results => from result in results
                           group result by result.Country into g
                           select new { Country = g.Key, Count = g.Sum(x=>x.Count)}

    }
}

But this is just the first step, the next is to tell RavenDB about them:

IndexCreation.CreateIndexes(typeof(Movies_ByActor).Assembly, store);

If you put this in your Global.asax, you would never have to think about indexing or deployment issues again.

Tags:

Published at

Originally posted at

Comments (20)

Raven’s dynamic queries

One of the pieces of feedback that we got from people frequently enough to be annoying is that the requirement to define indexes upfront before you can query is annoying. For a while, I thought that yes, it is annoying, but so is the need to sleep. And that there isn’t anything much you can do about it.

Then Rob and asked, why do we require users to define indexes when we can gather enough information to do this ourselves? And since I couldn’t think of a good reason why, he went ahead and implemented this.

var activeUsers = from user in session.Query<User>()
                  where user.IsActive == true
                  select user;

You don’t have to define anything, it just works.

There are a few things to note, though.

First, unindexed queries are notorious for working on small data sets and killing systems in production. That was one of the main reasons that I didn’t want to run

Moreover, since RavenDB philosophy is that it should make it pretty hard to shoot yourself in the foot, we figured out how to make this efficient.

  • RavenDB will look at the query, see that there is no matching index, and create one for you.
    • Currently, it will create an index per each query type. In the future (as in, by the time you read this, most probably) we will have a query optimizer that can deal with selecting the appropriate index.
  • Those indexes are going to be temporary indexes, maintained for a short amount of time.
  • However, if RavenDB notice that you are making a large number of queries to a particular index, it will materialize it and turn it into a permanent index.

In other words, based on your actual system behavior, RavenDB will optimize itself for you :-) !

This feature actually had us stop and think, because it fundamentally changed the way that you work with RavenDB. We actually had to stop and think why you would want to create indexes manually.

As it turned out, there are still a number of reasons why you would want to do that, but they become far more rare:

  • You want to do apply complex filtering logic or do something to the index output. (For example, you may be interested in aggregating several fields into one searchable item)
  • You want to use RavenDB’s spatial support.
  • Aggregations still require defining a map/reduce query.
  • You want to use the Live Projections feature, which I’ll discuss in my next post.
Tags:

Published at

Originally posted at

Comments (33)

RavenDB in comparison to CouchDB

I run into this question, and I thought that this is an important enough topic to put it on the blog as well.

A cursory dig suggests CouchDB is more mature, with a larger community to support it. That aside, what do you consider to be the significant differences?

RavenDB was heavily inspired by CouchDB. But when I sat down to build it, I tried to find all the places where you would have friction in using CouchDB and eliminating them, as well as try to build a product that would be a natural fit to the .NET ecosystem. That isn't just being able to run easily on Windows, btw. It is about being a product that fits the thought processes, requirements and environment in which it is used.

Here are some of the things that distinguish RavenDB:

  • Transactions - support for single document, document batch, multi request, multi node transactions. Include support for DTC. To my knowledge, CouchDB supports transaction only on a single document.
  • Patching - you can perform a PATCH op against a document, instead of having to send the entire document to the server.
  • Set based operations - basically, a way to do things like: "update active = false where last_login < '2010-10-01'"
  • Deployment options - can run embedded, separate executable, windows service, iis, windows azure.
  • Client API - comes with a client API for .NET that is very mature. Supports things like unit of work, change tracking, etc.
  • Safe by default - both the server and the client have builtin limits (overrdable) that prevent you from doing things that will kill your app.
  • Queries - Support the following querying options:
    • Indexes - similar to couch's views. Define by specifying a linq query.
    • Just do a search - doesn't have to have an index. RavenDB will analyze the query and create a temporary index for you. However, unlike couch temp views. This is meant for production use. And those temp indexes will automatically become permanent ones based on usage. Note that you don't have to define anything, just issue the actual query: Name:ayende will give you back the correct result. Tags,Name:raven will also do the same, including when you have to deal with extracting information directly from the composite docs.
    • Run a linq query - this is similar to the way temp view works, it is an O(n) operation, but it allows you to do whatever you want with the full power of linq. (For the non .NET guys, it allows you to run a SQL query against the data store) Mostly meant for testing.
  • Index backing store - Raven puts the index information in Lucene, which means we get full text searching OOTB. We can also do spatial queries OOTB.
  • Searching - It is very easy to say "index users by first name and last name", then search for them by either one. (As I understand it, I would have to define two separate views in couch for this).
  • Scaling - Raven comes with replication builtin, including master/master. Sharding is natively supported by the client API and requires you to simply define your sharding strategy.
  • Authorization - Raven has an auth system that allows defining queries based on user / role on document, set of documents (based on the doc data) and globally. You can define something like: "Only Senior Engineers can Support Strategic Clients"
  • Triggers - Raven gives you the option to register triggers that will run on document PUT/READ/DELETE/INDEX
  • Extensibility - Raven is intended to be customized by the user for typical deployment. That mean that you would typically have some sort of customization, such as triggers, additional operations that the DB can do.
  • Includes  & Live projections -  Let us say that we have the following set of documents: { "name": "ayende", "partner": "docs/123" }, { "name": "arava" }
    • Includes means that you can load the "ayende" document, while asking RavenDB to load the document referred to by the partner property. That means that you have only a single request to make, vs. 2 of them without this feature.
    • Live projections means that we can ask for the document name and the name of the partner's name. Effectively joining the two together.
    • Those two features will only work on local data, obviously.

And I am probably forgetting some stuff, to be honest. Oh, and I am naturally the most unbiased of observers :-)

Tags:

Published at

Originally posted at

Comments (11)

re: Diverse.NET

This post is a reply for this post, you probably want to read that one first.

Basically, the problem is pretty simple. It is the chicken & the egg problem. There is a set of problems where it doesn’t matter. Rhino Mocks is a good example where it doesn’t really matter how many users there are for the framework. But there are projects where it really does matters.

A package management tool is almost the definition of the chicken & egg problem. Having a tool coming from Microsoft pretty much solve this, because you get a fried chicken pre-prepared.

If you look at other projects, you can see that the result has been interesting.

  • Unity / MEF didn’t have a big impact on the OSS containers.
  • ASP.Net MVC pretty much killed a lot of the interest in MonoRail.
  • Entity Framework had no impact on NHibernate.

In NHibernate’s case, it is mostly because it already moved beyond the chicken & egg problem, I think. In MonoRail’s case, it was that there wasn’t enough outside difference, and most people bet on the MS solution. For Unity / MEF, there wasn’t any push to use something else, because you really didn’t depended on that.

In short, it depends :-)

There are some projects that really need critical mass to succeed. And for those projects, having Microsoft get behind them and push is going to make all the difference in the world.

And no, I don’t really see anything wrong with that.

Presentation: Intro to NoSQL

My session with the .NET user group in Aarhus, Denmark was recorded and is available.

I tried to discuss (in the space of an hour) about the major types of NoSQL databases. It is a bit rush, but I think it is pretty good.

Tags:

Published at

Originally posted at

Comments (5)

RavenDB – The great simplification

There is always a tension between giving users an interface that is simple to use and giving them something that is powerful.

With RavenDB, we run into a problem with the session interface. I mean, just take a look…

image

Look how many operations you can do here! Yes, it is powerful, but it is also very complex to talk about & explain. For that matter, look at the interface:

image

We have 4(!) different ways of querying (actually, we have more, but that is beside the point).

Something had to be done, but we didn’t want to lose any power. So we decided on the following method:

image

Basically, we grouped all the common methods into a super simple session interface. This means that when you need to use the session, you see only:

image

4 of those methods (Equals, GetHashCode, GetType, ToString) are from System.Object and one is from IDisposable. Users pretty much are going to not see those methods.  The CRUD interface is there, as well as the starting point for querying. And that is it.

I think that it makes it very easy to get started with RavenDB. And all those operations that are important but uncommonly used are still there, they just require an explicit step to access them.

Tags:

Published at

Originally posted at

Comments (12)

RavenDB & Managed Storage

One of the things that I really wanted with to do with RavenDB is to create something that is really easy to use for .NET developers. I think that I have managed to do that. But the one big challenge that we still had was running everything using managed storage.

As you can imagine, building transactional, crash-safe, data stores isn’t particularly easy, but we actually did that, and now RavenDB can run in managed core. That has other implications, like being able to run completely in memory. Which means that you can test your RavenDB code simply by using:

var store = new DocumentStore { RunInMemory = true; }

And just use this as you normally would. For that matter, you can ask the RavenDB server to run in memory as well (extremely useful for demos):

Raven.Server.exe /ram

As a side note, I am going to be posting a lot about the recent storm of features that were just added to RavenDB.

Tags:

Published at

Originally posted at

Comments (16)

RavenDB – Live projections, or how to do joins in a non relational database

This feature has me really excited, because it solves a pretty big problem and it does so in a really elegant fashion.

Let us start from the beginning. Documents are independent, which means that processing a single document should not require loading additional documents. This, in turn, leads to denormalization, so we can keep the data that we need about our associations in the same document.

The problem, of course, is that there are many cases where this denormalization is annoying. In particular, it means that you have to take responsibility for handling the updates to the denormalized data. There are good reasons to want to do that, particularly if you are working on a sharded data store. And yet… many people don’t run in a sharded store, so why make them pay for that?

I found it hard to answer, especially since we already introduced the includes feature. For a while, I thought that just doing the include was enough, but then I got into a discussion with Rob about this. And we had the chance to talk about projections vs. documents. We agreed that we wanted a solution, and we started throwing a lot of crazy ideas (multi sourced, background updated, materialized views – to name one) around. Until we finally realized that we were being really stupid. The data is already there!

It was just that I was too blind to see how we can push it out.

Let us talk in code for a minute, since it would be easier to demonstrate how things work:

using (var s = ds.OpenSession())
{
    var entity = new User { Name = "Ayende" };
    s.Store(entity);
    s.Store(new User { Name = "Oren", AliasId = entity.Id });
    s.SaveChanges();
}

This creates two documents and links between them.

Now, let us say that we want to display the following grid:

User Alias
Oren Ayende

Well, we need to query for all the users that has an alias, then include the associated document. Something like this:

var usersWithAliases = from user in session.Query<User>().Include(x=>x.AliasId)
                       where user.AliasId != null
                       select user;


var results = new List<UserAndAlias>();

foreach(var user in usersWithAliases)
{
    results.Add(
        new UserAndAlias
        {
            User = user.Name
            Alias = session.Load<User>(user.AliasId).Name
        }
    );
}

Here is the deal, this is very efficient in terms of calling the database only once, but it does means that we are passing the full document back, which may be something that we may not want to do.

Not to mention that there is a whole lot of code here.

Okay, so far we have introduced the problem. Let us see how we can solve it. We can do that by applying a live projection at the server side. A live projection transforms the results of a query on the server side, and it has access to other documents as well. Let us see what I mean by that:

public class Users_ByAlias : AbstractIndexCreationTask<User>
{
    public Users_ByAlias()
    {
        Map =
            users => from user in users
                     select new {user.AliasId};

        TransformResults =
            (database, users) => from user in users
                                 let alias= database.Load<User>(user.AliasId)
                                 select new {Name = user.Name, Alias = alias.Name};
    }
}

It is important to understand exactly what is going on here. The TransformResults will be executed on the results on the query, which gives it the change to modify, extend or filter them. In this case, it gives you the ability to look at data from another document.

For the DB guys among you, this performs a nested loop join.

Now, we can just write:

var usersWithAliases = 
     (from user in session.Query<User, Users_ByAlias>()
     where user.AliasId != null
     select user).As<UserAndAlias>();

This will query the index, transform the results on the server side, and give us the UserAndAlias colelction that we can just use.

Did I mention that I am really excited about this feature?

Small optimizations? Did you check that idiot checkbox yet?

I started to write the following code:

operationsInTransactions.GetOrAdd(txId,  new List<Command>())
    .Add(new Command
    {
        Key = key,
        Position = position,
        Type = CommandType.Put
    });

But then I realized that in the happy path, I would always allocate a list, even if I don’t need it, and I changed it to this:

operationsInTransactions.GetOrAdd(txId, guid => new List<Command>())
    .Add(new Command
    {
        Key = key,
        Position = position,
        Type = CommandType.Put
    });

Why am I being stupid?

Hint: It has nothing to do with the stupidity associated with trying to do micro optimizations…

Riddle me this: File access & Multi threading

What would be the output of the following code?

var list = new List<Task>();

var fileStream = File.Open("test", FileMode.Create, FileAccess.ReadWrite, FileShare.None);
for (int i = 0; i < 150; i++)
{
    var index = i;
    var task = Task.Factory.StartNew(() =>
    {
        var buffer = Encoding.ASCII.GetBytes(index + "\r\n");
        fileStream.Write(buffer, 0, buffer.Length);
    });
    list.Add(task);
}


Task.WaitAll(list.ToArray());
fileStream.Flush(true);
fileStream.Dispose();

Please note that:

  • What order the data is saved is not important.
  • Preventing data corruption is important.

Can you guess?

How should the RavenDB UI look like? (There is a prize!)

Update: The offer is now closed.

Currently, RavenDB comes with a builtin user interface which is HTML/JS based. The UI is nice, but it has limitations. Getting real applications on to of HTML/JS requires a lot of work and expertise, and while I understand how it is done, I find it painful to do so. Another problem is that the UI is already becoming cluttered, and we can’t really offer high end UX with the type of options that we have in a web interface without a ridiculously high investment in time and effort. But I still want the no-deploy option, so the next UI for RavenDB is going to be done in Silverlight. For example, giving syntax highlighting for editing index definitions would be trivial in Silverlight and very hard in HTML/JS.

Hibernating Rhinos (my company) is going to sponsor professional UI developed for RavenDB using Silverlight. Think about it like Management Studio. The work is going to be done in Silverlight, with the idea that you could gain all the usual benefits of being able to host your own UI. I was talking today with a few people about that, and everyone agrees that the actual process of implementing this is fairly simple.

The problem is what sort of UI we should do. One option, which we will not take, is simply copy the current web UI. But that is something that I strongly oppose. I think that the web UI is nice, but that we can do much better.

One of the things that bothers me is that I do not want to do is to give your a grid at any point in the UI. RavenDB is all about documents, and I want to UI to direct you toward thinking in documents.

I have all sort of crazy ideas about representing documents as a series of documents in an infinite space. Something like this:

image

And then zoom on a particular document to be able to work on it effectively, literally paging between different documents with page turns, etc.

I am not sure if I am going with the right direction or if I am just doing the equivalent of build an interface made of nothing by <blink/> tags.

So, here is how it goes. I would really like to get ideas about how to do a good UI for RavenDB. But I don’t think that just talking about this would work. I would really like to see a prototype that I can look at. Doesn’t have to be actually workable, I would gladly look at hand drawn pictures. It doesn’t even have to look good, because I am more interested in seeing UX rather than UI, and how we can focus users toward document oriented thinking.

To facilitate that, I am going to offer a prize. Your choice of the a profiler license or RavenDB license if you send me your ideas about how to build such a UI. I am going to send a free license to either the profiler or RavenDB regardless if I‘ll choose your suggestion or not (as long as the suggestion itself is serious), this isn’t just for the winner.

The other side of chasing the money

When I started doing my own consulting, I realized that sometimes I would have to chase after a client in order to get paid. Luckily, it hasn’t happened often.

I did not expect the reverse to happen, but it did. I just had to send the following formal notice to someone who does work for me:

Hi guys!
I am pretty sure that I owe you money. Would you mind terribly if I paid you?

And yes, this post is here to serve as a kick to the people in question.

A murder of Ravens

In the last 96 hours, I slept a total of 6 (all of them while I was on a plane and had no electricity, File:A-murder-of-crows-dvd-cover.jpgso I couldn’t work).

All the rest of my time was pretty much focused on doing crazy stuff with RavenDB. I am very happy to say that I (with a lot of help from Rob Ashton), have managed to really push what RavenDB can do. In particular, about 5 minutes ago I finished the last touch required to get RavenDB to run using fully managed storage.

And yes, this means that we will have a Mono build shortly.

The crazy part is that a week ago I would have said that managed storage would be the feature for RavenDB, but it is actually the case that it is on an equal footing with a few others that we recently added. We have been so busy adding those features, and working through the implication of having those features, that we didn’t have time to sit down and document & explain them properly.

No worries, this is coming. In fact, this is what I intend to do for most of tomorrow (at least, as soon as I wake up).

Ayende the excited.