Ayende @ Rahien

Refunds available at head office

That No SQL Thing: Document Database Migrations

One of the things that I get ask frequently about No SQL is how to handle migrations. First, I want to clarify something very important:

Migrations sucks.

It doesn’t matter what you are doing, the moment that you need to be forward / backward compatible, or just update existing data to a new format, you are going to run into a lot of unpleasant situations. Chief of which is that what you thought was true about your data isn’t, and probably never was.

But leaving that aside, in the RDBMS world, we have three major steps when we migrate from one database version to another, non destructive update to the schema (add columns / tables), update the data (or move it to its new location), destructive schema update (delete columns / tables, introduce indexes, etc). With a document database, the situation is much the same, while we don’t have a mandatory schema, we do have the data in a certain shape.

I am going to use three examples, which hopefully demonstrate enough about migrations to give you the basic idea.

  • Renaming a column – Posts.Text should be renamed to Posts.Content
  • Update column value – Update tax rate in all orders after 2010-01-01
  • Change many-to-one association to many-to-many association – Posts.Category to Posts Categories

Renaming a column

Using SQL, we can do this using:

-- non destructive update
alter table Posts add Content nvarchar(max) null

-- update the data
update Posts
set Content = Text

-- destructive schema update
alter table Posts drop column Text
alter table Posts alter column Content nvarchar(max) not null

Using a Document Database, we run the following Linq query on the server (this is effectively an update):

from doc in docs
where doc.type == "post"
select new 
{
   type = doc.type,
   content = doc.text, // rename a column
   posted_at = doc.posted_at
}

Update column value

In SQL, it is pretty easy:

-- update the data
update Orders
set Tax = 0.16
where OrderDate > '2010-01-01'

And using an update Linq query:

from doc in docs
where doc.type == "order" && doc.order_date > new DateTime(2010,1,1)
select new
{
   tax = 0.16,
   order_date = doc.order_date,
   order_lines = doc.order_lines
}

Change many-to-one association to many-to-many association

Using SQL, it is pretty complex (especially when you add the FKs):

create table CategoriesPosts (PostId int, CategoryId int)

insert into CategoriesPosts (PostId, CategoryId)
select Id, CategoryId from Posts

alter table Posts drop column CategoryId

And using a doc db:

from doc in docs
where doc.type == "post"
select new 
{
   type = doc.type,
   content = doc.content,
   posted_at = doc.posted_at,
   categories = new [] { doc.category }
}

It is pretty much the same pattern over and over again :-)

Comments

MF
04/22/2010 10:06 AM by
MF

That answers updates, but i am interested in how you do creates and deletes.

MF
04/22/2010 10:08 AM by
MF

By that i meant creating and deleting documents. But after thinking about it some more, you probably don't end up creating/deleting documents very often.

Rafal
04/22/2010 10:32 AM by
Rafal
  1. How do you tell if it's an update query and not a select?

  2. Changing db schema in RDBM is a nightmare if your db is big enough. It puts a table lock on your data and the update can take hours or days if the server concludes it has to move all data due to record size change. The application is inaccessible during that time and practically you can't do anything but wait and hope everything's ok. And in case of a fokup transaction rollback can take twice as much time :) NoSQL is a clear winner here.

Rafal
04/22/2010 10:34 AM by
Rafal

.... anyway, could you estimate how much time it takes to update million docs in Raven?

josh
04/22/2010 04:00 PM by
josh

hmm. interesting. I hadn't even gotten to thinking about migrations.

Ayende Rahien
04/22/2010 05:03 PM by
Ayende Rahien

MF,

Deletes/Create are handled the usual way.

Rafal,

1) Based on how you submit it.

In my current perf tests for Raven, I am test on a 7 GB and ~ 2.5 million documents.

Doing about 2 million inserts, then about >1 million updates in about 1 hour and 15 minutes.

fschwiet
04/22/2010 05:50 PM by
fschwiet

Thanks for this blog series. It seems all these migration scenarios require putting the application on hold while the database is being updated. Are there solutions that would not require downtime? I was thinking the documents could be versioned so things with an old schema could be recognized/converted on an as needed basis. I have no experience with document databases though. Do you have any thoughts or experience with this approach, or other no-downtime approaches?

Ayende Rahien
04/22/2010 07:08 PM by
Ayende Rahien

fschwiet,

You can certainly do that, sure. But I am showing a more common case of a point in time upgrade.

Demis Bellot
04/23/2010 12:37 AM by
Demis Bellot

I'm glad that I'm not the only person who thinks that data migration is a pain - I was beginning to think I was suffering this task in silence.

Database Schema migration (although we automatically generate the required DDL) is the only reason why we have downtime when we upgrade our systems at mflow.

To mitigate these problems, I've thrown away the normalization book on RDBMS design and we keep most of our Key Value Objects as stored as Blobs. Although this is transparent to the model as they are exposed as a strongly-typed IList, IDictionary, etc, where even our LINQ provider provides partial support for them where ever possible. I've found the benefits to be numerous, one of which is that we can modify the type and not have to do a schema migration only runtime updates if required.

Because of this pain, 'schema flexibility' is something I cared very deeply about when developing the C# Redis client, so much so that I decided to build my own text-serialization format specifically catered for serializing C# POCO classes. Performance and resilience to schema changes is it's top priority where as long as it doesn't lose data it will de/serialize. i.e. you can serialize a custom POCO class and read it back as string Dictionary (and as long as the dictionary keys matches the POCO properties - vice versa). As a benefit of controlling the implementation, I use 'generic optimizations' to remove all runtime reflection which gives me a 5.3x increase over .NET's BCL JsonDataContractSerializer.

Anyway for anyone that's interested it's ideal for blobbing any .NET POCO and comes in a dependency-free .dll that's open source under the liberal BSD licence at:

code.google.com/p/servicestack/wiki/TypeSerializer

Chris Wright
04/23/2010 12:38 AM by
Chris Wright

I must say, that's really ugly. I have to name every field (twice) in order to update a single field. It's an API issue, but it means I would really try to avoid modifying my document format or doing bulk updates.

I thought that NoSQL was about making it easy to change your schema, but this post makes it clear that SQL's better at that.

Or am I just misunderstanding the examples?

Demis Bellot
04/23/2010 01:10 AM by
Demis Bellot

@Chris Wright

I don't actually think its that bad, as we are talking about renaming a field (i.e. putting the value of one field into another) and it looks like this is achieved with a single query (processed on the server?).

I think the real benefits are the most typical use cases when you add/remove fields which in most cases you won't have to do anything in your NoSQL data store. This normally will require scripted logic as part of your deployment which could potentially require downtime.

Chris Wright
04/23/2010 03:03 AM by
Chris Wright

@Demis Bellot

You need to schedule downtime or explicitly plan for compatibility in any case. Or are you going to let your application crash because multiple versions are running, some of which expect a field to have one name, some of which expect it to have a different name?

Now in SQL, I can change a column name in one line with sp_rename, or even if I don't have that convenience, in three lines (create column, update, drop old column; maybe need to alter the new column to disallow nulls, or drop a foreign key constraint). And I only need to write the table name, the old column name, and the new column name. Only three chances to mess up, and it's pretty hard to mess up by forgetting something. Whereas (if I'm interpreting the examples correctly) RavenDB will force me to list all fields to rename one, which means I can drop one by forgetting about it.

Not to mention, that's implicitly creating a schema. If I don't have a schema, I'll be unable to predict what fields will exist, so I'll never be able to run an update query without removing fields from some of the stored objects.

Even assuming you want and can deal with a schema, this system makes it harder. At work, I have tables with 40 columns. The examples indicate to me that I simply wouldn't be able to rename or drop fields, without a tool to write the queries for me. Even then, I'd be quite hesitant to do so.

This could be fixed if there were a special variable "the_rest" (or some such) that got interpreted to mean "any fields that I haven't explicitly named".

Demis Bellot
04/23/2010 08:35 AM by
Demis Bellot

@Chris Wright

You need to schedule downtime or explicitly plan for compatibility in any case. Or are you going to let your application crash because multiple versions are running, some of which expect a field to have one name, some of which expect it to have a different name?

Do we let our app crash?? Obviously that's insane. We code defensively, maintain backwards compatibility and our app only accesses web services that are not tightly-coupled to the db-schema. The typical use-case is that we add new fields which means additional functionality that the latest version of the client app can take advantage of. We have an automated CI process which includes a 1-click deployment to all our load balanced app servers. IIS can handle a redeployment where it's still serving old requests off the old App domain while we're deploying a newer version of the software - which means it is always serving requests during deployment, old or new. Now one of our back-end services is built on a sharded architecture and we can't change the table schema when there are active connections, it is this reason we have downtime. Adding new fields in text-blobs doesn't exhibit this behaviour and like I said requires 'no down-time'.

Schema migration is mitigated as best as it can be where we have a tool to generate the required DDL statements to migrate the old schema to the new ones. We literary take IIS down run the migration script, redeploy the services and after that we can finally bring the system back on line. So it's not down for a long time but it still goes off-line which causes client disruption. Now I'm not advocating to not use RDBMS because of this reason because as far as I'm concerned it is a necessary evil, when your data/requirements best fit into a RDBMS it is best to leave it there. But schema migration is one of the benefits of schema-less designs.

Chris Wright
04/23/2010 02:46 PM by
Chris Wright

Okay, so if you're going to explicitly plan for compatibility, you can do that with SQL just as well as with NoSQL, using the same mechanisms. Version A depends on a field; version B writes to it; version C ignores it entirely (sets the column to nullable, or adds a DEFAULT constraint); version D eliminates it.

How would this work with NoSQL? Pretty much the same way. SQL allows you to define constraints, but your application has the same constraints, so it doesn't help you to eliminate SQL.

Unless, of course, you're just looking to reduce the friction for setup and adding code. Or if you're actually storing documents and your application is treating the bulk of the documents as a blob.

Demis Bellot
04/29/2010 12:11 AM by
Demis Bellot

Hi @Chris Wright

I think this is going to be hard to explain what is actually possible and what I mean when I say 'in most cases nothing needs to be done' without an example.

So I've followed the theme in this article and used this Blog example and show you what you can expect with schema-less designs:

code.google.com/.../MigrationsUsingSchemalessNoSql

Essentially, in nearly all cases data migrations (if they even need to be done) are much less painful in schema-less datastores. There is no need for an 'intermediate language' (e.g. DDL) to define your custom application's schema in the datastore, it's just your code -> and your datastore, so life's easy.

Redis and most NoSQL datastores keep this theme and design their functionality to be non-invasive, e.g. you can store a 'BlogPost' into a list that doesn't even exist, it will just create it on the fly, likewise when you query from a non existing list of BlogPost's you will get the same result as if it was an empty list, i.e. 0 results.

Comments have been closed on this topic.