Bug magnet

time to read 6 min | 1055 words

imageIf you ever need to understand why I tend to be involved so many projects, it is for a simple reason. I am apparently a bug magnet. I mentioned that I managed to cause FireBird to corrupt itself, which is why I am not intending to use it. Now I have another contender to the list.

Berkeley DB is a well known, widely used, and in general though to be very stable. I decided to use it, mainly because it is cool and I really need to learn something new. (It is running a lot of interesting things, and I want to be able to give concrete opinions on that.)

So, I start by writing tests, to ensure that it behave the way I expect it to. It doesn't, so I go and read the documentation, and it explained that my assumptions were wrong. Okay, I can deal with that, I learn more about this, and starting to experiment with how it works.

It was going fairly well, until I started to toy with transactions and isolations. One of the early tests is to see whatever read committed transactions work.

I wanted to get this (pseudo code) to pass:

using (var environment = new BerkeleyDbEnvironment("test"))
{
    using (var tx = environment.BeginTransaction())
    using (var queue = environment.OpenQueue("my-queue"))
    {

        queue.Append(new DateTime(2000, 1, 1));

        // now test that we cannot read it
        using (var environment2 = new BerkeleyDbEnvironment("test"))
        {
            using (var tx2 = environment2.BeginTransaction())
            using (var queue2 = environment2.OpenQueue("my-queue"))
            {
                Assert.IsNull(queue2.Consume());
                tx2.Commit();
            }
        }

        tx.Commit();
    }
}

However, not only did the test fail, it also corrupt the DB state, and corrupt the process state (leaving file handles hanging). At first, I blamed myself, then I blamed the code. Finally, I blamed the .Net API Binding.

Eventually I decided that I really need to write my own, to fix this issue. Luckily, I decided to try to reproduce the same scenario using the native C API. This is the first time I touched C in years, so you can imagine how pissed off I was.

I got the same error! And it is a scary one:

PANIC: fatal region error detected; run recovery
PANIC: fatal region error detected; run recovery
File handles still open at environment close
Open file handle: test\my-test-queue
PANIC: fatal region error detected; run recovery

Here the full C code, running against BDB 4.7.

//create environment & queue
u_int32_t envCreateFlags = DB_CREATE |
			   DB_INIT_LOCK|
			   DB_INIT_LOG|
			   DB_INIT_MPOOL|
			   DB_INIT_TXN|
			   DB_RECOVER;

char* path = "test";

DB_ENV *createDbEnv;
db_env_create(&createDbEnv, 0);
createDbEnv->open(createDbEnv,path, envCreateFlags, 0);

DB *createDbDb;
db_create(&createDbDb, createDbEnv, 0);
createDbDb->set_re_len(createDbDb, 128);
createDbDb->open(createDbDb, NULL, "my-test-queue", NULL, DB_QUEUE, DB_CREATE, 0);
createDbDb->close(createDbDb, 0);
createDbEnv->close(createDbEnv, 0);

// write to queue
DB_ENV *writeToQueueEnv ;
db_env_create(&writeToQueueEnv, 0);
writeToQueueEnv->open(writeToQueueEnv, path, envCreateFlags, 0);
DB_TXN *writeToQueueTxn;

writeToQueueEnv->txn_begin(writeToQueueEnv, NULL, &writeToQueueTxn, DB_READ_COMMITTED);

DB *writeToQueueDb;
db_create(&writeToQueueDb, writeToQueueEnv, 0);
writeToQueueDb->open(writeToQueueDb, writeToQueueTxn, "my-test-queue", NULL, DB_QUEUE, 0, 0);

int keyVal = 5;
int valVal = 3;

DBT key, value;
memset(&key, 0, sizeof(DBT));
memset(&value, 0, sizeof(DBT));

key.data = &keyVal;
key.size = sizeof(int);
value.data = &valVal;
value.size = sizeof(DBT);

writeToQueueDb->put(writeToQueueDb,	writeToQueueTxn, &key, &value, DB_APPEND);

// now try to read using another transaction

// read from queue in middle of uncommited transaction
DB_ENV *readFromQueueEnv ;
db_env_create(&readFromQueueEnv, 0);
readFromQueueEnv->open(readFromQueueEnv, path, envCreateFlags, 0);
DB_TXN *readFromQueueTxn;

readFromQueueEnv->txn_begin(readFromQueueEnv, NULL, &readFromQueueTxn, DB_READ_COMMITTED);

DB *readFromQueueDb;
db_create(&readFromQueueDb, readFromQueueEnv, 0);
readFromQueueDb->open(readFromQueueDb, readFromQueueTxn, "my-test-queue", NULL, DB_QUEUE, 0, 0);

int readFromQueueKeyVal = 5;
int val2Val = 3;

DBT readFromQueueKey, readFromQueueData;
memset(&readFromQueueKey, 0, sizeof(DBT));
memset(&readFromQueueData, 0, sizeof(DBT));

readFromQueueKey.data = &keyVal;
readFromQueueKey.size = sizeof(int);
readFromQueueData.data = &valVal;
readFromQueueData.size = sizeof(DBT);

int ret = readFromQueueDb->get(readFromQueueDb, readFromQueueTxn, &key, &value, DB_CONSUME);
if( ret == DB_NOTFOUND)
	Console::WriteLine("not found");

readFromQueueTxn->close(readFromQueueTxn, 0);
readFromQueueDb->close(readFromQueueDb, 0);
readFromQueueEnv->close(readFromQueueEnv, 0);


// now commit write transaction
writeToQueueTxn->commit(writeToQueueTxn, 0);
writeToQueueDb->close(writeToQueueDb,0);
writeToQueueEnv->close(writeToQueueEnv, 0);

Now, what the hell am I doing wrong? I am pretty sure that I am missing something. Just to give you an idea about what the issue is, this is about as bad as if running this code would crash a SQL database:

using( var con = new SomeDbConnection(" connection string "))
using( var tx = con.BeginTransaction())
using( var cmd = con.CreateCommand() )
{
	cmd.Transaction = tx;
	cmd.CommandText = "INSERT INTO Queue (Data) VALUES('a')";
      cmd.ExecuteNonQuery();
	
	using( var con2 = new SomeDbConnection(" connection string "))
	using( var tx2 = con2.BeginTransaction())
	using( var cmd2 = con2.CreateCommand() )
	{
		cmd.Transaction = tx;
		cmd.CommandText = "DELETE FROM Queue"
	      int rowAffected = cmd.ExecuteNonQuery();		
		Assert.AreEqual(0, rowAffected);
		tx2.Commit();
	}	
	
	tx.Commit();
}

Sigh, if I were a tester, I would be a menace.

Updates:

The error occurs if & only if you are creating two environments in the same process. What is annoying is that this is at least implicitly allowed. Only if you specify DB_REGISTER you are limited to a single environment per process (and there is a check for it).

It turns out that this is an issue of using the DB_RECOVER flag in multiple environments at the same time. When you create a new environment with DB_RECOVER, it clear the process state. Yuck. Full details can be found here.