Ayende @ Rahien

Refunds available at head office

More code review errors

Take a look at this method:

image

Now, let us make this simple, shall we?

image

Same meaning, and a significant reduction of complexity. Damn, but this is annoying.

Common issues found in code review

I am going over a code base that I haven't seen in a while, and I am familiarizing myself with it by doing a code review to see that I understand what the code is doing now.

I am going to post code samples of changes that I made, along with some explanations.

image

This code can be improved by introducing a guard clause, like this:

image

This reduce nesting and make the code easier to read in the long run (no nesting).

image

I hope you recognize the issue. The code is using reflection to do an operation that is already built into the CLR.

This is much better:

image

Of course, there is another issue here, why the hell do we have those if statement on type instead of pushing this into polymorphic behavior. No answer yet, I am current just doing blind code review.

Here is another issue, using List explicitly:

image

It is generally better to rely on the most abstract type that you can use:

image

This is a matter of style more than anything else, but it drives me crazy:

image

I much rather have this:

image

Note that I added braces for both clauses, because it also bother me if one has it and the other doesn't.

Another issue is hanging ifs:

image

Which we can rewrite as:

image

I think that this is enough for now...

Emulating Java Enums

Java Enums are much more powerful than the ones that exists in the CLR. There are numerous ways of handling this issue, but here is my approach.

Given this enum (defined in Java):

private static enum Layer {
    FIRST,
    SECOND;

	public boolean isRightLayer(WorkType type) {
		if (this == FIRST && type != WorkType.COLLECTION) return true;
		return this == SECOND && type == WorkType.COLLECTION;
		}
}

And the C# version is:

private class Layer
{
    public static readonly Layer First = new Layer(delegate(WorkType type)
    {
        return type != WorkType.Collection;
    });
    public static readonly Layer Second = new Layer(delegate(WorkType type)
    {
        return type == WorkType.Collection;
    });

    public delegate bool IsRightLayerDelegate(WorkType type);

    private readonly IsRightLayerDelegate isRightLayer;

    protected Layer(IsRightLayerDelegate isRightLayer)
    {
        this.isRightLayer = isRightLayer;
    }

    public bool IsRightLayer(WorkType type)
    {
        return isRightLayer(type);
    }
}

On jQuery & Microsoft

No, I am not going to bore you with another repetition of the news. Yeah, Microsoft is going to bundle jQuery with Visual Studio and the ASP.Net MVC. That is important, but not quite as important as something else that I didn't see other people pointing out.

This is the first time in a long time that I have seen Microsoft incorporating an Open Source project into their product line.

I am both thrilled and shocked.

Hibernating Rhinos #10 - Producing Production Quality Software

image

Don't get used to the deluge of the screen casts, I usually do them with months apart, not a mere day.

At any rate, this screen cast is another significant diversion from my usual style.

To start with, it is a zero code webcast, and it would probably would well as a podcast, although I think that the artwork and presentation is still important.

Anyway, this webcast is focused on several lessons learned from unsuccessful project, what are the kind of things that we should pay attention to, and how we can avoid them.

It runs just under 40 minutes, and it is pretty intense.

As I said, this is a new approach for me, and I would like to get your feedback on the matter.

You can download it at the bottom of this page: http://ayende.com/hibernating-rhinos.aspx

Wacky Traveling Schedule

Well, I spent almost two hours yesterday just getting things organized for the coming couple of months. It was only when I actually sat down to start making reservations that I figure out what I had set myself up to.

2nd Oct - 16th Oct - London, there is a ALT.Net beer night on the 14th that I am looking forward too.

27th Oct - 29th Oct - Dallas.

30th Oct - 2nd Nov - Austin for the ALT.Net conference there.

2nd Nov - 15th Nov - New Jersey.

16th Nov - 22th Nov - Sweden, for Ørdev.

Oh, and there is DevTeach as well at the beginning of December, but I am not thinking that far ahead.

Hibernating Rhinos #9 - Application Architecture

image

It has been a while since I last published a screen cast, but here is a new one.

This one is in a slightly different style. I decided to follow Rob Conery's method of using a lot of prepared code instead of my usual ad hoc programming.

Please let me know what you think about the different style.

This is a far more condensed episode, lasting just under half an hour, and it is focus primarily on the internal architecture of a real world application.

I tried to go over a lot of the concepts that seems to trip people up when they come to define the structure of the application.

 

The technical details:

  • ~30 minutes
  • 28.4 MB

You can download this from this address: http://ayende.com/hibernating-rhinos.aspx

Chapter 11 is done, or the tale of meta documentation

Here it the table of content:

  • Writing the Getting Started Guide
    • Create Low Hanging Fruits
  • The User Guide
    • Documenting the Language Syntax
    • The Language Reference
    • Debugging for business users
  • Creating developer documentation
    • Outlining DSL structure
    • The syntax implementation
      • Keywords
      • Behaviors
      • Conventions
      • Notations
      • External Integration
  • Documenting AST transformations
  • Executable documentation
  • Summary

The chapter starts with...

Documentation is a task that most developers strongly dislike. It is treated as a tedious, annoying task and often falls on the developer who protests the least. One additional problem is that a developer trying to document his own work is often not going to do a good work.

This is not an aspiration against developers in general; the problem is that there are too many things that people who actually write the code takes for granted. And even in we ignore that, developers tend to write to developers, in a way that makes little sense to non developers.

That is true for me as well. And trying to document how developers should write documentation is... hard.

At least I can look forward for a really interesting Chapter 12.

The Managed Extensibility Framework

The Managed Extensibility Framework is "new library in .NET that enables greater reuse of applications and components. Using MEF, .NET applications can make the shift from being statically compiled to dynamically composed. If you are building extensible applications, extensible frameworks and application extensions, then MEF is for you."

(I was too lazy to think about my own description for it, so I just copied the official one.)

Probably the first thing that you should know about MEF is what will undoubtedly be the most common cause for confusion.

The Managed Extensibility Framework is not an IoC container.

image This is not a slug at MEF, it is an important distinction. If you try to judge MEF through IoC container glasses, you will come away confused. It may walk like a duck, but it meows.

MEF is, first and foremost, a composition framework. And its target audience are BIG applications. Those two, taken together, are important to understand what MEF is and how we should look at it.

What is the difference between a composition framework and an IoC container? On the surface, they are doing much of the same thing, managing dependencies for the application in an automated fashion. The difference (and the devil) are in the details.

IoC containers have long ago stopped just managing dependencies. They are taking care of a lot of additional responsibilities. Managing lifecycles, proxies, aspect orientation, event aggregation, transaction semantics and a lot of other features.

In addition to that, there is a lot of focus on problem solving by utilizing the container. Things like generic specialization or component selectors allows you to approach a lot of very complex problems in a completely different mindset.

A composition framework, on the other hand, is focused on a single goal: dependency management.

It sounds like MEF is a subset of what an IoC container is doing, I know. This is not the case. MEF, the bits we have right now, are doing a lot more in the area of dependency management than the containers are doing. Where a container is usually static and opaque, MEF primary focus is to make the dependency management itself a dynamic and transparent process.

This is where the second part of the MEF design goals come into place. MEF is targeting Big applications. The really big ones. One of the immediate customers of MEF is Visual Studio itself.

Things like ( take a deep breath ):

  • being able to query meta data without loading assemblies
  • statically verify the dependency graph for all the components and reject those that would put the system in in valid state
  • being explicit by default
  • contract adapters
  • discovery
  • metadata tagging

All of those are key concepts in the overall dependency management theme. And all of those are the product of having a Visual Studio being one of the first consumers of this project. Visual Studio needs this kind of things, across tens of thousands of components. And MEF is setup to handle those kind of scenarios.

So, MEF is very similar to IoC containers, but it has very different goals (or maybe it would be more accurate to say that it has very different priorities).

Another important aspect of MEF has nothing to do with it at all and everything to do with where it is going to be used. MEF is going to ship with .Net 4.0, which put it in a position to be very widely distributed, but more importantly, since it is on the framework, it can be used by other parts of the framework. Which is where it get interesting.

There are a lot of places in the framework that could make use of a container. IControllerFactory is a good example of something that should not exists, for example. I am ambivalent with regards to that, because I think that the correct abstraction for those kind of things is not necessarily MEF, but that is beside the point.

And that is enough for now, I am going to toss a coin and see if it is going to be Erlang code or meta documentation next.

KB957541 is my favorite hotfix

It is not public yet (but you can call and ask for it), but it will be when SP1 goes to Windows Update. This is the fix for the ExecutionEngineException that appeared in .Net 3.5 SP1, and was found by Rhino Mocks.

It took a while (but not unreasonably so), and it is here, yeah!

This is fixed,  you can get the fix here: http://support.microsoft.com/?id=957541

More CouchDB reading: btree:query_modify

Okay!

After diving into CouchDB source code and doing an in depth review of the btree:lookup, I am ready for the real challenge, figuring out how CouchDB writes to the btree. This is the really interesting part, from my point of view.

The heart of this functionality is the query_modify method. This method allow us to perform a batch of operations on the tree in one go. Following my own ideal that all remote APIs (and this is talking to disk, hence, remote) should support batching, I really like the idea.

image

The first part is dedicated to building the actions list. I am not going to go over that in detail, but I will say that this first sort things by key, and then by action.

So, for example, let us say that I want to perform the following actions:

query_modify(tree, {'c', 'a'},'a','a')

What will happen is that we will first query the current value of 'a', then we will remove it, and then we will insert it. Finally, we will query the value of 'c'. This is done in order to ensure good traversal performance of the tree, as well as to ensure consistency when performing multiple operations.

The part that really interest us is modify_node, which perform the actual work:

image

Now we are starting to see how things are really happening. We can see that if the root is null we create a value node, or we find the appropriate node if it is not. This follows what we seen in the lookup.

Moving on, we forward the call to modify_kpnode or modify_kvnode. I am going to touch kpnode first, since it is bound to be simpler (kvnode needs to handle splitting, I would assume).

image

This is far from trivial code, and I don't think that I understand it fully yet, but what it basically does is fairly simple. First the first node that match the current FirstActionKey, if this is the last node that we have to deal with, we call modify_node on it, accumulate the result and return it. If it is not the last node, we split the work, sending all that are less than or equal to the current FirstActionKey to be handled using modify_node (which is likely to be a key/value node and thus handled directly) and continue to handle the rest using modify_kpnode.

In a way, this is the exact reverse of how lookup_kvnode is handled.

modify_kvnode is complex. I can't fit the functions on a single screen in a 1280x1024 resolution. I am going to split it into two sections. It is not ideal, since they are supposed to go together, but I'm sure you'll manage.

image

The first part is to handle the case where we have more actions to perform. In this case, we can simple return the results. The second is there to handle the case where we run out of the items in the node. Note how insert works, for example. You can see that the way that the btree works is the same in which Erlang does. That is, we will always rewrite the entire node, rather than modify it.  remove is also interesting. If we got to this point and haven't found the node, it doesn't exist, so we can move on. Same for fetching.

Now let us see the more complex part, what happen if we do find the items in the value node?

image

Note that we have AccNode, which is our accumulator. We find the first node that match ActionKey, and then we take the NodeTuple and the AccNode and turn them into a reverse list. This copies all the items that are less than the current one to the ResultNode, those are the ones that we are not interested in, so we can just copy them as is.

The next part handles the actual adding/removing/fetching from the node. It is pretty self explanatory, I think, so I'll leave it at that.

So now we understand what modify_kvnode and modify_kpnode works. But there is nothing here about splitting nodes, which I find suspicious. Let us go back to modify_node and look what else is going on there:

image

Ah, note the handling of NewNodeList, there is probably where the money is.

We make a couple of tests. The first is to see if there are any nodes left, the second to see if we changed the node list (by comparing it to the old value). We don't care for any of those at the moment, so we will focus on the last one. write_node is called, and this is likely where we will see what we are looking for.

image

Well, this is simple enough, and chunkify seems like what we were looking for. However, It bothers me that we write all the nodes to disk. Is seems... wrong somehow. More specifically, since we are always appending, aren't we going to break the binary tree? There is also the reduce_node call that is hiding there, which we also need to check. It is being called after the node was written to disk, so I am not sure what is going on here.

Let us read chunkify first, and then deal with reduce_node.

image

Well... it looks like the chunkify function sucks. But anyway, what is going on there is fairly simple. We check if the list that we were passed is greater than CHUNK_THRESHOLD. This is set to 1279, for some strange reason that I can't follow. I assume that the reason for that is to ensure blocks of less than 1280, but that no sector size that I have heard of came in this size.

The second part is more interesting (and complex). OutputChunks is a list of lists of the elements that started in the InList. This piece of code is very short, but it does a tremendous amount of work.

And now we can move back to reduce_node, like I promised.

image

This is short, to the point, and interesting. The idea of rereduce is that when something has changed, you don't have to go and recalculate everything from scratch, you can take partial reduced results and the new values and combine them to produce the same result that you would have if you had reduced over the entire data set.

As you can see, calling reduce_node on a key pointer node will cause a re reduction, while on a value node, it just reduce after a map. I am assuming that the thought was that value nodes are small enough that there is no point in optimizing this.

There are a few interesting points that need to be raised, which I don't have answers for at the moment.

  • Why is this happening after we write to file?
  • How does this relate to CouchDB's ability to shell out to java script in order to perform maps and reductions?
  • What ensure that the ordering of the node reduction match the tree hierarchy?

At any rate, this completes our examination of write_node and modify_node, we can now go back to where we started, query_modify:

image

We are nearly at the end. We have seen that the nodes are written to disk after being chunkified. Note that currently nothing actually have a reference to them at this point. We do have KeyPointers, but they aren't attached to anything. If we crashed right now (directly after modify_node), there is no state change as far as we are concerned, we just lost some disk space that we will recover in the next compaction.

I am pretty sure that complete_root is the one responsible for hooking everything together, so let us see what it does...

image

Yes, I expected complete_root to be complicated as well :-)

What is going on here is the actual missing piece. This is what takes all the value nodes and turn them into pointer nodes, and does so recursively until we finally get to the point where we only have a single value returned, which is the root node. There is also handling for no nodes, in which case the tree is empty.

Basically, what this means that that the way CouchDB is able to achieve ACID using a btree is by saving the modified tree structure to disk on each commit. Since it is only the modified part that is saved, and since btree structure are really small, it has no real storage penalty. Now I want to go and find what actually save the new root tree to disk, since query_modify does not modify the actual header (which means that in the case of a crash, nothing will change from the point of view of the system).

Right now I suspect that this is intentional, that this allows to combine even more actions into a single transaction, even beyond what you can do in a single query_modify. This is especially true since the common interface for those would usually be add, remove, etc.

As an interesting side effect, this is also how CouchDB is able to handle MVCC. Since everything is immutable, it can just hand a tree reference to the calling code and forget about it. No changes ever occur, so you get serializable isolation level by default. And, from what I can see, basically for free.

Going over the couch_httpd file, is seems that the main interface is couch_db, so I am heading that way... and I run into some very complex functions that I don't really feel like reading right now.

Hope this post doesn't contain too many mistakes...

More CouchDB reading: btree:lookup

I want to dive deep into the way CouchDB's file format, which is interesting, because it maintain ACID guarantees and the code is small enough to make it possible to read.

The file starts with a header, with the following structure:

"gmk\0"
db_header:
    writer_version
    update_seq
    summary_stream_state,
    fulldocinfo_by_id_btree_state
    docinfo_by_seq_btree_state
    local_docs_btree_state
    purge_seq
    purged_docs
// zero padding to 2048
md5 (including zero padding)

At the moment, I am not sure what some of the fields are (all the state and the purged_docs), and there is indication that this header can get larger than 2Kb. I'll ignore it for now and go study the way CouchDB retrieve a node. The internal data structure for the CouchDB file is a btree.

Here is the root lookup/2, which takes the btree and a list of keys:

image

It makes a call to lookup/3, the first of which is error handling for null btree, and the second on is the really interesting one.

get_node will perform a read from the file at the specified location. As you can see, the Pointer we get pass is the btree root at this stage. So this basically reads a term (an Erlang data structure) from the file. It is important to note that Erlang has stable serialization in the face of versioning, unlike .Net.

So, we get the node, and it is either a kp or a kv node (I think that kv is key/value and kp is key/pointer). Since key value seems to be easier to understand, I am going to look it up first.

image

As usual, we are actually interested in the lower one, with the first two for error handling. The first is to search for an empty list of keys, which return the reversed output. This is a standard pattern in Erlang, where you accumulate output until you run out of input to process in a recursive manner, at which point you simple return the reversed accumulator, since you need to return the data in the same order that you got it.

Indeed, we can see if from the signature of the third function that this is how this works. We split LookupKey from RestLookupKeys. We will ignore the second function for now, since I don't understand its purpose.

find_first_gteq seemed strange at first, until I realized that gteq stands for greater or equals to. This perform a simple binary search on NodeTuple. Assuming that it contains a list of ordered tuples (key,value), it will give you the index of the first item that is equal or greater to the LookupKey. The rest of the function is a simple recursive search for all the other items in the RestLookupKeys.

That is interesting, since it means that all the items are in memory. The only file access that we had was in the get_node call in lookup. That is interesting, since I would assume that it is entirely possible to get a keys across wide distribution of nodes. I am going to assume that this is there for the best case scenario, that the root node is also a value node that has no branches. Let us look at the handling of kp_node and see what is going on.

And indeed, this seems to be the case:

image

As usual, the first two are stop conditions, and the last one is what we are really interested in.

We get the first key that interest us, as before. But now, instead of having the actual value, we have a PointerInfo. This is the pointer to where this resides on the disk, I assume.

We then split the LookupKeys list on those who are greater then what FirstLookupKey is. For all those who are less then or equal to us, we call back to lookup, to recursively find us the results of them. Then we call again to all those who are greater than FirstLookupKey, passing it all those items.

In short, this is a implementation of reading a binary tree from a file, I started to call it simple, but then I realized that it isn't, really. It is elegant, though.

Of course, this is still missing a critical piece, the part in which we actually write the data, but for now I am actually able to understand how the file format works. Now I have to wonder about attachments, and how the actual document content is stored, beyond just the key.

Considering that just yesterday I gave up on reading lookup for being too complex, I am pretty happy with this.

Reading Erlang: Inspecting CouchDB

I like the ideas that Erlang promotes, but I find myself in a big problem trying to actually read Erlang. That is, I can read sample code, more or less, but real code? Not so much. I won't even touch on writing the code.

One of the problems is the lack of IDE support. I tried both ErlIDE and Erlybird, none of them gave me so much as syntax highlighting. They should, according to this post, but I couldn't get it to work. I ended up with Emacs, which is a tool that I just don't understand. It supports syntax highlighting, which is what I want to be able to read the code more easily, so that is fine.

Emacs is annoying, since I don't understand it, but I can live with that. The source code that I want to read is CouceDB, which is a Document DB with REST interface that got some attention lately. I find the idea fascinating, and some of the concepts that they raise in the technical overview are quite interesting. Anyway, I chose that to be the source code that will help me understand Erlang better. Along the way, I expect to learn some about how CouchDB is implemented.

Unlike my usual code reviews, I don't have the expertise or the background to actually judge anything with Erlang, so please take that into account.

One of the things that you notice very early with CouchDB is how little code there is in it. I would expect a project for a database to be in the hundreds of thousands of lines and hundreds of files. But apparently there is something else going on. There are about 25 files, the biggest of them being 1,200 lines or so. I know that the architecture is drastically different (CouchDB shells out to other processes to do some of the things that it does), but I don't think it is just that.

As usual, I am going to start reading code in file name order, and jump around in order to understand what is going on.

couch_btree is strange, I would expect Erlang to have an implementation of btree already. Although, considering functional languages bias toward lists, it may not be the case. A more probable option is that CouchDB couldn't use the default implementation because of the different constraint it is working under.

This took me a while to figure out:

image

The equivalent in C# is:

public class btree
{
	fd, root;
	extract_kv = (KeyValuePair kvp) => kvp;
	assemble_kv = (KeyValuePair kvp) => kvp;
	less = (a, b) => a < b;
	reduce = null
}

public object Extract(btree, val)
{
	return btree.extract_kv(val);
}

The ability to magically extract values out of a record make the code extremely readable, once you know what is going on. I am not sure why the Extract variable is on the right side, but I can live with it.

Here is another piece that I consider extremely elegant:

image

You call set_options with a list of tuples. The first value in each tuple is the option you are setting, the second is the actual value to set. This is very similar to Ruby's hash parameters.

Note how it uses recursion to systematically handle each of those items, then it call itself with the rest of it. Elegant.

It is also interesting to see how it deals with immutable structure, in that it keeps constructing new ones as copy of the old.  The only strange thing here is why we map split to extract and join to assemble. I think it would be best to name them the same.

There is also the implementation of reduce:

image

This gave me a headache for a while, before I figured it out. Let us say that we are calling final_reduce with a binary tree and a value. We expect that value to always be a tuple of two lists. We call the btree.reduce function on the value. And this is the sequence of operations that is going on:

  • if the value is two empty lists, we call reduce with an empty list, and return its value.
  • if the value is composed of an empty list and a list with a single value, then we have reduced the value to its final state, and we can return that value.
  • if the value is an empty list and a list of reduced values, we call reduce again to re-reduce the list, and return its value.
  • if the value is two lists, we call reduce on the first and prepend the result to the second list. Then we call final_reduce again with an empty list and the new second list.

Elegant, if you can follow what is going on.

There is a set of lookup functions that I am not sure that I can understand. That seems to be a problem in general with the CouchDB source code, there is very little documentation, and I don't think that it is just my newbie status at Erlang that is making this difficult.

I got up to get_node (line 305) before I realized that this isn't actually an in memory structure. This is backed by couch_file:pread_term, whatever that is. And then we have write_node. I am skipping ahead and see things like reduce_stream_kv_node2, which I am pretty sure are discouraged in any language.

Anyway, general observation tells me that we have kvnode and kpnode. I think that kvnode is key/value node and kpnode is key/pointer node. I am lost beyond that. Trying to read a function with eleven arguments tax my ability to actual understand what each of them is doing.

I found it very interesting to find test functions at the end of the file. They seem to be testing the btree functionality. Okay, I am giving up on understanding exactly how the btree works (not uncommon when starting to read code, mind you), so I'll just move on and record anything new that I have to say about the btree if I find something.

Next target, couch_config. This one actually have documentation! Be still, my heart:

image

Not sure what an ets table is, mind you.

couch_config is not just an API, like the couch_btree is. It is a process using Erlang's OTP in order to create an isolate process:

image

gen_server is a generic server component. Think about this like a separate process that is running inside the Erlang VM.

Hm, looks like ets is an API to handle in memory databases. Apparently they are implemented using Judy Arrays, a data structure implementation that aims to save in cache misses on the CPU.

This is the public API for cache_config:

image

It is a fairly typical OTP process, as far as I understand. All calls to the process are translated to messages and sent to the server process to handle. I do wonder about the ets:lookup and ets:match calls that the get is making. I assume that those have to be sync calls.

Reading the ini file itself is an interesting piece of code:

image

In the first case, I am no sure why we need io:format("~s~n", [Msg]), that seems like a no op to me. The file parsing itself (the lists:foldl) is a true functional code. It took me a long time to actually figure out what is going on with there, but I am impressed with the ability to express so much with so little code.

The last line uses a list comprehension to write all the values to the ets table.

On the other hand, it is fairly complex, for something that procedural code would handle with extreme ease and be more readable.

Let us take a look at the rest of the API, okay?

image

This is the part that actually handles the calls that can be made to it (the public API that we examined earlier). Init is pretty easy to figure out. We create a new in memory table, load all the the ini files (using list comprehension to do so).

The last line is interesting. Because servers needs to maintain state, they do it by always returning their state to their caller. It is the responsibility of the Erlang platform (actually, I think it is OTP specifically) to maintain this state between calls. The closest thing that I can think of is that this is similar to the user's session, which can live beyond the span of a single request.

However, I am not sure how Erlang deals with concurrent requests to the same process, since then two calls might result in different state. Maybe they solve this by serializing access to processes, and creating more processes to handle concurrency (that sounds likely, but I am not sure).

The second function, handle_call of set will call insert_and_commit, which we will examine shortly, and just return the current state. The function handle_call of delete is pretty obvious, I think, as is handle_call of all.

handle_call of register is interesting. We use this to register a callback function. The first thing we do is setup a monitor on the process that asked us for the callback. When that process goes down, we will be called (I'll not show that, it is pretty trivial code).

Then we prepend the new callback and the process id to the notify_funs variable (the callbacks functions) and return it as the new state.

But what are we doing with those?

image

Well, we call them in the insert_and_commit method. The first thing we do is insert to the table, then we use list comprehension over all the callbacks (notify_funs) and call them. Note that we use a catch block there to ignore errors from other processors. We write the newly added config item to file as well, on the last line.

And this is it for couch_config. couch_config_writer isn't really worth talking about.  Moving on to the first item of real meat. The couch_db.erl file itself! Unsurprisingly, couch_db, too, is an OTP managed process.  And the first two functions in the file tell us quite a bit.

image

catch, like all things in Erlang, is an expression. So start_link will return the error, instead of throwing it. I am not quite sure why this is used, it seems strange, like seeing a method whose return type is System.Exception.

start_link0, however, is interesting. couch_file:open returns an Fd, I would assume that this is a File Descriptor, but the call to unlink below means that it is actually a process identifier, which means that couch_file:open creates a new process. Probably we have a process per file. I find this interesting. We have seen couch_file used in the btree, but I haven't noticed it being its own process before.

Note the error recovery for file not found. We then actually start the server, after we ensure that the file exists and valid, delete the old file if exists, and return.

We then have a few functions that just forward to the rest of couch_db:

image

Looks like those are used as a facade. delete_doc is funny:

image

I am surprised that I can actually read this. What this does is create a new document document for all the relevant versions that were passed. I initially read it as modifying the doc and using pattern matching to do so, but it appears that this is actually creating a new one. I hope that I'll be able to understand update_docs.

In the meantime, this is fairly readable, I think:

image

This translate to the following C# code:

public var open_doc(var db, var id, var options)
{
	var doc = open_doc_internal(db,id,options);
	if(doc.deleted)
	{
		if (options.Contains("deleted"))
			return doc;
		return null;
	}
	return doc;
}

About the same line count. And I am not sure what is more readable.

I am skipping a few functions that are not interesting, but I was proud that I was able to figure this one out:

image

We start with a sorted list of documents, which we pass to group_alike_docs/2:

  • if the first arg is an empty list, return the reserved list from the second arg.
  • If the first argument is a list with value, and the second one is an empty list, create a new bucket for that document, and recurse.

If both lists are not empty, requires its own explanation. Note this piece of code:

[#doc{id=BucketId}|_] = Bucket,

This is the equivalent of the following C# code:

var BucketId = Bucket.First().Id;

Now, given that, we can test if the bucket it equals to the current document it. If it is, we add it to the existing bucket, if it is not, we create new one. The syntax is pretty confusing, I admit, but it make sense when you realize that we are dealing with a sorted list.

By this time I am pretty sure that you are tired of me spelling out all the CouchDB code. So I'll only point out things that are of real interest.

couch_server is really elegant, and I was able to figure out all the code very easily. One thing that is apparent to me is that I don't understand the choice of using sync vs. async calls in Erlang. couch_server is all synchronous calls, for example.

That is all for now, I reviewed couch_file as well, and I think that I need to go back to couch_btree to understand it more.

couch_ft_query is very interesting. It is basically a way to shell out to an external process to do additional work. This fits very well with the way that Erlang itself works. What seems to be going on is that full text search is being shelled elsewhere, and the result of calling that javascript file is a set of doc id and score, which presumably can be used later to get an ordered list from the database itself.

This seems to indicate that there is a separate process that handles the actual indexing of the data (Lucene / Solar seems to be a natural choice here). It isn't currently used, which maps to what I know about the current state of CouchDB. It is interesting to note that Lucene is also a document based store.

Moving on, couch_httpd is where the dispatching of all commands occur. I took only a single part, which represent most of how the code looks like in this module:

image

I really like the way pattern matching is used here. Although I must admit that it is strange to see function declaration being used as logic.

couch_view is the one that is responsible for views, obviously. I can't follow all it is doing now, but it is shelling out some functionality to couch_query_servers, which is shelling the work, in turn, to... a java script file. This one, to be precise. CouchDB is using an external process in order to handle all of its views processing. This give it immense flexibility, since a user can define a view that does everything they want. Communication is done using standard IO, which surprised me, but seems like the only way you could do cross platform process communication. Especially if you wanted to communicated with java script.

All in all, that has been a very interesting codebase, that is all for now...

For the single reader who actually wade through all of that, thanks.

Coddling is consider harmful

Recently there has been a lot of discussion about how we can make development easier. It usually started with someone stating "X is too hard, we must make X approachable to the masses".

My response to that was:

You get what you pay for, deal with it.

It was considered rude, which wasn't my intention, but that is beside the point.

One of the things that I just can't understand is the assumption that development is a non skilled labor. That you can just get a bunch of apes from the zoo and crack the whip until the project is done.

Sorry, it just doesn't work like this.

Development requires a lot of skill, it requires quite a lot of knowledge and at least some measure of affinity. It takes time and effort to be a good developer. You won't be a good developer if you seek the "X in 24 Hours" or "Y in 21 days". Those things only barely scratch the surface, and that is not going to help at all for real problems.

And yes, a lot of the people who call themselves developers should put down their keyboards and go home.

I don't think that we need to apologize if we are working at a high level that makes it hard to beginners. Coming back to the idea that this is not something that you can just pick up.

And yes, experience matters. And no, one years repeated fifteen times does not count.

Learn programming in 10 years is a good read, which I recommend. But the underlying premise is that there is quite a lot of theory and practical experience underlying what we are doing on a day to day basis. If you don't get that, you are out of the game.

I refuse to be ashamed of requiring people to understand advanced concepts. That is what their job is all about.

Amazon EC2 now offers RDBMS

That is pretty amazing, since that was a big pain point for developing for EC2 powered systems. They support both Oracle and MySQL, in addition to SimpleDB, which is a non relational DB.

From the looks of things, however, there is a significant difference between the Oracle and MySQL offerings. MySQL is a DB limited to a single machine. They talk about the ability to dynamically scale the machine (which sounds just awesome) from small to extra large based on requirement, but not about multi instance databases.

Oracle, however, does have this ability, and it is supported on EC2. So you can increase you database horizontally, rather than vertically. At least, that is how I read things.

I found this to be extremely interesting.

Designing Erlang#

I am currently reading another Erlang book, and I am one again impressed by the elegance of the language. Just to be clear, I don't have any intention to actually implement what I am talking about here, this is merely a way to organize my thoughts. And to avoid nitpickers, yes, I know of Retlang.

Processes

Erlang's processes are granular and light weight. There is no real equivalent for OS threads or processes or even to .Net's AppDomains. A quick, back of the envelope, design for this in .Net would lead to the following interface:

public class ErlangProcess
{
	public ErlangProcess(ICommand cmd);
        public static ProcessHandle Current { get; } 
	public MailBox MailBox { get; }
	public IDictionary Dictionary { get; }

	public Action Execution { get; set; }
        public Func<Message> ExecutionFilter { get; set; }
}

A couple of interesting aspects of the design, we always start a process with a command. But we allow waiting using a delegate + filter. This is quite intentional, the reason is the spawn() API call, which looks like this:

public ProcessHandle Spawn<TCommand>();

We do not allow to pass an instance, only the type. The reason for that is that passing instances between processes opens up a chance for threading bugs. For that matter, we don't give you back a reference to the process either, just a handle to it.

Messages

Message passing is an important concept for Erlang, and something that I consider to be more and more essential to the way I think about software. Sending a message is trivial, all you need to do is call:

public void Send(ProcessHandle process, object msg);

Receiving  a message is a lot more complex, because you may have multiple receivers or conditional receivers. C#'s lack of pattern match syntax is really annoying in this regard (Boo has that, though :-) ). But I was able to come up with the following API:

public void Recieve<TMsg>(
	Action<TMsg> process);

public void Recieve<TMsg>(
	Expression<Func<TMsg, bool>> condition, 
	Action<TMsg> process);

A simple example of using this API would be:

public class PMap : ICommand
{
	List<object> results = new List<object>();

	public void Execute()
	{
		this.Receive(delegate(MapMessage msg)
		{
			foreach(var item in msg.Items)
			{
				Spawn<ActionExec>().Send(new ActionExecMessage(Self(), msg.Action, item));
				this.Recieve(delegate(ProcessedItemMessage itemMsg)
				{
					results.Add(itemMsg.Item);
					if(results.Count == msg.Items.Length)
						msg.Parent.Send(new ResultsMessage( results.ToArray() ));
				});
			}
		});
	}
}

This demonstrate a couple of important ideas. Chief among them is how we actually communicate between processes. Another interesting issue is the actual execution of this. Note that we have no threading involved, we are just registering to be notified at some date. When we have no more receivers registered, the process dies.

Execution environment

The processes should executed by something. In this case, I think we can define a very simple set of rules for the scheduler:

  • A process is in runnable state if it is has a message to process.
  • A process is in stopped state if there are no messages matching the current receivers list.
  • A process with no receivers is done, and will be killed.
  • A process may only run on a single thread at any given point.
  • It is allowed to move processes between threads (no thread affinity).
  • Processes have no priorities, but there is a preference for LIFO scheduling.
  • A process unit of work is the processing of a single message (not sure about that, though).
  • Since the only thing that can wake a process is a message, the responsibility for moving a process from stopped to runnable is at the hands of the process MailBox implementation.

Okay, that is enough for now.

Comments?

Enter the demoware

I am writing about documentation at the moment, and I found myself writing the following:

I don’t think that I can emphasize enough how important it is to have a good first impression in the DSL. It can literally make or break your project. We should make above reasonable efforts to ensure that the first impression of the user from our system would be positive.

This includes investing time in building good looking UI, and snappy graphics. They might not actually have a lot of value for the project from a technical perspective, not even from the point of day to day usage in some cases, but they are crucially important from social engineering perspective.

A project that looks good is pleasant to use, easier to demo and in general easier to get funding for.

This also includes the documentation, if we can do something in a short amount of time; we get a level of trust from the users. “Hey, I can make it go bang!” is important to gain acceptance. The first stage should be a very easy one, even if you have to design to enable that specifically.

After reading that, I quickly added this as well:

Note, however, that you should be wary of creating a demoware project, one that is strictly focused on demoing well, and not actually add value in real world conditions. Such projects may demo well, and get funding and support, but they tend to fall into the land of tortureware very rapidly, making things harder to do, instead of easier.

Beware of the demoware.

Get that queue out of my head

For some reason, I am having a lot of trouble lately getting Rhino Queues out of my head. I don't actually have a project to use it on, but it keep popping up in my mind. Before committing to the fifth or sixth rewrite of the project, I decided to take a stub at testing its performance.

The scenario is sending a hundred thousand (100,000) messages, with sizes ranging from 4Kb to 16Kb, which are the most common sizes for most messages.

On my local machine, I can push that amount of data in 02:02.02. So just above two minutes. This is for pushing 1,024,540,307 bytes, or just under a gigabyte in size. This translate to over 800 messages a second and close to 8.5 MB per second.

Considering that I am pushing everything over HTTP, that is just fine by me. But the real test would be to see how it works over the Internet.

I have a limited upload capacity here, so trying to push a gigabyte up would take too long. Instead, I confined myself to 5,000 messages, of the same varying sizes as before.

This time, over the Internet, it took 27:25:95 minutes. Just short of half an hour to push 51,135,538 bytes (~48Mb).

But that is testing throughput. While that is interesting on its own, it doesn't really tell us much about another important quality of the system, latency. Over the Internet, it took an average of 25 seconds to send a message batch. This is pretty fast, but I don't think it is great.

Anyway, getting back into the code didn't really help much, and I think that I will need to rewrite it again.

Tags:

Published at

That Agile Thing

Agile introduction is an interesting problem. One that I have learned to avoid. I am not feeling comfortable standing up to a business owner and saying (paraphrased, of course), "if you will do it my way, your life will be better". At least, not about development methodologies, I have no problem saying that about techniques, design or tools. The reason that I don't feel comfortable saying that is that there are too many issues surrounding agile introduction to talk confidently about the benefits.

I usually never mention agile to the customer at all. What I am doing, however, is insisting on a regularly scheduled demo. Usually every week or two. Oh, and I ask the customer what they want to see in the next demo. Given that I have a very short amount of time between demos, I can usually get a very concise set of features for the next demo.

Having demos every week or two really strengthen the confidence in the project. So from the point of view of the customer, we have regular demos and they get to choose what they will see in the next demo, that is all.

From the point of view of the team, having to have a demo every week or two makes a lot of difference in the way we have to work. As a simple example, I can't demo a UML diagram, so we can't have a six months design phase. Having to accept new requirements for each demo means that we need to enable ourselves to make changes very rapidly, so we go to practices such as TDD, Continuous Integration, etc. It also means that the design of the application tends to be much more light weight than it would be otherwise.

In other words, everything flows from the single initial requirement, having a demo every week or two.

Once you have that, you have a free reign (more or less) to implement agile practices, as they are needed, in order to get the demo up and running. You get customer involvement from the feature selection for the next demo, and you get customer buy-in from having the demos at frequent intervals, always showing additional value.

I refuse to be an [type] developer

Recently I was asked, in two different situations, what kind of a developer I am. I refused. I am not a C# developer, or a database developer, or an agile developer (I don't even know what that means).

If pressed, I would admit that I am mostly familiar with the .Net platform, but I am not going to limit myself to that. I don't even believe that trying to put such tabs on people is useful.

Blog updated

I updated the site to Subtext 2.0. So far, it looks good.

Reason for upgrade? I wanted to get the ability to future post. That should give me a way to throttle my blogging bandwidth.

This post, for that matter, is posted with 4 minutes delay.

Broken: I manage to forget that this blog also has a lot of links built from the old old blog, using dasBlog. It run for the last year and a half with no issues, so I just didn't notice that. Trying to fix it now, but it may be delayed for tomorrow.

Getting things done, my way

When I was in the army, I used to have a notebook (that is a physical one, made of paper) and I wrote just about everything there. I stopped doing it when I realized that I never read it. I don't do to do lists. I can barely manage to handle task tracking with a bug tracking system, and that is because even I recognize this as mandatory.

This is about how I manage to do things. It is likely not applicable for anyone else.

Currently, I use my inbox as the master todo list. And I am using read/unread for managing that. If it is unread, I need to take care of that. It works, except when I have to checkup on things in some time period. That is, let us say that I mail someone something, and I need to get back to them about that in a week. My current option is to mark this as unread and leave it that way for a week.

This approach drive me crazy.

Oh, I tried using things like outlook's reminders and stuff. They don't work for me, too much of an association with meeting reminders and annoyances.

What I wanted was a way to have a mail sent to me in the future. That is, if I want to follow up on that in a week, I will get an email next week reminding me that. Since I am using Gmail, I'll also get the entire conversation, which is the context for what I want to read.

Eventually, I decided that I am going to build this. TimeBox is a simple future email forwarder. It supports natural date syntax, courtesy of DateTimeEnglishParser. Now, if I want to be reminded of something, all I do is forward it to the mailbox, where the service will read it, parse the date and email that to me in the appropriate time.

From the UX perspective, it is:

  • hit 'f' for foward
  • enter timebox email
  • tab twice (subject and then to the actual text)
  • enter time, such as 'in one week'
  • tab to send, enter
  • Done.

I started using this already, and I am liking this quite a bit.

How to expose an OSS build server?

I just finished setting up a build server for Rhino Tools. Ideally, I want it to be publicly accessible, and have people download the build artifacts after each build. However, CC.Net is not something that you want to just expose to the web. It has no security model (any random Joe can just start a build, hence DOS).

Any suggestions?

I should note that anything that involves significant amount of time is going to be answered with: "Great, when can you help me do that".

WPF is magic

And I mean that in the kindest way possible. I am currently working with these wizards, and they (and the possibility WPF opens) keep surprising me.

I am well aware that this piece of code is par the course for WPF devs, but I have only dabbled in WPF, and seeing what it can do from visual perspective doesn't mean much until I have seen how clean the UI code looks like. I mean, just take a look at this:

image

All the information that I need to have about how to handle a piece of the UI is right there, and the infrastructure supports working in the way that I think is appropriate. The reason for this post is seeing how context menus works. I was extremely pleased to see how it all comes together in a single cohesive unit.

It is alive! CodePlex has Subversion Access

It is so much fun to see things that I worked on coming alive. The official announcement is here, with all the boring details. You can skip all of that and go read the code directly using SVN by hitting: https://svnbridge.svn.codeplex.com/svn

Switch svnbridge for your project, and you are done. Note that this is https. And yes, it should work with git-svn as well.

Way cool!