Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,640
|
Comments: 51,260
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 96 words

This is speculative at the moment, just to be clear.

I am thinking about creating a profiler for NHibernate. This came out of the common need to actually get a good view about what is going on with NHibernate.

This is intended to be a commercial project.

I have a feature set in mind, but I would rather hear from you if you think that it is a tool that you would use (and buy) and what kind of features do you expect such a tool to have?

And, to forestall the nitpickers, I am well aware of SQL Profiler.

time to read 12 min | 2299 words

Jan has corrected a misconception that I had here, where I assumed that PUT was create and POST was update. Apparently this is not quite correct. A PUT request will update/create a new document with a user supplied name, while a POST request will create a new document with a server generated name.

That narrows down my search for "create new document" code path again. Here is the method in couch_httpd_db that handles a POST request on a document.

200810041143.jpg

It parse the form, get the revision of the document, and then it calls to open_doc_revs, which we haven't explored so far. I feel confident that I understand what this is doing, however:

200810041145.jpg

I'll not show the open_doc_revs_int, because it is too big, but it contains some references that so far I haven't encountered. Specifically, couch_key_tree, which I don't think that I seen before, and to get_full_doc_infos, which looks like this:

200810041150.jpg

Oh, now we are on a more familiar territory. Lookup performs a search on a binary tree, and we have gone over it in depth before. However, we can now see that the we pass to it a fulldocinfo_by_id_tree... I don't think we have explored how CouchDB make use of the btree, so let us head to the db structure and see what else we have there.

200810041153.jpg

Here is what I know about this structure at the moment. db_header container the data that is actually written to the file, while db is the in memory structure. Not sure what write_version is all about, update_seq looks like the the internal sequence number for CouchDB. I don't think that I have even seen the concept of stream before, but it is there, and look like most stream implementations that I am already familiar with.

get_state on that returns a {Position, BytesRemaining}, since I don't understand how this is used at the moment, I don't understand why this is important. fulldocsinfo_by_id_btree_state contains the root of the btree, and the same holds for docinfo_by_seq_btree_state, I assume that local_docs_btree_state is also the same.

purge_seq is a counter for the last purge (physical delete), and it is used in conjunction with the views in order to ensure that there are no purged documents in the view. purged_docs keeps track of the documents that were last purged.

db, the runtime structure, is considerably more interesting. It took me a while to figure out where it is created, but I finally found the init_db function in couch_db_updater:

200810041217.jpg

The first part is to open the summary_state stream, still not sure what this is about, though. We then define a comparison function for use when opening the fulldocsinfo btree, and we pass several additional functions when we create the btree. Reduce is usually used when we compact the tree, and to get information about the btree (such as count of documents). The join and split also prepare the data to be written to disk in a consistent fashion.

I don't like the name join/split. The internal names, assemble and extract seems to be much more accurate.

Finally, we get to the actual creation of the db structure itself. An interesting thing to note is that only the update_pid is filled. Since CouchDB implements the many readers / single writer, I assume that this is how we ensure the single reader, because only a single process ever handle writes to the DB. I am assuming for now that main_pid is for readers, and compactor_pid if for compacting (obviously). I'll track them next.

Well, that took some doing, the main_pid is set in the couch_db_updater:init (the same function that calls the init_db function):


200810041256.jpg

I had hard time tracking this because I couldn't figure out what who was calling couch_db_updater:init (and what is the MainPid). Usually, in gen_server processes, there is a start_link function that you call, but in this case, there wasn't. Some digging revealed that the couch_db process is the one that actually call start_link on the couch_db_updater. I wonder if this is a common pattern in erlang, because it sure did caught me by surprise.

So main_pid is the couch_db process, and update_pid is the couch_db_updater. It actually goes a lot further than that. couch_db_updater doesn't actually have a public interface. It looks like couch_db is the one that exposes the interface, and send synchronous messages to the couch_db_updater process:

200810041301.jpg

That is the way in which CouchDB ensures that we have only a single write, I presume. Compaction is handled in couch_db_updater as well, the implementation is interesting:

200810041313.jpg

We create a new process and return before we finish compacting. It is interesting because it is the first example of explicit concurrency that I saw. Let us dig a bit better into start_copy_compact_int, which is pretty straight forward:

200810041315.jpg

This just create a new file, init a database and call_compact_docs, finishing by notifying the updater that it finished compacting. copy_compact_docs is interesting in itself:

200810041320.jpg

We start by defining the EnumBySeqFun, which batches documents to write to disk until should_flush returns true. should_flush is funny, it check memory thresholds and flush to disk whenever there is about 10 MB of data ready to be flushed.

The usage of foldl here is a good reason to go back and examine it in more details:

200810041328.jpg

And fold is defined as:

200810041329.jpg

So we have this stream_node, whatever that is, let us take an additional look.

200810041332.jpg

We are already familiar with get_node(), but adjust_dir and stream_kp_node and stream_kv_node are new. The first is easy:

200810041334.jpg

Let us explore stream_kv_node, which is likely to be easier than stream_kp_node.

200810041337.jpg

This is slightly complex. We start by defining the drop off point (smaller / greater from the current key). The important bit is in the last function, the call to our function and the recursion back into the function. As a reminder, here is the function that we pass:

200810041340.jpg

I hope that this make it clear how this works. We simply stream all the items to the function, and it copies it. Note that an important subtlety is that we have several btree over the same data (we will touch down in a bit) and in this case we are only copying the items that are greater than the update_seq in the current DB. Still not sure what this update_seq is all about. We will go all the way back to copy_compact_docs_int's last line, where it notify the updater that it finished the current compaction. This message arrive to this method:

200810041348.jpg

This looks complex, but it really isn't. It is simply opening the new database, and checking if the current update_seq equal to the latest in the new database. If it isn't, we restart the process. If it is, then we copy the local documents (not sure why they get special treatment, and it looks like it is expected to have very few of them), move the files and move on with our lives.

Now, it looks like update_seq is updated whenever someone make a change to the database, but let us verify this, shall we?

Looking at couch_db_updater:update_docs_int, we can see that the update_seq is updated for each new document that is modified (see merge_rev_trees).

Okay, that was interesting. But I still don't know how we save a document to disk. Let us go back to couch_db:update_docs, and see what is going on here. This is pretty complex, so we will break it apart to discrete pieces.

200810041409.jpg

We start by generating a new revision for the documents. Note that in the case local revisions, we just increment the revision, if it is a standard document, we generate a random revision number. I think that this is done in order to support farm wide revisions, so if I update document #1 in server A and in server B, I get two different revisions, which would be detected as conflicts.

We extract all the new revisions, group all the documents by id and extract all the ids. So far, so good. Let us see the rest of the function:

200810041418.jpg

We get the current existing documents, and then we merge the existing with the new bucket. That requires some explanation.

The first function clause uses list comprehensions in order to ensure that if we couldn't find a document in the DB, we didn't specify a previous version. The second is more complex and require us to understand about key trees. CouchDB defines them as:

200810041423.jpg

Based on that and the dict:from_list arguments, we can deduct what the structure of a revision is. It should be noted that it looks like there is an expectation that the number of revisions wouldn't be too big. This make sense, CouchDB does keep previous versions, but it make no guarantees about their life spans. Also, we should note that we have both revs and rev_tree, I am not quite sure why.

The last line in the zipwith function is prepare_doc_for_new_edit, which is called for each document, let us take a look at it.

200810041429.jpg

This is an example of how condense functional code can get. Let us see what it does:

  • Get the new revisions and all the previous revisions
  • If the previous revisions has a value, it perform a lookup based on the revision to the revision tree, to get the revision information. (Which answers the question of why we need both revs and rev_tree. revs is an ordered list of revision keys. rev_tree is apparently the metadata).
  • Now, the LeafRevsDict contains only the most recent one, so failing to find it in the dictionary means that we trying to update a non current version, so we error.
  • I am not sure what a stub is, but the rest of it seems pretty simple.

The last part is interesting, however, we check that we aren't trying to create a new document with the same id as existing one (although we allow it if that document is deleted).

After the call to zipwith(), we have a call to doc_flush_binaries, which we totally ignored so far. It flushes all the documents to disk, it is not really interesting, it is dealing with writing the attachments to disk. I am not quite sure what is going on (it is complex), mostly because I don't follow to which file it is writing. Or, to be rather more exact, I am not following how the attachment can be written to a different file, which is what the code seems to imply.

Anyway, the code there is pretty clear about what is going on, we append to the current file (except if we write to a different file, again, not following that) and update the in memory structure of the document to point to the range of data that each attachment contains.

Let us look at the last part of the couch_db:update_docs:

200810041452.jpg

And is looks like we have hit the place where it is actually happening. We call to the couch_db_updater in order to do the actual update, and we make be asked to retry (this will require more study), in which case we do try.

Let us look at update_docs at the couch_db_updater.

200810041456.jpg

Sigh, I can feel it in my waters, update_docs_int is going to be complex... once more, we will try to divide and conquer.

200810041459.jpg

We start by extracting variable from the Db, then we split the list based on whatever this is local or not. NonRep documents (local) are not replicated, have only a single version and are expected to have very few of them. I still don't know what is the use case for those, however. It looks like DocsList is a list of lists of docs, probably this is the buckets that we had to deal with earlier.

We finish the snipper by extracting the ids from the documents, what comes next?

200810041504.jpg

We lookup the existing documents, and then we merge them into a file info (note that we create a new full_doc_info with just the id if this is a new file). We then create the new doc infos by calling merge_rev_trees. We will ignore the implementation for now.

Finally, we gather all the sequences that we would like to remove. Moving on, we see that we start by updating the local docs and then flushing the trees. Both of which looks promising for my quest to understand how the data is stored on the disk:

200810041516.jpg

update_local_docs just modify the in memory btree, so I am not going to show it, flush_trees should be more interesting, based on just the comment.

200810041523.jpg

This is interesting. What we see here is that we write the attachment references and the document body to the summary stream. That is interesting, because I am still not sure what the summary stream is. Note that here we have the issue of writing to a file during compaction, so we ask the higher level code to retry that again.

We still have to see how the indexes are maintained, but that just update the in memory structure, the real bit happens next:

200810041528.jpg

Here we see our good old friend the btree, and we add/remove the new index to the file. We have seen how this works here. We finish by updating the Db structure with the new values and committing the data.

I now feel that I have a much better understanding on how CouchDB file handling works. Except that I still don't know what this summary stream is. That is a topic for later, though.

time to read 5 min | 902 words

As you can probably figure out from the title, the purpose of this post is to track how CouchDB is saving a document on the disk. We have already tracked how much of CouchDB works, including some fascinating pieces such as btree:lookup and btree:query_modify. Now it is time to dig ourselves from the trenches and have a slightly higher level view of what is going on.

We start with the recently refactored couch_httpd_db module. The fun starts with handle_request.

200810040223.jpg

It took me a few seconds to realize what is going on here. Again, some of the problem is that I am not used to the function declaration being such an important role in the function behavior. In this case, the function declaration decompose the request into several parameters. The case statement is used to dispatch based on the arguments. If we have a PUT or DELETE requests with just the DB name, we know that we need to create/delete a database. Otherwise, we assume we have some other operation on the database directly, and we send a function for it to execute. The reason for that becomes clear when you check the rest of the code.


200810040228.jpg

As we can see, the create_db_req and delete_db_req have different disposal semantics than do_db_req. do_db_req will just open/close the database and then execute the function in the cotext of this database handle.

This is quite a common pattern in functional programming, and it used frequently in Ruby. I am using it as much as I can in C# as well, since it is a very natural syntax to use with anonymous delegates and lambdas.

Anyway, what we are interested in today is the actual process of getting data from the request and putting it on the disk, so let us move along and see what is going on here. The next place of interest, of course, is db_req itself, which we pass into do_db_req. This is fairly interesting in its own, since it deals with all the commands that the server can support (from bulk_docs to compactions to purging). That isn't really interesting to me at this time, so we will move to the first code sample and look at the last clause.

This is where we are handling with a request for a specific document. couch_util:dict_find is where we need to look now...

200810040251.jpg

This is simply a way to do a lookup into a dictionary with a default value. So what does this code do?

200810040253.jpg

Well, it lookup in the DbUrlHandlers (which we got passed to the method) and defaults to the db_req if it isn't found there. But what is this DbUrlHandlers? I'll save you the trouble of finding out on your own and guide us directly to the location that we care about. Here is a section from the couchdb.ini configuration:

[httpd_global_handlers]
/ = {couch_httpd_misc_handlers, handle_welcome_req, <<"Welcome">>}
_utils = {couch_httpd_misc_handlers, handle_utils_dir_req, "%localdatadir%/www"}
_all_dbs = {couch_httpd_misc_handlers, handle_all_dbs_req}
_config = {couch_httpd_misc_handlers, handle_config_req}
_replicate = {couch_httpd_misc_handlers, handle_replicate_req}
_uuids = {couch_httpd_misc_handlers, handle_uuids_req}

[httpd_db_handlers]
_view = {couch_httpd_view, handle_view_req}
_temp_view = {couch_httpd_view, handle_temp_view_req}

If this looks like erlang code, you are almost right. This is actually just an .ini file, but it is format like erlang on purpose, since this is a simple mapping of a url to its handler in the code. This is a new extension mechanism to CouchDB that was added just yesterday.

This is a really nice way of handling extensibility, but enough beating around the bush, where is the save document code path? It starts with this function clause:

200810040303.jpg

So we need to go to the do_doc_req to see what else is going on here. And in the db_doc_req, we just need to find the PUT handler, which will create a new document. And here it is:

200810040304.jpg

Let us examine this in depth. couch_httpd:json_body(Req) just takes the request body and decode the JSON information there. I'll save you the way JSON is parsed using erlang. Suffice to say that I don't think it is pretty.

couch_doc:from_json_doc is interesting, though. First, take a look at this code:

200810040312.jpg

This is the equivalent of this code:

var allowedSpecialMembers = new[]{ /* see list above */ };

foreach(var prop in Props)
{
if( allowedSpecialMembers.Contains(prop)) continue;
throw new DocValidationexception('Bad special document member " + prop);

}

The method is doing a few extra things that aren't really interesting to us at the moment, basically taking the JSON object that we got and turning that into this object:

200810040318.jpg

Going back to db_doc_req, we can see that next we just get the appropriate revisions and then we extract the header revision, whatever that may be...

200810040321.jpg

This is merely dealing with ensuring that if we specified the etag header, it will match the current revision. Not sure why this is important for PUT request, however. This make sense for POST (update) request, however. Taking that as a given, we then call update_doc (which seems to serve double duty as create_doc) and send as JSON response back to the user.

This means that we finally have a lead, and can move from couch_httpd_db to couch_db :-)

200810040326.jpg

update_doc just forward the call to update_docs (plural). That should be interesting, but it wouldn't be today.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. API Design (10):
    29 Jan 2026 - Don't try to guess
  2. Recording (20):
    05 Dec 2025 - Build AI that understands your business
  3. Webinar (8):
    16 Sep 2025 - Building AI Agents in RavenDB
  4. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  5. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
View all series

Syndication

Main feed ... ...
Comments feed   ... ...