Here is an interesting one. Can you write code that would take the first piece of text and would turn it into the second piece of text?
First (not compiling)
Second (compiling):
Hint, you can use NRefactory to do the C# parsing.
Here is an interesting one. Can you write code that would take the first piece of text and would turn it into the second piece of text?
First (not compiling)
Second (compiling):
Hint, you can use NRefactory to do the C# parsing.
Here is a snippet from a blog post describing a lecture I gave yesterday in the Ural University:
Imagine him [Ayende] giving a public lecture at the Ural State University and demonstrating one of the numerous code snippets he prepared. Suddenly a guy (I think his name is Alex) interrupts him and tries to point out an error in the code. Unfortunately, Alex fails to express himself in English, and instead just mumbles incomprehensibly. After two more attempts, he gives up and explains the bug to the audience — in Russian. But before anyone has a chance to translate it, Oren smiles and says: “Oh right! You are absolutely correct, I have to insert a break statement here!”. Now, granted, Oren is a great talker and has no problem understanding people; but that was unbelievable even for him, because I can swear that: 1) he doesn't know a single word of Russian, 2) the guy who spotted the problem didn't use a single word of English (like break or foreach). Truth to be told, the whole situation was even scary a bit.
There was a roar of laugher in the audience when I did that, and it took me a while to understand why.
One of the more interesting problems with document databases is how you handle views. But a lot of people already had some issues with understanding what I mean with document database (hint, I am not talking about a word docs repository), so I have better explain what I mean by this.
A document database stores documents. Those aren’t what most people would consider as a document, however. It is not excel or word files. Rather, we are talking about storing data in a well known format, but with no schema. Consider the case of storing an XML document or a Json document. In both cases, we have a well known format, but there is not a required schema for those. That is, after all, one of the advantages of document db’s schema less nature.
However, trying to query on top of schema less data can be… problematic. Unless you are talking about lucene, which I would consider to be a document indexer rather than a document DB, although it can be used as such. Even with lucene, you have to specify the things that you are actually interested on to be able to search on them.
So, what are views? Views are a way to transform a document to some well known and well defined format. For example, let us say that I want to use my DB to store wiki information, I can do this easily enough by storing the document as a whole, but how do I lookup a page by its title? Trying to do this on the fly is a receipt for disastrous performance. In most document databases, the answer is to create a view. For RDMBS people, a DDB view is often called a materialized view in an RDMBS.
I thought about creating it like this:
Please note that this is only to demonstrate the concept, actually implementing the above syntax requires either on the fly rewrites or C# 4.0
The code above can scan through the relevant documents, and in a very clean fashion (I think), generate the values that we actually care about. Basically, we now have created a view called “pagesByTitleAndVersion”, index by title (ascending) and version (descending). We can now query this view for a particular value, and get it in a very quick manner.
Note that this means that updating views happen as part of a background process, so there is going to be some delay between updating the document and updating the view. That is BASE for you :-)
Another important thing is that this syntax is for projections only. Those are actually very simple to build. Well, simple is relative, there is going to be some very funky Linq stuff going on in there, but from my perspective, it is fairly straightforward. The part that is going to be much harder to deal with is aggregation. I am going to deal with that separately, however.
In a previous post, I asked about designing a document DB, and brought up the issue of replication, along with a set of questions that effect the design of the system:
I think that we can assume that the faster we replicate, the better it is. However, there are cost associated with this. I think that a good way of doing replication would be to post a message on a queue for the remote replication machine, and have the queuing system handle the actual process. This make it very simple to scale, and create a distinction between the “start replication” part an the actual replication process. It also allow us to handle spikes in a very nice manner.
We don’t replicate attachments, since those are out of scope.
Generated view data is a more complex issue. Mostly because we have a trade off here, of network payload vs. cpu time. Since views are by their very nature stateless (they can only use the document data), running the view on source machine or the replicated machine would result in exactly the same output. I think that we can safely ignore the view data, treating this as something that we can regenerate. CPU time tend to be far less costly than network bandwidth, after all.
Note that this assumes that view generation is the same across all machines. We discuss this topic more extensively in the views part.
I think that a sharding algorithm would be the best option, given a document, it will give a list of machine to replicate to. We can provide a default implementation that replicate to all machines or to secondary and tertiaries.
In a previous post, I asked about designing a document DB, and brought up the issue of attachments, along with a set of questions that needs to be handled:
We pretty much have to, otherwise we will have the users sticking them into the document directly, resulting in very inefficient use of space (binaries in Json format sucks).
Storing them in the DB will lead to very high database sizes. And there is the simple question if a Document DB is the appropriate storage for BLOBs. I think that there are better alternatives for that than the Document DB. Things like Rhino DHT, S3, the file system, CDN, etc.
Out of scope for the document db, I am afraid. That depend on the external storage that you wish for.
Yes we can and we should.
However, we still want to be able to add attachments to documents. I think we can resolve them pretty easily by adding the notion of a document attributes. That would allow us to add external information to a document, such as the attachment URLs. Those should be used for things that are related to the actual document, but are conceptually separated from it.
An attribute would be a typed key/value pair, where both key and value contains strings. The type is an additional piece of information, containing the type of the attribute. This will allow to do things like add relations, specify attachment types, etc.
This is actually a topic that I haven’t considered upfront. Now that I do, it looks like it is a bit of a hornet nest.
In order to have authorization we must first support authentication. And that bring a whole bunch of questions on its own. For example, which auth mechanism to support? Windows auth? Custom auth? If we have auth, don’t we need to also support sessions? But sessions are expansive to create, so do we really want that?
For that matter, would we need to support SSL?
I am not sure how to implement this, so for now I am going to assume that magic happened and it got done. Because once we have authorization, the rest is very easy.
By default, we assume that any user can access any document. We also support only two operations: Read & Write.
Therefore, we have two pre-defined attributes on the document, read & write. Those attributes may contain a list of users that may read/write to the document. If either read/write permission is set, then only the authorized users may view it.
The owner of the document (the creator) is the only one allowed to set permissions on a document. Note that write permission implies read permission.
In addition to that, an administrator may not view/write to documents that they do not own, but he is allowed to change the owner of a document to the administrator account, at which point he can change the permissions. Note that there is no facility to assign ownership away from a user, only to take ownership if you are the admin.
There is a somewhat interesting problem here related to views. What sort of permissions should we apply there? What about views which are aggregated over multiple documents with different security requirements? I am not sure how to handle this yet, and I would appreciate any comments you have in the matter.
In my previous post, I asked about designing a document DB, and brought up the issue of concurrency, along with a set of questions that effect the design of the system:
We have several options. Optimistic and pessimistic concurrency are the most obvious ones. Merge concurrency, such as the one implemented by Rhino DHT, is another. Note that we also have to handle the case where we have a conflict as a result of replication.
I think that it would make a lot of sense to support optimistic concurrency only. Pessimistic concurrency is a scalability killer in most system. As for conflicts as a result of concurrency, Couch DB handles this using merge concurrency, which may be a good idea after all. We can probably support both of them pretty easily.
It does cause problems with the API, however. A better approach might be to fail reads of documents with multiple versions, and force the user to resolve them using a different API. I am not sure if this is a good idea or a time bomb. Maybe returning the latest as well as a flag that indicate that there is a conflict? That would allow you to ignore the issue.
In addition to the Document ID, each document will have an associated version. The Document Id is a UUID, which means that it can be generated at the client side. Each document is also versioned by the server accepting it. The version syntax follow the following format: [server guid]/[increasing numeric id]/[time].
That will ensure global uniqueness, as well as giving us all the information that we need for the document version.
In my previous post, I asked about designing a document DB, and brought up the issue of scale, along with a set of questions that effect the design of the system:
Yes and no. I think that we should start from the get go assuming that a database is not alone, but we shouldn’t burden it with the costs that are associated with this. I think that simply building replication should be a pretty good task, which mean that we can push more smarts regarding the distribution into the client library. Simpler server side code usually means goodness, so I think we should go with that.
Joins are usually not used in a document DB. They are very useful, however. The problem is how do we resolve them, and by whom. This is especially true when we consider that a joined document may reside on a completely different server. I think that I am going to stick closely to the actual convention in other document databases, that is, joins are not supported. There is another idea that I am toying with, the notion of document attributes, which may be used to record this, but that is another aspect all together. See the discussion about attachments for more details.
Yes and no. The database only care about data that is stored locally, while it may reference data on other nodes, we don’t care about that.
That is a tricky question. The initial answer is yes, I want this feature. The complete answer is that while I want this feature, I am not sure how I can implement this.
Basically, this is desirable since we can use this to reduce the amount of data we send over the network. The problem is that we run into an interesting issue of how to express that partial update. My current thinking is that we can apply a diff to the initial Json version vs. the updated Json version, and send that. That is problematic since there is no standard way of actually diffing Json. We can just throw it into a string and compare that, of course, but that expose us to json format differences that may cause problems.
I think that I am going to put this issue as: postphoned.
In a previous post, I asked about designing a document DB, and brought up the issue of storage, along with a set of questions that needs to be handled:
There are several options, from building our own persistent format, to using an RDMBS. I think that the most effective option would be to use Esent. It is small, highly efficient, require no installation and very simple to use. It also neatly resolve a lot of the questions that we have to ask in addition to that.
Esent already has the facilities to do that, so we have very little to worry about it here.
See above, Esent is also pretty good in doing auto recovery, which is a nice plus.
I think not, I think that the best alternative is to have a file per view. That should make things such backing up just the DB easier, not to mention that it will reduce contention internally. Esent is built to handle that, but it is better to make it this way than not. All the data (include logs & temp dirs) should reside inside the same directory.
Crash recovery on startup should be enabled. Transactions should probably avoid crossing file boundaries.It is important the the files will include a version table, which will allow to detect invalid versions (caused a whole bunch of problems with RDHT until we fixed it).
Yes, we are transactional. But only for document writes. We are not transactional for document + views, for example, since view generation is done as a background service.
Yes, depending on the operation. We allow submittal of several document writes / deletes at the same time, and they would succeed or fail as a single unit. Beyond that, no.
A while ago I started experimenting with building my own document DB, based on the concepts that Couch DB have. As it turn out, there isn’t really much to it, at a conceptual level. A document DB requires the following features:
The first two requirements are easily handled, and should generally take less than a day to develop. Indeed, after learning about the Esent database, it took me very little time to create this. I should mention that as an interesting limitation to the DB, I made the decision to accept only documents in Json format. That makes some things very simple, specifically views and partial updates.
There are several topics here that are worth discussion, because they represent non trivial issues. I am going to raise them here as questions, and answer them in future posts.
Storage:
Scale:
Concurrency:
Attachments:
Replication:
Views:
There are some very interesting challenges relating to doing the views. Again, I am interested in your opinions about this.
There are several other posts, detailing my current design, which will be posted spaced about a day apart from one another. I’ll post a summary post with all the relevant feedback as well.
No future posts left, oh my!