Ayende @ Rahien

It's a girl

The bug that ruined my weekend

This is bloody strange. I have a test failing in our unit tests, which isn’t an all too uncommon occurrence after a big work. The only problem is that this test shouldn’t fail, no one touched this part.

For reference, here is the commit where this is failing. You can reproduce this by running the Raven.Tryouts console project.

Note that it has to be done in Release mode. When that happens, we consistently get the following error:

Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.
at Raven.Client.Connection.MultiGetOperation.<TryResolveConflictOrCreateConcurrencyException>d__b.MoveNext() in c:\Work\ravendb\Raven.Client.Lightweight\Connection\MultiGetOperation.cs:line 156
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

Here is the problem with this stack trace:

image

Now, this only happens in release mode, but it happens there consistently. Now, I’ve verified that this isn’t an issue of me running old version of the code. So this isn’t possible. There is no concurrency going on, all the data this method is touching is only touched by this thread.

What is more, the exception is not thrown from inside the foreach loop in line 139. I’ve verified that by putting a try catch around the inside of the loop and still getting the NRE when thrown outside it. In fact, I have tried to figure it out in pretty much any way I can. Attaching a debugger make the problem disappear, as does disabling optimization, or anything like that.

In fact, at this point I’m going to throw up my hands in disgust, because this is not something that I can figure out. Here is why, this is my “fix” for this issue. I replaced the following failing code:

image

With the following passing code:

image

And yes, this should make zero difference to the actual behavior, but it does. I’m suspecting a JIT issue.

Tags:

Published at

Originally posted at

Solve this bug!

This took some head scratching before I figured it out.

image

And here is the relevant code:

image

Tags:

Published at

Originally posted at

Comments (10)

Beyond RavenDB 3.0: The future road map for RavenDB

We are pretty much done with RavenDB 3.0, we are waiting for fixes to internal apps we use to process orders and support customers, and then we can actually make a release. In the meantime, that means that we need to start looking beyond the 3.0 release. We had a few people internally focus on post 3.0 work for the past few months, and we have a rough outline for what we done there. Primarily we are talking about better distribution and storage models.

Storage models – the polyglot database

Under this umbrella we put dedicated database engines to support specific needs. We are talking about distributed counters (high scale out, rapid throughput), time series and event store as the primary areas that we are focused on. For example, the counters stuff is pretty much complete, but we didn’t have time to actually make that into a fully mature product.

I talked about this several times in the past, so I’ll not get into too many details here.

Distribution models

We have been working on a Raft implementation for the past few months, and it is now in the stage where we are starting to integrate it into the rest of our software. Raft is planned to be the core replication protocol for the time series and events databases. But you are probably going to see if first as topology super layer for RavenDB and RavenFS.

Distributed topology management

Replication support in RavenDB and RavenFS follow the multi master system. You can write to any node, and your write will be distributed by the server to all the nodes. This has several advantages, in particular, the fact that we can operate in disconnected or partially disconnected manner, and that we need little coordination between clients to get everything working. It also has the disadvantage of allow conflicts. In fact, if you are writing to multiple replicating nodes, and aren’t careful about how you are splitting writes, you are pretty much guaranteed to have conflicts. We have repeatedly heard that this is both a good thing and something that customers really don’t want to deal with.

It is a good thing because we don’t have data loss, it is a bad thing because if you aren’t ready to handle this, some of your data is inaccessible because of the conflict until it is resolved.

Because of that, we are considering implementing a server side topology management system. The actual replication mechanics are going to remain the same. But the difference is how we are going to decide how to work with it.

A cluster (in this case, a set of RavenDB servers and all databases and file systems on them)  is composed of cooperating nodes. The cluster is managed via Raft, which is used to store the topology information of the cluster. Topology include each of the nodes in the system, as well as any of the databases and file systems on the cluster. The cluster will select a leader, and that leader will also be the primary node for writes for all databases. In other words, assume we have a 3 node cluster, and 5 databases in the cluster. All the databases are replicated to all three nodes, and a single node is going to serve as the write primary for all operations.

During normal operations, clients will query any server for the replication topology (and cache that) every 5 minutes or so. If a node is down, we’ll switch over to an alternative node. If the leader is down, we’ll query all other nodes to try to find out who the new leader is, then continue using that leader’s topology from now on. This give us the advantage that a down server cause clients to switch over and stay switched. That avoid an operational hazard when you bring a down node back up again.

Clients will include the topology version they have in all communication with the server. If the topology version doesn’t match, the server will return an error, and the client will query all nodes it knows about to find the current topology version. It will always chose the latest topology version, and continue from there.

Note that there are still a chance for conflicts, a leader may become disconnected from the network, but not be aware of that, and accept writes to the database. Another node will take over as the cluster leader and clients will start writing to it. There is a gap where a conflict can occur, but it is pretty small one, and we have good mechanisms to deal with conflicts, anyway.

We are also thinking about exposing a system similar to the topology for clients directly. Basically, a small distributed and consistent key/value store. Mostly meant for configuration.

Thoughts?

Tags:

Published at

Originally posted at

Comments (6)

RavenDB Wow! Features presentation

In Oredev, beside sitting in a booth and demoing why RavenDB is cool for about one trillion times, I also gave a talk. I intended it to be a demo packed 60 minutes, but then I realized that I only have 40 minutes for the entire thing.

The only thing to do was to speed things out, I think I breathed twice throughout the entire presentation. And I think it went great.

RAVENDB: WOW! FEATURES - THE THINGS THAT YOU DIDN'T KNOW THAT YOUR DATABASE CAN DO FOR YOU from Øredev Conference on Vimeo.

Tags:

Published at

Originally posted at

Comments (4)

RavenDB 3.0 RTM!

RavenDB 3.0 is out and about!

RavenDB

It is available on out downloads page and on Nuget. You can read all about what is new with RavenDB 3.0 here.

This is a stable release, fully supported. It is the culmination of over a year and a half of work, a very large team and enough improvements to make you dance a jig.

You can play with the new version here, and all of our systems has been running on 3.0 for a while now, of course.

And with that, I’m exhausted, thrilled and very excited. Have fun playing with 3.0, and check by tomorrow to see some of the cool Wow features.

Open-mouthed smile

Tags:

Published at

Originally posted at

Comments (7)

The road to RavenDB 3.0 stable release

We are currently busy shouting at the build cluster to hurry up and finish (it is not impressed by us and keep chugging on our test suite), but I was quite amused by the following:

image

This is the merge from the 3.0 development branch to the stable branch. That is a lot of goodness coming your way…

Tags:

Published at

Originally posted at

Comments (2)

Large scale distributed consensus approaches: Concurrent consistent decisions

So far we tackled the idea of large compute cluster, and a large storage cluster. I mentioned that the problem with the large storage cluster is that it doesn’t handle consistency within itself. Two concurrent requests can hit two storage nodes and make concurrent operations that aren’t synchronized between themselves. That is usually a good thing, since that is what you want for high throughput system. The less coordination you can get away with, the more you can actually do.

So far, so good, but that isn’t always suitable. Let us consider a case where we need to have a consistent approach, for some business reason. The typical example would be transactions in a bank, but I hate this example, because in the real world banks deal with inconsistency all the time, this is an explicit part of their business model. Let us talk about auctions and bids, instead. We have an auction service, which allow us to run a large number of auctions.

For each auction, users can place bids, and it is important for us that bids are always processed sequentially per auction, because we have to know who place a bid that is immediately rejected ($1 commission) or a wining bid that was later overbid (no commission except for the actual winner). We’ll leave aside the fact that this is something that we can absolutely figure out from the auction history and say that we need to have it immediate and consistent. How do we go about doing this?

Remember, we have enough load on the system that we are running a cluster with a hundred nodes in it. The rough topology is still this:

image

We have the consensus cluster, which decide on the network topology. In other words, it decide which set of servers is responsible for which auction. What happens next is where it gets interesting.

Instead of just a set of cooperating nodes that share the data between them and each of which can accept both reads and writes, we are going to twist things a bit. Each set of servers is their own consensus cluster for that particular auction. In other words, we first go to the root consensus cluster to get the topology information, then we add another command to the auction’s log. That command go through the same distributed consensus algorithm between the three nodes. The overall cluster is composed of many consensus clusters for each auction.

This means that we have a fully consistent set of operations across the entire cluster, even in the presence of failure. Which is quite nice. The problem here is that you have to have a good way to distinguish between the different consensuses. In this case, an auction is the key per consensus, but it isn’t always so each to make such distinction, and it is important that an auction cannot grow large enough to overwhelm the set of servers that it is actually using. In those cases, you can’t really do much beyond relax the constraints and go in a different manner.

For optimization purposes, you usually don’t run an independent consensus for each of the auctions. Or rather, you do, but you make sure that you’ll share the same communication resources, so for auctions/123 the nodes are D,E,U with E being the leader, while for auctions/321 the nodes are also D,E,U but U is the leader. This gives you the ability to spread processing power among the cluster, and the communication channels (TCP connections, for example) are shared between both auctions consensuses. 

Tags:

Published at

Originally posted at

Comments (3)

RavenDB 3.0 RC discount ends in two days

I forgot the mention this explicitly, but we are currently giving 20% discount for RavenDB 3.0 licenses for the release candidate.

This discount is going to be discontinued with the release of RavenDB 3.0 in two days, so if you are counting it, hurry up Smile.

Tags:

Published at

Originally posted at

Comments (3)

RavenDB 3.0 Release date: 25 Nov, 2014

Barring anything major, we’ll be releasing RavenDB 3.0 in 5 days Smile.

It will be  a stable release and you’re encourage to move to it as soon as it is available, using the Esent database.

The Voron database is still in RC mode (mostly because we’re paranoid and want to have more real world experience before we go full forward with this), but it is going to be fully supported.

Upgrading instances will use Esent, and new databases will default to Esent unless you explicitly select Voron.

Tags:

Published at

Originally posted at

Comments (9)

Large scale distributed consensus approaches: Large data sets

In my previous post, I talked about how we can design a large cluster for compute bound operations. The nice thing about this is that is that the actual amount of shared data that you need is pretty small, and you can just distribute that information among your nodes, then let them do stateless computation on that, and you are done.

A much more common scenario is when can’t just do stateless operations, but need to keep track of what is actually going on. The typical example is a set of users changing data. For example, let us say that we want to keep track of the pages each user visit on our site. (Yes, that is a pretty classic Big Table scenario, I’ll ignore the prior art issue for now). How would we design such a system?

Well, we still have the same considerations. We don’t want a single point of failures, and we want to have very large number of machines and make the most of their resources.

In this case, we are merely going to change the way we look at the data. We still have the following topology:

image

There is the consensus cluster, which is responsible for cluster wide immediately consistent operations. And there are all the other nodes, which actually handle processing requests and keeping the data.

What kind of decisions do we get to make in the consensus cluster? Those would be:

  • Adding & removing nodes from the entire cluster.
  • Changing the distribution of the data in the cluster.

In other words, the state that the consensus cluster is responsible for is the entire cluster topology. When a request comes in, the cluster topology is used to decide into which set of nodes to direct it to.

Typically in such systems, we want to keep the data on three separate nodes, so we get a request, then route it to one of those three nodes that match this. This is done by sharding the data according the the actual user id whose page views we are trying to track.

Distributing the sharding configuration is done as described in the compute cluster example, and the actual handling of requests, or sending the data between the sharded instances is handled by the cluster nodes directly.

Note that in this scenario, you cannot ensure any kind of safety. Two requests for the same user might hit different nodes, and do separate operations without being able to consider the concurrent operation. Usually, that is a good thing, but that isn’t always the case. But that is an issue of the next post.

Tags:

Published at

Originally posted at

Comments (2)

Large scale distributed consensus approaches: Computing with a hundred node cluster

I’m using 100/99 node cluster as the example, but the discussion also apply for smaller clusters (dozens of nodes) and bigger clusters (hundreds or thousands). Pretty much the only reason that you want to go with clusters of that size is that you want to scale out your processing in some manner. I’ve already discussed why a hundred node cluster isn’t a good option for safety reasons.

Consensus algorithm create a single consensus in the entire cluster, usually about an order set of operations that are fed to a state machine. The easiest such example would be a dictionary. But it make no sense to have a single dictionary spread across hundred nodes. Why would you need to do that?  How would it give you the ability to make full use of all of the power of all those nodes?

Usually nodes are used for either computing or storage purposes. Computing is much easier, so let us take that as a good example. A route calculating system, need to do a lot of computations on a relatively small amount of information (the map data). Whenever there is a change in the map (route blocked, new road open, etc), it needs to send the information to all the servers, and make sure that it isn’t lost.

Since calculating routes is expensive (we’ll ignore the options for optimizations and caching for now), we want to scale it to many nodes. And since the source data is relatively small, each node can have a full copy of the data. Under this scenario, the actual problem we have to solve is how to ensure that once we save something to the cluster, it is propagated to the entire cluster.

The obvious way to do this is with a hierarchy:

image

Basically, the big icons are the top cluster, each of which is responsible for updating a set of secondary servers, which is then responsible for updating the tertiary servers.

To be perfectly honest, this looks nice, and even reasonable, but it is going to cause a lot of issues. Sure, the top cluster is resilient to failures, but relying on a node to be up to notify other nodes isn’t so smart. If one of the nodes in the top cluster goes down, then we have about 20% of our cluster that didn’t get the notice, which kind of sucks.

A better approach would be to go with a management system and a gossip background:

image

In other words, the actual decisions are down by the big guys (literally, in this picture). This is a standard consensus cluster (Paxos, Raft, etc). Once a decision has been made by the cluster, we need to send it to the rest of the nodes in the system. We can do that either by just sending the messages to all the nodes, or by selecting a few nodes and have them send the messages to their peers. The protocol for that is something like: “What is the like command id you have? Here is what I have after that.” Assuming that each processing node is connected to a few other servers, that means that we can send the information very quickly to the entire cluster. And even if there are errors, the gossiping server will correct it (note that there is an absolute order of the commands, ensured by the consensus cluster, so there isn’t an issue about agreeing to this, just distributing the data).

Usually the gossip topology follows the actual physical distribution. So the consensus cluster will notify a couple of servers on each rack, and let the servers in the rack gossip among themselves about the new value.

This means that once we send a command to the cluster, the consensus would agree on that, then we would distribute it to the rest of the nodes. There is a gap between the consensus confirming it and the actual distributing to all the nodes, but that is expected in any distributed system. If it is important to sync this on a synchronized basis across the entire cluster, the command is usually time activated (which require clock sync, but that is something that we can blame on the ops team, so we don’t care Smile).

With this system, we can have an eventually consistent set of data across the entire cluster, and we are happy.

Of course, this is something that is only relevant for compute clusters, the kind of things were you compute a result, return it to the client and that is about it. There are other types of clusters, but I’ll talk about them in my next post.

Tags:

Published at

Originally posted at

Comments (6)

Live playground for RavenDB 3.0

We are getting to the part where we are out of things to do, so we setup a live instance of RavenDB 3.0 and opened it up for the world to play with.

It is available here: http://live-test.ravendb.net

Disclaimer - It may go down at any moment, data will routinely be wiped but is public and can be copied and used for other users. This is strictly for playing around with it, nothing more.

Give it a shot, see all the new cool stuff.

Tags:

Published at

Originally posted at

Comments (6)

Large scale distributed consensus approaches: Calculating a way out

The question cross my desk, and it was interesting enough that I felt it deserves a post. The underlying scenario is this. We have distributed consensus protocols that are built to make sure that we can properly arrive at a decision and have the entire cluster follow it, regardless of failure. Those are things like Paxos or Raft. The problem is that those protocols are all aimed at relatively small number of nodes. Typically 3 – 5. What happens if we need to manage a large number of machines?

Let us assume that we have a cluster of 99 machines. What would happen under this scenario? Well, all consensus algorithm works on top of the notion of a quorum. That at least (N/2+1) machines have the same data. For a 3 nodes cluster, that means that any decision that is on 2 machines is committed, and for a 5 nodes cluster, it means that any decision that is on 3 machines is committed. What about 99 nodes? Well, a decision would have to be on 50 machines to be committed.

That means making 196 requests (98 x 2) (once for the command, then for the confirmation) for each command. That… is a lot of requests. And I’m not sure that I want to see what it would look like in term of perf. So just scaling things out in this manner is out.

In fact, this is also pretty strange thing to do. The notion of distributed consensus is that you will reach agreement on a state machine. The easiest way to think about it is that you reach agreement on a set of values among all nodes. But why are you sharing those values among so many nodes? It isn’t for safety, that is for sure.

Assuming that we have a cluster of 5 nodes, with each node having 99% availability (which translates to about 3.5 days of downtime per year). The availability of all nodes in the cluster is 95%, or about 18 days a year.

But we don’t need them to all be up. We just need any three of them to be up. That means that the math is going to be much nicer for us (see here for an actual discussion of the math).

In other words, here are the availability numbers if each node has a 99% availability:

Number of nodes Quorum Availability  
3 2 99.97% ~ 2.5 hours per year
5 3 99.999% (5 nines) ~ 5 minutes per year
7 5 99.9999% (6 nines) ~ 12 seconds per year
99 50 100%  

Note that all of this is based around each node having about 3.5 days of downtime per year. If we can have availability of 99.9% (or about 9 hours a year), the availability story is:

Number of nodes Quorum Availability  
3 2 99.9997% ~ 2 minutes a year
5 3 99.999999% ( 8 nines ) ~ 30 seconds per year
7 5 100%  

So in rough terms, we can say that going to 99 node cluster isn’t a good idea. It is quite costly in terms of the number of operation require to ensure a commit, and from a safety perspective, you can get the same safety level at the drastically lower cost.

But there is now another question, what would we actually want to do with a 99 node cluster*? I’ll talk about this in my next post.

A hundred node cluster only make sense if you have machines with about 80% availability. In other words, they are down for 2.5 months every year. I don’t think that this is a scenario worth discussing.

Tags:

Published at

Originally posted at

Is the library open or not?

An interesting challenge came across my desk. Let us assume that we have a set of libraries, which looks like this:

{
    "Name": "The Geek Hangout",
    "OpeningHours": {
        "Sunday": [
            {   "From": "08:00", "To": "13:00"  },
            {   "From": "16:00", "To": "19:00"  }
        ],
        "Monday": [
            {   "From": "09:00", "To": "18:00"  },
            {   "From": "22:00", "To": "23:59"  }
        ],
        "Tuesday": [
            {   "From": "00:00", "To": "04:00"  },
            {   "From": "11:00", "To": "18:00"  }
        ]
    }
}
{
    "Name": "Beer & Books",
    "OpeningHours": {
        "Sunday": [
            {   "From": "16:00", "To": "23:59"  }
        ],
        "Monday": [
            {   "From": "00:00", "To": "02:00"  },
            {   "From": "10:00", "To": "22:00"  }
        ],
        "Tuesday": [
            {   "From": "10:00", "To": "22:00"  }
        ]
    }
}

I only included three days, to make it shorter, but you get the points. You can also see that there are times that the opening hours go through a day.

Now, the question we need to answer is: “find me an open library now”.

How can we answer such a question? If we were using SQL, it would be something like this:

select * from Libraries l 
where Id in (
         select Library Id OpeningHours oh 
         where oh.Day = dayofweek(now()) AND oh.From >= now() AND oh.To < now()
) 

I’ll leave the performance of such a query to your imagination, but the key point is that we cannot actually express such a computation in RavenDB. We can do range queries, but in this case, it is the current time that we compare to the range of values. So how do we answer such a query?

As usual, but not trying to answer the same thing at all. Here is my index:

image

The result of this is an index entry per day, and in each index entry, we have outputted the half hours that this library is open. So if we want to check for libraries that are open on Sunday at 4:30 PM, all we have to do is issue the following query:

image

The power of dynamic fields and index time computation means that this is an easy query to make, and even more importantly, this is something that we can answer very efficiently.

Tags:

Published at

Originally posted at

Comments (8)

RavenDB 3.0 & Subscription Licenses

We were asked this a few times, so I think it is worth clarifying.

If you have a subscription license to RavenDB, you have automatic access to all versions of RavenDB for as long as your subscription is current. That means that if you purchase a RavenDB 2.x subscription, your license allows you to use RavenDB 3.0 without any issues.

Note that this doesn’t include using RavenFS, which will require an updated license.

If you purhcased RavenDB using the one time code, you’ll need to purchase a new license for RavenDB 3.0.

Tags:

Published at

Originally posted at

Comments (6)

Fixing a production issue

So we had a problem in our production environment. It showed up like this.

image

The first thing that I did was log into our production server, and look at the logs for errors. This is part of the new operational features that we have, and it was a great time to try it under real world conditions:

image

This gave me a subscription to the log, which gave me the exact error in question:

image

From there, I went to look at the exact line of code causing the problem:

image

Can you see the issue?

We create a dictionary that was case sensitive, and they used that to create a dictionary that was case insensitive. The actual fix was adding the ignore case comparer to the group by clause, and then pushing to production.

Tags:

Published at

Originally posted at

Comments (2)

The RavenDB new website (and the beta discount)!

I’m currently with some of our team in the Oredev conference. So if you are here, seek us out.

In other good news, the new website for RavenDB is now up, and that means that we are no longer selling RavenDB 2.x. We are now selling RavenDB 3.0 only*!

With this, the last hurdle of releasing RavenDB 3.0 is pretty much out the door, we’ll probably wait until we are back from Oredev and recover a bit, but we are on track for a stable release of RavenDB 3.0 next week or the one just after.

In the meantime, go ahead and look at the new website.

* A RavenDB 3.0 license can work for 2.5, though.

Tags:

Published at

Originally posted at

Comments (26)

Finding the “best” book scenario

This started out as a customer engagement, but it was interesting to see how we solved it.

The problem is searching for books. Let us take the following books as good example:

image

We have users that want to have recommendations for books in specific topics, and authors can pay us to promote their books. You can see how it looks like above.

Now, the rules we want to follow for sorting the results are fairly simple. Find all the matching books, and sort them so:

  • The user has searched for a book primary tag, and the author paid to promote that tag, show first.
  • The user has searched for a book secondary tag, and the author paid to promote that tag, show second.
  • The user has searched for a book primary tag, and the author didn’t paid to promote that tag, show third.
  • The user has searched for a book secondary tag, and the author didn’t paid to promote that tag, show forth.

Actually trying to specify the sort order according to this tend to be quite hard to do, as it turns out, but we can take advantage of boosting to get what we want.

We define the following index:

from book in docs.Books
select new
{
  PaidPrimaryTag = book.Tags.Where(x=>x.Primary && x.Paid).Select(x=>x.Name),
  PaidSecondaryTag = book.Tags.Where(x=>x.Primary == false && x.Paid).Select(x=>x.Name),
  PrimaryTag = book.Tags.Where(x=>x.Primary).Select(x=>x.Name),
  SecondaryTag = book.Tags.Where(x=>x.Primary == false).Select(x=>x.Name),
}

And now we want to do a few searches: First for NoSQL and then RavenDB.

The actual query we issue is:

image

And as you can see, books/3 is shown first, because the author paid for higher ranking. What about when we do that with RavenDB?

image

We have books/3, as before, but books/2 is higher ranked than books/1. Why is that? Because books/2 paid to have a higher ranking on a secondary tag, and it is more important than even a primary tag match according to our query.

This is quite elegant, and it also allows us to take into account relevancy in the search as well.

Tags:

Published at

Originally posted at

Comments (3)

Bug tracking, when your grandparent isn’t in your family tree

We got a failing test because of some changes we made in RavenDB, and the underlying reason ended up being this code:

image

The problem was that the type that I was expecting did inherit from the right stuff. See this:

image

So something here is very wrong. I tracked this until I got to:

return RuntimeTypeHandle.CanCastTo(fromType, this);

And there is stopped. I work around this issue by using IsSubclassOf, instead of IsAssignableFrom.

The problem with IsAssignableFrom is that it is a confusing method. The parent is supposed to be the target, and the type you check is the parameter, but it is very easy to forget that and get confused. This worked for 99% of cases, because the single assembly we usually use also contained the RavenBaseApiController (which obviously can be assigned to itself), so that looked like it worked. IsSubclassOf is much nicer, but you need to understand that this won’t work for interfaces, or check direct equality. In this case, this was exactly what I needed, so that worked.

Tags:

Published at

Originally posted at

Comments (7)

Modeling exercise: The grocery store’s checkout model process approach

I posted about the grocery store checkout process exercise before. Now I want to see if I can do a short outline on how I would handle this.

The key aspect from my perspective is that we need to separate the notion of the data we have and the processing of the data. That means that we are going to have the following model:

public class ShoppingCart
{
   public List<ProductInShoppingCart> Products {get;set;}
   public List<Discount> Discounts { get;set; }
}

public class ProductInShoppingCart
{
   public string ProductId;
   public Discount Discount;
}

Note that we explicitly do not have a quantity field here. If we purchase 6 bottles of milk, that would appear three times in the cart. Why is that?

Let us assume that we have a sale for 2 bottles of milk for 20% discount or a 3 +1 bottles of milk offer. Consider the kind of code you would have to write in the offer code:

  • Find all products that have this offer and have 4 items without discount.
  • Add the discount to those products.
  • After searching for products without discount, need to search for products with a discount, but that we can apply this to and get a better option.

In this case, we start by doing:

  • Add bottle of milk
  • Add bottle of milk – 2 for 20% discount is triggered.
  • Add bottle of milk
  • Add bottle of milk – 3+1 offer is triggered, removing the previous discount.

Because this is likely going to be complex, I’m going to be writing this once. A set of offers and the kind of rules that we want. Then we will give the users the ability to define those rules.

Note that we keep the raw data (products) and the transformations (discounts) separate, so we can always reapply everything without losing any data.

Career planning: Mine

I got some really good questions about my career. Which caused me to reflect back and snort at my own attempts to make a sense of what happened.

Here is a rough timeline:

  • 1999 – Start of actually considering myself to be a professional developer. This is actually the start of a great one year course I took to learn C++, right out of high school.
  • 2001 – Joined the army, was sent to the Military Police, and spent 4 years in prison. Roles ranged from a prison guard, XO of big prison, teacher in officer training course and concluded with about a year as a small prison commander.
  • 2004 – Opened my blogged and started writing about the kind of stuff that I was doing, first version of Rhino Mocks.
  • 2005 – Joined the Castle Comitter’s team, Left the army, joined We!, worked as a consultant.
  • 2006 – My first international conference – DevTeach.
  • 2008 – Left We!, started working as an independent consultant.
  • 2009 – NHibernate Profiler beta release.
  • 2010 – DSLs in Book book is published, Entity Framework Profiler, Linq to SQL Profiler, RavenDB.
  • 2011 – Hiring of first full employee.
  • 2014 – Writing this blog post.

A lot of my history doesn’t make sense without a deeper understanding. In the army, there was a long time in which I wasn’t actually able to do anything with computers. That meant that on vacations, I would do stuff voraciously. At that time, I already read a lot of university level books (dinosaurs book, Tanenbaum’s book, TCP/IP, DDD book and a lot of other stuff like that). At some point, I got an actual office and had some free time, so I could play with things. I wrote code, a lot. Nothing that was actually very interesting. It was anything from mouse tracking software (that I never actually used) to writing custom application to track medical data for inmates. Mostly I played around and learned how to do stuff.

When I got to be a prison commander, I also got myself a laptop, and start doing more serious stuff. I wasn’t able to afford any professional software, so it was mostly Open Source work. I started working on NHibernate Query Analzyer, just to see what I can do about it. That thought me a lot about reflection and NHibernate itself. I then got frustrated with the state of mocking tools in the .NET framework, and I decided to write my own. Around that time, I also started to blog.

What eventually became Rhino Mocks was a pretty big thing. Still one of the best pieces of software that I have written, it required that I’ll have a deep understanding of a lot of things. From IL generation to how classes are composed by the runtime to AppDomains to pretty much everything.

Looking back, Rhino Mocks was pretty huge in terms of pushing my career. It was very widely used, and it got me a lot of attention. After that, I was using NHibernate and talking about that a lot, so I got a lot of reputation points in that arena as well. But the first thing that I did after starting as an independent consultant was to actually work on SvnBridge. A component that would allow an SVN client to talk to a Team Foundation Server. That was something that I had very little experience with, but I think that I did a pretty good job there.

Following that, I mostly did consulting and training on NHibernate. I was pretty busy. So busy that at some point I actually have a six week business trip that took me to five countries and three continents. I came back home and didn’t leave my bed for about a week. For two weeks following that, I would feel physically ill if I sat in front of the computer for more than a few minutes.

That was a pretty big wakeup call for me. I knew that I had to do something about it. That is when I actually sat down and thought about what I wanted to do. I knew that I wanted to stay in development, and that I couldn’t continue being a road warrior without burning out again. I decided that my route would be to continue to do consulting, but on a much reduced frequency, and to start focusing on creating products. Stuff that I could work on while at home, and hopefully get paid for. That is how the NHibernate Profiler was born.

From there, it was a matter of working more on that and continuing to other areas, such as Entity Framework, Linq to SQL, etc. RavenDB came because I got tired of fixing the same old issues over and over again, even with the profilers to help me. And that actually had a business plan, we were going to invest so much money and time to get it out, and it far exceeded our expectations.

Looking back, there were several points that were of great influence. Writing my blog. Writing Rhinos Mocks, joining open source projects such as Boo or Castle. Working and blogging with NHibernate. Going to conferences and speaking there. All of those gave me a lot of experience and got me out there, building reputation and getting to know people.

That was very helpful down the road, when I was looking for consultancy jobs, or doing training. Or when it came the time to actually market our software.

In terms of the future, Hibernating Rhinos is growing at a modest rate. Which is the goal. I don’t want to double every six months, that is very disturbing to one’s peace of mind. I would much rather have a slow & steady approach. We are working on cool software, we are going home early and for the most part, there isn’t a “sky is falling” culture. The idea is that this is going to be a place that you can spend a decade or four in. I’ll get back to you on that when I retire.

Career planning: Disaster recovery

One of the more important things that you have to remember is that you should always be ready for failure. As developers, we are used to thinking about stuff like that in our code, but this is true for real life as well.

I’m going to leave aside things like personal disasters for this post (things like car accidents, getting seriously sick, etc), because there are some ways to mitigate those (insurance, family, etc) and they really isn’t anything special in development to say about those. Instead, I want to talk about professional disasters.

Those can be things like:

  • Company closing (nicely or otherwise).
  • Getting fired.
  • Product going under.
  • Product doing badly.
  • Reputation smear.
  • High profile failure.

Let me try take them in turn. The easiest one to handle is probably a company closing down, there is very little blame attached here, so there shouldn’t be an issue of having a new job. This is also the time to consider if you want to move tracks to be an independent or entrepreneur. Getting fired is a bit harder, but assuming that you weren’t fired for cause (such as negligence of criminal behavior), the old “everyone is downsizing” is going to work.

Even in a so-so economy, there is still a lot of jobs out there for software developers, but getting a good one might require you to polish your skills, and getting good idea of what is marketable today. Note that there is a big difference between what is popular and what is marketable (as in, will land you a job). Node.JS seems to be the buzzword of the day, but knowing Java very well is probably a much better path for quick employment.

This comes back to what kind of approach you want to take. For now, I’m going to assume that the fallback position for a good developer is to get hired in some fashion, it can be a short term contract, or just be gainfully employed writing software. This is important when we consider the other things that can happen, those are the kind of disasters that strike when you are more than just an employee. If you are entrepreneur and your product is just loosing too much money, for example, what is your next path?

The easy thing is when you know that you can’t go on. Maybe a competitor is pricing you out of the market, or the bank is closing the credit line or you can’t get more client or any of a hundred reasons. You are done, and you are well aware of that. A much harder issue is when you are just doing badly. So you do make sales, but not enough to cover expenses, or just getting by. Not enough to bankrupt you immediately, but you can see it happening. Unless  something changes… So you have the option of pulling the cord or trying to get it to work, with the chance of going to actual failure.

For a startup, you usually don’t have to deal with those details, but you might just show up and the company is closing down.  In those cases, there is usually not much that can be done by you (unless you are the founder, in which case, there is a wealth of information on that issue out there).

The last issue that you need to take into account is how to deal with reputation damage or a high profile failure. That depend on what the actual issue is. If it is a high profile arrest for doing coke, it might be hard to get / retain clients. If it is a big failure that cost a customer a lot of money, you might be dealing with legal consequences as well as the actual damage with other customers.

We can simplify how we look at this if we treat it all as the same thing, just a basic setback to zero (or negative). The issue is how to recover and move on. At this point, the issue is what sort of future you want? Setbacks like that are a great reason to do some thinking about where you want to go and what you want to actually do.

The conservative choice would be to find a job as a full time developer of some kind, since that at least give you a steady paycheck for the duration. More complex can be the decision to do contracting, either on a short term (at worst you can be a Word Press consultant and install that to people) or longer term projects (which require you to actually sale yourself). Hopefully you won’t be doing someone’s else homework, at least not for long.

Note that actually being able to recover from a disaster properly require prior planning. Do you have resources to survive a duration with no money? Can you handle (mentally) being out of work? Are you running on the razor’s edge of a single disruption in money flow causes an utter collapse. If that is the case, your disaster planning is going to focus on just getting reserves to handle any hiccups, versus actually managing an actual disaster.

Oh, and of course, you need to consider the cost of disaster planning. It is all very well to build a bunker to survive atomic war and the zombie rampage, but it isn’t that good if it also bankrupt you on its own.

The general recommendation is to stay current, so it would be easier to hire you, and have some idea about what to do if you wake up one day, and for whatever reason, showing up for work is not going to happen.

Career planning: What is your path?

I got a lot of really great answers about my “Where do old developers go?”, I’m feeling much better about this now Smile.

Now let turn this question around, instead of asking what is going on in the industry, let’s check what is going on with you. In particular, do you have a career plan at all?

An easy way to check that is asking: “What are you going to do in 3 years, in 7 years and in 20 years from now?”

Of course, best laid plans of mice and men will often go awry, plans for the futures are written in sand on a stormy beach and other stuff like this. Any future planning has to include the caveats that they are just plans, with reality and life getting in the way.

For lawyers*, the career path might be: Trainee, associate, senior associate, junior partner, partner, named partner. (* This is based solely on seeing some legal TV shows, not actual knowledge.) Most lawyers don’t actually become named partners, obviously, but that is what you are planning for.

As discussed in the previous post, a lot of developers move to management positions at some point in their careers, mostly because salaries and benefits tend to flat line after about ten years or so for most people in the development track. Others decide that going independent and becoming consultants or contractors is a better way to increase their income. Another path is to rise in the technical track in a company that recognize technical excellence, those are usually pure tech companies, it is rare to have such positions in non technical companies. Yet another track that seems to be available is the architect route, this one is available in non tech companies, especially big ones. You have the startup route, and the Get Rich Burning Your Twenties mode, but that is a high risk / high reward, and people who think about career planning tend to avoid such things unless carefully considered.

It is advisable to actually consider those options, try to decide what options you’ll want to have available for you in the next 5 – 15 years, and take steps accordingly. For example, if you want to go in the management track, you’ll want to work on thinks like people’s skill, be able to fluently converse with the business in their own terms and learn to play golf. You’ll want to try to have leadership positions from a relatively early start, so team lead is a stepping stone you’ll want to get to, for example. There is a lot of material on this path, so I’m not going to cover this in details.

If you want to go with the Technical Expert mode, that means that you probably need to grow a beard (there is nothing like stroking a beard in quite contemplation  to impress people). More seriously, you’ll want to get a deep level of knowledge in several fields, preferably ones that you can tie together into a cohesive package. For example, networks expert would be able to understand how TCP/IP work and be able to actually make use of that when optimize an HTML5 app. Crucial at this point is also the ability to actually transfer that knowledge to other people. If you are working within a company, that increase the overall value you have, but a lot of the time, technical experts would be consultants. Focusing on a relatively narrow field gives you a lot more value, but narrow your utility. Remember that updating your knowledge is very important. But the good news is that if you have a good grasp of basics, you can get to grips with new technology very easily.

The old timer mode fits people who work in big companies and who believe that they can carve a niche in that company based on their knowledge of the company’s business and how things actually work. This isn’t necessarily the one year experience, repeated 20 times, although in many cases, that seems to be what happens. Instead, it is a steady job with reasonable hours, and you know the business well enough and the environment in which you are working with, that you can just get things done, without a lot of fussing around. Change is hard, however, because those places tend to be very conservative. Then again, you can do new systems in whatever technology you want, at a certain point (you tend to become the owner of certain systems, you’ve been around longer than the people who are actually using the system). That does carry a risk, however. You can be fired for whatever reason (merger, downsizing, etc) and you’ll have hard time finding equivalent position.

The entrepreneur mode is for people who want to build something. That can be a tool or a platform, and they create a business selling that. A lot of the time, it involve a lot of technical work, but there is a huge amount of stuff that needs to be done that is non technical. Marketing and sales, insurance and taxes, hiring people, etc. The good thing about this is that you usually don’t have to have a very big investment in your product before you can start selling it. We are talking about roughly 3 – 6 months for most things, for 1 – 3 people. That isn’t a big leap, and in many cases, you can handle this by eating some savings, or moonlighting. Note that this can completely swallow your life, but you are your own boss, and there is a great deal of satisfaction around building a product around your vision. Be aware that you need to have contingency plans around for failure and success. If your product become successful, you need to make sure that you can handle the load (hire more people, train them, etc).

The startup mode is very different than the entrepreneur mode. In startup, you are focused on getting investments, and the scope is usually much bigger. There is less of a risk financially (you usually have investors for that), but there is a much higher risk of failure, and there is usually a culture that consider throwing yourself on hand grenade advisable. The idea is that you are going to burn yourself on both ends for two to four years, and in return, you’ll have enough money to maybe stop working altogether. I consider this foolish, given the success rates, but there are a lot of people who consider that to be the only way worth doing. The benefits usually include a nice environment, both physically and professionally, but  it comes with the expectation that you’ll stay there for so many hours, it is your second home.

There are other modes and career paths, but now I’ve to return to my job Smile.

Modeling exercise: The grocery store’s checkout model

I went to the super market yesterday, and I forgot to get out of work mode, so here is this posts. imageThe grocery store checkout model exercise deals with the following scenario. You have a customer that is scanning products in a self checkout lane, and you need to process the order.

In terms of external environment, you have:

  • ProductScanned ( ProductId: string ) event
  • Complete Order command
  • Products ( Product Id –> Name, Price ) dataset

So far, this is easy, however, you also need to take into account:

  • Sales (1+1, 2+1, 5% off for store brands, 10% off for store brands for loyalty card holders).
  • Purchase of items by weight (apples, bananas, etc).
  • Per customer discount for 5 items.
  • Rules such as alcohol can only be purchased after store clerk authorization.
  • Purchase limits (can only purchase up to 6 items of the same type, except for specific common products)

The nice thing about such an exercise is that it forces you to see how many things you have to juggle for such a seemingly simple scenario.

A result of this would be to see how you would handle relatively complex rules. Given the number of rules we already have, it should be obvious that there are going to be more, and that they are going to be changing on a fairly frequent basis. A better model would be to actually do this over time. So you start with just getting the first part, then you start streaming the other requirements, but what you actually see is the changes in the code over time. So each new requirement causes you to make modifications and accommodate the new behavior.

The end result might be a Git repository that allows you to see the full approach that was used and how it changed over time. Ideally, you should see a lot of churn in the beginning, but then you’ll have a lot less work to do as your architecture settles down.