Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,971 | Comments: 44,508

filter by tags archive

User experience on the main path–get it or get lost


The background for this post:

Recently I got an email from a startup founder about a service that they are offering. It so happened that this service matched something that I was actually considering doing, so I was very happy to try it out. I registered, and on two separate occasions I attempted to use the service for its intended purpose. I wasn’t able to do that. I got some errors, and I’m not sure if it was data validation or plain errors.

After getting in touch with the founder in question, he pointed me to the guide which (presumably) had step by step instructions on how to get things working. I commented that this shouldn’t be this hard and got a reply that this is just the way it is, anywhere, not just with this service. It was at that point that I gave up on the service entirely.

A few things to note, this is a web based service (and I intentionally not revealing which one) that is supposed to allow collaboration in one –> many scenarios. I was trying to use it as a publisher.

This experience jarred me, for a very simple reason. It shouldn’t be this hard. And sending a user that just want to get their feet wet to the guide is a really bad mistake. You can see the funnel on the right, and anyone that is familiar with customer acquisition will recognize the problem.

Basically, the process of getting a new user begins when they are just looking at your software, and you want to lead them in without any jarring. Any hurdle in their path is going to cause some of them to leave, probably never to return. So you want to make damn sure that you have as little friction to the process as you can.

In this case, I was trying to use a service for the purpose it was intended to. And I was just bogged down with a lot of details that had to be completed before I could even test out the service.

In days of old, games used to come with a guide that told you about how to actually play the game. I remember going over the Red Alert and Diablo guides with great excitement while the games were installing.

Then the games makers noticed that no one was reading those guides, and new gamers run into problems playing the games even though they were clearly documented in the guide.

The solution was to stop using guides. Instead, the games started incorporating several initial levels as tutorial, to make sure that gamers actually learned all about the game while playing the game.

It is a great way to reduce the friction of playing, and it ensured a smooth transition from no idea how to play the game to ready to spend a lot of time blowing pixels up.

Taking this back to the realm of non gaming software, you really have to identify a few core paths that users are likely to walk down into when using your software, and then you are going to stream line pretty much everything along their path to make sure that they don’t hit any friction.

It means asking the user to do as little work as possible, choosing defaults for them (until they care enough to change those), having dedicated UI to lead them to the “Wow, this is amazing!” moments. Only after you actually gained enough positive experience with the users can you actually require them to do some work.

And note that by do some work, I’m talking about anything from setting up the logo for the publisher to selecting what categories the data will go on. By work, I’m talking about anything that is holding up the user from doing the major thing that they are on their site for.

If you don’t do that, then you are going to end up with narrower funnel, and the end result is that you’ll have fewer users. Not because you service is inferior, or your software sucks. Simply because you failed to prove to the user that you are actually worth the time investment in give you a fair shoot.

Career planningThe immortal choices aren't


In response for my previous post, Eric had the folowing comment (well, tweet):

I guess some baskets last longer or some eggs don't seem to rot e.g. C, C++, SQL, Java*, etc

And that is true, in some sense of the word. In other words, there isn't any expected shortage of C or C++ opportunities anywhere in the medium to long future. The problem is that this isn't the same language, framework or enviornment over time.

In the late 90s / early 2000s I was deep into C++. I read Effective C++ and More Effective C++, I gone through the entire STL with a fine tooth comb, and I was a pretty enthusiastic (and bad) C++ developer. But let assume that I was a compotent C++ dev in the late 90s.

What was the environment like at the time? Pretty much all 32 bits, STL was still a hotly debated topic. MFC and ATL were all the rage, and making the C++ compiler die via template meta programming was extremely common. COM and Windows DNA were all the rage. 

Assume that you freeze the knoweldge at that time, and skip forward 15 years. Where are you at?

Modern C++ has embraced STL, then moved beyond it to Boost. In Windows land, MFC and ATL are only used for legacy stuff. COM is still there, but you try to avoid it. And cross platform code isn't something esoteric.

Now, I stopped doing C++ a few years after getting starting with .NET, so I'm pretty sure that the kind of changes that I can see are just the tip of the iceberg.

In short, just because the title of your job didn't change doesn't mean that what you did hasn't changed considerablely. And choosing safe (from a job prospect) programming language and sticking to it, knowing that you can always rely on that is a pretty good way to perfom career suicide.

On the other hand, I know people looking for Cobol programmers...

Career planningThe age of least resistance


On Sunday, there was a news program about how tough it is to find work after 40s. It was full of the usual stuff about employers only looking for young people who can work 30 hours days*, and freezing out anyone too old for their taste, etc.

This is a real problem in many cases, and one that I find abhorrent. Not the least because I plan to have a long career in my chosen field, and I really don’t like the idea of having a certain age after which I should be shuffled off to do data entry tasks, if that. Especially since that age seems not too far at all. Currently, the oldest person we have in a development role is over fifty, although most of our team is late twenties to mid thirties.

So we reached out to one of them, asking to get a CV so we can look at that. And it took me very little time to realize why this person had a hard time finding a job. In particular, while the news program was about people who are unable to find a job, this particular person actually had a job. It just wasn’t a job that he was happy with. As far as I understand, he was paid a lot less than what he was used to, comparable to someone just starting out, rather than someone with close to two decades of experience.

Looking at the CV, it was obvious why that was. This particular person job history included a 15 years stretch in a large municipality. During which he worked mostly on VB6 programs. It has only been in the last couple of years that he started working in .NET.

Now, Microsoft released VB6 in 1998. And announced that it is moving to VB.Net in early 2000, with the .NET framework being released on 2002. By 2005, VB6 was no longer supported, and by 2008 even extended support run out. So we are talking about 7 – 8 years in which the main tool at their disposal was quite clearly a dead end.

While I’ve fond memories of VB6, and I’m pretty sure that there is still a lot of software using it. It is not surprising that demand for people with VB6 expertise has already peaked and is currently in a decline that I don’t really see changing. Note that this isn’t really surprising, and you would have to willfully ignore reality to believe that there is a strong future in VB6 in the past decade.

So we have a person with expertise in obsolete tech, trying to find a job in the market with effectively 1-2 years of experience using C#. It isn’t surprising that he got what is effectively a starter position, even given his age.  It isn’t that his age affected the offered position, it is that it didn’t.

This lead back to the advice I gave previously on the matter of career planning. Saying “I got a nice job” and resting on the laurels is a good way to end up in a marginalized position down the road.

Keeping your skills up to date (ideally as part of your job, but outside of it is if isn’t possible) is crucial, otherwise you are the guy with one year of actual experience, repeated many times over.

* Not a typo, it is intentionally stupid.

API DesignWe’ll let the users sort it out


In my previous post, I explained an API design that give the user the option to perform an immediate operation, use the default fire and forget or use an explicit bulk mechanism. The idea is that most operations are small, and that the cost of actually going over the network is going to dominate the cost of the entire operation. In this case, we want to give the user the option of selecting the “I want to know the operation completed” or “I just want to try the best, I’m fine if there is a failure” modes.

Eli asks:

Trying to understand why this isn't just a scripting case. In your example you loop over CSV and propose an increment call per part which gets batched up and committed outside the user's control. Why not define a JavaScript on the server that takes an array ("a batch") of parts sent over the wire and iterates over it on the server instead? This way the user gets fine grained control over the batch. I'm guessing the answer has something to do with your distributed counter architecture...

If I understand Eli properly, the idea is that we’ll just expose an API like this:

Increment(“users/1/visits”, 1);

And provide an endpoint where a user can POST a JavaScript code that will call this. The idea is that the user will be able to decide whatever to call this once, or send a whole batch of updates in one go.

This is certainly an option, but in my considered opinion, it is a pretty bad one. It has nothing to do with the distributed architecture, it has to do with the burden we put on the user. The actual semantics of “go all the way to the server and confirm the operation” vs “let us do a bulk insert kind of thing” are pretty easy. Each of them has a pretty well defined behavior.

But what happens when you want to do an operation per page view? From the point of view of your code, you are making a single operation (incrementing the counters for a particular entity). From the point of view of the system as a whole, you are generating a whole lot of individual requests that would be much better off as a single bulk request.

Having a scripting endpoint gives the user the option of doing that, sure, but then they need to handle:

  • Error recovery
  • Multi threading
  • Flushing on batch size / time
  • Failover

And many more that I’m probably forgetting. By providing the users with the option of making an informed choice about speed vs. safety, we avoid putting the onus of the actual implementation on them.

API DesignSmall modifications over a network


In RavenDB 4.0 (yes, that is quite a bit away), we are working on additional data types and storage engines. One of the things that we’ll add, for example, is the notion of gossiping distributed counters. That doesn’t actually matter for our purposes here, however.

What I wanted to talk about today is the problem in making small updates over a network. Consider the counters example. By far the most common operations are some variant of:

counters.Increment(name, 1);

Since the database is remote, we need to send that over the network. But that is a very small operation. Going all the way to the remote database just for that is a big waste of time. Consider that in most systems, we don’t have just a single thread of execution, we have many. Each of them performing their own operations. Allowing each operation to go on its own is a big waste, but what other options can we offer?

There are other considerations as well. While most of the time we’ll be making small operations, it is very common to need to do bulk operations as well. Maybe you are setting up the system for the first time, or importing data from a file, etc. Making large number of individual requests will kill any hope of doing this fast. The obvious answer is to use batching. send a lot of values to the server all at once. This reduce the network overhead significantly.

The API for that would be something like:

using(var batch = counters.Advanced.NewBatch())
{
	foreach(var line in File.EnumerateAllLines("counters.csv"))
	{
		var parts = line.Split();
		batch.Increment(parts[0], long.Parse(line[1]));
	}
}

So that is great for big things, but not very helpful for individual operations. We still have to go to the server for each of those.

Of course, we could also use the batch API, but making use of that for an individual operation is… a big waste.

What to do?

The end result we arrived at was that we have three levels of API:

  • counters.Increment(name, 1); – single operation, executed immediately over the network. Guaranteed to succeed or fail and give you the result.
  • counters.Advanced.NewBatch() – batch operation, executed over all of the items in the batch (but not as a single transaction), will let you know if the whole operation succeeded, or if there was an issue with something.
  • counters.Batch.Increment() – the default batch, thread safe, can be utilized by individual requests. This is a fire & forget operation. We increment, and behind the scene we’ll merge all the increment from all the threads and send them in batches to the server.

Note that the last option means that we’ll only do batches, so only when enough time has lapsed or we have enough items to send will we send the data to the server. The idea is that you get to choose, based on the importance of what you are doing.

If you need confirmation that something was successful, use the individual operation. If you just want us to make best effort, and if something bad really happened you don’t care about it, use the batch option.

Note that I’m using the counters example here because it is simple, but the same applies for things like time series, and other stuff that we are currently building.

Buffer Managers, production code and alternative implementations


We are porting RavenDB to Linux, and as such, we run into a lot of… interesting issues. Today we run into a really annoying one.

We make use of the BufferManager class inside RavenDB to reduce memory allocations. On the .Net side of things, everything works just fine, and we never really had any issues with it.

On the Mono side of things, we started getting all sort of weird errors. From ArgumentOutOfRangeException to NullReferenceException to just plain weird stuff. That was the time to dig in and look into what is going on.

On the .NET side of things, BufferManager implementation is based on a selection criteria between large (more than 85Kb) and small buffers. For large buffers, there is a single large pool that is shared among all the users of the pool. For small buffers, the BufferManager uses a pool per active thread as well as a global pool, etc. In fact, looking at the code we see that it is really nice, and a lot of effort has been made to harden it and make it work nicely for many scenarios.

The Mono implementation, on the other hand, decides to blithely discard the API contract by ignoring the maximum buffer pool size. It seems because “no user code is designed to cope with this”. Considering the fact that RavenDB is certainly dealing with that, I’m somewhat insulted, but it seems par the course for Linux, where “memory is infinite until we kill you”* is the way to go.

But what is far worse is that this class is absolutely not thread safe. That was a lot of fun to discover. Considering that this piece of code is pretty central for the entire WCF stack, I’m not really sure how that worked. We ended up writing our own BufferManager impl for Mono, to avoid those issues.

* Yes, somewhat bitter here, I’ll admit. The next post will discuss this in detail.

.NET Packaging mess


In the past few years, we had:

  • .NET Full
  • .NET Micro
  • .NET Client Profile
  • .NET Silverlight
  • .NET Portable Class Library
  • .NET WinRT
  • Core CLR
  • Core CLR (Cloud Optimized)*
  • MessingWithYa CLR

* Can’t care enough to figure out if this is the same as the previous one or not.

In each of those cases, they offered similar, but not identical API and options. That is completely ignoring the versioning side of things ,where we have .NET 2.0 (1.0 finally died a while ago), .NET 3.5, .NET 4.0 and .NET 4.5. I don’t think that something can be done about versioning, but the packaging issue is painful.

Here is a small example why:

image

In each case, we need to subtly tweak the system to accommodate the new packaging option. This is pure additional cost to the system, with zero net benefit. Each time that we have to do that, we add a whole new dimension to the testing and support matrix, leaving aside the fact that the complexity of the solution is increasing.

I wouldn’t mind it so much, if it weren’t for the fact that a lot of those are effectively drive-bys, it feels. Silverlight took a lot of effort, and it is dead. WinRT took a lot of effort, and it is effectively dead.

This adds a real cost in time and effort, and it is hurting the platform as a whole.

Now users are running into issues with the Core CLR not supporting stuff that we use. So we need to rip out MEF from some of our code, and implement it ourselves just to get things in the same place as before.

Gossip much? Use cases and bad practices for gossip protocols


My previous few posts has talked about specific algorithms for gossip protocols, specifically: HyParView and Plumtrees. They dealt with the technical behavior of the system, the process in which we are sending data over the cluster to all the nodes. In this post, I want to talk a bit about what kind of messages we are going to send in such a system.

The obvious one is to try to keep the entire state of the system up to date using gossip. So whenever we make a change, we gossip about it to the entire network, and we are able to get to an eventually consistent system in which all nodes have roughly the same data. There is one problem with that, you now have a lot of nodes with the same data on them. At some point, that stop making sense. Usually gossip is used when you have a large group of servers, and just keep all the data on all the nodes is not a good idea unless your data set is very small.

So you don’t do that.  Gossip is usually used to disseminate a small data set, one that can fit comfortably inside a single machine (usually it is a very small data set, a few hundred MB at most). Let us consider a few types of messages that would fit in a gossip setting.

The obvious example is the actual topology of the whole network. A node joining up the cluster will announce its presence, and that will percolate to the entire cluster, eventually. That can allow you to have an idea (note, this isn’t a certainty) about what is the structure of the cluster, and maybe make decisions based on it.

The system wide configuration data is also a good candidate for gossip, for example, you can use gossip as a distributed service locator in the cluster. Whenever a new SMTP server comes online, it announces itself via gossip to the cluster. It is added to the list of SMTP servers in all the nodes that heard about it, and then it get used. In this kind of system, you have to take into account that servers can be down for a long period of time, and miss up on messages. Gossip does not guarantee that the messages will arrive, after all. Oh, it is going to do its best, but you need to also build an anti entropy system. If a server finds that it missed up on too much, it can request one of its peers to send it a full snapshot of the current global state as that peer know it.

Going in the same vein, nodes can gossip about the health state of the network. If I’m trying to send an email via an SMTP server, and it is down, I’m going to try another server, and let the network know that I’ve failed to talk to that particular server. If enough nodes fail to communicate with the server, that become part of the state of the system, so nodes that learned about it can avoid that server for a period of time.

Moving into a different direction, you can also do gossip queries, that can be done by sending a gossip message on the cluster with a specific query to it. A typical example might be “which node has a free 10GB that I can use?”. Such queries typically carry with them a timeout element. You send the query, and any matches are sent back to (either directly or also via gossip). After a predefined timeout, you can assume that you got all the replies that you are going to get, so you can operate on that. More interesting is when you want to query for the actual data held in each node. If we want to find all the users who logged in today, for example.

The problem with doing something like that is that you might have a large result set, and you’ll need to have some way to work with that. You don’t want to send it all to a single destination, and what would you do with it, anyway? For that reason, most of the time gossip queries are actually aggregation. We can use that to get an estimate of certain things in our cluster. If we wanted to get the number of users per country, that would be a good candidate for this, for example. Note that you won’t necessarily get accurate results, if you have failures, so there are aggregation methods for getting a good probability of the actual value.

For fun, here is an interesting exercise. Look at trending topics in a large number of conversations. In this case, whenever you would get a new message, you would analyze the topics for this message, and periodically (every second, let us say), you’ll gossip to your peers about this. In this case, we don’t just blindly pass the gossip between nodes. Instead, we’ll use a slightly different method. Each second, every node will contact its peers to send them the current trending topics in the node. Each time the trending topics change, a version number is incremented. In addition, the node also send its peer the node ids and versions of the messages it got from other nodes. The peer, in reply, will send a confirmation about all the node ids and versions that it has. So the origin node can fill in about any new information that it go, or ask to get updates for information that it doesn’t have.

This reduce the number of updates that flow throughout the cluster, while still maintain an eventually consistent model. We’ll be able to tell, from each node, what are the current trending topics globally.

FUTURE POSTS

  1. Paying the rent online - about one day from now

There are posts all the way to Aug 03, 2015

RECENT SERIES

  1. Production postmortem (5):
    29 Jul 2015 - The evil licensing code
  2. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats