Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,968 | Comments: 44,484

filter by tags archive

User experience on the main path–get it or get lost


The background for this post:

Recently I got an email from a startup founder about a service that they are offering. It so happened that this service matched something that I was actually considering doing, so I was very happy to try it out. I registered, and on two separate occasions I attempted to use the service for its intended purpose. I wasn’t able to do that. I got some errors, and I’m not sure if it was data validation or plain errors.

After getting in touch with the founder in question, he pointed me to the guide which (presumably) had step by step instructions on how to get things working. I commented that this shouldn’t be this hard and got a reply that this is just the way it is, anywhere, not just with this service. It was at that point that I gave up on the service entirely.

A few things to note, this is a web based service (and I intentionally not revealing which one) that is supposed to allow collaboration in one –> many scenarios. I was trying to use it as a publisher.

This experience jarred me, for a very simple reason. It shouldn’t be this hard. And sending a user that just want to get their feet wet to the guide is a really bad mistake. You can see the funnel on the right, and anyone that is familiar with customer acquisition will recognize the problem.

Basically, the process of getting a new user begins when they are just looking at your software, and you want to lead them in without any jarring. Any hurdle in their path is going to cause some of them to leave, probably never to return. So you want to make damn sure that you have as little friction to the process as you can.

In this case, I was trying to use a service for the purpose it was intended to. And I was just bogged down with a lot of details that had to be completed before I could even test out the service.

In days of old, games used to come with a guide that told you about how to actually play the game. I remember going over the Red Alert and Diablo guides with great excitement while the games were installing.

Then the games makers noticed that no one was reading those guides, and new gamers run into problems playing the games even though they were clearly documented in the guide.

The solution was to stop using guides. Instead, the games started incorporating several initial levels as tutorial, to make sure that gamers actually learned all about the game while playing the game.

It is a great way to reduce the friction of playing, and it ensured a smooth transition from no idea how to play the game to ready to spend a lot of time blowing pixels up.

Taking this back to the realm of non gaming software, you really have to identify a few core paths that users are likely to walk down into when using your software, and then you are going to stream line pretty much everything along their path to make sure that they don’t hit any friction.

It means asking the user to do as little work as possible, choosing defaults for them (until they care enough to change those), having dedicated UI to lead them to the “Wow, this is amazing!” moments. Only after you actually gained enough positive experience with the users can you actually require them to do some work.

And note that by do some work, I’m talking about anything from setting up the logo for the publisher to selecting what categories the data will go on. By work, I’m talking about anything that is holding up the user from doing the major thing that they are on their site for.

If you don’t do that, then you are going to end up with narrower funnel, and the end result is that you’ll have fewer users. Not because you service is inferior, or your software sucks. Simply because you failed to prove to the user that you are actually worth the time investment in give you a fair shoot.

Career planningThe immortal choices aren't


In response for my previous post, Eric had the folowing comment (well, tweet):

I guess some baskets last longer or some eggs don't seem to rot e.g. C, C++, SQL, Java*, etc

And that is true, in some sense of the word. In other words, there isn't any expected shortage of C or C++ opportunities anywhere in the medium to long future. The problem is that this isn't the same language, framework or enviornment over time.

In the late 90s / early 2000s I was deep into C++. I read Effective C++ and More Effective C++, I gone through the entire STL with a fine tooth comb, and I was a pretty enthusiastic (and bad) C++ developer. But let assume that I was a compotent C++ dev in the late 90s.

What was the environment like at the time? Pretty much all 32 bits, STL was still a hotly debated topic. MFC and ATL were all the rage, and making the C++ compiler die via template meta programming was extremely common. COM and Windows DNA were all the rage. 

Assume that you freeze the knoweldge at that time, and skip forward 15 years. Where are you at?

Modern C++ has embraced STL, then moved beyond it to Boost. In Windows land, MFC and ATL are only used for legacy stuff. COM is still there, but you try to avoid it. And cross platform code isn't something esoteric.

Now, I stopped doing C++ a few years after getting starting with .NET, so I'm pretty sure that the kind of changes that I can see are just the tip of the iceberg.

In short, just because the title of your job didn't change doesn't mean that what you did hasn't changed considerablely. And choosing safe (from a job prospect) programming language and sticking to it, knowing that you can always rely on that is a pretty good way to perfom career suicide.

On the other hand, I know people looking for Cobol programmers...

Production postmortemThe case of the native memory leak


This one is a pretty recent one. A customer complained about high memory usage in RavenDB under moderate usage. That was a cause for concern, since we care a lot about our memory utilization.

So we started investigating that, and it turned out that we were wrong, the problem wasn’t with RavenDB, it was with the RavenDB Client Library. The customer had a scenario where 100% of the time, after issuing a small number of requests (less than ten), the client process would be using hundreds of MB, for really no purpose at all. The client already turned off caching, profiling and pretty much anything else that both they and us could think of.

We got a process dump from them and looked at that, and everything seemed to be fine. The size of the heap was good, and there didn’t appear to be any memory being leaked. Our assumption at that point was that there is some sort of native memory leak from their application.

To continue the investigation further, NDAs was required, but we managed to go through that and we finally had a small repro that we could look at ourselves. The fact that the customer was able to create such a thing is really appreciated, because very often we have to work with a lot of missing information. Of course, when we run this on our own system, everything was just fine & dandy. There was no issue. We got back to the customer and they told us that the problem would only reproduce in IIS.

And indeed, once we have tested inside IIS (vs. using IIS Express), that problem was immediately obvious. That was encouraging (at least from my perspective). RavenDB doesn’t contain anything that would differentiate between IIS and IIS Express. So if the problem was only in IIS, that is not likely to be something that we did.

Once I had a repro of the problem, I sat down to observe what was actually going on. The customer was reading 1,024 documents from the server, and then sending them to the browser. That meant that each of the requests we tested was roughly 5MB in size. Note that this means that we have to:

  • Read the response from the server
  • Decompress the data
  • Materialize it to objects
  • Serialize the data back to send it over the network

That is a lot of memory that is being used here. And we never run into any such issues before.

I had a hunch, and I created the following:

[RoutePrefix("api/gc")]
public class GcController : ApiController
{
    [Route]
    public HttpResponseMessage Get()
    {
        GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
        GC.Collect(2);
        return Request.CreateResponse(HttpStatusCode.OK);
    }
}

This allows me to manually invoke a rude GC, and indeed, when running this, memory utilization dropped quite nicely. Of course, that isn’t something that you want to run into your systems, but it is very important diagnostic tool.

Next, I tried to do the following:

[Route]
public HttpResponseMessage Get()
{
    var file = File.ReadAllText(@"D:\5mb.json");
    var deserializeObject = JsonConvert.DeserializeObject<Item>(file);

    return this.Request.CreateResponse(HttpStatusCode.OK, new { deserializeObject.item, deserializeObject.children });
}

This involves no RavenDB code, but it has roughly the same memory pressure semantics as the customer code. Indeed, the problem reproduced quite nicely there as well.

So the issue is about a request that uses a lot of memory (include at least one big buffer), likely causing some fragmentation in the heap that would bring memory utilization high. When the system had had enough, it would reclaim all of that, but unless there is active memory pressure, I’m guessing that it would rather leave it like that until it has to pay that price.

Career planningThe age of least resistance


On Sunday, there was a news program about how tough it is to find work after 40s. It was full of the usual stuff about employers only looking for young people who can work 30 hours days*, and freezing out anyone too old for their taste, etc.

This is a real problem in many cases, and one that I find abhorrent. Not the least because I plan to have a long career in my chosen field, and I really don’t like the idea of having a certain age after which I should be shuffled off to do data entry tasks, if that. Especially since that age seems not too far at all. Currently, the oldest person we have in a development role is over fifty, although most of our team is late twenties to mid thirties.

So we reached out to one of them, asking to get a CV so we can look at that. And it took me very little time to realize why this person had a hard time finding a job. In particular, while the news program was about people who are unable to find a job, this particular person actually had a job. It just wasn’t a job that he was happy with. As far as I understand, he was paid a lot less than what he was used to, comparable to someone just starting out, rather than someone with close to two decades of experience.

Looking at the CV, it was obvious why that was. This particular person job history included a 15 years stretch in a large municipality. During which he worked mostly on VB6 programs. It has only been in the last couple of years that he started working in .NET.

Now, Microsoft released VB6 in 1998. And announced that it is moving to VB.Net in early 2000, with the .NET framework being released on 2002. By 2005, VB6 was no longer supported, and by 2008 even extended support run out. So we are talking about 7 – 8 years in which the main tool at their disposal was quite clearly a dead end.

While I’ve fond memories of VB6, and I’m pretty sure that there is still a lot of software using it. It is not surprising that demand for people with VB6 expertise has already peaked and is currently in a decline that I don’t really see changing. Note that this isn’t really surprising, and you would have to willfully ignore reality to believe that there is a strong future in VB6 in the past decade.

So we have a person with expertise in obsolete tech, trying to find a job in the market with effectively 1-2 years of experience using C#. It isn’t surprising that he got what is effectively a starter position, even given his age.  It isn’t that his age affected the offered position, it is that it didn’t.

This lead back to the advice I gave previously on the matter of career planning. Saying “I got a nice job” and resting on the laurels is a good way to end up in a marginalized position down the road.

Keeping your skills up to date (ideally as part of your job, but outside of it is if isn’t possible) is crucial, otherwise you are the guy with one year of actual experience, repeated many times over.

* Not a typo, it is intentionally stupid.

API DesignWe’ll let the users sort it out


In my previous post, I explained an API design that give the user the option to perform an immediate operation, use the default fire and forget or use an explicit bulk mechanism. The idea is that most operations are small, and that the cost of actually going over the network is going to dominate the cost of the entire operation. In this case, we want to give the user the option of selecting the “I want to know the operation completed” or “I just want to try the best, I’m fine if there is a failure” modes.

Eli asks:

Trying to understand why this isn't just a scripting case. In your example you loop over CSV and propose an increment call per part which gets batched up and committed outside the user's control. Why not define a JavaScript on the server that takes an array ("a batch") of parts sent over the wire and iterates over it on the server instead? This way the user gets fine grained control over the batch. I'm guessing the answer has something to do with your distributed counter architecture...

If I understand Eli properly, the idea is that we’ll just expose an API like this:

Increment(“users/1/visits”, 1);

And provide an endpoint where a user can POST a JavaScript code that will call this. The idea is that the user will be able to decide whatever to call this once, or send a whole batch of updates in one go.

This is certainly an option, but in my considered opinion, it is a pretty bad one. It has nothing to do with the distributed architecture, it has to do with the burden we put on the user. The actual semantics of “go all the way to the server and confirm the operation” vs “let us do a bulk insert kind of thing” are pretty easy. Each of them has a pretty well defined behavior.

But what happens when you want to do an operation per page view? From the point of view of your code, you are making a single operation (incrementing the counters for a particular entity). From the point of view of the system as a whole, you are generating a whole lot of individual requests that would be much better off as a single bulk request.

Having a scripting endpoint gives the user the option of doing that, sure, but then they need to handle:

  • Error recovery
  • Multi threading
  • Flushing on batch size / time
  • Failover

And many more that I’m probably forgetting. By providing the users with the option of making an informed choice about speed vs. safety, we avoid putting the onus of the actual implementation on them.

API DesignSmall modifications over a network


In RavenDB 4.0 (yes, that is quite a bit away), we are working on additional data types and storage engines. One of the things that we’ll add, for example, is the notion of gossiping distributed counters. That doesn’t actually matter for our purposes here, however.

What I wanted to talk about today is the problem in making small updates over a network. Consider the counters example. By far the most common operations are some variant of:

counters.Increment(name, 1);

Since the database is remote, we need to send that over the network. But that is a very small operation. Going all the way to the remote database just for that is a big waste of time. Consider that in most systems, we don’t have just a single thread of execution, we have many. Each of them performing their own operations. Allowing each operation to go on its own is a big waste, but what other options can we offer?

There are other considerations as well. While most of the time we’ll be making small operations, it is very common to need to do bulk operations as well. Maybe you are setting up the system for the first time, or importing data from a file, etc. Making large number of individual requests will kill any hope of doing this fast. The obvious answer is to use batching. send a lot of values to the server all at once. This reduce the network overhead significantly.

The API for that would be something like:

using(var batch = counters.Advanced.NewBatch())
{
	foreach(var line in File.EnumerateAllLines("counters.csv"))
	{
		var parts = line.Split();
		batch.Increment(parts[0], long.Parse(line[1]));
	}
}

So that is great for big things, but not very helpful for individual operations. We still have to go to the server for each of those.

Of course, we could also use the batch API, but making use of that for an individual operation is… a big waste.

What to do?

The end result we arrived at was that we have three levels of API:

  • counters.Increment(name, 1); – single operation, executed immediately over the network. Guaranteed to succeed or fail and give you the result.
  • counters.Advanced.NewBatch() – batch operation, executed over all of the items in the batch (but not as a single transaction), will let you know if the whole operation succeeded, or if there was an issue with something.
  • counters.Batch.Increment() – the default batch, thread safe, can be utilized by individual requests. This is a fire & forget operation. We increment, and behind the scene we’ll merge all the increment from all the threads and send them in batches to the server.

Note that the last option means that we’ll only do batches, so only when enough time has lapsed or we have enough items to send will we send the data to the server. The idea is that you get to choose, based on the importance of what you are doing.

If you need confirmation that something was successful, use the individual operation. If you just want us to make best effort, and if something bad really happened you don’t care about it, use the batch option.

Note that I’m using the counters example here because it is simple, but the same applies for things like time series, and other stuff that we are currently building.

Production postmortemThe case of the intransigent new database


A customer called us to tell that they had a problem with RavenDB. As part of their process for handling new customers, they would create a new database, setup indexes, and then direct all the queries for that customer to that database.

Unfortunately, this system that has worked so well in development died a horrible death in production. But, and this was strange, only for new customers, and only in the create new customer stage. The problem was:

  • The user would create a new database in RavenDB. This just create a db record, and its location on disk. It doesn’t actually initialize a database.
  • On the first request, we initialize the db, creating it if needed. The first request will wait until this happens, then proceed.
  • On their production systems, that first request (which they used to create the indexes they require) would time out with an error.

Somehow, the creation of a new database would take way too long.

The first thought we had was they are creating the database on a path of an already existing database, maybe a big one that had a long initialization period, or maybe one that required recovery. But the customer validated that they were creating the database on an empty folder.

We looked at the logs, and the logs just showed a bunch of time were there was no activity. In fact, we had a single method call to open the database that took over 15 seconds to run. Except that on a new database, this method just create a bunch of files to start things out and is ready really quickly.

That is the point that led us to suspect that the issue was environmental. Luckily, as the result of many such calls, RavenDB comes with a pretty basic I/O Test tool. I asked the customer to run this on their production system, and I got the following:image

And now everything was clear. They were running on an I/O constrained system (a cloud machine), and they were running into an interesting problem. When RavenDB creates a database, it pre-allocate some files for its transactional journal.

Those files are 64MB in size, and the total write for a new Esent RavenDB database with default configuration is just over 65MB. If your write throughput is less than 1MB/sec sustained, that will be problematic.

I let the customer know about the configuration option to take less space at startup (Esent RavenDB databases can go as low as 5MB, Voron RavenDB starts at 256Kb), but I also gave them a hearty recommendation to make sure that their I/O rates improved, because this isn’t going to be the only case where slow I/O will kill them.

What is new in RavenDB 3.5Exploring data in the dark


One of the things that tend to happen a lot when we are developing with a database is that we need to peek at the data, and a lot of the time, just looking at the data one document at a time isn’t good for us.

We noticed that a lot of users will create temporary indexes (usually map/reduce ones) to get some idea about what is actually going on in the database, or for one off reporting. That is pretty inefficient, and in order to handle that, we added the Data Exploration feature.

image

 

This feature gives you the option of exploring the data in details. You can even run the query over large data sets, doing some complex aggregations and looking at the results.

Note that this is an admin / developer only feature, we provide no API for this, because the idea is that we have a human sitting in front of the DB going… “Hmm.. I wonder about…”

What is new in RavenDB 3.5My thread pool is smarter


The .NET thread pool is a really amazing piece of technology, and it is suitable for a wide range of usages. RavenDB has been making use of it for almost of all concurrent work since the very beginning.

In RavenDB 3.5, we have decided to change that. RavenDB have a lot of parallel execution requirements, but most of them have unique characteristics that we can express better with our own thread pool.

To start with, unlike the normal thread pool, we aren’t registering just a delegate and some state for it to execute, we are always registering a list of items to process, and a delegate that takes either a single item from that list or a section of that list. This let us do a much better job at work stealing. Because we have a lot more context about the actual operation. We know that when we are done with executing a particular delegate, we get to run the same delegate on the next available item in the list that it got passed it. That give us higher locality of code, because we are always executing the same task, as long as we have tasks for that in the pool.

We often have nested operations, a parallel task (execute indexing work) that spawn additional parallel work (index the following documents). By basing this all on our custom thread pool, we can perform those operations in a way that doesn’t involve waiting for that work to be done. Instead, the thread pool thread that we run on is able to “wait” by executing the work that we are waiting for. We have no blocked threads, and in many cases we can avoid getting any context switches.

Under load, that means that threads won’t put a lot of work on the thread pool and then have to fight with each other over who will finish its work first, it means that we get to run our own tasks, and only when there are enough threads available for other word will we spread for additional threads.

Speaking of load, the new thread pool also have dynamic load balancing feature. Because we know that RavenDB will use the thread pool for background work only, we can prioritize things accordingly. RavenDB is trying to keep the CPU usage in the 60% – 80% range by default. And if we detect that we have a higher CPU usage, we’ll start decreasing the background work we are doing, to make sure that we aren’t impacting front row work (like serving requests). We’ll start doing that by changing the priority of the background threads, and eventually just stop processing work in most of the background threads (we always have a minimum number of threads that will remain working, of course).

Another fun thing that the thread pool can do is to detect and handle slowpokes. A common example is an index that is taking a long time to run. Significantly more than all the other indexes. The thread pool can release all the other indexes, and let the calling code know that this particular task has been left to run on its own. RavenDB will then split the indexing work so the slow index will not slow all of the rest of the indexing.

And having split the thread pools between front row work (the standard .NET thread pool) doing request processing and the background pool (which is our own custom impl), we get a lot more predictability in the environment. We don’t have to worry about indexing jobs taking over the threads required to serve requests, or for requests on the server to impact the loading of a new database, etc.

And finally, like every other feature in RavenDB nowadays, we have a rich set of debug endpoints that can tell us in details exactly what is going on. That is crucial when we are talking about systems that run for months and years or when we are trying to troubleshoot a problematic server.

FUTURE POSTS

  1. Pointer arithmetic and dynamic HTML generation FTW–at 2 AM - 3 hours from now

There are posts all the way to Jul 28, 2015

RECENT SERIES

  1. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  2. Production postmortem (4):
    23 Jul 2015 - The case of the native memory leak
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats