Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,968 | Comments: 44,488

filter by tags archive

Pointer arithmetic and dynamic HTML generation FTW–at 2 AM


I just finished writing the following code, the time is 2 AM local time, and I am chasing a conceptual issue deep in the guts of Voron (RavenDB’s internal storage system).

image

By any measure of the word, this code should frighten anyone who reads it. For what I think can we can all safely assume are self evident reasons.

So why am I so very happy about this kind of code? Because the output of this code is this:

image

This is pretty trivial tree view for internal data structures inside of Voron.

But the point here isn’t that this code is probably going to be a no-go zone by tomorrow morning. This is code explicitly designed for that. In fact, the calling method for this is:

image

So this is a piece of code that we are trying very hard to ensure that will never be called unless we are actively requesting it in a debug session. In other words, this isn’t production code. It is scaffolding code. The code that you’ll call when you lost your way, and can no longer figure out what is going on using the standard tools.

This kind of visualization is cheap & dirty trick, but it means that instead of having to reason about this kind of thing, I can immediately see what is going on.

For that matter, if you look closely at the visualization, you might be able to see the bug in question. It is pretty obvious, once we are able to look at the entire data set in one go. The more complex the algorithm, the harder it is to conceptualize it. But in most cases, if we could but look at the data in a way that would make sense to us, we will be able to immediately see what we did wrong.

User experience on the main path–get it or get lost


The background for this post:

Recently I got an email from a startup founder about a service that they are offering. It so happened that this service matched something that I was actually considering doing, so I was very happy to try it out. I registered, and on two separate occasions I attempted to use the service for its intended purpose. I wasn’t able to do that. I got some errors, and I’m not sure if it was data validation or plain errors.

After getting in touch with the founder in question, he pointed me to the guide which (presumably) had step by step instructions on how to get things working. I commented that this shouldn’t be this hard and got a reply that this is just the way it is, anywhere, not just with this service. It was at that point that I gave up on the service entirely.

A few things to note, this is a web based service (and I intentionally not revealing which one) that is supposed to allow collaboration in one –> many scenarios. I was trying to use it as a publisher.

This experience jarred me, for a very simple reason. It shouldn’t be this hard. And sending a user that just want to get their feet wet to the guide is a really bad mistake. You can see the funnel on the right, and anyone that is familiar with customer acquisition will recognize the problem.

Basically, the process of getting a new user begins when they are just looking at your software, and you want to lead them in without any jarring. Any hurdle in their path is going to cause some of them to leave, probably never to return. So you want to make damn sure that you have as little friction to the process as you can.

In this case, I was trying to use a service for the purpose it was intended to. And I was just bogged down with a lot of details that had to be completed before I could even test out the service.

In days of old, games used to come with a guide that told you about how to actually play the game. I remember going over the Red Alert and Diablo guides with great excitement while the games were installing.

Then the games makers noticed that no one was reading those guides, and new gamers run into problems playing the games even though they were clearly documented in the guide.

The solution was to stop using guides. Instead, the games started incorporating several initial levels as tutorial, to make sure that gamers actually learned all about the game while playing the game.

It is a great way to reduce the friction of playing, and it ensured a smooth transition from no idea how to play the game to ready to spend a lot of time blowing pixels up.

Taking this back to the realm of non gaming software, you really have to identify a few core paths that users are likely to walk down into when using your software, and then you are going to stream line pretty much everything along their path to make sure that they don’t hit any friction.

It means asking the user to do as little work as possible, choosing defaults for them (until they care enough to change those), having dedicated UI to lead them to the “Wow, this is amazing!” moments. Only after you actually gained enough positive experience with the users can you actually require them to do some work.

And note that by do some work, I’m talking about anything from setting up the logo for the publisher to selecting what categories the data will go on. By work, I’m talking about anything that is holding up the user from doing the major thing that they are on their site for.

If you don’t do that, then you are going to end up with narrower funnel, and the end result is that you’ll have fewer users. Not because you service is inferior, or your software sucks. Simply because you failed to prove to the user that you are actually worth the time investment in give you a fair shoot.

Career planningThe immortal choices aren't


In response for my previous post, Eric had the folowing comment (well, tweet):

I guess some baskets last longer or some eggs don't seem to rot e.g. C, C++, SQL, Java*, etc

And that is true, in some sense of the word. In other words, there isn't any expected shortage of C or C++ opportunities anywhere in the medium to long future. The problem is that this isn't the same language, framework or enviornment over time.

In the late 90s / early 2000s I was deep into C++. I read Effective C++ and More Effective C++, I gone through the entire STL with a fine tooth comb, and I was a pretty enthusiastic (and bad) C++ developer. But let assume that I was a compotent C++ dev in the late 90s.

What was the environment like at the time? Pretty much all 32 bits, STL was still a hotly debated topic. MFC and ATL were all the rage, and making the C++ compiler die via template meta programming was extremely common. COM and Windows DNA were all the rage. 

Assume that you freeze the knoweldge at that time, and skip forward 15 years. Where are you at?

Modern C++ has embraced STL, then moved beyond it to Boost. In Windows land, MFC and ATL are only used for legacy stuff. COM is still there, but you try to avoid it. And cross platform code isn't something esoteric.

Now, I stopped doing C++ a few years after getting starting with .NET, so I'm pretty sure that the kind of changes that I can see are just the tip of the iceberg.

In short, just because the title of your job didn't change doesn't mean that what you did hasn't changed considerablely. And choosing safe (from a job prospect) programming language and sticking to it, knowing that you can always rely on that is a pretty good way to perfom career suicide.

On the other hand, I know people looking for Cobol programmers...

Production postmortemThe case of the native memory leak


This one is a pretty recent one. A customer complained about high memory usage in RavenDB under moderate usage. That was a cause for concern, since we care a lot about our memory utilization.

So we started investigating that, and it turned out that we were wrong, the problem wasn’t with RavenDB, it was with the RavenDB Client Library. The customer had a scenario where 100% of the time, after issuing a small number of requests (less than ten), the client process would be using hundreds of MB, for really no purpose at all. The client already turned off caching, profiling and pretty much anything else that both they and us could think of.

We got a process dump from them and looked at that, and everything seemed to be fine. The size of the heap was good, and there didn’t appear to be any memory being leaked. Our assumption at that point was that there is some sort of native memory leak from their application.

To continue the investigation further, NDAs was required, but we managed to go through that and we finally had a small repro that we could look at ourselves. The fact that the customer was able to create such a thing is really appreciated, because very often we have to work with a lot of missing information. Of course, when we run this on our own system, everything was just fine & dandy. There was no issue. We got back to the customer and they told us that the problem would only reproduce in IIS.

And indeed, once we have tested inside IIS (vs. using IIS Express), that problem was immediately obvious. That was encouraging (at least from my perspective). RavenDB doesn’t contain anything that would differentiate between IIS and IIS Express. So if the problem was only in IIS, that is not likely to be something that we did.

Once I had a repro of the problem, I sat down to observe what was actually going on. The customer was reading 1,024 documents from the server, and then sending them to the browser. That meant that each of the requests we tested was roughly 5MB in size. Note that this means that we have to:

  • Read the response from the server
  • Decompress the data
  • Materialize it to objects
  • Serialize the data back to send it over the network

That is a lot of memory that is being used here. And we never run into any such issues before.

I had a hunch, and I created the following:

[RoutePrefix("api/gc")]
public class GcController : ApiController
{
    [Route]
    public HttpResponseMessage Get()
    {
        GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
        GC.Collect(2);
        return Request.CreateResponse(HttpStatusCode.OK);
    }
}

This allows me to manually invoke a rude GC, and indeed, when running this, memory utilization dropped quite nicely. Of course, that isn’t something that you want to run into your systems, but it is very important diagnostic tool.

Next, I tried to do the following:

[Route]
public HttpResponseMessage Get()
{
    var file = File.ReadAllText(@"D:\5mb.json");
    var deserializeObject = JsonConvert.DeserializeObject<Item>(file);

    return this.Request.CreateResponse(HttpStatusCode.OK, new { deserializeObject.item, deserializeObject.children });
}

This involves no RavenDB code, but it has roughly the same memory pressure semantics as the customer code. Indeed, the problem reproduced quite nicely there as well.

So the issue is about a request that uses a lot of memory (include at least one big buffer), likely causing some fragmentation in the heap that would bring memory utilization high. When the system had had enough, it would reclaim all of that, but unless there is active memory pressure, I’m guessing that it would rather leave it like that until it has to pay that price.

Career planningThe age of least resistance


On Sunday, there was a news program about how tough it is to find work after 40s. It was full of the usual stuff about employers only looking for young people who can work 30 hours days*, and freezing out anyone too old for their taste, etc.

This is a real problem in many cases, and one that I find abhorrent. Not the least because I plan to have a long career in my chosen field, and I really don’t like the idea of having a certain age after which I should be shuffled off to do data entry tasks, if that. Especially since that age seems not too far at all. Currently, the oldest person we have in a development role is over fifty, although most of our team is late twenties to mid thirties.

So we reached out to one of them, asking to get a CV so we can look at that. And it took me very little time to realize why this person had a hard time finding a job. In particular, while the news program was about people who are unable to find a job, this particular person actually had a job. It just wasn’t a job that he was happy with. As far as I understand, he was paid a lot less than what he was used to, comparable to someone just starting out, rather than someone with close to two decades of experience.

Looking at the CV, it was obvious why that was. This particular person job history included a 15 years stretch in a large municipality. During which he worked mostly on VB6 programs. It has only been in the last couple of years that he started working in .NET.

Now, Microsoft released VB6 in 1998. And announced that it is moving to VB.Net in early 2000, with the .NET framework being released on 2002. By 2005, VB6 was no longer supported, and by 2008 even extended support run out. So we are talking about 7 – 8 years in which the main tool at their disposal was quite clearly a dead end.

While I’ve fond memories of VB6, and I’m pretty sure that there is still a lot of software using it. It is not surprising that demand for people with VB6 expertise has already peaked and is currently in a decline that I don’t really see changing. Note that this isn’t really surprising, and you would have to willfully ignore reality to believe that there is a strong future in VB6 in the past decade.

So we have a person with expertise in obsolete tech, trying to find a job in the market with effectively 1-2 years of experience using C#. It isn’t surprising that he got what is effectively a starter position, even given his age.  It isn’t that his age affected the offered position, it is that it didn’t.

This lead back to the advice I gave previously on the matter of career planning. Saying “I got a nice job” and resting on the laurels is a good way to end up in a marginalized position down the road.

Keeping your skills up to date (ideally as part of your job, but outside of it is if isn’t possible) is crucial, otherwise you are the guy with one year of actual experience, repeated many times over.

* Not a typo, it is intentionally stupid.

API DesignWe’ll let the users sort it out


In my previous post, I explained an API design that give the user the option to perform an immediate operation, use the default fire and forget or use an explicit bulk mechanism. The idea is that most operations are small, and that the cost of actually going over the network is going to dominate the cost of the entire operation. In this case, we want to give the user the option of selecting the “I want to know the operation completed” or “I just want to try the best, I’m fine if there is a failure” modes.

Eli asks:

Trying to understand why this isn't just a scripting case. In your example you loop over CSV and propose an increment call per part which gets batched up and committed outside the user's control. Why not define a JavaScript on the server that takes an array ("a batch") of parts sent over the wire and iterates over it on the server instead? This way the user gets fine grained control over the batch. I'm guessing the answer has something to do with your distributed counter architecture...

If I understand Eli properly, the idea is that we’ll just expose an API like this:

Increment(“users/1/visits”, 1);

And provide an endpoint where a user can POST a JavaScript code that will call this. The idea is that the user will be able to decide whatever to call this once, or send a whole batch of updates in one go.

This is certainly an option, but in my considered opinion, it is a pretty bad one. It has nothing to do with the distributed architecture, it has to do with the burden we put on the user. The actual semantics of “go all the way to the server and confirm the operation” vs “let us do a bulk insert kind of thing” are pretty easy. Each of them has a pretty well defined behavior.

But what happens when you want to do an operation per page view? From the point of view of your code, you are making a single operation (incrementing the counters for a particular entity). From the point of view of the system as a whole, you are generating a whole lot of individual requests that would be much better off as a single bulk request.

Having a scripting endpoint gives the user the option of doing that, sure, but then they need to handle:

  • Error recovery
  • Multi threading
  • Flushing on batch size / time
  • Failover

And many more that I’m probably forgetting. By providing the users with the option of making an informed choice about speed vs. safety, we avoid putting the onus of the actual implementation on them.

API DesignSmall modifications over a network


In RavenDB 4.0 (yes, that is quite a bit away), we are working on additional data types and storage engines. One of the things that we’ll add, for example, is the notion of gossiping distributed counters. That doesn’t actually matter for our purposes here, however.

What I wanted to talk about today is the problem in making small updates over a network. Consider the counters example. By far the most common operations are some variant of:

counters.Increment(name, 1);

Since the database is remote, we need to send that over the network. But that is a very small operation. Going all the way to the remote database just for that is a big waste of time. Consider that in most systems, we don’t have just a single thread of execution, we have many. Each of them performing their own operations. Allowing each operation to go on its own is a big waste, but what other options can we offer?

There are other considerations as well. While most of the time we’ll be making small operations, it is very common to need to do bulk operations as well. Maybe you are setting up the system for the first time, or importing data from a file, etc. Making large number of individual requests will kill any hope of doing this fast. The obvious answer is to use batching. send a lot of values to the server all at once. This reduce the network overhead significantly.

The API for that would be something like:

using(var batch = counters.Advanced.NewBatch())
{
	foreach(var line in File.EnumerateAllLines("counters.csv"))
	{
		var parts = line.Split();
		batch.Increment(parts[0], long.Parse(line[1]));
	}
}

So that is great for big things, but not very helpful for individual operations. We still have to go to the server for each of those.

Of course, we could also use the batch API, but making use of that for an individual operation is… a big waste.

What to do?

The end result we arrived at was that we have three levels of API:

  • counters.Increment(name, 1); – single operation, executed immediately over the network. Guaranteed to succeed or fail and give you the result.
  • counters.Advanced.NewBatch() – batch operation, executed over all of the items in the batch (but not as a single transaction), will let you know if the whole operation succeeded, or if there was an issue with something.
  • counters.Batch.Increment() – the default batch, thread safe, can be utilized by individual requests. This is a fire & forget operation. We increment, and behind the scene we’ll merge all the increment from all the threads and send them in batches to the server.

Note that the last option means that we’ll only do batches, so only when enough time has lapsed or we have enough items to send will we send the data to the server. The idea is that you get to choose, based on the importance of what you are doing.

If you need confirmation that something was successful, use the individual operation. If you just want us to make best effort, and if something bad really happened you don’t care about it, use the batch option.

Note that I’m using the counters example here because it is simple, but the same applies for things like time series, and other stuff that we are currently building.

Production postmortemThe case of the intransigent new database


A customer called us to tell that they had a problem with RavenDB. As part of their process for handling new customers, they would create a new database, setup indexes, and then direct all the queries for that customer to that database.

Unfortunately, this system that has worked so well in development died a horrible death in production. But, and this was strange, only for new customers, and only in the create new customer stage. The problem was:

  • The user would create a new database in RavenDB. This just create a db record, and its location on disk. It doesn’t actually initialize a database.
  • On the first request, we initialize the db, creating it if needed. The first request will wait until this happens, then proceed.
  • On their production systems, that first request (which they used to create the indexes they require) would time out with an error.

Somehow, the creation of a new database would take way too long.

The first thought we had was they are creating the database on a path of an already existing database, maybe a big one that had a long initialization period, or maybe one that required recovery. But the customer validated that they were creating the database on an empty folder.

We looked at the logs, and the logs just showed a bunch of time were there was no activity. In fact, we had a single method call to open the database that took over 15 seconds to run. Except that on a new database, this method just create a bunch of files to start things out and is ready really quickly.

That is the point that led us to suspect that the issue was environmental. Luckily, as the result of many such calls, RavenDB comes with a pretty basic I/O Test tool. I asked the customer to run this on their production system, and I got the following:image

And now everything was clear. They were running on an I/O constrained system (a cloud machine), and they were running into an interesting problem. When RavenDB creates a database, it pre-allocate some files for its transactional journal.

Those files are 64MB in size, and the total write for a new Esent RavenDB database with default configuration is just over 65MB. If your write throughput is less than 1MB/sec sustained, that will be problematic.

I let the customer know about the configuration option to take less space at startup (Esent RavenDB databases can go as low as 5MB, Voron RavenDB starts at 256Kb), but I also gave them a hearty recommendation to make sure that their I/O rates improved, because this isn’t going to be the only case where slow I/O will kill them.

What is new in RavenDB 3.5Exploring data in the dark


One of the things that tend to happen a lot when we are developing with a database is that we need to peek at the data, and a lot of the time, just looking at the data one document at a time isn’t good for us.

We noticed that a lot of users will create temporary indexes (usually map/reduce ones) to get some idea about what is actually going on in the database, or for one off reporting. That is pretty inefficient, and in order to handle that, we added the Data Exploration feature.

image

 

This feature gives you the option of exploring the data in details. You can even run the query over large data sets, doing some complex aggregations and looking at the results.

Note that this is an admin / developer only feature, we provide no API for this, because the idea is that we have a human sitting in front of the DB going… “Hmm.. I wonder about…”

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  2. Production postmortem (4):
    23 Jul 2015 - The case of the native memory leak
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats