Ayende @ Rahien

It's a girl

Things we learned from production, part IV–is your paperwork in order?

One of the major points that we worked on in the 1.2 release was making the ops team work easier. That included additional logging, like we have previously discussed, making RavenDB plays nicer with other parts of the system, adding performance counters, etc.

But those are the obvious things, and this series isn’t about the obvious things. One of the problems that we run into is that we already had a moderately good porthole into how RavenDB works.

The problem was that this porthole gave you access to the state of a single database ,which was great…

Except that in order to get a database statistics, you had to actually load that database. Imagine a system under load, and the admin need to check what is causing the load. The act of checking a database statistics will actually force that database to load, generating even more load. This is especially dangerous when we are talking about automated health monitoring tools, the fact that we monitor the health of our software shouldn’t cause it to do additional work.

In RavenDB 1.2 we have taken steps to make sure that we can report on all the active database without having to guess which ones are active and which aren’t. We have also taken additional steps to make sure that we give the admin even more information about what is going on.

You can see this pattern pretty much everywhere, in indexes, in operations, in database and server stats. There are a lot more places where we explicitly built the hooks to make it possible for the admin to figure out what is going on.

The lesson from that is that you have to provide a lot of information for the administrators, so they can figure out what is going on (and that administrator may very well be you, at 2 AM, trying to diagnose a problem). At the same time, you have to be sure to provide those hooks in a way that have minimal impact on the system. Having admin hooks in place that will put undue burden on the application is seriously not a cool thing to do.

Things we learned from production, part III–singleton thinking makes long queues

One of the more interesting things that we had to learn in production was that we aren’t an only child. It is a bit more complex than that, and I am not explaining well, let me start at the beginning.

Usually, when we work on RavenDB, we work within the scope of a single database, all of our efforts are usually scoped to that. That means that when we worked on the multi database feature for RavenDB, we actually focused on the process of loading a single database up in the air. We considered how multiple databases will interact, and we made sure that they are isolated from one another, but that was about it.

In particular, as mentioned in the previous post, starting up and shutting down were done sequentially, on a per database basis. In order to prevent issues, we had a lock on the initialize database part of the process, so two requests to the same database will not result in the same database being loaded twice.

I mentioned that we were thinking on a single database mindset, right?

Can you guess what happened?

  • Request for DB #1 – lock acquired, starting up
    • Request for DB #1 – waiting for lock to release
    • Request for DB #1 – waiting for lock to release
    • Request for DB #1 – waiting for lock to release
  • DB initialized, lock released
  • All requests are now freed and can be processed.

What happen when we have multiple databases, however?

  • Request for DB #1 – lock acquired, starting up
    • Request for DB #1 – waiting for lock to release
    • Request for DB #2 – waiting for lock to release
    • Request for DB #3 – waiting for lock to release
  • DB initialized, lock released
  • Request for DB #2 – lock released, lock acquired, starting up
    • Request for DB #3 – waiting for lock to release

You guessed it, we actually had a global lock for starting (or disposing, for that matter) databases. That meant that a single db that took time to start would impact other databases.

More importantly, it would means that other requests, which were waiting for that database to load and then had to load their own database, had far less time to actually do the processing they needed. Which meant that they were far more likely to run into the request time limit and be aborted by IIS. Which left them in an inconsistent state. Which was a nightmare to figure out.

We resolved this issue by making sure that the lock is now handled only on the same database, and that we won’t lock forever, if after a while we still don’t have the db, we will error early and give you a 503 Service Unavailable error until the db is ready to rock.

Things we learned from production, part II–wake up or I kill you dead

Getting started is probably easier than shutting down, I mean, no one is going to begrudge us some time to get our feet from under us, right?

As it turned out, this assumption is wrong on quite a few levels.

To start with, hosts such as IIS / Windows Service Manager will give you a certain time to start before they decide that you are hang and ruthlessly execute you without even thinking twice about it. This doesn’t even include the issue of admins with people breathing down their necks who assume that a taste of mortality must convince RavenDB to try even harder then next time it is started after then 7th time it was killed for not starting fast enough.

Because killing us during startup is pretty much the same as a standard crash, it means that we need to run recovery after this happened, which means that the next time is going to take longer, and then…

I think you can get the picture, right?

But the issue here is actually much more complex.

It is actually easier to recover from a real crash (something like a process termination or kill –9). It is harder when it isn’t a real crash, but something like IIS just recycling the AppDomain. The reason it is harder is that anything that is scoped to the OS, like file handles, unmanaged resources, etc, are actually still alive. It means that during the crash, you have to be very careful about detecting that you are crashing and cleaning up after you properly.

Moving back to the actual startup issue, so we have to startup fairly quickly, even if we just crashed. That makes sense, I guess. Now, that is fine and dandy, but that is just for the system database, what happens when you want to access a non system database (for example, the Northwind database)?

In RavenDB, we load those databases lazily, so on the first request to that particular database, we will load it.

As it turned out, this simple and fairly obvious decision has caused a no end of problems.

Starting up a database may take a while, in bad cases, that while may be long enough that the request time out. Now, what does it means, request time out? You might get a 408 Request Timeout from the server, but that is the client perspective.

What happens on the server? Well, IIS handed over control of the request to RavenDB, and as far as IIS is concerned, RavenDB is sitting there doing nothing, well above its time limit. Now, IIS doesn’t have a way to tell RavenDB, stop processing this request. So what do you think it does?

Welcome to the nice land of Thread.Abort().

Now, if you have ever read about Thread.Abort(), you probably know that every single reference to that is filled with warnings about the need to be very careful about what you are doing, that it is a very bad idea in general and that you should take care to never use it. The reason it is such a bad idea is that you basically cut the thread at mid execution, leaving it no chance at all to actually handle things. It is an easy way to violate invariants.

In particular, it is a good way for your cleanup to never happen. Think about it, we are in the middle of our constructor, opening files, settings things up, and suddenly the floor is yanked right out from under us.

As it turned out, in those cases, we would leak some stuff out. The next time that you tried to access the database, you would get an error that said that the files were already opened by someone else. (To make things worse, those were unmanaged resources, they wouldn’t get cleaned up by the system when GC is run.

That led to errors that were extremely hard to figure out. Because they would only occur when running at a high load, with a db that crashed and was now recovering, and with a few other databases waiting as well. And going over the code, thinking multi threading thoughts, none of that works. At some point, I put so many locks there, just to figure out what is going on, that the code looked like this:

But the actual problem wasn’t another thread corrupting state, the problem was that the current thread was ruthless killed in mid operation.

Once we figured that one out, it was straightforward, but in no way easy, to device a solution. We made sure that our db init code was robust for thread aborts, and then we moved the actual db initialization to a separate thread, one that wasn’t controlled by IIS, so we could actually get things done without having a hard time limit.

In my next post, I’ll discuss the fallacy of the singleton and how much pain it caused us.

Bug Fixes in OSS environment

A user reported a bug in RavenDB. We tracked that bug into a race condition in a 3rd party library, which then forced us to fix the bug, and then do the dependency roll up:

image

 

Sigh…

Then again, we could do all of that ourselves.

Things we learned from production, part I–shutting down is hard to do

This series of posts is going to talk about the things that we have learned ourselves and via our customers about running RavenDB in production. Those customers include people running on a single database on a Celeron 600 Mhz with 512 MB all the way to monsters like what RavenHQ is doing.

This particular story is about the effect of shutdown on RavenDB in production environments. Before we can do that, I have to explain the sequence of operations when RavenDB shuts down:

  • Stop accepting new connections
  • Abort all existing connections
  • For each loaded database:
    • Shut down indexing
    • For each index:
      • Wait for current indexing batch to complete
      • Flush the index
      • Close the index
    • Close database
  • Repeat the same sequence for the system database
  • Done

I am skipping a lot of details, but that is the gist of it.

In this case, however, you might have noticed something interesting. What happen if we have a large number of active databases, with a large number of actual indexes?

In that case, we have to wait for the current indexing batch to complete, then shut down each of the indexes, then move to the next db, and do the same.

In some cases, that can take a while. In particular, long enough while that we would get killed. Either by automated systems that decided we passed our threshold (in particular, iisreset gives you mere 20 seconds to restart, which tend to be not enough) or by an out of patience admin.

That sucks, because if you get killed, you don’t have the time to do a proper shutdown. You crashed & burned and died and now you have to deal with all the details of proper resurrection. Now, RavenDB prides itself on actually being a regular in this matters. You can yank the power cord out and once everything is back up, RavenDB will recover gracefully and with no data loss.

But, recovering from such scenarios can take precious time. Especially if, as is frequently the case in such scenarios, we have a lot of databases and indexes to recover.

Because of that, we actually had to spend quite a bit of time on optimizing the shut down sequence. It sounds funny, isn’t it? Very few people actually care about the time it takes them to shut down. But as it turned out, we have a fairly limited budget for that.  In particular, we parallelized the process of shutting down all of the databases together, and all of their indexes together as well.

That means more IO contention than before, but at least we could usually meet the deadline.  Speaking of which, we also added additional logging and configuration that told common hosts (such as IIS) that we really would like some more time before we would be hang out to dry.

On my next post, I’ll discuss the other side, how hard it is to actually wake up in the morning Smile.

The subtle distinction between snapshot isolation and read committed

I am using db transaction isolation levels for a reason here, they make it easier to reason about what is going on.

In short, RavenDB currently supports two storage engine options, Esent and Munin. Esent is what we usually use for production, and Munin is usually used for testing. We wrote Munin as a transactional, fully managed, storage engine a while ago. And it has mostly served us well, but Esent is what we usually aim for. That is the production use case.

We recently made a few changes that resulted in test failures on Munin, only in one run out of two dozens or so, but always worked with Esent.

Naturally, because of the random nature of the problem, I suspected the issue being a race condition in Munin. That happened in the past ,and obviously they are very hard to root out completely. But after finally isolating everything down to a simple test case (writing to two “tables” with associated information), I finally figured it out.

Munin is working just fine, it hasn’t got a spec of a problem. It is just that, when we built it, I built it to support Read Commited Isolation Level. While Esent is providing Snapshot isolation level. The code assumes snapshot isolation level at some pretty level. Obviously, this sort of thing shows up as a race condition, and it is extremely hard to debug, as anyone who ever dealt with those issues in RDBMS can testify.

So my task now is not to fix a bug in Munin, but to actually implement snapshot isolation. As it turned out, actually moving Munin from read committed isolation to snapshot isolation was a lot easier than finding the problem.

I am torn between being pleased that I found the issue, happy that Munin doesn’t have a bug and pissed that it took me that long.

Tags:

Published at

Originally posted at

Comments (8)

NuGet Perf, The Final Part – Load Testing – Source Code

This is just some logistical cleanups.

The code for the entire series can be found here: https://github.com/ayende/nuget.perf

No, I’ll not do a similar SQL version, if you want to, I would be very interested in seeing one, but that isn’t something that I intend to do.

Yes, it is a simple and trivial implementation, but that was pretty much the whole point. Being able to get to that scale without actually doing anything special is what we strive for in RavenDB.

Published at

Originally posted at

NuGet Perf, The Final Part – Load Testing – Results ^ 2

After seeing how well RavenDB does in perf testing, I decided to take it up a notch.

  • Starting from 10 users, with a  step duration of 1 sec, add 50 users for each step, all the way to 3,000.
  • Start with a warm up period of 20 seconds, then run the test for 10 minutes.

Let us see what happens, okay?

Just to be clear, this is a RavenDB application running with three thousands concurrent users, on an off the shelve laptop while I was busy doing other stuff.

One word of warning before hand, because I run everything on a single machine, just running so many users on the machine significantly slowed down how RavenDB is reacting. Basically, the code for managing the perf test took so many resources that RavenDB had to fight to get some to actually answer the queries.

Scared yet, because here are the results in graph form.

image

Now you can actually see that we have some fluctuations in the graphs, the number of users grows and grows until we get to 3,000 and we have 0.37 seconds response times.

Again, I remind you, we have done zero optimizations and this is idiomatic RavenDB code. And we were able to serve requests at a frankly pretty amazing rate of speed.

And here are they in their full details:

 

Load Test Summary
Test Run Information
Load test name LoadTest1
Description  
Start time 04/09/12 15:28:48
End time 04/09/12 15:38:48
Warm-up duration 00:00:20
Duration 00:10:00
Controller Local run
Number of agents 1
Run settings used Load
Overall Results
Max User Load 3,000
Tests/Sec 196
Tests Failed 0
Avg. Test Time (sec) 14.3
Transactions/Sec 0
Avg. Transaction Time (sec) 0
Pages/Sec 741
Avg. Page Time (sec) 0.37
Requests/Sec 741
Requests Failed 0
Requests Cached Percentage 0
Avg. Response Time (sec) 0.37
Avg. Content Length (bytes) 3,080
Key Statistic: Top 5 Slowest Pages
URL (Link to More Details) 95% Page Time (sec)
Page 1 0.83
Page 0 0.82
Page 2 0.82
Page 1 0.82
http://localhost:52688/api/search 0.81
Key Statistic: Top 5 Slowest Tests
Name 95% Test Time (sec)
Browsing 20.8
BrowseAndSearch 19.8
Searching 12.9
Test Results
Name Scenario Total Tests Failed Tests (% of total) Avg. Test Time (sec)
Browsing Load 31,843 0 (0) 17.4
BrowseAndSearch Load 33,989 0 (0) 16.8
Searching Load 51,650 0 (0) 10.8
Page Results
URL (Link to More Details) Scenario Test Avg. Page Time (sec) Count
Page 2 Load Browsing 0.40 32,338
Search yui Load Searching 0.39 52,597
Page 1 Load Browsing 0.39 32,627
http://localhost:52688/api/search Load BrowseAndSearch 0.39 68,576
Page 0 Load Browsing 0.38 32,803
Search grid Load Searching 0.38 52,283
Page 1 Load BrowseAndSearch 0.37 34,766
Page 0 Load BrowseAndSearch 0.36 34,982
Search debug Load Searching 0.35 51,991
Search ravendb Load Searching 0.33 51,846
Transaction Results
Name Scenario Test Response Time (sec) Elapsed Time (sec) Count
System Under Test Resources
Machine Name % Processor Time Available Memory at Test Completion (Mb)
Controller and Agents Resources
Machine Name % Processor Time Available Memory at Test Completion (Mb)
RAVEN 85.4 1,203
Errors
Type Subtype Count Last Message

Note that the reason fro the high CPU usage is that the tests and RavenDB were running on the same machine.

NuGet Perf, The Final Part – Loading Testing – Results

The test was run locally (no network involved ) on a Lenovo W520 laptop with 8 cores & 8 GB RAM with an SSD card. The storage engine we used was Esent, Safe Transactions. Default RavenDB configuration, running in console, with logging disabled.

We took the most obvious approach both in the code we wrote and the test approach. I am pretty sure that I’ll get a lot of helpful suggestions about the load testing. The code is available here, and you are more than welcome to take it for a spin and get your own results. What is important for me to note is that we have done exactly zero performance tuning. That is relevant to both the index we use, to the code that we wrote, everything. I just wrote things down, and didn’t worry about performance, even though this code is going to go through a load test.

Why don’t I worry about it? Because RavenDB is setup to do the Right Thing. It will self optimize itself without you need to take care of that.

With that said, here are the test results:

image

You can see that the red line is the number of users we have, and we have this worrying green line that seems to go crazy…

Except that this is actually the number of page served. The part that we care about is actually the Avg. Page Time, and that is the blue line.

This line, however, is basically flat no matter the load.

Here are the test results in details

 
Load Test Summary
Test Run Information
Load test name LoadTest1
Description  
Start time 04/09/12 14:16:38
End time 04/09/12 14:21:38
Warm-up duration 00:00:20
Duration 00:05:00
Controller Local run
Number of agents 1
Run settings used Run Settings1
Overall Results
Max User Load 300
Tests/Sec 20.0
Tests Failed 0
Avg. Test Time (sec) 12.5
Transactions/Sec 0
Avg. Transaction Time (sec) 0
Pages/Sec 77.1
Avg. Page Time (sec) 0.0062
Requests/Sec 77.1
Requests Failed 0
Requests Cached Percentage 0
Avg. Response Time (sec) 0.0062
Avg. Content Length (bytes) 3,042
Key Statistic: Top 5 Slowest Pages
URL (Link to More Details) 95% Page Time (sec)
Page 0 0.018
Page 0 0.018
Page 2 0.014
http://localhost:52688/api/search 0.014
Search ravendb 0.014
Key Statistic: Top 5 Slowest Tests
Name 95% Test Time (sec)
Browsing 19.3
BrowseAndSearch 17.6
Searching 10.6
Test Results
Name Scenario Total Tests Failed Tests (% of total) Avg. Test Time (sec)
Browsing Load 1,533 0 (0) 16.0
BrowseAndSearch Load 1,685 0 (0) 15.0
Searching Load 2,770 0 (0) 9.00
Page Results
URL (Link to More Details) Scenario Test Avg. Page Time (sec) Count
Page 0 Load Browsing 0.0072 1,629
Page 0 Load BrowseAndSearch 0.0071 1,783
http://localhost:52688/api/search Load BrowseAndSearch 0.0064 3,443
Search ravendb Load Searching 0.0064 2,798
Page 1 Load Browsing 0.0063 1,617
Page 2 Load Browsing 0.0063 1,580
Page 1 Load BrowseAndSearch 0.0063 1,760
Search debug Load Searching 0.0055 2,810
Search grid Load Searching 0.0055 2,839
Search yui Load Searching 0.0054 2,866
Transaction Results
Name Scenario Test Response Time (sec) Elapsed Time (sec) Count
System Under Test Resources
Machine Name % Processor Time Available Memory at Test Completion (Mb)
Controller and Agents Resources
Machine Name % Processor Time Available Memory at Test Completion (Mb)
RAVEN 13.0 1,356
Errors
Type Subtype Count Last Message

 

You can dig in and look at the data, it is quite interesting. Under the load of 300 users, the average page response time was… 0.0062 seconds.

And RavenDB was using just 13% of the CPU, and that include running the agents running the tests.

In my next post, we will go totally crazy…

NuGet Perf, The Final Part – Load Testing – The Tests

For the tests, we used VS 2012 load testing tool.

We defined the following tests:

Just browsing through the packages listing:

image

Browsing a bit then searching, and then narrowing the search:

image

And finally, searching a few packages by their id, tags, etc:

image

I then defined the following load test:

image

With the following distribution:

image

Finally, we have the way we actually run the test:

image

We get 20 seconds of warm up, then 5 minutes of tough load.

On my next post, we will see how we did.

Published at

Originally posted at

NuGet Perf, The Final Part – Load Testing – Setup

So, after talking so long about the perf issues, here is the final part of this series. In which we actually take this for a spin using Load Testing.

I built a Web API application to serve as the test bed. It has a RavenController, which looks like this:

public class RavenController : ApiController
{
    private static IDocumentStore documentStore;

    public static IDocumentStore DocumentStore
    {
        get
        {
            if (documentStore == null)
            {
                lock (typeof (RavenController))
                {
                    if (documentStore != null)
                        return documentStore;
                    documentStore = new DocumentStore
                        {
                            Url = "http://localhost:8080",
                            DefaultDatabase = "Nuget"
                        }.Initialize();
                    IndexCreation.CreateIndexes(typeof (Packages_Search).Assembly, documentStore);
                }
            }
            return documentStore;
        }
    }

    public IDocumentSession DocumentSession { get; set; }

    public override async Task<HttpResponseMessage> ExecuteAsync(HttpControllerContext controllerContext, CancellationToken cancellationToken)
    {
        using (DocumentSession = DocumentStore.OpenSession())
        {
            HttpResponseMessage result = await base.ExecuteAsync(controllerContext, cancellationToken);
            DocumentSession.SaveChanges();
            return result;
        }
    }
}

And now we have the following controllers:

public class PackagesController : RavenController
{
    public IEnumerable<Packages_Search.ReduceResult> Get(int page = 0)
    {
        return DocumentSession.Query<Packages_Search.ReduceResult, Packages_Search>()
            .Where(x=>x.IsPrerelease == false)
            .OrderByDescending(x=>x.DownloadCount)
                .ThenBy(x=>x.Created)
            .Skip(page*30)
            .Take(30)
            .ToList();
    }
}

public class SearchController : RavenController
{
    public IEnumerable<Packages_Search.ReduceResult> Get(string q, int page = 0)
    {
        return DocumentSession.Query<Packages_Search.ReduceResult, Packages_Search>()
            .Search(x => x.Query, q)
            .Where(x => x.IsPrerelease == false)
            .OrderByDescending(x => x.DownloadCount)
                .ThenBy(x => x.Created)
            .Skip(page * 30)
            .Take(30)
            .ToList();
    }
}

And, just for completeness sake, the Packages_Search index looks like this:

public class Packages_Search : AbstractIndexCreationTask<Package, Packages_Search.ReduceResult>
{
    public class ReduceResult
    {
        public DateTime Created { get; set; }
        public int DownloadCount { get; set; }
        public string PackageId { get; set; }
        public bool IsPrerelease { get; set; }
        public object[] Query { get; set; }
    }

    public Packages_Search()
    {
        Map = packages => from p in packages
                          select new
                              {
                                  p.Created, 
                                  DownloadCount = p.VersionDownloadCount, 
                                  p.PackageId, 
                                  p.IsPrerelease,
                                  Query = new object[] { p.Tags, p.Title, p.PackageId}
                              };
        Reduce = results =>
                 from result in results
                 group result by new {result.PackageId, result.IsPrerelease}
                 into g
                 select new
                         {
                             g.Key.PackageId,
                             g.Key.IsPrerelease,
                             DownloadCount = g.Sum(x => x.DownloadCount),
                             Created = g.Select(x => x.Created).OrderBy(x => x).First(),
                             Query = g.SelectMany(x=>x.Query).Distinct()
                         };

        Store(x=>x.Query, FieldStorage.No);
    }
}

That is enough setup, in the next post, I’ll discuss the actual structure of the load tests.

Implementing LRU cache

In my last post I mentioned that checking whatever a user is an administrator or not using Active Directory query can be slow. That means that we can just make use of that, we have to cache that.

When caching is involved, we have to consider a few things. When do we expire the data? How much memory are we going to use? How do we handle concurrency?

The first thing that pops to mind is the usage of MemoryCache, now part of the .NET framework and easily accessible. Sadly, this is a heavy weight object, it creates its own threads to manage its state, which probably means we don’t want to use it for a fairly simple feature like this.

Instead, I implemented the following:

public class CachingAdminFinder
{
    private class CachedResult
    {
        public int Usage;
        public DateTime Timestamp;
        public bool Value;
    }

    private const int CacheMaxSize = 25;
    private static readonly TimeSpan MaxDuration = TimeSpan.FromMinutes(15);
    private readonly ConcurrentDictionary<SecurityIdentifier, CachedResult> cache =
        new ConcurrentDictionary<SecurityIdentifier, CachedResult>();


    public bool IsAdministrator(WindowsIdentity windowsIdentity)
    {
        if (windowsIdentity == null) throw new ArgumentNullException("windowsIdentity");
        if (windowsIdentity.User == null)
            throw new ArgumentException("Could not find user on the windowsIdentity", "windowsIdentity");

        CachedResult value;
        if (cache.TryGetValue(windowsIdentity.User, out value) && (DateTime.UtcNow - value.Timestamp) <= MaxDuration)
        {
            Interlocked.Increment(ref value.Usage);
            return value.Value;
        }
        bool isAdministratorNoCache;
        try
        {
            isAdministratorNoCache = IsAdministratorNoCache(windowsIdentity.Name);
        }
        catch (Exception e)
        {
            log.WarnException("Could not determine whatever user is admin or not, assuming not", e);
            return false;
        }
        var cachedResult = new CachedResult
            {
                Usage = value == null ? 1 : value.Usage + 1,
                Value = isAdministratorNoCache,
                Timestamp = DateTime.UtcNow
            };

        cache.AddOrUpdate(windowsIdentity.User, cachedResult, (_, __) => cachedResult);
        if (cache.Count > CacheMaxSize)
        {
            foreach (var source in cache
                .OrderByDescending(x => x.Value.Usage)
                .ThenBy(x => x.Value.Timestamp)
                .Skip(CacheMaxSize))
            {
                if (source.Key == windowsIdentity.User)
                    continue; // we don't want to remove the one we just added
                CachedResult ignored;
                cache.TryRemove(source.Key, out ignored);
            }
        }

        return isAdministratorNoCache;
    }

    private static bool IsAdministratorNoCache(string username)
    {
       // see previous post
    }
}

Amusingly enough, properly handling the cache takes (much) more code than it takes to actually get the value.

We use ConcurrentDictionary as the backing store for our cache, and we enhance the value with usage & timestamp information. Those come in handy when the cache grows too big and need to be trimmed.

Note that we also make sure to check the source every 15 minutes or so, because there is nothing as annoying as “you have to restart the server for it to pick the change”. We also handle the case were we can’t get this information for some reason.

In practice, I doubt that we will ever hit the cache max size limit, but I wouldn’t have been able to live with myself without adding the check Smile.

Are you an administrator?

In RavenDB vNext, we tightened the security story a bit. Some operations that used to be possible for standard users are now administrator operations. For example, creating a new database require you to be admin.

Figuring out whatever you are admin is a bit tough, though. In particular, we use the following logic to determine that:

  • If you logged in using OAuth, the credentials will tell us whatever you are admin or not.
  • If you are logged in using Windows Auth, we make the following assumption:
    • If you are a Windows Admin, you are an administrator (ouch!).
    • If you are running on the same user as the one RavenDB is using, you are an administrator (debug / dev scenarios).
  • If you are running embedded, you are admin.

You might have noticed that there is an “ouch” on the Windows Admin line. The reason for that is that it is actually quite hard to figure that one out. RavenDB is running as a web server, and when we use Windows Auth, we get a WindowsIdentity that we can use. The problem is with UAC. When that is turned on, what we get is the non elevated user. But that user is not an Admin in the Windows sense of the word. We don’t actually care about that (it isn’t like we need to impersonate the user), we just use that as a “yes/no” for certain ops.

This is documented here: https://connect.microsoft.com/VisualStudio/feedback/details/679546/problem-with-windowsprincipal-isinrole-when-uac-is-enabled

The resolution is by design.

So… we need another way to check for this. Luckily, since we don’t need impersonation, we can just check Active Directory for that. Here is how we do so:

private static bool IsAdministratorNoCache(string username)
{
    PrincipalContext ctx;
    try
    {
        Domain.GetComputerDomain();
        try
        {
            ctx = new PrincipalContext(ContextType.Domain);
        }
        catch (PrincipalServerDownException)
        {
            // can't access domain, check local machine instead 
            ctx = new PrincipalContext(ContextType.Machine);
        }
    }
    catch (ActiveDirectoryObjectNotFoundException)
    {
        // not in a domain
        ctx = new PrincipalContext(ContextType.Machine);
    }
    var up = UserPrincipal.FindByIdentity(ctx, username);
    if (up != null)
    {
        PrincipalSearchResult<Principal> authGroups = up.GetAuthorizationGroups();
        return authGroups.Any(principal =>
                              principal.Sid.IsWellKnown(WellKnownSidType.BuiltinAdministratorsSid) ||
                              principal.Sid.IsWellKnown(WellKnownSidType.AccountDomainAdminsSid) ||
                              principal.Sid.IsWellKnown(WellKnownSidType.AccountAdministratorSid) ||
                              principal.Sid.IsWellKnown(WellKnownSidType.AccountEnterpriseAdminsSid));
    }
    return false;
}

Here we check whatever the user is directly or indirectly and admin. Note that we have to take care of cases in which we are running inside & outside a domain, as well as cases where the domain controller is down.

This works, but there is just one problem with that, it is sloooow. As in, multiple seconds slow. Even on the local machine without any domain involved.

I’ll discuss how we solved that on my next post.

On Professional Code

Trystan made a very interesting comment on my post about unprofessional code:

I think it's interesting that your definition of professional is not about SOLID code, infrastructure, or any other technical issues. Professional means that you, or the support staff, can easily see what the system is doing in production and why.

It is a pretty accurate statement, yes. More to the point, a professional system is one that can be supported in production easily. About the most unprofessional thing that you can say is: “I have no idea what is going on.”

Expanding on this, we have been paying a LOT of attention recently to production readiness. We can’t afford not to. Just building the software is often just not enough for us. In many cases, if there is a problem, we can’t just debug through the process. Either because reproducing the problem is too hard or because it happens at a client side with their own private data. Even more important than that, if we can give the ops team the tools to actually see what is going on within the system, we drastically reduce the number of support calls we have to take.

Not to mention that software that actively support and help the ops team gets into the actual data center a lot faster and easier than software that doesn’t. Sure, clean code is important, but production ready code is often not clean code. I read this a long time ago, and it stuck:

Back to that two page function. Yes, I know, it's just a simple function to display a window, but it has grown little hairs and stuff on it and nobody knows why. Well, I'll tell you why: those are bug fixes. One of them fixes that bug that Nancy had when she tried to install the thing on a computer that didn't have Internet Explorer. Another one fixes that bug that occurs in low memory conditions. Another one fixes that bug that occurred when the file is on a floppy disk and the user yanks out the disk in the middle. That LoadLibrary call is ugly but it makes the code work on old versions of Windows 95.

Each of these bugs took weeks of real-world usage before they were found. The programmer might have spent a couple of days reproducing the bug in the lab and fixing it. If it's like a lot of bugs, the fix might be one line of code, or it might even be a couple of characters, but a lot of work and time went into those two characters.

Some parts of the RavenDB code are ugly. HttpServer class, for example, goes on for over thousands lines of mostly error detection and recovery modes. But it works, and it allows us to inspect it on a running production server.

That is important, and that make the separation from good code and production worthy code.

Rule out the stupid stuff first: Select still ain’t broken

So, I sent a simple repro to the client and he reproduced the problem, and I was happy, because it ain’t my problem.

The good thing about sending a simple repro app to the client is that he can play with that easily, and that is how he discovered that it is my fault. In particular, it is all about the actual buffer size that we use.

If we try to send a large dataset using 4KB buffer, it would take much longer than it would take using 128KB buffer. But only when using a real network, not when running locally.

After looking at the matter for a while I figure it out. When using the default RavenDB builtin server, (based on the .NET HttpListener), it is actually flushing everything to the network card every time that you make a write. And there appears to be a non insubstantial cost of just doing that .I suspect that most of the cost is actually moving from user land to http.sys to do the work, but the problem was fairly clear.

When you have an expensive resource like that, there is a solution for it, buffering. And luckily for us, the .NET framework comes with a BufferedStream.  Sadly, it uses a single buffer, and we don’t know ahead of time how much data we are going to write. Even more important, it creates its own buffers, about the only thing you can customize there is the buffer size.

So sure, we can just wrap it in a Buffered Stream and set the buffer size to 128Kb and be done with it, right? Not quite so.

The reason this is problematic is that this would allocate a buffer of 128Kb for every request, and buffers of that size goes to the Large Object Heap. Never mind that just allocating that much memory have its own performance issues.

So, here is the problem. And the question is how we deal with in. There is a good solution for allocating a lot of memory and filling up the Large Object Heap, and that is to use the BufferManager that comes with the framework. I discussed this in this post. Next, we wrote a Buffer Pool Stream, which uses a buffer taken from a buffer manager. This resolve the problem of filling up the Large Object Heap, but it created another problem if for every request, we would use up a 128Kb buffer, that would means that we would use up a lot of memory that we probably don’t need. Admittedly, even if we had a thousand concurrent requests, it would still amount to less than 130 MB, but that still bothers me.

Instead, we took a different path. We use multiple buffers of fixed sizes (4Kb, 8Kb, 16Kb, 32Kb, 64Kb, 128Kb) to buffer the response, and we switch between them on demand.

Here is how it works now:

Size

Buffers

<= 8 Kb

4 Kb

<= 28 Kb

8 Kb

<=60 Kb

16 Kb

<=124 Kb

32 Kb

<= 252 Kb

64 Kb

> 252 Kb

128 Kb

For the first 8 Kb, we will use a 4 Kb buffer, then switch to an 8 Kb buffer until we get to 28 Kb, then 16 Kb buffer all the way to 60 Kb, etc. In our tests ,this strategy showed the best usage of time vs memory on both large and small requests. And we make sure to use big buffers only when we absolutely have to.

The moral of the story, once you get your repro and see what is actually happening, dig a bit deeper. You might find some gold there.

In our case, we optimize network traffic significantly when you are running in service / debug mode. This is very relevant for a very common scenario, downloading the silverlight xap for the management studio, which should improve quite a bit now Smile.

Tags:

Published at

Originally posted at

Comments (1)

Rule out the stupid stuff first

A customer reported that they have problems downloading large (10+ MB) attachments from RavenDB over the network. The problem wasn’t present when running locally.

The first thing I did was go and check how we are loading and sending the data to the user. But it seems fine, we were doing buffered reads from disk and streaming the results to the network.

I then wrote the simplest thing that I could think of to reproduce this issue:

class Program
{
    static void Main(string[] args)
    {
        var listener = new HttpListener
            {
                Prefixes = {"http://+:8080/"}
            };
        listener.Start();
        while (true)
        {
            var context = listener.GetContext();
            var sp = Stopwatch.StartNew();
            var chunks = int.Parse(context.Request.QueryString["chunks"]);
            var _4kb = new byte[4*1024];
            for (int i = 0; i < chunks; i++)
            {
                context.Response.OutputStream.Write(_4kb, 0, _4kb.Length);
            }
            context.Response.Close();
            var totalSize = (double) (_4kb.Length*chunks);
            Console.WriteLine("{0:#,#.##;;0} mb in {1:#,#}", Math.Round((totalSize / 1024) / 1024, 2), sp.ElapsedMilliseconds);
        }
    }
}

This is a simple HTTP server, which will stream to you as much data as you request, as fast as it can. The nice thing about it is that it is a pure in memory device, no IO / waiting required. So it should be able to saturate the network easily.

Using this, we can determine whatever the problem is really with RavenDB or something else.

The customer came back with a report that on their network, a 2MB request was taking over a minute. Considering that we are talking about internal networks, that sounded like we have identified the problem. There is something fishy going on with the network. Now it is in the hands of the sys admin, I just hope he is not this guy.

Tags:

Published at

Originally posted at

Comments (5)

NuGet Perf, Part VIII: Correcting a mistake and doing aggregations

I hope this is the last one, because I can never recall what is the next Latin number.

At any rate, it has been pointed out to me that I made an error in importing the data. I assumed that the DownloadCount field that I got from the Nuget API is the download count for the specific package, but it appears that this is the total downloads count, across all versions of this package. The actual download number for a specific package is: VersionDownloadCount.

That changes things a bit, because the way Nuget sorts things is based on the total download count, not the download count for a specific version. The reason this complicate things is that we aren’t going to store the total download count in all the version documents. First, let us see the sort of query we need to write. In SQL, it would look like this:

select top 30 skip 30 
    Id,
    PackageId,
     Created, 
    (select sum(VersionDownloadCount) from Packages all where all.PackageId = p.PackageId) as TotalDownloadsCount
from Packages p
where IsPrerelease = 0
order by TotalDownloadsCount desc, Created

This is a much simplified version of the real query, and something that you can’t actually write this simply in SQL, most probably. But it gets the point.

Note that in order to process this query, the RDMBS would have to first aggregate all of the data (for each row, mind) then do the paging, then give you the results. Sure, you can keep a counter for all the downloads for a package, but considering the fact that downloads are highly parallel and happen all the time, waiting for writers to finish doing their update.

Instead, with RavenDB, we are going to use a map/reduce index and query on that.

image

This should be fairly simple to follow. In the map we go over all the packages, and output their package id, whatever they have been released, the specific version download count and the date it was created.

In the reduce, we group by the package id and whatever is was pre released or not ( I am assuming that we usually don’t want to show the pre-release stuff there).

Finally, we sum up all of the individual package downloads and we output the oldest created date. Using all of that, we can now move to the next step, and actually query that:

image

There  is a small bug here, since I don’t see RavenDB in the results,  but I guess I’ll have to wait until I get the updated data from Nuget.

Actually, that is not quite true, for pre-released software, we are pretty high up:

image

That explains much, RavenDB 1.2 is pretty awesome.

NuGet Perf, Part VII AKA getting results is only half the work

So far, we have been focusing on various ways to get the raw results from RavenDB. What are the packages that match your queries, and whatever we can be really smart about it.

But let us say that we got the results that we wanted, this is still just half the work, because we can give the user additional information about those results. In particular, in this post I am going to talk about facets.

Facets are a way to provide easily understood context to a search, allowing the user to narrow down what he is looking for quickly. In our case, let us take a look what it takes to add facet supports to our NuGet console app. The first thing to do, of course, is to actually define the facets we want to work on. In this case, we care only for the Tags:

using (var session = store.OpenSession())
{
    session.Store(new FacetSetup
        {
            Id = "facets/PackagesTags",
            Facets =
            {
                new Facet
                    {
                        Name = "Tags",
                        MaxResults = 4,
                        Mode = FacetMode.Default,
                        TermSortMode = FacetTermSortMode.HitsDesc
                    }
            },
        });
    session.SaveChanges();
}

When doing facet search using this document, we will use the Tags field, using a value per each term found. We want to get the top 4, sorted by their hits.

And here is how we are actually doing the faceted query:

var facetResults = q.ToFacets("facets/PackagesTags");
foreach (var result in facetResults.Results)
{
    Console.WriteLine();
    Console.Write("{0}:\t", result.Key);
    foreach (var val in result.Value.Values)
    {
        Console.Write("{0} [{1:#,#}] | ", val.Range, val.Hits);
    }
    Console.WriteLine();
}

It is a one liner, with all of the rest of the code dedicated to just printing things out.

Finally, here are the results:

image

As you can see, searching for “dal”, we can narrow the searches for linq, orm, etc. Searching for events, we get reactive extensions, etc.

Using facets gives the user additional information about his search (including things like, am I close to what I want), discoverability over your dataset and additional tools to explore it.

All in all, I think that this is a pretty neat thing.

NuGet Perf, Part VI AKA how to be the most popular dev around

So far, we imported the NuGet data to RavenDB and seen how we can get it out for the packages page and then looked into how we can utilize RavenDB features to help us in package search. I think we did a good job there, but we can probably do better still. In this post, I am going to stop showing off things in the Studio and focus on code. In particular, advanced searching options.

We will start from the simplest search possible. Or not, because we are doing full text search and quite a few other things aside even in the base line search. Anyway, here is the skeleton program:

while (true)
{
    Console.Write("Search: ");
    var search = Console.ReadLine();
    if(string.IsNullOrEmpty(search))
    {
        Console.Clear();
        continue;
    }
    using (var session = store.OpenSession())
    {
        var q = session.Query<PackageSearch>("Packages/Search")
            .Search(x => x.Query, search)
            .Where(x => x.IsLatestVersion && x.IsAbsoluteLatestVersion && x.IsPrerelease == false)
            .As<Package>()
            .OrderByDescending(x => x.DownloadCount).ThenBy(x => x.Created)
            .Take(3);
        var packages = q.ToList();

        foreach (var package in packages)
        {
            Console.WriteLine("\t{0}", package.Id);
        }
    }
}

Now, we are going to run this and see what we get.

image

So far, so good. Now let us try to improve things. What happens when we search for “jquryt”? Nothing is found, and that is actually pretty sad, because to a human, it is obvious what you are trying to search on.

If you have fat fingers and have a tendency to creatively spell words, I am sure you can emphasize with this feeling. Luckily for us, RavenDB is going to help, let us see how:

image

What?!

How did it do that? Well, let us look at the changes in the code, shall we?

private static void PeformQuery(IDocumentSession session, string search, bool guessIfNoResultsFound = true)
{
    var packages = session.Query<PackageSearch>("Packages/Search")
        .Search(x => x.Query, search)
        .Where(x => x.IsLatestVersion && x.IsAbsoluteLatestVersion && x.IsPrerelease == false)
        .As<Package>()
        .OrderByDescending(x => x.DownloadCount).ThenBy(x => x.Created)
        .Take(3).ToList();

    if (packages.Count > 0)
    {
        foreach (var package in packages)
        {
            Console.WriteLine("\t{0}", package.Id);
        }
    }
    else if(guessIfNoResultsFound)
    {
        DidYouMean(session, search);
    }
    else
    {
        Console.WriteLine("\tNo search results were found");
    }
}

The only major change was the call to DidYouMean(), so let us see what is going on in there.

private static void DidYouMean(IDocumentSession session, string search)
{
    var suggestionQueryResult = session.Query<PackageSearch>("Packages/Search")
        .Search(x => x.Query, search)
        .Suggest();
    switch (suggestionQueryResult.Suggestions.Length)
    {
        case 0:
            Console.WriteLine("\tNo search results were found");
            break;
        case 1:
            // we may have it filtered because of the other conditions, don't recurse again
            Console.WriteLine("\tSearch corrected to: {0}", suggestionQueryResult.Suggestions[0]);
            Console.WriteLine();

            PeformQuery(session, suggestionQueryResult.Suggestions[0], guessIfNoResultsFound: false);
            break;
        default:
            Console.WriteLine("\tDid you mean?");
            foreach (var suggestion in suggestionQueryResult.Suggestions)
            {
                Console.WriteLine("\t - {0} ?", suggestion);
            }
            break;
    }
}

Here, we ask RavenDB, “we couldn’t find anything what we had, can you give me some other ideas?” RavenDB can check the actual data that we have on disk and suggest similar alternative.

In essence, we asked RavenDB for what is nearby, and it provided us with some useful suggestions. Because the suggestions are actually based on the data we have in the db, searches on that will produce correct results.

Note that we have three code paths here, if there is one suggestion, we are going to select that immediately. Let us see how this looks like in practice:

image

Users tend to fall in love with those sort of features, and with RavenDB you can provide them in just a few lines of code and absolutely no hassle.

In my next post (and probably the last in this series) we will discuss even more awesome search features Smile.

NugGet Perf, Part V–Searching Packages

Now we get to the good parts, actually doing searches for Packages, not just showing them in packages page, but doing complex and interesting searches. The current (after optimization) query looks like this:

SELECT        TOP (30)
       -- fields removed for brevity
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        ((((Packages.IsPrerelease <> cast(1 as bit)))))
                                ((((AND    Packages.IsLatestStable = 1))))
                                ((((AND    Packages.IsLatest = 1))))
                                AND    (
                                        PackageRegistrations.Id LIKE '%jquery%' ESCAPE N'~'
                                    OR    PackageRegistrations.Id LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Title LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Title LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Tags LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Tags LIKE '%ui%' ESCAPE N'~'
                                    )
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id

I can hear the DB whimpering in fear in a dark corner, where it is hiding while it isn’t being flogged by cruel and unusual queries.

Okay, there is  a certain amount of hyperbole here, I’ll admit .But at least it is funny.

At any rate, here we have query that allows the user to search for the latest stable packages by their id, title or tags. To make things interesting for the DB, all queries are using ‘%jquery%’ form. This is something that particularly every single resource you can find about databases will warn you against. You can read why here. I think we can safely assume that the NuGet guys do not use EF Prof, or they wouldn’t go this route.

Actually, I am being unfair here. There really aren’t many other good options when you start to need those sort of things. Yes, I know of SQL Server Full Text Indexes, they are complex to setup and maintain and they don’t provide enough facilities to do interesting stuff. They are also more complex to program against. You could maintain your own indexes on the side (Lucene, Fast, etc). Now you have triple the amount of work that you have to do, and care and maintenance of those isn’t trivial. For either the devs or the ops team.

So I can certainly follow why the decision was make to use LIKE ‘%jquery%’, even though it is a well known problem.

That said, it is the wrong tool for the job, and I think that RavenDB can do a lot more and in more interesting ways as well.

Let us see the index that can handle these sort of queries.

image

What does this index do?

Well, it index the a bunch of fields to allow them to be searched for by value, but it also do something else that is query interesting. The Query field in the index takes information from several different fields that are all indexed as one. We also specify that this index will treat the Query field as the target for full text analysis. This means that we can now write the following query:

image

In code, this would look like this:

var results = session.Query<Package_Search.Request, Package_Search>()
    .Where(x=> x.IsLatestVersion && x.IsAbsoluteLatestVersion && x.IsPrerelease == false)
    .Search(x=>x.Query, userSearchTerms)
    .OrderByDescending(x=>x.DownloadCount).ThenBy(x=>x.Created)
    .Take(30
    .As<Package>()
    .ToList();

This will generate the query you can see above, and return the first 30 results.

But a lot more is actually happening here, let us look at what actually goes on in the index:

image

Here you can see the actual terms that were indexed in the database for each of the documents. The reason that this is important is that when it comes the time to do searches, we aren’t going to need to do anything as crass as a full table scan, which is what SQL has to do. Instead, all of those terms are located in an index, and we have the <<jquery ui>> search string. We can them do a very simple index lookup (cost of that is O(logN), if you’ll recall) to find your results.

And of course, we have this guy:

image

So I am pretty happy about this so far, but we can probably do better. We will see how in our next post.

RavenDB Role Playing with RPG With Me

Join us for a discussion with Rob  Eisenberg, about RPG With Me, a compelling and beautiful product for table top role playing games.
In this webinar, we will discuss how RPG With Me came about, what role RavenDB plays (the Wizard, of course) in building RPGWithMe and the use of cloud technologies to speed development and deployment, including using RavenHQ, a hosted RavenDB provider.

Space is limited.
Reserve your Webinar seat now at:
https://www2.gotomeeting.com/register/675590066

Tags:

Published at

Originally posted at

Comments (2)