Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,592
|
Comments: 51,223
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 203 words

imageVery often I see people trading startup performance for overall performance. That is a good approach, generally. But it has huge implications when it gets out of hand.

There is one type of user that pays the startup cost of the system over and over and over again. It is YOU!

The developer is the one that keeps restarting the application, and overly long startup procedure will directly effect the way developers will work with the application.

Not only that, but since startup times can get ridiculous quickly, developers start working around the system to be able to avoid paying the startup costs. I have seen a system that had ~3 minutes startup time. And the developers has moved a lot of the actual application logic outside the system (XML files, mostly) in order to avoid having to start the system.

This is something that you should keep in mind when you trade off overall performance for startup performance. It is usually preferable to either do things in the background, or softly page things in, as demands require it.

This is usually something that require a more complex solution, though.

NH Prof Stats

time to read 1 min | 166 words

I am just doing some work on NH Prof, and I run into the line counter button, and I thought it is interesting. Here is the break down of line counts in the application:

  • Appender - 818 lines of code
    The Appender is responsible for communication with the profiled application
  • Backend - 3,548 lines of code
    The Backend is responsible for translating the event stream from the Appender into a coherent idea, apply analysis and send them to the UI.
  • Client - 6,430 lines of code
    The Client is where all the UI work is done, it contains the view model, which is large part of this assembly, reports, user interface work, and other such things that directly relate to what the user is doing with the application. Note that this include things like error reporting, actually executing queries, and other things that are not strictly UI, but are directly related to the way the client works.

This include only C# files, not Xaml files.

time to read 6 min | 1081 words

I mentioned before that NH Prof has a problem with dealing with a lot of activity from the profiled application. In essence, until it catch up with what the application is doing, there isn't much that you can do with the profiler.

I consider this a bug, and I decided to track that issue with some verifiable results, and not just rely on hunches. The results were... interesting. The actual problem was in the place that I expected it to be, but it was more complex than expected. Let me try to explain.

The general architecture of NH Prof is shown here:

image

And the actual trouble from the UI perspective is in the last stage, actually updating the UI. UI updates are done directly on a (very) rich view model, which supports a lot of the functionality that NH Prof provides.

Because, as usual, we have to synchronize all access to the UI, we actually update the UI through a dispatcher class. The relevant code is this:

private void SendBatchOfUpdates()
{
	while (true)
	{
		int remainingCount;
		Action[] actions;
		lock (locker)
		{
			actions = queuedActions
				.Take(100)
				.ToArray();

			for (int i = 0; i < actions.Length; i++)
			{
				queuedActions.RemoveAt(0);
			}
			remainingCount = queuedActions.Count;
		}

		if (actions.Length == 0)
		{
			Thread.Sleep(500);
			continue;
		}

		dispatcher.Invoke((Action)delegate
		{
			using(dispatcher.DisableProcessing())
			{
				var sp = Stopwatch.StartNew();
				foreach (var action in actions)
				{
					action();
				}
				Debug.WriteLine(remainingCount + "\t" + sp.ElapsedMilliseconds);
			}
		});
	}
}

We queue all the operations on the view model, batch them and execute them to the UI. This is the behavior as it stands right now (that is, with the problem). Looking at the code, is should be obvious to you that there is one glaring issue. (Well, it should be oblivious, it wasn't to me until I dug deeper into it.)

If the number of queued actions is greater than the batch size, we are going to try to drain the queue, but as long as we add more actions to the queue faster than we can process them, we are going to keep hammering the UI, we basically keep the UI busy with updates all the time, and never let it time to process user input.

Note the debug line in the end, from that, I can generate the following graph:

image

For the duration line, the number is the number of milliseconds it took to process a batch of 100 items, for the red line, this is the number of items remaining in the queue that we need to drain. You can clearly see the spikes in which more work was added to the queue before we could deal with that.

You can also see that doing the actual UI work is slow. In many cases, it takes hundreds of milliseconds to process, and at times, it takes seconds to run. That is because, again, we have a very rich view model, and a lot of stuff is bound to that.

Before I can go forward, I need to describe a bit how NH Prof actually talks to the UI. Talking to the UI is done by raising events when interesting things happen. The problem that I initially tried to solve was that my input model is a stream of events, and it mapped nicely to updating the UI when I got the information.

Let us say that we want opened a session, executed a single SQL statement, and then closed the session. What are the implication from UI marshalling?

  1. Create new session
  2. Add new statement
  3. Add alert, not running in transaction
  4. Update statement with parsed SQL
  5. Update statement with row count
  6. Add alert, unbounded row set
  7. Update statement duration
  8. Close session
  9. Update session duration

So for the simplest scenario, we have to go to the UI thread 9 times. Even when we batch those calls, we are still end up going to the UI thread too often, and updating the UI frequently. That, as I mentioned, can take time.

Now, a very easy fix for the problem is to move the thread sleep above the if statement. That would introduce a delay in talking to the UI, letting the UI time to catch up with user input. This would make the application responsive during periods where the actions to perform in the UI.

The downside is that while this works, it means slow stream of updates into the application. That is actually acceptable solution, except that this is the same situation that we find ourselves with when we are dealing with processing an offline profiling session. When those are big, it can take really long time to process those.

Therefor, this is not a good solution.

A much better solution would be to reduce the number of times that we go to the UI. We can reduce the number of going to the UI significantly by changing the process to:

  1. Create new session
  2. Add new statement (with alerts, parsed SQL, row count and duration)
  3. Close session (with duration)

That requires some changes to the way that we work, but those are actually good changes.

This is not quite enough, though. We can do better than that when we realize that we actually only show a single session at a time. So most often, we can reduce the work to:

  1. Create new session
  2. Close Session (with duration, number of statements, alerts aggregation, etc)

If we also add batching to the mix, we can reduce it further to:

  1. Create new session & close it (with duration, number of statements, alerts aggregation, etc)

We have to support both of them, because we need to support both streaming & debugging scenarios.

We don't need to let the UI know about the statements, we can give it to the UI when it asks us about them, when you select a session in the sessions list.

And now, when we are looking at a session, we can raise an event only for that session, when a statement is added to that session. As an added bonus, we are going to batch that as well, so we will update the UI about this every .5 seconds or so. At that point, we might as well do an aggregation of the changes, and instead of sending just the new statement, we can send a snapshot of the currently viewed session. The idea here is that if we added several statements, we can reduce the amount of UI work we have to do by making a single set of modifications to the UI.

Thoughts?

Oh, and do you care that I am doing detailed design for NH Prof problems on the blog?

time to read 1 min | 181 words

Right now, we have one challenge with building NH Prof: making the UI scale to what people are actually using.

The problem that we face is:

  • A profiled application is generating a lot of events.
  • Many of those events are relevant to the profiler and so result in UI Model changes.
  • UI Model changes require UI refresh.

There is a limit on how often you can refresh the UI. It is not just a matter of the UI being incredibly slow. (Try it, UI is about as slow as remote system calls!) That problem we already solved with batching and aggregation.

The problem is deeper than that, how do you keep a good UI experience when you have so many changes streaming into the UI. If my recent statements list is updated 5000 in the space of one minute, there is no value in actually showing anything there, because you don't get to see that.

We are thinking about various ways of handling that, but I would appreciate additional input. How do you think we should deal with this issue?

time to read 6 min | 1007 words

Greg Young has a post about the cost of messaging. I fully agree that the cost isn't going to be in the time that you are going to spend actually writing the message body. You are going to have a lot of those, and if you take more than a minute or two to write one, I am offering overpriced speed your typing courses.

The cost of messaging, and a very real one, comes when you need to understand the system. In a system where message exchange is the form of communication, it can be significantly harder to understand what is going on. For tightly coupled system, you can generally just follow the path of the code. But for messages?

When I publish a message, that is all I care about in the view of the current component. But in the view of the system? I sure as hell care about who is consuming it and what it is doing with it.

Usually, the very first feature in a system that I write is login a user. That is a good proof that all the systems are working.

We will ignore the UI and the actual backend for user storage for a second, let us thing about how we would deal with this issue if we had messaging in place? We have the following guidance from Udi about this exact topic. I am going to try to break it down even further.

We have the following components in the process. The user, the browser (distinct from the user), the web server and the authentication service.

We will start looking at how this approach works by seeing how system startup works.

image

The web server asks the authentication service for the users. The authentication service send the web server all the users he is aware off. The web server then cache them internally. When a user try to login, we can now satisfy that request directly from our cache, without having to talk the the authentication service. This means that we have a fully local authentication story, which would be blazingly fast.

image

But what happens if we get a user that we don't have in the cache? (Maybe the user just registered and we weren't notified about it yet?).

We ask the authentication service whatever or not this is a valid user. But we don't wait for a reply. Instead, we send the browser the instruction to call us after a brief wait. The browser set this up using JavaScript. During that time, the authentication service respond, telling us that this is a valid user. We simply put this into the cache, the way we would handle all users updates.

Then the browser call us again (note that this is transparent to the user), and we have the information that we need, so we can successfully log them in:

image

There is another scenario here, what happens if the user is not valid. The first part of the scenario is identical, we ask the authentication service to tell us if this is valid or not. When the service reply that this is not a valid user, we cache that. When the browser call back to us, we can tell it that this is not a valid user.

image

(Just to make things interesting, we also have to ensure that the invalid users cache will expire or has a limited size, because otherwise this is an invitation for DOS attack.)

Finally, we have the process of creating a new user in the application, which work in the following fashion:

image 

Now, I just took three pages to explain something that can be explained very easily using:

  • usp_CreateUser
  • usp_ValidateLogin

Backed by the ACID guarantees of the database, those two stored procedures are much simpler to reason about, explain and in general work with.

We have way more complexity to work with. And this complexity spans all layers, from the back end to the UI! My UI guys needs to know about async messaging!

Isn't this a bit extreme? Isn't this heavy weight? Isn't this utterly ridiculous?

Yes, it is, absolutely. The problem with the two SPs solution is that it would work beautifully for a simple scenario, but it creaks when start talking about the more complex ones.

Authentication is usually a heavy operation. ValidateLogin is not just doing a query. It is also recording stats, updating last login date, etc. It is also something that users will do frequently. It make sense to try to optimize that.

Once we leave the trivial solution area, we are face with a myriad of problems that the messaging solution solve. There is no chance of farm wide locks in the messaging solution, because there is never a lock taking place. There are no waiting threads in the messaging solution, because we never query anything but our own local state.

We can take the authentication service down for maintenance and the only thing that will be affected is new user registration. The entire system is more robust.

Those are the tradeoffs that we have to deal with when we get to high complexity features. It make sense to start crafting them, instead of just assembling a solution.

Just stop and think about what it would require of you to understand how logins work in the messaging system, vs. the two SP system. I don't think that anyone can argue that the messaging system is simpler to understand, and that is where the real cost is.

However, I think that you'll find that after getting used to the new approach, you'll find that it start making sense. Not only that, but it is fairly easy to see how to approach problems once you have started to get a feel for messaging.

FUTURE POSTS

  1. Semantic image search in RavenDB - about one day from now

There are posts all the way to Jul 28, 2025

RECENT SERIES

  1. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  2. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  3. Webinar (7):
    05 Jun 2025 - Think inside the database
  4. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  5. RavenDB News (2):
    02 May 2025 - May 2025
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}