Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,585
|
Comments: 51,218
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 875 words

The problem is quite simple, I want to be able to support certain operation on Raven. In order to support those operations, the user need to be able to submit a linq query to the server. In order to allow this, we need to accept a string, compile it and run it.

So far, it is pretty simple. The problem begins when you consider that assemblies can’t be unloaded. I was very hopeful when I learned about collectible assemblies in .NET 4.0, but they focus exclusively on assemblies generated from System.Reflection.Emit, while my scenario is compiling code on the fly (so I invoke the C# compiler to generate an assembly, then use that).

Collectible assemblies doesn’t help in this case. Maybe, in C# 5.0, the compiler will use SRE, which will help, but I don’t hold much hope there. I also checked out Mono.CSharp assembly, hoping that maybe it can do what I wanted it to do, but that suffer from the memory leak as well.

So I turned to the one solution that I knew would work, generating those assemblies in another app domain, and unloading that when it became too full. I kept thinking that I can’t do that because of the slowdown with cross app domain communication, but then I figured that I am violating one of the first rules of performance: You don’t know until you measure it. So I set out to test it.

I am only interested in testing the speed of cross app domain communication, not anything else, so here is my test case:

public class RemoteTransformer : MarshalByRefObject
{
    private readonly Transformer transfomer = new Transformer();

    public JObject Transform(JObject o)
    {
        return transfomer.Transform(o);
    }
}

public class Transformer
{
    public JObject Transform(JObject o)
    {
        o["Modified"] = new JValue(true);
        return o;
    }
}

Running things in the same app domain (base line):

static void Main(string[] args)
{
    var t = new RemoteTransformer();
    
    var startNew = Stopwatch.StartNew();

    for (int i = 0; i < 100000; i++)
    {
        var jobj = new JObject(new JProperty("Hello", "There"));

        t.Transform(jobj);

    }

    Console.WriteLine(startNew.ElapsedMilliseconds);
}

This consistently gives results under 200 ms (185ms, 196ms, etc). In other words, we are talking about over 500 operations per millisecond.

What happen when we do this over AppDomain boundary? The first problem I run into was that the Json objects were serializable, but that was easy to fix. Here is the code:

 static void Main(string[] args)
 {
    var appDomain = AppDomain.CreateDomain("remote");
    var t = (RemoteTransformer)appDomain.CreateInstanceAndUnwrap(typeof(RemoteTransformer).Assembly.FullName, typeof(RemoteTransformer).FullName);
    
    var startNew = Stopwatch.StartNew();
     
     for (int i = 0; i < 100000; i++)
     {
         var jobj = new JObject(new JProperty("Hello", "There"));

         t.Transform(jobj);

     }

     Console.WriteLine(startNew.ElapsedMilliseconds);
 }

And that run close to 8 seconds, (7871 ms). Or over 40 times slower, or just about 12 operations per millisecond.

To give you some indication about the timing, this means that an operation over 1 million documents would spend about 1.3 minutes just serializing data across app domains.

That is… long, but it might be acceptable, I need to think about this more.

time to read 8 min | 1547 words

Let us say that I have the homepage of the application, where we display Blogs with their Post count, using the following query:

select 
    dbo.Blogs.Id, 
    dbo.Blogs.Title,
    dbo.Blogs.Subtitle,
    (select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) as PostCount
 from dbo.Blogs 

Given what I think thoughts of denormalization, and read vs. write costs, it seems a little wasteful to run the aggregate all the time.

I can always add a PostCount property to the Blogs table, but that would require me to manage that myself, and I thought that I might see whatever the database can do it for me.

This isn’t a conclusive post, it details what I tried, and what I think is happening, but it isn’t the end all be all. Moreover, I run my tests on SQL Server 2008 R2 only, not on anything else. I would like to hear what you think of this.

My first thought was to create this as a persisted computed column:

ALTER TABLE Blogs
ADD PostCount AS (select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) PERSISTED

But you can’t create computed columns that uses subqueries. I would understand easier why not if it was only for persisted computed columns, because that would give the database a hell of time figuring out when that computed column needs to be updated, but I am actually surprised that normal computed columns aren’t supporting subqueries.

Given that my first attempt failed, I decided to try to create a materialized view for the data that I needed. Materialized views in SQL Server are called indexed views, There are several things to note here. You can’t use subqueries here either (likely because the DB couldn’t figure which row in the index to update if you were using subqueries), but have to use joins.

I created a data set of 1,048,576 rows in the blogs table and 20,971,520 posts, which I think should be enough to give me real data.

Then, I issued the following query:

select 
        dbo.Blogs.Id, 
        dbo.Blogs.Title,
        dbo.Blogs.Subtitle,
        count_big(*) as PostCount
from dbo.Blogs left join dbo.Posts
        on dbo.Blogs.Id = dbo.Posts.BlogId
where dbo.Blogs.Id = 365819
group by dbo.Blogs.Id,
        dbo.Blogs.Title,
        dbo.Blogs.Subtitle

This is before I created anything, just to give me some idea about what kind of performance (and query plan) I can expect.

Query duration: 13 seconds.

And the execution plan:

image

The suggest indexes feature is one of the best reasons to move to SSMS 2008, in my opinion.

Following the suggestion, I created:

CREATE NONCLUSTERED INDEX [IDX_Posts_ByBlogID]
ON [dbo].[Posts] ([BlogId])

And then I reissued the query. It completed in 0 seconds with the following execution plan:

image

After building Raven, I have a much better understanding of how databases operate internally, and I can completely follow how that introduction of this index can completely change the game for this query.

Just to point out, the results of this query is:

Id          Title                 Subtitle               PostCount
----------- --------------------- ---------------------- --------------------
365819      The lazy blog         hibernating in summer  1310720

I decided to see what using a view (and then indexed view) will give me. I dropped the IDX_Posts_ByBlogID index and created the following view:

CREATE VIEW BlogsWithPostCount 
WITH SCHEMABINDING
AS 
select 
    dbo.Blogs.Id, 
    dbo.Blogs.Title,
    dbo.Blogs.Subtitle,
    count_big(*) as PostCount
 from dbo.Blogs join dbo.Posts
    on dbo.Blogs.Id = dbo.Posts.BlogId
 group by dbo.Blogs.Id,
    dbo.Blogs.Title,
    dbo.Blogs.Subtitle

After which I issued the following query:

select 
        Id, 
        Title,
        Subtitle,
        PostCount
from BlogsWithPostCount
where Id = 365819

This had the exact same behavior as the first query (13 seconds and the suggestion for adding the index).

I then added the following index to the view:

CREATE UNIQUE CLUSTERED INDEX IDX_BlogsWithPostCount
ON BlogsWithPostCount (Id)

And then reissued the same query on the view. It had absolutely no affect on the query (13 seconds and the suggestion for adding the index). This make sense, if you understand how the database is actually treating this.

The database just created an index on the results of the view, but it only indexed the columns that we told it about, which means that is still need to compute the PostCount. To make things more interesting, you can’t add the PostCount to the index (thus saving the need to recalculate it).

Some points that are worth talking about:

  • Adding IDX_Posts_ByBlogID index resulted in a significant speed increase
  • There doesn’t seem to be a good way to perform materialization of the query in the database (this applies to SQL Server only, mind you, maybe Oracle does better here, I am not sure).

In other words, the best solution that I have for this is to either accept the cost per read on the RDBMS and mitigate that with proper indexes or create a PostCount column in the Blogs table and manage that yourself. I would like your critique on my attempt, and additional information about whatever what I am trying to do is possible in other RDMBS.

time to read 2 min | 336 words

That one was annoying to figure out. Take a look at the following code:

static void Main(string[] args)
{
    var listener = new HttpListener();
    listener.Prefixes.Add("http://+:8080/");
    listener.Start();

    Console.WriteLine("Started");

    while(true)
    {
        var context = listener.GetContext();
        context.Response.Headers["Content-Encoding"] = "deflate";
        context.Response.ContentType = "application/json";
        using(var gzip = new DeflateStream(context.Response.OutputStream, CompressionMode.Compress))
        using(var writer = new StreamWriter(gzip, Encoding.UTF8))
        {
            writer.Write("{\"CountOfIndexes\":1,\"ApproximateTaskCount\":0,\"CountOfDocuments\":0}");
            writer.Flush();
            gzip.Flush();
        }
        context.Response.Close();
    }
}

FireFox and IE have no trouble using this. But here is how it looks on Chrome.

image

To make matter worse, pay attention to the conditions of the bug:

  • If I use Gzip instead of deflate, it works.
  • If I use "text/plain” instead of “application/json”, it works.
  • If I tunnel this through Fiddler, it works.

I hate stupid bugs like that.

Hunt the bug

time to read 1 min | 180 words

The following code will throw under certain circumstances, what are they?

public class Global : HttpApplication
{
       public void Application_Start(object sender, EventArgs e)
       {
            HttpUtility.UrlEncode("Error inside!");
       }
}

Hint, the exception will not be raised because of transient conditions such as low memory.

What are the conditions in which it would throw, and why?

Hint #2, I had to write my own (well, take the one from Mono and modify it) HttpUtility to avoid this bug.

ARGH!

time to read 2 min | 351 words

Dave has an interesting requirements in his project:

We're not in control of where the data is located, how it's stored and in what configuration. In most cases employees need to be retrieved from a Active Directory (There's is no 'login', the Window Identity determines what a user can or can't do). Customer contacts are usually handled by the helpdesk department and each contact moment is logged in a helpdesk database. The customer (account information) itself often needs to be retrieved from an IBM DB2 database.

What you have is not one application that needs to access different data sources. That would be the wrong way to think about this, because it introduce a whole lot of complexity into the application.

image

It is much better to structure the application as an independent application with each integration point made explicit. Instead of touch the DB/2 database, you put a service on it and access that.

image

This isn’t just “oh, SOA nonsense again”, it is an important distinction. When you tie yourself directly to so many external integration points, you are also ensuring that whenever there is a change in one of them, you are going to be impacted. When you put a service boundary between you and the integration point (even if you have to build the service), the affect is much less noticeable.

Also, did you notice the blue lines going from the databases? Those are background ETL processes, replicating data to/from the databases. It allows us to handle situations where the integration points are not available.

In short, design you application so it doesn’t stick its nose into other people’s databases. If you need data from another database, put a service there, or replicate it. You’ll thank me when you app stays up.

time to read 2 min | 367 words

There seems to be some suspicion about the usage data from NH Prof that I published recently.

I would like to apologize for responding late to the comments, I know that there are some people who believe that I have installed a 3G chip directly to my head, but I actually was busy in the real world and didn’t look at my email until recently. The blog runs on auto pilot just so I’ll be able to do that, but sometimes it does give the wrong impression.

So, what does NH Prof “phone home” about?

Well, the data is actually divided into two distinct pieces. Most of the data (numbers, usages, geographic location, etc) actually comes from looking at the server logs for the update check.

Another piece of data that the profiler reports is feature usage. There are about 20 – 30 individual features that are being tracked for usage. What does it means, tracking a feature?

Well, here are three examples that shows what gets reported:

image

image

image

There is no way to correlate this data to an individual user, nor is there a way to track the behavior of a single user.

I use this data mainly in order to see what features are being used most often (therefore deserving the most attention, optimizations, etc).

Those are mentioned in the product documentation.

To summarize:

  • I am not stealing your connection strings.
  • I don’t gather any personally identifying data (and I am at somewhat at a loss to understand what I would do with it even if I did).
  • There is never any data about what you are profiling being sent anywhere.

I hope this clear things out.

time to read 5 min | 955 words

I get asked that quite frequently. More to the point, how to become an international speaker?

I was recently at a gathering where no less than three different people asked me this question, so I thought that it might be a good post.

Note: this post isn’t meant for someone who isn’t already speaking. And if you are speaking but are bad at it, this isn’t for you. The underlying assumption here is that you can speak and are reasonably good at it.

Note II: For this post, speaking is used to refer to presenting some technical content in front of an audience.

Why would you want to be a speaker anyway?

I heard that it is actually possible to make a living as a speaker. I haven’t found it to be the case, but then again, while I speak frequently, I don’t speak that frequently.

There are several reasons to want to be a speaker:

  • reputation (and in the end, good reputation means you get to raise your rates, get more work, etc).
  • contacts (speaking put you in front of dozens or hundreds of people, and afterward you get to talk with the people who are most interested in what you talked about)
  • advertising for your product (all those “lap around Visual Studio 2010” are actually an hour long ad that you paid to see :-) ).

I’ll focus on the first two, reputation & contacts gives you a much wider pool of potential work that you can choose from, increase the money you can make, etc.

So how do I do that, damn it?

Honestly, I have no idea. The first time that I presented at a technical conference, it was due to a mixup in communication. Apparently when in the US “it would have been delightful” means “we regret to inform”, but in Israel we read that as “great, let us do it”, and put the guy on the spot, so he had to scramble and do something.

Okay, I lied, I do have some idea about how to do this.

Again, I am assuming you are a reasonably good speaker (for myself, I know that my accent is a big problem when speaking English), but there are a lot of reasonably good speakers out there.

So, what is the answer? Make yourself different.

Pick a topic that is near & dear to your heart (or to your purse, which also works) and prepare a few talks on it. Write about it in a blog, comment on other people blogs about the topic. Your goal should be that when people think about topic X, your name would be on that list.  Forums like Stack Overflow can help, writing articles (whatever it is for pay or in places like CodeProject). Join a mailing list and be active there (and helpful). Don’t focus on regionally associated forums / mailing list, though. The goal is international acknowledgement.

This will take at least a year, probably, for people to start recognizing your name (it took over 2 years for me). If it is possible, produce a set of tools that relate to your topic. Publish them for free, and write it off as an investment in your future.

For myself, NHibernate Query Analyzer would a huge boost in terms of getting recognized. And Rhino Mocks was probably what clinched the deal. I honestly have no idea how much time & effort I put into Rhino Mocks, but Ohloh estimate that project at $ 12,502,089(!). While I disagree about that number, I did put a lot of effort into it, but it paid itself off several times over.

If you don’t have a blog, get one. Don’t get one at a community site, either. Community sites like blogs.microsoft.co.il are good to get your stuff read, but they have a big weakness in terms of branding yourself. You don’t want to get lost in a crowd, you want people to notice who you are. And most people are going to read your posts in a feed reader, and they are going to notice that the community feed is interesting, not that you are interesting.

Post regularly. I try to have a daily post, but that would probably not be possible for you, try to post at least once a week, and try to time it so it is always on the same date & time. Monday’s midnight usually works.

Okay, I did all of that, what now?

Another note, this is something that you may want to do in parallel to the other efforts.

Unless you become very well known, you won’t be approached, you’ll have to submit session suggestions. Keep an eye on the conferences that interest you, and wait until they have a call for sessions. Submit your stuff. Don’t get offended if they reject you.

If you live in a place that host international conferences (which usually rule Israel out), a good bet is to try to get accepted as a speaker there. You would be considerably cheaper than bringing someone from out of town/country. And that also play a role. Usually, if you managed to get into a conference once, they’ll be much more likely to have you again. They have your speaker eval, and unless you truly sucked (like going on stage and starting to speak in Hebrew at Denmark), and that gives them more confidence in bringing you a second time.

And that is about it for now.

time to read 3 min | 563 words

I was recently asked to contrast the business decisions related to the profiler and RavenDB. I thought that it would make an excellent post.

There are a lot of aspects to thing about here, actually. The profiler is an add on tool, it is only useful if you are using one of the supported OR/Ms, but if you do… it:

  • has very low barrier to entry, you need to reference the dll and add a single line of code.
  • provides immediate value, you can see the benefits that it gives you.
  • have very few moving parts that users can break.

NH Prof was released on Jan 1st, 2009. The first sale happened on Jan 2nd, 2009 (thanks Yann!).

The lead time for the profiler tends to be very short. Because there is very little that you need to invest and there is a lot that you gain. Yesterday I introduced a guy to the profiler as a way to help him see what his app is doing, he made a purchase about an hour later.

That is excellent news from my point of view. :-)

RavenDB, on the other hand:

  • has a very high barrier to entry, not so much from technical perspective, but from adoption one.
  • requires you to make significant changes to the way you work.
  • takes time to show why it is beneficial.
  • requires payment only when you actually goes live.
  • requires much higher degree of support for users.

That means that while it takes a few minutes to decide if you want the profiler (and the rest of the 30 days trial is spent getting corporate approving it :-) ), for RavenDB the lead time until you pull out your credit card is much longer.

That has some interesting implications. I actually spent a lot more (time & money) in the profiler than I spent (outright) on RavenDB. But the major difference is what type of investment that would be.

There is a term in economics called sunk cost, that is all the costs associated with building a product up to the point you released it. That is money already spent. But what usually matter a lot more is that once you reached the release point, can the cash flow from a product justify the continued work on the product ( and maybe, at some point, pay for the product development) ?

NH Prof was a big investment for me, but money started coming in shortly afterward, and it became apparent that it was sustainable product. For RavenDB, the costs have actually been a lot lower (since the majority of them represented my own time), but the expectation is that it would take about a year or two before it would be be possible to say if RavenDB is a sustainable product.

In that sense, RavenDB represent a lot riskier investment. If RavenDB wasn’t rattling in my head for so long, I would have probably would have gone to something with much shorter lead time.

It is interesting to me to see how many factors there are in those sort of decisions.  So many things to balance.

time to read 1 min | 93 words

I thought that I would announce that I am following JAOO, I am going to head off to the European NHibernate Day, a full day conference dedicated to NHibernate.

I am going to show off a lot of the new features in NHibernate 3.0, Steve Strong is going to discuss Linq to NHibernate and what you can do with it. For extra fun, I am also to spend an hour discussing worst practices in NHibernate. That is going to be an hour full of ranting & raving, which should be amusing.

time to read 2 min | 223 words

I have been doing some studying of how people are using the profiler, and it shows some interesting results.

  • Typical profiler session is :
    • NH Prof : 1:15 hours
    • Hibernate Profiler: 1:05 hours
    • EF Prof: 42 minutes
    • L2S Prof: 50 minutes
  • 83% of the profiler users have used it more than once. In fact, here is the # of usages:
    image
    So we have over 50% that use it regularly.
  • Most people use it predominately to view the statements executed:
    image
    This means that the reports are getting comparatively little attention.
  • The results per geographical location are also interesting:
    image

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  2. Webinar (7):
    05 Jun 2025 - Think inside the database
  3. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  4. RavenDB News (2):
    02 May 2025 - May 2025
  5. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}