Ayende @ Rahien

It's a girl

Black box reverse engineering speculation

Terrance has pointed me to some really interesting feature in Solr, called Facets. After reading the documentation, I am going to try and guess how this is implemented, based on my understanding of how Lucene works.

But first, let me explain what Facets are, Facets are a way to break down a search result in a way that would give the user more meaningful results. I haven’t looked at the code, and haven’t read any further than that single link, but i think that I can safely extrapolate from that. I mean, the worst case that could happen is that I would look stupid.

Anyway, Lucene, the underpinning of Solr, is a document indexing engine. It has no ability to do any sort of aggregation, and I am pretty sure that Solr didn’t sneak in something relational when no one was looking. So how can it do these sort of things?

Well, let us look at a simple example: ?q=camera&facet=true&facet.field=manu, which will give us the following results:

<!-- search results snipped -->
<lst name="facet_fields">
  <lst name="manu">
    <int name="Canon USA">17</int>
    <int name="Olympus">12</int>
    <int name="Sony">12</int>
    <int name="Panasonic">9</int>
    <int name="Nikon">4</int>
  </lst>
</lst>

Remember what we said about Lucene being an indexing engine? You can query the index itself very efficiently, and these sort of results are something that Lucene can provide you instantly.

More over, when we start talking about facets prices, which looks something like this;

?q=camera&facet=true&facet.query=price:[* TO 100]
    &facet.query=price:[100 TO 200];&facet.query=[price:200 TO 300]
    &facet.query=price:[300 TO 400];&facet.query=[price:400 TO 500]
    &facet.query=price:[500 TO *]

It gets even nicer. If I would have that problem (which I actually do, but that is a story for another day), I would resolve this using individual multiple Lucnene searches. Something like:

  • type:camera –> get docs
  • type:camera price:[* TO 100] –> but just get count
  • type:camera price:[100 TO 200] –> but just get count

In essence, Solr functions as a query batching mechanism to Lucene, and then message the data to a form that is easy to consume by the front end. That is quite impressive.

By doing this aggregation, Solr can provide some really impressive capabilities, on top of a really simple concept. I am certainly going to attempt something similar for Raven.

Of course, I may have headed in the completely wrong direction, in which case I am busy wiping egg of my face.

Tags:

Published at

Originally posted at

Comments (6)

User Interface Delegation – Rhino Divan DB

No, this isn’t a post about how to do UI delegation in software. This is me, looking at HTML code and realizing that I have a deep and troubling disadvantage, while I know how to build UI, I am just not very good at it.

For a certain value of done, Rhino Divan DB is done. Architecture is stable, infrastructure is in place, everything seems to be humming along nicely. There is still some work to be done (error handling, production worthiness, etc) but those are relatively minor things.

The most major item on the table, however, is providing a basic UI on top of the DB. I already have a working prototype, but I came to realize that I am tackling something that I am just not good at.

This is where you, hopefully, comes into play. I am interested in someone with good graph of HTML/JS to build a very simple CRUD interface on top of a JSON API. That is simple enough, but the small caveat is that my dead lines are short, I would like to see something working tomorrow.

Anyone is up for the challenge?

NHibernate donation campaign

NHibernate is the most popular Open Source Object Relational Mapper in the .NET framework. As an Open Source project, all the work done on it is done for free.  We would like to be able to dedicate more time to NHibernate, but even as a labor of love, the amount of time that we can spend on a free project is limited.

In order to facilitate that, we opened a donation campaign that will allow you to donate money to the project.

Click here to lend your support to: NHibernate and make a donation at www.pledgie.com !

What is this money going to be used for?

This money will go directly to NHibernate development, primarily to sponsor the time required development of NHibernate itself.

Donation Matching

Moreover, my company, Hibernating Rhinos, is going to match any donation to this campaign (to a total limit of 5,000$), as a way to give back to the NHibernate project for the excellent software it produced.

In addition to that, my company will also sponsor any resources need for the project, such as production servers (the NHibernate’s Jira is already running on our servers), build machines, etc.

Why should you donate?

If you are a user of NHibernate, you gained a lot from build on such a solid foundation. We ask to you to donate so that we can make the project even better. If your company uses NHibernate, ask it to donate to this campaign.

Thanks,

~Ayende

Pricing software

It seems that people make some rather interesting assumptions regarding software pricing. When talking about commercial products, there is a wide variety of strategies that you might want to push, depending on what exactly you are trying to achieve.

You may be trying to have a loss leader product, gain reputation or entry to a new market, in which case your goals are skewed toward optimizing other things. What I want to talk about today is pricing software with regards to maximizing profits.

Probably the first thing that you need to do consider the base price of the product. That is important for several levels. For the obvious reason that this is what you will make out of it, but also because the way you price your product impacts its Perception of Value.

If you try to sell your product at a significant cost below what the market expect it to be, the perception is that it is crap, and you are trying to get rid of it.

Another issue that you have to face is that it is very hard to change the base price once you settled on it. It is fairly obvious why you have a hard time jacking the price upward, but pushing the price down is also something to approach with trepidation. It has a very negative impact on the product image. Now, you can effectively price it at a lower rate very easily, but doing things like offers, discounts, etc. Those do not affect the base price and so generally do not affect how the product is perceived.

And just as bad as with setting the price significantly below market expectations, setting the price significantly higher than market expectations. The problem here is that you come out as a loony with delusions of grandeur. Even if your product is wonderful, and even if it will return its investment in a single week, pricing it too high generate a  strong rejection reaction. It also means that you need to actively combat that perception, and that takes a lot of time & effort.

Finally, in the range between too little and too much, you have a wide room to play. And that is where the usual supply & demand metrics come into play. Since supply isn’t a problem with software, what we are talking about is demand vs. price. Everyone in imageeconomics is familiar with the graph, I assume.

As price goes up, demands goes down, and vice versa. So the question is what is the optimum price point that would generate the most revenue. Here are some numbers, showing the probable correlation between price and sales.

Price Users $
$ 2,000.00 1.00 $ 2,000.00
$ 1,000.00 1.00 $ 1,000.00
$ 200.00 100.00 $ 20,000.00
$ 100.00 200.00 $ 20,000.00
$ 10.00 5000.00 $ 50,000.00

It might be easier to look at in its graphical form to the right.

You can see that when the price point is too high, very few people will buy it, if at all. When the price point is in the range that the market expects, we make some money, drastically more than we would when the price was above market expectations.

However, if we set the price way down, we get a lot more users, and a lot more money pouring in. We will ignore the question of damage to the reputation for now.

This seems to indicate that by lowering the price we can make a lot more money, right? Except that it isn’t really the case. Let us see why.

1 in 10 users has a question, requires support, etc. Let us say that each question cost 100$, and now let us look at the numbers.

Price Users Calls Support Cost Income Revenue
$ 2,000.00 1.00 0.00 $ - $ 2,000.00 $ 2,000.00
$ 1,000.00 1.00 0.00 $ - $ 1,000.00 $ 1,000.00
$ 200.00 100.00 10.00 $ 1,000.00 $ 20,000.00 $ 19,000.00
$ 100.00 200.00 20.00 $ 2,000.00 $ 20,000.00 $ 18,000.00
$ 10.00 5000.00 500.00 $ 50,000.00 $ 50,000.00 $ -

And that tells us a very different story indeed. The end goal is increasing profit, not just the amount of money coming in, after all.

Oh, and a final thought, the following coupon code EXR-45K2D462FD is my pricing experiment, it allows the first 10 users to buy NH Prof, L2S Prof or EF Prof at a 50% discount.

Since all the coupon codes run out in a short order, I am going to continue the experiment, this coupon code ER2-45K2D462FJ will give the first 25 users to buy NH Prof, L2S Prof or EF Prof at a 25% discount.

Fun with a non relational databases

My post showing a different approach for handling data got a lot of traffic, and a lot of good comments. But I think that there is some misunderstanding with regards to the capabilities of NoSQL databases, so I am going to try to expand on those capabilities in this post.

Instead of hand waving, and since I am thinking about this a lot lately, we will assume that we are talking about DivanDB (unreleased version), which has the following API:

  • JsonDoc[] Get(params Id[] ids);
  • Set(params JsonDoc[] docs);
  • JsonDoc[] Query(string indexName, string query);

DivanDB is a simple Document database, storing documents as Json, and using Lucene as a indexer for the data. An index is defined by specifying a function that creates it:

var booksByAuthor = from doc in documents
                                  where doc.type == “book”
                                  from author in doc.authors
                                  select new { author };

And the data looks like this:

imageimage

Indexing is done at write time, you can think about those indexes as materialized views.

It appears that people assumes that just because you aren’t using an RDBMS, you can’t use queries. Here are a few options to show how you can do so.

Books by author:

Query(“booksByAuthor”, “author:weber”);

Books by category:

Query(“booksByCategory”, “category:scifi”);

How is the books by category index defined? I think you can guess.

var booksByCategory = from doc in documents
                                  where doc.type == “book”
                                  from category in doc.categories
                                  select new { category };

What other queries did people brought up?

Who has book X in their queue?

Query(“usersByQueuedBooks”, “book:41”);

And the view?

var usersByQueuedBooks = from doc in documents
                                  where doc.type == "user"
                                  from book in doc.queues_books
                                  select new { book };

I’ll leave the implementation of Who has book X checked out as an exercise for the reader.

What about deletes?

Using this system, it looks like deletes might be really expensive, right? Well, that depends on what exactly you want here.

My default approach would be to consider exactly what you want, as Udi pointed out, in the real world, you don’t delete.

But it is actually fairly easy to support something like this cheaply. It is all about defining the association and letting the DB handle this (although I am not fond of the syntax I came up with):

var books = from book in documents
            where book.type == "book"
            select new { book = book.id };
var checkedOutBooks =    from user in documents
                        where user.type = "user"
                        from book in user.checked_out
                        select new { book };

var queuedBooks =    from user in documents
                    where user.type = "user"
                    from book in user.queued_books
                    select new { book };

FK_DisallowDeletion(books, checkoutBooks);
FK_RemoveElement(books, queuedBooks);

Under the cover this creates indexers and can check out those at delete / insert time.

However, I would probably not implement this for Rhino DivanDB, mostly because I agree with Udi.

Rhino Divan DB reboot idea

Divan DB is my pet database. I created it to scratch an itch [Nitpickers: please note this!], to see if I can create Couch DB like system in .NET. You can read all about it in the following series of posts.

It stalled for a while, mostly because I run into the hard problems (building map/reduce views). But I think that I actually have a better idea, instead of trying to build something that would just mimic Couch DB, a .NET based Document DB is actually a very achievable goal.

The way it would work is actually pretty simple, the server would accept Json-formatted documents, like those:

[
    {
        "id": 153,
        "type": "book",
        "name": "Storm from the Shadows",
        "authors": [
            "David Weber"
        ],
        "categories": [
            "SciFi",
            "Awesome",
            "You gotta read it"
        ],
        "avg_stars": 4.5,
        "reviews": [13,5423,423,123,512]
    },
    {
        "id": 1337,
        "type": "book",
        "name": "DSLs in Boo",
        "authors": [
            "Ayende Rahien",
            "Oren Eini"
        ],
        "categories": [
            "DSL",
            ".NET",
            "You REALLY gotta read it"
        ],
        "avg_stars": 7,
        "reviews": [843,214,451]
    }
]

Querying could be done either by id, or using a query on an index. Indexes can be defined using the following syntax:

var booksByTitle = 
   from book in docs
   where book.type == "book"
   select new { book.title };

The fun part here is that this index would be translated into a Lucene index, which means that you could query the index using a query:

Query(“booksByTitle”, “title:Boo”) –> documents that match this query.

As well as apply any & all the usual Lucene tricks.

You don’t get Map/Reduce using this method, but the amount of complexity you have is quite low, and the implementation should take only several days to build.

Thoughts?

Tags:

Published at

Originally posted at

Comments (35)

Profiler new feature: Too many joins detection

This is Josh’s feature, since we wrote most of the code for it together. Basically, it recognize a very common performance problem, queries that uses too many joins, such as this one:

image

Which would result in the following warning:

image

Queries with too many joins might be a performance problem. Each join requires the database to perform additional work, and the complexity and cost of the query grows rapidly with each additional join. While relational database are optimized for handling joins, it is often more efficient to perform several separate queries instead of a single query with several joins in it.

For OLTP systems, you should consider simplifying your queries or simplifying the data model. While I do not recommend avoiding joins completely, I strong discourage queries with large numbers of joins. Another issue to pay attention to is possible Cartesian products in queries contains joins, it is very easy to create such a thing and not notice it during development.

Profiler Subscription?

One repeated request for the profiler is to have a personal version (which a price to match).

I am not really happy with the idea, for several reasons. One of the major ones is that I have enough variability in the product already, and adding a new edition in addition to the 4 we already support is bound to create headaches. Another is that I simply cannot just draw a line and say “those are the pro features and these are the personal features”.

Instead, I wonder about offering a subscription model, something with a cost around 10 – 15 Euro per month. This would be renewed monthly (automatically), and beyond just having a lower price point, it will also provide automatic upgrades across versions (so free upgrade from 1.x to 2.x).

Thoughts?

Profiler Speculative Feature: Query plans

This isn’t a new feature, because you can’t use it right now, but it is a really nice feature that we are working on, and I couldn’t resist showing it off hot “off the press”, so to speak.

Given the following query:

SELECT this_.id             AS id7_1_,
       this_.title          AS title7_1_,
       this_.subtitle       AS subtitle7_1_,
       this_.allowscomments AS allowsco4_7_1_,
       this_.createdat      AS createdat7_1_,
       posts2_.blogid       AS blogid3_,
       posts2_.id           AS id3_,
       posts2_.id           AS id0_0_,
       posts2_.title        AS title0_0_,
       posts2_.TEXT         AS text0_0_,
       posts2_.postedat     AS postedat0_0_,
       posts2_.blogid       AS blogid0_0_,
       posts2_.userid       AS userid0_0_
FROM   blogs this_
       LEFT OUTER JOIN posts posts2_
         ON this_.id = posts2_.blogid
WHERE  this_.id = 1 /* @p0 */

SELECT this_.id            AS id0_1_,
       this_.title         AS title0_1_,
       this_.TEXT          AS text0_1_,
       this_.postedat      AS postedat0_1_,
       this_.blogid        AS blogid0_1_,
       this_.userid        AS userid0_1_,
       comments2_.postid   AS postid3_,
       comments2_.id       AS id3_,
       comments2_.id       AS id2_0_,
       comments2_.name     AS name2_0_,
       comments2_.email    AS email2_0_,
       comments2_.homepage AS homepage2_0_,
       comments2_.ip       AS ip2_0_,
       comments2_.TEXT     AS text2_0_,
       comments2_.postid   AS postid2_0_
FROM   posts this_
       LEFT OUTER JOIN comments comments2_
         ON this_.id = comments2_.postid
WHERE  this_.blogid = 1 /* @p1 */

The profiler can show you the query plan using this UI:

image

And here is how the same query looks like using the query plan feature in Management Studio:

image

So, why implement it?

  • This isn’t limited to SQL Server, the profiler can display query plans for: SQL Server, Oracle, PostgreSQL and MySQL
  • This let you keep yourself in the flow, just hit a button to see the query plan, instead of copying the SQL, opening SSMS, displaying the query plan, etc.

Don’t discount the last one, making it easy is one of the core values of the profiler.

The idea is that if you make it easy enough, the barriers for using it goes away. If you can instantly see the query plan for a query, you are far more likely to look at it than if it takes 30 seconds to get that. At that point, you would only do it when you already have a performance problem.

Slaying relational dragons

I recently had a fascinating support call, talking about how to optimize a very big model and an access pattern that basically required to have the entire model in memory for performing certain operations.

A pleasant surprise was that it wasn’t horrible (when I get called, there is usually a mess), which is what made things interesting. In the space of two hours, we managed to:

  • Reduced number of queries by 90%.
  • Reduced size of queries by 52%.
  • Increased responsiveness by 60%, even for data set an order of magnitude.

My default answer whenever I am asked when to use NHibernate is: Whenever you use a relational database.

My strong recommendation at the end of that support call? Don’t use a relational DB for what you are doing.

The ERD just below has absolutely nothing to do with the support call, but hopefully it will help make the example. Note that I dropped some of the association tables, to make it simpler.

image

And the scenario we have to deal with is this one:

image

Every single table in the ERD is touched by this screen.  Using a relational database, I would need something like the following to get all this data:

SELECT * 
FROM   Users 
WHERE  Id = @UserID 

SELECT * 
FROM   Subscriptions 
WHERE  UserId = @UserId 
       AND GETDATE() BETWEEN StartDate AND EndDate 

SELECT   MIN(CheckedBooks.CheckedAt), 
         Books.Name, 
         Books.ImageUrl, 
         AVG(Reviews.NumberOfStars), 
         GROUP_CONCAT(', ',Authors.Name), 
         GROUP_CONCAT(', ',Categories.Name) 
FROM     CheckedBooks 
         JOIN Books 
           ON BookId 
         JOIN BookToAuthors 
           ON BookId 
         JOIN Authors 
           ON AuthorId 
         JOIN Reviews 
           ON BookId 
         JOIN BooksCategories 
           ON BookId 
         JOIN Categories 
           ON CategoryId 
WHERE    CheckedBooks.UserId = @UserId 
GROUP BY BookId 

SELECT   Books.Name, 
         Books.ImageUrl, 
         AVG(Reviews.NumberOfStars), 
         GROUP_CONCAT(', ',Authors.Name), 
         GROUP_CONCAT(', ',Categories.Name) 
FROM     Books 
         JOIN BookToAuthors 
           ON BookId 
         JOIN Authors 
           ON AuthorId 
         JOIN Reviews 
           ON BookId 
         JOIN BooksCategories 
           ON BookId 
         JOIN Categories 
           ON CategoryId 
WHERE    BookId IN (SELECT BookID 
                    FROM   QueuedBooks 
                    WHERE  UserId = @UserId) 
GROUP BY BookId 

SELECT   Books.Name, 
         Books.ImageUrl, 
         AVG(Reviews.NumberOfStars), 
         GROUP_CONCAT(', ',Authors.Name), 
         GROUP_CONCAT(', ',Categories.Name) 
FROM     Books 
         JOIN BookToAuthors 
           ON BookId 
         JOIN Authors 
           ON AuthorId 
         JOIN Reviews 
           ON BookId 
         JOIN BooksCategories 
           ON BookId 
         JOIN Categories 
           ON CategoryId 
WHERE    BookId IN (SELECT BookID 
                    FROM   RecommendedBooks 
                    WHERE  UserId = @UserId) 
GROUP BY BookId 

SELECT   Books.Name, 
         Books.ImageUrl, 
         AVG(Reviews.NumberOfStars), 
         GROUP_CONCAT(', ',Authors.Name), 
         GROUP_CONCAT(', ',Categories.Name) 
FROM     Books 
         JOIN BookToAuthors 
           ON BookId 
         JOIN Authors 
           ON AuthorId 
         JOIN Reviews 
           ON BookId 
         JOIN BooksCategories 
           ON BookId 
         JOIN Categories 
           ON CategoryId 
WHERE    Books.Name LIKE @search 
          OR Categories.Name LIKE @search 
          OR Reviews.Review LIKE @search 
GROUP BY BookId

Yes, this is a fairly simplistic approach, without de-normalization, and I would never perform searches in this manner, but… notice how complex things are getting. For bonus points, look at the forth query, the queued books are ordered, try to figure out how we can get the order in a meaningful way. I shudder to thing about the execution plan of this set of queries. Even if we ignore the last one that does full text searching in the slowest possible way. And this is just for bringing the data for a single screen, assuming that magically it will show up (you need to do a lot of manipulation at the app level to make this happen).

The problem is simple, our data access pattern and the data storage technology that we use are at odds with one another. While relational modeling dictate normalization, our actual data usage means that we don’t really deal with a single-row entity, with relatively rare access to associations, which is the best case for OLTP. Nor are we dealing with set based logic, which is the best case for OLAP / Relational based queries.

Instead, we are dealing an aggregate that spans multiple tables, mostly because we have no other way to express lists and many to many associations in a relational database.

Let us see how we could handle things if we were using a document or key/value database. We would have two aggregates, User and Book.

GetUser(userId) –> would result in:

image

We can now issue another query, to bring the associated books. GetBooks(153, 1337) would result in:

image

Note that the entire data structure is different, we haven’t just copied the normalized relational model, we have a totally different model. An aggregate (similar to DDD’s aggregate) is a single entity that contains anything except other aggregates. References to other aggregates are allowed (from user to all the books), but most of the entity’s data is stored as a single value.

That has several interesting implications. First, we need two queries to get the data for the screen. One to get the user’s data, and the second to get the books that we need to display. Reducing remote calls is something that you really care about, and simplifying the queries to mere query by ids is going to have a significant effect as well.

By changing the data storage technology, we also enforced a very rigid aggregate boundary. Transactions becomes much simpler as well, since most transactions will now modify only a single aggregate, which is a single operation, no matter how many actual operations we perform on that aggregate. And by tailoring the data structure that we use to match our needs, we have natural aggregate boundaries.

The end result is a far simpler method of working with the data. It may mean that we have to do more work upfront, but look at the type of work we would have to do in order to try to solve our problems using the relational model. I know what model I would want for this sort of a problem.

Linq to SQL Profiler 1.0 Released!

image

Well, I planned to do it last week, but it got delay for personal reasons.

But here it is, Linq to SQL Profiler is now out of beta, and I personally think it is awesome. Using the profiler, you gain valuable insight about the actual data access pattern of your application (which is usually abstracted away by the Linq to SQL framework). But the profiler goes beyond just dumping a heap of data on you, it takes it several steps further by:

  • Tying together queries and code, you can go directly from any query to the line of code that generated it, no more scratching about “what caused this query”.
  • Analyzing your data access patterns and alerting your about bad practices and suggesting how to fix them.
  • Provide detailed reports on your application’s database usage, perfect for sending to the DBA for optimization.

I got some really nice feedback from people with it:

image

image

Happy linqing

Nice process, but what about the engineering bits?

Software processes has always been a popular topic of discussion in our industry. Those can get quite heated, with advocates of the “stable / stale” Waterfall method pointing fingers toward “rapid / rabid” Agile methods, with the CMMI people throwing documents around and Lean people standing on the sidelines muttering about Waste.

This isn’t a post about a specific software process, I’ll defer that to another day. Instead, I want to focus on a flaw in the basic building blocks in many* software building processes.

They ignore the actual building the software.

That may sound ridiculous on the face of it, after all, how can a software process ignore the act of building software. But take a look at the following diagrams:

image

If you’ll pay attention, you’ll notice that those processes talk about everything except how to actually build software. They talk about people, about requirements, about managing customers, about a whole lot of things, but not about the part where you have people sitting down and writing code. In most of those, in fact, that part is usually defined as one of those:

image

Why is that a problem? After all, isn’t there a big distinction between software engineering (we know what to do, now let us do it) and project management (getting to know what we need to do, and verifying that we did it right). Those processes deal primarily with project management and leave the engineering part to be defined in a way that fit that particular project. Surely that is better, right? In theory, it might be. But there is a big problem when you have a software process that ignore the software engineering aspects of building software.

The problem is that that in many cases, there are hidden assumptions that are going to hammer you down the road if you use a certain process with engineering practices that doesn’t fit it. Take a look at the following chart, showing a team velocity over time, does this look familiar?

image

The term I heard used for this is Scrum Wall, but I have seen similar results in other processes as well. The best description for that is Allan Kelly’s:

You hit the Scrum wall when you adopt Scrum and everything goes well, then, after a few Sprints things don’t work any more - to use an English expression, they go pear shaped. You can’t keep your commitments, you can’t release software, your customers get annoyed and angry, it looks like Scrum is broken.

This is what happens when you adopt Scrum without technical practices such as Test Driven Development, continuous integration and refectoring. When teams adopt the Scrum process, they go faster, show progress, things look good... and then the quality becomes a problem. Now the team are fighting through quick sand.

The code quality is poor and developers are expected to continue to make progress. Maybe the Scrum Master/Project Manager reverts to past behavior and demands overtime and weekend working. Maybe the team start busting a gut to keep their commitments. Either way the team is heading for burn-out.

The major issue is in focusing so much effort and time on project management with what amounts to willful ignorance of the technical and engineering practices will inevitably leads to disaster. The process of building software is intractably linked to the engineering practices involved in building the software. Moreover, some technical practices are actively harmful in some scenarios and life savers in others.

Many Agile and post-Agile processes focus on short cycles, each of them producing something with a distinct value to the customer. That may be an iteration, a commit or a feature, where the goal is to increase the velocity over time so we can provide as much value to the customer in as short a time as possible. What those processes ignore are things like technical debt, large scale refactoring and non functional efforts. Oh, you see those things mentioned on the edge, but they aren’t something that is dealt with heads on, as a core issue to consider.

There is a bit more to that, actually. The software engineering practices and the project management strategies are linked and of paramount importance when the time comes to decide how the software should actually be built. No, this is not tautology. We just need to take into account Conway’s law and expand on it a bit.

Any organization that designs a system will inevitably produce a design whose structure is a copy of the organization's communication structure.

Part of the design process of a project should include design the team(s) structure, the project management strategy and the software engineering practices in order to align the end result with what is desired. Ignoring this leads to imbalance in the project, and if that imbalance is big enough, and goes on for long enough, the project is going to rip itself apart.

* Nitpicker corner: I said many, not all. Don’t bother to list me software process that deals with it. I had a reason to explicitly list the processes that I did.

My new laptop – 2010 edition

Well, it arrived two days ago, but I only just finished doing the bare bones installation. My new laptop is a Lenovo Thinkpad W510 (4319-29G, if anyone really cares). I have been a happy Thinkpad user for a long time, so it seems to be a natural choice. One think that I did directly off the bat, though, was to replace the HD that it came with with an SSD one.

It seems to be a pretty nice machine, so far.

image

I find it highly amusing that the slowest part here is actually the memory.

One thing that i do find unforgivable, however, is the install from scratch experience. Since i replaced the HD, I installed the machine using Win7, and after a successful installation, the computer could neither connect to the internet or use USB. That meant that in order to get drivers (you know, so I can connect to the internet or use USB) I had to burn a CD. It felt so 90s.

Even more unforgivable?

 image

After being able to go on the net and download all the drivers for my machine, I still can’t get USB to work. There doesn’t seem to be any USB drivers for my model, which I find extremely puzzling.

At a minimum, I would expect Lenovo to include a drivers disk.

Challenge: Premature optimization, and all of that…

Let us look at the following pieces of code:

public void Consume(MyBooksRequest message)
{
    var user = session.Get<User>(message.UserId);
    
    bus.Reply(new MyBooksResponse
    {
        UserId = message.UserId,
        Timestamp = DateTime.Now,
        Books = user.CurrentlyReading.ToBookDtoArray()
    });
}

public void Consume(MyQueueRequest message)
{
    var user = session.Get<User>(message.UserId);

    bus.Reply(new MyQueueResponse
    {
        UserId = message.UserId,
        Timestamp = DateTime.Now,
        Queue = user.Queue.ToBookDtoArray()
    });
}

public void Consume(MyRecommendationsRequest message)
{
    var user = session.Get<User>(message.UserId);

    bus.Reply(new MyRecommendationsResponse
    {
        UserId = message.UserId,
        Timestamp = DateTime.Now,
        Recommendations = user.Recommendations.ToBookDtoArray()
    });
}

Looking at this, I see that I have a requirement to getting my books, my queues and my recommendations. It appears that getting each datum is going to result in 2 queries, the first to load the User, and the second to lazy load the actual collection that we want to return.

An almost trivial optimization would be to eliminate the lazy loading, right? That would reduce the cost from 6 queries to just 3.

However, that assumption would be wrong. The following client code:

bus.Send(
    new MyBooksRequest
    {
        UserId = userId
    },
    new MyQueueRequest
    {
        UserId = userId
    },
    new MyRecommendationsRequest
    {
        UserId = userId
    });

Produces this SQL:

-- statement #1
enlisted session in distributed transaction with isolation level: Serializable

-- statement #2
SELECT user0_.Id          as Id2_0_,
       user0_.Name        as Name2_0_,
       user0_.Street      as Street2_0_,
       user0_.Country     as Country2_0_,
       user0_.City        as City2_0_,
       user0_.ZipCode     as ZipCode2_0_,
       user0_.HouseNumber as HouseNum7_2_0_
FROM   Users user0_
WHERE  user0_.Id = 1 /* @p0 */

-- statement #3
SELECT currentlyr0_.[User] as User1_1_,
       currentlyr0_.Book   as Book1_,
       book1_.Id           as Id0_0_,
       book1_.Name         as Name0_0_,
       book1_.ImageUrl     as ImageUrl0_0_,
       book1_.Image        as Image0_0_,
       book1_.Author       as Author0_0_
FROM   UsersReadingBooks currentlyr0_
       left outer join Books book1_
         on currentlyr0_.Book = book1_.Id
WHERE  currentlyr0_.[User] = 1 /* @p0 */

-- statement #4
SELECT queue0_.[User]  as User1_1_,
       queue0_.Book    as Book1_,
       queue0_.[Index] as Index3_1_,
       book1_.Id       as Id0_0_,
       book1_.Name     as Name0_0_,
       book1_.ImageUrl as ImageUrl0_0_,
       book1_.Image    as Image0_0_,
       book1_.Author   as Author0_0_
FROM   UsersWaitingBooks queue0_
       left outer join Books book1_
         on queue0_.Book = book1_.Id
WHERE  queue0_.[User] = 1 /* @p0 */

-- statement #5
SELECT recommenda0_.[User]  as User1_1_,
       recommenda0_.Book    as Book1_,
       recommenda0_.[Index] as Index3_1_,
       book1_.Id            as Id0_0_,
       book1_.Name          as Name0_0_,
       book1_.ImageUrl      as ImageUrl0_0_,
       book1_.Image         as Image0_0_,
       book1_.Author        as Author0_0_
FROM   UsersRecommendedBooks recommenda0_
       left outer join Books book1_
         on recommenda0_.Book = book1_.Id
WHERE  recommenda0_.[User] = 1 /* @p0 */


-- statement #7
commit transaction

That seems strange, can you figure out why?

Bonus points for figuring out whatever it would be worth it to do the eager load optimization or not.

Challenges: Where is the optimization?

Let us look at the following pieces of code:

public void Consume(MyBooksRequest message)
{
    var user = session.Get<User>(message.UserId);
    
    bus.Reply(new MyBooksResponse
    {
        UserId = message.UserId,
        Timestamp = DateTime.Now,
        Books = user.CurrentlyReading.ToBookDtoArray()
    });
}

public void Consume(MyQueueRequest message)
{
    var user = session.Get<User>(message.UserId);

    bus.Reply(new MyQueueResponse
    {
        UserId = message.UserId,
        Timestamp = DateTime.Now,
        Queue = user.Queue.ToBookDtoArray()
    });
}

public void Consume(MyRecommendationsRequest message)
{
    var user = session.Get<User>(message.UserId);

    bus.Reply(new MyRecommendationsResponse
    {
        UserId = message.UserId,
        Timestamp = DateTime.Now,
        Recommendations = user.Recommendations.ToBookDtoArray()
    });
}

Looking at this, I see that I have a requirement to getting my books, my queues and my recommendations. Looking at the code, can you guess how many queries are being generated to get those?

And can you suggest an optimization?

Profiler new feature: Integrating with application frameworks

One of the things that makes working with the profiler easier is the fact that it gives you not just information, but information in context.

I was working with an app using Rhino Service Bus, and it really bothered me that I couldn’t immediately figure out what was the trigger for a session. When using ASP.Net or WCF, the profiler can show the URL that triggered the request, but when we are not using a url based mechanism, that turns out to be much harder.

So I set out to fix that, you can see the results below:

image

This session was generated by a message batch containing messages for MyBooks, MyQueue, etc.

The integration is composed of two parts, first, from the profiler perspective, you now have the ProfilerIntegration.CurrentSessionContext property, which allows you to customize how the profiler detects the current context.

The second part is the integration from the application framework itself, you can see how I did that for Rhino Service Bus, which will dynamically detect the presence of the profiler and fill the appropriate values. The result makes it a lot easier to track down what is going on.

Resolution: Efficient Querying

Well, I already covered how you can handle this challenge several times in the past, so I’ll not do that again. What I actually did is quite different. Instead of having to deal with the complexity (which is possible) I decided to remove it entirely.

The solution to my problem is to simplify the model:

image

Which is represented as the following physical data model:

image

Now, querying for this is about as simple as you can get:

select user0_.Id          as Id2_0_,
       book2_.Id          as Id0_1_,
       user0_.Name        as Name2_0_,
       user0_.Street      as Street2_0_,
       user0_.Country     as Country2_0_,
       user0_.City        as City2_0_,
       user0_.ZipCode     as ZipCode2_0_,
       user0_.HouseNumber as HouseNum7_2_0_,
       book2_.Name        as Name0_1_,
       book2_.ImageUrl    as ImageUrl0_1_,
       book2_.Image       as Image0_1_,
       book2_.Author      as Author0_1_,
       queue1_.[User]     as User1_0__,
       queue1_.Book       as Book0__,
       queue1_.[Index]    as Index3_0__
from   Users user0_
       inner join UsersWaitingBooks queue1_
         on user0_.Id = queue1_.[User]
       inner join Books book2_
         on queue1_.Book = book2_.Id
where  user0_.Id = 1 /* @p0 */

De-normalizing the model has significantly improved my ability to work with it.

Challenge: Efficient querying

This issue came out in my Alexandria sample app. I wanted to have a Queue of books for each user, and each book has a collection of its authors. The model looks something like this:

image

And the physical data model:

image

So far, so good, and relatively simple and straightforward to work with, right?

Except, what kind of a query would you want to make to get all the books in the user’s queue? Here is the code that uses your results:

bus.Reply(new MyQueueResponse
{
    UserId = message.UserId,
    Timestamp = DateTime.Now,
    Queue = books.Select(book => new BookDTO
    {
        Id = book.Id,
        Image = book.Image,
        Name = book.Name,
        Authors = book.Authors.Select(x => x.Name).ToArray()
    }).ToArray()
});

Hint, note that we need to bring the Book’s image as well, and pay attention to the number of joins you require as well as the number of queries.

Suggestions for a new laptop?

Okay, the previous time I asked this question, I decided that buying a big desktop machine would be preferable, but I am currently traveling and I am pained at the speed of my current laptop.

Therefore, I am going to need a new one. I am the happy owner of a Mac Book Pro, but I can’t really justify purchasing a new Mac given the cost of a new Mac vs. a comparable PC. The Mac tax just doesn’t pay for itself at those levels for me.

Any recommendations? I want something 15” with enough horsepower to be a major development machine. If there is something like the Mac’s touchpad, I would love that.

Any ideas?

Answer: Debugging a resource leak

As it turns out, there are a LOT of issues with this code:

public class QueueActions : IDisposable
{
    UnmanagedDatabaseConnection database;
    public string Name { get; private set; }

    public class QueueActions( UnmanagedDatabaseConnectionFactory factory)
    {
         database = factory.Create();
         database.Open(()=> Name = database.ReadName());
    }

   // assume proper GC finalizer impl

    public void Dispose()
    {
          database.Dispose();
    }
}

And the code using this:

using(var factory = CreateFactory())
{
   ThreadPool.QueueUserWorkItem(()=>
   {
          using(var actions = new QueueActions(factory))
          {
               actions.Send("abc");     
          }
    });
}

To begin with, what happens if we close the factory between the first and second lines in QueueActions constructors?

We already have an unmanaged resource, but when we try to open it, we are going to get an exception. Since the exception is thrown from the constructor, it will NOT invoke the usual using logic, and the code will not be disposed.

Furthermore, and the reason for the blog post about it. Dispose itself can also fail.

Here is the actual stack trace that caused this blog post:

Microsoft.Isam.Esent.Interop.EsentErrorException: Error TermInProgress (JET_errTermInProgress, Termination in progress)
at Microsoft.Isam.Esent.Interop.Api.Check(Int32 err) in Api.cs: line 1492
at Microsoft.Isam.Esent.Interop.Api.JetCloseTable(JET_SESID sesid, JET_TABLEID tableid) in Api.cs: line 372
at Microsoft.Isam.Esent.Interop.Table.ReleaseResource() in D:\Work\esent\EsentInterop\Table.cs: line 97
at Microsoft.Isam.Esent.Interop.EsentResource.Dispose() in EsentResource.cs: line 63
at Rhino.Queues.Storage.AbstractActions.Dispose() in AbstractActions.cs: line 146 
Tags:

Published at

Challenge: Find the resource leak

I am not sure if I should be embarrassed or furious about this bug, but I spent a lot of time working on issue. Basically, I had a resource leak in my app. I knew that I had it, and even though it took a while to create a reliable reproduction, I managed to narrow it down to a simple reproduction. It was obvious that the issue is leaking resources during the shutdown phase of the application as a result of a race condition.

Now, usually we don’t care about such things because shutting down will, obviously, clear all of those nasty things anyway. Unfortunately, my scenario is to start & shutdown many times in a single process, as part of a unit test. So I had to figure out what was actually going on. The maddening part is that the code should work.

Here is a close approximation of the code in question:

public class QueueActions : IDisposable
{
    UnmanagedDatabaseConnection database;
    public string Name { get; private set; }

    public class QueueActions( UnmanagedDatabaseConnectionFactory factory)
    {
         database = factory.Create();
         database.Open(()=> Name = database.ReadName());
    }

   // assume proper GC finalizer impl

    public void Dispose()
    {
          database.Dispose();
    }
}

And the code using this:

using(var factory = CreateFactory())
{
   ThreadPool.QueueUserWorkItem(()=>
   {
          using(var actions = new QueueActions(factory))
          {
               actions.Send("abc");     
          }
    });
}

Note that this isn’t remotely like the actual code, it is highly simplified

And yes, there is an intentional race condition here.

Here are two pieces of information that are important to remember about this unmanaged library:

  • The factory will not close if there are opened connections (exception)
  • Trying to open a connection after the factory was closed will fail (exception)

How many problems does the code above have?

Performance optimizations, managed code and leaky abstractions

I run into this post from Jeff Atwood, talking about the performance difference between managed and unmanaged code:

C# versus unmanaged C++, Chinese/English dictionary reader

There were a lot of optimizations for this along the way, but the C++ version has soundly beaten the C# version. As expected, right?

Well, yes, but with extenuating circumstances.

So am I ashamed by my crushing defeat? Hardly. The managed code achieved a very good result for hardly any effort. To defeat the managed version, Raymond had to:
  • Write his own file/io stuff
  • Write his own string class
  • Write his own allocator
  • Write his own international mapping

Of course he used available lower level libraries to do this, but that's still a lot of work. Can you call what's left an STL program? I don't think so, I think he kept the std::vector class which ultimately was never a problem and he kept the find function. Pretty much everything else is gone.

So, yup, you can definitely beat the CLR. I think Raymond can make his program go even faster.

I find this interesting, because it isn’t really specific for C++, in my recent performance sprint for the profiler, I had to:

  • Write my own paging system
  • Write my own string parsing routines
  • Write my own allocator

For the most part, performance optimizations fall into four categories:

  • Inefficient algorithms – O(N) notation, etc.
  • Inefficient execution – not applying caching, doing too much work upfront, doing unneeded work.
  • I/O Bound – the execution waits for a file, database, socket, etc.
  • CPU Bound – it just takes a lot of calculations to get the result.

I can think of very few problems that are really CPU Bounded, they tend to be very specific and small. And those are just about the only ones that’ll gain any real benefit from a faster code. Of course, in pure math scenarios, which is pretty much where most of the CPU Bound code reside, there isn’t much of a difference between the language that you choose (assuming it is not interpreted, at least, and that you can run directly on the CPU using native instructions). But as I said, those are pretty rare.

In nearly all cases, you’ll find that the #1 cause for perf issues is IO. Good IO strategies (buffering, pre-loading, lazy loading, etc) are usually applicable for specific scenarios, but they are the ones that will make a world of difference between poorly performing code and highly performing code. Caching can also make a huge difference, as well as differing work to when it is actually needed.

I intentionally kept the “optimize the algorithm” for last, because while it can have drastic performance difference, it is also the easiest to do, since there is so much information about it, assuming that you didn’t accidently got yourself into an O(N^2) or worse.

Say hello to Uber Prof

I got several requests for this, so I am making Uber Prof itself available for purchasing.

What is Uber Prof?

It is a short hand way of saying: All the OR/M profilers that we make.

An Uber Prof license gives you the ability to use:

And it will automatically give you the ability to use any additional profilers that we will create. And yes, there is an upgrade path if you already purchased a single profiler license and would like to upgrade to Uber Prof.

A marketing mistake: WCF Data Services & WCF RIA Services

Illustration 3There are some things that I just don’t understand, and the decision to name two apparently different technologies working in the same area using those two names is one of them.

The image on the right is from In Search Of Stupidity, an excellent and funny book, which talks about a lot of marketing mistakes that software, dedicate a whole chapter for this error.

Coming back to WCF Data Services & WCF RIA Services, I read this page, and I am still confused. It appears that the major difference in that RIA services will generate the Silverlight client classes as part of the build process, where as using Data Services this is a separate process.

And I am not sure even of that.