Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,953 | Comments: 44,408

filter by tags archive

Black box reverse engineering speculation


Terrance has pointed me to some really interesting feature in Solr, called Facets. After reading the documentation, I am going to try and guess how this is implemented, based on my understanding of how Lucene works.

But first, let me explain what Facets are, Facets are a way to break down a search result in a way that would give the user more meaningful results. I haven’t looked at the code, and haven’t read any further than that single link, but i think that I can safely extrapolate from that. I mean, the worst case that could happen is that I would look stupid.

Anyway, Lucene, the underpinning of Solr, is a document indexing engine. It has no ability to do any sort of aggregation, and I am pretty sure that Solr didn’t sneak in something relational when no one was looking. So how can it do these sort of things?

Well, let us look at a simple example: ?q=camera&facet=true&facet.field=manu, which will give us the following results:

<!-- search results snipped -->
<lst name="facet_fields">
  <lst name="manu">
    <int name="Canon USA">17</int>
    <int name="Olympus">12</int>
    <int name="Sony">12</int>
    <int name="Panasonic">9</int>
    <int name="Nikon">4</int>
  </lst>
</lst>

Remember what we said about Lucene being an indexing engine? You can query the index itself very efficiently, and these sort of results are something that Lucene can provide you instantly.

More over, when we start talking about facets prices, which looks something like this;

?q=camera&facet=true&facet.query=price:[* TO 100]
    &facet.query=price:[100 TO 200];&facet.query=[price:200 TO 300]
    &facet.query=price:[300 TO 400];&facet.query=[price:400 TO 500]
    &facet.query=price:[500 TO *]

It gets even nicer. If I would have that problem (which I actually do, but that is a story for another day), I would resolve this using individual multiple Lucnene searches. Something like:

  • type:camera –> get docs
  • type:camera price:[* TO 100] –> but just get count
  • type:camera price:[100 TO 200] –> but just get count

In essence, Solr functions as a query batching mechanism to Lucene, and then message the data to a form that is easy to consume by the front end. That is quite impressive.

By doing this aggregation, Solr can provide some really impressive capabilities, on top of a really simple concept. I am certainly going to attempt something similar for Raven.

Of course, I may have headed in the completely wrong direction, in which case I am busy wiping egg of my face.

User Interface Delegation – Rhino Divan DB


No, this isn’t a post about how to do UI delegation in software. This is me, looking at HTML code and realizing that I have a deep and troubling disadvantage, while I know how to build UI, I am just not very good at it.

For a certain value of done, Rhino Divan DB is done. Architecture is stable, infrastructure is in place, everything seems to be humming along nicely. There is still some work to be done (error handling, production worthiness, etc) but those are relatively minor things.

The most major item on the table, however, is providing a basic UI on top of the DB. I already have a working prototype, but I came to realize that I am tackling something that I am just not good at.

This is where you, hopefully, comes into play. I am interested in someone with good graph of HTML/JS to build a very simple CRUD interface on top of a JSON API. That is simple enough, but the small caveat is that my dead lines are short, I would like to see something working tomorrow.

Anyone is up for the challenge?

NHibernate donation campaign


NHibernate is the most popular Open Source Object Relational Mapper in the .NET framework. As an Open Source project, all the work done on it is done for free.  We would like to be able to dedicate more time to NHibernate, but even as a labor of love, the amount of time that we can spend on a free project is limited.

In order to facilitate that, we opened a donation campaign that will allow you to donate money to the project.

Click here to lend your support to: NHibernate and make a donation at www.pledgie.com !

What is this money going to be used for?

This money will go directly to NHibernate development, primarily to sponsor the time required development of NHibernate itself.

Donation Matching

Moreover, my company, Hibernating Rhinos, is going to match any donation to this campaign (to a total limit of 5,000$), as a way to give back to the NHibernate project for the excellent software it produced.

In addition to that, my company will also sponsor any resources need for the project, such as production servers (the NHibernate’s Jira is already running on our servers), build machines, etc.

Why should you donate?

If you are a user of NHibernate, you gained a lot from build on such a solid foundation. We ask to you to donate so that we can make the project even better. If your company uses NHibernate, ask it to donate to this campaign.

Thanks,

~Ayende

Pricing software


It seems that people make some rather interesting assumptions regarding software pricing. When talking about commercial products, there is a wide variety of strategies that you might want to push, depending on what exactly you are trying to achieve.

You may be trying to have a loss leader product, gain reputation or entry to a new market, in which case your goals are skewed toward optimizing other things. What I want to talk about today is pricing software with regards to maximizing profits.

Probably the first thing that you need to do consider the base price of the product. That is important for several levels. For the obvious reason that this is what you will make out of it, but also because the way you price your product impacts its Perception of Value.

If you try to sell your product at a significant cost below what the market expect it to be, the perception is that it is crap, and you are trying to get rid of it.

Another issue that you have to face is that it is very hard to change the base price once you settled on it. It is fairly obvious why you have a hard time jacking the price upward, but pushing the price down is also something to approach with trepidation. It has a very negative impact on the product image. Now, you can effectively price it at a lower rate very easily, but doing things like offers, discounts, etc. Those do not affect the base price and so generally do not affect how the product is perceived.

And just as bad as with setting the price significantly below market expectations, setting the price significantly higher than market expectations. The problem here is that you come out as a loony with delusions of grandeur. Even if your product is wonderful, and even if it will return its investment in a single week, pricing it too high generate a  strong rejection reaction. It also means that you need to actively combat that perception, and that takes a lot of time & effort.

Finally, in the range between too little and too much, you have a wide room to play. And that is where the usual supply & demand metrics come into play. Since supply isn’t a problem with software, what we are talking about is demand vs. price. Everyone in imageeconomics is familiar with the graph, I assume.

As price goes up, demands goes down, and vice versa. So the question is what is the optimum price point that would generate the most revenue. Here are some numbers, showing the probable correlation between price and sales.

Price Users $
$ 2,000.00 1.00 $ 2,000.00
$ 1,000.00 1.00 $ 1,000.00
$ 200.00 100.00 $ 20,000.00
$ 100.00 200.00 $ 20,000.00
$ 10.00 5000.00 $ 50,000.00

It might be easier to look at in its graphical form to the right.

You can see that when the price point is too high, very few people will buy it, if at all. When the price point is in the range that the market expects, we make some money, drastically more than we would when the price was above market expectations.

However, if we set the price way down, we get a lot more users, and a lot more money pouring in. We will ignore the question of damage to the reputation for now.

This seems to indicate that by lowering the price we can make a lot more money, right? Except that it isn’t really the case. Let us see why.

1 in 10 users has a question, requires support, etc. Let us say that each question cost 100$, and now let us look at the numbers.

Price Users Calls Support Cost Income Revenue
$ 2,000.00 1.00 0.00 $ - $ 2,000.00 $ 2,000.00
$ 1,000.00 1.00 0.00 $ - $ 1,000.00 $ 1,000.00
$ 200.00 100.00 10.00 $ 1,000.00 $ 20,000.00 $ 19,000.00
$ 100.00 200.00 20.00 $ 2,000.00 $ 20,000.00 $ 18,000.00
$ 10.00 5000.00 500.00 $ 50,000.00 $ 50,000.00 $ -

And that tells us a very different story indeed. The end goal is increasing profit, not just the amount of money coming in, after all.

Oh, and a final thought, the following coupon code EXR-45K2D462FD is my pricing experiment, it allows the first 10 users to buy NH Prof, L2S Prof or EF Prof at a 50% discount.

Since all the coupon codes run out in a short order, I am going to continue the experiment, this coupon code ER2-45K2D462FJ will give the first 25 users to buy NH Prof, L2S Prof or EF Prof at a 25% discount.

Fun with a non relational databases


My post showing a different approach for handling data got a lot of traffic, and a lot of good comments. But I think that there is some misunderstanding with regards to the capabilities of NoSQL databases, so I am going to try to expand on those capabilities in this post.

Instead of hand waving, and since I am thinking about this a lot lately, we will assume that we are talking about DivanDB (unreleased version), which has the following API:

  • JsonDoc[] Get(params Id[] ids);
  • Set(params JsonDoc[] docs);
  • JsonDoc[] Query(string indexName, string query);

DivanDB is a simple Document database, storing documents as Json, and using Lucene as a indexer for the data. An index is defined by specifying a function that creates it:

var booksByAuthor = from doc in documents
                                  where doc.type == “book”
                                  from author in doc.authors
                                  select new { author };

And the data looks like this:

imageimage

Indexing is done at write time, you can think about those indexes as materialized views.

It appears that people assumes that just because you aren’t using an RDBMS, you can’t use queries. Here are a few options to show how you can do so.

Books by author:

Query(“booksByAuthor”, “author:weber”);

Books by category:

Query(“booksByCategory”, “category:scifi”);

How is the books by category index defined? I think you can guess.

var booksByCategory = from doc in documents
                                  where doc.type == “book”
                                  from category in doc.categories
                                  select new { category };

What other queries did people brought up?

Who has book X in their queue?

Query(“usersByQueuedBooks”, “book:41”);

And the view?

var usersByQueuedBooks = from doc in documents
                                  where doc.type == "user"
                                  from book in doc.queues_books
                                  select new { book };

I’ll leave the implementation of Who has book X checked out as an exercise for the reader.

What about deletes?

Using this system, it looks like deletes might be really expensive, right? Well, that depends on what exactly you want here.

My default approach would be to consider exactly what you want, as Udi pointed out, in the real world, you don’t delete.

But it is actually fairly easy to support something like this cheaply. It is all about defining the association and letting the DB handle this (although I am not fond of the syntax I came up with):

var books = from book in documents
            where book.type == "book"
            select new { book = book.id };
var checkedOutBooks =    from user in documents
                        where user.type = "user"
                        from book in user.checked_out
                        select new { book };

var queuedBooks =    from user in documents
                    where user.type = "user"
                    from book in user.queued_books
                    select new { book };

FK_DisallowDeletion(books, checkoutBooks);
FK_RemoveElement(books, queuedBooks);

Under the cover this creates indexers and can check out those at delete / insert time.

However, I would probably not implement this for Rhino DivanDB, mostly because I agree with Udi.

Rhino Divan DB reboot idea


Divan DB is my pet database. I created it to scratch an itch [Nitpickers: please note this!], to see if I can create Couch DB like system in .NET. You can read all about it in the following series of posts.

It stalled for a while, mostly because I run into the hard problems (building map/reduce views). But I think that I actually have a better idea, instead of trying to build something that would just mimic Couch DB, a .NET based Document DB is actually a very achievable goal.

The way it would work is actually pretty simple, the server would accept Json-formatted documents, like those:

[
    {
        "id": 153,
        "type": "book",
        "name": "Storm from the Shadows",
        "authors": [
            "David Weber"
        ],
        "categories": [
            "SciFi",
            "Awesome",
            "You gotta read it"
        ],
        "avg_stars": 4.5,
        "reviews": [13,5423,423,123,512]
    },
    {
        "id": 1337,
        "type": "book",
        "name": "DSLs in Boo",
        "authors": [
            "Ayende Rahien",
            "Oren Eini"
        ],
        "categories": [
            "DSL",
            ".NET",
            "You REALLY gotta read it"
        ],
        "avg_stars": 7,
        "reviews": [843,214,451]
    }
]

Querying could be done either by id, or using a query on an index. Indexes can be defined using the following syntax:

var booksByTitle = 
   from book in docs
   where book.type == "book"
   select new { book.title };

The fun part here is that this index would be translated into a Lucene index, which means that you could query the index using a query:

Query(“booksByTitle”, “title:Boo”) –> documents that match this query.

As well as apply any & all the usual Lucene tricks.

You don’t get Map/Reduce using this method, but the amount of complexity you have is quite low, and the implementation should take only several days to build.

Thoughts?

Profiler new featureToo many joins detection


This is Josh’s feature, since we wrote most of the code for it together. Basically, it recognize a very common performance problem, queries that uses too many joins, such as this one:

image

Which would result in the following warning:

image

Queries with too many joins might be a performance problem. Each join requires the database to perform additional work, and the complexity and cost of the query grows rapidly with each additional join. While relational database are optimized for handling joins, it is often more efficient to perform several separate queries instead of a single query with several joins in it.

For OLTP systems, you should consider simplifying your queries or simplifying the data model. While I do not recommend avoiding joins completely, I strong discourage queries with large numbers of joins. Another issue to pay attention to is possible Cartesian products in queries contains joins, it is very easy to create such a thing and not notice it during development.

Profiler Subscription?


One repeated request for the profiler is to have a personal version (which a price to match).

I am not really happy with the idea, for several reasons. One of the major ones is that I have enough variability in the product already, and adding a new edition in addition to the 4 we already support is bound to create headaches. Another is that I simply cannot just draw a line and say “those are the pro features and these are the personal features”.

Instead, I wonder about offering a subscription model, something with a cost around 10 – 15 Euro per month. This would be renewed monthly (automatically), and beyond just having a lower price point, it will also provide automatic upgrades across versions (so free upgrade from 1.x to 2.x).

Thoughts?

Profiler Speculative Feature: Query plans


This isn’t a new feature, because you can’t use it right now, but it is a really nice feature that we are working on, and I couldn’t resist showing it off hot “off the press”, so to speak.

Given the following query:

SELECT this_.id             AS id7_1_,
       this_.title          AS title7_1_,
       this_.subtitle       AS subtitle7_1_,
       this_.allowscomments AS allowsco4_7_1_,
       this_.createdat      AS createdat7_1_,
       posts2_.blogid       AS blogid3_,
       posts2_.id           AS id3_,
       posts2_.id           AS id0_0_,
       posts2_.title        AS title0_0_,
       posts2_.TEXT         AS text0_0_,
       posts2_.postedat     AS postedat0_0_,
       posts2_.blogid       AS blogid0_0_,
       posts2_.userid       AS userid0_0_
FROM   blogs this_
       LEFT OUTER JOIN posts posts2_
         ON this_.id = posts2_.blogid
WHERE  this_.id = 1 /* @p0 */

SELECT this_.id            AS id0_1_,
       this_.title         AS title0_1_,
       this_.TEXT          AS text0_1_,
       this_.postedat      AS postedat0_1_,
       this_.blogid        AS blogid0_1_,
       this_.userid        AS userid0_1_,
       comments2_.postid   AS postid3_,
       comments2_.id       AS id3_,
       comments2_.id       AS id2_0_,
       comments2_.name     AS name2_0_,
       comments2_.email    AS email2_0_,
       comments2_.homepage AS homepage2_0_,
       comments2_.ip       AS ip2_0_,
       comments2_.TEXT     AS text2_0_,
       comments2_.postid   AS postid2_0_
FROM   posts this_
       LEFT OUTER JOIN comments comments2_
         ON this_.id = comments2_.postid
WHERE  this_.blogid = 1 /* @p1 */

The profiler can show you the query plan using this UI:

image

And here is how the same query looks like using the query plan feature in Management Studio:

image

So, why implement it?

  • This isn’t limited to SQL Server, the profiler can display query plans for: SQL Server, Oracle, PostgreSQL and MySQL
  • This let you keep yourself in the flow, just hit a button to see the query plan, instead of copying the SQL, opening SSMS, displaying the query plan, etc.

Don’t discount the last one, making it easy is one of the core values of the profiler.

The idea is that if you make it easy enough, the barriers for using it goes away. If you can instantly see the query plan for a query, you are far more likely to look at it than if it takes 30 seconds to get that. At that point, you would only do it when you already have a performance problem.

Slaying relational dragons


I recently had a fascinating support call, talking about how to optimize a very big model and an access pattern that basically required to have the entire model in memory for performing certain operations.

A pleasant surprise was that it wasn’t horrible (when I get called, there is usually a mess), which is what made things interesting. In the space of two hours, we managed to:

  • Reduced number of queries by 90%.
  • Reduced size of queries by 52%.
  • Increased responsiveness by 60%, even for data set an order of magnitude.

My default answer whenever I am asked when to use NHibernate is: Whenever you use a relational database.

My strong recommendation at the end of that support call? Don’t use a relational DB for what you are doing.

The ERD just below has absolutely nothing to do with the support call, but hopefully it will help make the example. Note that I dropped some of the association tables, to make it simpler.

image

And the scenario we have to deal with is this one:

image

Every single table in the ERD is touched by this screen.  Using a relational database, I would need something like the following to get all this data:

SELECT * 
FROM   Users 
WHERE  Id = @UserID 

SELECT * 
FROM   Subscriptions 
WHERE  UserId = @UserId 
       AND GETDATE() BETWEEN StartDate AND EndDate 

SELECT   MIN(CheckedBooks.CheckedAt), 
         Books.Name, 
         Books.ImageUrl, 
         AVG(Reviews.NumberOfStars), 
         GROUP_CONCAT(', ',Authors.Name), 
         GROUP_CONCAT(', ',Categories.Name) 
FROM     CheckedBooks 
         JOIN Books 
           ON BookId 
         JOIN BookToAuthors 
           ON BookId 
         JOIN Authors 
           ON AuthorId 
         JOIN Reviews 
           ON BookId 
         JOIN BooksCategories 
           ON BookId 
         JOIN Categories 
           ON CategoryId 
WHERE    CheckedBooks.UserId = @UserId 
GROUP BY BookId 

SELECT   Books.Name, 
         Books.ImageUrl, 
         AVG(Reviews.NumberOfStars), 
         GROUP_CONCAT(', ',Authors.Name), 
         GROUP_CONCAT(', ',Categories.Name) 
FROM     Books 
         JOIN BookToAuthors 
           ON BookId 
         JOIN Authors 
           ON AuthorId 
         JOIN Reviews 
           ON BookId 
         JOIN BooksCategories 
           ON BookId 
         JOIN Categories 
           ON CategoryId 
WHERE    BookId IN (SELECT BookID 
                    FROM   QueuedBooks 
                    WHERE  UserId = @UserId) 
GROUP BY BookId 

SELECT   Books.Name, 
         Books.ImageUrl, 
         AVG(Reviews.NumberOfStars), 
         GROUP_CONCAT(', ',Authors.Name), 
         GROUP_CONCAT(', ',Categories.Name) 
FROM     Books 
         JOIN BookToAuthors 
           ON BookId 
         JOIN Authors 
           ON AuthorId 
         JOIN Reviews 
           ON BookId 
         JOIN BooksCategories 
           ON BookId 
         JOIN Categories 
           ON CategoryId 
WHERE    BookId IN (SELECT BookID 
                    FROM   RecommendedBooks 
                    WHERE  UserId = @UserId) 
GROUP BY BookId 

SELECT   Books.Name, 
         Books.ImageUrl, 
         AVG(Reviews.NumberOfStars), 
         GROUP_CONCAT(', ',Authors.Name), 
         GROUP_CONCAT(', ',Categories.Name) 
FROM     Books 
         JOIN BookToAuthors 
           ON BookId 
         JOIN Authors 
           ON AuthorId 
         JOIN Reviews 
           ON BookId 
         JOIN BooksCategories 
           ON BookId 
         JOIN Categories 
           ON CategoryId 
WHERE    Books.Name LIKE @search 
          OR Categories.Name LIKE @search 
          OR Reviews.Review LIKE @search 
GROUP BY BookId

Yes, this is a fairly simplistic approach, without de-normalization, and I would never perform searches in this manner, but… notice how complex things are getting. For bonus points, look at the forth query, the queued books are ordered, try to figure out how we can get the order in a meaningful way. I shudder to thing about the execution plan of this set of queries. Even if we ignore the last one that does full text searching in the slowest possible way. And this is just for bringing the data for a single screen, assuming that magically it will show up (you need to do a lot of manipulation at the app level to make this happen).

The problem is simple, our data access pattern and the data storage technology that we use are at odds with one another. While relational modeling dictate normalization, our actual data usage means that we don’t really deal with a single-row entity, with relatively rare access to associations, which is the best case for OLTP. Nor are we dealing with set based logic, which is the best case for OLAP / Relational based queries.

Instead, we are dealing an aggregate that spans multiple tables, mostly because we have no other way to express lists and many to many associations in a relational database.

Let us see how we could handle things if we were using a document or key/value database. We would have two aggregates, User and Book.

GetUser(userId) –> would result in:

image

We can now issue another query, to bring the associated books. GetBooks(153, 1337) would result in:

image

Note that the entire data structure is different, we haven’t just copied the normalized relational model, we have a totally different model. An aggregate (similar to DDD’s aggregate) is a single entity that contains anything except other aggregates. References to other aggregates are allowed (from user to all the books), but most of the entity’s data is stored as a single value.

That has several interesting implications. First, we need two queries to get the data for the screen. One to get the user’s data, and the second to get the books that we need to display. Reducing remote calls is something that you really care about, and simplifying the queries to mere query by ids is going to have a significant effect as well.

By changing the data storage technology, we also enforced a very rigid aggregate boundary. Transactions becomes much simpler as well, since most transactions will now modify only a single aggregate, which is a single operation, no matter how many actual operations we perform on that aggregate. And by tailoring the data structure that we use to match our needs, we have natural aggregate boundaries.

The end result is a far simpler method of working with the data. It may mean that we have to do more work upfront, but look at the type of work we would have to do in order to try to solve our problems using the relational model. I know what model I would want for this sort of a problem.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
  2. Special Offer (2):
    27 May 2015 - 29% discount for all our products
  3. RavenDB Sharding (3):
    22 May 2015 - Adding a new shard to an existing cluster, splitting the shard
  4. Challenge (45):
    28 Apr 2015 - What is the meaning of this change?
  5. Interview question (2):
    30 Mar 2015 - fix the index
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats