Database assisted denormalization

time to read 8 min | 1547 words

Let us say that I have the homepage of the application, where we display Blogs with their Post count, using the following query:

select 
    dbo.Blogs.Id, 
    dbo.Blogs.Title,
    dbo.Blogs.Subtitle,
    (select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) as PostCount
 from dbo.Blogs 

Given what I think thoughts of denormalization, and read vs. write costs, it seems a little wasteful to run the aggregate all the time.

I can always add a PostCount property to the Blogs table, but that would require me to manage that myself, and I thought that I might see whatever the database can do it for me.

This isn’t a conclusive post, it details what I tried, and what I think is happening, but it isn’t the end all be all. Moreover, I run my tests on SQL Server 2008 R2 only, not on anything else. I would like to hear what you think of this.

My first thought was to create this as a persisted computed column:

ALTER TABLE Blogs
ADD PostCount AS (select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) PERSISTED

But you can’t create computed columns that uses subqueries. I would understand easier why not if it was only for persisted computed columns, because that would give the database a hell of time figuring out when that computed column needs to be updated, but I am actually surprised that normal computed columns aren’t supporting subqueries.

Given that my first attempt failed, I decided to try to create a materialized view for the data that I needed. Materialized views in SQL Server are called indexed views, There are several things to note here. You can’t use subqueries here either (likely because the DB couldn’t figure which row in the index to update if you were using subqueries), but have to use joins.

I created a data set of 1,048,576 rows in the blogs table and 20,971,520 posts, which I think should be enough to give me real data.

Then, I issued the following query:

select 
        dbo.Blogs.Id, 
        dbo.Blogs.Title,
        dbo.Blogs.Subtitle,
        count_big(*) as PostCount
from dbo.Blogs left join dbo.Posts
        on dbo.Blogs.Id = dbo.Posts.BlogId
where dbo.Blogs.Id = 365819
group by dbo.Blogs.Id,
        dbo.Blogs.Title,
        dbo.Blogs.Subtitle

This is before I created anything, just to give me some idea about what kind of performance (and query plan) I can expect.

Query duration: 13 seconds.

And the execution plan:

image

The suggest indexes feature is one of the best reasons to move to SSMS 2008, in my opinion.

Following the suggestion, I created:

CREATE NONCLUSTERED INDEX [IDX_Posts_ByBlogID]
ON [dbo].[Posts] ([BlogId])

And then I reissued the query. It completed in 0 seconds with the following execution plan:

image

After building Raven, I have a much better understanding of how databases operate internally, and I can completely follow how that introduction of this index can completely change the game for this query.

Just to point out, the results of this query is:

Id          Title                 Subtitle               PostCount
----------- --------------------- ---------------------- --------------------
365819      The lazy blog         hibernating in summer  1310720

I decided to see what using a view (and then indexed view) will give me. I dropped the IDX_Posts_ByBlogID index and created the following view:

CREATE VIEW BlogsWithPostCount 
WITH SCHEMABINDING
AS 
select 
    dbo.Blogs.Id, 
    dbo.Blogs.Title,
    dbo.Blogs.Subtitle,
    count_big(*) as PostCount
 from dbo.Blogs join dbo.Posts
    on dbo.Blogs.Id = dbo.Posts.BlogId
 group by dbo.Blogs.Id,
    dbo.Blogs.Title,
    dbo.Blogs.Subtitle

After which I issued the following query:

select 
        Id, 
        Title,
        Subtitle,
        PostCount
from BlogsWithPostCount
where Id = 365819

This had the exact same behavior as the first query (13 seconds and the suggestion for adding the index).

I then added the following index to the view:

CREATE UNIQUE CLUSTERED INDEX IDX_BlogsWithPostCount
ON BlogsWithPostCount (Id)

And then reissued the same query on the view. It had absolutely no affect on the query (13 seconds and the suggestion for adding the index). This make sense, if you understand how the database is actually treating this.

The database just created an index on the results of the view, but it only indexed the columns that we told it about, which means that is still need to compute the PostCount. To make things more interesting, you can’t add the PostCount to the index (thus saving the need to recalculate it).

Some points that are worth talking about:

  • Adding IDX_Posts_ByBlogID index resulted in a significant speed increase
  • There doesn’t seem to be a good way to perform materialization of the query in the database (this applies to SQL Server only, mind you, maybe Oracle does better here, I am not sure).

In other words, the best solution that I have for this is to either accept the cost per read on the RDBMS and mitigate that with proper indexes or create a PostCount column in the Blogs table and manage that yourself. I would like your critique on my attempt, and additional information about whatever what I am trying to do is possible in other RDMBS.