Database assisted denormalization
Let us say that I have the homepage of the application, where we display Blogs with their Post count, using the following query:
select dbo.Blogs.Id, dbo.Blogs.Title, dbo.Blogs.Subtitle, (select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) as PostCount from dbo.Blogs
Given what I think thoughts of denormalization, and read vs. write costs, it seems a little wasteful to run the aggregate all the time.
I can always add a PostCount property to the Blogs table, but that would require me to manage that myself, and I thought that I might see whatever the database can do it for me.
This isn’t a conclusive post, it details what I tried, and what I think is happening, but it isn’t the end all be all. Moreover, I run my tests on SQL Server 2008 R2 only, not on anything else. I would like to hear what you think of this.
My first thought was to create this as a persisted computed column:
ALTER TABLE Blogs ADD PostCount AS (select COUNT(*) from Posts where Posts.BlogId = Blogs.Id) PERSISTED
But you can’t create computed columns that uses subqueries. I would understand easier why not if it was only for persisted computed columns, because that would give the database a hell of time figuring out when that computed column needs to be updated, but I am actually surprised that normal computed columns aren’t supporting subqueries.
Given that my first attempt failed, I decided to try to create a materialized view for the data that I needed. Materialized views in SQL Server are called indexed views, There are several things to note here. You can’t use subqueries here either (likely because the DB couldn’t figure which row in the index to update if you were using subqueries), but have to use joins.
I created a data set of 1,048,576 rows in the blogs table and 20,971,520 posts, which I think should be enough to give me real data.
Then, I issued the following query:
select dbo.Blogs.Id, dbo.Blogs.Title, dbo.Blogs.Subtitle, count_big(*) as PostCount from dbo.Blogs left join dbo.Posts on dbo.Blogs.Id = dbo.Posts.BlogId where dbo.Blogs.Id = 365819 group by dbo.Blogs.Id, dbo.Blogs.Title, dbo.Blogs.Subtitle
This is before I created anything, just to give me some idea about what kind of performance (and query plan) I can expect.
Query duration: 13 seconds.
And the execution plan:
The suggest indexes feature is one of the best reasons to move to SSMS 2008, in my opinion.
Following the suggestion, I created:
CREATE NONCLUSTERED INDEX [IDX_Posts_ByBlogID] ON [dbo].[Posts] ([BlogId])
And then I reissued the query. It completed in 0 seconds with the following execution plan:
After building Raven, I have a much better understanding of how databases operate internally, and I can completely follow how that introduction of this index can completely change the game for this query.
Just to point out, the results of this query is:
Id Title Subtitle PostCount
----------- --------------------- ---------------------- --------------------
365819 The lazy blog hibernating in summer 1310720
I decided to see what using a view (and then indexed view) will give me. I dropped the IDX_Posts_ByBlogID index and created the following view:
CREATE VIEW BlogsWithPostCount WITH SCHEMABINDING AS select dbo.Blogs.Id, dbo.Blogs.Title, dbo.Blogs.Subtitle, count_big(*) as PostCount from dbo.Blogs join dbo.Posts on dbo.Blogs.Id = dbo.Posts.BlogId group by dbo.Blogs.Id, dbo.Blogs.Title, dbo.Blogs.Subtitle
After which I issued the following query:
select Id, Title, Subtitle, PostCount from BlogsWithPostCount where Id = 365819
This had the exact same behavior as the first query (13 seconds and the suggestion for adding the index).
I then added the following index to the view:
CREATE UNIQUE CLUSTERED INDEX IDX_BlogsWithPostCount ON BlogsWithPostCount (Id)
And then reissued the same query on the view. It had absolutely no affect on the query (13 seconds and the suggestion for adding the index). This make sense, if you understand how the database is actually treating this.
The database just created an index on the results of the view, but it only indexed the columns that we told it about, which means that is still need to compute the PostCount. To make things more interesting, you can’t add the PostCount to the index (thus saving the need to recalculate it).
Some points that are worth talking about:
- Adding IDX_Posts_ByBlogID index resulted in a significant speed increase
- There doesn’t seem to be a good way to perform materialization of the query in the database (this applies to SQL Server only, mind you, maybe Oracle does better here, I am not sure).
In other words, the best solution that I have for this is to either accept the cost per read on the RDBMS and mitigate that with proper indexes or create a PostCount column in the Blogs table and manage that yourself. I would like your critique on my attempt, and additional information about whatever what I am trying to do is possible in other RDMBS.