Ayende @ Rahien

Refunds available at head office

Database assisted denormalization – Oracle edition

I decided to take a chance (installing Oracle is a big leap :-) ) and see how things match in Oracle.

I decided to run the following query:

SELECT deptno, 
       dname, 
       loc, 
       (SELECT COUNT(*) 
        FROM   emp 
        WHERE  emp.deptno = dept.deptno) AS empcount 
FROM   dept 
WHERE  deptno = 20 

Please note that I run in on a database that had (total) maybe a 100 records, so the results may be skewed.

image

Like in the SQL Server case, we need to create an index on the FK column. I did so, after which I got:

image

Then I dropped that index and create a simple view:

CREATE VIEW depswithempcount 
AS 
  SELECT deptno, 
         dname, 
         loc, 
         (SELECT COUNT(*) 
          FROM   emp 
          WHERE  emp.deptno = dept.deptno) AS empcount 
  FROM   dept 

Querying on top of that gives me the same query plan as before. Trying to create a materialized view out of this fails, because of the subquery expression, I’ll have to express the view in terms of joins, instead. Like this:

SELECT dept.deptno, 
       dname, 
       loc, 
       COUNT(*) empcount 
FROM   dept 
       LEFT JOIN emp 
         ON dept.deptno = emp.deptno 
WHERE  dept.deptno = 20 
GROUP  BY dept.deptno, 
          dname, 
          loc 

Interestingly enough, this is a different query plan than the subquery, with SQL Server, those two query exhibit identical query plans.

image

Now, to turn that into an materialized view.

CREATE materialized VIEW deptwithempcount 
AS SELECT dept.deptno, 
          dname, 
          loc, 
          COUNT(*) empcount 
   FROM   dept 
          left join emp 
            ON dept.deptno = emp.deptno 
   GROUP  BY dept.deptno, 
             dname, 
             loc 

And querying on this gives us very interesting results:

select * from deptwithempcount 
where deptno = 20

image

Unlike SQL Server, we can see that Oracle is reading everything from the view. But let us try one more thing, before we conclude this with a victory.

update emp 
set deptno = 10
where deptno = 20;

select * from deptwithempcount 
where deptno = 20

But now, when we re-run the materialized view query, we see the results as they were at the creation of the view.

There appears to be a set of options to control that, but the one that I want (RERESH FAST), which update the view as soon as data changes will not work with this query, since it consider it too complex. I didn’t investigate too deeply, but it seems that this is another dead end.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Justin
08/27/2010 03:12 PM by
Justin

Probably one of the biggest deficiencies in the various RDBMS offerings out there is their inability to materialize complex views.

This is a hard problem to solve but could be improved and fits very well into the relational model.

If you went with a eventually consistent model with a background view generation like some NOSQL products, the the complex view update could be deferred, but you could get stale results.

You should try a trigger based solution with MSSQL or Oracle at least as a comparison.

Not sure what the aversion to triggers is, they are ideal for creating complex user defined "indexes" such as what you are doing with the Post Count.

Materialize views are essentially just declarative system managed triggers that update another table.

Bruno Martinez
08/27/2010 05:58 PM by
Bruno Martinez

Ayende,

Put both tables in the same table cluster, so that finding the employees is not another disk seek. This gets close to document data bases.

tobi
08/27/2010 06:06 PM by
tobi

"This is a hard problem to solve" It is not. It can be solved by using stacked indexed views (search on google to find a pdf from MS; alas not implemented).

I was appalled as I learned about oracles deferred materialized views. I believe there is no synchronous option which makes them worthless for many applications. What an unnecessary crippling.

tobi
08/27/2010 06:13 PM by
tobi

On the other hand If you want a deferred update system then the automatic view matching in the query optimizer must come in very handy.

Sam
08/28/2010 07:47 AM by
Sam

I'm with Justin - relational databases have an existing, proven, built-in mechanism to implement what you're trying to accomplish; they're called triggers.

On one occasion I remember fiddling around with indexed views for a week before giving up and implementing a trigger solution. Quickest, easiest, most sensible and most performant change I ever made to that system.

Frans Bouma
08/28/2010 09:28 AM by
Frans Bouma

@Tobi: materialized/indexed views are by definition used in functionality which consumes large buckets of data and processes them for a lengthy period of time for readonly purposes. This alone makes it already not necessary to have the views be 'up to date' at all time, as that's only possible with a live query.

many people think data has to be 'non-stale' at any given moment, but they forget that only some data has to be 'non-stale'. All the rest can be consumed from for example materialized/indexed views. The post count is a good example for data which is totally acceptable to be 'stale': who's going to check whether user X has indeed 103 posts? no-one.

It therefore comes down to:

  • decide what data elements can be stale

  • make sure the non-stale data is fast and easy to retrieve and manipulate.

And with 'stale' I don't mean data which is a year old, but for example materialized views (or your own tables with duplicated denormalized data, same thing, if your db doesn't support materialized views) which are refreshed every 10 minutes. For reports about monthly sales etc., that's perfectly fine.

This also leads to the point where one could decide to use two distinct databases instead of 1: one for the non-stale data and one for the stale data, which is for example updated every night.

It's true that this means one has to think about what the data in a system really means at time T and time T+t. But IMHO that's a good thing, it makes one realize that a database is more than just a bucket to store bits in, and actually a machine which can make things really easy.

gunteman
08/28/2010 01:08 PM by
gunteman

"many people think data has to be 'non-stale' at any given moment, but they forget that only some data has to be 'non-stale'. All the rest can be consumed from for example materialized/indexed views. The post count is a good example for data which is totally acceptable to be 'stale': who's going to check whether user X has indeed 103 posts? no-one."

Word!

tobi
08/28/2010 02:49 PM by
tobi

@Frans Bouma: Materialized views are not only used for aggregation. You can also save tons of join with them. Example: A forum thread. At least the following tables contribute one row for each post: ForumPosts, Users, UserAvatarImages. Then the resultset has to be sorted by datetime. All of this can be materialized so that the forum page needs exactly one range seek. Very cheap, can do that 2k times per second per core.

Imran
08/28/2010 09:07 PM by
Imran

Have you tried using count(1) instead of count(*) and seeing if it makes a difference to your execution time? Also if it is ok for the count to be a little stale you could store it as a normal numerical field that gets updated by a batch process (database job in sql server) thats runs every so often?

Simon Mac
08/29/2010 06:35 PM by
Simon Mac

@ Justin

+1 for materialized / schema bound views, being "sytem shortcuts" for triggers.

Ultimately the DB needs to infer if ANY object insert / update / delete would alter the output of ANY of the materialized views (Which I guess are little more than tables anyway).

I suspect this is the reason they are fairly limited in complexity, otherwise they could cripple modification performance, but in a way that is completely non-obvious.

Whilst triggers too maybe less than obvious, at least they are tied to the table you find the bottleneck on.

Would be interesting to see the query plan for the insert / updates on tables that form part of the materialized views.

tobi
08/30/2010 01:00 PM by
tobi

The query plan does multiple updates in sequence. For aggregation views it actually aggregates the changed rows for each different view.

Comments have been closed on this topic.