Ayende @ Rahien

It's a girl

Awesome indexing with RavenDB

I am currently teaching a course in RavenDB, and as usual during a course, we keep doing a lot of work that pushes what we do with RavenDB. Usually because we try to come up with new scenarios on the fly and adapting to the questions from the students.

In this case, we were going over the map/reduce stack and we kept coming more and more complex example and how to handle them, and then we got to this scenario.

Given the following class structure:

   1: public class Animal
   2: {
   3:     public string Name { get; set; }
   4:     public string Species { get; set; }
   5:     public string Breed { get; set; }
   6: }

Give me the count of all the species and all the breeds.  That is pretty easy to do, right?  In SQL, you would write it like this:

   1: SELECT Species, Breed, Count(*) FROM Animals
   2: GROUP BY Species, Breed

And that is nice, but it still means that you have to do some work on the client side to merge things up to get the final result, since we want something like this:

  • Dogs: 6
    • German Shepherd: 3
    • Labrador: 1
    • Mixed: 2
  • Cats: 3
    • Street: 2
    • Long Haired: 1

In RavenDB, we can express the whole thing in a simple succinct index:

   1: public class Animals_Stats : AbstractIndexCreationTask<Animal, Animals_Stats.ReduceResult>
   2: {
   3:     public class ReduceResult
   4:     {
   5:         public string Species { get; set; }
   6:         public int Count { get; set; }
   7:         public BreedStats[] Breeds { get; set; }
   8:  
   9:         public class BreedStats
  10:         {
  11:             public string Breed { get; set; }
  12:             public int Count { get; set; }
  13:         }
  14:     }
  15:  
  16:     public Animals_Stats()
  17:     {
  18:         Map = animals =>
  19:                 from animal in animals
  20:                 select new
  21:                     {
  22:                         animal.Species,
  23:                         Count = 1,
  24:                         Breeds = new [] {new {animal.Breed, Count = 1}}
  25:                     };
  26:         Reduce = animals =>
  27:                     from r in animals
  28:                     group r by r.Species
  29:                     into g
  30:                     select new
  31:                         {
  32:                             Species = g.Key,
  33:                             Count = g.Sum(x => x.Count),
  34:                             Breeds = from breed in g.SelectMany(x => x.Breeds)
  35:                                     group breed by breed.Breed
  36:                                     into gb
  37:                                     select new {Breed = gb.Key, Count = gb.Sum(x => x.Count)}
  38:                         };
  39:  
  40:     }
  41: }

And the result of this beauty?

image

And that is quite pretty, even if I say so myself.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

TomCollins
10/05/2012 01:17 PM by
TomCollins

In SQL you could do similar things. This example works on Oracle 11:

CREATE TABLE Animal ( Name VARCHAR2(30), Species VARCHAR2(30), Breed VARCHAR2(30));

INSERT INTO Animal (Name, Species, Breed) VALUES ('Albert','Dog','German Shepherd'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Berta','Dog','German Shepherd'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Carla','Dog','German Shepherd'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Doris','Dog','Labrador'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Emil','Dog','Mixed'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Fritz','Dog','Mixed'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Gerd','Cat','Street'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Hans','Cat','Street'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Ida','Cat','Long Haired');

SELECT Species, Breed, Count FROM (SELECT Species, 'Sum' AS Breed, 0 AS Detaillevel, COUNT() AS Count FROM Animal GROUP BY Species UNION SELECT Species, Breed, 1 AS Detaillevel, COUNT() AS Count FROM Animal GROUP BY Species, Breed ORDER BY 1, 3, 2);

Result: SPECIES BREED COUNT Cat Sum 3 Cat Long Haired 1 Cat Street 2 Dog Sum 6 Dog German Shepherd 3 Dog Labrador 1 Dog Mixed 2

TomCollins
10/05/2012 01:19 PM by
TomCollins

Hope this format will look nicer:

CREATE TABLE Animal ( Name VARCHAR2(30),

Species VARCHAR2(30),

Breed   VARCHAR2(30));

INSERT INTO Animal (Name, Species, Breed) VALUES ('Albert','Dog','German Shepherd');

INSERT INTO Animal (Name, Species, Breed) VALUES ('Berta','Dog','German Shepherd');

INSERT INTO Animal (Name, Species, Breed) VALUES ('Carla','Dog','German Shepherd');

INSERT INTO Animal (Name, Species, Breed) VALUES ('Doris','Dog','Labrador');

INSERT INTO Animal (Name, Species, Breed) VALUES ('Emil','Dog','Mixed');

INSERT INTO Animal (Name, Species, Breed) VALUES ('Fritz','Dog','Mixed');

INSERT INTO Animal (Name, Species, Breed) VALUES ('Gerd','Cat','Street');

INSERT INTO Animal (Name, Species, Breed) VALUES ('Hans','Cat','Street');

INSERT INTO Animal (Name, Species, Breed) VALUES ('Ida','Cat','Long Haired');

SELECT Species, Breed, Count FROM

(SELECT Species, 'Sum' AS Breed, 0 AS Detaillevel, COUNT(*) AS Count FROM Animal GROUP BY Species

UNION

SELECT Species, Breed, 1 AS Detaillevel, COUNT(*) AS Count FROM Animal GROUP BY Species, Breed

ORDER BY 1, 3, 2);

SPECIES BREED COUNT

Cat Sum 3

Cat Long Haired 1

Cat Street 2

Dog Sum 6

Dog German Shepherd 3

Dog Labrador 1

Dog Mixed 2

wqweto
10/05/2012 02:04 PM by
wqweto

SELECT Species, Breed, Count(*) FROM Animals

GROUP BY Species, Breed

WITH ROLLUP

justin
10/05/2012 02:13 PM by
justin

What about ROLLUP which was designed for exactly this purpose and is almost every SQL implementation?

SQL server has had it since at SQL 2000:

http://msdn.microsoft.com/en-us/library/ms189305(v=SQL.90).aspx

Rasmus Schultz
10/05/2012 02:23 PM by
Rasmus Schultz

Doing this with rollup in SQL yields the same result - but there's an important difference: the result of the projection is materialized, so it's extremely fast to query. Of course, if your SQL engine supports materialized views, again, you can accomplish something similar - but I'll be you anything, RavenDB indexes will still beat the performance of materialized views in any SQL engine ;-)

Ajai
10/05/2012 02:26 PM by
Ajai

select species, breed, count(1) from animals group by rollup(species, breed)

And that is pretty if I say so myself :)

Ajai

João P. Bragança
10/05/2012 03:41 PM by
João P. Bragança

If only there were an in memory version of sql server to allow tests on top of this to execute quickly.

Justin
10/05/2012 05:13 PM by
Justin

"but I'll be you anything, RavenDB indexes will still beat the performance of materialized views in any SQL engine ;-)"

I'll bet you a database engine that been around many more years than Raven has a good chance of winning that bet. Dumping row from a materialized view will be one of the fastest thing a SQL server can do.

"If only there were an in memory version of sql server to allow tests on top of this to execute quickly."

SQL server will host the pages of a materialized view in RAM for reads if it has enough. If your worried about write performance and not durability then put the SQL server db files on a RAMDISK:

http://www.sqltreeo.com/wp/sql-server-on-steroids-with-ram-disk/

João P. Bragança
10/05/2012 08:32 PM by
João P. Bragança

"http://www.sqltreeo.com/wp/sql-server-on-steroids-with-ram-disk/"

That sounds suspiciously like effort. I'd rather have something xcopy deployable.

Rafal
10/07/2012 08:20 AM by
Rafal

I'm all enthusiastic about NoSQL but we still don't have a true data processing language. The C# + map-reduce construct is not very elegant compared to SQL select with rollup.

Ayende Rahien
10/07/2012 10:21 AM by
Ayende Rahien

Justin, There is a very important difference between the way RavenDB and SQL Server. RavenDB is going to compute the result once and then use that as the source for the query. SQL Server is going to have to go through the entire data set, on each query. Materialized views (indexed views in SQL Server) and quite limited - for example, they can't do Rollups.

Justin
10/07/2012 03:24 PM by
Justin

"RavenDB is going to compute the result once and then use that as the source for the query."

Probably why everyone is talking about materialized views no?

In the original post you said: "And that is nice, but it still means that you have to do some work on the client side to merge things up to get the final result"

This has nothing to do with how often Raven or SQL computes the result, you said SQL simply could not compute the result on it own, which means one of two things, either you did not know ROLLUP existed or you are purposely are saying SQL the language can do less than it actually does.

Also the limitation of ROLLUP not being in am materialized view is a MS SQL Server limitation not a SQL/Relational limitation. Oracle and DB2 have no problem with ROLLUP in materialized view. Again this just shows that you simply don't know what is possible in the relational world or you are purposely ignoring the facts. For MS SQL ever heard of triggers or the job scheduler?

Oracle can even defer the refresh of of the materialized view for later if you want Raven DB like staleness.

Ayende Rahien
10/07/2012 03:28 PM by
Ayende Rahien

Justin, Are you aware of my history with regard to RDBMS?

Justin
10/07/2012 03:30 PM by
Justin

Yes, so which is it, did you not know about ROLLUP or purposely say SQL can't do it?

Ayende Rahien
10/07/2012 03:43 PM by
Ayende Rahien

Justin, The only place where I talked about SQL was when I wanted to show the simple aggregation. I had nothing to say about the feasibility of this in SQL.

Justin
10/07/2012 04:06 PM by
Justin

Still not answering the question.

You most definitely talked about the feasibility in SQL by saying work must be done on the client side while this was not necessary in Raven, simply untrue.

The whole article's premise was a contrast to SQL yet didn't mention the ROLLUP operator at all even though it is the textbook example of ROLLUP. You might as well have left off GROUP BY and COUNT and done ALL the work client side and declared SQL can't do aggregation but Raven can.

Adding WITH ROLLUP to the SQL statement is even simpler and more succinct than Raven if I do say so myself. 3 lines of SQL vs 40+ lines of C# map/reduce for the same result.

Duke
10/07/2012 08:21 PM by
Duke

3 lines of SQL has my vote too on this one

Rytis
10/08/2012 09:14 AM by
Rytis

I like the idea about simple things expressed in 40 lines of code instead of 2-3lines :)

João P. Bragança
10/09/2012 03:15 PM by
João P. Bragança

Yes, forget test-ability! This won't run on SQLITE so there goes quick test execution. Then again when was the last time you saw a DBA write a test for anything?

Alex Henderson
10/09/2012 07:47 PM by
Alex Henderson

An application I work on, implements this really easily as part of an in-built query language - here's the one-liner in that case:

GROUP BY Species { GROUP BY Breed { COUNT } }

http://dev.enterprisetester.com/help/#SearchTQLAggregation

http://blog.bittercoder.com/2012/08/20/odata-tql-and-filtering/

We don't build an index, but the result sets are built on the fly against Lucene.Net indexes, so it's pretty fast (few milliseconds) to execute.

I think the ROLLUP/CUBE can work well for these scenarios too when using Sql Server (unfortunately I don't have the luxury of supporting just a single Database) - but I think as things get more complex, like faceted results, or including a list of entities matching a facet (top 10 / bottom 10 etc.) then the Sql Server solution becomes a lot more complex...

Comments have been closed on this topic.