Slaying relational hydras (or dating them)

time to read 2 min | 262 words

Sometimes client confidentiality can be really annoying, because the problem sets & suggested solutions that come up are really interesting. That said, since I am interesting in having future clients, it is pretty much a must have. As such, the current post represent a real world customer problem, but probably in a totally different content. In fact, I would be surprised if the customer was able to recognize the problem as his.

That said, the problem is actually quite simple. Consider a dating site, where you can input your details and what you seek, and the site will match you with the appropriate person. I am going to ignore a lot of things here, so if you actually have built a dating site, try not to cringe.

At the most basic level, we have two screens, the My Details screen, where the users can specifies their stats and their preferences:

And the results screen, which shows the user the candidate matching their preferences.

There is just one interesting tidbit, the list of qualities is pretty big (hundreds or thousands of potential qualities).

Can you design a relational model that would be a good fit for this? And allow efficient searching?

I gave it some thought, and I can’t think of one, but maybe you can.

I’ll follow up on this post in a day or two, showing how to implement the problem using Raven.

Tweet Share Share 26 comments

Tags:

Comments

08 Mar 2010
10:49 AM

Joe

I think a correctly indexed attribute value pair model would work. You could use a few strategies for the search like multiple pass reducing constraint match - or just plain old aggregate function to see how many attributes in common.

The nice thing about the av pair model is that as long as you think through your search and data entry - nothing says that the list of qualities has to stay constant. It could change over time without affecting the integrity of previous entries.

You of course pay a price for the av pair model in mapping data to - well - pretty much anything. You will need adapter if you use things like mvc or databinding code in general. But it does serve the purpose.

08 Mar 2010
10:55 AM

Ayende Rahien

Joe,

Try to show me SQL that make a search work.

08 Mar 2010
11:39 AM

Erik

select * from people p inner join qualities q on q.personid=p.id and q.attribute = 'species' and q.value = 'Hydra'

08 Mar 2010
11:40 AM

Ayende Rahien

Erik,

You need to match on all the possible qualities, not on just one of them.

08 Mar 2010
11:49 AM

Erik

Right, so then it turns into a where clause with a series of and'd or or'd search terms, depending on what you're wanting to match.

select * from people p inner join qualities q on q.personid=p.id

where

(q.attribute = 'heads' and q.value >= '3' )

and (q.attribute = 'heads' and q.value <= '5' )

and (q.attribute = 'species' and q.value = 'Hydra' )

and (q.attribute = 'scales' and q.value = 'Yes' )

and ...

It's not very elegant, but no EAV problem is when you're stuck in traditional RDBMS.

en.wikipedia.org/wiki/Entity-attribute-value_model

08 Mar 2010
11:52 AM

Ayende Rahien

Erik,

You will never return anything using this approach.

q.attribute - 'heads' and q.attributes = 'species' are mutually exclusive.

08 Mar 2010
11:57 AM

Rafal

Convert the attributes to categories, like

Gender_Male

Species_Hydra

Eyes_Red

and build a Lucene index on all people, including these words.

08 Mar 2010
11:59 AM

tobi

select mq.MemberID, count(*) as CountOfMatchingAttributes

from MemberDesiredQualities mdq

join MemberQualities mq on mdq.QualityID = mq.QualityID and mdq.ValueMin <= mq.Value and mdq.ValueMax >= mq.Value

where mdq.MemberID = 1000

group by mq.MemberID

order by mq.MemberID

MemberQualities: QualityID, MemberID, Value

MemberDesiredQualities: MemberID, QualityID, ValueMin, ValueMax

Both tables clustered uniquely on the first two columns. Query returns instantly all members with at leas one matching attibute plus the count of matching attributes.

08 Mar 2010
12:00 PM

Ayende Rahien

Rafal,

That isn't relational solution.

08 Mar 2010
12:00 PM

Rafal

Oh, sh*t, the task was to design a RELATIONAL model...

08 Mar 2010
12:04 PM

Ayende Rahien

Tobi,

Hat tip, that is a really nice solution.

08 Mar 2010
13:50 PM

Bryan

You really shouldn't be solving this with a SQL database (unless you're using something like Postgis to represent this as a multidimensional space). This is really a multi-dimensional graph problem at it's heart.

08 Mar 2010
14:22 PM

Ayende Rahien

Tobi,

Thinking about this, what about if I don't have just ranges, but direct matches?

Eye color = Red, for example.

Or a single quality with multiple values (Fav. Food: Heroes, Witches).

08 Mar 2010
15:12 PM

configurator

Maybe this isn't what you're looking for, but in most cases (and I'm not only talking about dating sites) what you want isn't "someone with the most exact matches" but "someone with the closest match to each of the attributes." Consider it a multi-dimensional map and you want the closest point.

If my dream girl is blonde with blue eyes and a size L breast, I'd probably rather have a blonde girl with bluish-green eyes and a size M breast (1 match, two very close attributes) than a blonde with blue eyes and no breast (2 matches, one very far attribute).

What is normally done is you give a weight and a desired value to each attribute (the weight acts somewhat like the opposite of a standard deviation). Then for each object you calculate the weighted distance from the object of your desires and you choose the top ones. How would you index that in a relational database?

Maybe you can separate the multidimensional map into sectors. A 3-headed beast with red eyes is sector 6485, and a blonde with a really big breast is sector 20316.

You'd have the following table

Name, Type, Eye color, Sector

Blondie, Human, Blue, 20316

Hydra, Hydra, Red, 6485

Than, you choose all the sectors that are close enough to the desired value (probably the exact one and one sector to each side so you'd have 3^(number of attributes) sectors), and only make an approximate calculation for each of the values in your narrowed-per-sector resultset. There are some more ways you can optimize the sector-choosing, and the calculation, but that's the main idea.

08 Mar 2010
15:25 PM

tobi

Ayende, according to the execution plan that I got with a few million test-data rows sql server does not use the predicate "mdq.ValueMin <= mq.Value and mdq.ValueMax >= mq.Value" in its index seeks. It should not do any harm to change this predicate to anything at all.

For multiple data types you could create multiple nullabe columns or multiple tables.

08 Mar 2010
15:26 PM

Anonymous Coward

select p.Id, COUNT(*) from people p inner join qualities q on q.personid=p.id

where

(q.attribute = 'heads' and q.value >= '3' )

or (q.attribute = 'heads' and q.value <= '5' )

or (q.attribute = 'species' and q.value = 'Hydra' )

or (q.attribute = 'scales' and q.value = 'Yes' )

group by p.Id

order by p.Id desc

returns people and how many things they have in common.

Posting as AC because I'm suspecting something is horribly wrong with this solution if you didn't come up with it :-)

08 Mar 2010
15:26 PM

Mark

You could add a marker ro switch between range match and value match, e.g.

...

join MemberQualities mq on (mdq.QualityID = mq.QualityID and mdq.ValueMin <= mq.Value and mdq.ValueMax >= mq.Value and mq.IsRangeMatch = 1) or (mdq.Value = mq.Value and mq.IsRangeMatch = 0)

....

If you need multiple matches just have more than one "quality" for choices, e.g. a "favourite food 1" and a "favourite food 2", both of which have the same qualityID. Watch the number of matches though, if another member had red eyes and liked eating heroes and witches you'd get three matches, where you may be expecting two. Add the quality to the group by to help with this - the rowcount would then give you the number of matching qualities.

You can't have a fixed set of values like this:

mdq.ValueA = mq.ValueB or mdq.ValueB = mq.ValueB

Because you need to cross match A with B, not too bad if there are only two choices, but it soon gets ugly:

mdq.ValueA = mq.ValueA or mdq.ValueA = mq.ValueB

or mdq.ValueA = mq.ValueC or mdq.ValueB = mq.ValueA

or mdq.ValueB = mq.ValueB or mdq.ValueB = mq.ValueC

or mdq.ValueC = mq.ValueA or mdq.ValueC = mq.ValueB

or mdq.ValueC = mq.ValueC

nasty!

08 Mar 2010
15:28 PM

tobi

configurator, you could use my solution for this. In addition to the count(*) predicate, you add

sum(Value - MinValue) as Difference

Then you can calculate the average difference or other metrics of closeness.

08 Mar 2010
18:29 PM

MattMc3

Erik has the right idea as a starting point, but you'd have to use subqueries or temp tables instead of the join. It's not that hard to build the query dynamically. The hard part would be getting it to perform.

08 Mar 2010
21:45 PM

Yoni Shalom

is the matching criteria relatively static or dynamic ?

i.e. - do I fill out what qualities I like on a search page or do I fill them out once in a long while in an account preferences page ?

Where am I going with this ? - if the search itself is relatively static, then you might want to precalculate matches and save/search those....

08 Mar 2010
22:49 PM

Ayende Rahien

Yoni,

Assume that it is dynamic enough to cause problems

09 Mar 2010
01:57 AM

Gerke Geurts

Proposes adjustment of Tobi's solution to allow for multiple values in both desired and actual member qualities. And I would treat single values as a range with same ValueMin and ValueMax values.

select mq.MemberID, count(mq.QualityID) as CountOfMatchingAttributes

from MemberDesiredQualities mdq

join MemberQualities mq

on mdq.QualityID = mq.QualityID

and (mdq.ValueMin between mq.ValueMin and mq.ValueMax

or mdq.ValueMax BETWEEN mq.ValueMin and mq.ValueMax)

where mdq.MemberID = 1000

group by mq.MemberID, mq.QualityID

order by mq.MemberID

09 Mar 2010
02:00 AM

Gerke Geurts

Small correction in grouping logic:

select mq.MemberID, count(DISTINCT mq.QualityID) as CountOfMatchingAttributes

from MemberDesiredQualities mdq

join MemberQualities mq

on mdq.QualityID = mq.QualityID

and (mdq.ValueMin between mq.ValueMin and mq.ValueMax

or mdq.ValueMax BETWEEN mq.ValueMin and mq.ValueMax)

where mdq.MemberID = 1000

group by mq.MemberID

order by mq.MemberID

09 Mar 2010
04:46 AM

Francois Germain

What about this?

Catalog : CatalogID, Prompt, Type, etc

Person : PersonID, FirstName, LastName, etc

Member : PersonID, MemberSince, Status, etc

PersonStringAttributes : PersonID, CatalogID, Value

PersonDateAttributes : PersonID, CatalogID, Value

PersonTextAttributes : PersonID, CatalogID, Value

DesiredStringAttributes : PersonID, CatalogID, MinValue, MaxValue

DesiredDateAttributes : PersonID, CatalogID, MinValue, MaxValue

DesiredTextAttributes : PersonID, CatalogID, MinValue, MaxValue

Now you have a few possible ways to return the resulting data. You can combine the separate asynch returns on the middle tier. You can have a view on top of your data structure.

12 Mar 2010
12:37 PM

Joe Cincotta

...this is a transcript of our conversation to date...

2010/3/11 Joe Cincotta xxxx@spiralglobal.com

There would be 2 tables.

USER

ID as GUID

other columns for the user

AVPAIR

ID as GUID

UserID as GUID (FK)

Attribute as String

Value as String

My thought was that I would do a select where (attribute=x and value=y) or (attribute=z and value=a) etc etc for each attribute in the source profile. Then you would group by the result by the UserID - using an aggregate function to order by the most results on a UserID to the least. This would provide you the best matches first and the least further down.

I would have to spend more time than I have to give the precise SQL, but this at least gives my idea. It would use a fair amount of processing power performing an aggregate SQL function if you are talking about thousands of attributes.

On 12/03/2010, at 2:17 AM, Ayende Rahien xxxx@ayende.com wrote:

The problem is that this becomes expensive very quickly. You can't use indexing, for example.

2010/3/11 Joe Cincotta xxxx@pixolut.com

You could create some artificial indices. Like, for example - every time you 'save' a user profile you make a checksum. Create a checksum table which looks like

CHECKSUM

ID as Guid

UserID as Guid (foreign key)

Checksum as varchar or byte - could use a nice big crypto SHA hash as the checksum so there are no collisions.

When we do a save on the user profile we concat all attributes and values together and then hash them use a big (1024 bit?) SHA hash.

We can index this table easily and then exact match searches get waaaaay faster. One entry per user instead of per attribute.

You could also use a variation of this to cluster results - like

Do the above strategy but make several tables for the attribute clusters which are important. For example maybe everyone wants exact matches to 'gender, marital status, and age' - we make a checksum table for those attributes which gives us a shortlist for the more complex aggregate function.

The more 'checksum' tables the more funky you can get. You essentially can segment data and control performance of the search. The place you pay performance hit is on the create or update operations. This would be deemed acceptable.

On 12/03/2010, at 7:06 PM, Ayende Rahien xxxx@ayende.com wrote:

At which point, I am better off with a non relational option, it is simpler to follow, faster to program with and faster to execute.

On Fri, Mar 12, 2010 at 8:23 AM, Joe Cincotta xxxx@pixolut.com wrote:

If that's an option then, sweet. But you end up with other issues - do you use a non- relational data store? Does it provide the measured scalability of SQL? It's not just about the language you choose - it's about reliability. What would you suggest as an alternative to Sql?

Joe,

Yes, I do, and yes, it does.

But I can't really answer that question without having more information, such as what the actual need is.

...ende...

18 Mar 2010
10:03 AM

Frédéric

Ayende, my software is heavily using metade, or so called "client properties".

The real part of this is getting data validation and consistency if not defined.

But there is no major issue, just take a look at sharepoint, it's exactly the design principle used by moss.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB