﻿<?xml version="1.0" encoding="utf-8"?><rss version="2.0"><channel><title>Ayende @ Rahien</title><link>http://ayende.com</link><description>Ayende @ Rahien</description><copyright>Copyright (C) Ayende Rahien  2004 - 2021 (c) 2026</copyright><ttl>60</ttl><item><title>Avish commented on Perfoming joins without having all the data in memory</title><description>I wasn't implying that; I was replying to Luke's comment about a hand-written approach being better than using a standard in-memory engine due to increased domain knowledge.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment23</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment23</guid><pubDate>Mon, 10 Dec 2007 21:11:47 GMT</pubDate></item><item><title>Ayende Rahien commented on Perfoming joins without having all the data in memory</title><description>Yes, but as the sole developer, I don't have the time to build the full engine.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment22</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment22</guid><pubDate>Sun, 09 Dec 2007 18:40:18 GMT</pubDate></item><item><title>Avish commented on Perfoming joins without having all the data in memory</title><description>I'd argue that, this being an ETL tool, you have just the same generic requirements as any database engine, and no more domain-specific knowledge than one. 
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment21</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment21</guid><pubDate>Sun, 09 Dec 2007 18:26:24 GMT</pubDate></item><item><title>Luke Breuer commented on Perfoming joins without having all the data in memory</title><description>This is true, and OK as long as SQLite does what you want.  I have no idea how well it operates on sets of data that cannot be kept entirely in memory, nor whether a hand-written approach could be made more efficient due to fewer requirements and more domain-specific knowledge.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment20</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment20</guid><pubDate>Fri, 07 Dec 2007 18:20:37 GMT</pubDate></item><item><title>Ayende Rahien commented on Perfoming joins without having all the data in memory</title><description>Luke,
  
I can just dump it on a SQLite DB and don't have to deal with this, that would be better.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment19</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment19</guid><pubDate>Fri, 07 Dec 2007 18:19:11 GMT</pubDate></item><item><title>Luke Breuer commented on Perfoming joins without having all the data in memory</title><description>Sheesh, that's a lot of work.  I'll bet you've spent more time emailing people your comments than it would take to make the blog engine do that. :-p
  
  
It does require a second scan, but that might happen _no matter your approach_.  It's something to think about -- if you populate a table and then index a column, I believe you're doing a full scan of the data twice, although the second time should be more efficient.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment18</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment18</guid><pubDate>Fri, 07 Dec 2007 18:17:51 GMT</pubDate></item><item><title>Ayende Rahien commented on Perfoming joins without having all the data in memory</title><description>The blog engine doesn't do that, this is me taking care of it :-)
  
  
I meant, this would require knowing the relative sizes of the data, and would require me to scan it twice.
  
It is certainly possible, but I don't want to get there.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment17</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment17</guid><pubDate>Fri, 07 Dec 2007 18:11:21 GMT</pubDate></item><item><title>Luke Breuer commented on Perfoming joins without having all the data in memory</title><description>Ayende, can you define "how this goes"?  Also, your blog engine is no longer sending me emails when you reply to my posts. :-(
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment16</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment16</guid><pubDate>Fri, 07 Dec 2007 17:57:59 GMT</pubDate></item><item><title>Jon Skeet commented on Perfoming joins without having all the data in memory</title><description>Sorry, yes - I should have specified that key equality was what I'd been considering in terms of equality all along. An equijoin basically :)
  
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment15</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment15</guid><pubDate>Thu, 06 Dec 2007 20:35:04 GMT</pubDate></item><item><title>Avish commented on Perfoming joins without having all the data in memory</title><description>If you're doing equality, but not necessarily key equality (what I mean is, if there exist functions f and g so that a match is produced iff f(a) == g(b)) you can do a simple hash join. That would still require one of the rowsets to be in memory, but it'll perform better than O(N*M). Actually, O(N+M) on average if your hashes are "ideal". 
  
For my experience with this, see http://anavish.livejournal.com/29734.html. 
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment14</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment14</guid><pubDate>Thu, 06 Dec 2007 19:42:23 GMT</pubDate></item><item><title>Jon Skeet commented on Perfoming joins without having all the data in memory</title><description>Well, for non-equality it's still doable - you just don't get quite as efficient operation. If you fundamentally need to compare each outer row with each inner row, then it's inherently an O(N*M) operation anyway, so you're still okay to just keep a list of the inner rows in memory.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment13</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment13</guid><pubDate>Thu, 06 Dec 2007 18:04:56 GMT</pubDate></item><item><title>Ayende Rahien commented on Perfoming joins without having all the data in memory</title><description>Jon,
  
I need more than merely equality, so I am not sure what to do here, perhaps I'll limit this to that, or I'll use the inner row iteration only.
  
Thanks.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment12</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment12</guid><pubDate>Thu, 06 Dec 2007 17:09:54 GMT</pubDate></item><item><title>Jon Skeet commented on Perfoming joins without having all the data in memory</title><description>For joins in a pipeline, LINQ to Objects is still a good model to follow - with the restriction that the "inner" dataset has to be small.
  
  
(LINQ to Objects actually builds up a Lookup based on the inner dataset, so it doesn't need to loop through it in memory for each row in the outer dataset, but it's only able to do that because the joins are equality-based. That may or may not be your situation :)
  
  
Jon
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment11</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment11</guid><pubDate>Thu, 06 Dec 2007 16:04:21 GMT</pubDate></item><item><title>Ayende Rahien commented on Perfoming joins without having all the data in memory</title><description>Luke,
  
That would require some way of knowing ahead of time how this goes.
  
I am thinking of joins in a pipeline, which makes it more interesting.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment10</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment10</guid><pubDate>Thu, 06 Dec 2007 15:18:58 GMT</pubDate></item><item><title>Ayende Rahien commented on Perfoming joins without having all the data in memory</title><description>Luke,
  
That would require some way of knowing ahead of time how this goes.
  
I am thinking of joins in a pipeline, which makes it more interesting.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment9</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment9</guid><pubDate>Thu, 06 Dec 2007 15:18:57 GMT</pubDate></item><item><title>Chad Myers commented on Perfoming joins without having all the data in memory</title><description>I think Ayende's point was that he didn't want to have to go down the route of writing a SQL/join engine in his app.
  
  
Is there some sort of other quick/dirty way to accomplish this with minimal effort? 
  
  
I agree with Derick that SQLite is probably your best bet. Otherwise you're going to have to go down the NirD route and re-invent a SQL engine. If you're doing a SQL engine anyhow, might as well just use SQLite and let it do the complicated stuff.
  
  
Of course, I'm sure you could write a 3 line Boo script to implement a SQL engine, so...
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment8</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment8</guid><pubDate>Thu, 06 Dec 2007 14:49:46 GMT</pubDate></item><item><title>Philip L&amp;#248;ventoft commented on Perfoming joins without having all the data in memory</title><description>The Wikipedia article http://en.wikipedia.org/wiki/Join_algorithm#Join_algorithms
  
about join algorithms is actually pretty good and explains some of the common ways to make joins. Hope it helps.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment7</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment7</guid><pubDate>Thu, 06 Dec 2007 14:03:01 GMT</pubDate></item><item><title>Derick Bailey commented on Perfoming joins without having all the data in memory</title><description>I would actually go down the SQLite route. I've never seen iterative logic of code perform anywhere near what group logic of sql can do... even in a small system like SQLite. I would recommend putting SQLite into in-memory mode, though, so you don't have to worry about file read/writes. I think you can specify the connection string as ":memory:" or something like that, to put sqlite in to in-memory mode.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment6</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment6</guid><pubDate>Thu, 06 Dec 2007 13:54:12 GMT</pubDate></item><item><title>Jason commented on Perfoming joins without having all the data in memory</title><description>I agree with NirD - simplistically that's how Oracle would handle the join.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment5</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment5</guid><pubDate>Thu, 06 Dec 2007 13:51:24 GMT</pubDate></item><item><title>Luke Breuer commented on Perfoming joins without having all the data in memory</title><description>Are the files so big that you cannot load all of the key values &amp; the associated character indexes for the line beginnings?  You would have to scan the files twice, but I don't see that as prohibitive.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment4</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment4</guid><pubDate>Thu, 06 Dec 2007 13:22:23 GMT</pubDate></item><item><title>NirD commented on Perfoming joins without having all the data in memory</title><description>The most efficient way to join two large datasets is to partition them into several smaller datasets (or at least it was when I working on databases in collage).
  
  
This is easiest when your condition is something like row_a.id == rob_b.a_id (lucky for us it’s a common scenario)
  
  
You open N new files, let's call them a1 to aN and use a hash function to divide all rows in A between those N buckets by row_a.id.
  
  
Then you do the same for B (by row_b.a_id), so you have two sets of files a1..aN and b1..bN
  
  
Now you load a1 and b1 and join between them and repeat for a2,b2 up to aN,bN
  
  
Because the way you divided the data between files you know all rows in a1 are only joined with b1 etc.
  
  
It's not easy but this is how the "big boys" do it.
  
  
There are simpler methods for special circumstances (for example if you know one of the datasets is small).
  
  
If you need more information you can e-mail me.
  
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment3</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment3</guid><pubDate>Thu, 06 Dec 2007 11:18:39 GMT</pubDate></item><item><title>Ken Egozi commented on Perfoming joins without having all the data in memory</title><description>I once needed to write a simple program to diff two not-too-well-formed text files, with some sort of logic in it.
  
they were way too big for an in-memory work.
  
the solution was (do not laugh - it gave best ROI):
  
  
1. read first file line by line while parsing it,  insert each parsed record to MsAccess table (takes about 3-4 seconds)
  
2. do the same on the second file (another 3-4 seconds)
  
3. run a SP on the tables to create the output table (under .5 second)
  
4. export to csv file (another blink of an eye)
  
  
this app runs every day by that client for over two year now.
  
flawlessly.
  
  
know your toolbox. choose wisely.
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment2</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment2</guid><pubDate>Thu, 06 Dec 2007 11:14:31 GMT</pubDate></item><item><title>Jon Skeet commented on Perfoming joins without having all the data in memory</title><description>The way LINQ to Objects does it is to have all the data for the "inner" sequence in memory, but then stream the "outer" sequence. So in pseudo-code (apologies for not writing it in full Boo!)
  
  
CachedB = LoadB
  
for row_a in StreamA
  
    for row_b in CachedB
  
        if (condition(row_a, row_b)
  
            yield join (row_a, row_b)
  
  
Where "yield" is the way of streaming the result out. All easy in C# 2 or 3, but I don't know how easy it will be in your actual situation.
  
  
Hope it helps though.
  
</description><link>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment1</link><guid>http://ayende.com/3013/perfoming-joins-without-having-all-the-data-in-memory#comment1</guid><pubDate>Thu, 06 Dec 2007 10:33:01 GMT</pubDate></item></channel></rss>