Perfoming joins without having all the data in memory
Probably the easier way to perform a join is by a nested loop, given dataset A and dataset B, joining all the rows to dataset C is simple:
for row_a in A: for row_b in B: if condition(row_a, row_b): add join(row_a, row_b), C
Supporting left/right/cross joins is simple matter from here, but this has the issue of having to have both datasets in memory. I run into it while processing big files, I don't want to have to hold them in memory, especially if I need several level of joins in order to correctly process them.
I thought about bypassing the entire issue by simply writing the data down to a sqlite DB, and doing the join there. While this is possible, I would rather avoid having to do this.
Any suggestions on where to look to solve this issue?