## Big Data SearchSorting randomness

**15 min**|

**2925 words**

As it turns out, doing work on big data sets is quite hard. To start with, you need to *get* the data, and it is… well, big. So that takes a while. Instead, I decided to test my theory on the following scenario. Given 4GB of random numbers, let us find how many times we have the number 1.

Because I wanted to ensure a consistent answer, I wrote:

1: public static IEnumerable<uint> RandomNumbers()`2: {`

3: const long count = 1024 * 1024 * 1024L * 1;4: var random = new MyRand();5: for (long i = 0; i < count; i++)`6: {`

7: if (i % 1024 == 0)`8: {`

9: yield return 1;10: continue;`11: }`

`12: var result = random.NextUInt();`

13: while (result == 1)`14: {`

`15: result = random.NextUInt();`

`16: }`

17: yield return result;`18: }`

`19: }`

`20:`

21: /// <summary>22: /// Based on Marsaglia, George. (2003). Xorshift RNGs.23: /// http://www.jstatsoft.org/v08/i14/paper24: /// </summary>25: public class MyRand`26: {`

27: const uint Y = 842502087, Z = 3579807591, W = 273326509;28: uint _x, _y, _z, _w;`29:`

30: public MyRand()`31: {`

`32: _y = Y;`

`33: _z = Z;`

`34: _w = W;`

`35:`

`36: _x = 1337;`

`37: }`

`38:`

39: public uint NextUInt()`40: {`

41: uint t = _x ^ (_x << 11);`42: _x = _y; _y = _z; _z = _w;`

43: return _w = (_w ^ (_w >> 19)) ^ (t ^ (t >> 8));`44: }`

`45: }`

I am using a custom Rand function because it is significantly faster than System.Random. This generate 4GB of random numbers, at also ensure that we get exactly 1,048,576 instances of 1. Generating this in an empty loop takes about 30 seconds on my machine.

For fun, I run the external sort routine in 32 bits mode, with a buffer of 256MB. It is currently processing things, but I expect it to take a while. Because the buffer is 256 in size, we flush it every 128 MB (while we still have half the buffer free to do more work). The interesting thing is that even though we generate random number, sorting then compressing the values resulted in about 60% compression rate.

The problem is that for this particular case, I am not sure if that is a good thing. Because the values are random, we need to select a pretty high degree of compression just to get a good compression rate. And because of that, a *significant amount of time *is spent just compressing the data. I am pretty sure that for real world scenario, it would be better, but that is something that we’ll probably need to test. Not compressing the data in the random test is a huge help.

Next, external sort is pretty dependent on the performance of… sort, of course. And sort isn’t that fast. In this scenario, we are sorting arrays of about 26 million items. And that takes *time*. Implementing parallel sort cut this down to less than a minute per batch of 26 million.

That let us complete the entire process, but then it halts with the merge. The reason for that is that we push all the values into a heap, and there are 1 billion of them. Now, the heap never exceed 40 items, but those are still 1 billion * O(log 40) or about 5.4 billion comparisons that we have to do, and we do this sequentially, which takes time. I tried thinking about ways to parallel, but I am not sure how that can be done. We have 40 sorted files, and we want to merge all of them.

Obviously we can sort each 10 files set in parallel, then sort the resulting 4, but the cost we have now is the actual sorting cost, not I/O. I am not sure how to approach this.

For what is it worth, you can find the code for this here.

#### More posts in "Big Data Search" series:

*(24 Jan 2014)*Sorting randomness*(23 Jan 2014)*Sorting Optimizations*(22 Jan 2014)*The index format is horrible*(20 Jan 2014)*Binary Search of Textual Data*(17 Jan 2014)*Setting up

## Comments

Oh yeah, finally someone recognizes the value of the XorShift generator! It beats the crap out of all the commonly used ones. It is faster than light and has very strong randomness properties.

Am I the only one who thinks that the state size of 2KB of the Mersenne Twister is ridiculous?!

To replace the heap (which probably has a high constant factor) try an aggregation tree. If you want to merge 4 sorted sequences, do it like this: Merge(Merge(s0, s1), Merge(s2, s3)). An similarly for any number of sequences.

I have done this in an external sorting code I wrote. I did not compare perf with a heap, though.

One of the things I'm considering in this case is actually a modified radix sort. Basically, read through the initial data and split into the initial 16 files (use top 4 bits to decide which file), ignoring any consideration of sorting within the file. Then go back and read each file and sort them independently. Certainly, each of the individual files could be read and sorted independently, giving great parallelizatibility.

This question really intrigued me. I may start looking at your other big data question next weekend, but I've done some initial workup reviewing this sorting question. My results are here (https://github.com/viceroypenguin/ExternalIntSorter) if you're interested. Thanks for the challenge!

Stuart, Did I miss something, but I don't see you doing anything in the Radix impl:

https://github.com/viceroypenguin/ExternalIntSorter/blob/master/ExternalIntSorter/RadixSortTester.cs#L74

Regarding your merge sort, I would probably not do it 1GB at a time, and if I did so, I would make sure that I could

reusethe buffers. Also, sorting 1GB array is a really good place to say: "let us paralelize this work"mmmmm... How about QuickSort? you can parallelise the subsequent recursions, and run it in place... For 4GB you can do it on a memory mapped file...

I commented before reading the code...

## Comment preview