Big Data SearchSorting Optimizations
I mentioned several times that the entire point of the exercise was to just see how this works, not to actually do anything production worthy. But it is interesting to see how we could do better here.
In no particular order, I think that there are at least several things that we could do to significantly improve the time it takes to sort. Right now we defined 2 indexes on top of a 1GB file, and it took under 1 minute to complete. That gives us a runtime of about 10 days over a 15TB file.
Well, one of the reason for this performance is that we execute this in a serial fashion, that is, one after another. But we have to completely isolated indexes, there is no reason why we can’t parallelize the work between them.
For that matter, we are buffering in memory up to a certain point, then we sort, then we buffer some more, etc. That is pretty inefficient. We can push the actual sorting to a different thread, and continue parsing and adding to a buffer while we are adding to the buffer.
We wrote to intermediary files, but we wrote to those using plain file I/O. But it is usually a lot more costly to write to disk than to compress and then write to disk. We are writing sorted data, so it is probably going to compress pretty well.
Those are the things that pop to mind. Can you think of additional options?
More posts in "Big Data Search" series:
- (24 Jan 2014) Sorting randomness
- (23 Jan 2014) Sorting Optimizations
- (22 Jan 2014) The index format is horrible
- (20 Jan 2014) Binary Search of Textual Data
- (17 Jan 2014) Setting up
Comments
WriteFileGather and going totally asynchrous with writes.
Hi, You should imho do the usual things you did for Raven - concentrate on minimizing the IO, performing sequential reads and writes only and reducing the number of runs needed for merge sort. Apart for that, you can use more clever data structures - for example a priority queue instead of a simple mem buffer, which should make merge sort series longer and require fewer runs. Don't see how introducing more threads would improve anything, after all you would still be limited by a single disk that serializes all operations. And suppose that using multiple disks or multiple machines to is not the kind of optimization you're looking for?
We could start by speeding up the sorting algorithm.
Assuming that we use List type and List.Sort() which uses QuickSort (if my memory is correct) we can either do that in parallel since this is problem is actually partitioning over pivot thus it can be efficiently optimize using > 1 thread.
The next thins is that we can load up more then 1 GB and do TOP(K) partitioning over pivot which is even faster then QuickSort itself but that should not matter as much.
One thing to remember is while we are using dynamic expandable data structures (like lists) we should up scale them from the get go so there will be no expand operation which is expensive.
As for the IO/S
Since we are using intermediary files we can keep their data ranges so ideally we could start merging those whose data ranges are closest together (especially the starting value data range), this could lead to better average time complexity.
The actual binary search could use a probabilistic approach to set the starting point of the search, but this needs to carefully tested as it can downgrade performance.
And like you said since we have two indexes we could just parralize the operation but the number of threads working on stuff needs to be adjusted.
Is it possible to use statistical methods? E.g. do 1000 seeks in the 15 TB file. Sort the sample and pivot around the found values and create 1000x(records per seek) smaller files when finally serial reading the big 15 TB file.
Quicksort might have problems with wrong pivotvalues or a lot of identical values. I guess only in constructed data sets but you never know.
I think I might do something with pipelining of the data, maybe using the standard microsoft libraries in the Microsoft.Tpl.Dataflow package. With pipelining of the data, I can continue to read on one thread, and when I read enough data, it gets passed off to the sorting code, and then on to the writing code. In other words I could be doing up to 3 things at the same time. Then once all my small index files are written and ready to be merged, I believe i would thread that and see what I got out of it.
As a cheating alternative you can use Unix Tools for win32 sorting command line.
In a project I was working we decided to just do a straight output to a new file making sure the field to be sorted by was first on each line and then apply the command line sort and let the tried and tested low level code do the job.
The performance improvement we got would have never been achieved without loads of work and research. Choosing your battles...
Comment preview