Code through the looking glass: Sorting large data sets

Mar 17 2016

Code through the looking glassSorting large data sets

time to read 3 min | 590 words

The task in question was to read a CSV file (which is large, but can fit in memory) and allow quick searches on the data by email to find the relevant users. Most candidates handle that with a dictionary, but since our candidate decided to flip things around, he also included an implementation of doing this with sorted arrays. That is actually quite nice, in theory. In practice, it ended up something like this:

funny, funny pictures, funny photos, hilarious, wtf, weird, bizarre, funny animals, 17 Bizarre Photoshop Hybrid Animals

In order to truly understand the code, I have to walk you through a bunch of it.

It starts with the following fields:

List<Structure> emails = new List<Structure>();

Structure[] namesArray = new Structure[0];

Structure, by the way, is a helper class that just has a key and a list of ids. The duplicate is strange, but whatever, let move on.

The class constructor is reading one line at a time and add an structure instance with the email and the line id to the emails list.

public void ToArray()
{
    emailsArray = this.emails.ToArray();
}

This code just copy emails to the array, we are not sure why yet, but then we have:

public void Sort()
{
    Array.Sort(emailsArray, delegate(Structure x, Structure y) { return x.STR.CompareTo(y.STR); });
}

So far, this is right in line with the "use a sorted array" method that the candidate talked about. There is a small problem here, because emails are allowed to be duplicated, but no fear, our candidate can solve that…

public void UnifyIdLists()
{
    for (int i = 0; i < emailsArray.Length; i++)
    {
        if (i + 1 == emailsArray.Length)
            break;
        if (emailsArray[i].STR.Equals(emailsArray[i + 1].STR))
        {
            emailsArray[i].ID.AddRange(emailsArray[i + 1].ID);
            emailsArray[i + 1] = null;
            List<Structure> temp = emailsArray.ToList<Structure>();
            temp.RemoveAll(item => item == null);
            emailsArray = temp.ToArray();
        }
    }
}

The intent of this code is to merge all identical email values into a single entry in the list.

Now, to put things in perspective, we are talking about a file that is going to be around the 500MB in size, and there are going to be about 3.5 million lines in it.

That means that the emailsArray alone is going to take about 25MB.

Another aspect to consider is that we are using dummy data in our test file. Do you want to guess how many duplicates there are going to be there? Each of which is generating two 25MB allocations and multiple passes over an array of 3.5 million items in size.

Oh, and for the hell of it, the code above doesn't even work. Consider the case when we have three duplicates…

Tweet Share Share 10 comments

Tags:

More posts in "Code through the looking glass" series:

(18 Mar 2016) And a linear search to rule them
(17 Mar 2016) Sorting large data sets
(16 Mar 2016) I'm the God of all OOP
(15 Mar 2016) All you need is a dictionary (and O(N) )
(14 Mar 2016) Finding today's log file

Comments

17 Mar 2016
11:21 AM

OmariO

The subject is getting less and less interesting. Yes, there are inexperienced people and plane stupid ones but why dedicate so much time to the fact.

Can we please get back to, for example, IO, throughput and latency )

17 Mar 2016
12:23 PM

Jesús López

Here is my naive implementation https://gist.github.com/jesuslpm/3e638fc33370f2f74f41

Just for fun, I would be interested in doing your coding tasks, would you send me them?

17 Mar 2016
12:35 PM

Boris Nostars

Am I a bad developer if half of the time I don't understand what is going on in these snippets and the other half I don't understand why is it written the way it is?

17 Mar 2016
13:25 PM

Stephen Ward

I also would be interested in the coding tasks Ayende.

Boris

Can't say. wrapping your head around straightforward bad code tends to take more time than wrapping it around straightforward good code. And you almost never understand why the bad code was written the way it was. You should be able to understand what it's trying to do though, although that will take longer.

17 Mar 2016
16:12 PM

Mike

"also included"? As in, he did it the right way and then said "hey this is also an option", for the lulz? I hope you didn't penalize him for any of the things you complained about here. You told him not to worry about memory size or load time, so it's a feasible idea given those constraints.

The thing about multiple duplicates is a bug for sure, but who among us hasn't written a bug or two, especially during a rush job like this one? Point the bug out during the interview and see how he does at fixing it. This can actually be more useful than if he'd submitted bug-free code.

17 Mar 2016
17:08 PM

Uri

@Jesús López, nice implementation, return using yield is very good if we access to the file directly. dictionary implementation is also very easy with 1 line of code:

var dictionary = userDataList.ToLookup(t => t.Email, StringComparer.InvariantCultureIgnoreCase).ToDictionary(t => t.Key, v => v.Select(t => t.Id),StringComparer.InvariantCultureIgnoreCase);

I think the questions are too easy, I would add a constraint that the data can't fit in ram, and you also need to build an autocomplete for the emails, for instance.

18 Mar 2016
06:47 AM

frank quednau

Note to self: Apply to ayende folks under pseudonym with most atrocious code I can come up and chuckle silently to myself when it gets ripped apart over here.

20 Mar 2016
09:07 AM

Oren Eini

OmariO, Wait for the last post in the series...

20 Mar 2016
09:09 AM

Oren Eini

Mike, At what point did you see something that looked like the right way? The candidate solved the problem in two different ways. I wouldn't care that much about the inefficiencies of this work if the rest was actually good. See the next post for what the candidate did with this

20 Mar 2016
09:10 AM

Oren Eini

Uri, The questions are meant to be easy, they aren't meant to take you a very long time. This particular question is supposed to be snort, implement, move on. There are other with data sets larger than RAM.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB