Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 187 words

The final piece for this candidate is the code to actually search through the sorted array. As you'll recall, the candidate certainly made our computer pay for that sorting and merging. But we are fine now, we have manage to recover from all of that abuse, and we are ready to rock.

Here is how the search (over sorted array) was implemented.

public List<int> EmailSearcher(string email)
{
    List<int> answer = new List<int>();
    for (int i = 0; i < emailsArray.Length; i++)
    {
        if (emailsArray[i].STR.ToString().Equals(email, StringComparison.OrdinalIgnoreCase))
        {
            answer = emailsArray[i].ID;
            return answer;
        }
    }

    return answer;
}

And with that, I have nothing left to say.

time to read 3 min | 590 words

The task in question was to read a CSV file (which is large, but can fit in memory) and allow quick searches on the data by email to find the relevant users. Most candidates handle that with a dictionary, but since our candidate decided to flip things around, he also included an implementation of doing this with sorted arrays. That is actually quite nice, in theory. In practice, it ended up something like this:

funny, funny pictures, funny photos, hilarious, wtf, weird, bizarre, funny animals, 17 Bizarre Photoshop Hybrid Animals

In order to truly understand the code, I have to walk you through a  bunch of it.

It starts with the following fields:

List<Structure> emails = new List<Structure>();

Structure[] namesArray = new Structure[0];

Structure, by the way, is a helper class that just has a key and a list of ids.  The duplicate is strange, but whatever, let move on.

The class constructor is reading one line at a time and add an structure instance with the email and the line id to the emails list.

public void ToArray()
{
    emailsArray = this.emails.ToArray();
}

This code just copy emails to the array, we are not sure why yet, but then we have:

public void Sort()
{
    Array.Sort(emailsArray, delegate(Structure x, Structure y) { return x.STR.CompareTo(y.STR); });
}

So far, this is right in line with the "use a sorted array" method that the candidate talked about. There is a small problem here, because emails are allowed to be duplicated, but no fear, our candidate can solve that…

public void UnifyIdLists()
{
    for (int i = 0; i < emailsArray.Length; i++)
    {
        if (i + 1 == emailsArray.Length)
            break;
        if (emailsArray[i].STR.Equals(emailsArray[i + 1].STR))
        {
            emailsArray[i].ID.AddRange(emailsArray[i + 1].ID);
            emailsArray[i + 1] = null;
            List<Structure> temp = emailsArray.ToList<Structure>();
            temp.RemoveAll(item => item == null);
            emailsArray = temp.ToArray();
        }
    }
}

The intent of this code is to merge all identical email values into a single entry in the list.

Now, to put things in perspective, we are talking about a file that is going to be around the 500MB in size, and there are going to be about 3.5 million lines in it.

That means that the emailsArray alone is going to take about 25MB.

Another aspect to consider is that we are using dummy data in our test file. Do you want to guess how many duplicates there are going to be there? Each of which is generating two 25MB allocations and multiple passes over an array of 3.5 million items in size.

Oh, and for the hell of it, the code above doesn't even work. Consider the case when we have three duplicates…

time to read 3 min | 472 words

We typically don't look too deeply into the overall design on a candidate's code. Most coding tasks just aren't big enough to warrant it.

When we do, it is typically because the candidate has done something… peculiar. For example, we have had a candidate that send a submission with no less than 17 classes to manager reading a CSV file and populating a dictionary. At some point we knew that this candidate wasn't going to move forward, but it became a challenge, to try to outline how the data actually traveled through that particular maze. But as much as we love Mr. Goldberg, I'm afraid that dear Rubbie isn't going to be remembered for much longer.

Here is how this candidate code started:

static void Main(string[] args)
{
    //Call for parsing method
    HelperClass cr = new HelperClass();
    cr.CsvParser();
    cr.ToArray();
    cr.Sort();
    cr.UnifyIdLists();
    //ask the user to enter input
    string input = cr.InputReader();

I actually started explaining the issues in this code, but I run out of… pretty much everything.

I have a challenge for you, can you think of a single OOP principle that isn't being broken here?

For… fun, I guess, you can also look at the following code, which come just after the one above:

Start:
    //check input type
    switch (cr.InputTester(input))
    {
        case 0:
            //valid email structure, but the input is not found in the file
            List<int> emails = new List<int>();
            emails = cr.EmailSearcher(input);
            if (emails.Count == 0)
            {
                Console.WriteLine("The input you have entered has not been found in the file. \nPlease try again.");
                input = cr.InputReader();
                goto Start;
            }

I guess someone heard that loops are a performance bottleneck in most applications and decided to do away with them.

time to read 2 min | 362 words

The first question that ask in the coding task goes something like this: "Given a CSV file containing users' data, who is large, but can fully fit into memory, we want to be able to search by email address very quickly and get all the matching user ids. Optimize for fast queries over fast startup".

The intent is basically that you'll read the file and load that into a dictionary, then use that to perform queries.

This candidate has done just that, although things started being strange pretty much from the get go…

Dictionary<int, string> emails = new Dictionary<int, string>();

That seemed like a pretty strange way to set things up, but we have seen crazier. At this point I thought that they are probably storing the hash code of the email in the dictionary, and the string value is the concatenated list of all the matching ids.

The truth was… far more interesting. Here is the code for querying:

public Dictionary<int, int> EmailSearcher(string email)
{
    Dictionary<int, int> answer = new Dictionary<int, int>();
    int count = 0;
    foreach (var entry in emails)
    {
        if (entry.Value.ToString().Equals(email, StringComparison.OrdinalIgnoreCase))
        {
            answer.Add(count, entry.Key);
            count++;
        }

    }
    return answer;
}

This code actually have multiple levels of strangeness. The obvious one is that this is doing a linear search on the dictionary, but look at the return type as well…

The candidate was well aware that this code is not able to handle large amount of information, so the candidate sent us an alternative implementation. But I'll talk about that in my next post.

time to read 4 min | 615 words

You might have noticed that we are looking for more people again. Mostly because I gripe about some of the bad ones here every now and then.

One of my absolute requirements is that I want to read a candidate's code before hiring them. Some of them have made significant contributions to Open Source code, but a large number don't have any significant body of code that they can show. That is why we ask candidates to send us a coding test.

We had people flat out refuse to do the coding test, and some who thought that the questions were incredibly hard and unrealistic. We had people who sent in code that was so bad it caused migraines, we had people who sent in code that wouldn't compile, we  had people do stuff like read a 15TB file (16,492,674,416,640)! times. And just to clarify, that is factorial(16492674416640). I don't know what the actual value of this number is, but is it big.

The nice thing is that you can usually tell right away when the code is bad. We also play a game called "guess the candidate's background". We have about 89% success rate there*.

Spotting good code is much harder, because a really successful submission is going to be boring. It does what needs to be done, and that is it. Our most recent hire's code submission was so boring, we had to analyze it using our standard metrics to find issues (our standard metrics are for production code running in a high performance DB for months on end, not the same metrics we evaluate code submissions).

And then… then there is this candidate, whose code is so unique that I decided to dedicate a full week to explore it. The code looks good, documented, there are explanations that show what is going on and they are going in the right direction, on the surface of it.

And then there is the devil in the details. I have quite a lot to write about this particular submission, but I'll just start with the following:

//Find today's log file in the directory
public string LogFileFinder()
{
    string[] files = Directory.GetFiles(LoggingAggregation.Program.GlobalVar.path, "*.log");
    for (int i = 0; i < files.Length; i++)
    {
        int slash = files[i].LastIndexOf(@"\");
        int dot = files[i].LastIndexOf(".");
        string fileName = files[i].Substring(slash + 1, dot - slash - 1);
        string fileDate = DateTime.Now.ToString("yyyy-MM-dd");
        if (fileName.Equals(fileDate))
            return fileName + ".log";
    }

    return null;
}

Now, another way to write this code would be:

Path.Combine(LoggingAggregation.Program.GlobalVar.path, DateTime.Now.ToString("yyyy-MM-dd") +".log") 

I literally stared at this piece of code for several minutes, trying to figure out what is actually going on there. I wasn't able to.

As an aside, we sent the candidate a rejection notice, along with a few pointers to some of the more egregious things that were wrong in the code.  They know how to code, it is just that it goes sideway in the middle.

* The game call to guess the candidate's default language settings. That is, someone who write C# code, but has been brought up on C, so have a C style to the code.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}