## New interview question

So, I think that I run through enough string sorting and tax calculations. My next interview question is going to be:

Given a finite set of unique numbers, find all the runs in the set. Runs are 1 or more consecutive numbers.

That is, given {1,59,12,43,4,58,5,13,46,3,6}, the output should be: {1}, {3,4,5,6}, {12,13}, {43}, {46},{58,59}.

Note that the size of the set may be very large.

This seems pretty simple, as a question, and should introduce interesting results.

Just to give you some idea, this is a real world problem we run into. We need to find runs in a sequence of freed pages so we can optimize sequential I/O.

## Comments

Interesting question, wouldn't have a clue where to start solving that one! <3 a good challenge.

Output should be ordered?

Fun stuff. I would do something along the lines of Zipping an ordered list with an ordered list skipping the first element, and adding to a run as long as the difference is 1. Yielding a best case performance of O(n log(n)), and worst case of O(n²) depending on the sorting algorithm.

Dangerous grounds for a Monday morning, but I'd think that the gist over here https://gist.github.com/flq/6578697 gives you what you want (at least for the example provided, which is far from a proof, but hey...).

I am not too happy about having to order the thing, but then again you extract a global information from the list.

I could imagine a situation where you open x buckets and add things as you go depending on the comparison with the last on the bucket, but I think you'd end up with all items in memory, too...

Can you specify "very large"? I assume it's between 'fits easily in memory' and 'O(n^2) isn't feasible'

gandjustas, No, it doesn't have to be ordered.

gandjustas, Sorry, yes, the output DOES have to be ordered.

This runs in (set.length + maxValue) iterations. Kind of an abused bucketsort. extremely efficient when set.length is big and maxValue is small

Patrick, Very large, assume 10 millions or more.

You can't give the output till you've read the entire input, as there might be a value on the input you still have to consume which makes two runs become one run (the value connects the two). So the predicate 'Note that the size of the set may be very large.' is not important: you have to read the entire input before processing can begin.

So in that light, the layman answer likely will be: 1) sort input 2) walk the sorted data and detect non-sequential pairs, which break a run so you've found a run.

The more clever people will try to find the runs during sorting, or better: during reading the input. The B+ tree might be a good fit for this, as nodes in the tree can be sets, which you can see a node as a run.

I think I can say that I would implement, for the third time in my career, the contracted set object I keep needing/using. Need to get a version open sourced.

Something like this (untested, sorry if doesn't compile) http://pastebin.com/bpSj6fk6 and here's the code (sorry, don't know how to paste it formatted):

"Very large, assume 10 millions or more."

Ok, so I guess an O(n^2) option is out then.

As we're talking about disk locations, we're talking about 64 bit numbers.

Since it would be nice to keep it up to date as we go (thus inserts and removals at random places), a simple array based collection is not an option as well. Thus we'll have to use some kind of tree. Which would suggest one or two pointers, which is another 2 * 64 bits per entry.

So we need max 30 MB of 64 bit memory for the collection for 10 million items. Seems reasonable to me.

However we can optimize by storing a run as {start, length} and assuming the average run is at least 2 items, that will save us a bit of memory and processing.

So that will lead to some kind of tree of runs. They will need to merged as elements / runs are added that fill gaps. And runs can be removed as they are used. And run start / length will change when a run of the right length can't be found. No need to be able to split runs, as you should only 'consume' a run from one of it's ends, not from the middle.

Could you hack a Patricia Trie such that each node just tracks the first and last entry in it's range? On add you would navigate down ala the binary variety until you either add a new node (outside of bounds for nearest leaf) or update the leaf that you were adjacent to by decrementing first # or incrementing last #.

I wanted to do it with linq and a GroupBy, here is my solution. Cobbled together but seems to get the result, not sure how it will perform on large lists though.

https://gist.github.com/lawrab/6579083

Here's a naive python implementation: https://gist.github.com/erikzaadi/6579100

As bit arrays (or bit sets) are inherently ordered sets of integers, they will probably be the most simple way to make it efficient. As in Java-version:

private static List<List> findRuns(List uniqueIntegers) { BitSet bitSet = new BitSet(); for (Integer integer : uniqueIntegers) { bitSet.set(integer); } List<List> runs = new ArrayList<List>(); for (int start = bitSet.nextSetBit(0); start >= 0; ) { List run = new ArrayList(); for (int i = start, end = bitSet.nextClearBit(start); i < end; i++) { run.add(i); } start = bitSet.nextSetBit(bitSet.nextClearBit(start)); runs.add(run); } return runs; }

FWIW, JVM converts those bitset-operations effectively to single BSF instruction, making it pretty fast. With stock BitSet, memory usage can be a bit unpredictable depending on order of the input, but there are packed bit array implementations that will minimize this downside.

Assuming you are willing to take the hit on an upfront sort, then http://stackoverflow.com/questions/4936876/grouping-into-ranges-of-continuous-integers seems a reasonable approach.

I just opted for simplicity. Since I assume you would be working within a team you don't want to be too tricky. Additionally, you don't know if the performance would be a bottleneck here, so I didn't do any "premature optimization".

using System; using System.Collections.Generic; using System.Collections.ObjectModel; using System.Linq;

namespace AyendeExam { class Program { public static readonly ReadOnlyCollection Numbers = new ReadOnlyCollection(new[] { 1, 59, 12, 43, 4, 58, 5, 13, 46, 3, 6 });

}

Actually, @Frans, I believe there is an algorithm that doesn't require reading the whole list in memory... unfortunately the worst case is still O(n) for memory (2n+ actually) and O(n^2) for time. Something like:

For a relatively low number of runs (compared to n) this would work quite fast; it might be possible to improve the time (but not the memory) by keeping the list of runs sorted and doing a binary search in it.

Marcel, you have described my implementation (above). Using a hashtable for runs makes it effectively an O(n) algorithm regarding the time and memory, but then there's an additional requirement of results being ordered, which cannot be done in linear time ;(

My attempt, requiring only (significant) memory for the sorted set: https://gist.github.com/koen-lee/6580025

Good point, @Rafal - I think Patrick also has something similar.

One thing though, I don't like this line:

seq[n] = n;

or rather, I don't like the fact that you don't remove it if you discover that n-1 or n+1 are present. It messes up the final pass that will be needed to print all runs.

All in all, interesting question :)

Marcel, I remove it when merging with neighboring runs.

... and the final pass for printing the results looks like this:

http://pastebin.com/aVR1KBSbThis is my attempt:

https://gist.github.com/jonnii/6580639

The idea is you keep a previous and next item to expect from the inputs, and then append/prepend to the associated linked list. I've modified the input to show a special case, if you run the output of this though you'll see a run like:

4567 8

Where you have two runs which touch, these should be combined. You could combine the two linked lists when this condition is met (I have the special condition case, but no implementation), but there's no way to combine linked lists in .net without iterating. The other option is to combine them while iterating for the output.

IMHO you should be much more concerned about whether someone can understand real world problems and come up with practical solutions, than whether they can solve isolated mathematical problems, taken out of context, put on the spot, in an artificial setting, under pressure to perform and "show you what they got."

Pop quizzing candidates is one of the weirdest and dumbest practices American companies use. There's a good chance you're weeding out candidates who don't do well when asked to solve nonsense problems without context or preparation - condidates who might have excellent analytical and problem solving skills when faced with real problems in a practical context and a real-world setting.

I have more than once turned down job offers in the US because I was asked to "pop quiz" - it's a big red flag for me, because it tells me that either (A) you don't know how to interview and hire, you're just following a manual, or blindly doing whatever it is you think other companies do, or (B) you're hiring for a fairly junior position, in which you would actually ask me to solve problems with little depth or complexity.

I have openly pointed this out at several interviews, and the answer is usually something along the lines of "yeah, we're just following process, we know it's dumb, but we have to do it."

Whether someone can solve problems like these and pass your little exam, at the most will tell you whether they would be able to pass an actual exam - but you hopefully already knew that before you called them in to interview?

Until I moved to America, I had never once (ever) been asked to "pop quiz" at an interview. In Europe, we take care of that shit in school - if you pass your exams, you get a certificate that says you earned your junior wings, and then you move on to bigger and better things.

Have you ever actually been faced with a candidate who lied about their level of education, and was able to fake years or work history, and github and stackoverflow history? If so, someone must be willing to go through an awful lot of work just to get fired in a couple of months when you realize they can't actually build software.

I don't understand what it is American companies think it is they're "protecting" themselves from.

Process and standards are great for software development - but not for hiring.

http://bit.ly/1go6u17

@Rasmus Schultz

For Ayende this

isa real world problem. He's building databases, remember.Just quickly jot my thoughts:

A sorted set of sorted sets.

As each item gets added a sorted set of sorted sets, it creates a new sorted set, and the new inserted sorted set does a compare with the previous and/or next sorted set, union if consecutive.

e.g. {1,59,12,43,4,58,5,13,46,3,6}

(* being currently inserted sorted set)

Since a sorted set can be implemented with O(log n) on average using BST, an insertion will cost O(log n) and union will cost O(log*n). Since a union is only 1 + exiting set (essentially an add to the inner set), the cost is O(log n).

Invoking a union or an add can be determined by the custom comparer function and a quick search through the set which is O(log n) * O(log n) on average.

Therefore this algorithm will have a complexity of O(log n)^2 on average.

pondersAm I on the right path?3N always

int[] inputNumbers = GetNumbers();

Not 3N but 2N+maxNumber. no need to sort. Assumption: no duplicates

Iterate through the initial list, inserting it (sorted) into a list of buckets with start and stop points. Merge the new bucket on insert.

https://gist.github.com/LaptopHeaven/6586644

using System; using System.Collections.Generic; using System.Linq;

namespace kygeek.AyendeRuns { class Program { static void Main(string[] args) { var limit = 1000; //var input = GetRandomInput(limit, 1, limit*6); int[] input = new[] {1, 59, 12, 43, 4, 58, 5, 13, 46, 3, 6};

}

Following up on Frans Bouma's approach (sorted input already), I would ask in what context this data is used - do we need to do anything with it or are these just temporary results used for further processing? If the later (e.g.: a scenario of loading document data from disk to memory based on the results from index scan(s)), why not just pass in a callback (or make it fire an event, yield, or whatever) with the lo/hi values? That way (with already sorted input) it becomes a method with O(n) operations (though the callback's performance could make it worse) and constant memory use. Sample code: https://gist.github.com/zsoldosp/6586803

public class SequenceFinder { private readonly List sequences = new List();

}

If the set to solve gets into millions, the problem is somewhere else.

Keeping track of things and group in a early stage makes the problem area much easier and efficient to solve. Waiting and get a huge set of data to resolve feels like neglecting the source/cause of the data.

If within say 100ms this mess (fragmentation) occurs there is a much more challeging puzzle to solve underneat. That should be the focus. The puzzle is about the symptoms but not the root.

I do also agree with Rasmus. Quizing results in false positives If you have your license you know how to drive a car ie how to build software. Not everybody makes it to F1, But an experienced observer should notice that within 2 weeks

A simple extension method

I'm leaning toward floating bit vectors but my suspicion is the bit manipulation instructions are still slower than our current implementation. Haven't actually prototyped it yet to prove one way or the other though.

My updated solution without global pos variable:

You do need to read all input before you can say there are e.g. {1,2} and {4,5} as runs and not {1, 2, 3, 4, 5} as '3' isn't in the input: but you can't conclude that until you've read till the end of the input ;)

The approach with storing start, length or min,max is indeed a good one (e.g. the approach of Patrick Huizinga above) and saves memory, so one could potentially store the complete input in runs in memory and create the data structure while reading the input (so the total algorithm will be O(n), even with non-sorted input).

My take on the algoritmic side of this (may be completly wrong): I think you can't do it in O(n) with a comparasion based sort because if you could do that it would mean that you could sort an array in O(n) with a comparastion sort which we all know is not possible.

Proof: if you could make this work in O(n) with some comparastion sort then you could sort an array in O(n) - you will only need to flatten the result and you will get a sorted array.

So the solutions should be: 1. O(n*log(n)) - if you use comparasion sort 2. O(n) - some kind of bucket sort or other datastrcture - but this will probably require O(f(n)) or more of memory - where f(n) is a linear function

So it sounds like a common trade off in CS :)

Uriel, the solution cannot be faster than sorting because it IS sorting (of the results). But this is an additional requirement by Ayende, the core 'find runs' algorithm doesn't require any sorting of results and can be made linear O(n) for both memory and time. The proof is above.

Another solution is to sort array and than find and store each run start and end position. By far most expensive part of this kind of solution is sorting, for sorting I would use some quick parallel sort (https://gist.github.com/wieslawsoltes/6592526).

Full code: https://gist.github.com/wieslawsoltes/6592536

I'm surprised that no-one has a solution with

realbit-twiddling in it ;-) Oh, and it doesn't what was asked, but it was nice to think about...so here you go:

The blog didn't like my code... https://gist.github.com/remco-sch/6593315

Represent existing runs by a (bottom, top) tuple.

Maintain a map of top -> run and a map of bottom -> run.

For each number k, look up its successor (k+1) in the bottoms map and its predecessor (k-1) in the tops map.

If both lookups fail, create a new run of (k, k) and insert into both maps.

If only one lookup succeeds, remove the run that comes back from both maps, and insert a new run, extended with k, into both maps.

If both lookups succeed, remove both runs from each map, and insert a new run which is their union into both maps.

After reading all numbers, print out the members of one of the maps.

CATCH: sorted output. Either ensure one of the maps is a sorted map, and then the runs can be output from that map, or sort all of of the runs before the output. It's not clear which to pick. Asymptotically, the latter is probably better (the number of runs is smaller than the input size). But the former might be faster in reality, and is definitely easier to implement.

Fairly simple implementation - uses winform to get input for flexibility in testing small sets...

Public Class Form1

End Class

One more try using to store run start and end position in long instead of struct: https://gist.github.com/wieslawsoltes/6594019

Rasmus,

It

isa real world problem. As mentioned in the post: " this is a real world problem we run into. We need to find runs in a sequence of freed pages so we can optimize sequential I/O" We are not an American company, so I don't see the association, either.And I got quite a BIT of highly certified candidates, good grades, really good schools that couldn't code their way out of a wet paper bag. Also see my post about the cost of a bad hire. http://ayende.com/blog/163266/the-cost-of-a-bad-hire-or-the-zombie-apocalypse-is-upon-us

A compromise between simplicity and efficiency. Sort values with values.OrderBy(v=>v) if not already sorted.

public static IEnumerable<Tuple<int, int>> GetRunsFromSorted(IEnumerable sortedValues) { bool first = true; int start = 0, end = 0; foreach (var value in sortedValues) { if (first) { start = end = value; first = false; continue; }

}

Since everyone seems to be sorting first, then finding runs, here's a way to find runs then sort:

@Shawn: I'm not sorting anythig at the beginning. You should notice it since your algorithm is almost copied from mine.

@Everybody: So how much does your algoritthm spend finding runs in 10 milion numbers? I'm asking people sorting input set at the beginning. Also I would like to ask people creating bool arrays how their algorithm works with 10 milion numbers input.

@Ayende: What did you came with? How much does it take to find runs in 10 milion numbers in input? Also what were the requirements? My algorithm is by far simplest and cleanest (in my opinion) and it works with 10 milion numbers in 2.5 seconds. I didn't take any time to further optimize it because I don't know what performance is needed. I checked other algorithms from here and they werent faster but way more complicated.

@Jiggaboo: You're right it does looks like we have hit on the same algorithm, I didn't notice at first since the formatting was weird. You're implementation is cleaner but the sequences.FirstOrDefault really kill it when there's more than a few sequences. When you timed it how many separate sequences were there? In my testing under 5 and it's a few seconds, around 20 is about 8 seconds, around 100 and it's about 20 seconds, a thousand and you're waiting minutes (my dictionary abuse keeps it to 1.5s).

You can try it too:

Also, for 10 million items a quick check shows it takes around 20s just to sort in the sort first approach.

@Jiggaboo

Tested with 10 million random integers (this is probably close to worst case as runs are very small):

sort: 1039ms algorithm: 78ms

parallel sort: 677ms algorithm: 77ms

and also with 100 million random integers:

sort: 11825ms algorithm: 797ms

parallel sort: 5356ms algorithm: 774ms

Code: https://gist.github.com/wieslawsoltes/6600771

My mistake. When I worked with 10 milion the numbers were sorted. So one sequence.

I suspected (and checked with profiler) that FirstOrDefault will be the bottleneck. And in fact was thinkig about something like what you did. But it was already 2AM so 4 hours of sleep till waking up before going to job. Also I didn't notice that Ayende was talking about set of more then 10 milion numbers and I didn't test for that.

Good job on your algorithm :)

@wiso I'm not sure how your code returns results

For input 1, 2, 3, 5, 6, 7 it should return two runs yet it returns 5 longs. And foreach with Print doesn't display anything meaningfull: {7} {6} {5} {3} {2}

@Shawn,@Jiggaboo You can merge the two dictionaries in one, because the numbers are distinct - so each value can only be either the start or the end of a run. You just have to make sure to select only ascending runs at the end (start <= end). I tried your method as well, but it gave me no significant better performance nor memory footprint than my much simpler method.

@Jiggaboo Please check the code below. I've created 3 tests. - first to check numbers given by Ayende (n1 array) - second test for your numbers (n2 array) - third for array with 10 million numbers

My algorithm in first step sorts the input array:

than finds each run start and end position in sorted array:

and finally it stores two int values (start and end of run) in one long variable:

where 'pos' is run Start index in 'n' array and 'j' is run end index in 'n' array.

To store run position in one long variable I use Pack method, to get back start and end position I'm using Unpack method. Storing start and end index in one long I have avoided creating new objects (eg. Tuple or struct) to store run position.

@Ayende

Of course this is a real-world problem somewhere in the world, every programming problem is - but it's just like the ones they would use in school and to "pop quiz" candidates, and it is no more meaningful

when taken out of contextthan any other problem you would face in a test or quiz."And I got quite a BIT of highly certified candidates, good grades, really good schools that couldn't code their way out of a wet paper bag."

Unfortunately, this is precisely the explanation (or excuse) you would hear from American companies.

Everyone knows most people don't come out of school equipped to really solve any real-world problems - that takes experience.

The kind of problems you're faced with in school are typically problems of this type - problems that

canbe solved in a vacuum, without any background knowledge of the problem. I don't think you would hire people to sit around and solve trivial textbook-type problems like this one all day? You can solve these on your own quite easily.All a candidate proves by solving this problem, is that they can solve isolated logic problems, under pressure, without any context. If that's what you require for the position, I don't think you will find many interested candidates that have any substantial experience building real systems. Most anyone who has been out of school and working for more than 3-5 years has (hopefully) long since moved on from "testing" to solving problems that require much higher faculties, abstract thinking and real-world problem-solving.

All I'm saying is, once you're out of that loop, and out of that environment, living in the real world and doing useful work, you're not necessarily going to perform well in a setting like that - and that is not and indication of your ability to absorb and understand a complex real-world domain, and create solutions to real problems, in a real-world setting.

A much better indicator for that, is to simply talk to the candidate, like a human being - discuss and review their history and examine a body of their recent work, have them explain the reasoning that drives their decision-making, and so on. Not expecting "text-book" answers, but looking for the kind of practical learning you can only do in the field, and not in a class-room... You have real-world experience, you know other capable programmers, and when you meet new developers, you quickly know if they have that shared experience, right? Use that instead. We're done with school, and there is no "test" for the kind of experience you earn in the field.

Unless you're hiring a junior, don't test you candidates like you were hiring a junior.

Of course, maybe you were hiring for a junior position, I don't actually know that :-)

@wiso: Yeah, now it works. It did not previously.

Since I assume you are interested not only in one time sorting of disk pages, but actively using this free list (i.e. inserting and removing sets of pages that are freed in a transaction or recycled from the free list and now in use), two types of structures with different trade-offs come to mind:

A compressed patricia trie that stores page ranges (i.e start/end page or start page/number of pages), and that switches on 4-bit nibble prefixes of the start page number would likely work well. A hybrid approach, where you use "burst" leaf nodes, that are basically hash tables and "burst" into prefix nodes when they become too full might perform even better.

A more compact in-memory/serialized representation could be to use a compressed bitmap (or page range partitioned sets thereof), like one of the many word-aligned hybrid (WAH) approaches. This would be slower on updates though.

A combination of both might be interesting, i.e. use the patricia trie in memory and serialize it to a compressed bitmap on writing the freelist to disk.

Rasmus, "You can solve these on your own quite easily." - Yes, I can (at least I hope so). Doesn't mean that I've got the

timeto do that.If you can't solve a clean room problem like that, without having to worry about things like IoC configuration, database access patterns, jquery version, etc. You wouldn't be able to solve the problems that I actually have to deal with.

As I mentioned, this is a real world problem. We had to solve it. And the good thing about it is that it

canbe pulled out of context and still make sense. I am sorry, that"once you're out of that loop, and out of that environment, living in the real world and doing useful work, you're not necessarily going to perform well in a setting like that" - then you won't be working for me. Because useful work include solving problems like that.

"A much better indicator for that, is to simply talk to the candidate" I do talk to the candidates, but there is a

vastdifference between being able to talk about code and being actuallyableto code.And pretty much any good programmer that I know socially would be able to solve this to one level of another in a matter of 10 - 20 minutes. That is what I am looking for.

@Ayende maybe I'm not explaining myself well, but many articles have been written on this subject over the years, here's one:

http://eliw.wordpress.com/2008/12/04/interviewing-programmers/

The most important point is, if you're interviewing for a junior position, sure, pop quiz, ask them to code - but if you're interviewing for an intermediate or senior position, why would you ask questions they would have had to answer in the past to get to where they are today? Looking at their work, and talking about their experience, the process they used, etc. should be more than enough to arouse your suspicion, if you were to come across a candidate that was trying to dupe you.

Here's another on "pop quiz" questions in general:

http://philtoland.com/post/448907461/pop-quiz-interviews-considered-harmful

The main problem with asking junior candidates to solve textbook problems, is that you're practically inviting them to dupe you - since there's a good chance they had solve similar problems in school recently.

My problem, personally, is that with 15 years of professional experience, a body of open source work and a good history on various social coding sites, there are some (many) American companies that

stillinsist on pop-quizzing and staged coding-sessions when they're interviewing for senior or lead positions.It's not that I can't solve trivial problems, or perform under pressure for that matter - it's that, the fact that they're using these techniques to interview for the position tells me one of two things: (1) these people have no idea how to hire for a senior/lead position, and/or (2) their expectations for a so-called "lead" or "senior" is someone who knows == from === and can solve textbook problems; i.e. questions I could have answered 20 years ago. Either way, I politely cut those interviews as short as I can, and continue my search for a job with real challenges. You see the problem there?

We can agree to disagree :-)

It seems to me that the simplest method would be to first sort the list and then starting from the beginning check the adjacent number if it's value is n+1 large using a recursive function.

@Rasmus "(1) these people have no idea how to hire for a senior/lead position" implies that the job doesn't have real challenges? I think that if you, as a candidate, do your due diligence before going into the interview, you'll already know whether there are real challenges available at the company, rather than making that determination based on whether or not their recruiting pipeline is well-equipped (in your opinion) to recruit for a senior position.

We ask everyone to solve these sort of problems, with code written on a white board, in the interview setting, regardless of the position. If we're recruiting for a junior position, I'm interested in "can this person solve problems and write code." If it's a senior position, my expectation is that solving the problem and writing code isn't going to be difficult for them -- so instead I raise the bar and expect them to be able to pose more than one solution, talk about their problem-solving methodology, and elucidate on edge cases that might be encountered if this problem were extrapolated to a real-world system. Because that's what the senior person will be doing -- planning and discussing solutions with members of the team, including edge cases, scaling concerns, maintainability, etc. The problem is the same, but the focus is different, and the same interview question offers the ability to evaluate both types of candidates. Sure, we probably could make that determination based on resume, code review, and discussion, but seeing the developer at work makes me much more confident of my evaluation. And, to be sure, thorough discussion of experience, development philosophy, favorite paradigms and patterns, etc. is also included in the interview.

Define a pair data structure, and a hash table T that allows constant time insert, retrieval, and delete based on either the left hand side or the right hand side of the pair. This is easily doable using two hash maps, if you like.

Then for each number n in the unordered set, we have four operations: 1) if T contains an entry (a, n-1) and an entry (n+1, b), remove both and insert (a, b) 2) else if T contains an entry (a, n-1), remove it and insert (a, n) 3) else if T contains an entry (n+1, b), remove it and insert (n, b) 4) else insert (n, n)

It gives you the set of runs in O(N) time and at most about O(2N/3) memory. To get the fully sorted output you can use a standard sorting algorithm on T, which will sort in O(T * log(T)) (where for the sake of convenience T actually represents size(T)).

Total time is O(N) + O(T*log(T)).

## Comment preview