Ayende @ Rahien

Refunds available at head office

Evil interview questions: Unique & Random C

Writing in C (and using only the C std lib as building blocks, which explicitly exclude C++ and all its stuff), generate 1 million unique random numbers.

For reference, here is the code in C#:

   1: var random = new Random();
   2: var set = new HashSet<int>();
   3:  
   4: var sp = Stopwatch.StartNew();
   5:  
   6: while (set.Count < 1000 * 1000)
   7: {
   8:     set.Add(random.Next(0, int.MaxValue));
   9: }
  10:  
  11: Console.WriteLine(sp.ElapsedMilliseconds);

It is a brute force approach, I’ll admit, but it completes in about 150ms on my end.  Solution must run in under 10 seconds.

This question just looks stupid, it actually can tell you quite a bit about the developer.

Comments

Marco
10/02/2013 09:23 AM by
Marco

Considering http://xkcd.com/221/ returning numbers from 0 to 1M -1 should be fine :) If numbers do not have to be integers I'd sum rand ([0,1[) to every number of the previous random set just make them different at every run :D Am i a cheater?

Rafal
10/02/2013 09:34 AM by
Rafal
= 500)
        {
            cnt++;
            //printf("%d\n",num); we don't have to print, we have to find...
        }
    }
    printf("found 1000000 unique random numbers, but don't know what to do with them\n");
}

Assuming nothing about the distribution of these numbers, and without having to print them all, this runs in a fraction of a second.

Rafal
10/02/2013 09:35 AM by
Rafal

Heh, code formatting in comments sucks

Ramon Smits
10/02/2013 09:36 AM by
Ramon Smits

You have indeed a unique set. But you don't actually test it the set contains 1.000.000 items. If you have dupes from the random number generator then what?

James
10/02/2013 09:41 AM by
James

@Ramon - yes it does. That would be the while on the add...

tobi
10/02/2013 09:48 AM by
tobi

1) We need to call rand() twice for each number because it returns [0, 32k]

2) We have no built-in hash table, so we'll use a bool[1000000] to keep track of the duplicates. I'd now point out the various reasons this is a hack and problematic and ask the interviewer whether he wants me to fix that or whether it is enough.

Frans Bouma
10/02/2013 09:52 AM by
Frans Bouma

Easy.

var sw = new Stopwatch(); var store = new SortedDictionary<Guid, int>(); sw.Start(); for(int i=0;i<1000000;i++) { store.Add(Guid.NewGuid(), i); } sw.Stop(); Console.WriteLine("Elapsed ms: {0}", sw.ElapsedMilliseconds);

// show the results ;) int amount = 0; foreach(var kvp in store) { Console.WriteLine("Value: {0}", kvp.Value); amount++; if(amount>100) { break; } }

The idea is to sort on the guid. It's not that efficient, but the question is rather stupid, and it does guarantee 1000000 numbers in (pseudo) random order.

Frans Bouma
10/02/2013 09:53 AM by
Frans Bouma

(oops, that formatting went seriously wrong).

Ayende Rahien
10/02/2013 09:55 AM by
Ayende Rahien

Marco, How would you get random distribution properly there? And how do you ensure uniqueness.

Ayende Rahien
10/02/2013 09:56 AM by
Ayende Rahien

Ramon, Because it is a unique set, duplicates generated will be ignored.

Ayende Rahien
10/02/2013 09:59 AM by
Ayende Rahien

Tobi, Assume that you need full 32 bits for the numbers, so you can't just have an array for all the items. What do you do then?

Ayende Rahien
10/02/2013 10:00 AM by
Ayende Rahien

Frans, There is a reason I said C, not C#. And your code would produce only numbers 1 .. 1,000,000, I want full 32 bits range.

Marco
10/02/2013 10:07 AM by
Marco

Integers from 1 to 1M-1 are unique and normally distrubuted, while if I sum a random floating number between 0 and 1 (not included) for each integers from 0 to 1M - 1, I get a set of unique random floating numbers (sort of normally distributed).

It's a cheat I know but it solves the given problem respecting the constraints.

Marco
10/02/2013 10:17 AM by
Marco

sorry I meant uniformly distributed not normally

Sergey Shumov
10/02/2013 10:17 AM by
Sergey Shumov

The idea is to divide all integer numbers into N chunks. Then we choose a random number from each chunk. Here is the code: https://gist.github.com/SHSE/6791622

Rafal
10/02/2013 10:18 AM by
Rafal

http://pastebin.com/VdScTnaa Same idea as Marco's from the first comment

Ramon Smits
10/02/2013 10:36 AM by
Ramon Smits

@James: Ah yes! I missed that :) thanks

tobi
10/02/2013 10:44 AM by
tobi

In that case I'd build myself a crude hash table that is a LinkedList[1000000]. Only adding is needed and I can rely on malloc to build the linked list. That's the simplest hash table possible and can be built as a quick hack.

Alternatively, I'd generate 10m random numbers, sort them with qsort and implement a simple distinct algorithm which is easy for a sorted array. I'd wrap this in a retry-loop in case I do not get 1m unique numbers out of the operation.

Rafal
10/02/2013 11:29 AM by
Rafal

Ayende, you have to stop doing that. I really can't resist this kind of algorithmic exercises and this hurts my productivity. Here's another shot at it, this time taking into consideration the requirement that the output must be uniformly distributed in full 32 bit range. I made use of the fact that a cartesian product of two uniformly distributed sets will also be uniformly distributed. I mean, 1000000 is 1000x1000 so I can produce two sets of 1000 unique numbers and combine them.

http://pastebin.com/8NXDUSG5

Sorry if it crashes, didn't really wait until it spits out all these numbers ;)

Rafal
10/02/2013 11:35 AM by
Rafal

sorry, the previous version contained an error (two random numbers were incorrectly combined into one in the 'printf' line)

http://pastebin.com/LCedvFNucklwar fecuring

Rafal
10/02/2013 11:37 AM by
Rafal

AAARGH!

http://pastebin.com/LCedvFNu

Andy Masters
10/02/2013 11:42 AM by
Andy Masters

Unless external input is given all solutions will be pseudo-random. Truely random would need to acquire data from a random physical source. For 1m numbers you would need quite a lot of entropy. maybe hook up my webcam to observe a lava lamp and take the least significant bits from some pseudo random pixels. Now the c part....

Ruben Bartelink
10/02/2013 11:45 AM by
Ruben Bartelink

I know its beside the point but you could speed up the C# segment:

6: while (set.Count < 1000 * 1000) 7: { 8: set.Add(random.Next(0, int.MaxValue)); 9: }

by leveraging the result of HashSet.Add

int todo = 1000 * 1000; while(todo != 0) if( set.Add(random.Next(0, int.MaxValue)) --todo;

or if you don't want the C# to be terser but [arguably] less legible:

for(int todo = 1000 * 1000; todo > 0; ) todo -= (int) set.Add(random.Next(0, int.MaxValue);

Rafal
10/02/2013 12:10 PM by
Rafal

errrr... a final version fixing few bugs (incorrect parentheses + uninitialized array that created some undesired entropy in run time)

http://pastebin.com/e9Sf3eJA

Mika Kolari
10/02/2013 12:50 PM by
Mika Kolari

I don't know C, but I guess this would be easy to translate

public static int[] getRandoms(int count, int max) { int[] results = new int[count]; var mask = new BitArray(max + 1); int counter = 0; while (counter < count) { int r = random.Next(max + 1); if (!mask[r]) { results[counter] = r; mask[r] = true; counter++; } }

return results;

}

Geert Baeyaert
10/02/2013 01:24 PM by
Geert Baeyaert

@Sergey,

That's not random! With your solution, you get precisely one number from each chunk, so for instance the set[1, 2, 3, whatever..] could not occur, because 1, 2 and 3 are all in the same chunk.

Joseph Daigle
10/02/2013 02:30 PM by
Joseph Daigle

Here is one approach.

Start with a linked list containing all numbers in the range you care about (e.g. -2^31 through (2^31) - 1). Next loop 1000*1000 times. In each loop get a random number between 0 and (2^32) - i, where i = incrementing counter for the loop. Get and remove the number at the position found by the random number, push that number to end of an array for your output.

Yes, you do have to do 1,000,000 "seeks". But it should be fairly constant time. Perhaps there is a better/faster way to randomize the initial set of numbers (random swapping?).

Judah Gabriel Himango
10/02/2013 02:40 PM by
Judah Gabriel Himango

Here's a super short version. It kind of cheats by ensuring the random number is always greater than the last one. But hey, the numbers are still random, and the code is dead simple. :-)

https://gist.github.com/JudahGabriel/6794792

(It's been a long time since I wrote C!)

Ayende Rahien
10/02/2013 03:17 PM by
Ayende Rahien

Judah, But this will generate a random set of sequential numbers. That is what we don't want.

Peter Ibbotson
10/02/2013 03:29 PM by
Peter Ibbotson

I have a sneaky suspicion that one answer to this is to use a 32 bit LFSR and just generate 1000000 numbers, since it's shift register based there can't be any duplicates (assuming you don't start from 0).

Need to think some more about what to do if you're using rand_s instead.

Sergey Shumov
10/02/2013 03:31 PM by
Sergey Shumov

@Geert, good point! It can be fixed by introducing randomness in the chunk allocation logic. Here is the code: https://gist.github.com/SHSE/6795548

Thor Arne Johansen
10/02/2013 05:38 PM by
Thor Arne Johansen

An LFSR representing a primitive polynomial will generate numbers from the whole range.

So a primitive order 32 polynomial will have period of 2^32-1 and go through all 32 bit numbers (excluding 0)

Jesús López
10/02/2013 10:31 PM by
Jesús López

https://gist.github.com/jesuslpm/6801439

Keith
10/03/2013 12:50 AM by
Keith

I'd probablly go with the following :-

http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/

Can't get much faster than that. The only question you have to ask is how random do you want your random to be :)

Vytis
10/03/2013 06:57 AM by
Vytis

https://gist.github.com/xytis/6805873

A similar answer to Judah Gabriel Himango, but with additional features: 1. Ensures that random numbers uniformly distribute among range [0..MAX_INT] 2. Shuffles the obtained set.

With one drawback: If implementation of rand() has RAND_MAX < 1000000, then set shuffle does not work correctly. (May be fixed by using offsets from current location, instead of absolute indices...)

Also, it is actually slower than Keith's answer, but in memory consumption it performs better than C# example.

kb2011
10/03/2013 07:38 AM by
kb2011

I haven't done much C (do C# for a living) but I took a stab at it. There are four files denoted by /** NAME.EXT **/ comments.

Didn't write it to fail if it took longer than 10 seconds since the example C# didn't. Also assumed by C std lib you meant C standard libraries, and not just stdlib.h.

http://pastebin.com/AkWDNuXr

Jesús López
10/03/2013 11:20 AM by
Jesús López

I have another implementation, it takes 3 milliseconds to complete.

https://gist.github.com/jesuslpm/6808179

Jonathan North
10/03/2013 11:27 AM by
Jonathan North

http://pastebin.com/iVDd9cYW

Basic algorithm: generate more values than needed and dedupe them. If there are too few results, fill in the empty part of the buffer and repeat. Shuffle and take the first N values. This wastes 0.8MB of RAM in exchange for being easily understood and easily implemented.

I could have used less extra space, or even no extra space, but that would devolve into O(#rounds * n log n). As is, it'll be rare to have less than the desired number of values in the initial run.

Juan Lopes
10/04/2013 03:13 AM by
Juan Lopes

I'd use a segtree with lazy propagation (or a rbtree) to be able to answer the nth unused number in O(log^2 n).

But I'm not messing arround with mallocs today

ori
10/04/2013 09:02 AM by
ori

What do you mean by random here? That each set of 1m unique ints has an equal probability?

Matt
10/04/2013 10:16 AM by
Matt

I'd cheat.

n(x+1) = nx + rnd(1, int32.max/10^6)

for x = 0 to 10^6, n_0 = 0

matt
10/04/2013 10:28 AM by
matt

Actually, n_0 should be -1 if I'm generating random numbers between 1 and int32.max/10^6. And I'd work for x=0 to (10^6 + 1) and drop the first result.

Should probably round the int32.max/10^6 up and the last result should be calculated from the between the previous number and int32.max. otherwise the the possible numbers would be around 0 to int32.max - 10^6.

Igor
10/07/2013 03:06 AM by
Igor

A simple approach is to maintain a binary search tree of values generated so far. Generate a random integer and check the tree. If the tree already contains the value, retry. Otherwise, add the number into the result array and insert it into the tree.

Since the values are random, balancing the tree wouldn't be necessary in practice.

Also, the retries are expected to be rare, so their impact should be negligible. Even when generating the last 1,000,000th number, the probability of collision with a number already selected is < 0.05%.

As an implementation note, it is unnecessary to allocate memory for each tree node - i.e., we don't need 1,000,000 malloc calls. We can just pre-allocate an array of 1,000,000 tuples [int, int, int], whose meaning is [value, leftchildindex, _rightchildindex].

Jay Sullivan
10/08/2013 02:13 AM by
Jay Sullivan

Generate sequence 1..1000000, then shuffle the results. https://gist.github.com/notfed/6878293

Comments have been closed on this topic.