Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 10 | Comments: 37

filter by tags archive

Evil interview questions: Unique & Random C

time to read 4 min | 652 words

Writing in C (and using only the C std lib as building blocks, which explicitly exclude C++ and all its stuff), generate 1 million unique random numbers.

For reference, here is the code in C#:

   1: var random = new Random();
   2: var set = new HashSet<int>();
   3:  
   4: var sp = Stopwatch.StartNew();
   5:  
   6: while (set.Count < 1000 * 1000)
   7: {
   8:     set.Add(random.Next(0, int.MaxValue));
   9: }
  10:  
  11: Console.WriteLine(sp.ElapsedMilliseconds);

It is a brute force approach, I’ll admit, but it completes in about 150ms on my end.  Solution must run in under 10 seconds.

This question just looks stupid, it actually can tell you quite a bit about the developer.


Comments

Marco

Considering http://xkcd.com/221/ returning numbers from 0 to 1M -1 should be fine :) If numbers do not have to be integers I'd sum rand ([0,1[) to every number of the previous random set just make them different at every run :D Am i a cheater?

Rafal
= 500)
        {
            cnt++;
            //printf("%d\n",num); we don't have to print, we have to find...
        }
    }
    printf("found 1000000 unique random numbers, but don't know what to do with them\n");
}

Assuming nothing about the distribution of these numbers, and without having to print them all, this runs in a fraction of a second.

Rafal

Heh, code formatting in comments sucks

Ramon Smits

You have indeed a unique set. But you don't actually test it the set contains 1.000.000 items. If you have dupes from the random number generator then what?

James

@Ramon - yes it does. That would be the while on the add...

tobi

1) We need to call rand() twice for each number because it returns [0, 32k]

2) We have no built-in hash table, so we'll use a bool[1000000] to keep track of the duplicates. I'd now point out the various reasons this is a hack and problematic and ask the interviewer whether he wants me to fix that or whether it is enough.

Frans Bouma

Easy.

var sw = new Stopwatch(); var store = new SortedDictionary<Guid, int>(); sw.Start(); for(int i=0;i<1000000;i++) { store.Add(Guid.NewGuid(), i); } sw.Stop(); Console.WriteLine("Elapsed ms: {0}", sw.ElapsedMilliseconds);

// show the results ;) int amount = 0; foreach(var kvp in store) { Console.WriteLine("Value: {0}", kvp.Value); amount++; if(amount>100) { break; } }

The idea is to sort on the guid. It's not that efficient, but the question is rather stupid, and it does guarantee 1000000 numbers in (pseudo) random order.

Frans Bouma

(oops, that formatting went seriously wrong).

Ayende Rahien

Marco, How would you get random distribution properly there? And how do you ensure uniqueness.

Ayende Rahien

Ramon, Because it is a unique set, duplicates generated will be ignored.

Ayende Rahien

Tobi, Assume that you need full 32 bits for the numbers, so you can't just have an array for all the items. What do you do then?

Ayende Rahien

Frans, There is a reason I said C, not C#. And your code would produce only numbers 1 .. 1,000,000, I want full 32 bits range.

Marco

Integers from 1 to 1M-1 are unique and normally distrubuted, while if I sum a random floating number between 0 and 1 (not included) for each integers from 0 to 1M - 1, I get a set of unique random floating numbers (sort of normally distributed).

It's a cheat I know but it solves the given problem respecting the constraints.

Marco

sorry I meant uniformly distributed not normally

Sergey Shumov

The idea is to divide all integer numbers into N chunks. Then we choose a random number from each chunk. Here is the code: https://gist.github.com/SHSE/6791622

Rafal

http://pastebin.com/VdScTnaa Same idea as Marco's from the first comment

Ramon Smits

@James: Ah yes! I missed that :) thanks

tobi

In that case I'd build myself a crude hash table that is a LinkedList[1000000]. Only adding is needed and I can rely on malloc to build the linked list. That's the simplest hash table possible and can be built as a quick hack.

Alternatively, I'd generate 10m random numbers, sort them with qsort and implement a simple distinct algorithm which is easy for a sorted array. I'd wrap this in a retry-loop in case I do not get 1m unique numbers out of the operation.

Rafal

Ayende, you have to stop doing that. I really can't resist this kind of algorithmic exercises and this hurts my productivity. Here's another shot at it, this time taking into consideration the requirement that the output must be uniformly distributed in full 32 bit range. I made use of the fact that a cartesian product of two uniformly distributed sets will also be uniformly distributed. I mean, 1000000 is 1000x1000 so I can produce two sets of 1000 unique numbers and combine them.

http://pastebin.com/8NXDUSG5

Sorry if it crashes, didn't really wait until it spits out all these numbers ;)

Rafal

sorry, the previous version contained an error (two random numbers were incorrectly combined into one in the 'printf' line)

http://pastebin.com/LCedvFNucklwar fecuring

Rafal

AAARGH!

http://pastebin.com/LCedvFNu

Andy Masters

Unless external input is given all solutions will be pseudo-random. Truely random would need to acquire data from a random physical source. For 1m numbers you would need quite a lot of entropy. maybe hook up my webcam to observe a lava lamp and take the least significant bits from some pseudo random pixels. Now the c part....

Ruben Bartelink

I know its beside the point but you could speed up the C# segment:

6: while (set.Count < 1000 * 1000) 7: { 8: set.Add(random.Next(0, int.MaxValue)); 9: }

by leveraging the result of HashSet.Add

int todo = 1000 * 1000; while(todo != 0) if( set.Add(random.Next(0, int.MaxValue)) --todo;

or if you don't want the C# to be terser but [arguably] less legible:

for(int todo = 1000 * 1000; todo > 0; ) todo -= (int) set.Add(random.Next(0, int.MaxValue);

Rafal

errrr... a final version fixing few bugs (incorrect parentheses + uninitialized array that created some undesired entropy in run time)

http://pastebin.com/e9Sf3eJA

Mika Kolari

I don't know C, but I guess this would be easy to translate

public static int[] getRandoms(int count, int max) { int[] results = new int[count]; var mask = new BitArray(max + 1); int counter = 0; while (counter < count) { int r = random.Next(max + 1); if (!mask[r]) { results[counter] = r; mask[r] = true; counter++; } }

return results;

}

Geert Baeyaert

@Sergey,

That's not random! With your solution, you get precisely one number from each chunk, so for instance the set[1, 2, 3, whatever..] could not occur, because 1, 2 and 3 are all in the same chunk.

Joseph Daigle

Here is one approach.

Start with a linked list containing all numbers in the range you care about (e.g. -2^31 through (2^31) - 1). Next loop 1000*1000 times. In each loop get a random number between 0 and (2^32) - i, where i = incrementing counter for the loop. Get and remove the number at the position found by the random number, push that number to end of an array for your output.

Yes, you do have to do 1,000,000 "seeks". But it should be fairly constant time. Perhaps there is a better/faster way to randomize the initial set of numbers (random swapping?).

Judah Gabriel Himango

Here's a super short version. It kind of cheats by ensuring the random number is always greater than the last one. But hey, the numbers are still random, and the code is dead simple. :-)

https://gist.github.com/JudahGabriel/6794792

(It's been a long time since I wrote C!)

Ayende Rahien

Judah, But this will generate a random set of sequential numbers. That is what we don't want.

Peter Ibbotson

I have a sneaky suspicion that one answer to this is to use a 32 bit LFSR and just generate 1000000 numbers, since it's shift register based there can't be any duplicates (assuming you don't start from 0).

Need to think some more about what to do if you're using rand_s instead.

Sergey Shumov

@Geert, good point! It can be fixed by introducing randomness in the chunk allocation logic. Here is the code: https://gist.github.com/SHSE/6795548

Thor Arne Johansen

An LFSR representing a primitive polynomial will generate numbers from the whole range.

So a primitive order 32 polynomial will have period of 2^32-1 and go through all 32 bit numbers (excluding 0)

Jesús López

https://gist.github.com/jesuslpm/6801439

Keith

I'd probablly go with the following :-

http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/

Can't get much faster than that. The only question you have to ask is how random do you want your random to be :)

Vytis

https://gist.github.com/xytis/6805873

A similar answer to Judah Gabriel Himango, but with additional features: 1. Ensures that random numbers uniformly distribute among range [0..MAX_INT] 2. Shuffles the obtained set.

With one drawback: If implementation of rand() has RAND_MAX < 1000000, then set shuffle does not work correctly. (May be fixed by using offsets from current location, instead of absolute indices...)

Also, it is actually slower than Keith's answer, but in memory consumption it performs better than C# example.

kb2011

I haven't done much C (do C# for a living) but I took a stab at it. There are four files denoted by /** NAME.EXT **/ comments.

Didn't write it to fail if it took longer than 10 seconds since the example C# didn't. Also assumed by C std lib you meant C standard libraries, and not just stdlib.h.

http://pastebin.com/AkWDNuXr

Jesús López

I have another implementation, it takes 3 milliseconds to complete.

https://gist.github.com/jesuslpm/6808179

Jonathan North

http://pastebin.com/iVDd9cYW

Basic algorithm: generate more values than needed and dedupe them. If there are too few results, fill in the empty part of the buffer and repeat. Shuffle and take the first N values. This wastes 0.8MB of RAM in exchange for being easily understood and easily implemented.

I could have used less extra space, or even no extra space, but that would devolve into O(#rounds * n log n). As is, it'll be rare to have less than the desired number of values in the initial run.

Juan Lopes

I'd use a segtree with lazy propagation (or a rbtree) to be able to answer the nth unused number in O(log^2 n).

But I'm not messing arround with mallocs today

ori
ori

What do you mean by random here? That each set of 1m unique ints has an equal probability?

Matt

I'd cheat.

n(x+1) = nx + rnd(1, int32.max/10^6)

for x = 0 to 10^6, n_0 = 0

matt

Actually, n_0 should be -1 if I'm generating random numbers between 1 and int32.max/10^6. And I'd work for x=0 to (10^6 + 1) and drop the first result.

Should probably round the int32.max/10^6 up and the last result should be calculated from the between the previous number and int32.max. otherwise the the possible numbers would be around 0 to int32.max - 10^6.

Igor

A simple approach is to maintain a binary search tree of values generated so far. Generate a random integer and check the tree. If the tree already contains the value, retry. Otherwise, add the number into the result array and insert it into the tree.

Since the values are random, balancing the tree wouldn't be necessary in practice.

Also, the retries are expected to be rare, so their impact should be negligible. Even when generating the last 1,000,000th number, the probability of collision with a number already selected is < 0.05%.

As an implementation note, it is unnecessary to allocate memory for each tree node - i.e., we don't need 1,000,000 malloc calls. We can just pre-allocate an array of 1,000,000 tuples [int, int, int], whose meaning is [value, leftchildindex, _rightchildindex].

Jay Sullivan

Generate sequence 1..1000000, then shuffle the results. https://gist.github.com/notfed/6878293

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

  1. Production postmortem: The case of the memory eater and high load - about one day from now
  2. Production postmortem: The case of the lying configuration file - 2 days from now
  3. Production postmortem: The industry at large - 3 days from now
  4. The insidious cost of allocations - 4 days from now
  5. Find the bug: The concurrent memory buster - 5 days from now

And 4 more posts are pending...

There are posts all the way to Sep 10, 2015

RECENT SERIES

  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    14 Aug 2015 - The case of the man in the middle
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats